Data Expeditions

A Data Expedition is an element of an undergraduate course that introduces students to exploratory data analysis.

Pairs of graduate students, often from different disciplines, work with the course instructor to formulate a question that will engage the students, and a pathway through a dataset that will provide insight.

Graduate student participants will receive a travel grant. Browse our current projects to find opportunities.

Projects

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.

 

To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 

 

In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.

Introduction

Female baboons occasionally exhibit large swellings on their behinds. Although these ‘sexual swellings’ may evoke disgust from human on-lookers, they provide important information to group members about a female’s reproductive state. To figure out what these sexual swellings mean and whether male baboons notice, we need to look at the data.  

This data expedition explores the relationship between female baboon sexual swellings, female estrogen concentrations, and male mating success. The expedition uses long-term data collected on wild baboons by the Amboseli Baboon Research Project. After learning background information about baboon social lives and reproduction, students generate testable predictions for two hypotheses about baboon reproduction. Students then learn how to use the popular R packages dplyr and ggplot2 to calculate descriptive statistics about the dataset. Finally, students perform data visualization to understand and explore patterns in animal mating behavior and sexual signals.

Learning objectives

  • Learn basics of exploratory data analysis (descriptive statistics, generating plots) in R
  • Learn basics of popular R packages dplyr and ggplot2
  • Increase understanding of association between hormones and mating behavior
  • Increase science literary skills (e.g. generating predictions, interpreting results)

Course Materials

Workflow

The lesson started with a brief Powerpoint presentation to introduce the class to basic information on baboon sociality and reproduction. At the end of the Powerpoint, students were introduced to 2 key hypotheses about baboon reproduction that they then explored using R. The class was divided into small groups where students worked together to propose possible predictions to test these hypotheses (and filled out the first section of the provided worksheet) before seeing the provided predictions.

Students then worked through the provided R script and accompanying dataset to test these predictions. Students with prior experience with R were able to skip ahead by following instructions on the R script, while most of the class worked through the script step-by-step with guidance from instructors. The course instructors walked the students through most of the script, then let students work independently to complete Data Visualization Part 2. Students filled out the worksheet as they went along.

At the end of the R script, students ultimately replicated Figure 1 from Gesquiere et al. 2007. This figure is included as the final slide of the Powerpoint presentation. The end of the class session was used to interpret the figure and discuss how it relates to the 2 project hypotheses.

Student level

This lesson is designed for undergraduate students who have little to no exposure to R or other programming software. It could be easily adjusted for students who are familiar with R or other programming software. This lesson takes about 75 minutes to complete.

The dataset and the Amboseli Baboon Research Project

The dataset for this expedition is a subset of the long-term database of the Amboseli Baboon Research Project, a project co-directed by Drs. Jeanne Altmann, Susan Alberts, Beth Archie, and Jenny Tung. The Amboseli Baboon Research Project has collected demographic, behavioral, genetic, and endocrinological data on a population of wild baboons since 1971 in order to study questions related to animal behavior, life history, behavioral ecology, genetics, and physiology. The project’s database is managed by Jake Gordon at Duke University and Niki Learn at Princeton University.

The unit of analysis for this dataset is a fecal estrogen sample from a cycling1 female. For each fecal sample (n = 843), 6 variables are recorded:

  1. female - identity of the female baboon
  2. cyle_day - day of her reproductive cycle
  3. estrogen - fecal estrogen concentration
  4. swelling_size - sexual swelling size2
  5. alpha_consort - whether or not the female consorted3 with an alpha male4 on that day
  6. nonalpha_consort - whether or not the female consorted with a non-alpha male on that day

This dataset includes data from 93 female baboons, with approximately 10 fecal estrogen samples per female. Minor differences between this dataset and the the dataset used in Gesquiere et al. 2007 are due to small, incremental changes in the database over time.

Footnotes

1 cycling: sexually mature but not pregnant or lactating
2 female yellow baboons exhibit exaggerated sexual swellings (an enlargement/engorgement of the genital and perineal skin) around ovulation
3 consortship: a period in which a male mate-guards a female. Virtually all matings and conceptions occur during consorts

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.

We led a 75-minute class session for the Marine Mammals course at the Duke University Marine Lab that introduced students to strengths and challenges of using aerial imagery to survey wildlife populations, and the growing use of machine learning to address these "big data" tasks.

Most phenomena that data scientists seek to analyze are either spatially or temporally correlated. Examples of spatial and temporal correlation include political elections, contaminant transfer, disease spread, housing market, and the weather. A question of interest is how to incorporate the spatial correlation information into modeling such phenomena.

 

In this project, we focus on the impact of environmental attributes (such as greenness, tree cover, temperature, etc.) along with other socio-demographics and home characteristics on housing prices by developing a model that takes into account the spatial autocorrelation of the response variable. To this aim, we introduce a test to diagnose spatial autocorrelation and explain how to integrate spatial autocorrelation into a regression model

 

 

In this data exploration, students are provided with data collected from remote sensing, census, and Zillow sources. Students are tasked with conducting a regression analysis of real-estate estimates against environmental amenities and other control variables which may or may not include the spatial autocorrelation information.

 

Over the course of two, one and a half hour sessions we led students in the Duke Marine Lab Marine Ecology class (Biology 273LA) on a data expedition using the statistical programming environment R. We gave an introduction to big data, the role of big data in ecology, important things to consider when working with data (quality control, metadata, etc.), dealing with big data in R, what the Tidyverse is, and how to organize tidy data (see class PowerPoint). We then led a hands-on coding workshop where we explored an open-access citizen science dataset of aquatic plants along U.S. east coast (see dataset details below).

The goal of this Data Expedition was to introduce students to the exploration of social networks data using R. Students learned to load and plot a social network in R and then perform some basic analyses on two different networks: Hockey Fights in the National Hockey League in 2018-2019 and characters in Game of Thrones Season 3. Students used social network analysis to better understand who is connected to whom, how frequently they interact, and how they are interacting.

The data that students see in their statistics courses are often constrained to numeric and tabular data. However, there is an exciting field of data science and statistics known as text analysis. This expedition introduces students to the concept of treating text as data frames of words, and demonstrates how to perform basic analyses on bodies of text using R. Tweets of four Democratic candidates for the 2020 Primary are used as data, and demonstrated text analysis techniques in the expedition include comparisons of word frequencies, log-odds ratios for word usage, and pairwise word correlations.

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve. 

To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulations. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time.

In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.

This project allowed students in BIOL 268D (Mechanisms of Animal Behavior) to explore the relationship between estrogen, female sexual swellings, and male mating success in wild baboons using data from the Amboseli Baboon Research Project. Students learned how to use the popular R packages dplyr and ggplot2 to calculate descriptive statistics about the dataset and perform data visualization to understand and explore patterns in animal mating behavior and sexual signals.

Ecological data comes in various shapes and sizes. When conducting an ecological study, it is common to have population data (such as snail counts) and continuous sensor data (such as stream temperature with 35,000 data points collected each year!). Ecologists must reconcile data collected at different spatial and temporal scales in order to make inferences about their study systems. Luckily, there are standard practices and toolsets that ecologists use. In this data expedition, we ingest, arrange and query data collected in the field through various methods into formats that can be analyzed. We then use different plot types, data transformations and statistical tests, such that our analyses are appropriate for the type of data. We examine both field data collected by students and also large open-source datasets that can be scraped from the web and analyzed locally.

 

Each year, the Field ecology students measure physical, chemical, and biological characteristics of the Eno River. The Eno River also has been continuously monitored for numerous environmental parameters as part of the StreamPulse project (Duke and other collaborators worldwide). StreamPulse collects data from instream sensors, such as temperature and dissolved oxygen to estimate ecosystem processes such as metabolism. So, we are able to compare data collected in the field course to long term monitoring efforts.

KC and Patrick led two hands-on data workshops for ENVIRON 335: Drones in Marine Biology, Ecology, and Conservation. These labs were intended to introduce students to examples of how drones are currently being used as a remote sensing tool to monitor marine megafauna and their environments, and how machine learning can be used to efficiently analyze remote sensing datasets. The first lab specifically focused on how drones are being used to collect aerial images of whales to measure changes in body condition to help monitor populations. Students were introduced to the methods for making accurate measurements and then received an opportunity to measure whales themselves. The second lab then introduced analysis methods using computer vision and deep neural networks to detect, count, and measure objects of interest in remote sensing data. This work provided students in the environmental sciences an introduction to new techniques in machine learning and remote sensing that can be powerful multipliers of effort when analyzing large environmental datasets.

This two-week teaching module in an introductory-level undergraduate course invites students to explore the power of Twitter in shaping public discourse. The project supplements the close-reading methods that are central to the humanities with large-scale social media analysis. This exercise challenges students to consider how applying visualization techniques to a dataset too vast for manual apprehension might enable them to identify for granular inspection smaller subsets of data and individual tweets—as well as to determine what factors do not lend themselves to close-reading at all. Employing an original dataset of almost one million tweets focused on the contested 2018 Florida midterm elections, students develop skills in using visualization software, generating research questions, and creating novel visualizations to answer those questions. They then evaluate and compare the affordances of large-scale data analytics with investigation of individual tweets, and draw on their findings to debate the role of social media in shaping public conversations surrounding major national events. This project was developed as a collaboration among the English Department (Emma Davenport and Astrid Giugni), Math Department (Hubert Bray), Duke University Library (Eric Monson), and Trinity Technology Services (Brian Norberg).

Understanding how to generate, analyze, and work with datasets in the humanities is often a difficult task without learning how to code or program. In humanities centered courses, we often privilege close reading or qualitative analysis over other methods of knowing, but by learning some new quantitative techniques we better prepare the students to tackle new forms of reading. This class will work with the data from the HathiTrust to develop ideas for thinking about how large groups and different discourse communities thought of queens of antiquity like Cleopatra and Dido.

Please refer to https://sites.duke.edu/queensofantiquity/ for more information.

We introduced students to spatial analysis in QGIS and R using location data from two whale species tagged with satellite transmitters. Students were given satellite tracks from five Cuvier’s beaked whales (Ziphius cavirostris) and five short-finned pilot whales (Globicephala macrorhynchus) tagged off the North Carolina coast. Students then used RStudio to calculate two metrics of these species' spatial ranges: home range (where a species spends 95% of its time) and core range (where a species spends 50% of its time). Next, students used QGIS to visualize the data, producing maps that displayed the whales' tracks and their ranges.

This Data Expedition introduces students to network tools and approaches and invites students to consider the relationship(s) between social networks and social imaginaries. Using foundation-funding data that was collected from the The Foundation Directory Online, the Data Expedition enables students to visualize and explore the relationship between networks, social imaginaries, and funding for higher education. The Data Expedition is based on two sets of data. The first set list the grants received by Duke University in 2016 from five foundations: The Bill and Melinda Gates Foundation, Fidelity Charitable Gift Fund, Silicon Valley Community Foundation, The Community Foundation of Western North Carolina, and The Robert Wood Johnson Foundation. The second set lists the names of board members from Duke University and each of these five foundations along with the degree granting institution for their undergraduate education. For the sake of this exercise, the degree granting institutions data was fabricated from a randomized list of the top twenty-five undergraduate institutions.

This Data Expedition seeks to introduce students to statistical analysis in the field of international development. Students construct a index of wealth/poverty based on asset holdings using four datasets collected under the umbrella of the Living Standards Measurement Survey project at the World Bank. We selected countries to represent different continents with comparable and recent survey data: Bulgaria (2007), Tajikistan (2009), Tanzania (2010-2011), and Panama (2008).

First, we construct an index of wealth based on household assets in the different countries using Principle Components Analysis. Once a poverty index is constructed, students seek to understand what the main drivers of wealth/poverty are in different countries. We include variables for health, education, age, relationship to the household head, and sex. Students then use regression analysis to identify the main drivers of poverty in different countries.

This data expedition explores the local (ego) patent citation networks of three hybrid vehicle-related patents. The concept of patent citations and technological development is a core theme in innovation and entrepreneurship, and the purpose of these network explorations is to both quantitatively and visually assess how innovations are connected and what these connections mean for the focal innovations and the technologies that draw on those patents in the future. The expedition was incorporated as part of the Sociology of Entrepreneurship class, where students are thinking about the emergence and diffusion of innovations.

Large publicly available environmental databases are a tremendous resource for both scientists and the general public interested in climate trends and properties. However, without the programming skills to parse and interpret these massive datasets, significant trends may remain hidden from both scientists and the public. In this data exploration, students, over the course of three hours, accessed two large, publicly available datasets, each with greater than 4 million observations. They learned how to use R and RStudio to effectively organize, visualize and statistically explore trends in deep sea physical oceanography.  

Our aim was to introduce students to the wealth of possibilities that human genotyping and sequencing hold by illustrating firsthand the power of these datasets to identify genetic relatives, using the story of the Golden State Killer’s capture with public genetic databases.

This Data Expedition introduced hypothesis-driven data analysis in R and the concept of circular data, while providing some tools for importing it and analyzing it in R.

The aim of this data expedition was to give students an introduction to stable isotopes and how the data can be used to understand trophic dynamics. 

Marine mammals exhibit extreme physiological and behavioral adaptions that allow them to dive hundreds to thousands of meters underwater despite their need to breathe air at the surface. Through the development of new remote monitoring technologies, we are just beginning to understand the mechanisms by which they are able to execute these extreme behaviors. Long- term animal-borne tags can now record location, dive depth, and dive duration and then transmit these data to satellite receivers, enabling remote access to behavior occurring both many kilometers out to sea and several kilometers below the ocean surface. 

The aim of this Data Expedition was for students to learn hands-on data visualization techniques using a variety of data types. Students first discussed how data visualization is useful, and tips to make graphs both visually appealing and easy to understand. 

Understanding of how to manipulate, analyze, and display large datasets is an essential skill in the life sciences. Introducing students to the concepts of coding languages and showing them the diversity of tasks that can be accomplished using a flexible coding scheme like R is an important step in the training of any life sciences professional. For students taking lab-based courses, who are often required to analyze the datasets they produce in class, learning these techniques can be helpful both in the short-term (i.e., during the semester) and for their future careers.

Matt and Ken led two labs for the engineering section of STA 111/130, an introductory course in statistics and probability. The lab assignments were written by Matt and Ken in order to bridge the gap between introductory linear regression, which is often explained in terms of a static, complete dataset, and time series analysis, which is not a common topic in introductory courses. 

Graduate Students: Kendra Kaiser and John Mallard

Faculty: Michael O’Driscoll

Course: Landscape Hydrology, EOS 323/723

Graduate Student: Jacob Coleman, 3rd year Ph.D. student in Statistical Science

Faculty Instructor: Colin Rundel

Class: STA 112, Data Science

Graduate student: Hamza Ghadyali          

Faculty instructor: Dr. Paul Bendich

Course: MATH 412 – Topology with Applications

In this Data Expedition, Duke undergraduates were introduced to a real world traffic citation data set. Provided by Dr. Frank R. Baumgartner, a political scientist at UNC, the data consist of 15 years of traffic stops, with over 18 million observations of 53 variables.

Dr. Guillermo Sapiro, professor in Pratt School of Engineering at Duke University, conducts ongoing autism research. Using image processing, he attempts to program a computer to detect whether babies (around eight to 14 months of age) display a sign of autism. This very early detection enables doctors to train these babies (when their brain plasticity is high) to behave in ways to counter the behavioral limitations autism imposes, thus allowing these babies to act more normally as they grow up. 

Students learned to visualize high-dimensional gene expression data; understand genetic differences in the context of gene networks; connect genetic differences to physiological outcomes; and perform simple analyses using the R programming language.

This data expedition introduced students to “sliding windows and persistence” on time series data, which is an algorithm to turn one dimensional time series into a geometric curve in high dimensions, and to quantitatively analyze hybrid geometric/topological properties of the resulting curve such as “loopiness” and “wiggliness.”

Graduate students: Aaron Berdanier and Matt Kwit, University Program in Ecology & Nicholas School of the Environment

Faculty instructors: Rebecca Vidra

Course: ENVIRON 102, Fall 2014

Using social network analysis to predict survival in large-brained mammals.

Introduce NBA and MLB datasets to undergraduates to help them gain expertise in exploratory data analysis, data visualization, statistical inference, and predictive modeling.

Questions asked: Do males and females scent mark equally? Do lemurs scent mark equally in breeding and non-breeding seasons?

STEM education often presents a very sanitized version of the scientific enterprise. To some extent, this is necessary, but overemphasizing neat-and-tidy results and scripted protocol assignments poses the risk of failing to adequately prepare students for the real-world mess of transforming experimental data into meaningful results. The fundamental aim of this project was to guide students in processing large real-world datasets far beyond their academic comfort zone so as to give them a more realistic understanding of how science works.

What drove the prices for paintings in 18th Century Paris?