Data Expeditions

A Data Expedition is an element of an undergraduate course that introduces students to exploratory data analysis.

Pairs of graduate students, often from different disciplines, work with the course instructor to formulate a question that will engage the students, and a pathway through a dataset that will provide insight.

Graduate student participants will receive a travel grant. Browse our current projects to find opportunities.


This Data Expedition introduced hypothesis-driven data analysis in R and the concept of circular data, while providing some tools for importing it and analyzing it in R.

The aim of this data expedition was to give students an introduction to stable isotopes and how the data can be used to understand trophic dynamics. 

Marine mammals exhibit extreme physiological and behavioral adaptions that allow them to dive hundreds to thousands of meters underwater despite their need to breathe air at the surface. Through the development of new remote monitoring technologies, we are just beginning to understand the mechanisms by which they are able to execute these extreme behaviors. Long- term animal-borne tags can now record location, dive depth, and dive duration and then transmit these data to satellite receivers, enabling remote access to behavior occurring both many kilometers out to sea and several kilometers below the ocean surface. 

The aim of this Data Expedition was for students to learn hands-on data visualization techniques using a variety of data types. Students first discussed how data visualization is useful, and tips to make graphs both visually appealing and easy to understand. 

Understanding of how to manipulate, analyze, and display large datasets is an essential skill in the life sciences. Introducing students to the concepts of coding languages and showing them the diversity of tasks that can be accomplished using a flexible coding scheme like R is an important step in the training of any life sciences professional. For students taking lab-based courses, who are often required to analyze the datasets they produce in class, learning these techniques can be helpful both in the short-term (i.e., during the semester) and for their future careers.

Matt and Ken led two labs for the engineering section of STA 111/130, an introductory course in statistics and probability. The lab assignments were written by Matt and Ken in order to bridge the gap between introductory linear regression, which is often explained in terms of a static, complete dataset, and time series analysis, which is not a common topic in introductory courses. 

Graduate Students: Kendra Kaiser and John Mallard

Faculty: Michael O’Driscoll

Course: Landscape Hydrology, EOS 323/723

Graduate Student: Jacob Coleman, 3rd year Ph.D. student in Statistical Science

Faculty Instructor: Colin Rundel

Class: STA 112, Data Science

Graduate student: Hamza Ghadyali          

Faculty instructor: Dr. Paul Bendich

Course: MATH 412 – Topology with Applications

Dr. Guillermo Sapiro, professor in Pratt School of Engineering at Duke University, conducts ongoing autism research. Using image processing, he attempts to program a computer to detect whether babies (around eight to 14 months of age) display a sign of autism. This very early detection enables doctors to train these babies (when their brain plasticity is high) to behave in ways to counter the behavioral limitations autism imposes, thus allowing these babies to act more normally as they grow up. 

In this Data Expedition, Duke undergraduates were introduced to a real world traffic citation data set. Provided by Dr. Frank R. Baumgartner, a political scientist at UNC, the data consist of 15 years of traffic stops, with over 18 million observations of 53 variables.

This data expedition introduced students to “sliding windows and persistence” on time series data, which is an algorithm to turn one dimensional time series into a geometric curve in high dimensions, and to quantitatively analyze hybrid geometric/topological properties of the resulting curve such as “loopiness” and “wiggliness.”

Students learned to visualize high-dimensional gene expression data; understand genetic differences in the context of gene networks; connect genetic differences to physiological outcomes; and perform simple analyses using the R programming language.

Graduate students: Aaron Berdanier and Matt Kwit, University Program in Ecology & Nicholas School of the Environment

Faculty instructors: Rebecca Vidra

Course: ENVIRON 102, Fall 2014

Using social network analysis to predict survival in large-brained mammals.

Questions asked: Do males and females scent mark equally? Do lemurs scent mark equally in breeding and non-breeding seasons?

Introduce NBA and MLB datasets to undergraduates to help them gain expertise in exploratory data analysis, data visualization, statistical inference, and predictive modeling.

STEM education often presents a very sanitized version of the scientific enterprise. To some extent, this is necessary, but overemphasizing neat-and-tidy results and scripted protocol assignments poses the risk of failing to adequately prepare students for the real-world mess of transforming experimental data into meaningful results. The fundamental aim of this project was to guide students in processing large real-world datasets far beyond their academic comfort zone so as to give them a more realistic understanding of how science works.

What drove the prices for paintings in 18th Century Paris?