Ecological Data Analysis

Project Summary

Over the course of two, one and a half hour sessions we led students in the Duke Marine Lab Marine Ecology class (Biology 273LA) on a data expedition using the statistical programming environment R. We gave an introduction to big data, the role of big data in ecology, important things to consider when working with data (quality control, metadata, etc.), dealing with big data in R, what the Tidyverse is, and how to organize tidy data (see class PowerPoint). We then led a hands-on coding workshop where we explored an open-access citizen science dataset of aquatic plants along U.S. east coast (see dataset details below).

Themes and Categories

Graduate students: Julianna Renzi and Leo Gaskins

Faculty collaborator: Brian Silliman

Course: Marine Ecology (Biology 273LA)

Expedition Learning Goals

  • Understand what big data are
  • Understand why big data are becoming increasingly important in ecology
  • Learn key principles for dealing with any type of data, but particularly big data (quality control, metadata, reproducibility, code documentation, etc.)
  • Become familiar with R, R Studio, and the Tidyverse
  • Get hands-on experience coding in R using the Tidyverse
  • Create annotated code that can be repurposed for student projects in this course and future projects after the course ends

The Dataset

Phenology is the study of the timing of plant and animal activity (e.g. flowering, migration, reproduction) and how the timing of life cycle events changes with climate. The data we used for this expedition were downloaded off of the USA National Phenology Network’s (NPN’s) Phenology Observation Portal ( The NPN runs a citizen science program called Nature’s Notebook that collects phenological observations on species all around the country. Their observation portal has over 17.3 million phenological observations, 76,100 of which are aquatic. Each observation has 20 default fields, along with 21 optional satellite-derived climate fields, and 25 optional record-related fields. The data are freely available and, like many other large-scale ecological datasets (e.g. eBird, iNaturalist), the data were collected by citizen scientists as part of an effort to observe large-scale environmental phenomena. Scientists at the NPN are using the data to track how the start of spring is changing across the nation, but a multitude of other questions remain to be explored using NPN data. For this data expedition, we used an aquatic subset of NPN data on the U.S. east coast and practiced accessing and summarizing information the subset data. We focused on a few key species and looked at when they were displaying particular phenophases throughout the year as well as potential relationships between climate and plant phenology.

Black willow yes observations graphBlack willow breaking leaf buds graphBlack willow phenophase observationsFun spline graph

Students in classroom





Related Projects

This data expedition focused on the mechanisms animals use to orient using environmental stimuli, the methods that scientists use to test hypotheses about orientation, and the statistical methods used with circular orientation data. Students collected their own data set during the class period, performed hypothesis testing on their data using circular statistics in R, and aggregated their data to formally test the hypothesis that isopods orient with light using an RShiny online application.

This exercise served as a capstone to a series of four class sessions on orientation and navigation, where students read primary scientific literature that used circular statistics in their methods. This data exercise was used to give students the opportunity to collect their own data, discover why linear statistics wouldn’t be sufficient to analyze them, and then implement their own analysis. The goal of this course was to give students a better understanding of circular statistics, with hands-on application in forming and testing a hypothesis.

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.


To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 


In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.