Ecological Data Analysis

Project Summary

Over the course of two, one and a half hour sessions we led students in the Duke Marine Lab Marine Ecology class (Biology 273LA) on a data expedition using the statistical programming environment R. We gave an introduction to big data, the role of big data in ecology, important things to consider when working with data (quality control, metadata, etc.), dealing with big data in R, what the Tidyverse is, and how to organize tidy data (see class PowerPoint). We then led a hands-on coding workshop where we explored an open-access citizen science dataset of aquatic plants along U.S. east coast (see dataset details below).

Themes and Categories

Graduate students: Julianna Renzi and Leo Gaskins

Faculty collaborator: Brian Silliman

Course: Marine Ecology (Biology 273LA)

Expedition Learning Goals

  • Understand what big data are
  • Understand why big data are becoming increasingly important in ecology
  • Learn key principles for dealing with any type of data, but particularly big data (quality control, metadata, reproducibility, code documentation, etc.)
  • Become familiar with R, R Studio, and the Tidyverse
  • Get hands-on experience coding in R using the Tidyverse
  • Create annotated code that can be repurposed for student projects in this course and future projects after the course ends

The Dataset

Phenology is the study of the timing of plant and animal activity (e.g. flowering, migration, reproduction) and how the timing of life cycle events changes with climate. The data we used for this expedition were downloaded off of the USA National Phenology Network’s (NPN’s) Phenology Observation Portal ( The NPN runs a citizen science program called Nature’s Notebook that collects phenological observations on species all around the country. Their observation portal has over 17.3 million phenological observations, 76,100 of which are aquatic. Each observation has 20 default fields, along with 21 optional satellite-derived climate fields, and 25 optional record-related fields. The data are freely available and, like many other large-scale ecological datasets (e.g. eBird, iNaturalist), the data were collected by citizen scientists as part of an effort to observe large-scale environmental phenomena. Scientists at the NPN are using the data to track how the start of spring is changing across the nation, but a multitude of other questions remain to be explored using NPN data. For this data expedition, we used an aquatic subset of NPN data on the U.S. east coast and practiced accessing and summarizing information the subset data. We focused on a few key species and looked at when they were displaying particular phenophases throughout the year as well as potential relationships between climate and plant phenology.

Black willow yes observations graphBlack willow breaking leaf buds graphBlack willow phenophase observationsFun spline graph

Students in classroom





Related Projects

The goal of this Data Expedition was to introduce students to the exploration of social networks data using R. Students learned to load and plot a social network in R and then perform some basic analyses on two different networks: Hockey Fights in the National Hockey League in 2018-2019 and characters in Game of Thrones Season 3. Students used social network analysis to better understand who is connected to whom, how frequently they interact, and how they are interacting.

The data that students see in their statistics courses are often constrained to numeric and tabular data. However, there is an exciting field of data science and statistics known as text analysis. This expedition introduces students to the concept of treating text as data frames of words, and demonstrates how to perform basic analyses on bodies of text using R. Tweets of four Democratic candidates for the 2020 Primary are used as data, and demonstrated text analysis techniques in the expedition include comparisons of word frequencies, log-odds ratios for word usage, and pairwise word correlations.

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve. 

To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulations. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time.

In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.