Data Analysis for Ecological Modeling

Project Summary

Ecological data comes in various shapes and sizes. When conducting an ecological study, it is common to have population data (such as snail counts) and continuous sensor data (such as stream temperature with 35,000 data points collected each year!). Ecologists must reconcile data collected at different spatial and temporal scales in order to make inferences about their study systems. Luckily, there are standard practices and toolsets that ecologists use. In this data expedition, we ingest, arrange and query data collected in the field through various methods into formats that can be analyzed. We then use different plot types, data transformations and statistical tests, such that our analyses are appropriate for the type of data. We examine both field data collected by students and also large open-source datasets that can be scraped from the web and analyzed locally.


Each year, the Field ecology students measure physical, chemical, and biological characteristics of the Eno River. The Eno River also has been continuously monitored for numerous environmental parameters as part of the StreamPulse project (Duke and other collaborators worldwide). StreamPulse collects data from instream sensors, such as temperature and dissolved oxygen to estimate ecosystem processes such as metabolism. So, we are able to compare data collected in the field course to long term monitoring efforts.

Themes and Categories

Graduate Students: Emily Ury and Alice Carter

Faculty: Dr. Justin Wright (and Dr. Emily Bernhardt helped with original proposal)

Undergraduate Course: "Field Ecology" (BIO 361) 


Part 1: Observations at the Eno River

  • Students will learn how to ingest their own data into the R programming environment

  • Students will become familiar with different types of ecological data

  • Students will use linear regression and multiple linear regression to examine and predict ecological data

  • Students will try different transformations and statistical tests to examine their data

Part 2: A Year of Eno River Data

  • Students will explore the StreamPulse project data platform and R package

  • Students will download and examine a year of Eno River monitoring data

  • Students will begin to examine how long-term monitoring data is used to understand field observation data for ecological analysis. 

Here are some examples of the plots we made:

Binning stream parameters to understand population distributions:

Snail distribution graph

Trying out various visual, statistical and modeling approaches:

Different modeling approaches graph

Graph of a year of oxygen levels at the Eno River

Student Feedback

“I’ve never used R before, so I learned how to input data, make plots, and do regression analyses (single + multiple)...Stream data was really cool!”

“Thank you! Super helpful. Always so much to learn with R.”

“[I] learned how to fit a linear trendline to a graph.”

“I learned how to customize the data I am working with.”

Student feedback cards

Attached materials for the lesson

Two R markdown files:

One data file:

Photos from the class

Students in class

Students in class

Students in class

Related Projects

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.

We led a 75-minute class session for the Marine Mammals course at the Duke University Marine Lab that introduced students to strengths and challenges of using aerial imagery to survey wildlife populations, and the growing use of machine learning to address these "big data" tasks.

Most phenomena that data scientists seek to analyze are either spatially or temporally correlated. Examples of spatial and temporal correlation include political elections, contaminant transfer, disease spread, housing market, and the weather. A question of interest is how to incorporate the spatial correlation information into modeling such phenomena.


In this project, we focus on the impact of environmental attributes (such as greenness, tree cover, temperature, etc.) along with other socio-demographics and home characteristics on housing prices by developing a model that takes into account the spatial autocorrelation of the response variable. To this aim, we introduce a test to diagnose spatial autocorrelation and explain how to integrate spatial autocorrelation into a regression model



In this data exploration, students are provided with data collected from remote sensing, census, and Zillow sources. Students are tasked with conducting a regression analysis of real-estate estimates against environmental amenities and other control variables which may or may not include the spatial autocorrelation information.