A Shallow Dive into Deep Sea Data

Project Summary

Large publicly available environmental databases are a tremendous resource for both scientists and the general public interested in climate trends and properties. However, without the programming skills to parse and interpret these massive datasets, significant trends may remain hidden from both scientists and the public. In this data exploration, students, over the course of three hours, accessed two large, publicly available datasets, each with greater than 4 million observations. They learned how to use R and RStudio to effectively organize, visualize and statistically explore trends in deep sea physical oceanography.  

Themes and Categories

Graduate Students: Sarah Solie (Biology) and Arielle Fogel (University Program in Genetics and Genomics, Evolutionary Anthropology)

Faculty Member: Dr. Kate Thomas

Course: Biology 190: Life in the Deep Sea

Students gained experience exploring patterns in multivariate oceanographic data, relevant to their coursework, to answer the following four questions:

  1. How does average temperature and salinity at the surface of the ocean compare to the temperature and salinity at 3000 meters below the surface?
  2. Do the trends observed in question 1 differ across tropical, temperate, and polar climates?
  3. What is the relationship between ocean temperature and salinity across depths ranging continuously from the surface to 5500 meters below sea level?
  4. Do the trends observed in question 2 differ across tropical, temperate, and polar climates?

As students pursued these questions, they were introduced to R, a free software program that provides powerful tools for statistical computing and graphics, and RStudio, an integrated development environment frequently used for easier programming in R. They learned valuable skills for future data analysis, including:

  1. Accessing and downloading two physical oceanography databases (salinity and temperature) from the National Oceanic and Atmospheric Administration (NOAA) and National Oceanographic Data Center (NODC) World Ocean Atlas 2013 - https://www.nodc.noaa.gov/OC5/woa13/woa13data.html
  2. Importing and inspecting a dataset in .csv format in RStudio
  3. Installing and using R packages
  4. Tidying data such that it was interpretable for R analysis
  5. Manipulating data included subsetting, filtering, transforming, and summarizing data
  6. Creating a new categorical variable and assigning values to it based on existing data
  7. Using graphical visualization (see Graphics Created) including:
    1. Boxplots (Figures 1-2)
    2. Scatterplots (Figures 3-6)
  8. Performing statistical tests including:
    1. A two sample t-test
  9. Best practices for data wrangling and analysis (e.g. inspecting data after manipulation, annotating code)

At the end of the exercise, students were provided with additional online resources to continue exploring data with R and RStudio.

The Datasets

Students accessed and explored two massive datasets from the National Oceanic and Atmospheric Administration (NOAA) and National Oceanographic Data Center (NODC) World Ocean Atlas 2013. Specifically, they used the annual temperature statistical mean and the annual salinity statistical mean datasets which contained temperature or salinity observations, respectively, across depth (up to 5500 meters), location (at 1o spatial resolution), and time (1955-2012).

Graphics Created

Temperature by Depth and Climate
Figure 1. Temperature at the surface versus 3000 meters below sea level and its relation to climate
Salinity by Depth and Climate
Figure 2. Salinity at the surface versus 3000 meters below sea level and its relation to climate.
Temperature by depth and climate
Figure 3. Temperature by depth and climate.
Salinity by depth and climate
Figure 4. Salinity by depth and climate.
Average temperature by depth and climate
Figure 5. Average temperature by depth and climate.
Average salinity by depth and climate
Figure 5. Average temperature by depth and climate.

Course Materials

Please see the R Markdown file titled “deep_sea_data.Rmd” as well as the PDF version, which includes figures, titled “deep_sea_data.pdf”.

Student Feedback

“I learned that programming is probably 10% writing out the code and 90% figuring out what went wrong. It is a ton of troubleshooting, and through that troubleshooting is a lot of frustration. However, it was also a lot of fun doing it. Problem solving has always been enjoyable for me, so I had a good time figuring out what I did wrong.”

“It was ... cool learning all of the different manners in which you can analyze data using the program and also compile all of the information—over 4 million data points—into very easy to read graphs that made interpreting the data very simple.”

“I think it was an amazing experience to make 4 million data [points] into [a] few intuitive graphs.”

“Using the skills I learned in these lessons, I can convey a huge group of data that seems chaotic into a series of tables that [are] both easy to see and easy to analyze.”

“I can understand why and how to use the codes with the instruction of the teachers.”

“Coding [in R] made it easier to graph complicated scientific results with many variables that programs like Excel would struggle with.”

Data Sources

  1. Locarnini, R. A., A. V. Mishonov, J. I. Antonov, T. P. Boyer, H. E. Garcia, O. K. Baranova, M. M. Zweng, C. R. Paver, J. R. Reagan, D. R. Johnson, M. Hamilton, and D. Seidov, 2013. World Ocean Atlas 2013, Volume 1: Temperature. S. Levitus, Ed., A. Mishonov Technical Ed.; NOAA Atlas NESDIS 73, 40 pp.
  2. Zweng, M.M, J.R. Reagan, J.I. Antonov, R.A. Locarnini, A.V. Mishonov, T.P. Boyer, H.E. Garcia, O.K. Baranova, D.R. Johnson, D.Seidov, M.M. Biddle, 2013. World Ocean Atlas 2013, Volume 2: Salinity. S. Levitus, Ed., A. Mishonov Technical Ed.; NOAA Atlas NESDIS 74, 39 pp.

Related Projects

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.


To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 


In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.

Female baboons occasionally exhibit large swellings on their behinds. Although these ‘sexual swellings’ may evoke disgust from human on-lookers, they provide important information to group members about a female’s reproductive state. To figure out what these sexual swellings mean and whether male baboons notice, we need to look at the data.