From Data to Science: An Introduction to the National Health and Nutrition Examination Survey (NHANES)

Project Summary

STEM education often presents a very sanitized version of the scientific enterprise. To some extent, this is necessary, but overemphasizing neat-and-tidy results and scripted protocol assignments poses the risk of failing to adequately prepare students for the real-world mess of transforming experimental data into meaningful results. The fundamental aim of this project was to guide students in processing large real-world datasets far beyond their academic comfort zone so as to give them a more realistic understanding of how science works.

Themes and Categories
Paul Bendich

Graduate students: Jameson Clarke and Rick Gawne

Faculty instructor: Fred Nijout

Focus: Questions that can be asked and answered using large public health databases

Skills gained: (I) Accessing and using NHANES, (II) Statistical analyses using JMP

Results: Presentations describing findings and pitfalls of large data exploration

Related Projects

Over the course of two, one and a half hour sessions we led students in the Duke Marine Lab Marine Ecology class (Biology 273LA) on a data expedition using the statistical programming environment R. We gave an introduction to big data, the role of big data in ecology, important things to consider when working with data (quality control, metadata, etc.), dealing with big data in R, what the Tidyverse is, and how to organize tidy data (see class PowerPoint). We then led a hands-on coding workshop where we explored an open-access citizen science dataset of aquatic plants along U.S. east coast (see dataset details below).

The goal of this Data Expedition was to introduce students to the exploration of social networks data using R. Students learned to load and plot a social network in R and then perform some basic analyses on two different networks: Hockey Fights in the National Hockey League in 2018-2019 and characters in Game of Thrones Season 3. Students used social network analysis to better understand who is connected to whom, how frequently they interact, and how they are interacting.

The data that students see in their statistics courses are often constrained to numeric and tabular data. However, there is an exciting field of data science and statistics known as text analysis. This expedition introduces students to the concept of treating text as data frames of words, and demonstrates how to perform basic analyses on bodies of text using R. Tweets of four Democratic candidates for the 2020 Primary are used as data, and demonstrated text analysis techniques in the expedition include comparisons of word frequencies, log-odds ratios for word usage, and pairwise word correlations.