Tips in Data Visualization for Genetic Mapping

Project Summary

The aim of this Data Expedition was for students to learn hands-on data visualization techniques using a variety of data types. Students first discussed how data visualization is useful, and tips to make graphs both visually appealing and easy to understand. 

Themes and Categories
C. Ryan Campbell

Graduate Students: Jenn Coughlan, Ryan Campbell

Course: Biology 490s - Methods in Comp Bio & Genomics

Over two 70-minute class periods, the students worked through two tutorials; the first introducing them to the basics of ggplot2, a data visualization package in the free statistical interface R. Students were then given a homework assignment to visualize a simple genotype-phenotype dataset, ‘Coughlan_inversiongenopheno.csv’. In the second class, we began by discussing the homework assignment, thinking of challenges and next steps. Students were then given a much more complicated dataset, involving reduced representation whole genome data from the wildflower Senecio (from Roda et al. 2017, dataset ‘Fst_BSA_wLinkagegrp.csv’). Students used this data to associate survival with allele frequencies across different habitats to determine regions of the genome which are associated with adaptation to edaphic conditions. 

Download the course slides (PDF).


Related Projects

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

A team of students led by researchers in the BIG IDEAS lab in the biomedical engineering department will build and validate machine learning techniques to classify longitudinal illness trajectories of individuals with infections such as COVID-19 or flu. Students will construct a pipeline to query survey and wearable device data from our newly constructed database in the Microsoft Azure environment and modify existing machine learning and deep learning algorithms for wearables data analysis. This project will build upon the work accomplished by the Duke Bass Connections team and the Duke MIDS capstone project.

Project Lead: Jessilyn Dunn

A team of students led by Zackary Johnson (Associate Professor Nicholas School of the Environment and Biology) and supported by other faculty in NSOE, Statistics, Biology and Engineering, will perform analyses of a 10+ year oceanographic time‐series dataset sampled near the Duke Marine Laboratory.  With >1000 observations and >400 fields, the team will first mature the MATLAB based data wrangler and analysis scripts.  Using this enhanced tool, the team will then focus on clustering, classification and forecasting towards the interpretation of variability and trends of key variables measured by the Pivers Island Coastal Observatory.  The long term goal of this project is to understand the proximal drivers of variability in coastal marine ecosystems as well as longer term changes associated with climate change.

Project Lead: Zackary Johnson