To Catch a Thief (with Data!)

Project Summary

Our aim was to introduce students to the wealth of possibilities that human genotyping and sequencing hold by illustrating firsthand the power of these datasets to identify genetic relatives, using the story of the Golden State Killer’s capture with public genetic databases.

Themes and Categories

Graduate Students: Ryan Campbell and Jenn Coughlan, Duke Biology

Course: BIO190S, Genetics and Evolution in Humans

Building on recent coursework discussing genetic differences between human populations we discussed the underlying math used to describe these differences. We also covered the technologies that allows us to quickly and affordably measure genetic differences, the same ones that make consumer products such as 23&Me possible. After the introductory lecture students split into groups to discuss articles related to the case, which gave them an opportunity to ask questions in an informal setting and learn from each other.

After the introduction and group work we introduced the students to R and R markdown files, which allow the user to combine typed notes and other organizational tips with code to be executed. The students were provided with a default R markdown file (.Rmd) which they ran in R while making small changes. The file also contained their homework assignment, consisting of questions to be answered, and they completed this assignment directly in the R markdown file.

In running this R markdown file the students downloaded a database with multiple individuals genetic information (HapMap). This information was stored as genotypes at a large number of Single Nucleotide Polymorphisms (SNPs), repeated over several hundred individuals. The students were asked to summarize the dataset and investigate to determine how many human populations it consisted of (Fig 1).

Figure 1: HapMap
Figure 1: First two components of a PCA of HapMap variant data, labeled by population of origin. Four populations were included, CEU - Caucasian, YRI - Yoruban, JPT - Japanese, HCB - Han Chinese, and they form three distinct groups.

Once students have been acclimated to the R environment, the dataset, and the markdown file they drew a random population to simulate the membership in common consumer genetic databases such as GEDMatch, which are largely Northern European. In pairs they picked a random “criminal.” One partner has code to pick a Northern European criminal while the other partner has code to pick an individual not of Northern European ancestry. They then query the database for the level of genetic similarity (kinship) with their focal criminal, and have to find samples that would help them conduct their criminal investigation. They are looking for an individual with high kinship to the criminal as that would be a close family relative and could identify them.

Figure 2: Kinship
Figure 2: Histogram of kinship to a randomly selected sample among the GEDmatch-like database used in this module. The red histogram is that of a Northern European sample, and shows an individual of high kinship. The blue histogram is that of a non-Northern European sample which has fewer close relatives due to the makeup of the database.

Finally the students are asked to assess the differences between the outcomes of each partner, summarized in Figure 2. Because the database has far more Northern European samples, the partner given a criminal of this ancestry (red in the figure) will have found a much closer relative than the criminal of a different ancestry (blue in the figure).

Regardless of what the future holds for these students, they will almost assuredly be impacted by the sequencing and genotyping of human genomic data. Whether directly interacting with this type of data through careers as research scientists and medical doctors or indirectly learning of the implications of their genetic code through the health of themselves or family members, the impending genomic revolution in medicine will impact nearly everyone. We hope that through this module we have familiarized students with the nature of genetic data that is presently available as well as the usefulness and power of data analysis.

Sources: Graham Coop blog post - Atlantic Golden State Killer GEDmatch site - GEDmatch Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples Consumer DNA Testing

Course Files



Related Projects

This data expedition focused on the mechanisms animals use to orient using environmental stimuli, the methods that scientists use to test hypotheses about orientation, and the statistical methods used with circular orientation data. Students collected their own data set during the class period, performed hypothesis testing on their data using circular statistics in R, and aggregated their data to formally test the hypothesis that isopods orient with light using an RShiny online application.

This exercise served as a capstone to a series of four class sessions on orientation and navigation, where students read primary scientific literature that used circular statistics in their methods. This data exercise was used to give students the opportunity to collect their own data, discover why linear statistics wouldn’t be sufficient to analyze them, and then implement their own analysis. The goal of this course was to give students a better understanding of circular statistics, with hands-on application in forming and testing a hypothesis.

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.


To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 


In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.