To Catch a Thief (with Data!)

Project Summary

Our aim was to introduce students to the wealth of possibilities that human genotyping and sequencing hold by illustrating firsthand the power of these datasets to identify genetic relatives, using the story of the Golden State Killer’s capture with public genetic databases.

Themes and Categories

Graduate Students: Ryan Campbell and Jenn Coughlan, Duke Biology

Course: BIO190S, Genetics and Evolution in Humans

Building on recent coursework discussing genetic differences between human populations we discussed the underlying math used to describe these differences. We also covered the technologies that allows us to quickly and affordably measure genetic differences, the same ones that make consumer products such as 23&Me possible. After the introductory lecture students split into groups to discuss articles related to the case, which gave them an opportunity to ask questions in an informal setting and learn from each other.

After the introduction and group work we introduced the students to R and R markdown files, which allow the user to combine typed notes and other organizational tips with code to be executed. The students were provided with a default R markdown file (.Rmd) which they ran in R while making small changes. The file also contained their homework assignment, consisting of questions to be answered, and they completed this assignment directly in the R markdown file.

In running this R markdown file the students downloaded a database with multiple individuals genetic information (HapMap). This information was stored as genotypes at a large number of Single Nucleotide Polymorphisms (SNPs), repeated over several hundred individuals. The students were asked to summarize the dataset and investigate to determine how many human populations it consisted of (Fig 1).

Figure 1: HapMap
Figure 1: First two components of a PCA of HapMap variant data, labeled by population of origin. Four populations were included, CEU - Caucasian, YRI - Yoruban, JPT - Japanese, HCB - Han Chinese, and they form three distinct groups.

Once students have been acclimated to the R environment, the dataset, and the markdown file they drew a random population to simulate the membership in common consumer genetic databases such as GEDMatch, which are largely Northern European. In pairs they picked a random “criminal.” One partner has code to pick a Northern European criminal while the other partner has code to pick an individual not of Northern European ancestry. They then query the database for the level of genetic similarity (kinship) with their focal criminal, and have to find samples that would help them conduct their criminal investigation. They are looking for an individual with high kinship to the criminal as that would be a close family relative and could identify them.

Figure 2: Kinship
Figure 2: Histogram of kinship to a randomly selected sample among the GEDmatch-like database used in this module. The red histogram is that of a Northern European sample, and shows an individual of high kinship. The blue histogram is that of a non-Northern European sample which has fewer close relatives due to the makeup of the database.

Finally the students are asked to assess the differences between the outcomes of each partner, summarized in Figure 2. Because the database has far more Northern European samples, the partner given a criminal of this ancestry (red in the figure) will have found a much closer relative than the criminal of a different ancestry (blue in the figure).

Regardless of what the future holds for these students, they will almost assuredly be impacted by the sequencing and genotyping of human genomic data. Whether directly interacting with this type of data through careers as research scientists and medical doctors or indirectly learning of the implications of their genetic code through the health of themselves or family members, the impending genomic revolution in medicine will impact nearly everyone. We hope that through this module we have familiarized students with the nature of genetic data that is presently available as well as the usefulness and power of data analysis.

Sources: Graham Coop blog post - Atlantic Golden State Killer GEDmatch site - GEDmatch Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples Consumer DNA Testing

Course Files



Related Projects

KC and Patrick led two hands-on data workshops for ENVIRON 335: Drones in Marine Biology, Ecology, and Conservation. These labs were intended to introduce students to examples of how drones are currently being used as a remote sensing tool to monitor marine megafauna and their environments, and how machine learning can be used to efficiently analyze remote sensing datasets. The first lab specifically focused on how drones are being used to collect aerial images of whales to measure changes in body condition to help monitor populations. Students were introduced to the methods for making accurate measurements and then received an opportunity to measure whales themselves. The second lab then introduced analysis methods using computer vision and deep neural networks to detect, count, and measure objects of interest in remote sensing data. This work provided students in the environmental sciences an introduction to new techniques in machine learning and remote sensing that can be powerful multipliers of effort when analyzing large environmental datasets.

This two-week teaching module in an introductory-level undergraduate course invites students to explore the power of Twitter in shaping public discourse. The project supplements the close-reading methods that are central to the humanities with large-scale social media analysis. This exercise challenges students to consider how applying visualization techniques to a dataset too vast for manual apprehension might enable them to identify for granular inspection smaller subsets of data and individual tweets—as well as to determine what factors do not lend themselves to close-reading at all. Employing an original dataset of almost one million tweets focused on the contested 2018 Florida midterm elections, students develop skills in using visualization software, generating research questions, and creating novel visualizations to answer those questions. They then evaluate and compare the affordances of large-scale data analytics with investigation of individual tweets, and draw on their findings to debate the role of social media in shaping public conversations surrounding major national events. This project was developed as a collaboration among the English Department (Emma Davenport and Astrid Giugni), Math Department (Hubert Bray), Duke University Library (Eric Monson), and Trinity Technology Services (Brian Norberg).

Understanding how to generate, analyze, and work with datasets in the humanities is often a difficult task without learning how to code or program. In humanities centered courses, we often privilege close reading or qualitative analysis over other methods of knowing, but by learning some new quantitative techniques we better prepare the students to tackle new forms of reading. This class will work with the data from the HathiTrust to develop ideas for thinking about how large groups and different discourse communities thought of queens of antiquity like Cleopatra and Dido.

Please refer to for more information.