Exposure to local pathogens is a significant selective pressure on the human genome: the strongest selective forces identified in modern human populations are for mutations that confer increased resistance to malaria infection. Understanding how human genetic variation impacts susceptibility to pathogens can reveal important aspects of disease biology and reveal novel treatment targets. By using genome-wide association of infection-related cellular traits, we can connect human genetic variation to disease susceptibility in a controlled laboratory environment. Identification of the variants, genes, and cellular pathways involved in infectious disease pathogenesis can inform host-directed therapeutics, clinically effective risk stratification, and epidemiological prediction. This data expedition explores the effect of host genetic variation on chemokine response to Chlamydia infection.
Graduate Students: Rylee Hackley & Benjamin Schott
Faculty: Steve Haase
Course: Biology 432S: Biology of Host-Pathogen Interactions
In Biology 423S, students engage in discussion and critique of primary research literature surrounding host-pathogen interactions. However, student-selected papers tend to focus on pathogen physiology, molecular biology, and genetics that enable increased infection success. Themes in this course tend to focus on the evolutionary arms race from the pathogen perspective. We sought to introduce students to a host-centered perspective through this data expedition, drawing on their familiarity with concepts like GWAS and linkage disequilibrium from Bio 202L. Through this workshop, we wanted students to understand the value of descriptive metadata and exploratory data analysis and visualization. Then, after analysis, students were asked to interpret the importance of results, both at the level of the data (can you propose a biological mechanism to explain the results?) and demonstrate their understanding of the host immune response contributing to disease symptoms.
Learning Outcomes
- Conduct exploratory data analysis:
- interpret histograms
- test for normality and general statistics
- Discuss the contribution of both host and pathogen variation to mechanisms of susceptibility and resistance
The Dataset
Data were collected as part of a larger screen to identify the human genetic variants associated with differential outcomes following Chlamydia trachomatis infection. Sheet 1 of the provided excel file contains sample metadata, phenotype data, and genotype data for 527 cell lines screened for CXCL10 abundance after C. trachomatis infection. For simplicity and size, the genotype data is subset to include variants within 2Mb of the gene that encodes CXCL10. Sheet 1 dimensions are 527 x 4107 and the complete dataset was published [1]. Sheet 2 contains association test results using Plink [2], subset to include variants within 2Mb of CXCL10. Sheet 2 dimensions are 7 x 4103 and a complete dataset is available [3, H2P2 website].
Download the dataset: Data_final.xls
In-Class Exercises
Prior to this workshop, students download R and R studio and install required packages. Since this course typically reads and discusses student-selected host-pathogen journal articles each week, we provided the most relevant source [1], but did not require students to read it beforehand. Most students had some basic familiarity with the RStudio environment from an intro statistics class, but we provided some tutorials for anyone who wanted to re-familiarize themselves.
During this 75-minute workshop, 18 students were split into groups of 3-4 to work together. Ben introduced the class to the concept of host genetic variation influencing disease outcomes, and summarized key differences between normal and cellular GWAS. They were encouraged to think about the limitations of interpreting results for a cellular screen at the organismal level. Rylee then showed students how to open RStudio, download the data and necessary packages. Rylee had students “code-along” for the first chunk, explaining how to execute code, and inspect data objects. Groups ran the next code chuck and answered worksheet questions together, while Ben and Rylee circled the room.
After exploring the data set, we had a class discussion about how the metadata was essential to understand the data and the question we were trying to answer using the data. Ben transitioned into talking about the association testing results (sheet 2), explaining the rationale behind log-2 normalization of the data, and the concept of regression and multiple testing burden in quantitative association studies. In groups, students visualized the results of association testing to identify the human variant most associated with high CXCL10 protein levels during Chlamydia infection. Students were surprised to find that there was a second nearby SNP that was equally significant, which prompted a discussion of linkage disequilibrium and haplotypes. Finally, students were shown how to localize genomic features near these SNPs using UCSC human genome browser to find that the most associated SNP was near the CXCL10 gene, suggesting a possible cis-regulatory role.
Student Feedback
“I like seeing how data can be transformed into graphs”
“It was an interesting premise and had a clear direction”
“I liked using real data and seeing how we could use it”
“When looking at someone else’s plot, we have to understand what x and y axis represent before looking into any statistical parameters.”
Data Sources
- Wang, L., et al. (2018). “An Atlas of Genetic Variation Linking Pathogen-Induced Cellular Traits to Human Disease.” Cell Host Microbe 24(2): 308-323 e306.
- Purcell, S., et. al. (2007). PLINK: a toolset for whole-genome association and population- based linkage analysis. American Journal of Human Genetics 81(3):559-575.
- The 1000 Genomes Project Consortium, et al. (2015). “A global reference for human genetic variation.” Nature 526(7571): 68-74.