Looking at the host in “host-pathogen”: Exploring effects of host genetic variation on Chlamydia trachomatis infection

Project Summary

Exposure to local pathogens is a significant selective pressure on the human genome: the strongest selective forces identified in modern human populations are for mutations that confer increased resistance to malaria infection. Understanding how human genetic variation impacts susceptibility to pathogens can reveal important aspects of disease biology and reveal novel treatment targets. By using genome-wide association of infection-related cellular traits, we can connect human genetic variation to disease susceptibility in a controlled laboratory environment. Identification of the variants, genes, and cellular pathways involved in infectious disease pathogenesis can inform host-directed therapeutics, clinically effective risk stratification, and epidemiological prediction. This data expedition explores the effect of host genetic variation on chemokine response to Chlamydia infection.

Themes and Categories
Year
2022

Graduate Students: Rylee Hackley & Benjamin Schott
Faculty: Steve Haase
Course: Biology 432S: Biology of Host-Pathogen Interactions

In Biology 423S, students engage in discussion and critique of primary research literature surrounding host-pathogen interactions. However, student-selected papers tend to focus on pathogen physiology, molecular biology, and genetics that enable increased infection success. Themes in this course tend to focus on the evolutionary arms race from the pathogen perspective. We sought to introduce students to a host-centered perspective through this data expedition, drawing on their familiarity with concepts like GWAS and linkage disequilibrium from Bio 202L. Through this workshop, we wanted students to understand the value of descriptive metadata and exploratory data analysis and visualization. Then, after analysis, students were asked to interpret the importance of results, both at the level of the data (can you propose a biological mechanism to explain the results?) and demonstrate their understanding of the host immune response contributing to disease symptoms.

Learning Outcomes

  1. Conduct exploratory data analysis:

    1. interpret histograms

    2. test for normality and general statistics

  2. Discuss the contribution of both host and pathogen variation to mechanisms of susceptibility and resistance

The Dataset

Data were collected as part of a larger screen to identify the human genetic variants associated with differential outcomes following Chlamydia trachomatis infection. Sheet 1 of the provided excel file contains sample metadata, phenotype data, and genotype data for 527 cell lines screened for CXCL10 abundance after C. trachomatis infection. For simplicity and size, the genotype data is subset to include variants within 2Mb of the gene that encodes CXCL10. Sheet 1 dimensions are 527 x 4107 and the complete dataset was published [1]. Sheet 2 contains association test results using Plink [2], subset to include variants within 2Mb of CXCL10. Sheet 2 dimensions are 7 x 4103 and a complete dataset is available [3, H2P2 website].

Download the dataset: Data_final.xls

In-Class Exercises

Prior to this workshop, students download R and R studio and install required packages. Since this course typically reads and discusses student-selected host-pathogen journal articles each week, we provided the most relevant source [1], but did not require students to read it beforehand. Most students had some basic familiarity with the RStudio environment from an intro statistics class, but we provided some tutorials for anyone who wanted to re-familiarize themselves.

During this 75-minute workshop, 18 students were split into groups of 3-4 to work together. Ben introduced the class to the concept of host genetic variation influencing disease outcomes, and summarized key differences between normal and cellular GWAS. They were encouraged to think about the limitations of interpreting results for a cellular screen at the organismal level. Rylee then showed students how to open RStudio, download the data and necessary packages. Rylee had students “code-along” for the first chunk, explaining how to execute code, and inspect data objects. Groups ran the next code chuck and answered worksheet questions together, while Ben and Rylee circled the room.

After exploring the data set, we had a class discussion about how the metadata was essential to understand the data and the question we were trying to answer using the data. Ben transitioned into talking about the association testing results (sheet 2), explaining the rationale behind log-2 normalization of the data, and the concept of regression and multiple testing burden in quantitative association studies. In groups, students visualized the results of association testing to identify the human variant most associated with high CXCL10 protein levels during Chlamydia infection. Students were surprised to find that there was a second nearby SNP that was equally significant, which prompted a discussion of linkage disequilibrium and haplotypes. Finally, students were shown how to localize genomic features near these SNPs using UCSC human genome browser to find that the most associated SNP was near the CXCL10 gene, suggesting a possible cis-regulatory role.

Student Feedback

“I like seeing how data can be transformed into graphs”
“It was an interesting premise and had a clear direction”
“I liked using real data and seeing how we could use it”
“When looking at someone else’s plot, we have to understand what x and y axis represent before looking into any statistical parameters.”

Data Sources

  1. Wang, L., et al. (2018). "An Atlas of Genetic Variation Linking Pathogen-Induced Cellular Traits to Human Disease." Cell Host Microbe 24(2): 308-323 e306.

  2. Purcell, S., et. al. (2007). PLINK: a toolset for whole-genome association and population- based linkage analysis. American Journal of Human Genetics 81(3):559-575.

  3. The 1000 Genomes Project Consortium, et al. (2015). "A global reference for human genetic variation." Nature 526(7571): 68-74.

Materials

Related Projects

This data expeditions module used three full course sessions to introduce undergraduate hydrology students with minimal programming background to:

  • Public water data (water quantity and chemistry)

  • Spatial analysis of water data

  • 2 core, spatial datasets produced by the USGS that enable spatial analysis

  • The programming language R

  • R based tools for water data

  • Spatial analysis and maps in R

How does human habitation relate to patterns in the natural environment? How do species respond to the presence of, and changes in, habitation? In this Data Expedition, students make use of public datasets from the Census and the Global Biodiversity Information Facility to examine relationships between individual species and human settlements. Students develop introductory skills in spatial data manipulation and visualization in R, exposure to powerful datasets and tools, and critical thinking skills in assessing dataset quality and bias. 

The goal of this Data+ project is to apply and extend custom analytics solutions to understand and predict microbial population growth. An explosion of data has resulted from tracking the growth of bacteria in high throughput devices. These data were generated to understand how microbes grow. Better models that fit and predict these growth data are needed for better treatment of pathogenic bacterial infections, food safety, beer and bread fermentation, and understanding stress resilience of the microbiome. Using nonparametric statistical models to analyze how microbes grow under stress, the Schmid research lab at Duke has made important discoveries in these areas. These studies generated large data sets and developed statistical models to track and predict how microbes grow and change their gene expression when faced with extreme stress. We built a web application called phenom to make these models accessible to the broader community. In this Data+ project, students will beta test the web app and make improvements, including data visualization, extending the underlying statistical model, and analyzing data using the app.

 

Project Lead: Amy Schmid

Project Manager: Andrew. Soborowski

Image credit: Tonner, P.D., Darnell, C.L., Bushell, F.M.L., Lund, P.A., Schmid, A.K.*, Schmidler, S.C. 2020. A Bayesian non-parametric mixed-Effects model of microbial growth curves. PLoS Comp Biol. 16(10): e1008366. https://doi.org/10.1371/journal.pcbi.1008366