--- title: "ToCatchAThief" author: "c ryan campbell & jenn coughlan" date: "7/23/2018" output: pdf_document --- Welcome to the "To Catch a Thief: With Data!" walkthrough! https://bioconductor.org/packages/devel/bioc/vignettes/SNPRelate/inst/doc/SNPRelateTutorial.html First we're going to install the required software within R, answer "yes" or "all" to any prompted questions: i ran source successfully!! ```{r setup, include=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("gdsfmt") biocLite("SNPRelate") # Load the R packages: gdsfmt and SNPRelate library(gdsfmt) library(SNPRelate) gg_color_hue <- function(n) { hues = seq(15, 375, length = n + 1) hcl(h = hues, l = 65, c = 100)[1:n] } ``` In case you're wondering, this is an "R Markdown" document. Everything in the white space is notes, the grey space (between ``` symbols) is code to be run. This makes it easy to take notes, and jot down important things that aren't computer jargon right next to the things that you want to run as computer jargon. For details see: . Inspect the SNP data then then save them as "genofile" (in R "<-" saves a variable) ```{r genotype file} snpgdsSummary(snpgdsExampleFileName()) genofile <- snpgdsOpen(snpgdsExampleFileName(), allow.duplicate = TRUE) ``` Now let's inspect the data a little more thoroughly ```{r} head(read.gdsn(index.gdsn(genofile, "snp.rs.id"))) pop <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group")) table(pop) ``` Questions to answer (feel free to edit this document to include your answers!): 1) How many samples are in this file? 2) How many SNPs are in this file (per sample, not total)? 3) What do these population codes mean (use Google)? Now we're going to visualize the samples with a Principle Components Analysis (Jenn discussed this method earlier this week). There is a lot going on in the next block of text to run, don't focus on the minor things just the end product. ```{r PCA} #first we're going to eliminate SNPs in LD with each other set.seed(1000) snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2) snpset.id <- unlist(snpset) #runPCA analysis on the smaller set of data, "snpset.id" pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2) #convert PCA results to a table tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) #visualize that table head(tab) #plot plot(tab$EV2, tab$EV1, xlab="PCA2", ylab="PCA1") # Get sample id and pop info sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group")) #add pop info to table tab <- data.frame(sample.id = pca$sample.id, pop = factor(pop_code)[match(pca$sample.id, sample.id)], EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) head(tab) #plot with pop info plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1"); legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop)) ``` 4) Why did we eliminate SNPs that are in LD with each other? 5) How many clusters do we have? 6) What does that tell us about our individuals? 7) Which populations cluster together on the PCA? Now, given a criminal can we find relatives in this database? We'll need to use a different analysis here, one of "kinship" or the relatedeness between two samples There are 3 outputs from this analysis - k0 - probability that zero genetic segments are shared k1 - probability that at least one genetic segments are shared kinship - numerical relationship between samples (ranging from 0 - .5) relatedness = 2 * kinship (ranges from 0-1) How much of your DNA do you share with 8) a parent? 9) a sibling? 10) grandparent? the relatedness values are directly proportional to these answers. If you find a sample of .25 relatedness you've got the criminal's grandparent or aunt/uncle. Here we will generate a pool of individuals similar to GEDmatch. By recent estimates GEDmatch has 10x as many Caucasian entries as any other human population, so the remainder of this walkthrough will have 89 Caucasian samples and 9 Yoruban, 9 Han Chinese, and 9 Japanese. ```{r} sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) GEDmatch.id <- sample.id[pop_code == "CEU"] GEDmatch.id <- GEDmatch.id[-37] GEDmatch.id <- GEDmatch.id[-43] GEDmatch.id <- GEDmatch.id[-51] GEDmatch.id <- c(GEDmatch.id,sample.id[pop_code == "YRI"][1:9]) GEDmatch.id <- c(GEDmatch.id,sample.id[pop_code == "HCB"][1:9]) GEDmatch.id <- c(GEDmatch.id,sample.id[pop_code == "JPT"][1:9]) ``` The partner differentiation kicks in here SO READ CAREFULLY: ```{r} #PARTNER A - uncomment (delete the #'s) this section and run criminal <- GEDmatch.id[sample(1:89,1)] ############ #PARTNER B - uncomment (delete the #'s) this section and run #criminal <- GEDmatch.id[sample(90:116,1)] ############ #THIS SECTION RANDOMLY ASSIGNS YOU A CRIMINAL #JUST RUN IT ONCE #IF YOU RUN IT AGAIN YOUR CRIMINAL (AND RESULTS) WILL CHANGE ``` Finally we're going to compare every sample in the database to every other, then isolate those that include our "criminal". ```{r} # Estimate IBD coefficients ibd <- snpgdsIBDMoM(genofile, sample.id=GEDmatch.id, snp.id=snpset.id, maf=0.05, missing.rate=0.05, num.thread=4) #convert the data to a table ibd.coeff <- snpgdsIBDSelection(ibd) head(ibd.coeff) #plot ALL data plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="HapMap samples") plot(ibd.coeff$k1, ibd.coeff$kinship, xlim=c(0,1), ylim=c(0,.5), xlab="k1", ylab="kinship", main="HapMap samples") #isolate criminal match data ibd.criminal <- subset(ibd.coeff, ID1 == criminal | ID2 == criminal) #plot criminal data plot(ibd.criminal$k1, ibd.criminal$kinship, xlim=c(0,1), ylim=c(0,.5), xlab="k1", ylab="kinship", main="Criminal Relatedness Only") #plot kinship hist(ibd.criminal$kinship) #return maximum kinship value max(ibd.criminal$kinship) #return ID match ibd.criminal$ID1[which.max(ibd.criminal$kinship)] #return population tab$pop[tab$sample.id == criminal] ``` 11) Are you partner A or B? 12) What is the ID of your criminal? (just run "criminal") 13) Does your criminal have any matches? 14) What kinship level? 15) What relatedness is that? 16) Who is that, as in which family member? 17) Finally compare your result to your partner - how did they differ and do you have a guess as to why?