This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
print("Hello world!")
## [1] "Hello world!"
Use the green play button in the top right corner of the code chunk to run the code (or all the code in the file). For this lesson, let’s run each chunk one at a time.
Ok, lets get started. You’ve downloaded the datasheet right? It’s a .csv file called “EnoRiverData.csv”. Do you remember where you put it? Maybe it’s in a folder on your Desktop? Great, lets set the working directory to that location and the read in the data file.
Look in the upper right panel, under the Environment tab…do you see the data file there. Good. If not, check to make sure that this markdown file is saved in the same location as the data file and that also matches the working directory (setwd).
Let’s inspect the data:
head(data)
## X Site Meters.Downstream Position DO_mgL DO_pctsat temperature_C SpCond
## 1 1 1 1 3 6.51 85.2 29.0 145.6
## 2 2 2 7 2 5.81 76.8 29.0 146.1
## 3 3 3 12 1 6.49 83.6 28.6 145.6
## 4 4 4 16 5 5.62 74.7 29.9 179.2
## 5 5 5 30 5 7.33 100.0 31.1 178.6
## 6 6 6 37 4 6.60 86.6 29.0 145.5
## pH Depth_m Velocity snail_count Total_pct_plant.algae_cover
## 1 7.50 0.48 0.30 5 73
## 2 7.49 0.03 0.04 0 1
## 3 7.51 0.16 0.20 2 2
## 4 7.73 0.14 0.01 5 0
## 5 8.01 0.08 0.01 10 0
## 6 7.69 0.13 0.65 9 90
## pct_coverBelowSurface pct_coverAboveSurface
## 1 100 0
## 2 100 0
## 3 100 0
## 4 0 0
## 5 0 0
## 6 97 3
## AutotrophCrewNotes
## 1 all species 1
## 2 species 1
## 3 100% species 2
## 4 all sediment
## 5
## 6 some above, most fallen - 6 individuals of species 4, most of species 5
Ok, looks like stream data. I see sites and some other stuff. Ok, there’s pH, we know what that is. Snail count! Nice. Let’s do something with that:
# How many sites are in the dataset
nrow(data)
## [1] 25
# Get the summary statistics for snail count
summary(data$snail_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 2.92 5.00 10.00
# What was the snail count at site #5
data$snail_count[which(data$Site == 5)]
## [1] 10
We can also subset the data:
#Let's look at only the really shallow sites
data[which(data$Depth_m < 0.10), 1:6]
## X Site Meters.Downstream Position DO_mgL DO_pctsat
## 2 2 2 7 2 5.81 76.8
## 5 5 5 30 5 7.33 100.0
## 15 15 15 114 1 5.18 65.1
How about we see the entire distribution of depts with a histogram?
hist(data$Depth_m)
Now lets make a plot of some of these parameters:
# Lets see depth as we move downstream
plot(data$Meters.Downstream, data$Depth_m)
Looks like it was getting deeper as the investigators moved downstream. Let’s keep that in mind for later because it might impact the relationships we observe in the data. But first, let’s jazz this plot up a bit…
plot(data$Meters.Downstream, data$Depth_m,
xlab = "Distance downstream (meters)", #add a nice axis labels
ylab = "Depth (meters)", #and one for the y axis
pch = 24, #change the symbol
cex = 2, #change symbol siz
main = "Depth across Eno River sample sites", #add a nice title
col = "blue", #choose a color
bg = "aquamarine3" #and symbol fill color
)
Don’t get to carried away with colors, good plots keep the use of color simple, but for reference check out the website: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
and some basic graphical parameter info: https://www.statmethods.net/advgraphs/parameters.html
Ok, what about some variables that we think might be correlated? Pick two and create your own plot here:
# we chose temperature and pH
plot(data$temperature_C, data$pH,
xlab = "Temperature (degrees C",
ylab = "pH",
main = "Temperature versus pH at the Eno River",
pch = 17,
col = 4)
Dang! That looks like a stong correlation. But how strong? We can use a linear regression model to check the strength of this relationship.
fit <- glm(pH~temperature_C, data=data) #store the model output as "fit"
fit #now look at the fit
##
## Call: glm(formula = pH ~ temperature_C, data = data)
##
## Coefficients:
## (Intercept) temperature_C
## 2.7381 0.1663
##
## Degrees of Freedom: 24 Total (i.e. Null); 23 Residual
## Null Deviance: 0.421
## Residual Deviance: 0.1682 AIC: -48.09
summary(fit) #and get summary statistics of the model
##
## Call:
## glm(formula = pH ~ temperature_C, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.31686 -0.03034 0.00977 0.02314 0.13629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.73815 0.83303 3.287 0.00323 **
## temperature_C 0.16629 0.02829 5.879 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.007314263)
##
## Null deviance: 0.42102 on 24 degrees of freedom
## Residual deviance: 0.16823 on 23 degrees of freedom
## AIC: -48.086
##
## Number of Fisher Scoring iterations: 2
Yep, it looks like the relationship is significant (see how small that p-value is, definitely less than 0.05) And the R-squared value is 0.6 indicating that the model explains about 60% of the variation in the observations. For ecology data, that’s pretty good!
Want to plot the fit?!
plot(data$temperature_C, data$pH,
xlab = "Temperature (degrees C",
ylab = "pH",
main = "Temperature versus pH at the Eno River",
pch = 17,
col = 4)
abline(fit) #adds the line to the plot
text(28.75,8, "pH = 0.1663*(temp) + 2.7381", cex = 0.8, font = 3)
text(28.75,7.96,paste("R^2 = ", format(summary(fit)$r.squared, digits=4)), cex = 0.8)
text(28.75,7.92,paste("p = ", format(summary(fit)$coefficients[2,4], digits=4)), cex = 0.8)
– plot a few more relationships between variables. Do you find any other significant relationships? (Use control-alt-I to insert a new code chunk)
## your code here:
Ready to take it to the next level? Let’s do a linear model with more than one explanitory variable First, lets see if velocity predicts stream oxygen concentration
plot(data$Velocity, data$DO_mgL)
DO.model1 <- lm(DO_mgL ~ Velocity, data = data)
summary(DO.model1)
##
## Call:
## lm(formula = DO_mgL ~ Velocity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85907 -0.24885 -0.09922 0.18209 1.45086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8592 0.1193 49.121 <2e-16 ***
## Velocity 1.9928 0.5981 3.332 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4918 on 23 degrees of freedom
## Multiple R-squared: 0.3255, Adjusted R-squared: 0.2962
## F-statistic: 11.1 on 1 and 23 DF, p-value: 0.002901
There appears to be a relationship, though it is not strong. Perhaps adding temperature will stregthen the model:
DO.model2 <- glm(DO_mgL ~ Velocity + temperature_C, data = data)
summary(DO.model2) # look at the output
##
## Call:
## glm(formula = DO_mgL ~ Velocity + temperature_C, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.74855 -0.40687 0.08191 0.27314 0.85677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.1171 4.3843 -1.167 0.255646
## Velocity 2.2057 0.5462 4.039 0.000549 ***
## temperature_C 0.3720 0.1485 2.504 0.020176 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1967873)
##
## Null deviance: 8.2484 on 24 degrees of freedom
## Residual deviance: 4.3293 on 22 degrees of freedom
## AIC: 35.11
##
## Number of Fisher Scoring iterations: 2
Yes! We have improved the R-squared value by adding a second predictor variable to the model.
FYI: the model output gives you two R squared values, the multiple R-squared and adjusted R-squared. The difference between these two is the multiple R-squared tells you the total amount of variation of your dependent variable is explained by your predictors. The adjustedR-squared is a reduced value that accounts for the use of more predictor variables to help avoid overfitting. So from here on we will refer to the adjusted R-squared value when evaluating our models.
Now let’s use our model to predict oxygen level at the sample sites.
# add a list of predicted values to the data frame and sort the data by velocity so that we can plot the predictions
data$DOpreds<- DO.model2$fitted.values
sort.data <- data[order(data$Velocity),]
plot(data$Velocity, data$DO_mgL, xlab = "Velocity", ylab = "DO (mg/L)")
# add the predictions from linear model as a dashed line
lines(sort.data$Velocity, sort.data$DOpreds, lty = 2)
Often ecological data is not linear, particularly concentration data (DO, conductivity) What happens if you try to transform the data, does this improve the model fit more than adding Temperature as a second predictor? Try a log transformation on the data:
# your code here
Can you predict how many snails will be found given the physical parameters of a site? What varibale might affect where snails live? Build your own model with one or more variables
Not sure where to start? Try making a box plot with some of your variables
boxplot(data$snail_count, ylab = "snail count", xlab = "all sites")
# you can look at the distribution of snails across binned values of another variable:
# Create 4 bins of the DO data, with the same number of points in each bin
breaks <- quantile(data$DO_mgL) # this creates a vector of the s
data$DO_bins <- cut(data$DO_mgL, breaks) # add a column of the DO bins
boxplot(data$snail_count~data$DO_bins, xlab = "Dissolved oxygen (mg/L)",
ylab = "Snail count", main = "Snail distribution by DO")
## fix the x axis by muting the default: xaxt="n"
breaks <- quantile(data$DO_mgL) # this creates a vector of the s
data$DO_bins <- cut(data$DO_mgL, breaks) # add a column of the DO bins
boxplot(data$snail_count~data$DO_bins, xlab = "Dissolved oxygen (mg/L)",
ylab = "Snail count", main = "Snail distribution by DO", xaxt="n")
## and writing your own
axis(1, at = 1:4, labels= c("5.04-5.71", "5.71-6.04", "6.04-6.49", "6.49-7.36"))
You can recreate these bins for wahtever variable and intervals you want by creating a vector called breaks by entering numbers like this:
breaks <- c(5,6,7,8)
data$DO_bins <- cut(data$DO_mgL, breaks) # add a column of the DO bins
boxplot(data$snail_count~data$DO_bins, xlab = "Dissolved oxygen level",
ylab = "Snail count", main = "Snail distribution by DO",
xaxt="n",
col = c("lightblue1", "lightblue3", "lightblue4"))
axis(1, at = 1:3, labels= c("low", "medium", "high"))
legend("bottomright", inset = 0.02, title = "DO (mg/L)", c("5-6", "6-7", "7-8"), fill = c("lightblue1", "lightblue3", "lightblue4"), cex = 0.8)
We can see that this data doesn’t look linear, and we know that it isn’t continuous because you can’t have 0.5 snails! So let’s try modeling our data with a poisson distribution, which is specifically designed for this type of data.
model <- glm(data$snail_count~data$DO_mgL, family = poisson())
summary(model)
##
## Call:
## glm(formula = data$snail_count ~ data$DO_mgL, family = poisson())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2242 -1.5851 -0.4120 0.5683 2.6113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.1597 1.2504 -5.726 1.03e-08 ***
## data$DO_mgL 1.3030 0.1897 6.868 6.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 101.485 on 24 degrees of freedom
## Residual deviance: 54.833 on 23 degrees of freedom
## AIC: 107.7
##
## Number of Fisher Scoring iterations: 6
We see this is giving us significant results! To be sure this is the model we should be using, we really should check that the data are poisson distributed. How about you give that a try?
# your code here
Now that we are convinced, lets see what the poisson model fit looks like in a plot:
plot(data$DO_mgL, data$snail_count,
xlab = "DO (mg/L)", ylab = "snail count",
pch = 20, col = "grey40")
b0 <- model$coef[1] # intercept
b1 <- model$coef[2] # slope
SE0 <- sqrt(diag(vcov(model))[1]) #standard error for the intercept
SE1 <- sqrt(diag(vcov(model))[2])# standard error for the slope
# create a sequence of x values that cover the range of DO values
x<-seq(min(data$DO_mgL), max(data$DO_mgL), length.out = nrow(data))
# generate model predicted snail counts
y <- exp(b0+b1*x)
ylower <- exp((b0-SE0)+(b1-SE1)*x)
yupper <- exp((b0+SE0)+(b1+SE1)*x)
lines(x, y)
lines(x,ylower, lty = 2)
lines(x,yupper, lty=2)
Now you have the tools manipulate data frames, plot your data and build statistical models. Why don’t you try it out with a few of the other variables? Or even better, load in some of your own data and find some results!