--- title: 'Lab: Spatial Regression Modeling' output: pdf_document: fig_height: 4 fig_width: 6 html_document: df_print: paged --- ```{r echo = F} knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.align = "center") ``` This lab is designed to provide the intuition behind spatial autoregressive models, specifically the **spatial lag** and **spatial error** models. The **data** are derived from several different sources: * Zillow Inc. real estate estimates + median neighborhood home price per square foot (`price`) * Satellite remote sensing observations + normalized difference vegetation index (`ndvi`) + land surface temperature (`lst`) + open space fraction (`open_space_fraction`) + tree canopy cover (`tcc`) * American Community Survey + median number of rooms (`median_number_rooms`) + median age of home (`median_age_home`) + median age of residents (`median_age`) + proportion of residents with bachelors degree (`attained_bachelors`) + population density in thousands of people per square km (`popden`) + median household income in thousands of dollars (`mhhi_family`) + proportion of residents that identify as white (`white`) The original motiviation for this analysis was to identify the economic effects of *environmental attributes* (NDVI, LST, TCC, open space) on *home values* in Zillow neighborhoods. The full study included all major metropolitan areas in the United States, but this abbreviated activity will focus on a single city - Houston, Texas - in order to simplify computation. ## Load packages ```{r} # regular suite of tidyverse packages library(tidyverse) library(broom) library(knitr) library(patchwork) #organize plots in a grid #visualize spatial data library(RColorBrewer) #custom color palettes #wrangle and model spatial data library(sf) library(spatialreg) library(spdep) ``` ## Load the Data ```{r} # read the shapefile shape <- read_sf(dsn = "data", layer = "Zillow_Houston") # convert RegionID to numeric before we join anddrop some columns that we don't need shape <- shape %>% mutate(RegionID = as.character(RegionID), RegionID = as.numeric(RegionID)) %>% dplyr::select(-fid, -State, -County, -City, -Name, -layer, -path, -population) # load the rest of the data big_df <- read_csv("data/full_dataset.csv") # merge keeping only those in both data sets merged <- inner_join(shape, big_df, by = "RegionID") ``` ## Visualize the data ```{r} ggplot(data = merged) + geom_sf() ``` ```{r} # plot median home price per square foot add formatting ggplot(data = merged, aes(fill = price)) + geom_sf() + labs(title = "Houston, TX", subtitle = "Median price per square foot") + theme_void() + scale_fill_distiller(guide = "legend") ``` It is a bit difficult to discern differences at the lower price levels. Let's plot again using quantile breaks in the data. We'll also use a better color palette. ```{r} # determine number of quantiles grps <- 10 # compute the quantiles brks <- quantile(merged$price, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles ggplot(data = merged, aes(fill = price)) + geom_sf() + labs(title = "Houston, TX", subtitle = "Median price per square foot") + theme_void() + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) ``` > Exercise 1: Make a plot for each of the following variables: - median household income ('mhhi_family'), - tree canopy cover ('tcc') - land surface temperature ('lst') - population density ('popden') ```{r} # determine number of quantiles grps <- 10 # compute the quantiles brks <- quantile(merged$price, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles p1 <- ggplot(data = merged, aes(fill = price)) + geom_sf() + labs(title = "price") + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) # compute the quantiles brks <- quantile(merged$mhhi_family, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles p2 <- ggplot(data = merged, aes(fill = mhhi_family)) + geom_sf() + labs(title = "Income") + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) p1 + p2 ``` Price and income appear to be moderately correlated. ```{r} # compute the quantiles brks <- quantile(merged$tcc, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles p3 <- ggplot(data = merged, aes(fill = tcc)) + geom_sf() + labs(title = "Tree cover") + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) p1 + p3 ``` Price and tree cover do not appear to be correlated. ```{r} # compute the quantiles brks <- quantile(merged$lst, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles p4 <- ggplot(data = merged, aes(fill = lst)) + geom_sf() + labs(title = "Temperature") + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) p1 + p4 ``` Temperature and price appear to be mildly correlated. ```{r} # compute the quantiles brks <- quantile(merged$popden, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) # plot with color scale adjusted for quantiles p5 <- ggplot(data = merged, aes(fill = popden)) + geom_sf() + labs(title = "Population density") + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) p1 + p5 ``` Price and population density do not appear to be correlated. Compare these plots to the modified plot above for median price per square foot. Does there appear to be correlation between any of the variables and the median home price per square foot? Briefly explain your response. ## Build a simple model Your task is to model the median home price per square foot as a function of the other variables in the dataset. Let's check the distribution of the response variable ('price'). ```{r} ggplot(data = merged, aes(x = price)) + geom_histogram() + labs(title = "Distribution of Price") ``` Next, let's fit a regression model where the response variable is `price` and the predictors are socio-demographic and environmental variables. ```{r} # ordinary least-squares model m1 <- lm(price ~ median_number_rooms + median_age_home + median_age + attained_bachelors + mhhi_family + popden + white + ndvi + tcc + lst + open_space_fraction, data = merged) ``` ```{r} tidy(m1) %>% kable(format = "markdown", digits = 4) ``` Below are some of the residual plots we need to check the model assumptions. ```{r} # add model residuals to the data merged <- merged %>% mutate(resid = resid(m1), pred = predict(m1)) ``` ```{r} ggplot(data = merged, aes(x = pred, y = resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + labs(title = "Residuals vs. Predicted", x = "Predicted", y = "Residuals") ``` ```{r} p1 <- ggplot(data = merged, aes(x = resid)) + geom_histogram() + labs(title = "Distribution of residuals", x = "", y = "") p2 <- ggplot(data = merged, aes(sample = resid)) + stat_qq() + stat_qq_line() + labs(title = "Normal QQ-plot of the residuals") #arrange plots using patchwork package p1 + p2 ``` > Exercise 2: Which assumption(s) appear to be violated based on the plots of the residuals? How can we transform the response variable `price` to address the violation in assumption(s)? Show your code below to create a new variable called `price_trans` that is the transformed version of the response variable `price`. There is evidence that the response variable is skewed (not normally distributed). We can see from the histogram that the response variable is right-skewed. Furthermore, the plot of residuals vs. predicted has a "fan" shape, which is evidence that the response variable is not normally distributed. We can log-transform the response variable to make it closer to normally-distributed. ```{r} # log-transform the response variable merged$price_trans <- log(merged$price) # plot histogram of the new response variable hist(merged$price_trans) ``` > Exercise 3: Refit the previous model with the transformed response variable, `price_trans`, created in Exercise 2. Show your code and model output. ```{r} # ordinary least-squares model m1 <- lm(price_trans ~ median_number_rooms + median_age_home + median_age + attained_bachelors + mhhi_family + popden + white + ndvi + tcc + lst + open_space_fraction, data = merged) ``` ```{r} tidy(m1) %>% kable(format = "markdown", digits = 4) ``` > Exercise 4: Interpret the output from the ordinary least squares model created in the previous exercise. Which variables are statistically significant? What is their estimated effect on the response variable? The statistically significant variables with a positive effect on price are median home age, bachelors degree, and median household income. The statistically significant variables with a negative effect on price are median number of rooms and median age. > Exercise 5: Add a new column called `residuals` to the `merged` dataset that contains the residuals from the model in Exercise 3. ```{r} # add model residuals to the data merged <- merged %>% mutate(residuals = resid(m1), pred = predict(m1)) ``` Next, let's make an assessment about the independence assumption by looking at the residuals distributed in space. If the residuals appear to be randomly distributed, then there is no spatial autocorrelation. If the errors are **not** randomly distrubuted in space, then we need to test for spatial autocorrelation. ` ```{r} # plot the residuals on the map broken down by quantiles grps <- 10 brks <- quantile(merged$residuals, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) ggplot(data = merged, aes(fill = residuals)) + geom_sf() + labs(title = "Houston, TX", subtitle = "Residuals from Least-Squares Model") + theme_void() + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) ``` > Exercise 6: If there was no spatial correlation, i.e. the residuals were randomly distributed in space, what would you expect the map to look like? Based on this, do you think the model residuals are randomly distributed in space? What might be a mechanism for this phenomenon? (In other words, why might the median home price of one neighborhood affect the median home price of an adjacent neighborhood?) If there was no spatial autocorrelation, then I would expect the red and blue polygons to be randomly distributed. Since the red and blue polygons do not appear to be randomly distributed (i.e., red is next to red, and blue is next to blue), then we can hypothesize that there is spatial autocorrelation. A possible mechanism for the spatial autocorrelation is that wealthy neighborhoods are desirable. Homeowners would rather live near wealthy homeownevers compared to non-wealthy homeowners. Therefore, wealth itself begets adjacent wealth simply because homeowners are willing to pay a premium to be proximate to wealthy homeowners. As we saw in the lecture, Moran's I test is a robust way to test for spatial autocorrelation. We can use the `spdep` package to calculate Moran's I for our model residuals. Once again, ideally there will be no spatial autocorrelation, i.e. a Moran's I value close to zero. First, generate the neighborhood list object. The neighborhood list object determines which observations are adjacent to other observations. ```{r} # make a neighbor list using the sdep package nb <- poly2nb(merged) nb #make a data frame for neighbors merged_sp <- as(merged, "Spatial") nb_lines <- nb %>% nb2lines(coords = coordinates(merged_sp)) %>% as("sf") %>% st_set_crs(st_crs(merged)) ``` ```{r} # plot neighbors ggplot(data = merged) + geom_sf(fill = "white", color = "lightgrey") + geom_sf(data = nb_lines, col = "red") + labs(title = "Adjacent Neighborhoods in Houston, TX") + theme_void() ``` That's a lot of adjacent neighborhoods! We already have an idea of whether or not the errors (model residuals) are correlated in space. Let's make one more plot to help us understand this correlation (or lack thereof). ```{r} # calculate the average neighborhing residual for each observation resnb_calc <- sapply(nb, function(x) mean(merged$residuals[x])) # add average neighboring residuals to merged data frame merged <- merged %>% mutate(resnb = resnb_calc) ``` ```{r} # plot the average neighboring residuals vs. residuals. ggplot(data = merged, aes(x = residuals, y = resnb)) + geom_point() + geom_smooth(method = "lm", se = FALSE)+ labs(x = "Residuals", y = "Mean Adjacent Residuals", title = "Relationship between Mean Adjacent Residuals vs. Residual for Observation i") ``` Now that we've built some intuition for spatial autocorrelation, let's calculate Moran's I. If the observed Moran's I is statistically greater than the null hypothesized value 0, then there is sufficient evidence to conclude that there is spatial autocorrelation. ```{r} # calculate weights matrix ww <- nb2listw(nb, style = 'B', zero.policy = T) # binary weights matrix # monte carlo Moran's test moran.mc(merged$residuals, ww, 1000, zero.policy = T) ``` > Exercise 7: What is the test statisic? What is the p-value? Does Moran's I provide evidence for or against there being significnat spatial autocorrelation? Briefly explain your reasoning. The test statistic is 0.28. The p-value is 0.001. Therefore, we conclude that Moran's I test provides evidence for significant spatial autocorrelation. Our accepted p-value cutoff is 0.05. Since the p-value is less than 0.05, we can conclude that the test statistic is significantly greater than the null hypothesis, i.e. we reject the null hypothesis. # Spatial regression models We will introduce two different types of spatial regression models: the **spatial lag model** and the **spatial error model**. Both models are similar in that they both add a term to the right-hand side of equation that includes the spatial weights matrix $W$. Consider a simple linear regression model: $$y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \dots + \epsilon$$ where $y$ is the response variable, $x_1$, $x_2$, etc. are the predictor variables, $\beta_1$, $\beta_2$, etc. are estimated coefficients, and $\epsilon$ is an uncorrelated error term. The **spatial lag model** adds a term that is a product of $W$ and the **response variable**. The **spatial lag model** would be: $$y = \rho Wy + \beta_0 + x_1\beta_1 + x_2\beta_2 + \dots + \epsilon$$ where $W$ is the spatial weights matrix and $\rho$ is an estimated coefficient. The **spatial error model**, on the other hand, incorporates $W$ into the **error term**: $$y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \dots + \lambda Wu + \epsilon$$ where $\lambda$ is an estimated coefficient and $u$ is a correlated spatial error term. Let's try both models on our data and see if they address the issue of spatial autocorrelation. ## Spatial lag model ```{r, warning=F} m1_sp_lag <- lagsarlm(price_trans ~ median_number_rooms + median_age_home + median_age + attained_bachelors + mhhi_family + popden + white + ndvi + tcc + lst + open_space_fraction, data = merged, listw = ww, zero.policy = T) summary(m1_sp_lag) ``` Moran's I of the spatial lag model: ```{r, warning=F} merged <- merged %>% mutate(residuals_lag = residuals(m1_sp_lag)) ``` ```{r} moran.mc(merged$residuals_lag, ww, 1000, zero.policy = T) ``` Plot the residuals: ```{r} brks <- quantile(merged$residuals_lag, 0:(grps-1)/(grps-1), na.rm = TRUE, names = FALSE) brks <- round(brks, 3) ggplot(data = merged, aes(fill = residuals_lag)) + geom_sf() + labs(title = "Houston, TX", subtitle = "Residuals from Spatia Lag Model") + theme_void() + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) ``` > Exercise 8: What can you conclude about the spatial autocorrelation of the original model compared to the lag model? Use your observations from the residuals plot and Moran's test to explain your reasoning. Moran's I from the original model is 0.28. Moran's I from the lag model is also 0.28. Both of these statistics are significantly different than the null hypothesis. Therefore, we conclude that the spatial lag model does not address issues of spatial autocorrelation in the model residuals. However, it is important to note that we are measuring spatial autocorrelation in the model residuals (error). So, we expect that the spatial error model will have a much greater impact on the model error than the spatial lag model. ## Spatial error model ```{r} m1_sp_err = errorsarlm(price_trans ~ median_number_rooms + median_age_home + median_age + attained_bachelors + mhhi_family + popden + white + ndvi + tcc + lst + open_space_fraction, data = merged, listw = ww, zero.policy = T) summary(m1_sp_err) ``` Moran's I of the spatial error model: ```{r} merged <- merged %>% mutate(residuals_error = residuals(m1_sp_err)) moran.mc(merged$residuals_error, ww, 1000, zero.policy = T) ``` Plot the residuals: ```{r} brks <- quantile(merged$residuals_error, 0:(grps-1)/(grps-1), na.rm=TRUE, names = FALSE) brks <- round(brks, 3) ggplot(data = merged, aes(fill = residuals_error)) + geom_sf() + labs(title = "Houston, TX", subtitle = "Residuals from Spatial Error Model") + theme_void() + scale_fill_distiller(palette = 'RdBu', guide = "legend", breaks = brks) ``` > Exercise 9: Let's compare the three different models. - How does the spatial autocorrelation of the spatial error model compare to that of the original model and the spatial lag model? Use your observations from the residuals plot and Moran's test to explain your reasoning. Moran's I tests from the first 2 models are significantly different from the null hypothesis. This suggests that the ordinary least squares model (1) and the spatial lag model (2) have statistically significant spatial autocorrelation in the residuals. The spatial error model (3), on the other hand, has a Moran's I statistic that is NOT statistically significant from the null hypothesis. Therefore, the spatial error model has addressed the issue of spatial autocorrelation in the error term. > Exercise 10: Briefly describe how the coefficients of the predictor variables differ across the three models. How are the coefficients similar? How do the coefficients differ? Did anything surprise you? The statistically significant effects have the same signs across all three models EXCEPT the error model has population density as a statistically significant positive effect, whereas this effect is not significant in the other models. I am surprised that the environmental attributes were not statistically significant in any of the models. > Exercise 11: Which model would you choose to explain variation `price` in the median house price in Houston, TX? Briefly explain your choice. I would choose the spatial error model because it is the only model that has proved to addresses spatial autocorrelation. > Exercise 12: There is a model in `spdep` that combines the spatial lag and spatial error models. It looks like this: $$y = \rho Wy + X\beta + \lambda Wu + \epsilon$$ > Implement this model using the function `sacsarlm`. You can use the code for the `lagsarlm` and `errorsarlm` models as a guide for the syntax. Comment on the coefficient estimates and their significance. Would you use this model versus the one you chose in the previous exercise? Briefly explain why or why not. ```{r} m1_sp_lag_err = sacsarlm(price_trans ~ median_number_rooms + median_age_home + median_age + attained_bachelors + mhhi_family + popden + white + ndvi + tcc + lst + open_space_fraction, data = merged, listw = ww, zero.policy = T) summary(m1_sp_lag_err) ``` The coefficients of the fourth model have the same signs and significance of the third model (spatial error) EXCEPT that the proportion of White residents is significant and positive in the fourth model, whereas it was not significant in the others. The fourth model has the most significant terms of all the models. I would choose this model over the other models because it has the highest log-likelihood.