Graduate Students: Jonathan Holt and Reza Momenifar, Department of Civil & Environmental Engineering
Faculty: Maria Tackett, Department of Statistical Science
Course: "Regression Analysis" (STA 210)
Data scientists are often asked to analyze data from obscure sources. No matter the source of data, analysts must be comfortable applying their skills to solve the client’s problem. This Data Expeditions course prepares students for the real world by asking students to conduct an environmental and spatial data analysis, build a statistical model based on the nature of the data and finally interpret their results.
One of the important concepts in spatial statistics is spatial autocorrelation which is an indicator of clustering or dispersion in a spatial data structure. Following a famous quote from Waldo R. Tobler, “Everything is related to everything else, but near things are more related than distant things”, spatial autocorrelation appears almost always in nature. It is important to evaluate its significance because spatial autocorrelation between observations violates the independence of data, which is a core assumption of statistical models. This diagnosis is conducted via Moran’s I test which reflects the correlation/similarity between an object and its neighboring ones.
This data exploratory project is delivered to the class via (i) a lecture session that teaches students the basic concept of spatial autocorrelation and (ii) a lab session that asks students to put their learned knowledge into practice.
In the lecture session, students are given a 1-hour lecture introducing the concept of spatial autocorrelation and detailing how to diagnose it in real-world data. We build an intuition for this concept via a tangible example for students: modeling students’ test scores in a class. We ask the students to come up with important predictors for modelling this problem. After presenting a sample test scores in a class, we ask students to assess the correlation between an individual test score and the nearby test scores. Then we teach students how to check their assessment via Moran’s I test and comment on the degree of spatial autocorrelation. Later in the lesson students learn how to embed such information into the regression model for the student test scores problem. This lecture prepares student for the lab session to work on a similar data analysis task on a real-world and interesting problem: building a spatial autoregressive model for housing price as a function of its important predictors (structural features, demographics, community features, environmental attributes).
- What is autocorrelation? Why do we care about it?
- Which test can measure the degree of spatial autocorrelation?
- Is the dependent variable (median housing price) spatially autocorrelated?
- How can one incorporate the spatial autocorrelation information in a regression model (ordinary least squares (OLS))?
- For this hedonic analysis, which predictors are more significant? How can one interpret the interaction of predictors?
- How do we test if a spatial autoregressive model has accounted for the spatial autocorrelation? How do we choose the best autoregressive model among others?
The data are curated from a variety of sources (including remote sensing, census, and Zillow housing data) and were collected by Jonathan Holt (using Google Earth Engine, R, or QGIS), as part of his doctoral thesis to infer the marginal value of urban environmental amenities (“green spaces”) with respect to housing prices. The essential question of the original study was, “How are average neighborhood home prices affected by the presence of environmental amenities (parks, golf courses, lakes, etc.)? The study found that greenness itself is a dis-amenity, although greenness in the form of shady trees or parks is an amenity. We propose to guide students through an abbreviated version of this study.
The native dataset contains 7597 Zillow neighborhoods (rows) and 127 variables (columns). The 127 columns consist of remote sensing products (such as Tree canopy cover, land surface temperature), socio-demographic (such as household income, education), location amenities (such as parks, schools), and the Zillow Home Value Index (neighborhood average home price).
The full study included all major metropolitan areas in the United States, but for the purpose of this data exploratory project we focus on a single city - Houston, Texas - in order to simplify computation.
Jon and Reza presented a lecture introducing the concept of spatial autocorrelation and its importance in statistical modeling.
The presentation was interactive, including think-pair-share activities with real-world data. For instance, for the “students test scores” problem, students engaged in the class activity by talking about the important predictor variables for predicting student test scores and improving the modeling. Students were introduced with multiple examples of data with spatial autocorrelation such as political elections, contaminant transfer, disease spread, housing market and weather.
After students became familiar with the concept of spatial autocorrelation, Jon presented Moran’s I test as a metric that measures the spatial autocorrelation for continuous data. Jon explained the parameters of this metric by applying it to the student test scores data and later elaborated on how Moran’s I may reflect positive/negative or no autocorrelation. At the end of the presentation, Jon explained the students’ task and the dataset for the lab session. The lab was designed to provide the intuition behind spatial autoregressive models, specifically the spatial lag and spatial error models.
In the lab session, students were asked to conduct a hedonic pricing analysis by modelling the median neighborhood home price as a function of socio-demographics, home characteristics, and environmental attributes. Given the spatial distribution of Zillow Neighborhoods, students were supposed to incorporate this information into their model and interpret the results. With the help of Professor Maria Tackett, Jon and Reza designed a Rstudio template in which all the necessary packages for this analysis (visualization, regression, spatial data) along with the datasets were loaded and so students were supposed to only deal with the modelling part rather than the details of R programming.
This lab activity started by asking students to plot the response variable, median price per square foot, along with some of the predictors (such as temperature, household incomes) and then comment on the potential correlation between them. Next, students were asked to build a simple regression model (ordinary least squares) that predicts the median home price based on the other variables in the dataset and (i) plot the distribution of the residuals and examine the need for any transformation of the response variable (ii) interpret the coefficients of predictors and comment on their statistical significance. Afterwards, students were asked to visualize the residuals from the least square model and make an assessment about the significance of spatial autocorrelation and later employ Moran’s I test to evaluate their assessment by interpreting the test statistic and p-value of the residuals. By this point, students have noticed the evidence for the spatial autocorrelation and their next task was to incorporate this information into three spatial regression models namely, spatial lag and error models and their combination. Students were asked to interpret the Moran’s I statistic and p-value of these models and choose the one which addresses the spatial autocorrelation issue.
Here is the quote from the class instructor, Prof. Maria Tackett:
“In the data expedition, Jon and Reza developed an engaging lecture and lab assignment on fitting regression models that account for spatial correlation. This is a topic that students typically learn in upper-level statistics courses, so the data expedition gave students a nice preview of more advanced modeling techniques. In the lab assignment, students applied what they learned in the lecture to analyze a real and complex data set from Zillow. They enjoyed learning the new methods for visualizing and fitting models for spatially correlated data. In fact, a few groups plan to apply some of the methods they learned in the data expedition to their final course project!”
Source of the Data
ESRI. (2019). ArcGIS Online.
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/J.RSE.2017.06.031
U.S. Census Bureau. (2019). American Community Survey, 2017 5-year Estimates.
Zillow. (2019). Zillow Economic Data.