A Spatial Regression to Analyze the Economic Impact of Urban Green Spaces in Zillow Neighborhoods

Project Summary

Most phenomena that data scientists seek to analyze are either spatially or temporally correlated. Examples of spatial and temporal correlation include political elections, contaminant transfer, disease spread, housing market, and the weather. A question of interest is how to incorporate the spatial correlation information into modeling such phenomena.

 

In this project, we focus on the impact of environmental attributes (such as greenness, tree cover, temperature, etc.) along with other socio-demographics and home characteristics on housing prices by developing a model that takes into account the spatial autocorrelation of the response variable. To this aim, we introduce a test to diagnose spatial autocorrelation and explain how to integrate spatial autocorrelation into a regression model

 

 

In this data exploration, students are provided with data collected from remote sensing, census, and Zillow sources. Students are tasked with conducting a regression analysis of real-estate estimates against environmental amenities and other control variables which may or may not include the spatial autocorrelation information.

 

Themes and Categories
Year
2020

Contact: Jonathan Holt or Reza Momenifar

Graduate Students: Jonathan Holt and Reza Momenifar, Department of Civil & Environmental Engineering

Faculty: Maria Tackett, Department of Statistical Science

Course: "Regression Analysis" (STA 210)

Introduction

Data scientists are often asked to analyze data from obscure sources. No matter the source of data, analysts must be comfortable applying their skills to solve the client’s problem. This Data Expeditions course prepares students for the real world by asking students to conduct an environmental and spatial data analysis, build a statistical model based on the nature of the data and finally interpret their results.

One of the important concepts in spatial statistics is spatial autocorrelation which is an indicator of clustering or dispersion in a spatial data structure. Following a famous quote from Waldo R. Tobler, “Everything is related to everything else, but near things are more related than distant things”, spatial autocorrelation appears almost always in nature. It is important to evaluate its significance because spatial autocorrelation between observations violates the independence of data, which is a core assumption of statistical models. This diagnosis is conducted via Moran’s I test which reflects the correlation/similarity between an object and its neighboring ones.

This data exploratory project is delivered to the class via (i) a lecture session that teaches students the basic concept of spatial autocorrelation and (ii) a lab session that asks students to put their learned knowledge into practice. 

In the lecture session, students are given a 1-hour lecture introducing the concept of spatial autocorrelation and detailing how to diagnose it in real-world data. We build an intuition for this concept via a tangible example for students: modeling students’ test scores in a class. We ask the students to come up with important predictors for modelling this problem. After presenting a sample test scores in a class, we ask students to assess the correlation between an individual test score and the nearby test scores. Then we teach students how to check their assessment via Moran’s I test and comment on the degree of spatial autocorrelation. Later in the lesson students learn how to embed such information into the regression model for the student test scores problem. This lecture prepares student for the lab session to work on a similar data analysis task on a real-world and interesting problem: building a spatial autoregressive model for housing price as a function of its important predictors (structural features, demographics, community features, environmental attributes).

Guiding Questions

  1. What is autocorrelation? Why do we care about it?
  2. Which test can measure the degree of spatial autocorrelation?
  3. Is the dependent variable (median housing price) spatially autocorrelated?
  4. How can one incorporate the spatial autocorrelation information in a regression model (ordinary least squares (OLS))?
  5. For this hedonic analysis, which predictors are more significant? How can one interpret the interaction of predictors?
  6. How do we test if a spatial autoregressive model has accounted for the spatial autocorrelation?  How do we choose the best autoregressive model among others?

The Dataset

The data are curated from a variety of sources (including remote sensing, census, and Zillow housing data) and were collected by Jonathan Holt (using Google Earth Engine, R, or QGIS), as part of his doctoral thesis to infer the marginal value of urban environmental amenities (“green spaces”) with respect to housing prices. The essential question of the original study was, “How are average neighborhood home prices affected by the presence of environmental amenities (parks, golf courses, lakes, etc.)? The study found that greenness itself is a dis-amenity, although greenness in the form of shady trees or parks is an amenity. We propose to guide students through an abbreviated version of this study.

The native dataset contains 7597 Zillow neighborhoods (rows) and 127 variables (columns). The 127 columns consist of remote sensing products  (such as Tree canopy cover, land surface temperature), socio-demographic (such as household income, education), location amenities (such as parks, schools), and the Zillow Home Value Index (neighborhood average home price).

The full study included all major metropolitan areas in the United States, but for the purpose of this data exploratory project we focus on a single city - Houston, Texas - in order to simplify computation.

In-Class Exercises

Jon and Reza presented a lecture introducing the concept of spatial autocorrelation and its importance in statistical modeling.

The presentation was interactive, including think-pair-share activities with real-world data. For instance, for the “students test scores”  problem, students engaged in the class activity by talking about the important predictor variables for predicting student test scores and improving the modeling. Students were introduced with multiple examples of data with spatial autocorrelation such as political elections, contaminant transfer, disease spread, housing market and weather.

After students became familiar with the concept of spatial autocorrelation, Jon presented Moran’s I test as a metric that measures the spatial autocorrelation for continuous data. Jon explained the parameters of this metric by applying it to the student test scores data and later elaborated on how Moran’s I may reflect positive/negative or no autocorrelation. At the end of the presentation, Jon explained the students’ task and the dataset for the lab session. The lab was designed to provide the intuition behind spatial autoregressive models, specifically the spatial lag and spatial error models.

Lab Session

In the lab session, students were asked to conduct a hedonic pricing analysis by modelling the median neighborhood home price as a function of socio-demographics, home characteristics, and environmental attributes. Given the spatial distribution of Zillow Neighborhoods, students were supposed to incorporate this information into their model and interpret the results. With the help of Professor Maria Tackett, Jon and Reza designed a Rstudio template in which all the necessary packages for this analysis (visualization, regression, spatial data) along with the datasets were loaded and so students were supposed to only deal with the modelling part rather than the details of R programming.

This lab activity started by asking students to plot the response variable, median price per square foot, along with some of the predictors (such as temperature, household incomes) and then comment on the potential correlation between them. Next, students were asked to build a simple regression model (ordinary least squares) that predicts the median home price based on the other variables in the dataset and (i) plot the distribution of the residuals and examine the need for any transformation of the response variable (ii) interpret the coefficients of predictors and comment on their statistical significance. Afterwards, students were asked to visualize the residuals from the least square model and make an assessment about the significance of spatial autocorrelation and later employ Moran’s I test to evaluate their assessment by interpreting the test statistic and p-value of the residuals. By this point, students have noticed the evidence for the spatial autocorrelation and their next task was to incorporate this information into three spatial regression models namely, spatial lag and error models and their combination. Students were asked to interpret the Moran’s I statistic and p-value of these models and choose the one which addresses the spatial autocorrelation issue. 

Here is the quote from the class instructor, Prof. Maria Tackett:

“In the data expedition, Jon and Reza developed an engaging lecture and lab assignment on fitting regression models that account for spatial correlation. This is a topic that students typically learn in upper-level statistics courses, so the data expedition gave students a nice  preview of more advanced modeling techniques.  In the lab assignment, students applied what they learned in the lecture to analyze a real and complex data set from Zillow. They enjoyed learning the new methods for visualizing and fitting models for spatially correlated data. In fact, a few groups plan to apply some of the methods they learned in the data expedition to their final course project!”

Students in classroom

Students in a classroom

 

Source of the Data

ESRI. (2019). ArcGIS Online.

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/J.RSE.2017.06.031

U.S. Census Bureau. (2019). American Community Survey, 2017 5-year Estimates.

Zillow. (2019). Zillow Economic Data.

Downloads

morans_i.pdf

presentation.pdf

spatial_regression_lab_tidy_solution.pdf

spatial_regression_lab_tidy_solution.Rmd

spatial_regression_lab_tidy.Rmd

Data

ACS_zillow.csv

consolidated.csv

full_dataset.csv

Zillow_All_States.cpg

Zillow_All_States.dbf

Zillow_All_States.gpkg

Zillow_All_States.prj

Zillow_All_States.qpj

Zillow_All_States.shp

Zillow_All_States.shx

Zillow_Houston.cpg

Zillow_Houston.dbf

Zillow_Houston.prj

Zillow_Houston.qpj

Zillow_Houston.shp

Zillow_Houston.shx

 

Related Projects

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.

We led a 75-minute class session for the Marine Mammals course at the Duke University Marine Lab that introduced students to strengths and challenges of using aerial imagery to survey wildlife populations, and the growing use of machine learning to address these "big data" tasks.

Over the course of two, one and a half hour sessions we led students in the Duke Marine Lab Marine Ecology class (Biology 273LA) on a data expedition using the statistical programming environment R. We gave an introduction to big data, the role of big data in ecology, important things to consider when working with data (quality control, metadata, etc.), dealing with big data in R, what the Tidyverse is, and how to organize tidy data (see class PowerPoint). We then led a hands-on coding workshop where we explored an open-access citizen science dataset of aquatic plants along U.S. east coast (see dataset details below).