North Carolina Traffic Stops

Project Summary

In this Data Expedition, Duke undergraduates were introduced to a real world traffic citation data set. Provided by Dr. Frank R. Baumgartner, a political scientist at UNC, the data consist of 15 years of traffic stops, with over 18 million observations of 53 variables.

Themes and Categories
Year
Contact
Paul Bendich
bendich@math.duke.edu

Graduate student: Derek Owens-Oas

Faculty instructor: David Banks

Course: STA 111 - Probability and Statistical Inference

  • Introduced Duke undergraduates to a real world data set.
  • Explored 15 years of traffic stop data, with 18 million observations of 53 variables.
  • Students collaborated to answer relevant research questions using probability and statistics.
  • Used statistical software to create pie charts, histograms, scatter plots, and other informative graphics.

Summary

Students collaborated to explore the data and answer relevant research questions using probability and statistics. Furthermore, STATA, a statistical software package, was used to create pie charts, histograms, scatter plots, and other informative graphics. 

Procedures

In particular, data were stored in 17 large .dta files, each containing a sizable number of stops from years spanning 2000 to 2014. Each file corresponds either to one of 14 counties in North Carolina, Highway Patrol, DMV (Division of Motor Vehicles), SHP Motor Carriers (State Highway Patrol), or Other. Using STATA, students navigated the .dta files, selecting one county to focus on for deeper analysis.

Each traffic stop within a .dta file consists of 53 variables, including details such as the age, race, and sex of the driver; the city in which the stop occurred; and whether a search occurred. However, some of the variables are irrelevant to many of the stops. For instance, one variable, ounces, tells the number of ounces of illicit contraband found in the vehicle. For most traditional stops this is not applicable. This naturally exposes the concept of conditional probability, so that information can be extracted from relevant subsets of data. 

To guide exploration, students were prompted by a variety of questions, some very open ended. A few sample questions are provided below:

  • For what proportion of all stops was information collected about whether the vehicle was searched? Of those for which information is available, what proportion involved vehicle searches?
  • Make a histogram of the month variable and the year variable. Carefully explain what these plots show. Do you notice any apparent trends in number of tickets throughout the year or over the years?
  • Among all stops in which a driver was arrested, make a pie chart displaying proportion breakdown by race. Do you think this finding is indicative of racism? Why or why not? 
  • Give another theoretical example of officer prejudice and describe how basic statistics, probability, or data visualization (in STATA) could be used to investigate whether there is supporting evidence.

In summary, the questions required students to utilize basic probability and statistics, to visualize and explore data graphically, and to think critically about potential real-world problems that can be investigated. 

Related Projects

This Data Expedition introduces students to network tools and approaches and invites students to consider the relationship(s) between social networks and social imaginaries. Using foundation-funding data that was collected from the The Foundation Directory Online, the Data Expedition enables students to visualize and explore the relationship between networks, social imaginaries, and funding for higher education. The Data Expedition is based on two sets of data. The first set list the grants received by Duke University in 2016 from five foundations: The Bill and Melinda Gates Foundation, Fidelity Charitable Gift Fund, Silicon Valley Community Foundation, The Community Foundation of Western North Carolina, and The Robert Wood Johnson Foundation. The second set lists the names of board members from Duke University and each of these five foundations along with the degree granting institution for their undergraduate education. For the sake of this exercise, the degree granting institutions data was fabricated from a randomized list of the top twenty-five undergraduate institutions.

This Data Expedition seeks to introduce students to statistical analysis in the field of international development. Students construct a index of wealth/poverty based on asset holdings using four datasets collected under the umbrella of the Living Standards Measurement Survey project at the World Bank. We selected countries to represent different continents with comparable and recent survey data: Bulgaria (2007), Tajikistan (2009), Tanzania (2010-2011), and Panama (2008).

First, we construct an index of wealth based on household assets in the different countries using Principle Components Analysis. Once a poverty index is constructed, students seek to understand what the main drivers of wealth/poverty are in different countries. We include variables for health, education, age, relationship to the household head, and sex. Students then use regression analysis to identify the main drivers of poverty in different countries.

This data expedition explores the local (ego) patent citation networks of three hybrid vehicle-related patents. The concept of patent citations and technological development is a core theme in innovation and entrepreneurship, and the purpose of these network explorations is to both quantitatively and visually assess how innovations are connected and what these connections mean for the focal innovations and the technologies that draw on those patents in the future. The expedition was incorporated as part of the Sociology of Entrepreneurship class, where students are thinking about the emergence and diffusion of innovations.