North Carolina Traffic Stops

Project Summary

In this Data Expedition, Duke undergraduates were introduced to a real world traffic citation data set. Provided by Dr. Frank R. Baumgartner, a political scientist at UNC, the data consist of 15 years of traffic stops, with over 18 million observations of 53 variables.

Themes and Categories
Paul Bendich

Graduate student: Derek Owens-Oas

Faculty instructor: David Banks

Course: STA 111 - Probability and Statistical Inference

  • Introduced Duke undergraduates to a real world data set.
  • Explored 15 years of traffic stop data, with 18 million observations of 53 variables.
  • Students collaborated to answer relevant research questions using probability and statistics.
  • Used statistical software to create pie charts, histograms, scatter plots, and other informative graphics.


Students collaborated to explore the data and answer relevant research questions using probability and statistics. Furthermore, STATA, a statistical software package, was used to create pie charts, histograms, scatter plots, and other informative graphics. 


In particular, data were stored in 17 large .dta files, each containing a sizable number of stops from years spanning 2000 to 2014. Each file corresponds either to one of 14 counties in North Carolina, Highway Patrol, DMV (Division of Motor Vehicles), SHP Motor Carriers (State Highway Patrol), or Other. Using STATA, students navigated the .dta files, selecting one county to focus on for deeper analysis.

Each traffic stop within a .dta file consists of 53 variables, including details such as the age, race, and sex of the driver; the city in which the stop occurred; and whether a search occurred. However, some of the variables are irrelevant to many of the stops. For instance, one variable, ounces, tells the number of ounces of illicit contraband found in the vehicle. For most traditional stops this is not applicable. This naturally exposes the concept of conditional probability, so that information can be extracted from relevant subsets of data. 

To guide exploration, students were prompted by a variety of questions, some very open ended. A few sample questions are provided below:

  • For what proportion of all stops was information collected about whether the vehicle was searched? Of those for which information is available, what proportion involved vehicle searches?
  • Make a histogram of the month variable and the year variable. Carefully explain what these plots show. Do you notice any apparent trends in number of tickets throughout the year or over the years?
  • Among all stops in which a driver was arrested, make a pie chart displaying proportion breakdown by race. Do you think this finding is indicative of racism? Why or why not? 
  • Give another theoretical example of officer prejudice and describe how basic statistics, probability, or data visualization (in STATA) could be used to investigate whether there is supporting evidence.

In summary, the questions required students to utilize basic probability and statistics, to visualize and explore data graphically, and to think critically about potential real-world problems that can be investigated. 

Related Projects

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.

We led a 75-minute class session for the Marine Mammals course at the Duke University Marine Lab that introduced students to strengths and challenges of using aerial imagery to survey wildlife populations, and the growing use of machine learning to address these "big data" tasks.

Most phenomena that data scientists seek to analyze are either spatially or temporally correlated. Examples of spatial and temporal correlation include political elections, contaminant transfer, disease spread, housing market, and the weather. A question of interest is how to incorporate the spatial correlation information into modeling such phenomena.


In this project, we focus on the impact of environmental attributes (such as greenness, tree cover, temperature, etc.) along with other socio-demographics and home characteristics on housing prices by developing a model that takes into account the spatial autocorrelation of the response variable. To this aim, we introduce a test to diagnose spatial autocorrelation and explain how to integrate spatial autocorrelation into a regression model



In this data exploration, students are provided with data collected from remote sensing, census, and Zillow sources. Students are tasked with conducting a regression analysis of real-estate estimates against environmental amenities and other control variables which may or may not include the spatial autocorrelation information.