Baseball Analytics with Statcast

Project Summary

Graduate Student: Jacob Coleman, 3rd year Ph.D. student in Statistical Science

Faculty Instructor: Colin Rundel

Class: STA 112, Data Science

Themes and Categories
Year

Graduate Student: Jacob Coleman, 3rd year Ph.D. student in Statistical Science

Faculty Instructor: Colin Rundel

Class: STA 112, Data Science

  • Data management, summarization, and exploration with R package dplyr
  • Data visualization through R package ggplot2
  • Worked with state-of-the-art data pulled from online source 

Summary

In this Data Exploration, students were introduced to baseball dataset Statcast, downloaded from baseballsavant.mlb.com, that included every pitch thrown in the first week of the 2016 season, with 19 characteristics. The students were tasked with using R packages dplyr and ggplot2 to answer data exploration and summarizion questions. The exercises challenged them to use information about the data as well as newly acquired computation skills.

The Statcast data is owned by MLB Advanced Media, L.P. and was downloaded from a search performed on baseballsa- vant.mlb.com for all pitches from 4/1/16 to 4/7/16. Statcast is a relatively new dataset (introduced in 2015), including all pitch characteristics from its precurser PitchF/X (such as pitch movement, type, start and end velocity, etc.). Statcast also added tracking of the ball during the entirety of the play, as well as tracking for all fielders. Full Statcast data is not yet available to the public, but Baseball Savant allows the public to have access to Statcast-added batted ball variables such as launch angle and batted ball speed.

Dplyr is an extremely powerful tool for exploring data, using simple structure to perform complex data management tasks. Students were introducted to dplyr in a previous lecture, and used the Statcast data to gain hands-on experience working with data. Their tasks ranged from simple summaries to sophisticated manipulation (as real data is rarely in perfect form for desired analysis). They also integrated the R package ggplot2 to visualize some of their findings and draw futher conclusions. 

 

Related Projects

We introduced students to spatial analysis in QGIS and R using location data from two whale species tagged with satellite transmitters. Students were given satellite tracks from five Cuvier’s beaked whales (Ziphius cavirostris) and five short-finned pilot whales (Globicephala macrorhynchus) tagged off the North Carolina coast. Students then used RStudio to calculate two metrics of these species' spatial ranges: home range (where a species spends 95% of its time) and core range (where a species spends 50% of its time). Next, students used QGIS to visualize the data, producing maps that displayed the whales' tracks and their ranges.

This Data Expedition introduces students to network tools and approaches and invites students to consider the relationship(s) between social networks and social imaginaries. Using foundation-funding data that was collected from the The Foundation Directory Online, the Data Expedition enables students to visualize and explore the relationship between networks, social imaginaries, and funding for higher education. The Data Expedition is based on two sets of data. The first set list the grants received by Duke University in 2016 from five foundations: The Bill and Melinda Gates Foundation, Fidelity Charitable Gift Fund, Silicon Valley Community Foundation, The Community Foundation of Western North Carolina, and The Robert Wood Johnson Foundation. The second set lists the names of board members from Duke University and each of these five foundations along with the degree granting institution for their undergraduate education. For the sake of this exercise, the degree granting institutions data was fabricated from a randomized list of the top twenty-five undergraduate institutions.

This Data Expedition seeks to introduce students to statistical analysis in the field of international development. Students construct a index of wealth/poverty based on asset holdings using four datasets collected under the umbrella of the Living Standards Measurement Survey project at the World Bank. We selected countries to represent different continents with comparable and recent survey data: Bulgaria (2007), Tajikistan (2009), Tanzania (2010-2011), and Panama (2008).

First, we construct an index of wealth based on household assets in the different countries using Principle Components Analysis. Once a poverty index is constructed, students seek to understand what the main drivers of wealth/poverty are in different countries. We include variables for health, education, age, relationship to the household head, and sex. Students then use regression analysis to identify the main drivers of poverty in different countries.