NBA and MLB datasets

Project Summary

Introduce NBA and MLB datasets to undergraduates to help them gain expertise in exploratory data analysis, data visualization, statistical inference, and predictive modeling.

Themes and Categories
Year
Contact
Paul Bendich
bendich@math.duke.edu

Graduate students: Joe Futoma and Ken McAlinn, PhD students, Statistical Science

Faculty instructor: Mine Cetinkaya-Rundel

Course: STA 112 (Data Science)

Applications:

  • Assessing home field advantage
  • Determining long term trends
  • Predicting game outcomes

Related Projects

Large publicly available environmental databases are a tremendous resource for both scientists and the general public interested in climate trends and properties. However, without the programming skills to parse and interpret these massive datasets, significant trends may remain hidden from both scientists and the public. In this data exploration, students, over the course of three hours, accessed two large, publicly available datasets, each with greater than 4 million observations. They learned how to use R and RStudio to effectively organize, visualize and statistically explore trends in deep sea physical oceanography.  

Our aim was to introduce students to the wealth of possibilities that human genotyping and sequencing hold by illustrating firsthand the power of these datasets to identify genetic relatives, using the story of the Golden State Killer’s capture with public genetic databases.

This Data Expedition introduced hypothesis-driven data analysis in R and the concept of circular data, while providing some tools for importing it and analyzing it in R.