Open Data for Tobacco Retailer Mapping

Project Summary

Felicia Chen (Computer Science, Statistics), Nikkhil Pulimood (Computer Science, Mathematics), and James Wang (Statistics, Public Policy) spent ten weeks working with Counter Tools, a local nonprofit that provides support to over a dozen state health departments. The project goal was to understand how open source data can lead to the creation of a national database of tobacco retailers.

Themes and Categories
Paul Bendich

Project Results: The team performed a feasibility study involving questions of technical accuracy and cost-effectiveness. Working mostly in R, they used a combination of web-scraping for data collection, machine-learning and text mining for data classification, and MTurk for human validation, and were able to construct a viable dataset for North Carolina.

They presented findings at an informal briefing of civic leaders and planning officials.

Partially funded by Counter Tools

Click here for the Executive Summary

Project Lead & Project ManagerMike Dolan Fliss, Counter Tools



"Coming in, I had little knowledge about what data science research entailed. Participating in Data+ was a great step and helped me better realize my career goals. I learned a host of interdisciplinary skills - ranging from web scraping to survey design – that can definitely be applied to future projects." — Felicia Chen, Computer Science & Public Policy

Related People

Related Projects

In this project, we are interested in creating a cohesive data pipeline for generating, modeling and visualizing basketball data. In particular, we are interested in understanding how to extract data from freely available video, how to model such data to capture player efficiency, strength and leadership, and how to visualize such data outcomes. We will have four separate teams as part of this project working on interrelated but separate goals:

Team 1: Video data extraction

This team will explore different video data extraction techniques with the goal of identifying player locations, ball location and events at any given time during a basketball game. The software developed as part of this project will be able to generate a usable dataset of time-stamped basketball plays that can be used to model the game of basketball.

Teams 2 & 3: Modeling basketball data: offense and defense

The two teams will explore different models for the game of basketball. The first team will concentrate on modeling offensive plays and try to answer questions such as: How does the ball advance? What leads to successful plays? The second team will concentrate on defensive plays: What is an optimal strategy for minimizing opponent scoring opportunities? How should we evaluate defensive plays?

Team 4: Visualizing basketball data

This team will work on dynamic and static visualization of elements of a basketball game. The goal of the visualization is to capture information about how players and the ball move around the court. They will develop tools to represent average trajectories be in these settings that can also capture uncertainty about this information.

Faculty Leads: Alexander Volfovsky, James Moody, Katherine Heller

Project Managers: Fan Bu, Greg Spell, 2 more TBD

A team of students led by researchers in the Energy Access Project will develop means to evaluate non-technical electricity losses (theft) in developing countries through machine learning techniques applied to smart meter electricity consumption data. Students will use data from smart meters installed at transformers and households through a randomized control trial. Students will develop algorithms that can be used to detect anomalies in the electricity consumption data and create a dataset of such indicators.  This project will provide researchers with new ways of incorporating electricity consumption data and applications for electricity utilities in developing country settings.

A team of students, in conjunction with Duke’s Office of Information Technology, will use of Duke’s network traffic data to perform IoT device behavioral fingerprinting that can be employed to identify device types. The data will be used to analyze trends and risks, develop security best practices, and build machine learning models that can be used to detect similar device types. Students will work directly with the network data and have access to the analytics tools used in OIT and will have a great opportunity for exploration of the data in consultation with OIT network, security and data analytics professionals.

Project Lead: Jen Vizas