Data+

 

 

Data+ is a full-time ten week summer research experience that welcomes Duke undergraduate and masters students interested in exploring new data-driven approaches to interdisciplinary challenges. It is suitable for students from all class years and from all majors.

Students join small project teams (at most 3 undergrads and 1 masters per team), working alongside other teams in a communal environment. They learn how to marshal, analyze, and visualize data, while gaining broad exposure to the modern world of data science. The projects (see below) come from an extremely diverse set of subject areas.i It is our hope that students will be able to both work deeply into their specific project and get a very broad picture of most of the skills needed for modern data science.

Participants will receive a $5,000 stipend, out of which they must arrange their own housing and travel . Funding and infrastructure support are provided by a wide range of departments, schools, and initiatives from across Duke University, as well as by outside industry and community partners.

Data+ is typically a program where students have dedicated workspace within Gross Hall at Duke University. For the last two summers (2020 and 2021), Data+ ran entirely remotely due to the pandemic, and was quite successful. We hope to resume in-person programming in the summer of 2022, per Duke University guidance.

See below for information about our past projects!

  • I learned there’s much more to it then looking at data. It’s also a way of thinking and organizing what you have analyzed to help others who aren’t able to look at data in such a way to understand it. It’s also a bit of storytelling in a way.

    - Jessica Ho, Math and Neuroscience ‘22
    Predicting Baseball Players’ Athletic Performance Utilizing Baseline Assessments of Vision

  • I didn't really know how data science research applied to social science, but Data+ showed me that it can be a really successful avenue for discovery and change.

    - Nick Datto, Neuroscience, Computer Science, and Cultural Anthropology ‘23
    Race and Housing in Durham over the Course of the 20th Century

  • I’ve learned how interdisciplinary data science is, and how a team of people with many different academic trajectories can work together on the same project, something that I don't think happens very often in other areas.

    - Anonymous

  • What I have discovered is that a majority of data research is about communication. How you interact with your teammates and superiors is just as important, if not more important, than being a genius in your field.

    -  Andrew Scofield, Computer Science ’22, Birmingham-Southern College
    For love of greed: tracing the early history of consumer culture

  • I had expected it to be very analytical, but I was surprised at the creativity that was also required. I enjoyed this aspect a lot.

    - Amber Potter, Computer Science ‘23
    Predicting Baseball Players’ Athletic Performance Utilizing Baseline Assessments of Vision

  • I've gained a lot of valuable insight into the career fields of environmental health and epidemiology. I've also learned a lot about project workflow and how to work through the different phases of a long term project with a team. In addition, my skills in R coding and Tableau have improved a ton.

    - Anonymous

  • My group has been focused on cybersecurity and automation methods to prevent and seek out attackers to keep Duke websites and accounts from being compromised. I have learned a lot about cybersecurity, a field that I otherwise might not have pursued. It has been a very interesting and enlightening experience so far and I am excited to continue learning from the Duke OIT staff.

    - John Taylor, Computer Science ‘21
    Applying Security Orchestration, Automation & Response (SOAR) to security threat hunting with Duke’s ITSO

  • I have gained so much knowledge and confidence! And it is not limited to the area of technology, although I have learned to code in R, navigate PACE, and so much more. I have better discovered the benefit of working with a team and received motivation and mentors by seeing female-identifying students, like myself, succeed. Hearing their success stories via panels or team meetings has given me so much more confidence as a young woman wanting to pursue a career in STEM. I see that it is possible! I also have never worked with data before Data+, but never felt behind in my lack of knowledge as my team is super supportive. They Zoom me outside of the workday and send me resources to help me complete my assignments. I've also realized that I do have an interest in Data Science and feel like I'm making a difference in the world through this program. Knowing that my project (Predicting Blindness in Duke's Glaucoma Patient Population) is going to help so many clinicians, government officials, patients and more is so empowering. It is crazy to think that I am just 19 years old and working on such an advanced project with beyond accomplished students, doctors, and professors, but I'm doing it! Data+ truly has given me the opportunity to expand my knowledge and network in a safe environment. I find these takeaways pretty impressive, especially since it is all remote this year.

    - Sydney Hunt, Engineering ‘23
    Predicting Blindness in Duke’s Glaucoma Patient Population

  • Beyond solid technical machine learning skills, I've received a greater appreciation for data science as a tool to understand everything--from aircraft maintenance to the humanities. Before, I'd never expected that conducting humanities research would teach me how to wield and utilize the most cutting-edge research in machine learning and natural language processing. My team is using new package libraries and research papers written by lead researchers this year to conduct our analysis of ancient texts. In Data+, New meets Old.

    - Albert Sun, Computer Science and Public Policy ‘23
    For love of greed: tracing the early history of consumer culture

  • Working remotely has made coordination much more difficult. However, we really have been embracing GitHub and box to overcome these challenges. I have learned a lot about RNNs and the applications of GRUs and LSTM's and how to implement such layers, in addition to learning how to use pytorch as previously I only used tensor flow.

    - Nathan Warren, MIDS
    Human Activity Recognition using Physiological Data from Wearables

  • As a Biology pre-med, I made the mistake of thinking that coding was irrelevant to me. That changed when I took a biology class where we used R to analyze lab data. That was when I realized that coding (and the problem solving skills that come with it) is invaluable in research. It was difficult at first to jump into Data+, but doing this has benefited me a few ways. Having to learn python on my own, in a very short amount of time, with almost no prior coding experience (I didn't even know what a package was) and quickly turning around and using those skills taught me that I am capable of flexibility and learning on the job. Coding also requires an immense amount of problem solving and independence. Although my mentors are fantastic, it's up to me to figure out where I want to take the project and how I want to do it. Finally, Data+ has been a really invaluable exercise in teamwork. This has been especially challenging with remote learning. However, I still feel like our team has grown very close in working toward a common goal.

    - Ellen Mines, Biology and Philosophy ‘21
    Computational Tools to Improve Healthy and Pleasurable Eating in Young Children

  • My coding skills and machine learning knowledge had a huge leap. I learned how to better work in a team as well.

    - Noah Lanier, Psychology ‘22
    Human Activity Recognition using Physiological Data from Wearables

  • I learned a lot about data science and using code to manipulate data. I learned how to properly use a terminal, deep learning/machine learning, pandas, and many other skills. Also, I gained collaboration skills when it comes to developing code.

    - Pavani Jairam, Physics ‘23
    Finding Space Junk with the World’s Biggest Telescopes

  • I’ve learned to work through the entire process of a data science project, from assembling data sources all the way through presenting our findings. I’ve also developed insight into working in a team with people of different backgrounds and interests, which enabled us to contribute to the project in different ways. I’ve taken various lessons and hard skills that will carry with me into my future academic and professional endeavors.

    - Benjamin Chen, Computer Science, Economics ‘22
    Protecting American Investors? Financial Advice from before the New Deal to the Birth of the Internet

  • Data+ absolutely changed my perception of data science research. Learning data science has been more intuitive than expected. There are also resources all over the Internet in addition to team members that are able to provide assistance when one is facing difficulty with an aspect of a project. Data science is also able to be applied to many more scenarios than I expected; I look forward to continuing data science research in the future.

    - Malik Scott, Global Health ‘22
    Predicting Baseball Players’ Athletic Performance Utilizing Baseline Assessments of Vision

  • I gained concrete skills in R and Tableau, the ability to collaborate in a virtual environment, and a better understanding of what data science actually means. I also got a glimpse into the public health field and got to learn what many different public health careers might actually entail.

    - Anna Zolotor, Undeclared
    Piloting an Environmental Public Health Tracking Tool for North Carolina

  • I have gained a significant amount of knowledge of the cybersecurity industry and attack methods due to the nature of the background research I had to do for my project. In addition, I was able to apply my knowledge of statistical analysis to real data and learn new techniques to arrange data such as time series analysis.

    - Matthew Feder, Computer Science ‘22
    Applying Security Orchestration, Automation & Response (SOAR) to security threat hunting with Duke’s ITSO

  • Since I've never participated in research before, especially not research this independently oriented, the main thing I feel I've gained from this experience is confidence. I feel like I have a much better understanding of my own capabilities, and I honestly feel much less intimidated by the idea of pursuing research, not just in Data Science.

    - Donald Pepka, Math, Political Science, and Creative Writing ‘21
    For love of greed: tracing the early history of consumer culture

  • I learned a number of hard skills in terms of coding languages as well as some soft skills along the lines of working with a team and coordinating with a client.

    - Benjamin Williams, ECE ‘21
    ABOUT-US – A BOundary Update Tool for Utility Services

  • I definitely gained a lot of experience in R and in Tableau, but I also learned a ton about the fields of data science and public health. We had several interviews with community partners that helped me learn a lot about the different types of careers in data science, environmental advocacy, and environmental health.

    - Leah Roffman, Environmental Science ‘23
    Piloting an Environmental Public Health Tracking Tool for North Carolina

  • I learned about team communication and organizational skills, time management, and I think I have a greater appreciation for how socio-cultural analysis from a humanities perspective can work in tandem with STEM based modes of collecting information/data.

    - Luci Jones, Environmental Studies Brown University
    When Black Stories Go Global: Analyzing the Translation of African-American Literature and Film

  • Through the program, I not only developed my technical skills with regards to programming and data visualization, but I also learned a lot more about finance and the intersections of finance and data science. This program really incited my love for programming and problem-solving with data, and has made me even more interested in studying statistical science and data science at Duke. Finally, I learned how to effectively collaborate and communicate with a team in a virtual environment.

    - Helen Chen, Statistics ‘23
    AI in the Investment Office

10
weeks during the summer
2-3
undergraduates per team
1-2
grad student mentors
25
projects sharing ideas and code

Related Videos

Projects

Alexa Goble (Finance) joined Econ majors Chavez Cheong and Eli Levine in a ten-week exploration of mortgage enforcement actions related to the financial crisis from earlier in this century. Using NLP techniques on mortgage data from Ohio and Massachusetts, the team validated a new experimental approach to understanding the dynamics between state regulatory agencies, mortgage lenders, brokers, and loan originators. This project was a continuation of two previous Data+ projects:

https://bigdata.duke.edu/projects/american-predatory-lending-global-financial-crisis

https://bigdata.duke.edu/projects/american-predatory-lending-and-global-financial-crisis-year-2

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Lee Reiners

Project Manager: Malcolm Smith Fraser

Stats/Sociology major Mitchelle Mojekwu joined Neuroscience majors Kassie Hamilton and Zineb Jaidi in a ten-week exploration of data relevant to an upcoming public school zone redistricting in Durham County. Using information acquired from the General Social Survey and the US Census, the team applied modern mathematical and statistical methods for generating proposed redistricting plans, with the aim of providing decision-makers with information they can use to produce school districts that are equitable and reflective of the Durham County student population.

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Faculty Lead: Greg Herschlag

Project Manager: Bernard Coles

 

Pryia Juarez (BME/ECE), Jonathan Pilland (ECE/BME), and Matthew Traum (CS/Econ) spent teen weeks analyzing sensor data synthesized by an agile waveform generator. The team used deep reinforcement learning techniques to understand the performance of different synthetic agents representing potential attackers to the sensor system.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Faculty leads: Robert Calderbank, Vahid Tarokh, Ali Pezeshki

Client leads: Dr. Lauren Huie, Dr. Elizabeth Bentley, Dr. Zola Donovan, Dr. Ashley Prater-Bennette, Dr. Erin Trip

Project Manger: Suya Wu

Xixi Lei (CS), Raffey Rana (CS/Econ), and Fan Zhu (Stats) spent ten weeks building tools to enable DUMAC to track and visualize its investments and their performance. The team cleaned data, met with stakeholders, and delivered an interactive dashboard in Tableau.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Yi Wang

Project Manager: Christopher Ritter

Keith Cressman (CS/ECE), Isa Lu (Econ), and Ivan-Aleksandr Mavrov (Econ/Math) spent ten weeks exploring how NLP tools could be put to use to improve document analysis workflow at DUMAC Inc., which is responsible for managing the assets of Duke University.

 

View the team's project poster here

Watch the team's final project presentation on Zoom:

Louis Hu (CS/Math), Fayfay Ning (Math/CS), and Kieran Lele (CS/Sociology) spent ten weeks exploring methods for exploring the similarities between networks of massive size, such as those arising from social media or from protein-protein alignment. The team used a variety of mathematical and software techniques and delivered a comprehensive analysis to experts in the field.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Jianming Xu

Project Manager: Sophie Yu

Shannon Houser (Stats/BioChem), Junbo Guan (MIDS), and Gaurav Sirdeshmukh (Stats) spent ten weeks exploring data concerning child and family health in Yolo County, CA. Using R Shiny, the team produced an interactive data dashboard that enables Yolo County residents to find healthcare and childcare providers, food resources, and transportation information.

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Leigh Ann Simmons (UC Davis)

Caroline Tang (Math/Stats) joined CS majors Frankie Willard and Alex Kumar in a ten-week exploration of AI methods to improve the mapping of energy infrastructure within satellite imagery. The team used cutting-edge methods to create synthetic imagery that, when blended with real imagery, improved the performance of deep learning methods on the energy infrastructure detection task.

 

View the team's project poster here

Watch the team's final presentation on Zoom here:

 

Project Lead: Kyle Bradbury

Project Manager: Wei Hu

Martin Guo (MIDS), Dani Trejo (CS), James Wang (CS/Math), and Grayson York (Math/CS) spent ten weeks building tools to understand voting patterns and gerrymandering of districts in North Carolina. They used dimension reduction techniques to cluster different elections into common groups, and they tested various methods for generating synthetic elections for comparison.

 

View the team's project poster here

Watch the team's final project presentation on Zoom:

 

This project is part of an ongoing set of projects by the sponsoring faculty around Voting, Gerrymandering and Democracy. See their blog ( https://sites.duke.edu/quantifyinggerrymandering/) for more information and projects from previous years of Data+ and Bass Connections.


Project Leads: Greg Herschlag, Jonathan Mattingly

Simi Bleznak (Math/AI), Max Brown (Math/Econ), and Julia Choi (Bio) spent ten weeks Exploring how visual, cognitive, and physical abilities relate to physical performance can provide insight into the development of athletes. Using two rich datasets provided by USA Baseball, the team used linear regression, logistic regression models, and longitudinal methods to deliver key insights to decision-makers. This project was a continuation of a Data+ project from last summer: https://bigdata.duke.edu/projects/predicting-baseball-players%E2%80%99-athletic-performance

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Leads: Marc Richard, Suhail Mithani, Greg Appelbaum

Project Mangers: Billy Carson and Hunter Klein

Sean Fiscus (Math/Econ/EnvEng), Alyssa Shi (Stats), Yamil Lopez-Ruiz (BME/CS), Emmanuel Mokel (Stats/Math) spent ten weeks working with data from CovIdentify, a study that focuses on using wearables to predict and diagnose COVID-19 and the Flu. The team improved the memory efficiency of analytic pipelines, and added capacity to ingest different types of data. This project built upon the work accomplished by the Duke Bass Connections team and the Duke MIDS capstone project.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Jessilyn Dunn

CS majors Aiman Haider and Divya Nataraj spent ten weeks analysing data and developing tools to assist the Fresh Produce Program (FPP) in achieving its core goals of creating a more equitable, sustainable, and integrated local food system. Using modern geospatial data analysis techniques, the team identified optimal locations for food distribution hubs, analyzed demographic data about the distribution area, and built an ArcGIS web app to allow interactive exploration of their results.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Willis Wong

Stats majors Alexandra Lawrence and Morgan Pruchniewski spent ten weeks exploring a dataset comprising 619 variables, including chemical and biological measurements, sourced from the Pivers Island Coastal Observatory (PICO). Using modern time-series analysis techniques, the team delivered key insights to PICO scientific staff, as well as advice for future data collection protocols.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Zackary Johnson

Jayesh Gupta (ECE/CS), Trevon Helm (ECE), and Yvonne Kuo (CS/PoliSci) spent ten weeks developing tools to extract features from network data in collaboration with information security professionals with Duke’s OIT. The team performed exploratory data analysis to extract features and used machine-learning techniques to detect zero-day attacks. This project was sponsored by Cisco, Inc.

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Eric Hope

Project Manager: Pranav Manjunath

Patience Jones (NCCU, Psychology and Elementary Education) joined Duke students Allyson Ashekun (PubPol), Drew Greene (PubPol), and Rhea Tejwani (CS/Econ) in a ten-week curation of data meant to assist decision-makers within Durham Public Schools. Using data from the Durham Compass and the NC School Report Card among many other sources, the team produced an interactive R Shiny dashboard that permits exploration of school statistical and geospatial data. This was a continuation of a joint Duke-NCCU Bass Connections project (https://bassconnections.duke.edu/about/news/how-duke-and-nc-central-university-are-inthistogether-support-durham-schools).

 

View the team's poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Alec Greenwald

Project Manager: Nicolas Restrepo Ochoa

Molly Borowiak (CS) and Joshua Tennyson (CS/BME) spent ten weeks building tools to assist in the analysis of data arising from microbial growth experiments. The team produced a comprehensive Python package that enables the exploration of a variety of modeling techniques with the data.

 

View the team's project poster here

Watch the team's final project presentation on Zoom:

 

(photo credit: Peter Tonner et al., 2020 PLoS Computational Biology https://journals.plos.org/ploscompbiol/article/comments?id=10.1371/journal.pcbi.1008366 )

Project Lead: Amy Schmid

Project Manager: David Buch

Annie Xu (Rice, CEE), Liuren Yin (ECE), and Zoe Zhu (Data Science) spent ten weeks analysing usage data for MorphoSource, a publicly available 3D data repository maintained by Duke University. Working with Python and Tableau, the team developed an interactive dashboard that allows MorphoSource staff to explore usage patterns for site visitors who view 3D files representing objects from primate skulls to historical art pieces.

 

View the team's project poster here

Watch the team's final presentation on Zoom here:

 

Project Leads: Doug Boyer, Julia Winchester

After London was destroyed during the Great Fire of 1666, it was reconstructed into the “emerald gem of Europe,” a utopian epicenter focused on England’s political and economic interests. For whom was the utopia constructed? Who determined its architectural choices? And what did such a utopia look like in seventeenth-century London?


Our research uses Natural Language Processing to analyze semantic trends in digitized text from the online database “Early English Books Online” (EEBO-TCP https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/) to answer such questions. After applying methods such as word-embedding, sentiment analysis, and hapax richness, we provide an overview of themes in the seventeenth century; specifically, we conducted case studies on changes to coal taxes within the period and the reconstruction of St Paul's Cathedral. Our results thus show that, while a utopian society was originally intended to be built for the people, the project’s motivation eventually shifted to a political purpose, as evidenced by the approval of more costly city projects. In response to backlash against the increase of taxes on coal to support large-scale building projects, the ruling class highlighted positive outcomes in printed materials in order to convince working class persons that their collected taxes contributed to a greater good, despite evidence to the contrary. Finally, during key historical events, sentiment and hapax richness are shown to have an inverse relationship, the results of which can demonstrate how London writers engaged with text and genre as forms of protest.

View the team's project poster here

Watch the team's final project presentation on Zoom:

 

Evan Dragich (Stats/Bio) and Katie Tan (Econ) spent ten weeks working with two sets of survey data provided by alumni of the Duke Graduate School. After cleaning data, the team used R Shiny to build an interactive dashboard that can assist leadership aiming to improve the direction and quality of Duke doctoral education.

 

View the team's project poster here

Watch the team's final project presentation on Zoom:

 

 

Project Lead: Ed Ballesein

Project Manager: Sarah Nolan

Brianna Cellini (Global Health, Neuroscience), Kexin (William) Feng (Public Policy, Philosophy), John Liakos (Neuroscience), and Maya Pandey (Political Science, Public Policy) received updated data from the Durham county sheriff’s office and Duke health, adding data from 2019-2020.  They spent the summer cleaning that data, comparing past demographics and descriptive statistics to the new data, and preparing to analyze the effect of COVID and new Durham County policies on incarceration and health service utilization.  They also established a new criterion to indicate which patients experience severe mental illness, based on health system records, to complement the mental health tagging system from the detention facility.

 

View the team's project poster here

View the team's final presentation on Zoom:

 

Project Lead: Nicole Schramm-Sapyta

Project Manager: Ruth Wygle

Tejvasi Patil (MEM), Sophia Stameson (CS), and Larry Zheng (Bio) spent ten weeks working with drone footage from different rainforest sources. The team designed a pipeline that performed image classification on the drone footage, and curated a training dataset using SQL.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Lead: Martin Brooke

Project Manager: Ryan Huang

 

Is there a right type and amount of consumption? The idea of ethical consumption has gained prominence in recent discourse, both in terms of what we purchase (from fair trade coffee to carbon off-sets) and how much we consume (from rechargeable batteries to energy efficient homes).

Heidi Smith (CS, English) and Biniam Garomsa (DataScience, Math) spent ten weeks building tools to assist the David M. Rubenstein Rare Book and Manuscript Library’s mission of finding and describing historically marginalized voices within their collections. The team performed extensive data wrangling, including modern optical character recognition techniques, with the card catalog, and then did a demographic analysis and a topic modeling analysis with the results. Final deliverables to library professionals included a structured dataset, an interactive web app, and a search tool.

 

View the team's project poster here

Watch the team's final presentation on Zoom:

 

Project Leads: Meghan Lyon

Project Manager: Anna Holleman

Past Projects

The Air Force’s F-15E Strike Eagle jets have parts that wear down and break, causing unscheduled maintenance events that take away valuable time in the air for critical missions and training. Our team, Limitless Data, is working with Seymour Johnson Air Force Base to mine manually entered maintenance data to visualize and predict aircraft failures. We created a prototype data visualization product that will enable maintainers on the flight line and help them identify and repair critical failures before they happen, keeping jets ready to fly, fight and win.

This project aims to improve the computational efficiency of signal operations, e.g., sampling and multiplying signals. We design machine learning-based signal processing modules that use an adaptive sampling strategy and interpolation to generate a good approximation of the exact output. While ensuring a low error level, improvements in computational efficiency can be expected for digital signal processing systems using the implemented self-adjusting modules.

Mapping History has focused on the categorizing, labelling, digitization, and 3D reconstruction of 16th & 17th century maps & atlases of London and Lisbon. Over the course of the summer, the Mapping History team has developed its own unique analytical dataset by painstakingly labelling every element contained within these maps, used python to digitize this dataset, and, now in the projects final stage, has begun the process of reconstructing these historical perspectives in a 3D game engine.