Data+

Data+ is a 10-week summer research experience that welcomes Duke undergraduates interested in exploring new data-driven approaches to interdisciplinary challenges. Students join small project teams, working alongside other teams in a communal environment. They learn how to marshal, analyze, and visualize data, while gaining broad exposure to the modern world of data science.

Browse 2018 projects 

  • "Before Data+, data science research sounded like a non-collaborative job involving PhD-level statistical concepts. Data+, however, showed me that there is a place for collaborative workers from all different backgrounds (and of all skill levels) in data science research. Participating in Data+ has enriched my technical skills as a coder; I am now able to navigate software and employ coding languages that I was not at all familiar with before the start of the program. Even more valuable, however, are the "soft" skills I have gained -- specifically, the ability to approach collaboration with an open mind."

    —Susie Choi, Computer Science

    Visualizing Real Time Data from Mobile Health Technologies

  • "I gained valuable program management experience. Given that after the program was over I got hired as a consultant manager at CollegeVine, I'd say it paid off."

    —Stefan Waldschmidt, English

    Quantified Feminism and the Bechdel Test

  • "My participation in the Data+ program has shown me how to successfully work with a dynamic team. Each of my team members were fundamentally different in course interests and background, yet we came together to create a polished product in which we each were a point person for a specific portion. I have also gained confidence in my ability to learn new skills, as I basically taught myself (through Google and asking teammates) how to program in R over this summer."
    —Devri Adams, Environmental Science

    Data Viz for Long-term Ecological Research and Curricula

  • "Participating in Data+ definitely changed my perception of Data Science research. It was more interdisciplinary than I expected, and the opportunity to work with experts across different fields (Medicine, Civil Engineering, Statistics) was a defining aspect of my Data+ experience."

    — Serge Assad, Biomedical Engineering, Electrical & Computer Engineering

    Classification of Vascular Anomalies using Continuous Doppler Ultrasound and Machine Learning

  • "The Data+ team created two new datasets that we'll immediately deploy as a part of our core research efforts and will serve as the basis for an upcoming Bass Connections in Energy project. The outputs will be used towards two new research projects on energy infrastructure and access in developing countries, and will serve as the ground truth data for developing machine learning techniques for identifying energy infrastructure and access. The students were fantastic - hardworking, passionate about their work, and all-around wonderful people to work with."

    —Kyle Bradbury, Lecturing Fellow and Managing Director, Duke Energy Data Analytics Lab

    Electricity Access in Developing Countries from Aerial Imagery

  • "The project mentor was fantastic. The three students I worked with were superb. We were able to make great progress that will lead to journal publications and grant proposals."

    —Wilkins Aquino, Professor, Duke Department of Electrical and Environmental Engineering

    Classification of Vascular Anomalies using Continuous Doppler Ultrasound and Machine Learning

10
weeks during the summer
2-3
undergraduates per team
1-2
grad student mentors
25
projects sharing ideas and code

Related Videos

Projects

United Nations Sustainable Development Goal 7 calls for universal access to affordable, reliable, sustainable, and modern energy. Researchers and practitioners around the world have responded to this call by producing a wealth of energy access data. While many data gaps still exist, are we capturing the fullest potential from the information and research we do have, and what it tells us about how to accelerate energy access? Power for All’s Platform for Energy Access Knowledge (PEAK) is an interactive knowledge platform designed to automatically curate, organize, and streamline large, growing bodies of data into digestible, sharable, and useable knowledge through automated data capture, indexing, and visualization. A team of students led by Rebekah Shirley will consult with Power for All to creatively visualize PEAK’s library, and to explore machine learning and natural language processing tools that can enable auto-extraction and visualization of data for more effective science communication.

Are there relative value opportunities in the global corporate bond markets?  
A team of students will work with Professor Emma Rasiel to understand whether an analysis of credit spreads on bonds issued by international firms in multiple countries over time can shed light on potential arbitrage opportunities. The team will have frequent opportunities to interact with analytics professionals at a leading financial advisory and asset management firm.

 

A team of students will consult with a leading financial advisory and asset management firm that is seeking to understand how big data can shed light on the secondary market for construction machinery. Students will explore a combination of publicly-available datasets that describe the used-machinery market and its potential implications as an indicator for the business cycle. There will be frequent interactions with analytical professionals from the firm.

A team of students will work with Duke’s Office of Information Technology to conceptualize and potentially develop an “e-advisor” program that will help students navigate, augment, and map their way through Duke’s co-curricular ecosystem. The team of students will identify available data, programs and resources, define learning objectives, recommend common pathways and create a storyboard of the program building out a “master narrative” experience and prototype the branching and decision engine. Students will work with de-identified registration and advising data in a secure environment, have access to the analytics tools used in OIT, and will have an opportunity for exploration of the data in consultation with OIT and data analytics professionals.

A team of students in conjunction with Duke’s Office of Information Technology will make use of Duke’s wireless network data to build detailed maps of wireless coverage, strength and utilization across campus.  The data will be overlayed on a campus map of buildings, and used to analyze trends in wireless demand (e.g. areas that need additional coverage or bandwidth), trends in wireless utilization (e.g. where and what times are the wireless network used the most), identify underutilization for resource reallocation, and trends in how groups of people move around campus.  Students will work directly with the network data and have access to the analytics tools used in OIT, and will have a great opportunity for exploration of the data in consultation with OIT network, security and data analytics professionals.

A team of students, under the direction of Prof. Benjamin C. Lee, will explore how a variety of statistical machine-learning techniques may be able to improve datacenter performance. The team will have frequent opportunities to interact with analytics leadership at Lenovo.

A team of students led by Janet Bettger and an interdisciplinary team with the 6th Vital Sign Study will use Census and other public data to examine the representativeness of people who participated in this smartphone based population health study. Students will design an online interactive map and other web-based tools that can be easily updated with new study participants illustrating key relationships such as health status with rurality, medical service availability, and sociodemographics. The online tools will be used to direct education efforts on the importance of walking speed as a marker of health and as the sixth vital sign. Findings from the data analysis will be used by GANDHI to direct scale-up of smartphone based research in target geographic areas and with specific population subgroups such as older adults and those with chronic illness.

Today, our society is struggling with an unprecedented amount of misinformation and disinformation. A team of students led by researchers in the Duke Reporters’ Lab and Department of Computer Science will build databases, systems, and apps to help fact-checkers combat falsehoods and hyperboles, and disseminate their fact-checks to the public. The team will apply database, machine learning, algorithmic, and app development techniques to scout media and public interest for check-worthy claims, and alert media consumers to previously checked claims instantly.

A team of students led by Professors Jonathan Mattingly and Gregory Herschlag will investigate gerrymandering in political districting plans.  Students will improve on and employ an algorithm to sample the space of compliant redistricting plans for both state and federal districts.  The output of the algorithm will be used to detect gerrymandering for a given district plan; this data will be used to analyze and study the efficacy of the idea of partisan symmetry.  This work will continue the Quantifying Gerrymandering project, seeking to understand the space of redistricting plans and to find justiciable methods to detect gerrymandering. The ideal team has a mixture of members with programing backgrounds (C, Java, Python), statistical experience including possibly R, mathematical and algorithmic experience, and exposure to political science or other social science fields.

Read the latest updates about this ongoing project by visiting Dr. Mattingly's Gerrymandering blog.

A team of students led by researchers in the Energy Data Analytics Lab and the Sustainable Energy Transitions Initiative will develop machine learning techniques for automatically mapping global electricity infrastructure using satellite imagery. By identifying substations, transmission lines, and distribution lines, students will create and publish a training dataset that we will use to automate grid infrastructure geolocation. These data and techniques will empower researchers and policymakers to better understand who has grid-connected access to electricity, who is underserved, and how to most efficiently transition communities and countries towards sustainable electrification.

A team of students led by faculty and researchers at the Social Science Research Institute will bring together data that will facilitate research using social determinants of health (SDH) to examine, understand, and ameliorate health disparities. This project will identify SDH variables that have the potential to be linked to data from the MURDOCK Study, a longitudinal health study based in Cabbarus County, NC. Much of this data – information relevant to understanding socioeconomic status, education, the physical and social environment, employment, and social support networks – is publicly available or easily obtained and its aggregation and analysis offer opportunities to significantly improve predictions of health risks and improve personalized care. Students will evaluate potential data sources, develop ethical policies to protect respondent privacy, clean and merge data, create documentation for data sharing and reuse, and use statistical tools and neighborhood mapping software to examine patterns of disparity.

Despite overwhelming scientific evidence on the benefits of vaccinations, pregnant women and parents of young children often refuse to accept, or choose to space-out, vaccinations for themselves or their children. This phenomenon, termed vaccine hesitancy, has been blamed for several vaccine-preventable outbreaks in the U.S. As part of larger study to understand vaccine hesitancy locally, students will conduct secondary data analysis of the coverage and timeliness of maternal and pediatric vaccines in Durham, and identify determinants of timely vaccination uptake. Results may inform the development of interventions to reduce hesitancy and improve the coverage and timeliness of maternal and pediatric vaccine uptake in Durham.

A team of students will contribute to an effort to operationalize the application of distributed computing methodologies in the analysis of electronic medical records (EMR) at Duke.  Specifically, the team will compare and contrast conventional (Oracle Exadata) and distributed (Apache SPARK) systems in the analysis of EMR data, and create recommendations for implementation.  Students will then use these systems to execute natural language processing (NLP) on clinical narratives and radiology notes with existing, ongoing analyses of Duke data.  This Data+ team will work with the Duke Forge, an interdepartmental collaboration focused on data science research and innovation in health and biomedical sciences.

A team of students lead by Rachel Richesson (Duke University School of Nursing) will explore patterns of health care treatment and utilization for several rare metabolic disorders treated at Duke University Health System (DUHS).  Students will gain an understanding of medical data, the use of reference terminologies to generate new relationships and inferences, and various data analysis and visualization techniques to describe and compare the clinical profiles of patients with different conditions. Students will interact with faculty experts from multiple disciplines (statistics, network analysis, medicine, genetics, and population health) to demonstrate how data-driven clinical profiles can inform our understanding of patients’ health care experience and support clinical care and research.

Would you like to know what influences patients’ medical decisions when outcomes are uncertain? Using a big data approach, we will explore a large number of physician-patient conversations and disentangle the complex decision-making process.  Students will be introduced not only to data science but also to behavioral research and aspects of communication in healthcare. This work will inform physicians on how to reduce overutilization of unnecessary interventions and ensure the well-being of patients.

How are women influenced by the spaces that they are allowed to occupy? A group of students, led by English Professor Charlotte Sussman, will examine how the spaces and places women can inhabit have changed over time, and how such changes have affected women’s rights and opportunities. The team will analyze the visual representations of women depicted in magazines from the nineteenth to the twenty-first century through the Women’s Magazine Archive, considering how images about women influence the reality that women can both imagine and live. Using this data, the group will design and visualize a potential women’s space that can empower and support women to reach their highest potential.

A team of students led by researchers in the Center for Health Policy and Inequalities Research will develop a platform that visualizes significant life events across time for more than 3,000 orphaned and separated children in Cambodia, Ethiopia, India, Kenya, and Tanzania from the Positive Outcomes for Orphans (POFO) study. The types of life events visualized on the timeline will include: the death of a parent, changes in living locations, school levels achieved, special events, traumatic events, and reported wellbeing at different ages. This data will be displayed via mobile devices and will serve to allow the participant to visualize and verify the information provided about their lives. Ultimately, the platform will allow researchers to ensure accuracy of the data provided and also allow greater audiences to visualize the individuality of the study's aggregate data.

A team of students led by Glenn Elementary School Parent Teacher Association (PTA) President, David Vanie, will explore publicly available data in order to develop a set of metrics that serve to understand the needs of the GSE parent community in a holistic way.  The data will identify potential obstacles that are barriers for parent involvement, and will inform best practices for increasing participation throughout the 2018-2019 school year at GSE.  The work will be used to provide helpful insight for engaging parents in PTA organizations at public schools throughout Durham, and across the country. 

 

A team of students led by UNC-CH graduate student Grant Glass and Duke English professor Charlotte Sussman will track the thousands of Daniel Defoe’s Robinson Crusoe editions – including the plethora of movies and “Robinsoniades,” most of which are deviations from Defoe’s original work. By examining the differences in these stories –through word-vector models and categorization algorithms, we can trace how the deviations often reflect the place and time of their production and consumption, evoking a range of questions that further our understanding of how the expanse and collapse of the British Empire is wrapped up in notions of capitalism, race, empire, gender, and climate concerns. Along the way, we will examine questions of intellectual property, piracy, and authorship as they relate to both the 18th century and today.

A team of students led by clinical and non-clinical global reproductive health researchers at the Duke Global Health Institute will develop an interactive, web-based platform that curates raw data on contraceptive discontinuation from the Demographic and Health Surveys (DHS) into a tool to help researchers and family planning advocates develop fresh insights around contraceptive discontinuation. Students will develop and refine the prototype and create a dissemination plan with guidance from Alexander Pavluck, Senior Manager of Information and Communication Technology (ICT) for the Global Health Division of RTI’s International Development Group (IDG). Students will have an opportunity to pilot creative ways to incorporate social media data into the tool and ways to validate this data against ground-truth data from population representative surveys.

A team of students led by a computational biologist and a cell biologist will develop methods to identify cell subsets and their developmental, maturation and activation lineage relationships using deep learning approaches. Students will learn to process single cell RNA sequencing data and use the Python programming language and TensorFlow to characterize lung stem cells involved in wound healing. This work will help Duke researchers establish a deep learning pipeline for single cell analysis with applications in immunology, cell biology and cancer.

What do we mean by the term “poverty”?

A team of students under the direction of Professor Astrid Giugni will analyze how the way we talk about poverty and public policy has changed over time. The team will work with two databases containing visual, textual, and audio documents from 1473 to the present, allowing students to track and analyze how our understanding of poverty has changed over time. The group will tackle the challenge of analyzing the political and popular language and imagery of poverty in order to create a visualization that contextualizes how financial and welfare policy is influenced by how we talk about poverty.

A team of students will work with Duke's Office of Development & Alumni Affairs to understand how cutting-edge data analytic techniques, such as sentiment analysis and network analysis, can be used to understand a variety of giving behaviors and trajectories. Students will work with de-identified data in a secure computing environment, and will have a rich opportunity for creative exploration in consultation with Development professionals.

 

 A team of students lead by Dr. Nicole Schramm-Sapyta of the Duke Institute for Brain Sciences will provide analytical consulting support to the Durham Crisis Intervention Team (CIT) Collaborative, a county-wide effort to provide law enforcement and first responders with specialized training in mental illness and crisis intervention techniques.  The team will build on last summer’s descriptive analysis of 9-1-1 call data by incorporating data from partner agencies to assess whether CIT training reduces recidivism, increases utilization of mental health services, and generally improves the lives of Durham citizens with mental illness. 

Past Projects

Sophie Guo, Math/PoliSci major, Bridget Dou, ECE/CompSci major, Sachet Bangia, Econ/CompSci major, and Christy Vaughn spent ten weeks studying different procedures for drawing congressional boundaries, and quantifying the effects of these procedures on the fairness of actual election results.

Anna Vivian (Physics, Art History) and Vinai Oddiraju (Stats) spent ten weeks working closely with the director of the Durham Neighborhood Compass. Their goal was to produce metrics for things like ambient stress and neighborhood change, to visualize these metrics within the Compass system, and to interface with a variety of community stakeholders in their work.

Maddie Katz (Global Health and Evolutionary Anthropology Major), Parker Foe (Math/Spanish, Smith College), and Tony Li (Math, Cornell) spent ten weeks analyzing data from the National Transgender Discrimination Survey. Their goal was to understand how the discrimination faced by the trans community is realized on a state, regional, and national level, and to partner with advocacy organizations around their analysis.