Research

Research projects at iiD focus on building connections. We encourage crosspollination of ideas across disciplines, and to develop new forms of collaboration that will advance research and education across the full spectrum of disciplines at Duke. The topics below show areas of research focus at iiD. See all of our research.

This Data Expedition introduced hypothesis-driven data analysis in R and the concept of circular data, while providing some tools for importing it and analyzing it in R.

We are seeking an exceptional researcher to work with Vahid Tarokh at the Information Initiative at Duke on foundations of Non-Commutative Information Theory, and the Design of Algorithms for the Processing of Multimodal Data based on these theoretical findings.

We are seeking up to two exceptional researchers to work on calculation of Fundamental Limits of Learning for High Dimensional, Purely High dimensional data, Sparse Data, and the Design of Limit Achieving Algorithms to work with Vahid Tarokh at the Information Initiative at Duke.

We are seeking an exceptional researcher to work on Change Detection for Multimodal Data, and Algorithm Design with Vahid Tarokh at the Information Initiative at Duke.

We are seeking an exceptional candidate to work with Robert Calderbank and Vahid Tarokh on Data-Driven Optimization with non-conves time-varying objectives at the Information Initiative at Duke.

We are seeking up to two exceptional researchers to work with Rob Calderbank and Vahid Tarokh at the Information Initiative at Duke on the applications of machine learning to analysis, detection and the design of radio frequency signals.

United Nations Sustainable Development Goal 7 calls for universal access to affordable, reliable, sustainable, and modern energy. Researchers and practitioners around the world have responded to this call by producing a wealth of energy access data. While many data gaps still exist, are we capturing the fullest potential from the information and research we do have, and what it tells us about how to accelerate energy access? Power for All’s Platform for Energy Access Knowledge (PEAK) is an interactive knowledge platform designed to automatically curate, organize, and streamline large, growing bodies of data into digestible, sharable, and useable knowledge through automated data capture, indexing, and visualization. A team of students led by Rebekah Shirley will consult with Power for All to creatively visualize PEAK’s library, and to explore machine learning and natural language processing tools that can enable auto-extraction and visualization of data for more effective science communication.

Are there relative value opportunities in the global corporate bond markets?  
A team of students will work with Professor Emma Rasiel to understand whether an analysis of credit spreads on bonds issued by international firms in multiple countries over time can shed light on potential arbitrage opportunities. The team will have frequent opportunities to interact with analytics professionals at a leading financial advisory and asset management firm.

 

A team of students will consult with a leading financial advisory and asset management firm that is seeking to understand how big data can shed light on the secondary market for construction machinery. Students will explore a combination of publicly-available datasets that describe the used-machinery market and its potential implications as an indicator for the business cycle. There will be frequent interactions with analytical professionals from the firm.

At Bank of America, we’ll match your drive and ambition to where you can make a real impact. As one of the world’s largest financial institutions, our global connections allow you to create a career on your own terms. Technology touches every part of Bank of America and underpins every deal we make. From low latency programming to big data challenges resulting from high profile mandates. Technology is critical to our success.

A team of students will work with Duke’s Office of Information Technology to conceptualize and potentially develop an “e-advisor” program that will help students navigate, augment, and map their way through Duke’s co-curricular ecosystem. The team of students will identify available data, programs and resources, define learning objectives, recommend common pathways and create a storyboard of the program building out a “master narrative” experience and prototype the branching and decision engine. Students will work with de-identified registration and advising data in a secure environment, have access to the analytics tools used in OIT, and will have an opportunity for exploration of the data in consultation with OIT and data analytics professionals.

A team of students in conjunction with Duke’s Office of Information Technology will make use of Duke’s wireless network data to build detailed maps of wireless coverage, strength and utilization across campus.  The data will be overlayed on a campus map of buildings, and used to analyze trends in wireless demand (e.g. areas that need additional coverage or bandwidth), trends in wireless utilization (e.g. where and what times are the wireless network used the most), identify underutilization for resource reallocation, and trends in how groups of people move around campus.  Students will work directly with the network data and have access to the analytics tools used in OIT, and will have a great opportunity for exploration of the data in consultation with OIT network, security and data analytics professionals.

The aim of this data expedition was to give students an introduction to stable isotopes and how the data can be used to understand trophic dynamics. 

A team of students, under the direction of Prof. Benjamin C. Lee, will explore how a variety of statistical machine-learning techniques may be able to improve datacenter performance. The team will have frequent opportunities to interact with analytics leadership at Lenovo.

A team of students led by Janet Bettger and an interdisciplinary team with the 6th Vital Sign Study will use Census and other public data to examine the representativeness of people who participated in this smartphone based population health study. Students will design an online interactive map and other web-based tools that can be easily updated with new study participants illustrating key relationships such as health status with rurality, medical service availability, and sociodemographics. The online tools will be used to direct education efforts on the importance of walking speed as a marker of health and as the sixth vital sign. Findings from the data analysis will be used by GANDHI to direct scale-up of smartphone based research in target geographic areas and with specific population subgroups such as older adults and those with chronic illness.

Today, our society is struggling with an unprecedented amount of misinformation and disinformation. A team of students led by researchers in the Duke Reporters’ Lab and Department of Computer Science will build databases, systems, and apps to help fact-checkers combat falsehoods and hyperboles, and disseminate their fact-checks to the public. The team will apply database, machine learning, algorithmic, and app development techniques to scout media and public interest for check-worthy claims, and alert media consumers to previously checked claims instantly.

A team of students led by Professors Jonathan Mattingly and Gregory Herschlag will investigate gerrymandering in political districting plans.  Students will improve on and employ an algorithm to sample the space of compliant redistricting plans for both state and federal districts.  The output of the algorithm will be used to detect gerrymandering for a given district plan; this data will be used to analyze and study the efficacy of the idea of partisan symmetry.  This work will continue the Quantifying Gerrymandering project, seeking to understand the space of redistricting plans and to find justiciable methods to detect gerrymandering. The ideal team has a mixture of members with programing backgrounds (C, Java, Python), statistical experience including possibly R, mathematical and algorithmic experience, and exposure to political science or other social science fields.

A team of students led by researchers in the Energy Data Analytics Lab and the Sustainable Energy Transitions Initiative will develop machine learning techniques for automatically mapping global electricity infrastructure using satellite imagery. By identifying substations, transmission lines, and distribution lines, students will create and publish a training dataset that we will use to automate grid infrastructure geolocation. These data and techniques will empower researchers and policymakers to better understand who has grid-connected access to electricity, who is underserved, and how to most efficiently transition communities and countries towards sustainable electrification.

A team of students led by faculty and researchers at the Social Science Research Institute will bring together data that will facilitate research using social determinants of health (SDH) to examine, understand, and ameliorate health disparities. This project will identify SDH variables that have the potential to be linked to data from the MURDOCK Study, a longitudinal health study based in Cabbarus County, NC. Much of this data – information relevant to understanding socioeconomic status, education, the physical and social environment, employment, and social support networks – is publicly available or easily obtained and its aggregation and analysis offer opportunities to significantly improve predictions of health risks and improve personalized care. Students will evaluate potential data sources, develop ethical policies to protect respondent privacy, clean and merge data, create documentation for data sharing and reuse, and use statistical tools and neighborhood mapping software to examine patterns of disparity.

Despite overwhelming scientific evidence on the benefits of vaccinations, pregnant women and parents of young children often refuse to accept, or choose to space-out, vaccinations for themselves or their children. This phenomenon, termed vaccine hesitancy, has been blamed for several vaccine-preventable outbreaks in the U.S. As part of larger study to understand vaccine hesitancy locally, students will conduct secondary data analysis of the coverage and timeliness of maternal and pediatric vaccines in Durham, and identify determinants of timely vaccination uptake. Results may inform the development of interventions to reduce hesitancy and improve the coverage and timeliness of maternal and pediatric vaccine uptake in Durham.

A team of students will contribute to an effort to operationalize the application of distributed computing methodologies in the analysis of electronic medical records (EMR) at Duke.  Specifically, the team will compare and contrast conventional (Oracle Exadata) and distributed (Apache SPARK) systems in the analysis of EMR data, and create recommendations for implementation.  Students will then use these systems to execute natural language processing (NLP) on clinical narratives and radiology notes with existing, ongoing analyses of Duke data.  This Data+ team will work with the Duke Forge, an interdepartmental collaboration focused on data science research and innovation in health and biomedical sciences.

A team of students lead by Rachel Richesson (Duke University School of Nursing) will explore patterns of health care treatment and utilization for several rare metabolic disorders treated at Duke University Health System (DUHS).  Students will gain an understanding of medical data, the use of reference terminologies to generate new relationships and inferences, and various data analysis and visualization techniques to describe and compare the clinical profiles of patients with different conditions. Students will interact with faculty experts from multiple disciplines (statistics, network analysis, medicine, genetics, and population health) to demonstrate how data-driven clinical profiles can inform our understanding of patients’ health care experience and support clinical care and research.

Would you like to know what influences patients’ medical decisions when outcomes are uncertain? Using a big data approach, we will explore a large number of physician-patient conversations and disentangle the complex decision-making process.  Students will be introduced not only to data science but also to behavioral research and aspects of communication in healthcare. This work will inform physicians on how to reduce overutilization of unnecessary interventions and ensure the well-being of patients.

How are women influenced by the spaces that they are allowed to occupy? A group of students, led by English Professor Charlotte Sussman, will examine how the spaces and places women can inhabit have changed over time, and how such changes have affected women’s rights and opportunities. The team will analyze the visual representations of women depicted in magazines from the nineteenth to the twenty-first century through the Women’s Magazine Archive, considering how images about women influence the reality that women can both imagine and live. Using this data, the group will design and visualize a potential women’s space that can empower and support women to reach their highest potential.

A team of students led by researchers in the Center for Health Policy and Inequalities Research will develop a platform that visualizes significant life events across time for more than 3,000 orphaned and separated children in Cambodia, Ethiopia, India, Kenya, and Tanzania from the Positive Outcomes for Orphans (POFO) study. The types of life events visualized on the timeline will include: the death of a parent, changes in living locations, school levels achieved, special events, traumatic events, and reported wellbeing at different ages. This data will be displayed via mobile devices and will serve to allow the participant to visualize and verify the information provided about their lives. Ultimately, the platform will allow researchers to ensure accuracy of the data provided and also allow greater audiences to visualize the individuality of the study's aggregate data.

A team of students led by Glenn Elementary School Parent Teacher Association (PTA) President, David Vanie, will explore publicly available data in order to develop a set of metrics that serve to understand the needs of the GSE parent community in a holistic way.  The data will identify potential obstacles that are barriers for parent involvement, and will inform best practices for increasing participation throughout the 2018-2019 school year at GSE.  The work will be used to provide helpful insight for engaging parents in PTA organizations at public schools throughout Durham, and across the country. 

 

A team of students led by UNC-CH graduate student Grant Glass and Duke English professor Charlotte Sussman will track the thousands of Daniel Defoe’s Robinson Crusoe editions – including the plethora of movies and “Robinsoniades,” most of which are deviations from Defoe’s original work. By examining the differences in these stories –through word-vector models and categorization algorithms, we can trace how the deviations often reflect the place and time of their production and consumption, evoking a range of questions that further our understanding of how the expanse and collapse of the British Empire is wrapped up in notions of capitalism, race, empire, gender, and climate concerns. Along the way, we will examine questions of intellectual property, piracy, and authorship as they relate to both the 18th century and today.

A team of students led by clinical and non-clinical global reproductive health researchers at the Duke Global Health Institute will develop an interactive, web-based platform that curates raw data on contraceptive discontinuation from the Demographic and Health Surveys (DHS) into a tool to help researchers and family planning advocates develop fresh insights around contraceptive discontinuation. Students will develop and refine the prototype, debut it with experts in online data visualization platforms at RTI and prepare a dissemination plan for the tool. Students will have an opportunity to pilot creative ways to incorporate social media data into the tool and ways to validate this data against ground-truth data from population representative surveys.

A team of students led by a computational biologist and a cell biologist will develop methods to identify cell subsets and their developmental, maturation and activation lineage relationships using deep learning approaches. Students will learn to process single cell RNA sequencing data and use the Python programming language and TensorFlow to characterize lung stem cells involved in wound healing. This work will help Duke researchers establish a deep learning pipeline for single cell analysis with applications in immunology, cell biology and cancer.

What do we mean by the term “poverty”?

A team of students under the direction of Professor Astrid Giugni will analyze how the way we talk about poverty and public policy has changed over time. The team will work with two databases containing visual, textual, and audio documents from 1473 to the present, allowing students to track and analyze how our understanding of poverty has changed over time. The group will tackle the challenge of analyzing the political and popular language and imagery of poverty in order to create a visualization that contextualizes how financial and welfare policy is influenced by how we talk about poverty.

A team of students will work with Duke's Office of Development & Alumni Affairs to understand how cutting-edge data analytic techniques, such as sentiment analysis and network analysis, can be used to understand a variety of giving behaviors and trajectories. Students will work with de-identified data in a secure computing environment, and will have a rich opportunity for creative exploration in consultation with Development professionals.

 

 A team of students lead by Dr. Nicole Schramm-Sapyta of the Duke Institute for Brain Sciences will provide analytical consulting support to the Durham Crisis Intervention Team (CIT) Collaborative, a county-wide effort to provide law enforcement and first responders with specialized training in mental illness and crisis intervention techniques.  The team will build on last summer’s descriptive analysis of 9-1-1 call data by incorporating data from partner agencies to assess whether CIT training reduces recidivism, increases utilization of mental health services, and generally improves the lives of Durham citizens with mental illness. 

Marine mammals exhibit extreme physiological and behavioral adaptions that allow them to dive hundreds to thousands of meters underwater despite their need to breathe air at the surface. Through the development of new remote monitoring technologies, we are just beginning to understand the mechanisms by which they are able to execute these extreme behaviors. Long- term animal-borne tags can now record location, dive depth, and dive duration and then transmit these data to satellite receivers, enabling remote access to behavior occurring both many kilometers out to sea and several kilometers below the ocean surface. 

The aim of this Data Expedition was for students to learn hands-on data visualization techniques using a variety of data types. Students first discussed how data visualization is useful, and tips to make graphs both visually appealing and easy to understand. 

A Durham-based startup, founded by Duke professors Larry Carin (ECE) and Ricardo Henao (B&B), is looking for interns excited to develop skills and gain experience in data science and machine learning. This is an opportunity to learn from and contribute to a new and growing company focused on cutting-edge machine learning and deep learning technology.

Understanding of how to manipulate, analyze, and display large datasets is an essential skill in the life sciences. Introducing students to the concepts of coding languages and showing them the diversity of tasks that can be accomplished using a flexible coding scheme like R is an important step in the training of any life sciences professional. For students taking lab-based courses, who are often required to analyze the datasets they produce in class, learning these techniques can be helpful both in the short-term (i.e., during the semester) and for their future careers.

Sophie Guo, Math/PoliSci major, Bridget Dou, ECE/CompSci major, Sachet Bangia, Econ/CompSci major, and Christy Vaughn spent ten weeks studying different procedures for drawing congressional boundaries, and quantifying the effects of these procedures on the fairness of actual election results.

Anna Vivian (Physics, Art History) and Vinai Oddiraju (Stats) spent ten weeks working closely with the director of the Durham Neighborhood Compass. Their goal was to produce metrics for things like ambient stress and neighborhood change, to visualize these metrics within the Compass system, and to interface with a variety of community stakeholders in their work.

Sharrin ManorArjun DevarajanWuming Zhang, and Jeffrey Perkins explored a lage collection of imagery data provided by the U.S. Geological Survey, with the goal of identifying solar panels using image recognition. They worked closely with the Energy Data Analytics Lab, part of the Energy Initiative at Duke.

ECE majors Mitchell Parekh and Yehan (Morton) Mo, along with IIT student Nikhil Tank, spent ten weeks understanding parking behavior at Duke. They worked closely with the Parking and Transportation Office, as well as with Vice President for Administration Kyle Cavanaugh.

Maddie Katz (Global Health and Evolutionary Anthropology Major), Parker Foe (Math/Spanish, Smith College), and Tony Li (Math, Cornell) spent ten weeks analyzing data from the National Transgender Discrimination Survey. Their goal was to understand how the discrimination faced by the trans community is realized on a state, regional, and national level, and to partner with advocacy organizations around their analysis.

Matt and Ken led two labs for the engineering section of STA 111/130, an introductory course in statistics and probability. The lab assignments were written by Matt and Ken in order to bridge the gap between introductory linear regression, which is often explained in terms of a static, complete dataset, and time series analysis, which is not a common topic in introductory courses. 

Yanmin (Mike) Ma, mathematics/economics major, and Manchen (Mercy) Fang, electrical and computer engineering/computer science major, spent ten weeks studying historical archives and building a model to predict the price of pigs, relative to a number of interesting factors.

David Clancy, a Stats/Math/EnvSci major, and Tianyi Mu, an ECE/CompSci major, spent ten weeks studying the effects of weather, surroundings, and climate on the operational behavior of water reservoirs across the United States. They used a large dataset compiled by the U.S. Army Corps of Engineers, and they worked closely with Lauren Patterson from the Water Policy Program at Duke's Nicholas Institute for Environmental Policy Solutions. Project mentorship was provided by Alireza Vahid, a postdoctoral candidate in Electrical Engineering.

Luke RaskopfPoliSci major and Xinyi (Lucy) Lu, Stats/CompSci major, spent ten weeks investigating the effectiveness of policies to combat unemployment and wage stagnation faced by working and middle-class families in the State of North Carolina. They worked closely with Allan Freyer at the North Carolina Justice Center.

This paper addresses analysis of heterogeneous data, such as ordered, categorical, real and count data. Such data are of interest in our motivating application, cognitive and brain science, in which subjects may answer questionnaires, and also (separately) undergo fMRI interrogation. A contribution of this paper concerns the joint analysis of how people answer questionnaires and how their brain responds to external stimuli (here visual), the latter measured via fMRI.

Computer Science major Yumin Zhang and IIT student Akhil Kumar Pabbathi spent ten weeks working closely with Dr. Joe McClernon from Psychiatry and Behavioral Sciences to understand smoking and tobacco purchase behavior through activity space analysis.

Biomedical Engineering major Chi Kim Trinh, and Biostatistics MS student Can Cui spent ten weeks constructing a computational and statistical framework to evaluate the effects of health coaching on Type II Diabetes patients’ quality metrics, including Hemoglobin A1c, blood pressure, eye exam consistency, tobacco use, and prescription adherence to statins, aspirin, and angiotensin converter enzyme (ACE)/ angiotensin receptor blocker (ARB).

Biomedical Engineering and Electrical and Computer Engineering major David Brenes, and Electrical and Computer Engineering/Computer Science majors Xingyu Chen and David Yang spent ten weeks working with mobile eye tracker data to optimize data processing and feature extraction. They generated their own video data with SMI Eye Tracking Glasses, and created computer vision algorithms to categorize subject gazing behavior in a grocery purchase decision-making environment.

Xinyu (Cindy) Li (Biology and Chemistry) and Emilie Song (Biology) spent ten weeks exploring the Black Queen Hypothesis, which predicts that co-operation in animal societies could be a result of genetic/functional trait losses, as well as polymorphism of workers in eusocial animals such as ants and termites. The goal was to investigate this idea in four different eusocial insect species.

BME major Neel Prabhu, along with CompSci and ECE majors Virginia Cheng and Cheng Lu, spent ten weeks studying how cells from embryos of the common fruit fly move and change in shape during development. They worked with Cell-Sheet-Tracker (CST), an algorithm develped by former Data+ student Roger Zou and faculty lead Carlo Tomasi. This algorithm uses computer vision to model and track a dynamic network of cells using a deformable graph.

Weiyao Wang (Math) and Jennifer Du , along with NCCU Physics majors Jarrett Weathersby and Samuel Watson, spent ten weeks learning about how search engines often provide results which are not representative in terms of race and/or gender. Working closely with entrepreneur Winston Henderson, their goal was to understand how to frame this problem via statistical and machine-learning methodology, as well as to explore potential solutions.

Matthew Newman (Sociology), Sonia Xu (Statistics), and Alexandra Zrenner (Economics) spent ten weeks exploring giving patterns and demographic characteristics of anonymized Duke donors. They worked closely with the Duke Alumni Affairs and Development Office, with the goal of understanding the data and constructing tools to generate data-driven insight about donor behavior.

Artem Streltsov (Masters Economics) and IIT Mechanical Engineering major Vinod Ramakrishnan spent ten weeks exploring North Carolina state budget documents. Working closely with the Budget and Tax Center, part of the North Carolina Justice Center, their goal was to help build a keystone tool that can be used for analysis of the state budget as well as future budget proposals.

Yuangling (Annie) Wang, a Math/Stats major, and Jason Law, a Math/Econ major, spent ten weeks analyzing message-testing data about the 2015 Marijuana Legalization Initiative in Ohio; the data were provided by Public Opinion Strategies, one of the nation's leading public opinion research firms.

The goal was to understand how statistics and machine learning might help develop microtargeting strategies for use in future campaigns.

Devri Adams (Environmental Science), Annie Lott (Statistics), and Camila Vargas Restrepo (Visual Media Studies, Psychology) spent ten weeks creating interactive and exploratory visualizations of ecological data. They worked with over sixty years of data collected at the Hubbard Brook Experimental Forest (HBEF) in New Hampshire.

Ana Galvez (Cultural and Evolutionary Anthropology), Xinyu Li (Biology), and Jonathan Rub (Math, Computer Science) spent ten weeks studying the impact of diet on organ and bone growth in developing laboratory rats. The goal was to provide insight into the growth dynamics of these model organisms that could eventually be generalized to inform research on human development.

Robbie Ha (Computer Science, Statistics), Peilin Lai  (Computer Science, Mathematics), and Alejandro Ortega (Mathematics) spent ten weeks analyzing the content and dissemination of images of the Syrian refugee crisis, as part of a general data-driven investigation of Western photojournalism and how it has contributed to our understanding of this crisis.

Runliang Li (Math), Qiyuan Pan (Computer Science), and Lei Qian (Masters in Statistics and Economic Modelling) spent ten weeks investigating discrepancies between posted wait times and actual wait times for rides at Disney World. They worked with data provided by TouringPlans.

Building off the work of a 2016 Data+ teamYu Chen (Economics), Peter Hase (Statistics), and Ziwei Zhao (Mathematics), spent ten weeks working closely with analytical leadership at Duke's Office of University Development. The project goal was to identify distinguishing characteristics of major alumni donors and to model their lifetime giving behavior.

Over ten weeks, Computer Science majors Daniel Bass-Blue and Susie Choi joined forces with Biomedical Engineering major Ellie Wood to prototype interactive interfaces from Type II diabetics' mobile health data. Their specific goals were to encourage patient self-management and to effectively inform clinicians about patient behavior between visits.

Over ten weeks, Computer Science Majors Amber Strange and Jackson Dellinger joined forces with Psychology major Rachel Buchanan to perform a data-driven analysis of mental health intervention practices by Durham Police Department. They worked closely with leadership from the Durham Crisis Intervention Team (CIT) Collaborative, made up of officers who have completed 40 hours of specialized training in mental illness and crisis intervention techniques.

A team of students led by Duke mathematician Marc Ryser and University of Southern California Pathology professor Darryl Shibata will characterize phenotypic evolution during the growth of human colorectal tumors. 

Graduate Students: Kendra Kaiser and John Mallard

Faculty: Michael O’Driscoll

Course: Landscape Hydrology, EOS 323/723

A team of students led by Dr. Shanna Sprinkle of Duke Surgery will combine success metrics of Duke Surgery residents from a set of databases and create a user interface for residency program directors and possibly residents themselves to view and better understand residency program performance.

Lauren Fox (Cultural Anthropology) and Elizabeth Ratliff (Statistics, Global Health) spent ten weeks analyzing and mapping pedestrian, bicycle, and motor vehicle data provided by Durham's Department of Transportation. This project was a continuation of a seminar on "ghost bikes" taught by Prof. Harris Solomon.

Boning Li (Masters Electrical and Computer Engineering), Ben Brigman (Electrical and Computer Engineering), Gouttham Chandrasekar (Electrical and Computer Engineering), Shamikh Hossain (Computer Science, Economics), and Trishul Nagenalli (Electrical and Computer Engineering, Computer Science) spent ten weeks creating datasets of electricity access indicators that can be used to train a classifier to detect electrified villages. This coming academic year, a Bass Connections Team will use these datasets to automatically find power plants and map electricity infrastructure.

Liuyi Zhu (Computer Science, Math), Gilad Amitai (Masters, Statistics), Raphael Kim (Computer Science, Mechanical Engineering), and Andreas Badea (East Chapel Hill High School) spent ten weeks streamlining and automating the process of electronically rejuvenating medieval artwork. They used a 14th-century altarpiece by Francescussio Ghissi as a working example.

Over ten weeks, Math/CompSci majors Benjamin Chesnut and Frederick Xu joined forces with International Comparative Studies major Katharyn Loweth to understand the myriad academic pathways traveled by undergraduate students at Duke. They focused on data from Mathematics and the Duke Global Health Institute, and worked closely with departmental leadership from both areas.

Felicia Chen (Computer Science, Statistics), Nikkhil Pulimood (Computer Science, Mathematics), and James Wang (Statistics, Public Policy) spent ten weeks working with Counter Tools, a local nonprofit that provides support to over a dozen state health departments. The project goal was to understand how open source data can lead to the creation of a national database of tobacco retailers.

Selen Berkman (ECE, CompSci), Sammy Garland (Math), and Aaron VanSteinberg (CompSci, English) spent ten weeks undertaking a data-driven analysis of the representation of women in film and in the film industry, with special attention to a metric called the Bechdel Test. They worked with data from a number of sources, including fivethirtyeight.com and the-numbers.com.

Over ten weeks, BME and ECE majors Serge Assaad and Mark Chen joined forces with Mechanical Engineering Masters student Guangshen Ma to automate the diagnosis of vascular anomalies from Doppler Ultrasound data, with goals of improving diagnostic accuracy and reducing physician time spent on simple diagnoses. They worked closely with Duke Surgeon Dr. Leila Mureebe and Civil and Environmental Engineering Professor Wilkins Aquino.

Furthering the work of a 2016 Data+ team in predictive modeling of pancreatic cancer from electronic medical record (EMR) data, students Siwei Zhang (Masters Biostatistics) and Jake Ukleja (Computer Science) spent ten weeks building a model to predict pancreatic cancer from Electronic Medical Records (EMR) data. They worked with nine years worth of EMR data, including ICD9 diagnostic codes, that contained records from over 200,000 patients.

Over ten weeks, Mathematics/Economics majors Khuong (Lucas) Do and Jason Law joined forces with Analytical Political Economy Masters student Feixiao Chen to analyze the spati-temporal distribution of birth addresses in North Carolina. The goal of the project was to understand how/whether the distributions of different demographic categories (white/black, married/unmarried, etc.) differed, and how these differences connected to a variety of socioeconomic indicators.

Zijing Huang (Statistics, Finance), Artem Streltsov (Masters Economics), and Frank Yin (ECE, CompSci, Math) spent ten weeks exploring how Internet of Things (IoT) data could be used to understand potential online financial behavior. They worked closely with analytical and strategic personnel from TD Bank, who provided them with a massive dataset compiled by Epsilon, a global company that specializes in data-driven marketing.

John Benhart (CompSci, Math) and Esko Brummel (Masters in Bioethics and Science Policy) spent ten weeks analyzing current and potential scholarly collaborations within the community of Duke faculty. They worked closely with the leadership of the Scholars@Duke database.

Angelo Bonomi (Chemistry), Remy Kassem (ECE, Math), and Han (Alessandra) Zhang (Biology, CompSci) spent ten weeks analyzing data from social networks for communities of people facing chronic conditions. The social network data, provided by MyHealth Teams, contained information shared by community members about their diagnoses, symptoms, co-morbidities, treatments, and details about each treatment.

Over ten weeks, Public Policy major Amy Jiang and Mathematics and Computer Science major Kelly Zhang joined forces with Economics Masters student Amirhossein Khoshro to investigate academic hiring patterns across American universities, as well as analyzing the educational background of faculty. They worked closely with Academic Analytics, a provider of data and solutions for universities in the U.S. and the U.K.

Linda Adams(CompSci), Amanda Jankowski (Sociology, Global Health), and Jessica Needleman (Statistics/Economics) spent ten weeks prototyping small-area mapping of public-health information within the Durham Neighborhood Compass, with a focus on mortality data. They worked closely with the director of DataWorks NC, an independent data intermediary dedicated to democratizing the use of quantitative information.

Gary Koplik (Masters in Economics and Computation) and Matt Tribby (CompSci, Statistics) spent ten weeks investigating the burden of rare diseases on the Duke University Health System (DUHS). They worked with a massive set of ICD diagnosis codes and visit data provided by DUHS.

Over ten weeks, Biology major Jacob Sumner and Neuroscience major Julianna Zhang joined forces with Biostatistics Masters student Jing Lyu to analyze potential drug diversion in the Duke Medical Center. Early detection of drug diversion assists health care providers in helping patients recover from their condition, as well as mitigate the effects on any patients under their care.

William Willis (Mechanical Engineering, Physics) and Qitong Gao (Masters Mechanical Engineering) spent ten weeks with the goal of mapping the ocean floor autonomously with high resolution and high efficiency. Their efforts were part of a team taking part in the Shell Ocean Discovery XPRIZE, and they made extensive use of simulation software built from Bellhop, an open-source program distributed by HLS Research.

Graduate Student: Jacob Coleman, 3rd year Ph.D. student in Statistical Science

Faculty Instructor: Colin Rundel

Class: STA 112, Data Science

Joy Patel (Math and CompSci) and Hans Riess (Math) spent ten weeks analyzing massive amounts of simulated weather data supplied by Spectral Sciences Inc. Their goal was to investigate ways in which advanced mathematical techniques could assist in quantifying storm intensity, helping to augment today's more qualitatively-based methods.

Albert Antar(Biology), and Zidi Xiu (Biostatistics) spent ten weeks leveraging Duke Electronic Medical Record (EMR) data to build predictive models of Pancreatic ductal adenocarcinoma (PDAC). PDAC is the 4th leading cause of cancer deaths in the US, and is most often is diagnosed in stage IV, with a survival rate of only 1% and life expectancy measured in months. Diagnosis of PDAC is very challenging due of deep anatomical placement, and significant risk imposed by traditional biopsy. The goal of this project is to utilize EMR data to identify potential avenues for diagnosing PDAC in the early treatable stages of disease.

Priya Sarkar (Computer Science), Lily Zerihun (Biology and Global Health), and Anqi Zhang (Biostatistics) spent ten weeks utilizing Duke Electronic Medical Record (EMR) data to identify subgroups of diabetic patients, and predict future complications associated with Type II Diabetes.

Computer Science and Psychology major Molly Chen, and Neuroscience major Emily Wu spent ten weeks working with patient diagnosis co-occurence data derived from Duke Electronic Medical Records to develop network visualizations of co-occurring disorders within demographic groups. Their goal was to make healthcare more holistic, and reduce healthcare disparities by improving patient and provider awareness of co-occurring disorders for patients within similar demographic groups.

Emily Horn (Public Policy, Global Health), Aasha Reddy (Economics), and Shanchao Wang (Masters Economics) spent ten weeks working with data from the National Asset Scorecard for Communities of Color (NASCC), an ongoing survey project that gathers information about asset and debt of households at a detailed racial and national origin level. They worked closely with faculty and researchers from the Samuel Dubois Cook Center for Social Equity.

Vivek Sriram (Computer Science and Math), Lina Yang (Biostatistics), and Pablo Ortiz (BME) spent ten weeks working in close collaboration with the Department of Biostatistics and Bioinformatics implementing an image analysis pipeline for immunofluorescence microscopy images of developing mouse lungs.

Statistical Science majors Nathaniel Brown and Corey Vernot, and Economics student Guan-Wun Hao spent ten weeks exploring changes in food purchase behavior and nutritional intake following the event of a new Metformin prescription for Type II Diabetes. They worked closely with Matthew Harding and researchers in the BECR Center, as well as Dr. Susan Spratt, an endocrinologist in Duke Medicine.

Anne Driscoll (Economics, Statistical Science), and Austin Ferguson (Math, Physics) spent ten weeks examining metrics for inter-departmental cooperativity and productivity, and developing a collaboration network of Duke faculty. This project was sponsored by the Duke Clinical and Translational Science Award, with the larger goal of promoting collaborative success in the School of Medicine and School of Nursing.

Joel Tewksbury (BME) and Miriam Goldman (Math and Statistics, Arizona State University) spent ten weeks analyzing time-series darkness visual adaptation scores from over 1200 study participants to identify trends in night vision, and ultimately genetic markers that might confer a visual advantage.

Lindsay Hirschhorn (Mechanical Engineering) and Kelsey Sumner (Global Health and Evolutionary Anthropology) spent ten weeks determining optimal vaccination clinic locations in Durham County for a simulated Zika virus outbreak. They worked closely with researchers at RTI International to construct models of disease spread and health impact, and developed an interactive visualization tool.

The team built a ground truth dataset comprising satellite images, building footprints, and building heights (LIDAR) of 40,000+ buildings, along with road annotations. This dataset can be used to train computer vision algorithms to determine a building’s volume from an image, and is significant contribution to the broader research community with applications in urban planning, civil emergency mitigation and human population estimation.

With the significant international consequences of recent outbreaks, the ITP Lab conducted extensive stakeholder interviews and macro-level health policy analysis to expose gaps in pandemic preparedness and develop legal frameworks for future threats. 

Graduate student: Hamza Ghadyali          

Faculty instructor: Dr. Paul Bendich

Course: MATH 412 – Topology with Applications

Students in the Performance and Technology Class create a series of performances that explore the interface between society and our machines. With the theme of the cloud to guide them, they have created increasingly complex art using digital media, microcontrollers, and motion tracking. Their work will be on display at the Duke Choreolab 2016.

Computer Science majors Erin Taylor and Ian Frankenburg, along with Math major Eric Peshkin, spent ten weeks understanding how geometry and topology, in tandem with statistics and machine-learning, can aid in quantifying anomalous behavior in cyber-networks. The team was sponsored by Geometric Data Anaytics, Inc., and used real anonymized Netflow data provided by Duke's Information Technology Security Office.

Paclitaxel (Taxol) is a small molecule drug belonging to the taxane family. It is one of the most commonly used chemotherapeutics, used for treatment of many cancers, as a monotherapy or in combination with other drugs to treat breast, lung and ovarian cancer as well as Kaposi’s sarcoma. Taxol is on the World Health Organization’s (WHO) List of Essential Medicines, a list that includes most the important medications for basic health. The worldwide demand for paclitaxel is exceeding the current supply. 

With the significant international consequences of recent outbreaks, the ITP Lab conducted extensive stakeholder interviews and macro-level health policy analysis to expose gaps in pandemic preparedness and develop legal frameworks for future threats. 

How well and in what ways do governments communicate with their citizens? How do governments analyze data and create visualizations to promote public access to government information? 

A virtual reality system to recreate the archaeological experience using data and 3D models from the neolithic site of Çatalhöyük, in Anatolia, Turkey. 

This project summarizes the existing sample agreements from different institutions, analyzes the key contractual issues in the formation of alliances, and develops master charts of legal provisions to compare different approaches, to provide a reference for the formation of new alliances in the era of epidemic disease outbreaks. 

Geometric Data Analytics, Inc. is a Triangle-based research, development and consulting company
that applies cutting-edge mathematical techniques to solve complex data analysis

This project transforms an inaccessible audio archive of historic North Carolina folk music colllected by Frank Clyde Brown in the 1920s-40s into a vital, publicly accessible digital archive and museum exhibition. 

Imagine a world where we understand how to detect mental health and developmental problems in early childhood so that we can intervene early in life and prevent future suffering and impairment. This is a challenge that can only be addressed by an interdisciplinary team of computational people with child psychiatrists and neuroscientists who can integrate and mine knowledge from cross-cultural and global data.

Molly Rosenstein, an Earth and Ocean Sciences major and Tess Harper, an Environmental Science and Spanish major spent ten weeks developing interactive data applications for use in Environmental Science 101, taught by Rebecca Vidra.

Lineage Logistics is the second largest cold storage network in the world, playing a critical role in multiple global supply chains. We store and transport temperature-sensitive com- modities (about 40 billion lbs per year) in a large network of warehouses, trucks and rail cars. Our inventories include everything from Boeing’s carbon fiber to your 4th of July baby-back ribs.

Two to three undergraduates joined a research group led by Douglas Boyer and Ingrid Daubechies, with the goal of testing and developing mathematical and statistical methodology for measuring similarities between bones and teeth.

Nonnegative matrix factorization (NMF) has an established reputation as a useful data analysis technique in numerous applications. However, its usage in practical situations is undergoing challenges in recent years.The fundamental factor to this is the increasingly growing size of the datasets available and needed in the information sciences. To address this, in this work we propose to use structured random compression, that is, random projections that exploit the data structure, for two NMF variants: classical and separable. In separable NMF (SNMF) the left factors are a subset of the columns of the input matrix. We present suitable formulations for each problem, dealing with different representative algorithms within each one.

In this work, we turn musical audio time series data into shapes for various tasks in music matching and musical structure understanding. 

Inspiring and empowering donors to give more effectively

We want three bright, motivated students to participate in this nine-week Data+ project!

The goal of this project is take a large amount of data from the Massive Open Online Courses offered by Duke professors, and produce from it a coherent and compelling data analysis challenge that might then be used for a Duke or nation-wide data analysis competition.

Kelsey SumnerEvAnth and Global Health major and Christopher Hong, CompSci/ECE major, spent ten weeks analyzing high-dimensional microRNA data taken from patients with viral and/or bacterial conditions. They worked closely with the medical faculty and practitioners who generated the data.

Kang Ni, Math/Econ major, Kehan Zhang, Econ/Stats/ major, and Alex Hong, spent ten weeks investigating a large collection of grocery store transaction data. They worked closely with Matt Harding Behavioral Economics and Healthy Food Choice Research Center. (BECR Center).

Ethan LevineAnnie Tang, and Brandon Ho spent ten weeks investigating whether personality traits can be used to predict how people make risky decisions. They used a large dataset collected by the lab of Prof. Scott Huettel, and were mentored by graduate students Emma Wu Dowd and Jonathan Winkle.

Spenser Easterbrook, a Philosophy and Math double major, joined Biology majors Aharon Walker and Nicholas Branson in a ten-week exploration of the connections between journal publications from the humanities and the sciences. They were guided by Rick Gawne and Jameson Clarke, graduate students from Philosophy and Biology.

The Triangle Census Research Network (TCRN) is an interdisciplinary team of researchers from Duke University and the National Institute of Statistical Sciences dedicated to improving the way that federal statistical agencies collect, analyze, and disseminate data to the public.

Large-scale databases from the social, behavioral, and economic sciences offer enormous potential benefits to society. However, as most stewards of social science data are acutely aware, wide-scale dissemination of such data can result in unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises–and in some instances laws to protect data subjects' privacy and confidentiality. 

We present a framework for high-dimensional regression using the GMRA data structure. In analogy to a classical wavelet decomposition of function spaces, a GMRA is a tree-based decomposition of a data set into local linear projections.

In this Data Expedition, Duke undergraduates were introduced to a real world traffic citation data set. Provided by Dr. Frank R. Baumgartner, a political scientist at UNC, the data consist of 15 years of traffic stops, with over 18 million observations of 53 variables.

Dr. Guillermo Sapiro, professor in Pratt School of Engineering at Duke University, conducts ongoing autism research. Using image processing, he attempts to program a computer to detect whether babies (around eight to 14 months of age) display a sign of autism. This very early detection enables doctors to train these babies (when their brain plasticity is high) to behave in ways to counter the behavioral limitations autism imposes, thus allowing these babies to act more normally as they grow up. 

Using social network analysis to predict survival in large-brained mammals.

Graduate students: Aaron Berdanier and Matt Kwit, University Program in Ecology & Nicholas School of the Environment

Students learned to visualize high-dimensional gene expression data; understand genetic differences in the context of gene networks; connect genetic differences to physiological outcomes; and perform simple analyses using the R programming language.

This data expedition introduced students to “sliding windows and persistence” on time series data, which is an algorithm to turn one dimensional time series into a geometric curve in high dimensions, and to quantitatively analyze hybrid geometric/topological properties of the resulting curve such as “loopiness” and “wiggliness.”

In this project, we aim to solve the compressive sensing (CS) hyperspectral / video image reconstruction problem. The propose algorithm is robust to different initializations. This is useful for CS reconstruction problems where the suitable training datasets are not available.

Questions asked: Do males and females scent mark equally? Do lemurs scent mark equally in breeding and non-breeding seasons?

Introduce NBA and MLB datasets to undergraduates to help them gain expertise in exploratory data analysis, data visualization, statistical inference, and predictive modeling.

STEM education often presents a very sanitized version of the scientific enterprise. To some extent, this is necessary, but overemphasizing neat-and-tidy results and scripted protocol assignments poses the risk of failing to adequately prepare students for the real-world mess of transforming experimental data into meaningful results. The fundamental aim of this project was to guide students in processing large real-world datasets far beyond their academic comfort zone so as to give them a more realistic understanding of how science works.

What drove the prices for paintings in 18th Century Paris?

A new model is developed for joint analysis of ordered, categorical, real and count data. In the motivating application, the ordered and categorical data are answers to questionnaires, the (word) count data correspond to the text questions from the questionnaires, and the real data correspond to fMRI responses for each subject. We also combine the analysis of these data with single-nucleotide polymorphism (SNP) data from each individual. 

The sub-thalamic nucleus (STN) within the sub-cortical region of the Basal ganglia is a crucial targeting structure for Deep brain stimulation (DBS) surgery, in particular for alleviating Parkinson’s disease (PD) symptoms. Volumetric segmentation of such small and complex structure, which is elusive in clinical MRI protocols, is thereby a pre-requisite process for reliable DBS targeting. While direct visualization and localization of the STN is facilitated with advanced high-field 7T MR imaging, such high fields are not always clinically available. 

Volumetric segmentation of sub-cortical structures such as the basal ganglia and thalamus is necessary for non-invasive diagnosis and neurosurgery planning. This is a challenging problem due in part to limited boundary information between structures, similar intensity profiles across the different structures, and low contrast data.

Intelligent mobile sensor agent can adapt to heterogeneous environmental conditions, to achieve the optimal performance, such as demining, maneuvering target tracking. 

Successful high-resolution signal reconstruction -- in problems ranging from astronomy to biology to medical imaging -- depends crucially our ability to make the most out of indirect, incomplete, a