Diagnosing Diabetes and Predicting Complications

Project Summary

Priya Sarkar (Computer Science), Lily Zerihun (Biology and Global Health), and Anqi Zhang (Biostatistics) spent ten weeks utilizing Duke Electronic Medical Record (EMR) data to identify subgroups of diabetic patients, and predict future complications associated with Type II Diabetes.

Themes and Categories

Project Results

The team utilized t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction of prescribed medications, medical diagnoses, laboratory tests, and patient outcomes. They then performed K-means clustering to identify meaningful clusters of similar patients and explored the sources of similarities. The team also constructed and tested statistical models to predict 13 common complications in diabetic patients, and found high predictive accuracy for several such complications when leveraging the rich data available in EMR.

Project Video:

Download the Executive Summary (PDF)

Faculty Sponsor

Project Manager

"Data+ provided an invaluable opportunity to work with motivated, hard-working students on exciting and challenging data problems. I learned so much about working with others, communicating effectively, and managing students with a variety of backgrounds. Though each of my students had a different level of statistics and coding experience, they made mentoring so easy with their hard work and interest in the project, as well as the effective organization of the summer as a whole. It was a great experience that I highly recommend to other graduate students!" Liz Lorenzi, Ph.D. Candidate, Statistics

Participants

  • Lillian Zerihun, Duke University Biology & Global Health
  • Priya Sarkar, Duke University Computer Science
  • Anqi Zhang, Duke University Biostatistics

Disciplines Involved

  • Biostatistics
  • Public Health
  • All quantitative STEM

 

Related People

Related Projects

United Nations Sustainable Development Goal 7 calls for universal access to affordable, reliable, sustainable, and modern energy. Researchers and practitioners around the world have responded to this call by producing a wealth of energy access data. While many data gaps still exist, are we capturing the fullest potential from the information and research we do have, and what it tells us about how to accelerate energy access? Power for All’s Platform for Energy Access Knowledge (PEAK) is an interactive knowledge platform designed to automatically curate, organize, and streamline large, growing bodies of data into digestible, sharable, and useable knowledge through automated data capture, indexing, and visualization. A team of students led by Rebekah Shirley will consult with Power for All to creatively visualize PEAK’s library, and to explore machine learning and natural language processing tools that can enable auto-extraction and visualization of data for more effective science communication.

Are there relative value opportunities in the global corporate bond markets?  
A team of students will work with Professor Emma Rasiel to understand whether an analysis of credit spreads on bonds issued by international firms in multiple countries over time can shed light on potential arbitrage opportunities. The team will have frequent opportunities to interact with analytics professionals at a leading financial advisory and asset management firm.

 

A team of students will consult with a leading financial advisory and asset management firm that is seeking to understand how big data can shed light on the secondary market for construction machinery. Students will explore a combination of publicly-available datasets that describe the used-machinery market and its potential implications as an indicator for the business cycle. There will be frequent interactions with analytical professionals from the firm.