Diagnosing Diabetes and Predicting Complications

Project Summary

Priya Sarkar (Computer Science), Lily Zerihun (Biology and Global Health), and Anqi Zhang (Biostatistics) spent ten weeks utilizing Duke Electronic Medical Record (EMR) data to identify subgroups of diabetic patients, and predict future complications associated with Type II Diabetes.

Themes and Categories

Project Results

The team utilized t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction of prescribed medications, medical diagnoses, laboratory tests, and patient outcomes. They then performed K-means clustering to identify meaningful clusters of similar patients and explored the sources of similarities. The team also constructed and tested statistical models to predict 13 common complications in diabetic patients, and found high predictive accuracy for several such complications when leveraging the rich data available in EMR.

Project Video:

Download the Executive Summary (PDF)

Faculty Sponsor

Project Manager

"Data+ provided an invaluable opportunity to work with motivated, hard-working students on exciting and challenging data problems. I learned so much about working with others, communicating effectively, and managing students with a variety of backgrounds. Though each of my students had a different level of statistics and coding experience, they made mentoring so easy with their hard work and interest in the project, as well as the effective organization of the summer as a whole. It was a great experience that I highly recommend to other graduate students!" Liz Lorenzi, Ph.D. Candidate, Statistics


  • Lillian Zerihun, Duke University Biology & Global Health
  • Priya Sarkar, Duke University Computer Science
  • Anqi Zhang, Duke University Biostatistics

Disciplines Involved

  • Biostatistics
  • Public Health
  • All quantitative STEM


Related People

Related Projects

Brooke Erikson (Economics/Computer Science), Alejandro Ortega (Math), and Jade Wu (Computer Science) spent ten weeks developing open-source tools for automatic document categorization, PDF table extraction, and data identification. Their motivating application was provided by Power for All’s Platform for Energy Access Knowledge, and they frequently collaborated with professionals from that organization.

Click here to read the Executive Summary


Jake Epstein (Statistics/Economics), Emre Kiziltug (Economics), and Alexander Rubin (Math/Computer Science) spent ten weeks investigating the existence of relative value opportunities in global corporate bond markets. They worked closely with a dataset provided by a leading asset management firm.

Click here for the Executive Summary

Maksym Kosachevskyy (Economics) and Jaehyun Yoo (Statistics/Economics) spent ten weeks understanding temporal patterns in the used construction machinery market and investigating the relationship between these patterns and macroeconomic trends.

They worked closely with a large dataset provided by MachineryTrader.com, and discussed their findings with analytics professionals from a leading asset management firm.

Click here to read the Executive Summary