Validating a Topic Model that Predicts Pancreatic Cancer from Latent Structures in the Electronic Medical Record

Project Summary

Furthering the work of a 2016 Data+ team in predictive modeling of pancreatic cancer from electronic medical record (EMR) data, students Siwei Zhang (Masters Biostatistics) and Jake Ukleja (Computer Science) spent ten weeks building a model to predict pancreatic cancer from Electronic Medical Records (EMR) data. They worked with nine years worth of EMR data, including ICD9 diagnostic codes, that contained records from over 200,000 patients.

Themes and Categories
Contact
Paul Benich
bendich@math.duke.edu

Project Results: The team began with exploratory data analysis that illustrated median times of appearance and frequency of specific ICD9 codes, with an eye toward understanding the relation between these statistics and pancreatic cancer diagnosis. They then trained a topic model which predicted past pancreatic cancer diagnosis with high accuracy (93 percent AUC) from ICD9 codes. Finally, they used the topic model outcomes to identify a pool of high-risk patients for potential future study.

Click here for the Executive Summary

Project Leads:

Lisa Satterwhite, PhD

James Abbruzzese, MD

Joseph Lucas, PhD

Project Manager: Tyler Massaro

Related People

Related Projects

Marine mammals exhibit extreme physiological and behavioral adaptions that allow them to dive hundreds to thousands of meters underwater despite their need to breathe air at the surface. Through the development of new remote monitoring technologies, we are just beginning to understand the mechanisms by which they are able to execute these extreme behaviors. Long- term animal-borne tags can now record location, dive depth, and dive duration and then transmit these data to satellite receivers, enabling remote access to behavior occurring both many kilometers out to sea and several kilometers below the ocean surface. 

The aim of this Data Expedition was for students to learn hands-on data visualization techniques using a variety of data types. Students first discussed how data visualization is useful, and tips to make graphs both visually appealing and easy to understand. 

The aim of our data expeditions course was to give students in Bio 190S-0.2, a summer session course in sensory systems, an introduction to how real data may actually look and how they may actually be analyzed. Over the course of a two-hour class session, 16 students ranging from 16-22 years old were given the opportunity to explore a dataset on the color vision capabilities of three species of cleaner shrimp.