An Integrated System for Accessing Large-Scale, Confidential Social Science Data

Project Summary

Large-scale databases from the social, behavioral, and economic sciences offer enormous potential benefits to society. However, as most stewards of social science data are acutely aware, wide-scale dissemination of such data can result in unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises–and in some instances laws to protect data subjects' privacy and confidentiality. 

Themes and Categories
Contact
Jerry Reiter
Statistical Science
jerry@stat.duke.edu

Supported by a grant from the National Science Foundation Data Infrastructure Building Blocks program, we are developing an integrated system for disseminating large-scale social science data. The system includes:

(i) Capability to generate highly redacted, synthetic data intended for wide access, coupled with

(ii) Means for approved researchers to access the confidential data via secure remote access solutions, glued together by

(iii) A verification server that allows users to assess the quality of their analyses with the redacted data so as to be more efficient with their use of remote data access.

Related People

Related Projects

A team of students led by Professors Jonathan Mattingly and Gregory Herschlag will investigate gerrymandering in political districting plans.  Students will improve on and employ an algorithm to sample the space of compliant redistricting plans for both state and federal districts.  The output of the algorithm will be used to detect gerrymandering for a given district plan; this data will be used to analyze and study the efficacy of the idea of partisan symmetry.  This work will continue the Quantifying Gerrymandering project, seeking to understand the space of redistricting plans and to find justiciable methods to detect gerrymandering. The ideal team has a mixture of members with programing backgrounds (C, Java, Python), statistical experience including possibly R, mathematical and algorithmic experience, and exposure to political science or other social science fields.

Read the latest updates about this ongoing project by visiting Dr. Mattingly's Gerrymandering blog.

A team of students led by faculty and researchers at the Social Science Research Institute will bring together data that will facilitate research using social determinants of health (SDH) to examine, understand, and ameliorate health disparities. This project will identify SDH variables that have the potential to be linked to data from the MURDOCK Study, a longitudinal health study based in Cabbarus County, NC. Much of this data – information relevant to understanding socioeconomic status, education, the physical and social environment, employment, and social support networks – is publicly available or easily obtained and its aggregation and analysis offer opportunities to significantly improve predictions of health risks and improve personalized care. Students will evaluate potential data sources, develop ethical policies to protect respondent privacy, clean and merge data, create documentation for data sharing and reuse, and use statistical tools and neighborhood mapping software to examine patterns of disparity.

A team of students led by researchers in the Center for Health Policy and Inequalities Research will develop a platform that visualizes significant life events across time for more than 3,000 orphaned and separated children in Cambodia, Ethiopia, India, Kenya, and Tanzania from the Positive Outcomes for Orphans (POFO) study. The types of life events visualized on the timeline will include: the death of a parent, changes in living locations, school levels achieved, special events, traumatic events, and reported wellbeing at different ages. This data will be displayed via mobile devices and will serve to allow the participant to visualize and verify the information provided about their lives. Ultimately, the platform will allow researchers to ensure accuracy of the data provided and also allow greater audiences to visualize the individuality of the study's aggregate data.