Large-scale databases from the social, behavioral, and economic sciences offer enormous potential benefits to society. When made widely accessible, these databases facilitate advances in research and policy-making, enable students to develop skills at data analysis, and help ordinary citizens learn about their communities. However, as most stewards of social science data are acutely aware, wide-scale dissemination of such data can result in unintended disclosures of data subjects’ identities and sensitive attributes, thereby violating promises—and in some instances laws—to protect data subjects’ privacy and confidentiality. iiD faculty, social science faculty, and Duke OIT staff are developing new methods and infrastructure for providing access to large-scale confidential social science data. These methods have the potential to revolutionize how researchers, students, and the general public access data, helping facilitate more and deeper analyses about our society.
Jerry Reiter, the Mrs. Alexander Hehmeyer Professor of Statistical Science at Duke, has obtained more than $5 million in grants from the National Science Foundation (NSF) to develop methods and infrastructure for making confidential social science data—such as that held by the U.S. Census Bureau and other agencies— available to researchers, policy makers, and the public while preserving confidentiality.
How will he do this? A cornerstone of his approach is to create synthetic data that mimic the trends and information contained in original data, but protects confidentiality. “Simply stripping names and addresses doesn’t protect confidentiality when the data contain demographic variables, employment/education histories, or other information that ill-intentioned people can use to match up records,” says Reiter, who is also iiD’s deputy director. “Agencies need more sophisticated methods for safely sharing and disseminating data.” Reiter’s team uses advanced statistics tools to construct synthetic data sets that capture crucial relationships found in the original data, but carry low risks.
Reiter and his team have plans to develop an NSF-funded system that will allow data stewards across the world to utilize this technology, coupled with systems for providing remote access to trusted users. The team also is developing new statistical methods to improve the quality of the data and link it to other information. For example, combining census data with payroll tax records would provide better estimates of people’s wages and salaries, which is extremely important to monitoring the health of the U.S. economy.
“We have the potential to be the best place in the world where information science meets the social sciences.” —Robert Calderbank