A few weeks ago I participated in an expert meeting hosted by the National Academy of Sciences and sponsored by the U.S. Bureau of Economic Analysis (BEA). The BEA is responsible for compiling the national accounts (e.g., how many dollars were spent on gasoline, on real estate, on milk), which are vital for the government’s macroeconomic planning as well as for decision-making in corporate America. The purpose of the meeting was to consider how the BEA could best leverage large-scale, organic data—that is, data arising naturally as opposed to being collected in planned surveys—to improve estimates of the national accounts. Around the table were representatives from high levels of government, top academic institutions, and big players in industry (Microsoft, JP Morgan, Zillow, and MasterCard, among others). The panelists heard presentations about applications of organic commercial data, and offered opinions and insights on their potential for the BEA.
The presentations were fascinating, and it was clear that organic data should be utilized. There was one discussion thread that I thought was especially important: what do the processes that generate the data imply about their usefulness? As an example, the panelists from JP Morgan and MasterCard presented analyses of American consumers’ spending estimated from billions of credit card transactions. Compared to surveys, these data are far more voluminous, more timely, and—especially relevant in this age of shrinking budgets for statistical agencies—far cheaper.
This is all very good. However, the experts (and panelists, to their credit) also highlighted some limitations of the data. By design, the data completely omit cash transactions. The demographics of those who own credit cards differ from those in the overall population, so that solely looking at credit card transactions can give incomplete (and biased) estimates of spending patterns. The experts and panelists also noted that the attribute definitions in organic data may not always match what is desired. For example, if I buy a book online from Amazon, should the charge be considered from Durham, from Seattle (Amazon headquarters), or from a local distribution center? In organic data this choice is driven by software, not by policy needs. There also were concerns over the unknown nature of the quality of organic data; about data privacy and data access, and the implications of limited access for verifiability and reproducibility of results; and, about potential conflicts between companies’ profit incentives and the creation of a public good. I left the meeting thinking there is enormous potential but also enormous challenge, which is of course the ideal scenario for researchers (and exactly where iiD lives).
The discussion also got me thinking about teaching. We (quantitative researchers at Duke and elsewhere) tend to focus on data analysis—e.g., novel models for learning from big data, efficient computing techniques, effective visualizations—and give short shrift to the data collection process. This paints an incomplete picture. Indeed, as apparent in the BEA meeting, arguably the most important component of an analysis is to understand the data generating process. The very best algorithm applied to lousy data still yields lousy results. As big data begin to permeate education, it is prudent for us not to neglect the importance of teaching about where the data come from. We need to figure out ways to incorporate interdisciplinary collaborations among data experts and quantitative/computational scientists in our training, e.g., as evident in iiD’s signature teaching initiatives Data+ and Data Expeditions, just as we have done in our research.
Jerry Reiter