Looking inside the data-making factory

Accessing economic data these days can be almost deceptively easy. A few clicks on the U.S. Bureau of Labor Statistics’ website or a simple search on the St. Louis Federal Reserve’s FRED site reveals thousands of datasets that anyone interested in economics can grab. Such quick-and-easy access to data has helped drive the “credibility revolution” in empirical economics research. Yet this multitude of data sets does not burst forth into the world fully formed. Rather, the creation of data can be a messy process, especially when considering the version that the U.S. public and researchers may eventually see.

Here’s one such mess—a new paper presented at last week’s Brooking Panel on Economic Activity warns that procedures used to protect privacy might be distorting many of the data sets that researchers use. Large data sets can be incredibly powerful in the amount of information they capture about individuals, but they also present a risk to those individuals’ privacy. Take the Current Population Survey, which is maintained by the U.S. Bureau of Labor Statistics and the U.S. Census Bureau. The survey contains information about a person’s family status, their income, their location, and quite a few more variables. If these data were made available to the public unaltered, then someone could look into the data and conceivably identify a respondent.

To avoid identification through these means, both private and public sector data providers mask the identity through a number of statistical processes. In the new Brookings paper, John M. Abowd of Cornell University and Ian M. Schmutte of the University of Georgia detail the ways that statistical disclosure limitations, or SDL—the catch-all name for these processes—can affect economic analyses. The authors single out one form of SDL that is particularly nefarious: swapping. In this process, the statistical agency or company will swap certain attributes of respondents. The location of individuals, for example, might be switched in order to protect their identities, but that could corrupt the economic analysis a researcher is running.

What the authors call for is a certain amount of disclosure from the organization creating these limited datasets. Abowd and Schmutte don’t want the raw, unaltered data, but rather better disclosure of how exactly the organization changed the data. If the agencies or companies let researchers know about certain aspects of the process that anonymized the data, then the researchers could account for these distortions in their analysis.

Of course, the agencies could just make more of the raw data available. But as participants at the Brookings panel noted, there are severe consequences for the employees of these agencies if someone is identified in the data. The punishment could include a fine of several thousand dollars or jail time. That potential punishment makes agency staff quite averse to disclosing data.

These statistical disclosure limitation processes right now are almost exclusively used on survey data, which makes some economists sanguine about the future of data. Microeconomists have increasingly turned to administrative data, which is taken directly from government sources and includes things like tax receipts, and away from survey data. But one participant warned that these SDL processes will almost certainly make their way to administrative data.

Abowd and Schmutte’s paper is an important reminder that we all need to pay attention to the sources and presentation of economic data. The old joke about economic analysis is that it’s like looking for a set of lost keys only where the streetlight illuminates the pavement. The least we can do is to make sure the light shines as bright and as wide as possible.

March 24, 2015

Connect with us!

Explore the Equitable Growth network of experts around the country and get answers to today's most pressing questions!

Get in Touch