TY - RPRT T1 - Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Y1 - 2017 A1 - Weinberg, Daniel A1 - Abowd, John M. A1 - Belli, Robert F. A1 - Cressie, Noel A1 - Folch, David C. A1 - Holan, Scott H. A1 - Levenstein, Margaret C. A1 - Olson, Kristen M. A1 - Reiter, Jerome P. A1 - Shapiro, Matthew D. A1 - Smyth, Jolene A1 - Soh, Leen-Kiat A1 - Spencer, Bruce A1 - Spielman, Seth E. A1 - Vilhuber, Lars A1 - Wikle, Christopher AB -

Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Weinberg, Daniel; Abowd, John M.; Belli, Robert F.; Cressie, Noel; Folch, David C.; Holan, Scott H.; Levenstein, Margaret C.; Olson, Kristen M.; Reiter, Jerome P.; Shapiro, Matthew D.; Smyth, Jolene; Soh, Leen-Kiat; Spencer, Bruce; Spielman, Seth E.; Vilhuber, Lars; Wikle, Christopher The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives. This paper began as a May 8, 2015 presentation to the National Academies of Science’s Committee on National Statistics by two of the principal investigators of the National Science Foundation-Census Bureau Research Network (NCRN) – John Abowd and the late Steve Fienberg (Carnegie Mellon University). The authors acknowledge the contributions of the other principal investigators of the NCRN who are not co-authors of the paper (William Block, William Eddy, Alan Karr, Charles Manski, Nicholas Nagle, and Rebecca Nugent), the co- principal investigators, and the comments of Patrick Cantwell, Constance Citro, Adam Eck, Brian Harris-Kojetin, and Eloise Parker. We note with sorrow the deaths of Stephen Fienberg and Allan McCutcheon, two of the original NCRN principal investigators. The principal investigators also wish to acknowledge Cheryl Eavey’s sterling grant administration on behalf of the NSF. The conclusions reached in this paper are not the responsibility of the National Science Foundation (NSF), the Census Bureau, or any of the institutions to which the authors belong

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52650 ER - TY - RPRT T1 - Making Confidential Data Part of Reproducible Research Y1 - 2017 A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - Making Confidential Data Part of Reproducible Research Vilhuber, Lars; Lagoze, Carl Disclaimer and acknowledgements: While this column mentions the Census Bureau several times, any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the U.S. Census Bureau or the other statistical agencies mentioned herein. PB - Cornell University UR - http://hdl.handle.net/1813/52474 ER - TY - JOUR T1 - Making Confidential Data Part of Reproducible Research JF - Chance Y1 - 2017 A1 - Vilhuber, Lars A1 - Lagoze, Carl UR - http://chance.amstat.org/2017/09/reproducible-research/ ER - TY - RPRT T1 - NCRN Meeting Spring 2017 Y1 - 2017 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2017 Vilhuber, Lars PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52163 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: Welcome Y1 - 2017 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2017: Welcome Vilhuber, Lars PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52163 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 3 Y1 - 2017 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 3 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from December 2016 through February 2017. NCRN Newsletter Vol. 3, Issue 3: March 10, 2017 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46686 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 4 Y1 - 2017 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 4 Vilhuber, Lars; Knight-Ingram, Dory The NCRN Newsletter is published quarterly by the NCRN Coordinating Office. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52259 ER - TY - RPRT T1 - Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy Y1 - 2017 A1 - Vilhuber, Lars A1 - Schmutte, Ian AB - Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas. PB - Cornell University UR - http://hdl.handle.net/1813/46197 ER - TY - RPRT T1 - Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy Y1 - 2017 A1 - Vilhuber, Lars A1 - Schmutte, Ian M. AB - Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian M. ese proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. is workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmu e 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cu ing-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. e teams developing those applications were just starting out when our rst workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cu ing-edge formal privacy models, there had been very li le e ort in the academic literature to apply those methods in real-world se ings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject ma er challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four di erent areas. e four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the speci c challenges that have arisen in ongoing e orts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short wri en memos that summarize concrete suggestions for practical applications to speci c Census Bureau priority areas. Comments can be provided at h ps://goo.gl/ZAh3YE PB - Cornell University UR - http://hdl.handle.net/1813/52473 ER - TY - RPRT T1 - Proceedings from the Synthetic LBD International Seminar Y1 - 2017 A1 - Vilhuber, Lars A1 - Kinney, Saki A1 - Schmutte, Ian M. AB - Proceedings from the Synthetic LBD International Seminar Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M. On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop. PB - Cornell University UR - http://hdl.handle.net/1813/52472 ER - TY - RPRT T1 - Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings Y1 - 2017 A1 - Foote, Andrew A1 - Kutzbach, Mark J. A1 - Vilhuber, Lars AB - Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings Foote, Andrew; Kutzbach, Mark J.; Vilhuber, Lars This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau. PB - Cornell University UR - http://hdl.handle.net/1813/52649 ER - TY - RPRT T1 - Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files Y1 - 2017 A1 - Green, Andrew A1 - Kutzbach, Mark J. A1 - Vilhuber, Lars AB - Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files Green, Andrew; Kutzbach, Mark J.; Vilhuber, Lars Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households’ responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets. PB - Cornell University UR - http://hdl.handle.net/1813/52611 ER - TY - RPRT T1 - Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Y1 - 2017 A1 - Haney, Samuel A1 - Machanavajjhala, Ashwin A1 - Abowd, John M A1 - Graham, Matthew A1 - Kutzbach, Mark A1 - Vilhuber, Lars AB - Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark; Vilhuber, Lars National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional PB - Cornell University UR - http://hdl.handle.net/1813/49652 ER - TY - RPRT T1 - NCRN Meeting Fall 2016 Y1 - 2016 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2016 Vilhuber, Lars Taken place at the U.S. Census Bureau HQ, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45885 ER - TY - RPRT T1 - NCRN Meeting Spring 2016 Y1 - 2016 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2016 Vilhuber, Lars Taken place at U.S. Census Bureau HQ, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45899 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 4 Y1 - 2016 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB -

NCRN Newsletter: Volume 2 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from September 2015 through December 2015. NCRN Newsletter Vol. 2, Issue 4: January 28, 2016.

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/42394 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 1 Y1 - 2016 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 3 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2016 through May 2016. NCRN Newsletter Vol. 3, Issue 1: June 10, 2016 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/44199 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 2 Y1 - 2016 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 2 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from June 2016 through December 2016. NCRN Newsletter Vol. 3, Issue 2: December 23, 2016 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46171 ER - TY - RPRT T1 - The NSF-Census Research Network in 2016: Taking stock, looking forward Y1 - 2016 A1 - Vilhuber, Lars AB - The NSF-Census Research Network in 2016: Taking stock, looking forward Vilhuber, Lars An overview of the activities of the NSF-Census Research Network as of 2016, given on Saturday, May 21, 2016, at a workshop on spatial and spatio-temporal design and analysis for official statistics, hosted by the Spatio-Temporal Statistics NSF Census Research Network (STSN) at the University of Missouri, and sponsored by the NSF-Census Research Network (NCRN) PB - University of Missouri UR - http://hdl.handle.net/1813/46210 ER - TY - JOUR T1 - Synthetic establishment microdata around the world JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Vilhuber, Lars A1 - Abowd, John M. A1 - Reiter, Jerome P. KW - Business data KW - confidentiality KW - differential privacy KW - international comparison KW - Multiple imputation KW - synthetic AB - In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature. VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964 IS - 1 ER - TY - JOUR T1 - Using partially synthetic microdata to protect sensitive cells in business statistics JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Miranda, Javier A1 - Vilhuber, Lars KW - confidentiality protection KW - gross job flows KW - local labor markets KW - Statistical Disclosure Limitation KW - Synthetic data KW - time-series AB - We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji963 IS - 1 ER - TY - RPRT T1 - NCRN Meeting Spring 2015 Y1 - 2015 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2015 Vilhuber, Lars May 7 meetings @ U.S. Census Bureau, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45867 ER - TY - Generic T1 - NCRN Meeting Spring 2015: Broadening data access through synthetic data Y1 - 2015 A1 - Vilhuber, Lars AB -

NCRN Meeting Spring 2015: Broadening data access through synthetic data Vilhuber, Lars Presentation at the NCRN Meeting Spring 2015

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40185 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 1 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from October 2014 to January 2015. NCRN Newsletter Vol. 2, Issue 1: January 30, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40193 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 2 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40194 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 2 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from February 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/44200 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 3 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB -

NCRN Newsletter: Volume 2 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from June 2015 through August 2015. NCRN Newsletter Vol. 2, Issue 3: September 15, 2015.

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/42393 ER - TY - RPRT T1 - Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project Y1 - 2015 A1 - Perry, Benjamin A1 - Kambhampaty, Venkata A1 - Brumsted, Kyle A1 - Vilhuber, Lars A1 - Block, William AB - Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Recent years have shown the power of user-sourced information evidenced by the success of Wikipedia and its many emulators. This sort of unstructured discussion is currently not feasible as a part of the otherwise successful metadata repositories. Creating and augmenting metadata is a labor-intensive endeavor. Harnessing collective knowledge from actual data users can supplement officially generated metadata. As part of our Comprehensive Extensible Data Documentation and Access Repository (CED2AR) infrastructure, we demonstrate a prototype of crowdsourced DDI, using DDI-C and supplemental XML. The system allows for any number of network connected instances (web or desktop deployments) of the CED2AR DDI editor to concurrently create and modify metadata. The backend transparently handles changes, and frontend has the ability to separate official edits (by designated curators of the data and the metadata) from crowd-sourced content. We briefly discuss offline edit contributions as well. CED2AR uses DDI-C and supplemental XML together with Git for a very portable and lightweight implementation. This distributed network implementation allows for large scale metadata curation without the need for a hardware intensive computing environment, and can leverage existing cloud services, such as Github or Bitbucket. Ben Perry (Cornell/NCRN) presents joint work with Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, & William C. Block at NADDI 2015. PB - Cornell University UR - http://hdl.handle.net/1813/40172 ER - TY - RPRT T1 - Synthetic Establishment Microdata Around the World Y1 - 2015 A1 - Vilhuber, Lars A1 - Abowd, John A. A1 - Reiter, Jerome P. AB - Synthetic Establishment Microdata Around the World Vilhuber, Lars; Abowd, John A.; Reiter, Jerome P. In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature. PB - Cornell University UR - http://hdl.handle.net/1813/42340 ER - TY - RPRT T1 - Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics Y1 - 2015 A1 - Vilhuber, Lars A1 - Miranda, Javier AB - Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics Vilhuber, Lars; Miranda, Javier We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). PB - Cornell University UR - http://hdl.handle.net/1813/42339 ER - TY - RPRT T1 - CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository Y1 - 2014 A1 - Lagoze, Carl A1 - Vilhuber, Lars A1 - Williams, Jeremy A1 - Perry, Benjamin A1 - Block, William C. AB - CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository Lagoze, Carl; Vilhuber, Lars; Williams, Jeremy; Perry, Benjamin; Block, William C. We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED 2 AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains Presented at 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), Sept 8-12, 2014 PB - Cornell University UR - http://hdl.handle.net/1813/44702 ER - TY - RPRT T1 - Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project Y1 - 2014 A1 - Perry, Benjamin A1 - Kambhampaty, Venkata A1 - Brumsted, Kyle A1 - Vilhuber, Lars A1 - Block, William AB - Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Benjamin Perry's presentation on "Collaborative Editing and Versioning of DDI Metadata: The Latest from Cornell's NCRN CED²AR Software" at the 6th Annual European DDI User Conference in London, 12/02/2014. PB - Cornell University UR - http://hdl.handle.net/1813/38200 ER - TY - RPRT T1 - NCRN Meeting Fall 2014 Y1 - 2014 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2014 Vilhuber, Lars Taken place at the ILR NYC Conference Center. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45868 ER - TY - RPRT T1 - NCRN Meeting Spring 2014 Y1 - 2014 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2014 Vilhuber, Lars Taken place at the Census Headquarters, Washington, DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45869 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau Y1 - 2014 A1 - Block, William A1 - Brown, Warren A1 - Williams, Jeremy A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau Block, William; Brown, Warren; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl presentation at NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36392 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) Y1 - 2014 A1 - Gathright, Graton A1 - Kutzbach, Mark A1 - Mccue, Kristin A1 - McEntarfer, Erika A1 - Monti, Holly A1 - Trageser, Kelly A1 - Vilhuber, Lars A1 - Wasi, Nada A1 - Wignall, Christopher AB - NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) Gathright, Graton; Kutzbach, Mark; Mccue, Kristin; McEntarfer, Erika; Monti, Holly; Trageser, Kelly; Vilhuber, Lars; Wasi, Nada; Wignall, Christopher Presentation for NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36396 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 2 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from November 2013 to March 2014. NCRN Newsletter Vol. 1, Issue 2: March 20, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40233 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 3 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from March 2014 to July 2014. NCRN Newsletter Vol. 1, Issue 3: July 23, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40234 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 4 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2014 to October 2014. NCRN Newsletter Vol. 1, Issue 4: October 15, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40192 ER - TY - RPRT T1 - Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results Y1 - 2014 A1 - Miranda, Javier A1 - Vilhuber, Lars AB - Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results Miranda, Javier; Vilhuber, Lars The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells. PB - Cornell University UR - http://hdl.handle.net/1813/40852 ER - TY - CONF T1 - Encoding Provenance Metadata for Social Science Datasets T2 - Metadata and Semantics Research Y1 - 2013 A1 - Lagoze, Carl A1 - Willliams, Jeremy A1 - Vilhuber, Lars ED - Garoufallou, Emmanouel ED - Greenberg, Jane KW - DDI KW - eSocial Science KW - Metadata KW - Provenance JF - Metadata and Semantics Research T3 - Communications in Computer and Information Science PB - Springer International Publishing VL - 390 SN - 978-3-319-03436-2 UR - http://dx.doi.org/10.1007/978-3-319-03437-9_13 ER - TY - RPRT T1 - Encoding Provenance of Social Science Data: Integrating PROV with DDI Y1 - 2013 A1 - Lagoze, Carl A1 - Block, William C A1 - Williams, Jeremy A1 - Abowd, John A1 - Vilhuber, Lars AB - Encoding Provenance of Social Science Data: Integrating PROV with DDI Lagoze, Carl; Block, William C; Williams, Jeremy; Abowd, John; Vilhuber, Lars Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. Submitted to EDDI13 5th Annual European DDI User Conference December 2013, Paris, France PB - Cornell University UR - http://hdl.handle.net/1813/34443 ER - TY - RPRT T1 - Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Y1 - 2013 A1 - Block, William C. A1 - Williams, Jeremy A1 - Vilhuber, Lars A1 - Lagoze, Carl A1 - Brown, Warren A1 - Abowd, John M. AB - Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Block, William C.; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl; Brown, Warren; Abowd, John M. Presentation at NADDI 2013 This record has also been archived at http://kuscholarworks.ku.edu/dspace/handle/1808/11093 . PB - Cornell University UR - http://hdl.handle.net/1813/33362 ER - TY - RPRT T1 - Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Y1 - 2013 A1 - Vilhuber, Lars A1 - Abowd, John A1 - Block, William A1 - Lagoze, Carl A1 - Williams, Jeremy AB - Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Vilhuber, Lars; Abowd, John; Block, William; Lagoze, Carl; Williams, Jeremy Social science researchers are increasingly interested in making use of confidential micro-data that contains linkages to the identities of people, corporations, etc. The value of this linking lies in the potential to join these identifiable entities with external data such as genome data, geospatial information, and the like. Leveraging these linkages is an essential aspect of “big data” scholarship. However, the utility of these confidential data for scholarship is compromised by the complex nature of their management and curation. This makes it difficult to fulfill US federal data management mandates and interferes with basic scholarly practices such as validation and reuse of existing results. We describe in this paper our work on the CED2AR prototype, a first step in providing researchers with a tool that spans the confidential/publicly-accessible divide, making it possible for researchers to identify, search, access, and cite those data. The particular points of interest in our work are the cloaking of metadata fields and the expression of provenance chains. For the former, we make use of existing fields in the DDI (Data Description Initiative) specification and suggest some minor changes to the specification. For the latter problem, we investigate the integration of DDI with recent work by the W3C PROV working group that has developed a generalizable and extensible model for expressing data provenance. PB - Cornell University UR - http://hdl.handle.net/1813/34534 ER - TY - RPRT T1 - NCRN Meeting Spring 2013 Y1 - 2013 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2013 Vilhuber, Lars Taken place at the NISS Headquarters, Research Triangle Park, NC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45870 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 1 Y1 - 2013 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2013 to November 2013. NCRN Newsletter Vol. 1, Issue 1: November 17, 2013 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40232 ER - TY - RPRT T1 - Data Management of Confidential Data Y1 - 2012 A1 - Lagoze, Carl A1 - Block, William C. A1 - Williams, Jeremy A1 - Abowd, John M. A1 - Vilhuber, Lars AB - Data Management of Confidential Data Lagoze, Carl; Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. PB - Cornell University UR - http://hdl.handle.net/1813/30924 ER - TY - RPRT T1 - An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Y1 - 2012 A1 - Block, William C. A1 - Williams, Jeremy A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars; Lagoze, Carl This presentation will demonstrate the latest DDI-related technological developments of Cornell University’s $3 million NSF-Census Research Network (NCRN) award, dedicated to improving the documentation, discoverability, and accessibility of public and restricted data from the federal statistical system in the United States. The current internal name for our DDI-based system is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). CED²AR ingests metadata from heterogeneous sources and supports filtered synchronization between restricted and public metadata holdings. Currently-supported CED²AR “connector workflows” include mechanisms to ingest IPUMS, zero-observation files from the American Community Survey (DDI 2.1), and SIPP Synthetic Beta (DDI 1.2). These disparate metadata sources are all transformed into a DDI 2.5 compliant form and stored in a single repository. In addition, we will demonstrate an extension to DDI 2.5 that allows for the labeling of elements within the schema to indicate confidentiality. This metadata can then be filtered, allowing the creation of derived public use metadata from an original confidential source. This repository is currently searchable online through a prototype application demonstrating the ability to search across previously heterogeneous metadata sources. Presentation at the 4th Annual European DDI User Conference (EDDI12), Norwegian Social Science Data Services, Bergen, Norway, 3 December, 2012 PB - Cornell University UR - http://hdl.handle.net/1813/30922 ER - TY - RPRT T1 - Encoding Provenance Metadata for Social Science Datasets Y1 - 2012 A1 - Lagoze, Carl A1 - Williams, Jeremy A1 - Vilhuber, Lars AB - Encoding Provenance Metadata for Social Science Datasets Lagoze, Carl; Williams, Jeremy; Vilhuber, Lars Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and re- produce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard. Submitted to Metadata and Semantics Research (MTSR 2013) conference. PB - Cornell University UR - http://hdl.handle.net/1813/55327 ER - TY - RPRT T1 - NCRN Meeting Fall 2012 Y1 - 2012 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2012 Vilhuber, Lars Taken place at the Census Bureau Headquarters, Suitland, MD. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45884 ER - TY - RPRT T1 - The NSF-Census Research Network: Cornell Node Y1 - 2012 A1 - Block, William C. A1 - Lagoze, Carl A1 - Vilhuber, Lars A1 - Brown, Warren A. A1 - Williams, Jeremy A1 - Arguillas, Florio AB - The NSF-Census Research Network: Cornell Node Block, William C.; Lagoze, Carl; Vilhuber, Lars; Brown, Warren A.; Williams, Jeremy; Arguillas, Florio Cornell University has received a $3M NSF-Census Research Network (NCRN) award to improve the documentation and discoverability of both public and restricted data from the federal statistical system. The current internal name for this project is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). The diagram to the right provides a high level architectural overview of the system to be implemented. The CED²AR will be based upon leading metadata standards such as the Data Documentation Initiative (DDI) and Statistical Data and Metadata eXchange (SDMX) and be flexibly designed to ingest documentation from a variety of source files. It will permit synchronization between the public and confidential instances of the repository. The scholarly community will be able to use the CED²AR as it would a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, could use the CED²AR with full information in authorized domains. PB - Cornell University UR - http://hdl.handle.net/1813/30925 ER - TY - CHAP T1 - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs T2 - Privacy in Statistical Databases Y1 - 2012 A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Block, William ED - Domingo-Ferrer, Josep ED - Tinnirello, Ilenia KW - Data Archive KW - Data Curation KW - Privacy-preserving Datamining KW - Statistical Disclosure Limitation JF - Privacy in Statistical Databases T3 - Lecture Notes in Computer Science PB - Springer Berlin Heidelberg VL - 7556 SN - 978-3-642-33626-3 UR - http://dx.doi.org/10.1007/978-3-642-33627-0_17 ER - TY - RPRT T1 - NCRN Meeting Fall 2011 Y1 - 2011 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2011 Vilhuber, Lars Taken place at Census Bureau Conference Center. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46201 ER - TY - RPRT T1 - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Y1 - 2011 A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Block, William AB - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Abowd, John M.; Vilhuber, Lars; Block, William We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials. PB - Cornell University UR - http://hdl.handle.net/1813/30923 ER -