TY - JOUR T1 - Data fusion for correcting measurement errors Y1 - Submitted A1 - J. P. Reiter A1 - T. Schifeling A1 - M. De Yoreo AB - Often in surveys, key items are subject to measurement errors. Given just the data, it can be difficult to determine the distribution of this error process, and hence to obtain accurate inferences that involve the error-prone variables. In some settings, however, analysts have access to a data source on different individuals with high quality measurements of the error-prone survey items. We present a data fusion framework for leveraging this information to improve inferences in the error-prone survey. The basic idea is to posit models about the rates at which individuals make errors, coupled with models for the values reported when errors are made. This can avoid the unrealistic assumption of conditional independence typically used in data fusion. We apply the approach on the reported values of educational attainments in the American Community Survey, using the National Survey of College Graduates as the high quality data source. In doing so, we account for the informative sampling design used to select the National Survey of College Graduates. We also present a process for assessing the sensitivity of various analyses to different choices for the measurement error models. Supplemental material is available online. ER - TY - JOUR T1 - A framework for sharing confidential research data, applied to investigating differential pay by race in the U. S. government Y1 - Submitted A1 - Barrientos, A. F. A1 - Bolton, A. A1 - Balmat, T. A1 - Reiter, J. P. A1 - Machanavajjhala, A. A1 - Chen, Y. A1 - Kneifel, C. A1 - DeLong, M. A1 - de Figueiredo, J. M. AB - Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. We present a framework for addressing this challenge. The framework uses an integrated system that includes fully synthetic data intended for wide access, coupled with means for approved users to access the confidential data via secure remote access solutions, glued together by verification servers that allow users to assess the quality of their analyses with the synthetic data. We apply this framework to data on the careers of employees of the U. S. federal government, studying differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up in the confidential data and which do not. We find differentials across races; for example, the gap between black and white female federal employees' pay increased over the time period. We present models for generating synthetic careers and differentially private algorithms for verification of regression results. ER - TY - JOUR T1 - Imputation in U.S. Manufacturing Data and Its Implications for Productivity Dispersion JF - Review of Economics and Statistics Y1 - Submitted A1 - T. Kirk White A1 - Jerome P. Reiter A1 - Amil Petrin AB - In the U.S. Census Bureau's 2002 and 2007 Censuses of Manufactures 79% and 73% of observations respectively have imputed data for at least one variable used to compute total factor productivity. The Bureau primarily imputes for missing values using mean-imputation methods which can reduce the true underlying variance of the imputed variables. For every variable entering TFP in 2002 and 2007 we show the dispersion is significantly smaller in the Census mean-imputed versus the Census non-imputed data. As an alternative to mean imputation we show how to use classification and regression trees (CART) to allow for a distribution of multiple possible impute values based on other plants that are CART-algorithmically determined to be similar based on other observed variables. For 90% of the 473 industries in 2002 and the 84% of the 471 industries in 2007 we find that TFP dispersion increases as we move from Census mean-imputed data to Census non-imputed data to the CART-imputed data. UR - http://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00678 ER - TY - JOUR T1 - Sequential identification of nonignorable missing data mechanisms JF - Statistica Sinica Y1 - Submitted A1 - Mauricio Sadinle A1 - Jerome P. Reiter KW - Identification KW - Missing not at random KW - Non-parametric saturated KW - Partial ignorability KW - Sensitivity analysis AB - With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. The main idea is to use a sequence of carefully set up identifying assumptions, whereby we specify potentially different missingness mechanisms for different blocks of variables. We show that the procedure results in models with the desirable property of being non-parametric saturated. ER - TY - JOUR T1 - The Earned Income Tax Credit and Food Insecurity: Who Benefits? Y1 - forthcoming A1 - Shaefer, H.L. A1 - Wilson, R. ER - TY - JOUR T1 - The Response of Consumer Spending to Changes in Gasoline Prices Y1 - forthcoming A1 - Gelman, Michael A1 - Gorodnichenko, Yuriy A1 - Kariv, Shachar A1 - Koustas, Dmitri A1 - Shapiro, Matthew D A1 - Silverman, Daniel A1 - Tadelis, Steven AB - This paper estimates how overall consumer spending responds to changes in gasoline prices. It uses the differential impact across consumers of the sudden, large drop in gasoline prices in 2014 for identification. This estimation strategy is implemented using comprehensive, daily transaction-level data for a large panel of individuals. The estimated marginal propensity to consume (MPC) is approximately one, a higher estimate than estimates found in less comprehensive or well-measured data. This estimate takes into account the elasticity of demand for gasoline and potential slow adjustment to changes in prices. The high MPC implies that changes in gasoline prices have large aggregate effects. ER - TY - JOUR T1 - Understanding Household Consumption and Saving Behavior using Account Data Y1 - forthcoming A1 - Gelman, Michael ER - TY - JOUR T1 - Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data JF - Journal of Labor Economics Y1 - 2018 A1 - John M. Abowd A1 - Kevin L. Mckinney A1 - Nellie Zhao AB - Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. ER - TY - JOUR T1 - Sorting Between and Within Industries: A Testable Model of Assortative Matching JF - Annals of Economics and Statistics Y1 - 2018 A1 - John M. Abowd A1 - Francis Kramarz A1 - Sebastien Perez-Duarte A1 - Ian M. Schmutte ER - TY - JOUR T1 - Adaptively-Tuned Particle Swarm Optimization with Application to Spatial Design JF - Stat Y1 - 2017 A1 - Simpson, M. A1 - Wikle, C.K. A1 - Holan, S.H. AB - Particle swarm optimization (PSO) algorithms are a class of heuristic optimization algorithms that are attractive for complex optimization problems. We propose using PSO to solve spatial design problems, e.g. choosing new locations to add to an existing monitoring network. Additionally, we introduce two new classes of PSO algorithms that perform well in a wide variety of circumstances, called adaptively tuned PSO and adaptively tuned bare bones PSO. To illustrate these algorithms, we apply them to a common spatial design problem: choosing new locations to add to an existing monitoring network. Specifically, we consider a network in the Houston, TX, area for monitoring ambient ozone levels, which have been linked to out-of-hospital cardiac arrest rates. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA VL - 6 UR - http://onlinelibrary.wiley.com/doi/10.1002/sta4.142/abstract IS - 1 ER - TY - JOUR T1 - Bayesian estimation of bipartite matchings for record linkage JF - Journal of the American Statistical Association Y1 - 2017 A1 - Mauricio Sadinle KW - Assignment problem KW - Bayes estimate KW - Data matching KW - Fellegi-Sunter decision rule KW - Mixture model KW - Rejection option AB - The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. VL - 112 UR - http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2016.1148612 IS - 518 ER - TY - JOUR T1 - Bayesian Hierarchical Multi-Population Multistate Jolly-Seber Models with Covariates: Application to the Pallid Sturgeon Population Assessment Program JF - Journal of the American Statistical Association Y1 - 2017 A1 - Wu, G. A1 - Holan, S.H. AB - Estimating abundance for multiple populations is of fundamental importance to many ecological monitoring programs. Equally important is quantifying the spatial distribution and characterizing the migratory behavior of target populations within the study domain. To achieve these goals, we propose a Bayesian hierarchical multi-population multistate Jolly–Seber model that incorporates covariates. The model is proposed using a state-space framework and has several distinct advantages. First, multiple populations within the same study area can be modeled simultaneously. As a consequence, it is possible to achieve improved parameter estimation by “borrowing strength” across different populations. In many cases, such as our motivating example involving endangered species, this borrowing of strength is crucial, as there is relatively less information for one of the populations under consideration. Second, in addition to accommodating covariate information, we develop a computationally efficient Markov chain Monte Carlo algorithm that requires no tuning. Importantly, the model we propose allows us to draw inference on each population as well as on multiple populations simultaneously. Finally, we demonstrate the effectiveness of our method through a motivating example of estimating the spatial distribution and migration of hatchery and wild populations of the endangered pallid sturgeon (Scaphirhynchus albus), using data from the Pallid Sturgeon Population Assessment Program on the Lower Missouri River. Supplementary materials for this article are available online. VL - 112 UR - http://www.tandfonline.com/doi/abs/10.1080/01621459.2016.1211531 IS - 518 ER - TY - JOUR T1 - The Cepstral Model for Multivariate Time Series: The Vector Exponential Model JF - Statistica Sinica Y1 - 2017 A1 - Holan, S.H. A1 - McElroy, T.S. A1 - Wu, G. KW - Autocovariance matrix KW - Bayesian estimation KW - Cepstral KW - Coherence KW - Spectral density matrix KW - stochastic search variable selection KW - Wold coefficients. AB - Vector autoregressive (VAR) models have become a staple in the analysis of multivariate time series and are formulated in the time domain as difference equations, with an implied covariance structure. In many contexts, it is desirable to work with a stable, or at least stationary, representation. To fit such models, one must impose restrictions on the coefficient matrices to ensure that certain determinants are nonzero; which, except in special cases, may prove burdensome. To circumvent these difficulties, we propose a flexible frequency domain model expressed in terms of the spectral density matrix. Specifically, this paper treats the modeling of covariance stationary vector-valued (i.e., multivariate) time series via an extension of the exponential model for the spectrum of a scalar time series. We discuss the modeling advantages of the vector exponential model and its computational facets, such as how to obtain Wold coefficients from given cepstral coefficients. Finally, we demonstrate the utility of our approach through simulation as well as two illustrative data examples focusing on multi-step ahead forecasting and estimation of squared coherence. VL - 27 UR - http://www3.stat.sinica.edu.tw/statistica/J27N1/J27N12/J27N12.html ER - TY - RPRT T1 - Computationally Efficient Multivariate Spatio-Temporal Models for High-Dimensional Count-Valued Data. (With Discussion). Y1 - 2017 A1 - Bradley, J.R. A1 - Holan, S.H. A1 - Wikle, C.K. KW - Aggregation KW - American Community Survey KW - Bayesian hierarchical model KW - Big Data KW - Longitudinal Employer-Household Dynamics (LEHD) program KW - Markov chain Monte Carlo KW - Non-Gaussian. KW - Quarterly Workforce Indicators AB - We introduce a Bayesian approach for multivariate spatio-temporal prediction for high-dimensional count-valued data. Our primary interest is when there are possibly millions of data points referenced over different variables, geographic regions, and times. This problem requires extensive methodological advancements, as jointly modeling correlated data of this size leads to the so-called "big n problem." The computational complexity of prediction in this setting is further exacerbated by acknowledging that count-valued data are naturally non-Gaussian. Thus, we develop a new computationally efficient distribution theory for this setting. In particular, we introduce a multivariate log-gamma distribution and provide substantial theoretical development including: results regarding conditional distributions, marginal distributions, an asymptotic relationship with the multivariate normal distribution, and full-conditional distributions for a Gibbs sampler. To incorporate dependence between variables, regions, and time points, a multivariate spatio-temporal mixed effects model (MSTM) is used. The results in this manuscript are extremely general, and can be used for data that exhibit fewer sources of dependency than what we consider (e.g., multivariate, spatial-only, or spatio-temporal-only data). Hence, the implications of our modeling framework may have a large impact on the general problem of jointly modeling correlated count-valued data. We show the effectiveness of our approach through a simulation study. Additionally, we demonstrate our proposed methodology with an important application analyzing data obtained from the Longitudinal Employer-Household Dynamics (LEHD) program, which is administered by the U.S. Census Bureau. JF - arXiv UR - https://arxiv.org/abs/1512.07273 ER - TY - JOUR T1 - Cost-Benefit Analysis for a Quinquennial Census: The 2016 Population Census of South Africa JF - Journal of Official Statistics Y1 - 2017 A1 - Spencer, Bruce D. A1 - May, Julian A1 - Kenyon, Steven A1 - Seeskin, Zachary KW - demographic statistics KW - fiscal allocations KW - loss function KW - population estimates KW - post-censal estimates AB - The question of whether to carry out a quinquennial Census is faced by national statistical offices in increasingly many countries, including Canada, Nigeria, Ireland, Australia, and South Africa. We describe uses and limitations of cost-benefit analysis in this decision problem in the case of the 2016 Census of South Africa. The government of South Africa needed to decide whether to conduct a 2016 Census or to rely on increasingly inaccurate postcensal estimates accounting for births, deaths, and migration since the previous (2011) Census. The cost-benefit analysis compared predicted costs of the 2016 Census to the benefits of improved allocation of intergovernmental revenue, which was considered by the government to be a critical use of the 2016 Census, although not the only important benefit. Without the 2016 Census, allocations would be based on population estimates. Accuracy of the postcensal estimates was estimated from the performance of past estimates, and the hypothetical expected reduction in errors in allocation due to the 2016 Census was estimated. A loss function was introduced to quantify the improvement in allocation. With this evidence, the government was able to decide not to conduct the 2016 Census, but instead to improve data and capacity for producing post-censal estimates. VL - 33 SN - 2001-7367 UR - https://www.degruyter.com/view/j/jos.2017.33.issue-1/jos-2017-0013/jos-2017-0013.xml IS - 1 ER - TY - CONF T1 - Differentially private regression diagnostics T2 - IEEE International Conference on Data Mining Y1 - 2017 A1 - Chen, Y. A1 - Machanavajjhala, A. A1 - Reiter, J. P. A1 - Barrientos, A. AB - Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. When intense redaction is needed to do so, one general strategy is to require users to do analyses without seeing the confidential data, for example, by releasing fully synthetic data or by allowing users to query remote systems for disclosure-protected outputs of statistical models. With fully synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the methods using artificial and genuine data. JF - IEEE International Conference on Data Mining ER - TY - JOUR T1 - Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data JF - Bayesian Analysis Y1 - 2017 A1 - Hu, Jingchen A1 - Reiter, Jerome P A1 - Wang, Quanli AB - We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files. Supplementary materials (Hu et al., 2017) for this article are available online. UR - http://projecteuclid.org/euclid.ba/1485227030 ER - TY - JOUR T1 - Do Interviewer Post-survey Evaluations of Respondents Measure Who Respondents Are or What They Do? A Behavior Coding Study JF - Public Opinion Quarterly Y1 - 2017 A1 - Kirchner, Antje A1 - Olson, Kristen A1 - Smyth, Jolene D. AB - Survey interviewers are often tasked with assessing the quality of respondents’ answers after completing a survey interview. These interviewer observations have been used to proxy for measurement error in interviewer-administered surveys. How interviewers formulate these evaluations and how well they proxy for measurement error has received little empirical attention. According to dual-process theories of impression formation, individuals form impressions about others based on the social categories of the observed person (e.g., sex, race) and individual behaviors observed during an interaction. Although initial impressions start with heuristic, rule-of-thumb evaluations, systematic processing is characterized by extensive incorporation of available evidence. In a survey context, if interviewers default to heuristic information processing when evaluating respondent engagement, then we expect their evaluations to be primarily based on respondent characteristics and stereotypes associated with those characteristics. Under systematic processing, on the other hand, interviewers process and evaluate respondents based on observable respondent behaviors occurring during the question-answering process. We use the Work and Leisure Today Survey, including survey data and behavior codes, to examine proxy measures of heuristic and systematic processing by interviewers as predictors of interviewer postsurvey evaluations of respondents’ cooperativeness, interest, friendliness, and talkativeness. Our results indicate that CATI interviewers base their evaluations on actual behaviors during an interview (i.e., systematic processing) rather than perceived characteristics of the respondent or the interviewer (i.e., heuristic processing). These results are reassuring for the many surveys that collect interviewer observations as proxies for data quality. UR - https://doi.org/10.1093/poq/nfx026 ER - TY - JOUR T1 - Dynamic Question Ordering in Online Surveys JF - Journal of Official Statistics Y1 - 2017 A1 - Early, Kirstin A1 - Mankoff, Jennifer A1 - Fienberg, Stephen E. AB - Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore response rate, as well as imputation quality. We present a DQO framework to improve survey completion and imputation. In the general survey-taking setting, we want to maximize survey completion, and so we focus on ordering questions to engage the respondent and collect hopefully all information, or at least the information that most characterizes the respondent, for accurate imputations. In another scenario, our goal is to provide a personalized prediction. Since it is possible to give reasonable predictions with only a subset of questions, we are not concerned with motivating users to answer all questions. Instead, we want to order questions to get information that reduces prediction uncertainty, while not being too burdensome. We illustrate this framework with an example of providing energy estimates to prospective tenants. We also discuss DQO for national surveys and consider connections between our statistics-based question-ordering approach and cognitive survey methodology. VL - 33 IS - 3 ER - TY - RPRT T1 - Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data Y1 - 2017 A1 - John M. Abowd A1 - Kevin L. Mckinney A1 - Nellie Zhao AB - Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. UR - http://digitalcommons.ilr.cornell.edu/ldi/34/ ER - TY - RPRT T1 - Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data Y1 - 2017 A1 - Abowd, John M. A1 - McKinney, Kevin L. A1 - Zhao, Nellie AB - Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data Abowd, John M.; McKinney, Kevin L.; Zhao, Nellie Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there. PB - Cornell University UR - http://hdl.handle.net/1813/52609 ER - TY - RPRT T1 - Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Y1 - 2017 A1 - Weinberg, Daniel A1 - Abowd, John M. A1 - Belli, Robert F. A1 - Cressie, Noel A1 - Folch, David C. A1 - Holan, Scott H. A1 - Levenstein, Margaret C. A1 - Olson, Kristen M. A1 - Reiter, Jerome P. A1 - Shapiro, Matthew D. A1 - Smyth, Jolene A1 - Soh, Leen-Kiat A1 - Spencer, Bruce A1 - Spielman, Seth E. A1 - Vilhuber, Lars A1 - Wikle, Christopher AB -

Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Secure the Future of the Federal Statistical System? Weinberg, Daniel; Abowd, John M.; Belli, Robert F.; Cressie, Noel; Folch, David C.; Holan, Scott H.; Levenstein, Margaret C.; Olson, Kristen M.; Reiter, Jerome P.; Shapiro, Matthew D.; Smyth, Jolene; Soh, Leen-Kiat; Spencer, Bruce; Spielman, Seth E.; Vilhuber, Lars; Wikle, Christopher The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives. This paper began as a May 8, 2015 presentation to the National Academies of Science’s Committee on National Statistics by two of the principal investigators of the National Science Foundation-Census Bureau Research Network (NCRN) – John Abowd and the late Steve Fienberg (Carnegie Mellon University). The authors acknowledge the contributions of the other principal investigators of the NCRN who are not co-authors of the paper (William Block, William Eddy, Alan Karr, Charles Manski, Nicholas Nagle, and Rebecca Nugent), the co- principal investigators, and the comments of Patrick Cantwell, Constance Citro, Adam Eck, Brian Harris-Kojetin, and Eloise Parker. We note with sorrow the deaths of Stephen Fienberg and Allan McCutcheon, two of the original NCRN principal investigators. The principal investigators also wish to acknowledge Cheryl Eavey’s sterling grant administration on behalf of the NSF. The conclusions reached in this paper are not the responsibility of the National Science Foundation (NSF), the Census Bureau, or any of the institutions to which the authors belong

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52650 ER - TY - JOUR T1 - An empirical comparison of multiple imputation methods for categorical data JF - The American Statistician Y1 - 2017 A1 - F. Li A1 - O. Akande A1 - J. P. Reiter KW - latent KW - missing KW - mixture KW - nonresponse KW - tree AB - Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. VL - 71 UR - http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1277158 IS - 2 ER - TY - JOUR T1 - Examining Changes of Interview Length over the Course of the Field Period JF - Journal of Survey Statistics and Methodology Y1 - 2017 A1 - Kirchner, Antje A1 - Olson, Kristen AB - It is well established that interviewers learn behaviors both during training and on the job. How this learning occurs has received surprisingly little empirical attention: Is it driven by the interviewer herself or by the respondents she interviews? There are two competing hypotheses about what happens during field data collection: (1) interviewers learn behaviors from their previous interviews, and thus change their behavior in reaction to the behaviors previously encountered; and (2) interviewers encounter different types of and, especially, less cooperative respondents (i.e., nonresponse propensity affecting the measurement error situation), leading to changes in interview behaviors over the course of the field period. We refer to these hypotheses as the experience and response propensity hypotheses, respectively. This paper examines the relationship between proxy indicators for the experience and response propensity hypotheses on interview length using data and paradata from two telephone surveys.Our results indicate that both interviewer-driven experience and respondent-driven response propensity are associated with the length of interview. While general interviewing experience is nonsignificant, within-study experience decreases interview length significantly, even when accounting for changes in sample composition. Interviewers with higher cooperation rates have significantly shorter interviews in study one; however, this effect is mediated by the number of words spoken by the interviewer. We find that older respondents and male respondents have longer interviews despite controlling for the number of words spoken, as do respondents who complete the survey at first contact. Not surprisingly, interviews are significantly longer the more words interviewers and respondents speak. VL - 5 SN - 2325-0984 UR - http://dx.doi.org/10.1093/jssam/smw031 IS - 1 ER - TY - RPRT T1 - Formal Privacy Models and Title 13 Y1 - 2017 A1 - Nissim, Kobbi A1 - Gasser, Urs A1 - Smith, Adam A1 - Vadhan, Salil A1 - O'Brien, David A1 - Wood, Alexandra AB - Formal Privacy Models and Title 13 Nissim, Kobbi; Gasser, Urs; Smith, Adam; Vadhan, Salil; O'Brien, David; Wood, Alexandra A new collaboration between academia and the Census Bureau to further the Bureau’s use of formal privacy models. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52164 ER - TY - JOUR T1 - How Will Statistical Agencies Operate When All Data Are Private JF - Journal of Privacy and Confidentiality Y1 - 2017 A1 - Abowd, John M AB -

How Will Statistical Agencies Operate When All Data Are Private Abowd, John M The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies. 

PB - Cornell University VL - 7 UR - http://repository.cmu.edu/jpc/vol7/iss3/1/ IS - 3 ER - TY - JOUR T1 - Itemwise conditionally independent nonresponse modeling for incomplete multivariate data JF - Biometrika Y1 - 2017 A1 - M. Sadinle A1 - J.P. Reiter KW - Loglinear model KW - Missing not at random KW - Missingness mechanism KW - Nonignorable KW - Nonparametric saturated KW - Sensitivity analysis AB - We introduce a nonresponse mechanism for multivariate missing data in which each study variable and its nonresponse indicator are conditionally independent given the remaining variables and their nonresponse indicators. This is a nonignorable missingness mechanism, in that nonresponse for any item can depend on values of other items that are themselves missing. We show that, under this itemwise conditionally independent nonresponse assumption, one can define and identify nonparametric saturated classes of joint multivariate models for the study variables and their missingness indicators. We also show how to perform sensitivity analysis to violations of the conditional independence assumptions encoded by this missingness mechanism. Throughout, we illustrate the use of this modeling approach with data analyses. VL - 104 UR - https://doi.org/10.1093/biomet/asw063 IS - 1 ER - TY - JOUR T1 - Itemwise conditionally independent nonresponse modeling for multivariate categorical data JF - Biometrika Y1 - 2017 A1 - Sadinle, M. A1 - Reiter, J. P. KW - Identification KW - Missing not at random KW - Non-parametric saturated KW - Partial ignorability KW - Sensitivity analysis AB - With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. The main idea is to use a sequence of carefully set up identifying assumptions, whereby we specify potentially different missingness mechanisms for different blocks of variables. We show that the procedure results in models with the desirable property of being non-parametric saturated. VL - 104 ER - TY - RPRT T1 - Making Confidential Data Part of Reproducible Research Y1 - 2017 A1 - Lars Vilhuber A1 - Carl Lagoze PB - Labor Dynamics Institute, Cornell University UR - http://digitalcommons.ilr.cornell.edu/ldi/41/ ER - TY - RPRT T1 - Making Confidential Data Part of Reproducible Research Y1 - 2017 A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - Making Confidential Data Part of Reproducible Research Vilhuber, Lars; Lagoze, Carl Disclaimer and acknowledgements: While this column mentions the Census Bureau several times, any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the U.S. Census Bureau or the other statistical agencies mentioned herein. PB - Cornell University UR - http://hdl.handle.net/1813/52474 ER - TY - JOUR T1 - Making Confidential Data Part of Reproducible Research JF - Chance Y1 - 2017 A1 - Vilhuber, Lars A1 - Lagoze, Carl UR - http://chance.amstat.org/2017/09/reproducible-research/ ER - TY - JOUR T1 - Modeling Endogenous Mobility in Earnings Determination JF - Journal of Business & Economic Statistics Y1 - 2017 A1 - John M. Abowd A1 - Kevin L. Mckinney A1 - Ian M. Schmutte AB - We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. UR - http://dx.doi.org/10.1080/07350015.2017.1356727 ER - TY - RPRT T1 - Modeling Endogenous Mobility in Wage Determination Y1 - 2017 A1 - John M. Abowd A1 - Kevin L. Mckinney A1 - Ian M. Schmutte AB - We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. UR - http://digitalcommons.ilr.cornell.edu/ldi/28/ ER - TY - JOUR T1 - Multiple imputation of missing categorical and continuous outcomes via Bayesian mixture models with local dependence JF - Journal of the American Statistical Association Y1 - 2017 A1 - J. S. Murray A1 - J. P. Reiter KW - Hierarchical mixture model KW - Missing data KW - Nonparametric Bayes KW - Stick-breaking process AB - We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. VL - 111 IS - 516 ER - TY - JOUR T1 - Multi-rubric Models for Ordinal Spatial Data with Application to Online Ratings from Yelp Y1 - 2017 A1 - Linero, A.R. A1 - Bradley, J.R. A1 - Desai, A. KW - Bayesian hierarchical model KW - Data augmentation KW - Nonparametric Bayes KW - ordinal data KW - recommender systems KW - spatial prediction. AB - Interest in online rating data has increased in recent years. Such data consists of ordinal ratings of products or local businesses provided by users of a website, such as \Yelp\ or \texttt{Amazon}. One source of heterogeneity in ratings is that users apply different standards when supplying their ratings; even if two users benefit from a product the same amount, they may translate their benefit into ratings in different ways. In this article we propose an ordinal data model, which we refer to as a multi-rubric model, which treats the criteria used to convert a latent utility into a rating as user-specific random effects, with the distribution of these random effects being modeled nonparametrically. We demonstrate that this approach is capable of accounting for this type of variability in addition to usual sources of heterogeneity due to item quality, user biases, interactions between items and users, and the spatial structure of the users and items. We apply the model developed here to publicly available data from the website \Yelp\ and demonstrate that it produces interpretable clusterings of users according to their rating behavior, in addition to providing better predictions of ratings and better summaries of overall item quality. UR - https://arxiv.org/abs/1706.03012 ER - TY - RPRT T1 - NCRN Meeting Spring 2017 Y1 - 2017 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2017 Vilhuber, Lars PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52163 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: Formal Privacy Models and Title 13 Y1 - 2017 A1 - Nissim, Kobbi A1 - Gasser, Urs A1 - Smith, Adam A1 - Vadhan, Salil A1 - O'Brien, David A1 - Wood, Alexandra AB - NCRN Meeting Spring 2017: Formal Privacy Models and Title 13 Nissim, Kobbi; Gasser, Urs; Smith, Adam; Vadhan, Salil; O'Brien, David; Wood, Alexandra A new collaboration between academia and the Census Bureau to further the Bureau’s use of formal privacy models. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52164 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: Welcome Y1 - 2017 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2017: Welcome Vilhuber, Lars PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52163 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 3 Y1 - 2017 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 3 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from December 2016 through February 2017. NCRN Newsletter Vol. 3, Issue 3: March 10, 2017 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46686 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 4 Y1 - 2017 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 4 Vilhuber, Lars; Knight-Ingram, Dory The NCRN Newsletter is published quarterly by the NCRN Coordinating Office. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52259 ER - TY - RPRT T1 - Presentation: Introduction to Stan for Markov Chain Monte Carlo Y1 - 2017 A1 - Simpson, Matthew AB - Presentation: Introduction to Stan for Markov Chain Monte Carlo Simpson, Matthew An introduction to Stan (http://mc-stan.org/): a probabilistic programming language that implements Hamiltonian Monte Carlo (HMC), variational Bayes, and (penalized) maximum likelihood estimation. Presentation given at the U.S. Census Bureau on April 25, 2017. PB - University of Missouri UR - http://hdl.handle.net/1813/52656 ER - TY - RPRT T1 - Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy Y1 - 2017 A1 - Vilhuber, Lars A1 - Schmutte, Ian AB - Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas. PB - Cornell University UR - http://hdl.handle.net/1813/46197 ER - TY - RPRT T1 - Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy Y1 - 2017 A1 - Vilhuber, Lars A1 - Schmutte, Ian M. AB - Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy Vilhuber, Lars; Schmutte, Ian M. ese proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. is workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmu e 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cu ing-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. e teams developing those applications were just starting out when our rst workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cu ing-edge formal privacy models, there had been very li le e ort in the academic literature to apply those methods in real-world se ings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject ma er challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four di erent areas. e four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the speci c challenges that have arisen in ongoing e orts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short wri en memos that summarize concrete suggestions for practical applications to speci c Census Bureau priority areas. Comments can be provided at h ps://goo.gl/ZAh3YE PB - Cornell University UR - http://hdl.handle.net/1813/52473 ER - TY - RPRT T1 - Proceedings from the Synthetic LBD International Seminar Y1 - 2017 A1 - Vilhuber, Lars A1 - Kinney, Saki A1 - Schmutte, Ian M. AB - Proceedings from the Synthetic LBD International Seminar Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M. On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop. PB - Cornell University UR - http://hdl.handle.net/1813/52472 ER - TY - RPRT T1 - Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings Y1 - 2017 A1 - Foote, Andrew A1 - Kutzbach, Mark J. A1 - Vilhuber, Lars AB - Recalculating - How Uncertainty in Local Labor Market Definitions Affects Empirical Findings Foote, Andrew; Kutzbach, Mark J.; Vilhuber, Lars This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau. PB - Cornell University UR - http://hdl.handle.net/1813/52649 ER - TY - JOUR T1 - Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error JF - Journal of the Royal Statistical Society -- Series B. Y1 - 2017 A1 - Bradley, J.R. A1 - Wikle, C.K. A1 - Holan, S.H. KW - American Community Survey KW - empirical orthogonal functions KW - MAUP KW - Reduced rank KW - Spatial basis functions KW - Survey data AB - The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error, squared prediction error, spatial variance, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds. UR - https://arxiv.org/abs/1502.01974 ER - TY - RPRT T1 - Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Y1 - 2017 A1 - John M. Abowd A1 - Ian M. Schmutte AB - We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. JF - Labor Dynamics Institute Document UR - http://digitalcommons.ilr.cornell.edu/ldi/37/ ER - TY - RPRT T1 - Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Y1 - 2017 A1 - Abowd, John A1 - Schmutte, Ian M. AB - Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial. A complete archive of the data and programs used in this paper is available via http://doi.org/10.5281/zenodo.345385. PB - Cornell University UR - http://hdl.handle.net/1813/39081 ER - TY - RPRT T1 - Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Y1 - 2017 A1 - Abowd, John A1 - Schmutte, Ian M. AB - Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John; Schmutte, Ian M. We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52612 ER - TY - JOUR T1 - The role of statistical disclosure limitation in total survey error JF - Total Survey Error in Practice Y1 - 2017 A1 - A. F. Karr KW - big data issues KW - data quality KW - data swapping KW - decision quality KW - risk-utility paradigms KW - Statistical Disclosure Limitation KW - total survey error AB - This chapter presents the thesis, which is statistical disclosure limitation (SDL) that ought to be viewed as an integral component of total survey error (TSE). TSE and SDL will move forward together, but integrating multiple criteria: cost, risk, data quality, and decision quality. The chapter explores the value of unifying two key TSE procedures - editing and imputation - with SDL. It discusses “Big data” issues, which contains a mathematical formulation that, at least conceptually and at some point in the future, does unify TSE and SDL. Modern approaches to SDL are based explicitly or implicitly on tradeoffs between disclosure risk and data utility. There are three principal classes of SDL methods: reduction/coarsening techniques; perturbative methods; and synthetic data methods. Data swapping is among the most frequently applied SDL methods for categorical data. The chapter sketches how it can be informed by knowledge of TSE. ER - TY - ABST T1 - Sequential Prediction of Respondent Behaviors Leading to Error in Web-based Surveys Y1 - 2017 A1 - Eck, Adam A1 - Soh, Leen-Kiat ER - TY - RPRT T1 - Sorting Between and Within Industries: A Testable Model of Assortative Matching Y1 - 2017 A1 - John M. Abowd A1 - Francis Kramarz A1 - Sebastien Perez-Duarte A1 - Ian M. Schmutte AB - We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. PB - Labor Dynamics Institute UR - http://digitalcommons.ilr.cornell.edu/ldi/40/ ER - TY - JOUR T1 - Stop or continue data collection: A nonignorable missing data approach for continuous variables JF - Journal of Official Statistics Y1 - 2017 A1 - T. Paiva A1 - J. P. Reiter AB - We present an approach to inform decisions about nonresponse follow-up sampling. The basic idea is (i) to create completed samples by imputing nonrespondents' data under various assumptions about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for different proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents' data, and adjust the probabilities of the mixture components to generate nonrespondents' distributions with desired features. We illustrate the approaches using data from the 2007 U. S. Census of Manufactures. ER - TY - RPRT T1 - Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files Y1 - 2017 A1 - Green, Andrew A1 - Kutzbach, Mark J. A1 - Vilhuber, Lars AB - Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files Green, Andrew; Kutzbach, Mark J.; Vilhuber, Lars Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households’ responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets. PB - Cornell University UR - http://hdl.handle.net/1813/52611 ER - TY - RPRT T1 - Unique Entity Estimation with Application to the Syrian Conflict Y1 - 2017 A1 - Chen, B. A1 - Shrivastava, A. A1 - Steorts, R. C. KW - Computer Science - Data Structures and Algorithms KW - Computer Science - Databases KW - Statistics - Applications AB - Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 \pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold. JF - arXiv UR - https://arxiv.org/abs/1710.02690 ER - TY - JOUR T1 - Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics JF - Proceedings of the 2017 ACM International Conference on Management of Data Y1 - 2017 A1 - Samuel Haney A1 - Ashwin Machanavajjhala A1 - John M. Abowd A1 - Matthew Graham A1 - Mark Kutzbach AB - National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ε≥ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research. SN - 978-1-4503-4197-4 UR - http://dl.acm.org/citation.cfm?doid=3035918.3035940 ER - TY - RPRT T1 - Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Y1 - 2017 A1 - Haney, Samuel A1 - Machanavajjhala, Ashwin A1 - Abowd, John M A1 - Graham, Matthew A1 - Kutzbach, Mark A1 - Vilhuber, Lars AB - Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark; Vilhuber, Lars National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional PB - Cornell University UR - http://hdl.handle.net/1813/49652 ER - TY - JOUR T1 - Visualizing uncertainty in areal data estimates with bivariate choropleth maps, map pixelation, and glyph rotation JF - Stat Y1 - 2017 A1 - Lucchesi, L.R. A1 - Wikle, C.K. AB - In statistics, we quantify uncertainty to help determine the accuracy of estimates, yet this crucial piece of information is rarely included on maps visualizing areal data estimates. We develop and present three approaches to include uncertainty on maps: (1) the bivariate choropleth map repurposed to visualize uncertainty; (2) the pixelation of counties to include values within an estimate's margin of error; and (3) the rotation of a glyph, located at a county's centroid, to represent an estimate's uncertainty. The second method is presented as both a static map and visuanimation. We use American Community Survey estimates and their corresponding margins of error to demonstrate the methods and highlight the importance of visualizing uncertainty in areal data. An extensive online supplement provides the R code necessary to produce the maps presented in this article as well as alternative versions of them. VL - 6 UR - http://onlinelibrary.wiley.com/doi/10.1002/sta4.150/abstract IS - 1 ER - TY - RPRT T1 - 2017 Economic Census: Towards Synthetic Data Sets Y1 - 2016 A1 - Caldwell, Carol A1 - Thompson, Katherine Jenny AB - 2017 Economic Census: Towards Synthetic Data Sets Caldwell, Carol; Thompson, Katherine Jenny PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52165 ER - TY - JOUR T1 - Assessing disclosure risks for synthetic data with arbitrary intruder knowledge JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - McClure, D. A1 - Reiter , J. P. KW - confidentiality KW - Disclosure KW - risk KW - synthetic AB - Several statistical agencies release synthetic microdata, i.e., data with all confidential values replaced with draws from statistical models, in order to protect data subjects' confidentiality. While fully synthetic data are safe from record linkage attacks, intruders might be able to use the released synthetic values to estimate confidential values for individuals in the collected data. We demonstrate and investigate this potential risk using two simple but informative scenarios: a single continuous variable possibly with outliers, and a three-way contingency table possibly with small counts in some cells. Beginning with the case that the intruder knows all but one value in the confidential data, we examine the effect on risk of decreasing the number of observations the intruder knows beforehand. We generally find that releasing synthetic data (1) can pose little risk to records in the middle of the distribution, and (2) can pose some risks to extreme outliers, although arguably these risks are mild. We also find that the effect of removing observations from an intruder's background knowledge heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and drops quickly if he/she cannot. VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji957 IS - 1 ER - TY - JOUR T1 - A Bayesian nonparametric Markovian model for nonstationary time series JF - Statistics and Computing Y1 - 2016 A1 - De Yoreo, M. A1 - Kottas, A. KW - Autoregressive Models KW - Bayesian Nonparametrics KW - Dirichlet Process Mixtures KW - Markov chain Monte Carlo KW - Nonstationarity KW - Time Series AB - Stationary time series models built from parametric distributions are, in general, limited in scope due to the assumptions imposed on the residual distribution and autoregression relationship. We present a modeling approach for univariate time series data, which makes no assumptions of stationarity, and can accommodate complex dynamics and capture nonstandard distributions. The model for the transition density arises from the conditional distribution implied by a Bayesian nonparametric mixture of bivariate normals. This implies a flexible autoregressive form for the conditional transition density, defining a time-homogeneous, nonstationary, Markovian model for real-valued data indexed in discrete-time. To obtain a more computationally tractable algorithm for posterior inference, we utilize a square-root-free Cholesky decomposition of the mixture kernel covariance matrix. Results from simulated data suggest the model is able to recover challenging transition and predictive densities. We also illustrate the model on time intervals between eruptions of the Old Faithful geyser. Extensions to accommodate higher order structure and to develop a state-space model are also discussed. ER - TY - JOUR T1 - A Bayesian Approach to Graphical Record Linkage and Deduplication JF - Journal of the American Statistical Association Y1 - 2016 A1 - Rebecca C. Steorts A1 - Rob Hall A1 - Stephen E. Fienberg AB - ABSTRACTWe propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online. VL - 111 UR - http://dx.doi.org/10.1080/01621459.2015.1105807 ER - TY - JOUR T1 - Bayesian Hierarchical Models with Conjugate Full-Conditional Distributions for Dependent Data from the Natural Exponential Family JF - Journal of the American Statistical Association - T&M. Y1 - 2016 A1 - Bradley, J.R. A1 - Holan, S.H. A1 - Wikle, C.K. AB - We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called "big n problem." The computational complexity of the "big n problem" is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce something we call the "conjugate multivariate distribution," which is motivated by the univariate distribution introduced in Diaconis and Ylvisaker (1979). Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. The results in this manuscript are extremely general, and can be adapted to many different settings. We demonstrate the proposed methodology through simulated examples and analyses based on estimates obtained from the US Census Bureaus' American Community Survey (ACS). UR - https://arxiv.org/abs/1701.07506 ER - TY - JOUR T1 - Bayesian latent pattern mixture models for handling attrition in panel studies with refreshment samples JF - Annals of Applied Statistics Y1 - 2016 A1 - Y. Si A1 - J. P. Reiter A1 - D. S. Hillygus VL - 10 UR - http://projecteuclid.org/euclid.aoas/1458909910 ER - TY - JOUR T1 - Bayesian Lattice Filters for Time-Varying Autoregression and Time-Frequency Analysis JF - Bayesian Analysis Y1 - 2016 A1 - Yang, W.H. A1 - Holan, S.H. A1 - Wikle, C.K. AB - Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). UR - https://arxiv.org/abs/1408.2757 ER - TY - RPRT T1 - Bayesian mixture modeling for multivariate conditional distributions Y1 - 2016 A1 - Maria DeYoreo A1 - Jerome P. Reiter AB - We present a Bayesian mixture model for estimating the joint distribution of mixed ordinal, nominal, and continuous data conditional on a set of fixed variables. The model uses multivariate normal and categorical mixture kernels for the random variables. It induces dependence between the random and fixed variables through the means of the multivariate normal mixture kernels and via a truncated local Dirichlet process. The latter encourages observations with similar values of the fixed variables to share mixture components. Using a simulation of data fusion, we illustrate that the model can estimate underlying relationships in the data and the distributions of the missing values more accurately than a mixture model applied to the random and fixed variables jointly. We use the model to analyze consumers' reading behaviors using a quota sample, i.e., a sample where the empirical distribution of some variables is fixed by design and so should not be modeled as random, conducted by the book publisher HarperCollins. PB - ArXiv UR - http://arxiv.org/abs/1606.04457 ER - TY - RPRT T1 - A Bayesian nonparametric Markovian model for nonstationary time series Y1 - 2016 A1 - Maria DeYoreo A1 - Athanasios Kottas AB - Stationary time series models built from parametric distributions are, in general, limited in scope due to the assumptions imposed on the residual distribution and autoregression relationship. We present a modeling approach for univariate time series data, which makes no assumptions of stationarity, and can accommodate complex dynamics and capture nonstandard distributions. The model for the transition density arises from the conditional distribution implied by a Bayesian nonparametric mixture of bivariate normals. This implies a flexible autoregressive form for the conditional transition density, defining a time-homogeneous, nonstationary, Markovian model for real-valued data indexed in discrete-time. To obtain a more computationally tractable algorithm for posterior inference, we utilize a square-root-free Cholesky decomposition of the mixture kernel covariance matrix. Results from simulated data suggest the model is able to recover challenging transition and predictive densities. We also illustrate the model on time intervals between eruptions of the Old Faithful geyser. Extensions to accommodate higher order structure and to develop a state-space model are also discussed. PB - ArXiv UR - http://arxiv.org/abs/1601.04331 ER - TY - JOUR T1 - A Bayesian Partial Identification Approach to Inferring the Prevalence of Accounting Misconduct JF - Journal of the American Statistical Association Y1 - 2016 A1 - P. R. Hahn A1 - J. S. Murray A1 - I. Manolopoulou AB - This article describes the use of flexible Bayesian regression models for estimating a partially identified probability function. Our approach permits efficient sensitivity analysis concerning the posterior impact of priors on the partially identified component of the regression model. The new methodology is illustrated on an important problem where only partially observed data are available—inferring the prevalence of accounting misconduct among publicly traded U.S. businesses. Supplementary materials for this article are available online. VL - 111 UR - http://www.tandfonline.com/doi/full/10.1080/01621459.2015.1084307 IS - 513 ER - TY - JOUR T1 - Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data JF - Journal of the American Statistical Association Y1 - 2016 A1 - Daniel Manrique-Vallier A1 - Jerome P. Reiter AB - In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a three year old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully utilize information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online. UR - http://dx.doi.org/10.1080/01621459.2016.1231612 ER - TY - JOUR T1 - Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey JF - Journal of the American Statistical Association Y1 - 2016 A1 - Bradley, J.R. A1 - Wikle, C.K. A1 - Holan, S.H. AB - We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on "new" spatial supports in "real-time." This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We demonstrate the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. UR - https://arxiv.org/abs/1405.7227 ER - TY - JOUR T1 - Categorical data fusion using auxiliary information JF - Annals of Applied Statistics Y1 - 2016 A1 - B. K. Fosdick A1 - M. De Yoreo A1 - J. P. Reiter KW - Imputation KW - Integration KW - Latent Class KW - Matching AB - In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. VL - 10 UR - http://projecteuclid.org/euclid.aoas/1483606845 ER - TY - JOUR T1 - Computation of the Autocovariances for Time Series with Multiple Long-Range Persistencies JF - Computational Statistics and Data Analysis Y1 - 2016 A1 - McElroy, T.S. A1 - Holan, S.H. AB - Gegenbauer processes allow for flexible and convenient modeling of time series data with multiple spectral peaks, where the qualitative description of these peaks is via the concept of cyclical long-range dependence. The Gegenbauer class is extensive, including ARFIMA, seasonal ARFIMA, and GARMA processes as special cases. Model estimation is challenging for Gegenbauer processes when multiple zeros and poles occur in the spectral density, because the autocovariance function is laborious to compute. The method of splitting–essentially computing autocovariances by convolving long memory and short memory dynamics–is only tractable when a single long memory pole exists. An additive decomposition of the spectrum into a sum of spectra is proposed, where each summand has a single singularity, so that a computationally efficient splitting method can be applied to each term and then aggregated. This approach differs from handling all the poles in the spectral density at once, via an analysis of truncation error. The proposed technique allows for fast estimation of time series with multiple long-range dependences, which is illustrated numerically and through several case-studies. UR - http://www.sciencedirect.com/science/article/pii/S0167947316300202 ER - TY - ABST T1 - Data management and analytic use of paradata: SIPP-EHC audit trails Y1 - 2016 A1 - Lee, Jinyoung A1 - Seloske, Ben A1 - Córdova Cazar, Ana Lucía A1 - Eck, Adam A1 - Kirchner, Antje A1 - Belli, Robert F. ER - TY - JOUR T1 - Differentially private publication of data on wages and job mobility JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Schmutte, Ian M. KW - Demand for public statistics KW - differential privacy KW - job mobility KW - matched employer-employee data KW - optimal confidentiality protection KW - optimal data accuracy KW - technology for statistical agencies AB - Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments' confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data. VL - 32 UR - http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji962 IS - 1 ER - TY - RPRT T1 - Differentially Private Verification of Regression Model Results Y1 - 2016 A1 - Reiter, Jerry AB - Differentially Private Verification of Regression Model Results Reiter, Jerry PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52167 ER - TY - JOUR T1 - Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors JF - Survey Practice Y1 - 2016 A1 - Olson, Kristen A1 - Kirchner, Antje A1 - Smyth, Jolene D. AB - Interviewers are required to be flexible in responding to respondent concerns during recruitment, but standardized during administration of the questionnaire. These skill sets may be at odds. Recent research has shown a U-shaped relationship between interviewer cooperation rates and interviewer variance: the least and the most successful interviewers during recruitment have the largest interviewer variance components. Little is known about why this association occurs. We posit four hypotheses for this association: 1) interviewers with higher cooperation rates more conscientious interviewers altogether, 2) interviewers with higher cooperation rates continue to use rapport behaviors from the cooperation request throughout an interview, 3) interviewers with higher cooperation rates display more confidence which translates into different interview behavior, and 4) interviewers with higher cooperation rates continue their flexible interviewing style throughout the interview and deviate more from standardized interviewing. We use behavior codes from the Work and Leisure Today Survey (n=450, AAPOR RR3=6.3%) to evaluate interviewer behavior. Our results largely support the confidence hypothesis. Interviewers with higher cooperation rates do not show evidence of being “better” interviewers. VL - 9 UR - http://www.surveypractice.org/index.php/SurveyPractice/article/view/351 IS - 2 ER - TY - RPRT T1 - Estimating Compensating Wage Differentials with Endogenous Job Mobility Y1 - 2016 A1 - Kurt Lavetti A1 - Ian M. Schmutte AB - We demonstrate a strategy for using matched employer-employee data to correct endogenous job mobility bias when estimating compensating wage differentials. Applied to fatality rates in the census of formal-sector jobs in Brazil between 2003-2010, we show why common approaches to eliminating ability bias can greatly amplify endogenous job mobility bias. By extending the search-theoretic hedonic wage frame- work, we establish conditions necessary to interpret our estimates as preferences. We present empirical analyses supporting the predictions of the model and identifying conditions, demonstrating that the standard models are misspecified, and that our proposed model eliminates latent ability and endogenous mobility biases. UR - http://digitalcommons.ilr.cornell.edu/ldi/29/ ER - TY - JOUR T1 - Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk Using Differential Smoothing JF - Journal of the Royal Statistical Society - Series A Y1 - 2016 A1 - Quick, H. A1 - Holan, S.H. A1 - Wikle, C.K. AB - When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies prior to making data publicly available due to data privacy obligations. An alternative to releasing aggregated and/or perturbed data is to release multiply-imputed synthetic data, where sensitive values are replaced with draws from statistical models designed to capture important distributional features in the collected data. One issue that has received relatively little attention, however, is how to handle spatially outlying observations in the collected data, as common spatial models often have a tendency to overfit these observations. The goal of this work is to bring this issue to the forefront and propose a solution, which we refer to as "differential smoothing." After implementing our method on simulated data, highlighting the effectiveness of our approach under various scenarios, we illustrate the framework using data consisting of sale prices of homes in San Francisco. UR - https://arxiv.org/abs/1507.05529 ER - TY - RPRT T1 - Hours Off the Clock Y1 - 2016 A1 - Green, Andrew AB - Hours Off the Clock Green, Andrew To what extent do workers work more hours than they are paid for? The relationship between hours worked and hours paid, and the conditions under which employers can demand more hours “off the clock,” is not well understood. The answer to this question impacts worker welfare, as well as wage and hour regulation. In addition, work off the clock has important implications for the measurement and cyclical movement of productivity and wages. In this paper, I construct a unique administrative dataset of hours paid by employers linked to a survey of workers on their reported hours worked to measure work off the clock. Using cross-sectional variation in local labor markets, I find only a small cyclical component to work off the clock. The results point to labor hoarding rather than efficiency wage theory, indicating work off the clock cannot explain the counter-cyclical movement of productivity. I find workers employed by small firms, and in industries with a high rate of wage and hour violations are associated with larger differences in hours worked than hours paid. These findings suggest the importance of tracking hours of work for enforcement of labor regulations. PB - Cornell University UR - http://hdl.handle.net/1813/52610 ER - TY - JOUR T1 - How Should We Define Low-Wage Work? An Analysis Using the Current Population Survey JF - Monthly Labor Review Y1 - 2016 A1 - Fusaro, V. A1 - Shaefer, H. Luke AB - Low-wage work is a central concept in considerable research, yet it lacks an agreed-upon definition. Using data from the Current Population Survey’s Annual Social and Economic Supplement, the analysis presented in this article suggests that defining low-wage work on the basis of alternative hourly wage cutoffs changes the size of the low-wage population, but does not noticeably alter time trends in the rate of change. The analysis also indicates that different definitions capture groups of workers with substantively different demographic, social, and economic characteristics. Although the individuals in any of the categories examined might reasonably be considered low-wage workers, a single definition obscures these distinctions. UR - http://www.bls.gov/opub/mlr/2016/article/pdf/how-should-we-define-low-wage-work.pdf ER - TY - RPRT T1 - How Will Statistical Agencies Operate When All Data Are Private? Y1 - 2016 A1 - Abowd, John M. AB - How Will Statistical Agencies Operate When All Data Are Private? Abowd, John M. The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies. PB - Cornell University UR - http://hdl.handle.net/1813/44663 ER - TY - JOUR T1 - Incorporating marginal prior information into latent class models JF - Bayesian Analysis Y1 - 2016 A1 - Schifeling, T. S. A1 - Reiter, J. P. VL - 11 UR - https://projecteuclid.org/euclid.ba/1434649584 ER - TY - JOUR T1 - Measuring Poverty Using the Supplemental Poverty Measure in the Panel Study of Income Dynamics, 1998 to 2010 JF - Journal of Economic and Social Measurement Y1 - 2016 A1 - Kimberlin, S. A1 - Shaefer, H.L. A1 - Kim, J. AB - The Supplemental Poverty Measure (SPM) was recently introduced by the U.S. Census Bureau as an alternative measure of poverty that addresses many shortcomings of the official poverty measure (OPM) to better reflect the resources households have available to meet their basic needs. The Census SPM is available only in the Current Population Survey (CPS). This paper describes a method for constructing SPM poverty estimates in the Panel Study of Income Dynamics (PSID), for the biennial years 1998 through 2010. A public-use dataset of individual-level SPM status produced in this analysis will be available for download on the PSID website. Annual SPM poverty estimates from the PSID are presented for the years 1998, 2000, 2002, 2004, 2006, 2008, and 2010 and compared to SPM estimates for the same years derived from CPS data by the Census Bureau and independent researchers. We find that SPM poverty rates in the PSID are somewhat lower than those found in the CPS, though trends over time and impact of specific SPM components are similar across the two datasets. VL - 41 UR - http://content.iospress.com/articles/journal-of-economic-and-social-measurement/jem425 IS - 1 ER - TY - ABST T1 - Mismatches Y1 - 2016 A1 - Smyth, Jolene A1 - Olson, Kristen ER - TY - RPRT T1 - Modeling Endogenous Mobility in Earnings Determination Y1 - 2016 A1 - Abowd, John M. A1 - McKinney, Kevin L. A1 - Schmutte, Ian M. AB - Modeling Endogenous Mobility in Earnings Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. Replication code can be found at DOI: http://doi.org/10.5281/zenodo.zenodo.376600 and our Github repository endogenous-mobility-replication . PB - Cornell University UR - http://hdl.handle.net/1813/40306 ER - TY - JOUR T1 - Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence JF - Journal of the American Statistical Association Y1 - 2016 A1 - Jared S. Murray A1 - Jerome P. Reiter AB - We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. UR - http://dx.doi.org/10.1080/01621459.2016.1174132 ER - TY - JOUR T1 - Multivariate Spatio-Temporal Survey Fusion with Application to the American Community Survey and Local Area Unemployment Statistics JF - Stat Y1 - 2016 A1 - Bradley, J.R. A1 - Holan, S.H. A1 - Wikle, C.K AB - There are often multiple surveys available that estimate and report related demographic variables of interest that are referenced over space and/or time. Not all surveys produce the same information, and thus, combining these surveys typically leads to higher quality estimates. That is, not every survey has the same level of precision nor do they always provide estimates of the same variables. In addition, various surveys often produce estimates with incomplete spatio-temporal coverage. By combining surveys using a Bayesian approach, we can account for different margins of error and leverage dependencies to produce estimates of every variable considered at every spatial location and every time point. Specifically, our strategy is to use a hierarchical modelling approach, where the first stage of the model incorporates the margin of error associated with each survey. Then, in a lower stage of the hierarchical model, the multivariate spatio-temporal mixed effects model is used to incorporate multivariate spatio-temporal dependencies of the processes of interest. We adopt a fully Bayesian approach for combining surveys; that is, given all of the available surveys, the conditional distributions of the latent processes of interest are used for statistical inference. To demonstrate our proposed methodology, we jointly analyze period estimates from the US Census Bureau's American Community Survey, and estimates obtained from the Bureau of Labor Statistics Local Area Unemployment Statistics program. Copyright © 2016 John Wiley & Sons, Ltd. UR - http://onlinelibrary.wiley.com/doi/10.1002/sta4.120/full ER - TY - RPRT T1 - NCRN Meeting Fall 2016 Y1 - 2016 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2016 Vilhuber, Lars Taken place at the U.S. Census Bureau HQ, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45885 ER - TY - RPRT T1 - NCRN Meeting Fall 2016: Audit Trails, Parallel Navigation, and the SIPP Y1 - 2016 A1 - Lee, Jinyoung AB - NCRN Meeting Fall 2016: Audit Trails, Parallel Navigation, and the SIPP Lee, Jinyoung Thanks to Dr. Robert Belli, Ana Lucía Córdova Cazar, and Ben Seloske for the team effort. PB - University of Nebraska UR - http://hdl.handle.net/1813/45823 ER - TY - RPRT T1 - NCRN Meeting Fall 2016: Scanner Data and Economic Statistics: A Unified Approach Y1 - 2016 A1 - Redding, Stephen J. A1 - Weinstein, David E. AB - NCRN Meeting Fall 2016: Scanner Data and Economic Statistics: A Unified Approach Redding, Stephen J.; Weinstein, David E. PB - University of Michigan UR - http://hdl.handle.net/1813/45821 ER - TY - RPRT T1 - NCRN Meeting Spring 2016 Y1 - 2016 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2016 Vilhuber, Lars Taken place at U.S. Census Bureau HQ, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45899 ER - TY - RPRT T1 - NCRN Meeting Spring 2016: A 2016 View of 2020 Census Quality, Costs, Benefits Y1 - 2016 A1 - Spencer, Bruce D. AB - NCRN Meeting Spring 2016: A 2016 View of 2020 Census Quality, Costs, Benefits Spencer, Bruce D. Census costs affect data quality and data quality affects census benefits. Although measuring census data quality is difficult enough ex post, census planning requires it to be done well in advance. The topic of this talk is the prediction of the cost-quality curve, its uncertainty, and its relation to benefits from census data. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting PB - Northwestern University UR - http://hdl.handle.net/1813/43897 ER - TY - RPRT T1 - NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms Y1 - 2016 A1 - Brandimarte, Laura A1 - Chiew, Ernest A1 - Ventura, Sam A1 - Acquisti, Alessandro AB - NCRN Meeting Spring 2016: Attitudes Towards Geolocation-Enabled Census Forms Brandimarte, Laura; Chiew, Ernest; Ventura, Sam; Acquisti, Alessandro Geolocation refers to the automatic identification of the physical locations of Internet users. In an online survey experiment, we studied respondent reactions towards different types of geolocation. After coordinating with US Census Bureau researchers, we designed and administered a replica of a census form to a sample of respondents. We also created slightly different forms by manipulating the type of geolocation implemented. Using the IP address of each respondent, we approximated the geographical coordinates of the respondent and displayed this location on a map on the survey. Across different experimental conditions, we manipulated the map interface between the three interfaces on the Google Maps API: default road map, Satellite View, and Street View. We also provided either a specific, pinpointed location, or a set of two circles of 1- and 2-miles radius. Snapshots of responses were captured at every instant information was added, altered, or deleted by respondents when completing the survey. We measured willingness to provide information on the typical Census form, as well as privacy concerns associated with geolocation technologies and attitudes towards the use of online geographical maps to identify one’s exact current location. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting PB - Carnegie-Mellon University UR - http://hdl.handle.net/1813/43889 ER - TY - RPRT T1 - NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study Y1 - 2016 A1 - Mccue, Kristin A1 - Abowd, John A1 - Levenstein, Margaret A1 - Patki, Dhiren A1 - Rodgers, Ann A1 - Shapiro, Matthew A1 - Wasi, Nada AB - NCRN Meeting Spring 2016: Developing job linkages for the Health and Retirement Study McCue, Kristin; Abowd, John; Levenstein, Margaret; Patki, Dhiren; Rodgers, Ann; Shapiro, Matthew; Wasi, Nada This paper documents work using probabilistic record linkage to create a crosswalk between jobs reported in the Health and Retirement Study (HRS) and the list of workplaces on Census Bureau’s Business Register. Matching job records provides an opportunity to join variables that occur uniquely in separate datasets, to validate responses, and to develop missing data imputation models. Identifying the respondent’s workplace (“establishment”) is valuable for HRS because it allows researchers to incorporate the effects of particular social, economic, and geospatial work environments in studies of respondent health and retirement behavior. The linkage makes use of name and address standardizing techniques tailored to business data that were recently developed in a collaboration between researchers at Census, Cornell, and the University of Michigan. The matching protocol makes no use of the identity of the HRS respondent and strictly protects the confidentiality of information about the respondent’s employer. The paper first describes the clerical review process used to create a set of human-reviewed candidate pairs, and use of that set to train matching models. It then describes and compares several linking strategies that make use of employer name, address, and phone number. Finally it discusses alternative ways of incorporating information on match uncertainty into estimates based on the linked data, and illustrates their use with a preliminary sample of matched HRS jobs. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting PB - University of Michigan UR - http://hdl.handle.net/1813/43895 ER - TY - RPRT T1 - NCRN Meeting Spring 2016: Evaluating Data quality in Time Diary Surveys Using Paradata Y1 - 2016 A1 - Córdova Cazar, Ana Lucía A1 - Belli, Robert AB - NCRN Meeting Spring 2016: Evaluating Data quality in Time Diary Surveys Using Paradata Córdova Cazar, Ana Lucía; Belli, Robert Over the past decades, time use researchers have been increasingly interested in analyzing wellbeing in tandem with the use of time (Juster and Stafford, 1985; Krueger et al, 2009). Many methodological issues have arose in this endeavor, including the concern about the quality of the time use data. Survey researchers have increasingly turned to the analysis of paradata to better understand and model data quality. In particular, it has been argued that paradata may serve as proxy of the respondents’ cognitive response process, and can be used as an additional tool to assess the impact of data generation on data quality. In this presentation, data quality in the American Time Use Survey (ATUS) will be assessed through the use of paradata and survey responses. Specifically, I will talk about a data quality index I have created, which includes measures of different types of ATUS errors (e.g. low number of reported activities, failures to report an activity), and paradata variables (e.g. response latencies, incompletes). The overall objective of this study is to contribute to data quality assessment in the collection of timeline data from national surveys by providing insights on those interviewing dynamics that most impact data quality. These insights will help to improve future instruments and training of interviewers, as well as to reduce costs. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting PB - University of Nebraska UR - http://hdl.handle.net/1813/43896 ER - TY - RPRT T1 - NCRN Meeting Spring 2016: The ATUS and SIPP-EHC: Recent Developments Y1 - 2016 A1 - Belli, Robert F. AB - NCRN Meeting Spring 2016: The ATUS and SIPP-EHC: Recent Developments Belli, Robert F. One of the main objectives of the NCRN award to the University of Nebraska node is to investigate data quality associated with timeline interviewing as conducted with the American Time Use Survey (ATUS) time diary and the Survey of Income and Program Participation event history calendar (SIPP-EHC). Specifically, our efforts are focused on the relationships between interviewing dynamics as extracted from analyses of paradata with measures of data quality. With the ATUS, our recent efforts have revealed that respondents differ in how they handle difficulty with remembering activities, with some overcoming these difficulties and others succumbing to them. With the SIPP-EHC, we are still in the initial stages of extracting variables from the paradata that are associated with interviewing dynamics. Our work has also involved the development of a CATI time diary in which we are able to analyze audio streams to capture interviewing dynamics. I will conclude this talk by discussing challenges that have yet to be overcome with our work, and our vision of moving forward with the eventual development of self-administered timeline instruments that will be respondent-friendly due to the assistance of intelligent-agent driven virtual interviewers. Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting PB - University of Nebraska UR - http://hdl.handle.net/1813/43893 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: 2017 Economic Census: Towards Synthetic Data Sets Y1 - 2016 A1 - Caldwell, Carol A1 - Thompson, Katherine Jenny AB - NCRN Meeting Spring 2017: 2017 Economic Census: Towards Synthetic Data Sets Caldwell, Carol; Thompson, Katherine Jenny PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52165 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: Differentially Private Verification of Regression Model Results Y1 - 2016 A1 - Reiter, Jerry AB - NCRN Meeting Spring 2017: Differentially Private Verification of Regression Model Results Reiter, Jerry PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52167 ER - TY - RPRT T1 - NCRN Meeting Spring 2017: Practical Issues in Anonymity Y1 - 2016 A1 - Clifton, Chris A1 - Merill, Shawn A1 - Merill, Keith AB - NCRN Meeting Spring 2017: Practical Issues in Anonymity Clifton, Chris; Merill, Shawn; Merill, Keith PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52166 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 4 Y1 - 2016 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB -

NCRN Newsletter: Volume 2 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from September 2015 through December 2015. NCRN Newsletter Vol. 2, Issue 4: January 28, 2016.

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/42394 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 1 Y1 - 2016 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 3 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2016 through May 2016. NCRN Newsletter Vol. 3, Issue 1: June 10, 2016 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/44199 ER - TY - RPRT T1 - NCRN Newsletter: Volume 3 - Issue 2 Y1 - 2016 A1 - Vilhuber, Lars A1 - Knight-Ingram, Dory AB - NCRN Newsletter: Volume 3 - Issue 2 Vilhuber, Lars; Knight-Ingram, Dory Overview of activities at NSF-Census Research Network nodes from June 2016 through December 2016. NCRN Newsletter Vol. 3, Issue 2: December 23, 2016 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46171 ER - TY - JOUR T1 - Noise infusion as a confidentiality protection measure for graph-based statistics JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Abowd, John M. A1 - McKinney, Kevin L. AB - We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. VL - 32 UR - http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji958 IS - 1 ER - TY - RPRT T1 - The NSF-Census Research Network in 2016: Taking stock, looking forward Y1 - 2016 A1 - Vilhuber, Lars AB - The NSF-Census Research Network in 2016: Taking stock, looking forward Vilhuber, Lars An overview of the activities of the NSF-Census Research Network as of 2016, given on Saturday, May 21, 2016, at a workshop on spatial and spatio-temporal design and analysis for official statistics, hosted by the Spatio-Temporal Statistics NSF Census Research Network (STSN) at the University of Missouri, and sponsored by the NSF-Census Research Network (NCRN) PB - University of Missouri UR - http://hdl.handle.net/1813/46210 ER - TY - JOUR T1 - Parallel associations and the structure of autobiographical knowledge JF - Journal of Applied Research in Memory and Cognition Y1 - 2016 A1 - Belli, R.F. A1 - T. Al Baghal KW - Autobiographical memory; Autobiographical knowledge; Autobiographical periods; Episodic memory; Retrospective reports AB - The self-memory system (SMS) model of autobiographical knowledge conceives that memories are structured thematically, organized both hierarchically and temporally. This model has been challenged on several fronts, including the absence of parallel linkages across pathways. Calendar survey interviewing shows the frequent and varied use of parallel associations in autobiographical recall. Parallel associations in these data are commonplace, and are driven more by respondents’ generative retrieval than by interviewers’ probing. Parallel associations represent a number of autobiographical knowledge themes that are interrelated across life domains. The content of parallel associations is nearly evenly split between general and transitional events, supporting the importance of transitions in autographical memory. Associations in respondents’ memories (both parallel and sequential), demonstrate complex interactions with interviewer verbal behaviors during generative retrieval. In addition to discussing the implications of these results to the SMS model, implications are also drawn for transition theory and the basic-systems model. VL - 5 IS - 2 ER - TY - RPRT T1 - Practical Issues in Anonymity Y1 - 2016 A1 - Clifton, Chris A1 - Merill, Shawn A1 - Merill, Keith AB - Practical Issues in Anonymity Clifton, Chris; Merill, Shawn; Merill, Keith PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52166 ER - TY - JOUR T1 - Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering, JF - Journal of Privacy and Confidentiality Y1 - 2016 A1 - Murray, J. S. AB - Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of methods have been developed for this task. These methods go under a variety of names, including indexing and blocking, and have seen significant development. However, methods for inferring linkage structure that account for indexing, blocking, and additional filtering steps have not seen commensurate development. In this paper we review the implications of indexing, blocking and filtering within the popular Fellegi-Sunter framework, and propose a new model to account for particular forms of indexing and filtering. VL - 7 UR - http://repository.cmu.edu/jpc/vol7/iss1/2 IS - 1 ER - TY - RPRT T1 - Regression Modeling and File Matching Using Possibly Erroneous Matching Variables Y1 - 2016 A1 - Dalzell, N. M. A1 - Reiter, J. P. KW - Statistics - Applications AB - Many analyses require linking records from two databases comprising overlapping sets of individuals. In the absence of unique identifiers, the linkage procedure often involves matching on a set of categorical variables, such as demographics, common to both files. Typically, however, the resulting matches are inexact: some cross-classifications of the matching variables do not generate unique links across files. Further, the matching variables can be subject to reporting errors, which introduce additional uncertainty in analyses. We present a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical matching variables are subject to reporting error. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2) a model for the linking vector given the true values of the matching variables, (3) a measurement error model for reported values of the matching variables given their true values, and (4) a model for the true values of the matching variables. We describe algorithms for sampling from the posterior distribution of the model. We illustrate the methodology using artificial data and data from education records in the state of North Carolina. PB - ArXiv UR - http://arxiv.org/abs/1608.06309 ER - TY - JOUR T1 - Releasing synthetic magnitude micro data constrained to fixed marginal totals JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Wei, Lan A1 - Reiter, Jerome P. KW - Confidential KW - Disclosure KW - establishment KW - mixture KW - poisson KW - risk AB - We present approaches to generating synthetic microdata for multivariate data that take on non-negative integer values, such as magnitude data in economic surveys. The basic idea is to estimate a mixture of Poisson distributions to describe the multivariate distribution, and release draws from the posterior predictive distribution of the model. We develop approaches that guarantee the synthetic data sum to marginal totals computed from the original data, as well approaches that do not enforce this equality. For both cases, we present methods for assessing disclosure risks inherent in releasing synthetic magnitude microdata. We illustrate the methodology using economic data from a survey of manufacturing establishments. VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji959 IS - 1 ER - TY - JOUR T1 - Simultaneous edit-imputation and disclosure limitation for business establishment data JF - Journal of Applied Statistics Y1 - 2016 A1 - H. J. Kim A1 - J. P. Reiter A1 - A. F. Karr AB - Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. ER - TY - JOUR T1 - Spatial Variation in the Quality of American Community Survey Estimates JF - Demography Y1 - 2016 A1 - Folch, David C. A1 - Arribas-Bel, Daniel A1 - Koschinsky, Julia A1 - Spielman, Seth E. VL - 53 ER - TY - JOUR T1 - Synthetic establishment microdata around the world JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Vilhuber, Lars A1 - Abowd, John M. A1 - Reiter, Jerome P. KW - Business data KW - confidentiality KW - differential privacy KW - international comparison KW - Multiple imputation KW - synthetic AB - In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature. VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964 IS - 1 ER - TY - THES T1 - Topics on Official Statistics and Statistical Policy T2 - Statistics Y1 - 2016 A1 - Zachary Seeskin AB - My dissertation studies decision questions for government statistical agencies, both regarding data collection and how to combine data from multiple sources. Informed decisions regarding expenditure on data collection require information about the effects of data quality on data use. For the first topic, I study two important uses of decennial census data in the U.S.: for apportioning the House of Representatives and for allocating federal funds. Estimates of distortions in these two uses are developed for different levels of census accuracy. Then, I thoroughly investigate the sensitivity of findings to the census error distribution and to the choice of how to measure the distortions. The chapter concludes with a proposed framework for partial cost-benefit analysis that charges a share of the cost of the census to allocation programs. Then, I investigate an approximation to make analysis of the effects of census error on allocations feasible when allocations also depend on non-census statistics, as is the case for many formula-based allocations. The approximation conditions on the realized values of the non-census statistics instead of using the joint distribution over both census and non-census statistics. The research studies how using the approximation affects conclusions. I find that in some simple cases, the approximation always either overstates or equals the true effects of census error. Understatement is possible in other cases, but theory suggests that the largest possible understatements are about one-third the amount of the largest possible overstatements. In simulations with a more complex allocation formula, the approximation tends to overstate the effects of census error with the overstatement increasing with error in non-census statistics but decreasing with error in census statistics. In the final chapter, I evaluate the use of 2008-2010 property tax data from CoreLogic, Inc. (CoreLogic), aggregated from county and township governments from around the country, to improve 2010 American Community Survey (ACS) estimates of property tax amounts for single-family homes. Particularly, I evaluate the potential to use CoreLogic to reduce respondent burden, to study survey response error and to improve adjustments for survey nonresponse. The coverage of the CoreLogic data varies between counties as does the correspondence between ACS and CoreLogic property taxes. This geographic variation implies that different approaches toward using CoreLogic are needed in different areas of the country. Further, large differences between CoreLogic and ACS property taxes in certain counties seem to be due to conceptual differences between what is collected in the two data sources. I examine three counties, Clark County, NV, Philadelphia County, PA and St. Louis County, MO, and compare how estimates would change with different approaches using the CoreLogic data. Mean county property tax estimates are highly sensitive to whether ACS or CoreLogic data are used to construct estimates. Using CoreLogic data in imputation modeling for nonresponse adjustment of ACS estimates modestly improves the predictive power of imputation models, although estimates of county property taxes and property taxes by mortgage status are not very sensitive to the imputation method. JF - Statistics PB - Northwestern University CY - Evanston, Illinois VL - PHD UR - http://search.proquest.com/docview/1826016819 ER - TY - JOUR T1 - Using Data Mining to Predict the Occurrence of Respondent Retrieval Strategies in Calendar Interviewing: The Quality of Retrospective Reports JF - Journal of Official Statistics Y1 - 2016 A1 - Belli, Robert F. A1 - Miller, L. Dee A1 - Baghal, Tarek Al A1 - Soh, Leen-Kiat AB - Determining which verbal behaviors of interviewers and respondents are dependent on one another is a complex problem that can be facilitated via data-mining approaches. Data are derived from the interviews of 153 respondents of the Panel Study of Income Dynamics (PSID) who were interviewed about their life-course histories. Behavioral sequences of interviewer-respondent interactions that were most predictive of respondents spontaneously using parallel, timing, duration, and sequential retrieval strategies in their generation of answers were examined. We also examined which behavioral sequences were predictive of retrospective reporting data quality as shown by correspondence between calendar responses with responses collected in prior waves of the PSID. The verbal behaviors of immediately preceding interviewer and respondent turns of speech were assessed in terms of their co-occurrence with each respondent retrieval strategy. Interviewers’ use of parallel probes is associated with poorer data quality, whereas interviewers’ use of timing and duration probes, especially in tandem, is associated with better data quality. Respondents’ use of timing and duration strategies is also associated with better data quality and both strategies are facilitated by interviewer timing probes. Data mining alongside regression techniques is valuable to examine which interviewer-respondent interactions will benefit data quality. VL - 32 IS - 3 ER - TY - JOUR T1 - Using partially synthetic microdata to protect sensitive cells in business statistics JF - Statistical Journal of the International Association for Official Statistics Y1 - 2016 A1 - Miranda, Javier A1 - Vilhuber, Lars KW - confidentiality protection KW - gross job flows KW - local labor markets KW - Statistical Disclosure Limitation KW - Synthetic data KW - time-series AB - We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). VL - 32 UR - http://content.iospress.com/download/statistical-journal-of-the-iaos/sji963 IS - 1 ER - TY - RPRT T1 - Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do Y1 - 2016 A1 - John M. Abowd UR - http://digitalcommons.ilr.cornell.edu/ldi/32/ ER - TY - JOUR T1 - Accounting for nonignorable unit nonresponse and attrition in panel studies with refreshment samples JF - Journal of Survey Statistics and Methodology Y1 - 2015 A1 - Schifeling, T. A1 - Cheng, C. A1 - Hillygus, D. S. A1 - Reiter, J. P. AB - Panel surveys typically su↵er from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, panel data alone cannot inform the extent of the bias from the attrition, so that analysts using the panel data alone must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new individuals during some later wave of the panel. Refreshment samples o↵er information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the refreshment sample itself. As we illustrate, nonignorable unit nonresponse can significantly compromise the analyst’s ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences—corrected for panel attrition—are to di↵erent assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study. VL - 3 UR - http://jssam.oxfordjournals.org/content/3/3/265.abstract IS - 3 ER - TY - JOUR T1 - Bayesian Analysis of Spatially-Dependent Functional Responses with Spatially-Dependent Multi-Dimensional Functional Predictors JF - Statistica Sinica Y1 - 2015 A1 - Yang, W. H. A1 - Wikle, C.K. A1 - Holan, S.H. A1 - Sudduth, K. A1 - Meyers, D.B. VL - 25 UR - http://www3.stat.sinica.edu.tw/preprint/SS-13-245w_Preprint.pdf ER - TY - JOUR T1 - Bayesian Binomial Mixture Models for Estimating Abundance in Ecological Monitoring Studies JF - Annals of Applied Statistics Y1 - 2015 A1 - Wu, G. A1 - Holan, S.H. A1 - Nilon, C.H. A1 - Wikle, C.K. VL - 9 UR - http://projecteuclid.org/euclid.aoas/1430226082 ER - TY - JOUR T1 - Bayesian Hierarchical Statistical SIRS Models JF - Statistical Methods and Applications Y1 - 2015 A1 - Zhuang, L. A1 - Cressie, N. VL - 23 ER - TY - JOUR T1 - Bayesian Latent Pattern Mixture Models for Handling Attrition in Panel Studies With Refreshment Samples JF - ArXiv Y1 - 2015 A1 - Yajuan Si A1 - Jerome P. Reiter A1 - D. Sunshine Hillygus KW - Categorical KW - Dirichlet pro- cess KW - Multiple imputation KW - Non-ignorable KW - Panel attrition KW - Refreshment sample AB - Many panel studies collect refreshment samples---new, randomly sampled respondents who complete the questionnaire at the same time as a subsequent wave of the panel. With appropriate modeling, these samples can be leveraged to correct inferences for biases caused by non-ignorable attrition. We present such a model when the panel includes many categorical survey variables. The model relies on a Bayesian latent pattern mixture model, in which an indicator for attrition and the survey variables are modeled jointly via a latent class model. We allow the multinomial probabilities within classes to depend on the attrition indicator, which offers additional flexibility over standard applications of latent class models. We present results of simulation studies that illustrate the benefits of this flexibility. We apply the model to correct attrition bias in an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study. UR - http://arxiv.org/abs/1509.02124 IS - 1509.02124 ER - TY - JOUR T1 - Bayesian Lattice Filters for Time-Varying Autoregression and Time-Frequency Analysis JF - ArXiv Y1 - 2015 A1 - Yang, W. H. A1 - Holan, S. H. A1 - Wikle, C.K. AB - Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). UR - http://arxiv.org/abs/1408.2757 IS - 1408.2757 ER - TY - JOUR T1 - Bayesian Lattice Filters for Time-Varying Autoregression and Time–Frequency Analysis JF - Project Euclid Y1 - 2015 A1 - Yang, W. H. A1 - Holan, Scott H. A1 - Wikle, Christopher K. KW - locally stationary KW - model selection KW - nonstationary partial autocorrelation KW - piecewise stationary KW - sequential estimation KW - time-varying spectral density AB - Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time–frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption). UR - http://projecteuclid.org/euclid.ba/1445263834 ER - TY - JOUR T1 - Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography JF - Spatial Statistics Y1 - 2015 A1 - Quick, Harrison A1 - Holan, Scott H. A1 - Wikle, Christopher K. A1 - Reiter, Jerome P. VL - 14 UR - http://www.sciencedirect.com/science/article/pii/S2211675315000718 ER - TY - JOUR T1 - Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography JF - ArXiv Y1 - 2015 A1 - Quick, H. A1 - Holan, S. H. A1 - Wikle, C. K. A1 - Reiter, J. P. AB - Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina. UR - http://arxiv.org/abs/1407.7795 IS - 1407.7795 ER - TY - JOUR T1 - Bayesian Semiparametric Hierarchical Empirical Likelihood Spatial Models JF - Journal of Statistical Planning and Inference Y1 - 2015 A1 - Porter, A.T. A1 - Holan, S.H. A1 - Wikle, C.K. VL - 165 ER - TY - JOUR T1 - Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey JF - Journal of the American Statistical Association Y1 - 2015 A1 - Bradley, Jonathan A1 - Wikle, C.K. A1 - Holan, S. H. AB - We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. UR - http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1117471 ER - TY - JOUR T1 - Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey JF - Journal of the American Statistical Association Y1 - 2015 A1 - Bradley, Jonathan R. A1 - Wikle, Christopher K. A1 - Holan, Scott H. KW - Aggregation KW - American Community Survey KW - Bayesian hierarchical model KW - Givens angle prior KW - Markov chain Monte Carlo KW - Multiscale model KW - Non-Gaussian. AB - We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. UR - http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1117471 ER - TY - JOUR T1 - Bayesian Spatial Change of Support for Count–Valued Survey Data JF - ArXiv Y1 - 2015 A1 - Bradley, J. R. A1 - Wikle, C.K. A1 - Holan, S. H. AB - We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on "new" spatial supports in "real-time." This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We demonstrate the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. UR - http://arxiv.org/abs/1405.7227 IS - 1405.7227 ER - TY - RPRT T1 - Blocking Methods Applied to Casualty Records from the Syrian Conflict Y1 - 2015 A1 - Sadosky, Peter A1 - Shrivastava, Anshumali A1 - Price, Megan A1 - Steorts, Rebecca JF - ArXiv UR - http://arxiv.org/abs/1510.07714 ER - TY - JOUR T1 - Capturing multivariate spatial dependence: Model, estimate, and then predict JF - Statistical Science Y1 - 2015 A1 - Cressie, N. A1 - Burden, S. A1 - Davis, W. A1 - Krivitsky, P. A1 - Mokhtarian, P. A1 - Seusse, T. A1 - Zammit-Mangion, A. VL - 30 UR - http://projecteuclid.org/euclid.ss/1433341474 IS - 2 ER - TY - RPRT T1 - Categorical data fusion using auxiliary information Y1 - 2015 A1 - Fosdick, B. K. A1 - Maria DeYoreo A1 - J. P. Reiter AB - In data fusion analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people's preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. PB - arXiv UR - http://arxiv.org/abs/1506.05886 ER - TY - JOUR T1 - Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the “Great Recession:” Spatial Differentiation in Remotely Sensed Land-Cover Dynamics JF - Population and Environment Y1 - 2015 A1 - Wilson, C. R. A1 - Brown, D. G. VL - 36 UR - http://link.springer.com/article/10.1007%2Fs11111-014-0219-y IS - 3 ER - TY - CONF T1 - Changing ‘Who’ or ‘Where’: Implications for Data Quality in the American Time Use Survey T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Deal, C.E. A1 - Kirchner, A. A1 - Cordova-Cazar, A.L. A1 - Ellyne, L. A1 - Belli, R.F. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Comment on Article by Ferreira and Gamerman JF - Bayesian Analysis Y1 - 2015 A1 - Cressie, N. A1 - Chambers, R. L. VL - 10 UR - http://projecteuclid.org/euclid.ba/1429880217 IS - 3 ER - TY - JOUR T1 - Comment on ``Semiparametric Bayesian Density Estimation with Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition" by Finncane, M. M., Paciorek, C. J., Stevens, G. A., and Ezzati, M. JF - Journal of the American Statistical Association Y1 - 2015 A1 - Wikle, C.K. A1 - Holan, S.H. ER - TY - JOUR T1 - Comment: Spatial sampling designs depend as much on “how much?” and “why?” as on “where?” JF - Bayesian Analysis Y1 - 2015 A1 - Cressie, N. A1 - Chambers, R. L. AB - A comment on “Optimal design in geostatistics under preferential sampling” by G. da Silva Ferreira and D. Gamerman ER - TY - JOUR T1 - Communicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern JF - Journal of Economic Literature Y1 - 2015 A1 - Manski, Charles F. KW - and Organizing Macroeconomic Data; Data Access E23: Macroeconomics: Production KW - B22: History of Economic Thought: Macroeconomics C82: Methodology for Collecting KW - Estimating AB - Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. I anchor my discussion of communication of uncertainty in the contribution of Oskar Morgenstern (1963a), who argued forcefully for agency publication of error estimates for official economic statistics. (JEL B22, C82, E23) VL - 53 UR - http://www.aeaweb.org/articles.php?doi=10.1257/jel.53.3.631 ER - TY - JOUR T1 - Comparing and selecting spatial predictors using local criteria JF - Test Y1 - 2015 A1 - Bradley, J.R. A1 - Cressie, N. A1 - Shi, T. VL - 24 UR - http://dx.doi.org/10.1007/s11749-014-0415-1 IS - 1 ER - TY - THES T1 - A Comparison of Multiple Imputation Methods for Categorical Data (Master's Thesis) T2 - Statistical Science Y1 - 2015 A1 - Akande, O. JF - Statistical Science PB - Duke University ER - TY - RPRT T1 - Cost-Benefit Analysis for a Quinquennial Census: The 2016 Population Census of South Africa. Y1 - 2015 A1 - Spencer, Bruce D. A1 - May, Julian A1 - Kenyon, Steven A1 - Seeskin, Zachary H. KW - demographic statistics KW - fiscal allocations KW - loss function KW - population estimates KW - post-censal estimates AB -

The question of whether to carry out a quinquennial census is being faced by national statistical offices in increasingly many countries, including Canada, Nigeria, Ireland, Australia, and South Africa. The authors describe uses, and limitations, of cost-benefit analysis for this decision problem in the case of the 2016 census of South Africa. The government of South Africa needed to decide whether to conduct a 2016 census or to rely on increasingly inaccurate post-censal estimates accounting for births, deaths, and migration since the previous (2011) census. The cost-benefit analysis compared predicted costs of the 2016 census to the benefits from improved allocation of intergovernmental revenue, which was considered by the government to be a critical use of the 2016 census, although not the only important benefit. Without the 2016 census, allocations would be based on population estimates. Accuracy of the post-censal estimates was estimated from the performance of past estimates, and the hypothetical expected reduction in errors in allocation due to the 2016 census was estimated. A loss function was introduced to quantify the improvement in allocation. With this evidence, the government was able to decide not to conduct the 2016 census, but instead to improve data and capacity for producing post-censal estimates.

JF - IPR Working Paper Series PB - Northwestern University, Institute for Policy Research UR - http://www.ipr.northwestern.edu/publications/papers/2015/ipr-wp-15-06.html ER - TY - CONF T1 - Determining Potential for Breakoff in Time Diary Survey Using Paradata T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Wettlaufer, D. A1 - Arunachalam, H. A1 - Atkin, G. A1 - Eck, A. A1 - Soh, L.-K. A1 - Belli, R.F. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Dirichlet Process Mixture Models for Nested Categorical Data JF - ArXiv Y1 - 2015 A1 - Hu, J. A1 - Reiter, J.P. A1 - Wang, Q. AB - We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files with high analytic validity and low disclosure risks. Supplementary materials for this article are available online. UR - http://arxiv.org/pdf/1412.2282v3.pdf IS - 1412.2282 ER - TY - THES T1 - Dirichlet Process Mixture Models for Nested Categorical Data (Ph.D. Thesis) T2 - Statistical Science Y1 - 2015 A1 - Hu, J. JF - Statistical Science PB - Duke University UR - http://dukespace.lib.duke.edu/dspace/handle/10161/9933 ER - TY - CONF T1 - Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors T2 - International Conference on Total Survey Error Y1 - 2015 A1 - Olson, K. A1 - Smyth, J.D. A1 - Kirchner, A. JF - International Conference on Total Survey Error CY - Baltimore, MD UR - http://www.niss.org/events/2015-international-total-survey-error-conference ER - TY - CONF T1 - Do Interviewers with High Cooperation Rates Behave Differently? Interviewer Cooperation Rates and Interview Behaviors T2 - Joint Statistical Meetings Y1 - 2015 A1 - Olson, K. A1 - Smyth, J.D. A1 - Kirchner, A. JF - Joint Statistical Meetings CY - Seattle, WA UR - http://www.amstat.org/meetings/jsm/2015/program.cfm ER - TY - THES T1 - Dynamic Models of Human Capital Accumulation (Ph.D. Thesis) T2 - Economics Y1 - 2015 A1 - Ransom, T. JF - Economics PB - Duke University UR - http://dukespace.lib.duke.edu/dspace/handle/10161/9929 ER - TY - RPRT T1 - Economic Analysis and Statistical Disclosure Limitation Y1 - 2015 A1 - Abowd, John M. A1 - Schmutte, Ian M. AB -

Economic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies.

PB - Cornell University UR - http://hdl.handle.net/1813/40581 ER - TY - JOUR T1 - Economic Analysis and Statistical Disclosure Limitation JF - Brookings Papers on Economic Activity Y1 - 2015 A1 - Abowd, John M. A1 - Schmutte, Ian M. AB - Economic Analysis and Statistical Disclosure Limitation Abowd, John M.; Schmutte, Ian M. This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies. VL - Spring 2015 UR - http://www.brookings.edu/about/projects/bpea/papers/2015/economic-analysis-statistical-disclosure-limitation ER - TY - JOUR T1 - The Effect of CATI Questionnaire Design Features on Response Timing JF - Journal of Survey Statistics and Methodology Y1 - 2015 A1 - Olson, K. A1 - Smyth, J.D. VL - 3 IS - 3 ER - TY - RPRT T1 - Effects of Census Accuracy on Apportionment of Congress and Allocations of Federal Funds. Y1 - 2015 A1 - Seeskin, Zachary H. A1 - Spencer, Bruce D. AB -

How much accuracy is needed in the 2020 census depends on the cost of attaining accuracy and on the consequences of imperfect accuracy. The cost target for the 2020 census of the United States has been specified, and the Census Bureau is developing projections of the accuracy attainable for that cost. It is desirable to have information about the consequences of the accuracy that might be attainable for that cost or for alternative cost levels. To assess the consequences of imperfect census accuracy, Seeskin and Spencer consider alternative profiles of accuracy for states and assess their implications for apportionment of the U.S. House of Representatives and for allocation of federal funds. An error in allocation is defined as the difference between the allocation computed under imperfect data and the allocation computed with perfect data. Estimates of expected sums of absolute values of errors are presented for House apportionment and for federal funds allocations.

JF - IPR Working Paper Series PB - Northwestern University, Institute for Policy Research UR - http://www.ipr.northwestern.edu/publications/papers/2015/ipr-wp-15-05.html ER - TY - CONF T1 - Effects of interviewer and respondent behavior on data quality: An investigation of question types and interviewer learning T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Kirchner, A. A1 - Olson, K. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Effects of interviewer and respondent behavior on data quality: An investigation of question types and interviewer learning T2 - 6th Conference of the European Survey Research Association Y1 - 2015 A1 - Kirchner, A. A1 - Olson, K. JF - 6th Conference of the European Survey Research Association CY - Reykjavik, Iceland UR - http://www.europeansurveyresearch.org/conference ER - TY - JOUR T1 - An empirical comparison of multiple imputation methods for categorical data JF - arXiv Y1 - 2015 A1 - Akande, O. A1 - Li, Fan A1 - Reiter , J. P. AB - Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. The results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and mixture model approaches. They also suggest competing advantages for the regression tree and mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. UR - http://arxiv.org/abs/1508.05918 IS - 1508.05918 ER - TY - JOUR T1 - Entity Resolution with Empirically Motivated Priors JF - Bayesian Anal. Y1 - 2015 A1 - Steorts, Rebecca C. AB - Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian-type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey on income and wealth, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. VL - 10 UR - http://dx.doi.org/10.1214/15-BA965SI ER - TY - JOUR T1 - Entity resolution with empirically motivated priors JF - Bayesian Analysis Y1 - 2015 A1 - Steorts, Rebecca C. AB - Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. VL - 10 UR - http://projecteuclid.org/euclid.ba/1441790411 IS - 5 ER - TY - THES T1 - Essays on Multinational Production and the Propagation of Shocks T2 - Department of Economics Y1 - 2015 A1 - Flaaen, Aaron KW - Business Cycle Comovement KW - Global Supply Chains KW - Multinational Firms AB - The increased exposure of the United States to economic shocks originating from abroad is a common concern of those critical of globalization. An understanding of the cross-country transmission of shocks is of central importance for policymakers seeking to limit excess volatility resulting from international linkages. Firms whose ownership spans multiple countries are one under-appreciated mechanism. These multinationals represent an enormous share of the global economy, but a general scarcity of firm-level data has limited our understanding of how they affect both origin and destination countries. One contribution of this dissertation is to expand the data availability on these firms, using innovative data-linking techniques. The first chapter provides some of the first ever causal evidence on the role of trade and multinational production in the transmission of economic shocks and the cross-country synchronization of business cycles. This chapter leverages the 2011 Japanese earthquake/tsunami as a natural experiment. It finds that those U.S. firms with large exposure to intermediate inputs from Japan -- typically the affiliates of Japanese multinationals -- experience significant output declines after this shock, roughly one-for-one with declines in imported inputs. Structural estimation of the production function reveals substantial complementarities between imported and domestic inputs. These results suggest that global supply chains are more rigid than previously thought. The second chapter incorporates this low production elasticity of imported inputs into an otherwise standard dynamic stochastic general equilibrium model. The low degree of input substitutability, when applied to the share of trade governed by multinational firms, can generate effects in the aggregate. Value-added co-movement increases by 11 percentage points in the baseline model relative to a model where such features are absent. The model confirms that real linkages -- in addition to financial and policy spillovers -- play an important role in business cycle synchronization. The third chapter describes additional characteristics of multinational firms relative to domestic and exporting firms in the U.S. economy. These firms are larger, more productive, more capital intensive, and pay higher wages than other firms. The relative patterns of trade and output offer valuable guidance for the motives for ownership that spans national boundaries. JF - Department of Economics PB - University of Michigan CY - Ann Arbor, MI UR - http://hdl.handle.net/2027.42/111331 ER - TY - CHAP T1 - Evaluation of diagnostics for hierarchical spatial statistical models T2 - Geometry Driven Statistics Y1 - 2015 A1 - Cressie, N. A1 - Burden, S. ED - I.L. Dryden ED - J.T. Kent JF - Geometry Driven Statistics PB - Wiley CY - Chinchester SN - 978-1118866573 UR - http://niasra.uow.edu.au/content/groups/public/@web/@inf/@math/documents/doc/uow169240.pdf ER - TY - JOUR T1 - Expanding the Discourse on Antipoverty Policy: Reconsidering a Negative Income Tax JF - Journal of Poverty Y1 - 2015 A1 - Jessica Wiederspan A1 - Elizabeth Rhodes A1 - H. Luke Shaefer KW - economic well-being KW - poverty alleviation KW - public policy KW - social welfare policy AB - This article proposes that advocates for the poor consider the replacement of the current means-tested safety net in the United States with a Negative Income Tax (NIT), a guaranteed income program that lifts families’ incomes above a minimum threshold. The article highlights gaps in service provision that leave millions in poverty, explains how a NIT could help fill those gaps, and compares current expenditures on major means-tested programs to estimated expenditures necessary for a NIT. Finally, it addresses the financial and political concerns that are likely to arise in the event that a NIT proposal gains traction among policy makers. VL - 19 UR - http://dx.doi.org/10.1080/10875549.2014.991889 ER - TY - JOUR T1 - Figures of merit for simultaneous inference and comparisons in simulation experiments JF - Stat Y1 - 2015 A1 - Cressie, N. A1 - Burden, S. VL - 4 UR - http://onlinelibrary.wiley.com/doi/10.1002/sta4.88/epdf IS - 1 ER - TY - THES T1 - Four Essays in Unemployment, Wage Dynamics and Subjective Expectations T2 - Department of Economics Y1 - 2015 A1 - Hudomiet, Peter KW - measurement error KW - subjective expectations KW - unemployment AB - This dissertation contains four essays on unemployment differences between skill groups, on the effect of non-employment on wages and measurement error, and on subjective expectations of Americans about mortality and the stock market. Chapter 1 tests how much of the unemployment rate differences between education groups can be explained by occupational differences in labor adjustment costs. The educational gap in unemployment is substantial. Recent empirical studies found that the largest component of labor adjustment costs are adaptation costs: newly hired workers need a few month get up to speed and reach full productivity. The chapter evaluates the effect of adaptation costs on unemployment using a calibrated search and matching model. Chapter 2 tests how short periods of non-employment affect survey reports of annual earnings. Non-employment has strong and non-standard effects on response error in earnings. Persons tend to report the permanent component of their earnings accurately, but transitory shocks are underreported. Transitory shocks due to career interruptions are very large, taking up several month of lost earnings, on average, and people only report 60-85% percent of these earnings losses. The resulting measurement error is non-standard: it has a positive mean, it is right-skewed, and the bias correlates with predictors of turnover. Chapter 3 proposes and tests a model, the modal response hypothesis, to explain patterns in mortality expectations of Americans. The model is a mathematical expression of the idea that survey responses of 0%, 50%, or 100% to probability questions indicate a high level of uncertainty about the relevant probability. The chapter shows that subjective survival expectations in 2002 line up very well with realized mortality of the HRS respondents between 2002 and 2010 and our model performs better than typically used models in the literature of subjective probabilities. Chapter 4 analyzes the impact of the stock market crash of 2008 on households' expectations about the returns on the stock market index: the population average of expectations, the average uncertainty, and the cross-sectional heterogeneity in expectations from March 2008 to February 2009. JF - Department of Economics PB - University of Michigan CY - Ann Arbor, MI UR - http://hdl.handle.net/2027.42/113598 ER - TY - CONF T1 - Grids and Online Panels: A Comparison of Device Type from a Survey Quality Perspective T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Wang, Mengyang A1 - McCutcheon, Allan L. A1 - Allen, Laura JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - he role of occupation specific adaptation costs in explaining the educational gap in unemployment. Y1 - 2015 A1 - Hudomiet, Peter UR - https://sites.google.com/site/phudomiet/Hudomiet-JobMarketPaper.pdf?attredirects=0 ER - TY - CHAP T1 - Hierarchcial models for uncertainty quantification: An overview T2 - Handbook of Uncertainty Quantification Y1 - 2015 A1 - Wikle, C.K. ED - Ghanem, R. ED - Higdon, D. ED - Owhadi, H. JF - Handbook of Uncertainty Quantification PB - Springer ER - TY - CHAP T1 - Hierarchical Agent-Based Spatio-Temporal Dynamic Models for Discrete Valued Data T2 - Handbook of Discrete-Valued Time Series Y1 - 2015 A1 - Wikle, C.K. A1 - Hooten, M.B. ED - Davis, R. ED - Holan, S. ED - Lund, R. ED - Ravishanker, N. JF - Handbook of Discrete-Valued Time Series PB - Chapman and Hall/CRC Press CY - Boca Raton, FL. UR - http://www.crcpress.com/product/isbn/9781466577732 ER - TY - CHAP T1 - Hierarchical Dynamic Generalized Linear Mixed Models for Discrete-Valued Spatio-Temporal Data T2 - Handbook of Discrete-Valued Time Series Y1 - 2015 A1 - Holan, S.H. A1 - Wikle, C.K. ED - Davis, R. ED - Holan, S. ED - Lund, R. ED - Ravishanker, N JF - Handbook of Discrete-Valued Time Series PB - Chapman and Hall/CRC Press CY - Boca Raton, FL SN - ISBN 9781466577732 UR - http://www.crcpress.com/product/isbn/9781466577732 N1 - to appear in "Handbook of Discrete-Valued Time Series ER - TY - CHAP T1 - Hierarchical Dynamic Generalized Linear Mixed Models for Discrete--Valued Spatio-Temporal Data T2 - Handbook of Discrete--Valued Time Series Y1 - 2015 A1 - Holan, S.H. A1 - Wikle, C.K. JF - Handbook of Discrete--Valued Time Series ER - TY - CHAP T1 - Hierarchical Spatial Models T2 - Encyclopedia of Geographical Information Science Y1 - 2015 A1 - Arab, A. A1 - Hooten, M.B. A1 - Wikle, C.K. JF - Encyclopedia of Geographical Information Science PB - Springer ER - TY - JOUR T1 - Hierarchical, stochastic modeling across spatiotemporal scales of large river ecosystems and somatic growth in fish populations under various climate models: Missouri River sturgeon example JF - Geological Society Y1 - 2015 A1 - Wildhaber, M.L. A1 - Wikle, C.K. A1 - Moran, E.H. A1 - Anderson, C.J. A1 - Franz, K.J. A1 - Dey, R. ER - TY - JOUR T1 - Hot enough for you? A spatial exploratory and inferential analysis of North American climate-change projections JF - Mathematical Geosciences Y1 - 2015 A1 - Cressie, N. A1 - Kang, E.L. UR - http://dx.doi.org/10.1007/s11004-015-9607-9 ER - TY - RPRT T1 - How individuals smooth spending: Evidence from the 2013 government shutdown using account data Y1 - 2015 A1 - Gelman, Michael A1 - Kariv, Shachar A1 - Shapiro, Matthew D A1 - Silverman, Dan A1 - Tadelis, Steven AB - Using comprehensive account records, this paper examines how individuals adjusted spending and saving in response to a temporary drop in income due to the 2013 U.S. government shutdown. The shutdown cut paychecks by 40% for affected employees, which was recovered within 2 weeks. Though the shock was short-lived and completely reversed, spending dropped sharply implying a naïve estimate of the marginal propensity to spend of 0.58. This estimate overstates how consumption responded. While many individuals had low liquidity, they used multiple strategies to smooth consumption including delay of recurring payments such as mortgages and credit card balances. PB - National Bureau of Economic Research ER - TY - CONF T1 - I Know What You Did Next: Predicting Respondent’s Next Activity Using Machine Learning T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Arunachalam, H. A1 - Atkin, G. A1 - Eck, A. A1 - Wettlaufer, D. A1 - Soh, L.-K. A1 - Belli, R.F. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Introduction to The Survey of Income and Program Participation (SIPP) Y1 - 2015 A1 - Shaefer, H. Luke AB - Introduction to The Survey of Income and Program Participation (SIPP) Shaefer, H. Luke Goals for the SIPP Workshop Provide you with an introduction to the SIPP and get you up and running on the public-use SIPP files, offer some advanced tools for 2008 Panel SIPP data analysis, Get you some experience analyzing SIPP data, Introduce you to the SIPP EHC (SIPP Redesign), Introduce you to the SIPP Synthetic Beta (SSB) Presentation made on May 15, 2015 at the Census Bureau, and previously in 2014 at Duke University and University of Michigan PB - University of Michigan UR - http://hdl.handle.net/1813/40169 ER - TY - CHAP T1 - Long Memory Discrete--Valued Time Series T2 - Handbook of Discrete-Valued Time Series Y1 - 2015 A1 - Lund, R. A1 - Holan, S.H. A1 - Livsey, J. JF - Handbook of Discrete-Valued Time Series PB - Chapman and Hall UR - http://www.crcpress.com/product/isbn/9781466577732 ER - TY - RPRT T1 - Modeling Endogenous Mobility in Wage Determination Y1 - 2015 A1 - Abowd, John M. A1 - McKinney, Kevin L. A1 - Schmutte, Ian M. AB - Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. PB - Cornell University UR - http://hdl.handle.net/1813/40306 ER - TY - RPRT T1 - Modeling Endogenous Mobility in Wage Determination Y1 - 2015 A1 - Abowd, John M. A1 - McKinney, Kevin L. A1 - Schmutte, Ian M. AB - Modeling Endogenous Mobility in Wage Determination Abowd, John M.; McKinney, Kevin L.; Schmutte, Ian M. We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax exogenous mobility by modeling the matched data as an evolving bipartite graph using a Bayesian latent-type framework. Our results suggest that allowing endogenous mobility increases the variation in earnings explained by individual heterogeneity and reduces the proportion due to employer and match effects. To assess external validity, we match our estimates of the wage components to out-ofsample estimates of revenue per worker. The mobility-bias corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/52608 ER - TY - RPRT T1 - Modeling for Dynamic Ordinal Regression Relationships: An Application to Estimating Maturity of Rockfish in California Y1 - 2015 A1 - DeYoreo, M. A1 - Kottas, A. KW - Statistics - Applications AB - We develop a Bayesian nonparametric framework for modeling ordinal regression relationships which evolve in discrete time. The motivating application involves a key problem in fisheries research on estimating dynamically evolving relationships between age, length and maturity, the latter recorded on an ordinal scale. The methodology builds from nonparametric mixture modeling for the joint stochastic mechanism of covariates and latent continuous responses. This approach yields highly flexible inference for ordinal regression functions while at the same time avoiding the computational challenges of parametric models. A novel dependent Dirichlet process prior for time-dependent mixing distributions extends the model to the dynamic setting. The methodology is used for a detailed study of relationships between maturity, age, and length for Chilipepper rockfish, using data collected over 15 years along the coast of California. PB - ArXiv UR - http://arxiv.org/abs/1507.01242 ER - TY - JOUR T1 - Modern Perspectives on Statistics for Spatio-Temporal Data JF - WIRES Computational Statistics Y1 - 2015 A1 - Wikle, C.K. VL - 7 UR - http://dx.doi.org/10.1002/wics.1341 IS - 1 ER - TY - JOUR T1 - Moving Toward the New World of Censuses and Large-Scale Sample Surveys: Methodological Developments and Practical Implementations JF - Journal of Official Statistics Y1 - 2015 A1 - Fienberg, S. E. N1 - In press ER - TY - JOUR T1 - Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis JF - Statistics in Medicine Y1 - 2015 A1 - Siddique, J. A1 - Reiter, J. P. A1 - Brincks, A. A1 - Gibbons, R. A1 - Crespi, C. A1 - Brown, C. H. UR - http://onlinelibrary.wiley.com/doi/10.1002/sim.6562/abstract ER - TY - JOUR T1 - Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence JF - arXiv Y1 - 2015 A1 - Murray, J. S. A1 - Reiter, J. P. AB - We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (i) modeling the means of the normal distributions as component-specific functions of the categorical variables and (ii) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. UR - arxiv.org/abs/1410.0438 IS - 1410.0438 ER - TY - ICOMM T1 - Multiscale Analysis of Survey Data: Recent Developments and Exciting Prospects Y1 - 2015 A1 - Bradley, J.R. A1 - Wikle, C.K. A1 - Holan, S.H. JF - Statistics Views ER - TY - JOUR T1 - Multivariate Spatial Covariance Models: A Conditional Approach Y1 - 2015 A1 - Cressie, N. A1 - Zammit-Mangion, A. AB - Multivariate geostatistics is based on modelling all covariances between all possible combinations of two or more variables at any sets of locations in a continuously indexed domain. Multivariate spatial covariance models need to be built with care, since any covariance matrix that is derived from such a model must be nonnegative-definite. In this article, we develop a conditional approach for spatial-model construction whose validity conditions are easy to check. We start with bivariate spatial covariance models and go on to demonstrate the approach's connection to multivariate models defined by networks of spatial variables. In some circumstances, such as modelling respiratory illness conditional on air pollution, the direction of conditional dependence is clear. When it is not, the two directional models can be compared. More generally, the graph structure of the network reduces the number of possible models to compare. Model selection then amounts to finding possible causative links in the network. We demonstrate our conditional approach on surface temperature and pressure data, where the role of the two variables is seen to be asymmetric. UR - https://arxiv.org/abs/1504.01865 ER - TY - JOUR T1 - Multivariate Spatial Hierarchical Bayesian Empirical Likelihood Methods for Small Area Estimation JF - STAT Y1 - 2015 A1 - Porter, A.T. A1 - Holan, S.H. A1 - Wikle, C.K. VL - 4 UR - http://dx.doi.org/10.1002/sta4.81 IS - 1 ER - TY - JOUR T1 - Multivariate Spatio-Temporal Models for High-Dimensional Areal Data with Application to Longitudinal Employer-Household Dynamics JF - ArXiv Y1 - 2015 A1 - Bradley, J. R. A1 - Holan, S. H. A1 - Wikle, C.K. AB - Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal datasets are extremely high-dimensional, which leads to practical issues when formulating statistical models. For example, we analyze Quarterly Workforce Indicators (QWI) published by the US Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) program. QWIs are available by different variables, regions, and time points, resulting in millions of tabulations. Despite their already expansive coverage, by adopting a fully Bayesian framework, the scope of the QWIs can be extended to provide estimates of missing values along with associated measures of uncertainty. Motivated by the LEHD, and other applications in federal statistics, we introduce the multivariate spatio-temporal mixed effects model (MSTM), which can be used to efficiently model high-dimensional multivariate spatio-temporal areal datasets. The proposed MSTM extends the notion of Moran's I basis functions to the multivariate spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using {a novel} parameter model. UR - http://arxiv.org/abs/1503.00982 IS - 1503.00982 ER - TY - JOUR T1 - Multivariate Spatio-Temporal Models for High-Dimensional Areal Data with Application to Longitudinal Employer-Household Dynamics JF - Annals of Applied Statistics Y1 - 2015 A1 - Bradley, J.R. A1 - Holan, S.H. A1 - Wikle, C.K. AB - Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal datasets are extremely high-dimensional, which leads to practical issues when formulating statistical models. For example, we analyze Quarterly Workforce Indicators (QWI) published by the US Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) program. QWIs are available by different variables, regions, and time points, resulting in millions of tabulations. Despite their already expansive coverage, by adopting a fully Bayesian framework, the scope of the QWIs can be extended to provide estimates of missing values along with associated measures of uncertainty. Motivated by the LEHD, and other applications in federal statistics, we introduce the multivariate spatio-temporal mixed effects model (MSTM), which can be used to efficiently model high-dimensional multivariate spatio-temporal areal datasets. The proposed MSTM extends the notion of Moran’s I basis functions to the multivariate spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using a novel parameter model. VL - 9 IS - 4 ER - TY - RPRT T1 - NCRN Meeting Fall 2016: Dynamic Question Ordering: Obtaining Useful Information While Reducing Burden Y1 - 2015 A1 - Early, Kirstin AB - NCRN Meeting Fall 2016: Dynamic Question Ordering: Obtaining Useful Information While Reducing Burden Early, Kirstin PB - Carnegie-Mellon University UR - http://hdl.handle.net/1813/45822 ER - TY - RPRT T1 - NCRN Meeting Spring 2015 Y1 - 2015 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2015 Vilhuber, Lars May 7 meetings @ U.S. Census Bureau, Washington DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45867 ER - TY - Generic T1 - NCRN Meeting Spring 2015: A Vision for the Future of Data Access Y1 - 2015 A1 - Reiter, J.P. AB -

NCRN Meeting Spring 2015: A Vision for the Future of Data Access Reiter, J.P. Presentation at the NCRN Meeting Spring 2015

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40181 ER - TY - Generic T1 - NCRN Meeting Spring 2015: Broadening data access through synthetic data Y1 - 2015 A1 - Vilhuber, Lars AB -

NCRN Meeting Spring 2015: Broadening data access through synthetic data Vilhuber, Lars Presentation at the NCRN Meeting Spring 2015

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40185 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Building and Training the Next Generation of Survey Methodologists and Researchers Y1 - 2015 A1 - Nugent, Rebecca AB - NCRN Meeting Spring 2015: Building and Training the Next Generation of Survey Methodologists and Researchers Nugent, Rebecca Presentation at the NCRN Meetings Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40188 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Y1 - 2015 A1 - Abowd, John M. A1 - Fienberg, Stephen E. AB - NCRN Meeting Spring 2015: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Abowd, John M.; Fienberg, Stephen E. May 8, 2015 CNSTAT Public Seminar PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40186 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Comment on: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Y1 - 2015 A1 - Groshen, Erica L. AB - NCRN Meeting Spring 2015: Comment on: Can Government-Academic Partnerships Help Secure the Future of the Federal Statistical System? Examples from the NSF-Census Research Network Groshen, Erica L. Public Seminar Presentation by Erica L. Groshen at the Spring 2015 NCRN/CNSTAT Meetings PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40187 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Geographic Aspects of Direct and Indirect Estimators for Small Area Estimation Y1 - 2015 A1 - Nagle, Nicholas AB - NCRN Meeting Spring 2015: Geographic Aspects of Direct and Indirect Estimators for Small Area Estimation Nagle, Nicholas Presentation at the NCRN Meeting Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40182 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Geography and Usability of the American Community Survey Y1 - 2015 A1 - Spielman, Seth AB - NCRN Meeting Spring 2015: Geography and Usability of the American Community Survey Spielman, Seth Presentation at the NCRN Meeting Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40183 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Models for Multiscale Spatially-Referenced Count Data Y1 - 2015 A1 - Holan, Scott A1 - Bradley, Jonathan R. A1 - Wikle, Christopher K. AB - NCRN Meeting Spring 2015: Models for Multiscale Spatially-Referenced Count Data Holan, Scott; Bradley, Jonathan R.; Wikle, Christopher K. Presentation at the NCRN Meeting Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40176 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Regionalization of Multiscale Spatial Processes Using a Criterion for Spatial Aggregation Error Y1 - 2015 A1 - Wikle, Christopher K. A1 - Bradley, Jonathan A1 - Holan, Scott AB - NCRN Meeting Spring 2015: Regionalization of Multiscale Spatial Processes Using a Criterion for Spatial Aggregation Error Wikle, Christopher K.; Bradley, Jonathan; Holan, Scott Develop and implement a statistical criterion to diagnose spatial aggregation error that can facilitate the choice of regionalizations of spatial data. Presentation at NCRN Meeting Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40177 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Y1 - 2015 A1 - Abowd, John M. A1 - Schmutte, Ian AB - NCRN Meeting Spring 2015: Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods Abowd, John M.; Schmutte, Ian Presentation at the NCRN Meeting Spring 2015 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40184 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? Y1 - 2015 A1 - McCutcheon, Allan L. AB -

NCRN Meeting Spring 2015: Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? McCutcheon, Allan L. Presentation at the NCRN Meeting Spring 2015

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40309 ER - TY - RPRT T1 - NCRN Meeting Spring 2015: Training Undergraduates, Graduate Students, Postdocs, and Federal Agencies: Methodology, Data, and Science for Federal Statistics Y1 - 2015 A1 - Cressie, Noel A1 - Holan, Scott H. A1 - Wikle, Christopher K. AB - NCRN Meeting Spring 2015: Training Undergraduates, Graduate Students, Postdocs, and Federal Agencies: Methodology, Data, and Science for Federal Statistics Cressie, Noel; Holan, Scott H.; Wikle, Christopher K. Presentation at the NCRN Spring 2015 Meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40179 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 1 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from October 2014 to January 2015. NCRN Newsletter Vol. 2, Issue 1: January 30, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40193 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 2 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from January 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40194 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 2 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 2 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from February 2015 to May 2015. NCRN Newsletter Vol. 2, Issue 2: May 12, 2015. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/44200 ER - TY - RPRT T1 - NCRN Newsletter: Volume 2 - Issue 3 Y1 - 2015 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB -

NCRN Newsletter: Volume 2 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from June 2015 through August 2015. NCRN Newsletter Vol. 2, Issue 3: September 15, 2015.

PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/42393 ER - TY - RPRT T1 - Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics Y1 - 2015 A1 - Abowd, John A. A1 - McKinney, Kevin L. AB - Noise Infusion as a Confidentiality Protection Measure for Graph-Based Statistics Abowd, John A.; McKinney, Kevin L. We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs. PB - Cornell University UR - http://hdl.handle.net/1813/42338 ER - TY - JOUR T1 - Nonparametric Bayesian models with focused clustering for mixed ordinal and nominal data JF - ArXiV Y1 - 2015 A1 - DeYoreo, Maria A1 - Reiter , J. P. A1 - Hillygus, D. S. AB - Dirichlet process mixtures can be useful models of multivariate categorical data and effective tools for multiple imputation of missing categorical values. In some contexts, however, these models can fit certain variables well at the expense of others in ways beyond the analyst's control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a Dirichlet process mixture model for mixed ordinal and nominal data that allows analysts to split variables into two groups: focus variables and remainder variables. The model uses three sets of clusters, one set for ordinal focus variables, one for nominal focus variables, and one for all remainder variables. The model uses a multivariate ordered probit specification for the ordinal variables and independent multinomial kernels for the nominal variables. The three sets of clusters are linked using an infinite tensor factorization prior, as well as via dependence of the means of the latent continuous focus variables on the remainder variables. This effectively specifies a rich, complex model for the focus variables and a simpler model for remainder variables, yet still potentially captures associations among the variables. In the multiple imputation context, focus variables include key variables with high rates of missing values, and remainder variables include variables without much missing data. Using simulations, we illustrate advantages and limitations of using focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study. PB - arXiv UR - http://arxiv.org/abs/1508.03758 IS - 1508.03758 ER - TY - JOUR T1 - Nonparametric Bayesian models with focused clustering for mixed ordinal and nominal data JF - Bayesian Analysis Y1 - 2015 A1 - M. De Yoreo A1 - J. P. Reiter A1 - D. S. Hillygus AB - Dirichlet process mixtures can be useful models of multivariate categorical data and effective tools for multiple imputation of missing categorical values. In some contexts, however, these models can fit certain variables well at the expense of others in ways beyond the analyst's control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a Dirichlet process mixture model for mixed ordinal and nominal data that allows analysts to split variables into two groups: focus variables and remainder variables. The model uses three sets of clusters, one set for ordinal focus variables, one for nominal focus variables, and one for all remainder variables. The model uses a multivariate ordered probit specification for the ordinal variables and independent multinomial kernels for the nominal variables. The three sets of clusters are linked using an infinite tensor factorization prior, as well as via dependence of the means of the latent continuous focus variables on the remainder variables. This effectively specifies a rich, complex model for the focus variables and a simpler model for remainder variables, yet still potentially captures associations among the variables. In the multiple imputation context, focus variables include key variables with high rates of missing values, and remainder variables include variables without much missing data. Using simulations, we illustrate advantages and limitations of using focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study. ER - TY - JOUR T1 - A nonparametric, multiple imputation-based method for the retrospective integration of data sets JF - Multivariate Behavioral Research Y1 - 2015 A1 - M.M. Carrig A1 - D. Manrique-Vallier A1 - K. Ranby A1 - J.P. Reiter A1 - R. Hoyle VL - 50 UR - http://www.tandfonline.com/doi/full/10.1080/00273171.2015.1022641 IS - 4 ER - TY - JOUR T1 - Perceptions, behaviors and satisfaction related to public safety for persons with disabilities in the United States JF - Criminal Justice Review Y1 - 2015 A1 - Brucker, D. VL - 1 IS - 18 ER - TY - CONF T1 - Predicting Breakoff Using Sequential Machine Learning Methods T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Soh, L.-K. A1 - Eck, A. A1 - McCutcheon, A.L. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project Y1 - 2015 A1 - Perry, Benjamin A1 - Kambhampaty, Venkata A1 - Brumsted, Kyle A1 - Vilhuber, Lars A1 - Block, William AB - Presentation: NADDI 2015: Crowdsourcing DDI Development: New Features from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Recent years have shown the power of user-sourced information evidenced by the success of Wikipedia and its many emulators. This sort of unstructured discussion is currently not feasible as a part of the otherwise successful metadata repositories. Creating and augmenting metadata is a labor-intensive endeavor. Harnessing collective knowledge from actual data users can supplement officially generated metadata. As part of our Comprehensive Extensible Data Documentation and Access Repository (CED2AR) infrastructure, we demonstrate a prototype of crowdsourced DDI, using DDI-C and supplemental XML. The system allows for any number of network connected instances (web or desktop deployments) of the CED2AR DDI editor to concurrently create and modify metadata. The backend transparently handles changes, and frontend has the ability to separate official edits (by designated curators of the data and the metadata) from crowd-sourced content. We briefly discuss offline edit contributions as well. CED2AR uses DDI-C and supplemental XML together with Git for a very portable and lightweight implementation. This distributed network implementation allows for large scale metadata curation without the need for a hardware intensive computing environment, and can leverage existing cloud services, such as Github or Bitbucket. Ben Perry (Cornell/NCRN) presents joint work with Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, & William C. Block at NADDI 2015. PB - Cornell University UR - http://hdl.handle.net/1813/40172 ER - TY - JOUR T1 - Preventive policy strategy for banking the unbanked: Savings accounts for teenagers? JF - Journal of Poverty Y1 - 2015 A1 - Friedline, T. A1 - Despard, M. A1 - Chowa, G. KW - financial assets KW - savings KW - Survey of Income and Program Participation (SIPP) KW - teenagers KW - unbanked KW - young adults AB - Concern over percentages of unbanked and underbanked households in the United States and their lack of connectedness to the financial mainstream has led to policy strategies geared toward reaching these households. Using nationally-representative longitudinal data, a preventive strategy for banking households is tested that asks whether young adults are more likely to be banked and own a diversity of financial assets when they are connected to the financial mainstream as teenagers. Young adults are more likely to own checking accounts, savings accounts, certificates of deposit, and stocks when they had savings accounts as teenagers. Policy implications are discussed. VL - 20 UR - http://www.tandfonline.com/doi/full/10.1080/10875549.2015.1015068 IS - 1 ER - TY - JOUR T1 - Privacy and human behavior in the age of information JF - Science Y1 - 2015 A1 - Alessandro Acquisti A1 - Laura Brandimarte A1 - George Loewenstein KW - confidentiality KW - privacy AB - This Review summarizes and draws connections between diverse streams of empirical research on privacy behavior. We use three themes to connect insights from social and behavioral sciences: people’s uncertainty about the consequences of privacy-related behaviors and their own preferences over those consequences; the context-dependence of people’s concern, or lack thereof, about privacy; and the degree to which privacy concerns are malleable—manipulable by commercial and governmental interests. Organizing our discussion by these themes, we offer observations concerning the role of public policy in the protection of privacy in the information age. VL - 347 UR - http://www.sciencemag.org/content/347/6221/509 IS - 6221 ER - TY - THES T1 - Probabilistic Hashing Techniques For Big Data T2 - Computer Science Y1 - 2015 A1 - Anshumali Shrivastava AB - We investigate probabilistic hashing techniques for addressing computational and memory challenges in large scale machine learning and data mining systems. In this thesis, we show that the traditional idea of hashing goes far beyond near-neighbor search and there are some striking new possibilities. We show that hashing can improve state of the art large scale learning algorithms, and it goes beyond the conventional notions of pairwise similarities. Despite being a very well studied topic in literature, we found several opportunities for fundamentally improving some of the well know textbook hashing algorithms. In particular, we show that the traditional way of computing minwise hashes is unnecessarily expensive and without loosing anything we can achieve an order of magnitude speedup. We also found that for cosine similarity search there is a better scheme than SimHash. In the end, we show that the existing locality sensitive hashing framework itself is very restrictive, and we cannot have efficient algorithms for some important measures like inner products which are ubiquitous in machine learning. We propose asymmetric locality sensitive hashing (ALSH), an extended framework, where we show provable and practical efficient algorithms for Maximum Inner Product Search (MIPS). Having such an efficient solutions to MIPS directly scales up many popular machine learning algorithms. We believe that this thesis provides significant improvements to some of the heavily used subroutines in big-data systems, which we hope will be adopted. JF - Computer Science PB - Cornell University VL - Ph.D. UR - https://ecommons.cornell.edu/handle/1813/40886 ER - TY - THES T1 - Ranking Firms Using Revealed Preference and Other Essays About Labor Markets T2 - Department of Economics Y1 - 2015 A1 - Isaac Sorkin KW - economics KW - labor markets AB - This dissertation contains essays on three questions about the labor market. Chapter 1 considers the question: why do some firms pay so much and some so little? Firms account for a substantial portion of earnings inequality. Although the standard explanation is that there are search frictions that support an equilibrium with rents, this chapter finds that compensating differentials for nonpecuniary characteristics are at least as important. To reach this finding, this chapter develops a structural search model and estimates it on U.S. administrative data. The model analyzes the revealed preference information in the labor market: specifically, how workers move between the 1.5 million firms in the data. With on the order of 1.5 million parameters, standard estimation approaches are infeasible and so the chapter develops a new estimation approach that is feasible on such big data. Chapter 2 considers the question: why do men and women work at different firms? Men work for higher-paying firms than women. The chapter builds on chapter 1 to consider two explanations for why men and women work in different firms. First, men and women might search from different offer distributions. Second, men and women might have different rankings of firms. Estimation finds that the main explanation for why men and women are sorted is that women search from a lower-paying offer distribution than men. Indeed, men and women are estimated to have quite similar rankings of firms. Chapter 3 considers the question: what are there long-run effects of the minimum wage? An empirical consensus suggests that there are small employment effects of minimum wage increases. This chapter argues that these are short-run elasticities. Long-run elasticities, which may differ from short-run elasticities, are more policy relevant. This chapter develops a dynamic industry equilibrium model of labor demand. The model makes two points. First, long-run regressions have been misinterpreted because even if the short- and long-run employment elasticities differ, standard methods would not detect a difference using U.S. variation. Second, the model offers a reconciliation of the small estimated short-run employment effects with the commonly found pass-through of minimum wage increases to product prices. JF - Department of Economics PB - University of Michigan CY - Ann Arbor, MI UR - http://hdl.handle.net/2027.42/116747 ER - TY - JOUR T1 - Record Linkage using STATA: Pre-processing, Linking and Reviewing Utilities JF - The Stata Journal Y1 - 2015 A1 - Wasi, Nada A1 - Flaaen, Aaron AB - In this article, we describe Stata utilities that facilitate probabilistic record linkage—the technique typically used for merging two datasets with no common record identifier. While the preprocessing tools are developed specifically for linking two company databases, the other tools can be used for many different types of linkage. Specifically, the stnd_compname and stnd_address commands parse and standardize company names and addresses to improve the match quality when linking. The reclink2 command is a generalized version of Blasnik's reclink (2010, Statistical Software Components S456876, Department of Economics, Boston College) that allows for many-to-one matching. Finally, clrevmatch is an interactive tool that allows the user to review matched results in an efficient and seamless manner. Rather than exporting results to another file format (for example, Excel), inputting clerical reviews, and importing back into Stata, one can use the clrevmatch tool to conduct all of these steps within Stata. This helps improve the speed and flexibility of matching, which often involves multiple runs. VL - 15 UR - http://www.stata-journal.com/article.html?article=dm0082 IS - 3 ER - TY - CONF T1 - Recording What the Respondent Says: Does Question Format Matter? T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Smyth, J.D. A1 - Olson, K. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Reducing the Margins of Error in the American Community Survey Through Data-Driven Regionalization JF - PlosOne Y1 - 2015 A1 - Folch, D. A1 - Spielman, S. E. UR - http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115626 ER - TY - JOUR T1 - Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error JF - ArXiv Y1 - 2015 A1 - Bradley, J. R. A1 - Wikle, C.K. A1 - Holan, S. H. AB - The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error, squared prediction error, spatial variance, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds. UR - http://arxiv.org/abs/1502.01974 IS - 1502.01974 ER - TY - JOUR T1 - Rejoinder on: Comparing and selecting spatial predictors using local criteria JF - Test Y1 - 2015 A1 - Bradley, J.R. A1 - Cressie, N. A1 - Shi, T. VL - 24 UR - http://dx.doi.org/10.1007/s11749-014-0414-2 IS - 1 ER - TY - THES T1 - Relaxations of differential privacy and risk utility evaluations of synthetic data and fidelity measures T2 - Statistics Department Y1 - 2015 A1 - McClure, D. AB - Many organizations collect data that would be useful to public researchers, but cannot be shared due to promises of confidentiality to those that participated in the study. This thesis evaluates the risks and utility of several existing release methods, as well as develops new ones with different risk/utility tradeoffs. In Chapter 2, I present a new risk metric, called model-specific probabilistic differ- ential privacy (MPDP), which is a relaxed version of differential privacy that allows the risk of a release to be based on the worst-case among plausible datasets instead of all possible datasets. In addition, I develop a generic algorithm called local sensitiv- ity random sampling (LSRS) that, under certain assumptions, is guaranteed to give releases that meet MPDP for any query with computable local sensitivity. I demon- strate, using several well-known queries, that LSRS releases have much higher utility than standard differentially private release mechanism, the Laplace Mechanism, at only marginally higher risk. In Chapter 3, using to synthesis models, I empirically characterize the risks of releasing synthetic data under the standard “all but one” assumption on intruder background knowledge, as well the effect decreasing the number of observations the intruder knows beforehand has on that risk. I find in these examples that even in the “all but one” case, there is no risk except to extreme outliers, and even then the risk is mild. I find that the effect of removing observations from an intruder’s background knowledge has on risk heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and the risk drops quickly if he/she cannot. In Chapter 4, I characterize the risk/utility tradeoffs for an augmentation of synthetic data called fidelity measures (see Section 1.2.3). Fidelity measures were proposed in Reiter et al. (2009) to quantify the degree to which the results of an analysis performed on a released synthetic dataset match with the results of the same analysis performed on the confidential data. I compare the risk/utility of two different fidelity measures, the confidence interval overlap (Karr et al., 2006) and a new fidelity measure I call the mean predicted probability difference (MPPD). Simultaneously, I compare the risk/utility tradeoffs of two different private release mechanisms, LSRS and a heuristic release method called “safety zones”. I find that the confidence interval overlap can be applied to a wider variety of analyses and is more specific than MPPD, but MPPD is more robust to the influence of individual observations in the confidential data, which means it can be released with less noise than the confidence interval overlap with the same level of risk. I also find that while safety zones are much simpler to compute and generally have good utility (whereas the utility of LSRS depends on the value of ε), it is also much more vulnerable to context specific attacks that, while not easy for an intruder to implement, are difficult to anticipate. JF - Statistics Department PB - Duke University VL - PhD UR - http://hdl.handle.net/10161/11365 ER - TY - CONF T1 - The Role of Device Type and Respondent Characteristics in Internet Panel Survey Breakoff T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Allan L. McCutcheon JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - The SAR model for very large datasets: A reduced-rank approach JF - Econometrics Y1 - 2015 A1 - Burden, S. A1 - Cressie, N. A1 - Steel, D.G. VL - 3 UR - http://www.mdpi.com/2225-1146/3/2/317 IS - 2 ER - TY - JOUR T1 - Semi-parametric selection models for potentially non-ignorable attrition in panel studies with refreshment samples JF - Political Analysis Y1 - 2015 A1 - Y. Si A1 - J.P. Reiter A1 - D.S. Hillygus VL - 23 UR - http://pan.oxfordjournals.org/cgi/reprint/mpu009?%20ijkey=joX8eSl6gyIlQKP&keytype=ref ER - TY - JOUR T1 - Simultaneous Edit-Imputation for Continuous Microdata JF - Journal of the American Statistical Association Y1 - 2015 A1 - Kim, H. J. A1 - Cox, L. H. A1 - Karr, A. F. A1 - Reiter, J. P. A1 - Wang, Q. VL - 110 UR - http://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1040881 ER - TY - JOUR T1 - Small Area Estimation via Multivariate Fay-Herriot Models With Latent Spatial Dependence JF - Australian & New Zealand Journal of Statistics Y1 - 2015 A1 - Porter, A.T. A1 - Wikle, C.K. A1 - Holan, S.H. VL - 57 UR - http://arxiv.org/abs/1310.7211 ER - TY - JOUR T1 - Spatio-temporal change of support with application to American Community Survey multi-year period estimates JF - Stat Y1 - 2015 A1 - Bradley, Jonathan R. A1 - Wikle, Christopher K. A1 - Holan, Scott H. KW - Bayesian KW - change-of-support KW - dynamical KW - hierarchical models KW - mixed-effects model KW - Moran's I KW - multi-year period estimate AB - We present hierarchical Bayesian methodology to perform spatio-temporal change of support (COS) for survey data with Gaussian sampling errors. This methodology is motivated by the American Community Survey (ACS), which is an ongoing survey administered by the US Census Bureau that provides timely information on several key demographic variables. The ACS has published 1-year, 3-year, and 5-year period estimates, and margins of errors, for demographic and socio-economic variables recorded over predefined geographies. The spatio-temporal COS methodology considered here provides data users with a way to estimate ACS variables on customized geographies and time periods while accounting for sampling errors. Additionally, 3-year ACS period estimates are to be discontinued, and this methodology can provide predictions of ACS variables for 3-year periods given the available period estimates. The methodology is based on a spatio-temporal mixed-effects model with a low-dimensional spatio-temporal basis function representation, which provides multi-resolution estimates through basis function aggregation in space and time. This methodology includes a novel parameterization that uses a target dynamical process and recently proposed parsimonious Moran's I propagator structures. Our approach is demonstrated through two applications using public-use ACS estimates and is shown to produce good predictions on a hold-out set of 3-year period estimates. Copyright © 2015 John Wiley & Sons, Ltd. VL - 4 UR - http://dx.doi.org/10.1002/sta4.94 ER - TY - JOUR T1 - Statistical Disclosure Limitation in the Presence of Edit Rules JF - Journal of Official Statistics Y1 - 2015 A1 - Kim, H.J. A1 - Karr, A.F. A1 - Reiter, J.P. VL - 31 ER - TY - JOUR T1 - A stochastic bioenergetics model based approach to translating large river flow and temperature in to fish population responses: the pallid sturgeon example JF - Geological Society Y1 - 2015 A1 - Wildhaber, M.L. A1 - Dey, R. A1 - Wikle, C.K. A1 - Anderson, C.J. A1 - Moran, E.H. A1 - Franz, K.J. VL - 408 ER - TY - JOUR T1 - Stop or continue data collection: A nonignorable missing data approach for continuous variables JF - ArXiv Y1 - 2015 A1 - T. Paiva A1 - J.P. Reiter KW - Methodology AB - We present an approach to inform decisions about nonresponse followup sampling. The basic idea is (i) to create completed samples by imputing nonrespondents' data under various assumptions about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for different proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents' data, and adjust the probabilities of the mixture components to generate nonrespondents' distributions with desired features. We illustrate the approaches using data from the 2007 U. S. Census of Manufactures. UR - http://arxiv.org/abs/1511.02189 IS - 1511.02189 ER - TY - JOUR T1 - Studying Neighborhoods Using Uncertain Data from the American Community Survey: A Contextual Approach JF - Annals of the Association of American Geographers Y1 - 2015 A1 - Seth E. Spielman A1 - Alex Singleton AB - In 2010 the American Community Survey (ACS) replaced the long form of the decennial census as the sole national source of demographic and economic data for small geographic areas such as census tracts. These small area estimates suffer from large margins of error, however, which makes the data difficult to use for many purposes. The value of a large and comprehensive survey like the ACS is that it provides a richly detailed, multivariate, composite picture of small areas. This article argues that one solution to the problem of large margins of error in the ACS is to shift from a variable-based mode of inquiry to one that emphasizes a composite multivariate picture of census tracts. Because the margin of error in a single ACS estimate, like household income, is assumed to be a symmetrically distributed random variable, positive and negative errors are equally likely. Because the variable-specific estimates are largely independent from each other, when looking at a large collection of variables these random errors average to zero. This means that although single variables can be methodologically problematic at the census tract scale, a large collection of such variables provides utility as a contextual descriptor of the place(s) under investigation. This idea is demonstrated by developing a geodemographic typology of all U.S. census tracts. The typology is firmly rooted in the social scientific literature and is organized around a framework of concepts, domains, and measures. The typology is validated using public domain data from the City of Chicago and the U.S. Federal Election Commission. The typology, as well as the data and methods used to create it, is open source and published freely online. VL - 105 UR - http://dx.doi.org/10.1080/00045608.2015.1052335 ER - TY - CONF T1 - Survey Informatics: The Future of Survey Methodology and Survey Statistics Training in the Academy? T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Allan L. McCutcheon JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Synthetic Establishment Microdata Around the World Y1 - 2015 A1 - Vilhuber, Lars A1 - Abowd, John A. A1 - Reiter, Jerome P. AB - Synthetic Establishment Microdata Around the World Vilhuber, Lars; Abowd, John A.; Reiter, Jerome P. In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature. PB - Cornell University UR - http://hdl.handle.net/1813/42340 ER - TY - JOUR T1 - Understanding the Dynamics of $2-a-Day Poverty in the United States JF - The Russell Sage Foundation Journal of the Social Sciences Y1 - 2015 A1 - Shaefer, H. Luke A1 - Edin, Kathryn A1 - Talbert, E. VL - 1 IS - Severe Deprivation ER - TY - JOUR T1 - Understanding the Human Condition through Survey Informatics JF - IEEE Computer Y1 - 2015 A1 - Eck, A. A1 - Leen-Kiat, S. A1 - McCutcheon, A. L. A1 - Smyth, J.D. A1 - Belli, R.F. VL - 48 IS - 11 ER - TY - CONF T1 - The Use of Paradata to Evaluate Interview Complexity and Data Quality (in Calendar and Time Diary Surveys) T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Cordova-Cazar, A.L. A1 - Belli, R.F. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Using Data Mining to Examine Interviewer-Respondent Interactions in Calendar Interviews T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Belli, R.F. A1 - Miller, L.D. A1 - Soh, L.-K. A1 - T. Al Baghal JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Using Machine Learning Techniques to Predict Respondent Type from A Priori Demographic Information T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Atkin, G. A1 - Arunachalam, H. A1 - Eck, A. A1 - Wettlaufer, D. A1 - Soh, L.-K. A1 - Belli, R.F. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics Y1 - 2015 A1 - Vilhuber, Lars A1 - Miranda, Javier AB - Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics Vilhuber, Lars; Miranda, Javier We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions). PB - Cornell University UR - http://hdl.handle.net/1813/42339 ER - TY - CONF T1 - Web Surveys, Online Panels, and Paradata: Automating Responsive Design T2 - 2015 Joint Program in Survey Methodology (JPSM) Distinguished Lecture Y1 - 2015 A1 - Allan L. McCutcheon JF - 2015 Joint Program in Survey Methodology (JPSM) Distinguished Lecture CY - University of Maryland. College Park, MD UR - http://www.jpsm.umd.edu/ ER - TY - JOUR T1 - Who’s Left Out? Characteristics of Households in Economic Need not Receiving Public Support JF - Journal of Sociology and Social Welfare Y1 - 2015 A1 - Fusaro, V. VL - 42 IS - 3 ER - TY - CONF T1 - Why Do Interviewers Speed Up? An Examination of Changes in Interviewer Behaviors over the Course of the Survey Field Period T2 - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) Y1 - 2015 A1 - Olson, K. A1 - Smyth, J.D. JF - 70th Annual Conference of the American Association for Public Opinion Research (AAPOR) CY - Hollywood, Florida UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Achieving balance: Understanding the relationship between complexity and response quality T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Powell, R.J. A1 - Kirchner, A. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Agent Based Models: Statistical Challenges and Opportunities JF - Statistics Views Y1 - 2014 A1 - Wikle, C.K. PB - Wiley UR - http://www.statisticsviews.com/details/feature/6354691/Agent-Based-Models-Statistical-Challenges-and-Opportunities.html ER - TY - CHAP T1 - Analytical frameworks for data release: A statistical view T2 - Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches Y1 - 2014 A1 - A. F. Karr A1 - J. P. Reiter JF - Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches PB - Cambridge University Press CY - New York City, NY ER - TY - ABST T1 - An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models, Seminar on Bayesian Inference in Econometrics and Statistics (SBIES) Y1 - 2014 A1 - Holan, S.H. ER - TY - CONF T1 - An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models T2 - Joint Statistical Meetings 2014 Y1 - 2014 A1 - Holan, S.H. JF - Joint Statistical Meetings 2014 PB - Joint Statistical Meetings CY - Boston, MA UR - http://www.amstat.org/meetings/jsm/2014/onlineprogram/AbstractDetails.cfm?abstractid=310841 ER - TY - JOUR T1 - Asymptotic Theory of Cepstral Random Fields JF - Annals of Statistics Y1 - 2014 A1 - McElroy, T. A1 - Holan, S. PB - University of Missouri VL - 42 UR - http://arxiv.org/pdf/1112.1977v4.pdf ER - TY - CHAP T1 - Autobiographical memory dynamics in survey research T2 - SAGE Handbook of Applied Memory Y1 - 2014 A1 - Belli, R. F. ED - T. J. Perfect ED - D. S. Lindsay JF - SAGE Handbook of Applied Memory PB - Sage UR - http://dx.doi.org/10.4135/9781446294703 ER - TY - ABST T1 - A Bayesian Approach to Estimating Agricultural Yield Based on Multiple Repeated Surveys Y1 - 2014 A1 - Holan, S.H. ER - TY - CONF T1 - Bayesian Dynamic Time-Frequency Estimation T2 - Twelfth World Meeting of ISBA Y1 - 2014 A1 - Holan, S.H. JF - Twelfth World Meeting of ISBA PB - ISBA CY - Cancun, Mexico ER - TY - JOUR T1 - Bayesian estimation of disclosure risks for multiply imputed, synthetic data JF - Journal of Privacy and Confidentiality Y1 - 2014 A1 - Reiter, J. P. A1 - Wang, Q. A1 - Zhang, B. AB -

Agencies seeking to disseminate public use microdata, i.e., data on individual records, can replace confidential values with multiple draws from statistical models estimated with the collected data. We present a famework for evaluating disclosure risks inherent in releasing multiply-imputed, synthetic data. The basic idea is to mimic an intruder who computes posterior distributions of confidential values given the released synthetic data and prior knowledge. We illustrate the methodology with artificial fully synthetic data and with partial synthesis of the Survey of Youth in Custody.

VL - 6 UR - http://repository.cmu.edu/jpc/vol6/iss1/2 IS - 1 ER - TY - JOUR T1 - Bayesian estimation of discrete multivariate latent structure models with structural zeros JF - Journal of Computational and Graphical Statistics Y1 - 2014 A1 - Manrique-Vallier, D. A1 - Reiter, J.P. VL - 23 ER - TY - JOUR T1 - Bayesian multiple imputation for large-scale categorical data with structural zeros JF - Survey Methodology Y1 - 2014 A1 - D. Manrique-Vallier A1 - J.P. Reiter VL - 40 UR - http://www.stat.duke.edu/ jerry/Papers/SurvMeth14.pdf ER - TY - RPRT T1 - Bayesian Nonparametric Modeling for Multivariate Ordinal Regression Y1 - 2014 A1 - DeYoreo, M. A1 - Kottas, A. KW - Statistics - Methodology AB - Univariate or multivariate ordinal responses are often assumed to arise from a latent continuous parametric distribution, with covariate effects which enter linearly. We introduce a Bayesian nonparametric modeling approach for univariate and multivariate ordinal regression, which is based on mixture modeling for the joint distribution of latent responses and covariates. The modeling framework enables highly flexible inference for ordinal regression relationships, avoiding assumptions of linearity or additivity in the covariate effects. In standard parametric ordinal regression models, computational challenges arise from identifiability constraints and estimation of parameters requiring nonstandard inferential techniques. A key feature of the nonparametric model is that it achieves inferential flexibility, while avoiding these difficulties. In particular, we establish full support of the nonparametric mixture model under fixed cut-off points that relate through discretization the latent continuous responses with the ordinal responses. The practical utility of the modeling approach is illustrated through application to two data sets from econometrics, an example involving regression relationships for ozone concentration, and a multirater agreement problem. PB - ArXiv UR - http://arxiv.org/abs/1408.1027 ER - TY - ABST T1 - Big Data Methodology Applied to Small Area Estimation Y1 - 2014 A1 - Porter, A.T. ER - TY - CONF T1 - Call back later: The association of recruitment contact and error in the American Time Use Survey T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Countryman, A. A1 - Cordova-Cazar, A.L. A1 - Deal, C.E. A1 - Belli, R.F. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - A CAR model for multiple outcomes on mismatched lattices JF - Spatial and Spatio-Temporal Epidemiology Y1 - 2014 A1 - Porter, A.T. A1 - Oleson, J. VL - 11 UR - http://www.sciencedirect.com/science/article/pii/S1877584514000604 ER - TY - JOUR T1 - Causes and Patterns of Uncertainty in the American Community Survey JF - Applied Geography Y1 - 2014 A1 - Spielman, S. E. A1 - Folch, D. A1 - Nagle, N. VL - 46 UR - http://www.sciencedirect.com/science/article/pii/S0143622813002518 ER - TY - RPRT T1 - CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository Y1 - 2014 A1 - Lagoze, Carl A1 - Vilhuber, Lars A1 - Williams, Jeremy A1 - Perry, Benjamin A1 - Block, William C. AB - CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository Lagoze, Carl; Vilhuber, Lars; Williams, Jeremy; Perry, Benjamin; Block, William C. We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED 2 AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains Presented at 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), Sept 8-12, 2014 PB - Cornell University UR - http://hdl.handle.net/1813/44702 ER - TY - RPRT T1 - The Cepstral Model for Multivariate Time Series: The Vector Exponential Model. Y1 - 2014 A1 - Holan, S.H. A1 - McElroy, T.S. A1 - Wu, G. AB -

Vector autoregressive (VAR) models have become a staple in the analysis of multivariate time series and are formulated in the time domain as difference equations, with an implied covariance structure. In many contexts, it is desirable to work with a stable, or at least stationary, representation. To fit such models, one must impose restrictions on the coefficient matrices to ensure that certain determinants are nonzero; which, except in special cases, may prove burdensome. To circumvent these difficulties, we propose a flexible frequency domain model expressed in terms of the spectral density matrix. Specifically, this paper treats the modeling of covariance stationary vector-valued (i.e., multivariate) time series via an extension of the exponential model for the spectrum of a scalar time series. We discuss the modeling advantages of the vector exponential model and its computational facets, such as how to obtain Wold coefficients from given cepstral coefficients. Finally, we demonstrate the utility of our approach through simulation as well as two illustrative data examples focusing on multi-step ahead forecasting and estimation of squared coherence.

PB - arXiv UR - http://arxiv.org/abs/1406.0801 ER - TY - CONF T1 - Changes in interviewer-related error over the course of the field period: An empirical examination using paradata T2 - Joint Statistical Meetings Y1 - 2014 A1 - Olson, K. A1 - Kirchner, A. JF - Joint Statistical Meetings CY - Boston, MA ER - TY - CONF T1 - Changes in interviewer-related error over the course of the field period: An empirical examination using paradata T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Olson, K. A1 - Kirchner, A. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - The Co-Evolution of Residential Segregation and the Built Environment at the Turn of the 20th Century: a Schelling Model JF - Transactions in GIS Y1 - 2014 A1 - Spielman, S. E. A1 - Harrison, P. VL - 18 UR - http://onlinelibrary.wiley.com/enhanced/doi/10.1111/tgis.12014/ ER - TY - RPRT T1 - Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project Y1 - 2014 A1 - Perry, Benjamin A1 - Kambhampaty, Venkata A1 - Brumsted, Kyle A1 - Vilhuber, Lars A1 - Block, William AB - Collaborative Editing of DDI Metadata: The Latest from the CED2AR Project Perry, Benjamin; Kambhampaty, Venkata; Brumsted, Kyle; Vilhuber, Lars; Block, William Benjamin Perry's presentation on "Collaborative Editing and Versioning of DDI Metadata: The Latest from Cornell's NCRN CED²AR Software" at the 6th Annual European DDI User Conference in London, 12/02/2014. PB - Cornell University UR - http://hdl.handle.net/1813/38200 ER - TY - CONF T1 - Commitment, concealment, and confusion: An empirical assessment of interviewer and respondent behaviors in survey interviews T2 - 39th Annual Conference of the Midwest Association for Public Opinion Research Y1 - 2014 A1 - Kirchner, A. A1 - Olson, K. JF - 39th Annual Conference of the Midwest Association for Public Opinion Research CY - Chicago, IL UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Communicating Uncertainty in Official Economic Statistics Y1 - 2014 A1 - Manski, Charles AB - Communicating Uncertainty in Official Economic Statistics Manski, Charles Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error-free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. PB - Northwestern University UR - http://hdl.handle.net/1813/36323 ER - TY - RPRT T1 - Communicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern Y1 - 2014 A1 - Manski, Charles F. AB -

Communicating Uncertainty in Official Economic Statistics: An Appraisal Fifty Years after Morgenstern Manski, Charles F. Federal statistical agencies in the United States and analogous agencies elsewhere commonly report official economic statistics as point estimates, without accompanying measures of error. Users of the statistics may incorrectly view them as error-free or may incorrectly conjecture error magnitudes. This paper discusses strategies to mitigate misinterpretation of official statistics by communicating uncertainty to the public. Sampling error can be measured using established statistical principles. The challenge is to satisfactorily measure the various forms of nonsampling error. I find it useful to distinguish transitory statistical uncertainty, permanent statistical uncertainty, and conceptual uncertainty. I illustrate how each arises as the Bureau of Economic Analysis periodically revises GDP estimates, the Census Bureau generates household income statistics from surveys with nonresponse, and the Bureau of Labor Statistics seasonally adjusts employment statistics. I anchor my discussion of communication of uncertainty in the contribution of Morgenstern (1963), who argued forcefully for agency publication of error estimates for official economic statistics.

PB - Northwestern University UR - http://hdl.handle.net/1813/40830 ER - TY - THES T1 - Comparing models of Demographic Subpopulations (Master's Thesis) Y1 - 2014 A1 - Moehl, J. PB - University of Tennessee UR - http://trace.tennessee.edu/utk_gradthes/2835/; http://trace.tennessee.edu/cgi/viewcontent.cgi?article=4005&context=utk_gradthes ER - TY - CHAP T1 - A Comparison of Blocking Methods for Record Linkage T2 - Privacy in Statistical Databases Y1 - 2014 A1 - Steorts, R. A1 - Ventura, S. A1 - Sadinle, M. A1 - Fienberg, S. E. A1 - Domingo-Ferrer, J. JF - Privacy in Statistical Databases PB - Springer VL - 8744 UR - http://link.springer.com/chapter/10.1007/978-3-319-11257-2_20 ER - TY - JOUR T1 - A Comparison of Spatial Predictors when Datasets Could be Very Large JF - ArXiv Y1 - 2014 A1 - Bradley, J. R. A1 - Cressie, N. A1 - Shi, T. KW - Statistics - Methodology AB -

In this article, we review and compare a number of methods of spatial prediction. To demonstrate the breadth of available choices, we consider both traditional and more-recently-introduced spatial predictors. Specifically, in our exposition we review: traditional stationary kriging, smoothing splines, negative-exponential distance-weighting, Fixed Rank Kriging, modified predictive processes, a stochastic partial differential equation approach, and lattice kriging. This comparison is meant to provide a service to practitioners wishing to decide between spatial predictors. Hence, we provide technical material for the unfamiliar, which includes the definition and motivation for each (deterministic and stochastic) spatial predictor. We use a benchmark dataset of CO2 data from NASA's AIRS instrument to address computational efficiencies that include CPU time and memory usage. Furthermore, the predictive performance of each spatial predictor is assessed empirically using a hold-out subset of the AIRS data.

UR - http://arxiv.org/abs/1410.7748 IS - 1410.7748 ER - TY - JOUR T1 - Dasymetric Modeling and Uncertainty JF - The Annals of the Association of American Geographers Y1 - 2014 A1 - Nagle, N. A1 - Buttenfield, B. A1 - Leyk, S. A1 - Spielman, S. E. VL - 104 UR - http://www.tandfonline.com/doi/abs/10.1080/00045608.2013.843439 ER - TY - THES T1 - Data Fusion Methods for Improved Demographic Resolution of Population Distribution Datasets (Ph.D. Thesis) Y1 - 2014 A1 - Rose, A. PB - University of Tennessee ER - TY - CONF T1 - Data Quality among Devices to Complete Surveys: Comparing Personal Computers, Smartphones and Tablets T2 - Midwest Association for Public Opinion Research Annual Meeting Y1 - 2014 A1 - Wang, Mengyang A1 - McCutcheon, Allan L. JF - Midwest Association for Public Opinion Research Annual Meeting CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - JOUR T1 - Deprivation Among U.S. Children With Disabilities Who Receive Supplemental Security Income JF - Journal of Disability Policy Studies Y1 - 2014 A1 - Ghosth, S. A1 - Parish, S. L. ER - TY - CONF T1 - Designing an Intelligent Time Diary Instrument: Visualization, Dynamic Feedback, and Error Prevention and Mitigation T2 - UNL/SRAM/Gallup Symposium Y1 - 2014 A1 - Atkin, G. A1 - Arunachalam, H. A1 - Eck, A. A1 - Soh, L.-K. A1 - Belli, R.F. JF - UNL/SRAM/Gallup Symposium CY - Omaha, NE UR - http://grc.unl.edu/unlsramgallup-symposium ER - TY - CONF T1 - Designing an Intelligent Time Diary Instrument: Visualization, Dynamic Feedback, and Error Prevention and Mitigation T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Atkin, G. A1 - Arunachalam, H. A1 - Eck, A. A1 - Soh, L.-K. A1 - Belli, R. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA. UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach JF - Annals of Applied Statistics Y1 - 2014 A1 - Sadinle, M. VL - 8 ER - TY - CHAP T1 - Disclosure risk evaluation for fully synthetic data T2 - Privacy in Statistical Databases Y1 - 2014 A1 - J. Hu A1 - J.P. Reiter A1 - Q. Wang JF - Privacy in Statistical Databases PB - Springer CY - Heidelberg VL - 8744 ER - TY - JOUR T1 - The Economics of Privacy JF - Journal of Economic Literature Y1 - 2014 A1 - Acquisti, A. A1 - Taylor, C. N1 - Commissioned article. To appear ER - TY - CONF T1 - The Effect of CATI Questionnaire Design Features on Response Timing T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Olson, K. A1 - Smyth, Jolene JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - The effects of unfamiliar terms on interviewer and respondent behaviors: Are subsequent questions affected? T2 - Paper presented at the Midwest Association for Public Opinion Research annual meeting Y1 - 2014 A1 - Lee, J. A1 - Olson, K. JF - Paper presented at the Midwest Association for Public Opinion Research annual meeting CY - Chicago, IL UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CHAP T1 - Enabling statistical analysis of suppressed tabular data, in Privacy in Statistical Databases T2 - Lecture Notes in Computer Science Y1 - 2014 A1 - L. Cox JF - Lecture Notes in Computer Science PB - Springer CY - Heidelberg VL - 8744 ER - TY - JOUR T1 - Entity Resolution with Empirically Motivated Priors JF - ArXiv Y1 - 2014 A1 - Steorts, R. C. KW - Statistics - Methodology AB - Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. UR - http://arxiv.org/abs/1409.0643 IS - 1409.0643 ER - TY - CONF T1 - Fast Estimation of Time Series with Multiple Long-Range Persistencies T2 - ASA Proceedings of the Joint Statistical Meetings Y1 - 2014 A1 - McElroy, T.S. A1 - Holan, S.H. JF - ASA Proceedings of the Joint Statistical Meetings PB - American Statistical Association CY - Alexandria, VA ER - TY - CONF T1 - Flexible Bayesian Methodology for Multivariate Spatial Small Area Estimation T2 - Joint Statistical Meetings 2014 Y1 - 2014 A1 - Porter, A.T. JF - Joint Statistical Meetings 2014 CY - Boston, MA ER - TY - RPRT T1 - Flexible prior specification for partially identified nonlinear regression with binary responses Y1 - 2014 A1 - P. R. Hahn A1 - J. S. Murray A1 - I. Manolopoulou AB - This paper adapts tree-based Bayesian regression models for estimating a partially identified probability function. In doing so, ideas from the recent literature on Bayesian partial identification are applied within a sophisticated applied regression context. Our approach permits efficient sensitivity analysis concerning the posterior impact of priors over the partially identified component of the regression model. The new methodology is illustrated on an important problem where we only have partially observed data -- inferring the prevalence of accounting misconduct among publicly traded U.S. businesses. PB - arXiv UR - https://arxiv.org/abs/1407.8430v1 IS - 1407.8430 ER - TY - CONF T1 - A Fully Bayesian Approach for Generating Synthetic Marks and Geographies for Confidential Data T2 - International Indian Statistical Association Y1 - 2014 A1 - Quick, H. JF - International Indian Statistical Association PB - IISA ER - TY - JOUR T1 - The generalized multiset sampler JF - Journal of Computational and Graphical Statistics Y1 - 2014 A1 - H. Kim A1 - S. N. MacEachern UR - http://dx.doi.org/10.1080/10618600.2014.962701 ER - TY - CONF T1 - ‘Good Respondent, Bad Respondent’? Assessing Response Quality in Internet Surveys T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Kirchner, A. A1 - Powell, R. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Harnessing Naturally Occurring Data to Measure the Response of Spending to Income JF - Science Y1 - 2014 A1 - Gelman, M. A1 - Kariv, S. A1 - Shapiro, M.D. A1 - Silverman, D. A1 - Tadelis, S. AB - This paper presents a new data infrastructure for measuring economic activity. The infrastructure records transactions and account balances, yielding measurements with scope and accuracy that have little precedent in economics. The data are drawn from a diverse population that overrepresents males and younger adults but contains large numbers of underrepresented groups. The data infrastructure permits evaluation of a benchmark theory in economics that predicts that individuals should use a combination of cash management, saving, and borrowing to make the timing of income irrelevant for the timing of spending. As in previous studies and in contrast to the predictions of the theory, there is a response of spending to the arrival of anticipated income. The data also show, however, that this apparent excess sensitivity of spending results largely from the coincident timing of regular income and regular spending. The remaining excess sensitivity is concentrated among individuals with less liquidity. Link to data at Berkeley Econometrics Lab (EML): https://eml.berkeley.edu/cgi-bin/HarnessingDataScience2014.cgi VL - 345 UR - http://www.sciencemag.org/content/345/6193/212.full IS - 11 ER - TY - CONF T1 - Having a Lasting Impact: The Effects of Interviewer Errors on Data Quality T2 - Midwest Association for Public Opinion Research Annual Conference Y1 - 2014 A1 - Timm, A. A1 - Olson, K. A1 - Smyth, J.D. JF - Midwest Association for Public Opinion Research Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - CHAP T1 - Hierarchical Linkage Clustering with Distributions of Distances for Large Scale Record Linkage T2 - Privacy in Statistical Databases (Lecture Notes in Computer Science Y1 - 2014 A1 - Ventura, S. A1 - Nugent, R. A1 - Fuchs, E. ED - Domingo-Ferrer, J. JF - Privacy in Statistical Databases (Lecture Notes in Computer Science PB - Springer VL - 8744 ER - TY - CONF T1 - Hours or Minutes: Does One Unit Fit All? T2 - Midwest Association for Public Opinion Research Annual Conference Y1 - 2014 A1 - Cochran, B. A1 - Smyth, J.D. JF - Midwest Association for Public Opinion Research Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - ICOMM T1 - How to Make a Better Map—Using Neuroscience Y1 - 2014 A1 - Laura Bliss KW - Nicholas Nagle KW - Seth Spielman AB -

The work of Seth Spielman and Nicholas Nagle was noted in this article in City Lab, a publication from The Atlantic magazine, available at http://www.citylab.com/design/2014/11/how-to-make-a-better-map-according-to-science/382898/.

PB - Citylab UR - http://www.citylab.com/design/2014/11/how-to-make-a-better-map-according-to-science/382898/ ER - TY - JOUR T1 - I Cheated, but only a Little–Partial Confessions to Unethical Behavior JF - Journal of Personality and Social Psychology Y1 - 2014 A1 - Peer, E. A1 - Acquisti, A. A1 - Shalvi, S. VL - 106 ER - TY - JOUR T1 - Identifying Regions based on Flexible User Defined Constraints JF - International Journal of Geographic Information Science Y1 - 2014 A1 - Folch, D. A1 - Spielman, S. E. VL - 28 UR - http://www.tandfonline.com/doi/abs/10.1080/13658816.2013.848986 ER - TY - JOUR T1 - Imputation of confidential data sets with spatial locations using disease mapping models JF - Statistics in Medicine Y1 - 2014 A1 - T. Paiva A1 - A. Chakraborty A1 - J.P. Reiter A1 - A.E. Gelfand VL - 33 ER - TY - RPRT T1 - Interval Estimates for Official Statistics with Survey Nonresponse Y1 - 2014 A1 - Manski, C. ER - TY - CONF T1 - Interviewer variance and prevalence of verbal behaviors in calendar and conventional interviewing T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Belli, R.F. A1 - Charoenruk, N., JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Interviewer variance of interviewer and respondent behaviors: A comparison between calendar and conventional interviewing T2 - XVIII International Sociological Association World Congress of Sociology Y1 - 2014 A1 - Belli, R.F. A1 - Charoenruk, N., JF - XVIII International Sociological Association World Congress of Sociology CY - Yokohama, Japan UR - https://isaconf.confex.com/isaconf/wc2014/webprogram/Paper34278.html ER - TY - JOUR T1 - Longitudinal mixed membership trajectory models for disability survey data JF - Annals of Applied Statistics Y1 - 2014 A1 - Manrique-Vallier, D VL - 8 ER - TY - CONF T1 - Making sense of paradata: Challenges faced and lessons learned T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Eck, A. A1 - Stuart, L. A1 - Atkin, G. A1 - Soh, L-K A1 - McCutcheon, A.L. A1 - Belli, R.F. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Making Sense of Paradata: Challenges Faced and Lessons Learned T2 - UNL/SRAM/Gallup Symposium Y1 - 2014 A1 - Eck, A. A1 - Stuart, L. A1 - Atkin, G. A1 - Soh, L-K A1 - McCutcheon, A.L. A1 - Belli, R.F. JF - UNL/SRAM/Gallup Symposium CY - Omaha, NE UR - http://grc.unl.edu/unlsramgallup-symposium ER - TY - JOUR T1 - Multiple imputation by ordered monotone blocks with application to the Anthrax Vaccine Adsorbed Trial JF - Journal of Computational and Graphical Statistics Y1 - 2014 A1 - Li, Fan A1 - Baccini, Michela A1 - Mealli, Fabrizia A1 - Zell, Elizabeth R. A1 - Frangakis, Constantine E. A1 - Rubin, Donald B VL - 23 UR - http://www.tandfonline.com/doi/abs/10.1080/10618600.2013.826583 ER - TY - THES T1 - Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies (Ph.D. thesis) T2 - Department of Statistical Sciences Y1 - 2014 A1 - Thais Paiva JF - Department of Statistical Sciences PB - Duke University VL - Ph.D. UR - http://dukespace.lib.duke.edu/dspace/handle/10161/9406 ER - TY - JOUR T1 - Multiple imputation of missing or faulty values under linear constraints JF - Journal of Business and Economic Statistics Y1 - 2014 A1 - Kim, H. J. A1 - Reiter, J. P. A1 - Wang, Q. A1 - Cox, L. H. A1 - Karr, A. F. AB -

Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

VL - 32 ER - TY - RPRT T1 - NCRN Meeting Fall 2014 Y1 - 2014 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2014 Vilhuber, Lars Taken place at the ILR NYC Conference Center. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45868 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography Y1 - 2014 A1 - Quick, Harrison A1 - Holan, Scott A1 - Wikle, Christopher A1 - Reiter, Jerry AB - NCRN Meeting Fall 2014: Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography Quick, Harrison; Holan, Scott; Wikle, Christopher; Reiter, Jerry Presentation from NCRN Fall 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37750 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the "Great Recession" Y1 - 2014 A1 - Wilson, Courtney A1 - Brown, Daniel G. AB - NCRN Meeting Fall 2014: Change in Visible Impervious Surface Area in Southeastern Michigan Before and After the "Great Recession" Wilson, Courtney; Brown, Daniel G. Presentation at Fall 2014 NCRN meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37446 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Constrained Smoothed Bayesian Estimation Y1 - 2014 A1 - Steorts, Rebecca A1 - Shalizi, Cosma AB - NCRN Meeting Fall 2014: Constrained Smoothed Bayesian Estimation Steorts, Rebecca; Shalizi, Cosma Presentation from NCRN Fall 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37748 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Decomposing Medical-Care Expenditure Growth Y1 - 2014 A1 - Dunn, Abe A1 - Liebman, Eli A1 - Shapiro, Adam AB - NCRN Meeting Fall 2014: Decomposing Medical-Care Expenditure Growth Dunn, Abe; Liebman, Eli; Shapiro, Adam PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37411 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Designer Census Geographies Y1 - 2014 A1 - Spielman, Seth AB - NCRN Meeting Fall 2014: Designer Census Geographies Spielman, Seth Presentation from NCRN Fall 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37747 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Geographic linkages between National Center for Health Statistics’ population health surveys and air quality measures Y1 - 2014 A1 - Parker, Jennifer AB - NCRN Meeting Fall 2014: Geographic linkages between National Center for Health Statistics’ population health surveys and air quality measures Parker, Jennifer PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37412 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Mixed Effects Modeling for Multivariate-Spatio-Temporal Areal Data Y1 - 2014 A1 - Bradley, Jonathan A1 - Holan, Scott A1 - Wikle, Christopher AB - NCRN Meeting Fall 2014: Mixed Effects Modeling for Multivariate-Spatio-Temporal Areal Data Bradley, Jonathan; Holan, Scott; Wikle, Christopher Presentation from NCRN Fall 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37749 ER - TY - RPRT T1 - NCRN Meeting Fall 2014: Respondent-Driven Sampling Estimation and the National HIV Behavioral Surveillance System Y1 - 2014 A1 - Spiller, Michael (Trey) AB - NCRN Meeting Fall 2014: Respondent-Driven Sampling Estimation and the National HIV Behavioral Surveillance System Spiller, Michael (Trey) PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/37414 ER - TY - RPRT T1 - NCRN Meeting Spring 2014 Y1 - 2014 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2014 Vilhuber, Lars Taken place at the Census Headquarters, Washington, DC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45869 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Adaptive Protocols and the DDI 4 Process Model Y1 - 2014 A1 - Greenfield, Jay A1 - Kuan, Sophia AB - NCRN Meeting Spring 2014: Adaptive Protocols and the DDI 4 Process Model Greenfield, Jay; Kuan, Sophia Presentation from NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36393 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location Y1 - 2014 A1 - Brandimarte, Laura A1 - Acquisti, Alessandro AB - NCRN Meeting Spring 2014: Aiming at a More Cost-Effective Census Via Online Data Collection: Privacy Trade-Offs of Geo-Location Brandimarte, Laura; Acquisti, Alessandro presentation at NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36397 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Imputation of multivariate continuous data with non-ignorable missingness Y1 - 2014 A1 - Paiva, Thais A1 - Reiter, Jerry AB - NCRN Meeting Spring 2014: Imputation of multivariate continuous data with non-ignorable missingness Paiva, Thais; Reiter, Jerry Presentation at Spring 2014 NCRN meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36399 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau Y1 - 2014 A1 - Block, William A1 - Brown, Warren A1 - Williams, Jeremy A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - NCRN Meeting Spring 2014: Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau Block, William; Brown, Warren; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl presentation at NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36392 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Introduction Y1 - 2014 A1 - Thompson, John AB - NCRN Meeting Spring 2014: Introduction Thompson, John NCRN Spring 2014 Meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36395 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Metadata Standards & Technology Development for the NSF Survey of Earned Doctorates Y1 - 2014 A1 - Noonan, Kimberly A1 - Heus, Pascal A1 - Mulcahy, Tim AB - NCRN Meeting Spring 2014: Metadata Standards & Technology Development for the NSF Survey of Earned Doctorates Noonan, Kimberly; Heus, Pascal; Mulcahy, Tim Presentation from NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36394 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Research Program and Enterprise Architecture for Adaptive Survey Design at Census Y1 - 2014 A1 - Miller, Peter A1 - Mathur, Anup A1 - Thieme, Michael AB - NCRN Meeting Spring 2014: Research Program and Enterprise Architecture for Adaptive Survey Design at Census Miller, Peter; Mathur, Anup; Thieme, Michael PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36400 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) Y1 - 2014 A1 - Gathright, Graton A1 - Kutzbach, Mark A1 - Mccue, Kristin A1 - McEntarfer, Erika A1 - Monti, Holly A1 - Trageser, Kelly A1 - Vilhuber, Lars A1 - Wasi, Nada A1 - Wignall, Christopher AB - NCRN Meeting Spring 2014: Summer Working Group for Employer List Linking (SWELL) Gathright, Graton; Kutzbach, Mark; Mccue, Kristin; McEntarfer, Erika; Monti, Holly; Trageser, Kelly; Vilhuber, Lars; Wasi, Nada; Wignall, Christopher Presentation for NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36396 ER - TY - RPRT T1 - NCRN Meeting Spring 2014: Web Surveys, Online Panels, and Paradata: Automating Adaptive Design Y1 - 2014 A1 - McCutcheon, Allan AB - NCRN Meeting Spring 2014: Web Surveys, Online Panels, and Paradata: Automating Adaptive Design McCutcheon, Allan Presentation at NCRN Spring 2014 meeting PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/36398 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 2 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 2 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from November 2013 to March 2014. NCRN Newsletter Vol. 1, Issue 2: March 20, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40233 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 3 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 3 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from March 2014 to July 2014. NCRN Newsletter Vol. 1, Issue 3: July 23, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40234 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 4 Y1 - 2014 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 4 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2014 to October 2014. NCRN Newsletter Vol. 1, Issue 4: October 15, 2014 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40192 ER - TY - RPRT T1 - A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data Y1 - 2014 A1 - Schneider, Matthew J. A1 - Abowd, John M. AB - A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data Schneider, Matthew J.; Abowd, John M. Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators. PB - Cornell University UR - http://hdl.handle.net/1813/40828 ER - TY - Generic T1 - NewsViews: An Automated Pipeline for Creating Custom Geovisualizations for News Y1 - 2014 A1 - Gao, T. A1 - Hullman, J. A1 - Adar, E. A1 - Hect, B. A1 - Diakopoulos, N. AB - Interactive visualizations add rich, data-based context to online news articles. Geographic maps are currently the most prevalent form of these visualizations. Unfortunately, designers capable of producing high-quality, customized geovisualizations are scarce. We present NewsViews, a novel automated news visualization system that generates interactive, annotated maps without requiring professional designers. NewsViews’ maps support trend identification and data comparisons relevant to a given news article. The NewsViews system leverages text mining to identify key concepts and locations discussed in articles (as well as po-tential annotations), an extensive repository of “found” databases, and techniques adapted from cartography to identify and create visually “interesting” thematic maps. In this work, we develop and evaluate key criteria in automatic, annotated, map generation and experimentally validate the key features for successful representations (e.g., relevance to context, variable selection, "interestingness" of representation and annotation quality). UR - http://cond.org/newsviews.html ER - TY - JOUR T1 - The Past, Present, and Future of Geodemographic Research in the Unites States and United Kingdom JF - The Professional Geographer Y1 - 2014 A1 - Singleton, A. A1 - Spielman, S. E. VL - 4 ER - TY - CONF T1 - The Poisson Change of Support Problem with Applications to the American Community Survey T2 - Joint Statistical Meetings 2014 Y1 - 2014 A1 - Bradley, J.R. JF - Joint Statistical Meetings 2014 ER - TY - CONF T1 - Predicting Survey Breakoff in Online Survey Panels T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - McCutcheon, A.L. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization Y1 - 2014 A1 - Spielman, Seth A1 - Folch, David AB - Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization Spielman, Seth; Folch, David The American Community Survey (ACS) is the largest US survey of households and is the principal source for neighborhood scale information about the US population and economy. The ACS is used to allocate billions in federal spending and is a critical input to social scientific research in the US. However, estimates from the ACS can be highly unreliable. For example, in over 72% of census tracts the estimated number of children under 5 in poverty has a margin of error greater than the estimate. Uncertainty of this magnitude complicates the use of social data in policy making, research, and governance. This article develops a spatial optimization algorithm that is capable of reducing the margins of error in survey data via the creation of new composite geographies, a process called regionalization. Regionalization is a complex combinatorial problem. Here rather than focusing on the technical aspects of regionalization we demonstrate how to use a purpose built open source regionalization algorithm to post-process survey data in order to reduce the margins of error to some user-specified threshold. PB - University of Colorado at Boulder / University of Tennessee UR - http://hdl.handle.net/1813/38121 ER - TY - CONF T1 - Remembering where: A look at the American Time Use Survey T2 - Paper presented at the annual conference of the Midwest Association for Public Opinion Research Y1 - 2014 A1 - Deal, C. A1 - Cordova-Cazar, A.L. A1 - Countryman, A. A1 - Kirchner, A. A1 - Belli, R.F. JF - Paper presented at the annual conference of the Midwest Association for Public Opinion Research CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - JOUR T1 - Reputation as a Sufficient Condition for Data Quality on Amazon Mechanical Turk JF - Behavior Research Methods Y1 - 2014 A1 - Peer, E. A1 - Vosgerau, J. A1 - Acquisti, A. VL - 46 ER - TY - CHAP T1 - The Rise of Incarceration Among the Poor with Mental Illnesses: How Neoliberal Policies Contribute T2 - The Routledge Handbook of Poverty in the United States Y1 - 2014 A1 - Camp, J. A1 - Haymes, S. A1 - Haymes, M. V. d. A1 - Miller, R.J. JF - The Routledge Handbook of Poverty in the United States PB - Routledge ER - TY - CONF T1 - The Role of Device Type in Internet Panel Survey Breakoff T2 - Midwest Association for Public Opinion Research Annual Conference Y1 - 2014 A1 - McCutcheon, A.L. JF - Midwest Association for Public Opinion Research Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - JOUR T1 - Savings from ages 16 to 35: A test to inform Child Development Account policy JF - Poverty & Public Policy Y1 - 2014 A1 - Friedline, T. A1 - Nam, I. VL - 6 UR - http://onlinelibrary.wiley.com/store/10.1002/pop4.59/asset/pop459.pdf IS - 1 ER - TY - JOUR T1 - Seeing the Non-Stars: (Some) Sources of Bias in Past Disambiguation Approaches and a New Public Tool Leveraging Labeled Records JF - Research Policy Y1 - 2014 A1 - Ventura, S. A1 - Nugent, R. A1 - Fuchs, E. N1 - Selected for Special Issue on Big Data ER - TY - ABST T1 - SIPP: From Conventional Questionnaire to Event History Calendar Interviewing Y1 - 2014 A1 - Belli, R.F. N1 - Workshop on ìConducting Research using the Survey of Income and Program Participation (SIPP). Presented at Duke University, Social Science Research Institute, Durham, NC ER - TY - CONF T1 - SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication T2 - AISTATS 2014 Proceedings, JMLR Y1 - 2014 A1 - Steorts, R. A1 - Hall, R. A1 - Fienberg, S. E. JF - AISTATS 2014 Proceedings, JMLR PB - W& CP VL - 33 ER - TY - RPRT T1 - Sorting Between and Within Industries: A Testable Model of Assortative Matching Y1 - 2014 A1 - Abowd, John M. A1 - Kramarz, Francis A1 - Perez-Duarte, Sebastien A1 - Schmutte, Ian M. AB - Sorting Between and Within Industries: A Testable Model of Assortative Matching Abowd, John M.; Kramarz, Francis; Perez-Duarte, Sebastien; Schmutte, Ian M. We test Shimer's (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated. PB - Cornell University UR - http://hdl.handle.net/1813/52607 ER - TY - JOUR T1 - Spatial Collective Intelligence? Accuracy, Credibility in Crowdsourced Data JF - Cartography and Geographic Information Science Y1 - 2014 A1 - Spielman, S. E. VL - 41 UR - http://go.galegroup.com/ps/i.do?action=interpret&id=GALE|A361943563&v=2.1&u=nysl_sc_cornl&it=r&p=AONE&sw=w&authCount=1 IS - 2 ER - TY - ABST T1 - Spatial Fay-Herriot Models for Small Area Estimation With Functional Covariates Y1 - 2014 A1 - Holan, S.H. ER - TY - JOUR T1 - Spatial Fay-Herriot Models for Small Area Estimation with Functional Covariates JF - Spatial Statistics Y1 - 2014 A1 - Porter, A. T., A1 - Holan, S.H., A1 - Wikle, C.K., A1 - Cressie, N. VL - 10 UR - http://arxiv.org/pdf/1303.6668v3.pdf ER - TY - CONF T1 - Spiny CACTOS: OSN Users Attitudes and Perceptions Towards Cryptographic Access Control Tools T2 - Proceedings of the Workshop on Usable Security (USEC) Y1 - 2014 A1 - Balsa, E., A1 - Brandimarte, L., A1 - Acquisti, A., A1 - Diaz, C., A1 - Gürses, S. JF - Proceedings of the Workshop on Usable Security (USEC) UR - https://www.internetsociety.org/doc/spiny-cactos-osn-users-attitudes-and-perceptions-towards-cryptographic-access-control-tools ER - TY - CONF T1 - Supporting Planners' Work with Uncertain Demographic Data T2 - GIScience Workshop on Uncertainty Visualization Y1 - 2014 A1 - Griffin, A. L. A1 - Spielman, S. E. A1 - Jurjevich, J. A1 - Merrick, M. A1 - Nagle, N. N. A1 - Folch, D. C. JF - GIScience Workshop on Uncertainty Visualization VL - 23 UR - http://cognitivegiscience.psu.edu/uncertainty2014/papers/griffin_demographic.pdf. ER - TY - CONF T1 - Supporting Planners' work with Uncertain Demographic Data T2 - Proceedings of IEEE VIS 2014 Y1 - 2014 A1 - Griffin, A. L. A1 - Spielman, S. E. A1 - Nagle, N. N. A1 - Jurjevich, J. A1 - Merrick, M. A1 - Folch, D. C. JF - Proceedings of IEEE VIS 2014 PB - Proceedings of IEEE VIS 2014 UR - http://cognitivegiscience.psu.edu/uncertainty2014/papers/griffin_demographic.pdf ER - TY - CONF T1 - Survey Fusion for Data that Exhibit Multivariate, Spatio-Temporal Dependencies T2 - Joint Statistical Meetings 2014 Y1 - 2014 A1 - Bradley, J.R. JF - Joint Statistical Meetings 2014 ER - TY - CONF T1 - Survey Informatics: Ideas, Opportunities, and Discussions T2 - UNL/SRAM/Gallup Symposium Y1 - 2014 A1 - Eck, A. A1 - Soh, L-K JF - UNL/SRAM/Gallup Symposium CY - Omaha, NE UR - http://grc.unl.edu/unlsramgallup-symposium ER - TY - ABST T1 - A Survey of Contemporary Spatial Models for Small Area Estimation Y1 - 2014 A1 - Porter, A.T. ER - TY - JOUR T1 - SynLBD 2.0: Improving the Synthetic Longitudinal Business Database JF - Statistical Journal of the International Association for Official Statistics Y1 - 2014 A1 - S. K. Kinney A1 - J. P. Reiter A1 - J. Miranda VL - 30 ER - TY - JOUR T1 - Top-Coding and Public Use Microdata Samples from the U.S. Census Bureau JF - Journal of Privacy and Confidentiality Y1 - 2014 A1 - Crimi, N. A1 - Eddy, W. C. VL - 6 UR - http://repository.cmu.edu/jpc/vol6/iss2/2/ ER - TY - JOUR T1 - Toward healthy balance sheets: Savings accounts as a gateway for young adults’ asset diversification and accumulation JF - The St. Louis Federal Reserve Bulletin Y1 - 2014 A1 - Friedline, T. A1 - Johnson, P. A1 - Hughes, R. UR - http://research.stlouisfed.org/publications/review/2014/q4/friedline.pdf ER - TY - THES T1 - Towards an Understanding of Dynamics Between Race, Population Movement, and the Built Environment of American Cities (undergraduate honors thesis) Y1 - 2014 A1 - Bellman, B. PB - University of Colorado at Boulder ER - TY - RPRT T1 - Twitter, Big Data, and Jobs Numbers Y1 - 2014 A1 - Hudomiet, Peter JF - LSA Today UR - http://www.lsa.umich.edu/lsa/ci.twitterbigdataandjobsnumbers_ci.detail ER - TY - RPRT T1 - Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates Y1 - 2014 A1 - Folch, David C. A1 - Arribas-Bel, Daniel A1 - Koschinsky, Julia A1 - Spielman, Seth E. AB - Uncertain Uncertainty: Spatial Variation in the Quality of American Community Survey Estimates Folch, David C.; Arribas-Bel, Daniel; Koschinsky, Julia; Spielman, Seth E. The U.S. Census Bureau's American Community Survey (ACS) is the foundation of social science research, much federal resource allocation and the development of public policy and private sector decisions. However, the high uncertainty associated with some of the ACS's most frequently used estimates can jeopardize the accuracy of inferences based on these data. While there is high level understanding in the research community that problems exist in the data, the sources and implications of these problems have been largely overlooked. Using 2006-2010 ACS median household income at the census tract scale as the test case (where a third of small-area estimates have higher than recommend errors), we explore the patterns in the uncertainty of ACS data. We consider various potential sources of uncertainty in the data, ranging from response level to geographic location to characteristics of the place. We find that there exist systematic patterns in the uncertainty in both the spatial and attribute dimensions. Using a regression framework, we identify the factors that are most frequently correlated with the error at national, regional and metropolitan area scales, and find these correlates are not consistent across the various locations tested. The implication is that data quality varies in different places, making cross-sectional analysis both within and across regions less reliable. We also present general advice for data users and potential solutions to the challenges identified. PB - University of Colorado at Boulder / University of Tennessee UR - http://hdl.handle.net/1813/38122 ER - TY - CHAP T1 - The Untold Story of Multi-Mode (Online and Mail) Consumer Panels: From Optimal Recruitment to Retention and Attrition T2 - Online Panel Surveys: An Interdisciplinary Approach Y1 - 2014 A1 - McCutcheon, Allan L. A1 - Rao, K., A1 - Kaminska, O. ED - Callegaro, M. ED - Baker, R. ED - Bethlehem, J. ED - Göritz, A. ED - Krosnick, J. ED - Lavrakas, P. JF - Online Panel Surveys: An Interdisciplinary Approach PB - Wiley ER - TY - JOUR T1 - An updated method for calculating income and payroll taxes from PSID data using the NBER’s TAXSIM, for PSID survey years 1999 through 2011 JF - Unpublished manuscript, University of Michigan. Accessed May Y1 - 2014 A1 - Kimberlin, Sara A1 - Kim, Jiyoun A1 - Shaefer, Luke AB - This paper describes a method to calculate income and payroll taxes from Panel Study of Income Dynamics data using the NBERʼs Internet TAXSIM version 9 (http://users.nber.org/~taxsim/taxsim9/), for PSID survey years 1999, 2001, 2003, 2005. 2007, 2009, and 2011 (tax years n-1). These methods are implemented in two Stata programs, designed to be used with the PSID public-use zipped Main Interview data files: PSID_TAXSIM_1of2.do and PSID_TAXSIM_2of2.do. The main program (2of2) was written by Sara Kimberlin (skimberlin@berkeley.edu) and generates all TAXSIM input variables, runs TAXSIM, adjusts tax estimates using additional information available in PSID data, and calculates total PSID family unit taxes. A separate program (1of2) was written by Jiyoon (June) Kim (junekim@umich.edu) in collaboration with Luke Shaefer (lshaefer@umich.edu) to calculate mortgage interest for itemized deductions; this program needs to be run first, before the main program. Jonathan Latner contributed code to use the programs with the PSID zipped data. The overall methods build on the strategy for using TAXSIM with PSID data outlined by Butrica & Burkhauser (1997), with some expansions and modifications. Note that the methods described below are designed to prioritize accuracy of income taxes calculated for low-income households, particularly refundable tax credits such as the Earned Income Tax Credit (EITC) and the Additional Child Tax Credit. Income tax liability is generally low for low-income households, and the amount of refundable tax credits is often substantially larger than tax liabilities for this population. Payroll tax can also be substantial for low-income households. Thus the methods below focus on maximizing accuracy of income tax and payroll tax calculations for low-income families, with less attention to tax items that largely impact higher-income households (e.g. the treatment of capital gains). VL - 6 ER - TY - CONF T1 - The use of paradata (in time use surveys) to better evaluate data quality T2 - American Association for Public Opinion Research 2014 Annual Conference Y1 - 2014 A1 - Cordova-Cazar, A.L. A1 - Belli, R.F. JF - American Association for Public Opinion Research 2014 Annual Conference CY - Anaheim, CA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results Y1 - 2014 A1 - Miranda, Javier A1 - Vilhuber, Lars AB - Using partially synthetic data to replace suppression in the Business Dynamics Statistics: early results Miranda, Javier; Vilhuber, Lars The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells. PB - Cornell University UR - http://hdl.handle.net/1813/40852 ER - TY - JOUR T1 - Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results JF - Privacy in Statistical Databases Y1 - 2014 A1 - J. Miranda A1 - L. Vilhuber AB - The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells. SN - 978-3-319-11256-5 UR - http://dx.doi.org/10.1007/978-3-319-11257-2_18 ER - TY - RPRT T1 - Using Social Media to Measure Labor Market Flows Y1 - 2014 A1 - Antenucci, Dolan A1 - Cafarella, Michael J A1 - Levenstein, Margaret C. A1 - Ré, Christopher A1 - Shapiro, Matthew UR - http://www-personal.umich.edu/~shapiro/papers/LaborFlowsSocialMedia.pdf ER - TY - CONF T1 - Web Surveys, Online Panels, and Paradata: Automating Adaptive Design T2 - NSF-Census Research Network (NCRN) Spring Meeting Y1 - 2014 A1 - McCutcheon, A.L. JF - NSF-Census Research Network (NCRN) Spring Meeting CY - Washington, DC UR - http://www.ncrn.info/event/ncrn-meeting-spring-2014 N1 - Conference on Methodological Innovations in the Study of Elections in Europe and Beyond. Presented at Texas A&M University ER - TY - JOUR T1 - What are You Doing Now? Activity Level Responses and Errors in the American Time Use Survey JF - Journal of Survey Statistics and Methodology Y1 - 2014 A1 - T. Al Baghal A1 - Belli, R.F. A1 - Phillips, A.L. A1 - Ruther, N. VL - 2 IS - 4 ER - TY - JOUR T1 - Why data availability is such a hard problem JF - Statistical Journal of the International Association for Official Statistics Y1 - 2014 A1 - A. F. Karr KW - Data Archive KW - Data availability KW - public good KW - replicability KW - reproducibility AB - If data availability were a simple problem, it would already have been resolved. In this paper, I argue that by viewing data availability as a public good, it is possible to both understand the complexities with which it is fraught and identify a path to a solution. VL - 30 IS - 2 ER - TY - CONF T1 - Would a Privacy Fundamentalist Sell their DNA for \$1000... if Nothing Bad Happened Thereafter? A Study of the Western Categories, Behavior Intentions, and Consequences T2 - Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) Y1 - 2014 A1 - Woodruff, A. A1 - Pihur, V. A1 - Acquisti, A. A1 - Consolvo, S. A1 - Schmidt, L. A1 - Brandimarte, L. JF - Proceedings of the Tenth Symposium on Usable Privacy and Security (SOUPS) PB - ACM CY - New York, NY UR - https://www.usenix.org/conference/soups2014/proceedings/presentation/woodruff N1 - IAPP SOUPS Privacy Award Winner ER - TY - JOUR T1 - Are independent parameter draws necessary for multiple imputation? JF - The American Statistician Y1 - 2013 A1 - Hu, J. A1 - Mitra, R. A1 - Reiter, J.P. VL - 67 UR - http://www.tandfonline.com/doi/full/10.1080/00031305.2013.821953 ER - TY - ABST T1 - A Bayesian Approach to Estimating Agricultural Yield Based on Multiple Repeated Surveys, Institute of Public Policy and the Truman School of Public Affairs Y1 - 2013 A1 - Holan, S.H. ER - TY - RPRT T1 - A Bayesian Approach to Graphical Record Linkage and De-duplication Y1 - 2013 A1 - Steorts, Rebecca C. A1 - Hall, Rob A1 - Fienberg, Stephen E. AB - We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online. JF - arXiv UR - https://arxiv.org/abs/1312.4645 ER - TY - ABST T1 - Bayesian inference for the Spatial Random Effects Model Y1 - 2013 A1 - Cressie, N. JF - Department of Statistics, Macquarie University PB - Macquarie University ER - TY - CONF T1 - Bayesian learning of joint distributions of objects T2 - Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS) 2013 Y1 - 2013 A1 - Banerjee, A. A1 - Murray, J. A1 - Dunson, D. B. AB -

There is increasing interest in broad application areas in defining flexible joint models for data having a variety of measurement scales, while also allowing data of complex types, such as functions, images and documents. We consider a general framework for nonparametric Bayes joint modeling through mixture models that incorporate dependence across data types through a joint mixing measure. The mixing measure is assigned a novel infinite tensor factorization (ITF) prior that allows flexible dependence in cluster allocation across data types. The ITF prior is formulated as a tensor product of stick-breaking processes. Focusing on a convenient special case corresponding to a Parafac factorization, we provide basic theory justifying the flexibility of the proposed prior and resulting asymptotic properties. Focusing on ITF mixtures of product kernels, we develop a new Gibbs sampling algorithm for routine implementation relying on slice sampling. The methods are compared with alternative joint mixture models based on Dirichlet processes and related approaches through simulations and real data applications.

Also at http://arxiv.org/abs/1303.0449

JF - Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS) 2013 UR - http://jmlr.csail.mit.edu/proceedings/papers/v31/banerjee13a.html ER - TY - CONF T1 - Bayesian Modeling in the Era of Big Data: the Role of High-Throughput and High-Performance Computing T2 - The Extreme Science and Engineering Discovery Environment Conference Y1 - 2013 A1 - Wu, G. JF - The Extreme Science and Engineering Discovery Environment Conference CY - San Diego, CA ER - TY - RPRT T1 - Bayesian multiple imputation for large-scale categorical data with structural zeros Y1 - 2013 A1 - Manrique-Vallier, D. A1 - Reiter, J. P. AB - Bayesian multiple imputation for large-scale categorical data with structural zeros Manrique-Vallier, D.; Reiter, J. P. We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, USA. PB - Duke University / National Institute of Statistical Sciences (NISS) UR - http://hdl.handle.net/1813/34889 ER - TY - RPRT T1 - b-Bit Minwise Hashing in Practice Y1 - 2013 A1 - Li, Ping A1 - Shrivastava, Anshumali A1 - König, Arnd Christian AB - b-Bit Minwise Hashing in Practice Li, Ping; Shrivastava, Anshumali; König, Arnd Christian Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demon- strated a potential use of b-bit minwise hashing [23, 24] for ef- ficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical is- sues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. PB - Cornell University UR - http://hdl.handle.net/1813/37986 ER - TY - CONF T1 - b-Bit Minwise Hashing in Practice T2 - Internetware'13 Y1 - 2013 A1 - Ping Li A1 - Anshumali Shrivastava A1 - König, Arnd Christian AB - Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20   80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented. JF - Internetware'13 UR - http://www.nudt.edu.cn/internetware2013/ ER - TY - CONF T1 - Beyond Pairwise: Provably Fast Algorithms for Approximate K-Way Similarity Search T2 - Neural Information Processing Systems (NIPS) Y1 - 2013 A1 - Anshumali Shrivastava A1 - Ping Li JF - Neural Information Processing Systems (NIPS) ER - TY - CONF T1 - Binomial Mixture Models for Urban Ecological Monitoring Studies Using American Community Survey Demographic Covariates T2 - Joint Statistical Meetings 2013 Y1 - 2013 A1 - Wu, G. JF - Joint Statistical Meetings 2013 CY - Montreal, Canada ER - TY - CONF T1 - The Co-Evolution of Residential Segregation and the Built Environment at the Turn of the 20th Century: A Schelling Model T2 - Transactions in GIS Y1 - 2013 A1 - S.E. Spielman A1 - Patrick Harrison JF - Transactions in GIS ER - TY - CHAP T1 - Collecting paradata for measurement error evaluation T2 - Improving Surveys with Paradata: Analytic Uses of Process Information Y1 - 2013 A1 - Olson, K. A1 - Parkhurst, B. ED - Frauke Kreuter JF - Improving Surveys with Paradata: Analytic Uses of Process Information PB - John Wiley and Sons CY - Hoboken, NJ. ER - TY - JOUR T1 - Comment: Innovations Associated with Multiple Systems Estimation in Human Rights Settings JF - The American Statistician Y1 - 2013 A1 - Fienberg, S. E. VL - 67 ER - TY - CONF T1 - Comparing and Selecting Predictors Predictors Using Local Criteria T2 - International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies Y1 - 2013 A1 - Cressie, N. JF - International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies PB - International Workshop on Recent Advances in Statistical Inference: Theory and Case Studies CY - Padua, Italy ER - TY - CONF T1 - Complementary Perspectives on Privacy and Security: Economics T2 - IEEE Security & Privacy Y1 - 2013 A1 - Acquisti, A. JF - IEEE Security & Privacy VL - 11 N1 - Invited paper ER - TY - RPRT T1 - Credible interval estimates for official statistics with survey nonresponse Y1 - 2013 A1 - Manski, Charles F. AB - Credible interval estimates for official statistics with survey nonresponse Manski, Charles F. Government agencies commonly report official statistics based on survey data as point estimates, without accompanying measures of error. In the absence of agency guidance, users of the statistics can only conjecture the error magnitudes. Agencies could mitigate misinterpretation of official statistics if they were to measure potential errors and report them. Agencies could report sampling error using established statistical principles. It is more challenging to report nonsampling errors because there are many sources of such errors and there has been no consensus about how to measure them. To advance discourse on practical ways to report nonsampling error, this paper considers error due to survey nonresponse. I summarize research deriving interval estimates that make no assumptions about the values of missing data. In the absence of assumptions, one can obtain computable bounds on the population parameters that official statistics intend to measure. I also explore the middle ground between interval estimation making no assumptions and traditional point estimation using weights and imputations to implement assumptions that nonresponse is conditionally random. I am grateful to Aanchal Jain for excellent research assistance and to Bruce Spencer for helpful discussions. I have benefitted from the opportunity to present this work in a seminar at the Institute for Social and Economic Research, University of Essex. PB - Northwestern University UR - http://hdl.handle.net/1813/34447 ER - TY - JOUR T1 - Data Management of Confidential Data JF - International Journal of Digital Curation Y1 - 2013 A1 - Carl Lagoze A1 - William C. Block A1 - Jeremy Williams A1 - John M. Abowd A1 - Lars Vilhuber AB - Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. VL - 8 N1 - Presented at 8th International Digital Curation Conference 2013, Amsterdam. See also http://hdl.handle.net/1813/30924 ER - TY - CONF T1 - Do ‘Don’t Know’ Responses = Survey Satisficing? Evidence from the Gallup Panel Paradata T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Wang, Mengyang A1 - Ruppanner, Leah A1 - McCutcheon, Allan L. JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Do single mothers in the United States use the Earned Income Tax Credit to reduce unsecured debt? JF - Review of Economics of the Household Y1 - 2013 A1 - Shaefer, H. Luke A1 - Song, Xiaoqing A1 - Williams Shanks, Trina R. KW - Earned Income Tax Credit Single Mothers Unsecured Debt AB -

The Earned Income Tax Credit (EITC) is a refundable credit for low income workers mainly targeted at families with children. This study uses the Survey of Income and Program Participation’s topical modules on Assets and Liabilities to examine associations between the EITC expansions during the early 1990s and the unsecured debt of the households of single mothers. We use two difference-in-differences comparisons over the study period 1988–1999, first comparing single mothers to single childless women, and then comparing single mothers with two or more children to single mothers with exactly one child. In both cases we find that the EITC expansions are associated with a relative decline in the unsecured debt of affected households of single mothers. While not direct evidence of a causal relationship, this is suggestive evidence that single mothers may have used part of their EITC to limit the growth of their unsecured debt during this period.

N1 - NCRN ER - TY - CONF T1 - Ecological Prediction with Nonlinear Multivariate Time-Frequency Functional Data Models T2 - Joint Statistical Meetings 2013 Y1 - 2013 A1 - Wikle, C.K. JF - Joint Statistical Meetings 2013 CY - Montreal, Canada ER - TY - JOUR T1 - Ecological Prediction With Nonlinear Multivariate Time-Frequency Functional Data Models JF - Journal of Agricultural, Biological, and Environmental Statistics Y1 - 2013 A1 - Yang, W.H., A1 - Wikle, C.K. A1 - Holan, S.H. A1 - Wildhaber, M.L. VL - 18 UR - http://link.springer.com/article/10.1007/s13253-013-0142-1 ER - TY - JOUR T1 - Empirical Analysis of Data Breach Litigation JF - Journal of Empirical Legal Studies Y1 - 2013 A1 - Romanosky, A. A1 - Hoffman, D. A1 - Acquisti, A. VL - 11 ER - TY - CONF T1 - Encoding Provenance Metadata for Social Science Datasets T2 - Metadata and Semantics Research Y1 - 2013 A1 - Lagoze, Carl A1 - Willliams, Jeremy A1 - Vilhuber, Lars ED - Garoufallou, Emmanouel ED - Greenberg, Jane KW - DDI KW - eSocial Science KW - Metadata KW - Provenance JF - Metadata and Semantics Research T3 - Communications in Computer and Information Science PB - Springer International Publishing VL - 390 SN - 978-3-319-03436-2 UR - http://dx.doi.org/10.1007/978-3-319-03437-9_13 ER - TY - RPRT T1 - Encoding Provenance of Social Science Data: Integrating PROV with DDI Y1 - 2013 A1 - Lagoze, Carl A1 - Block, William C A1 - Williams, Jeremy A1 - Abowd, John A1 - Vilhuber, Lars AB - Encoding Provenance of Social Science Data: Integrating PROV with DDI Lagoze, Carl; Block, William C; Williams, Jeremy; Abowd, John; Vilhuber, Lars Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. Submitted to EDDI13 5th Annual European DDI User Conference December 2013, Paris, France PB - Cornell University UR - http://hdl.handle.net/1813/34443 ER - TY - CONF T1 - Encoding Provenance of Social Science Data: Integrating PROV with DDI T2 - 5th Annual European DDI User Conference Y1 - 2013 A1 - Carl Lagoze A1 - William C. Block A1 - Jeremy Williams A1 - Lars Vilhuber KW - DDI KW - eSocial Science KW - Metadata KW - Provenance AB - Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface. JF - 5th Annual European DDI User Conference ER - TY - JOUR T1 - On estimation of mean squared errors of benchmarked and empirical bayes estimators JF - Statistica Sinica Y1 - 2013 A1 - Rebecca C. Steorts A1 - Malay Ghosh VL - 23 ER - TY - CONF T1 - Exact Sparse Recovery with L0 Projections T2 - 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Y1 - 2013 A1 - Ping Li A1 - Cun-Hui Zhang JF - 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ER - TY - CONF T1 - Examining item nonresponse through paradata and respondent characteristics: A multilevel approach T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Cordova-Cazar, A.L. JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Examining response time outliers through paradata in Online Panel Surveys T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Lee, J. A1 - T. Al Baghal JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Examining the relationship between error and behavior in the American Time Use Survey using audit trail paradata T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Ruther, N. A1 - T. Al Baghal A1 - A. Eck A1 - L. Stuart A1 - L. Phillips A1 - R. Belli A1 - Soh, L-K JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Fast Near Neighbor Search in High-Dimensional Binary Data Y1 - 2013 A1 - Shrivastava, Anshumali A1 - Li, Ping AB - Fast Near Neighbor Search in High-Dimensional Binary Data Shrivastava, Anshumali; Li, Ping Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections. PB - Cornell University UR - http://hdl.handle.net/1813/37987 ER - TY - CONF T1 - Flexible Semiparametric Hierarchical Spatial Models T2 - Joint Statistical Meetings 2013 Y1 - 2013 A1 - Porter, A.T. JF - Joint Statistical Meetings 2013 CY - Montreal, Canada ER - TY - JOUR T1 - From Facebook Regrets to Facebook Privacy Nudges JF - Ohio State Law Journal Y1 - 2013 A1 - Wang, Y. A1 - Leon, P. G. A1 - Chen, X. A1 - Komanduri, S. A1 - Norcie, G. A1 - Scott, K. A1 - Acquisti, A. A1 - Cranor, L. F. A1 - Sadeh, N. N1 - Invited paper ER - TY - JOUR T1 - A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Record Systems JF - Journal of the American Statistical Association Y1 - 2013 A1 - Sadinle, M. A1 - Fienberg, S. E. VL - 108 UR - http://dx.doi.org/10.1080/01621459.2012.757231 ER - TY - JOUR T1 - Gone in 15 Seconds: The Limits of Privacy Transparency and Control JF - IEEE Security & Privacy Y1 - 2013 A1 - Acquisti, A. A1 - Adjerid, I. A1 - Brandimarte, L. VL - 11 ER - TY - JOUR T1 - Handling Attrition in Longitudinal Studies: The Case for Refreshment Samples JF - Statist. Sci. Y1 - 2013 A1 - Deng, Yiting A1 - Hillygus, D. Sunshine A1 - Reiter, Jerome P. A1 - Si, Yajuan A1 - Zheng, Siyu AB - Panel studies typically suffer from attrition, which reduces sample size and can result in biased inferences. It is impossible to know whether or not the attrition causes bias from the observed panel data alone. Refreshment samples—new, randomly sampled respondents given the questionnaire at the same time as a subsequent wave of the panel—offer information that can be used to diagnose and adjust for bias due to attrition. We review and bolster the case for the use of refreshment samples in panel studies. We include examples of both a fully Bayesian approach for analyzing the concatenated panel and refreshment data, and a multiple imputation approach for analyzing only the original panel. For the latter, we document a positive bias in the usual multiple imputation variance estimator. We present models appropriate for three waves and two refreshment samples, including nonterminal attrition. We illustrate the three-wave analysis using the 2007–2008 Associated Press–Yahoo! News Election Poll. VL - 28 UR - http://dx.doi.org/10.1214/13-STS414 ER - TY - JOUR T1 - Hierarchical Bayesian Spatio-Temporal Conway-Maxwell Poisson Models with Dynamic Dispersion JF - Journal of Agricultural, Biological, and Environmental Statistics Y1 - 2013 A1 - Wu, G. A1 - Holan, S.H. A1 - Wikle, C.K. CY - Anchorage, Alaska VL - 18 UR - http://link.springer.com/article/10.1007/s13253-013-0141-2 ER - TY - JOUR T1 - Hierarchical Spatio-Temporal Models and Survey Research JF - Statistics Views Y1 - 2013 A1 - Wikle, C. A1 - Holan, S. A1 - Cressie, N. UR - http://www.statisticsviews.com/details/feature/4730991/Hierarchical-Spatio-Temporal-Models-and-Survey-Research.html ER - TY - JOUR T1 - Hierarchical Statistical Modeling of Big Spatial Datasets Using the Exponential Family of Distributions JF - Spatial Statistics Y1 - 2013 A1 - Sengupta, A. A1 - Cressie, N. KW - EM algorithm KW - Empirical Bayes KW - Geostatistical process KW - Maximum likelihood estimation KW - MCMC KW - SRE model VL - 4 UR - http://www.sciencedirect.com/science/article/pii/S2211675313000055 ER - TY - ABST T1 - How can survey estimates of small areas be improved by leveraging social-media data? Y1 - 2013 A1 - Cressie, N. A1 - Holan, S. A1 - Wikle, C. JF - The Survey Statistician UR - http://isi.cbs.nl/iass/N68.pdf ER - TY - JOUR T1 - Identifying Neighborhoods Using High Resolution Population Data JF - Annals of the Association of American Geographers Y1 - 2013 A1 - S.E. Spielman A1 - J. Logan VL - 103 ER - TY - RPRT T1 - Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Y1 - 2013 A1 - Block, William C. A1 - Williams, Jeremy A1 - Vilhuber, Lars A1 - Lagoze, Carl A1 - Brown, Warren A1 - Abowd, John M. AB - Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files Block, William C.; Williams, Jeremy; Vilhuber, Lars; Lagoze, Carl; Brown, Warren; Abowd, John M. Presentation at NADDI 2013 This record has also been archived at http://kuscholarworks.ku.edu/dspace/handle/1808/11093 . PB - Cornell University UR - http://hdl.handle.net/1813/33362 ER - TY - CONF T1 - Is it the Typeset or the Type of Statistics? Disfluent Font and Self-Disclosure T2 - Proceedings of Learning from Authoritative Security Experiment Results (LASER) Y1 - 2013 A1 - Balebako, R. A1 - Pe'er, E. A1 - Brandimarte, L. A1 - Cranor, L. F. A1 - Acquisti, A. JF - Proceedings of Learning from Authoritative Security Experiment Results (LASER) PB - USENIX Association CY - New York, NY UR - https://www.usenix.org/laser2013/program/balebako ER - TY - RPRT T1 - Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Y1 - 2013 A1 - Vilhuber, Lars A1 - Abowd, John A1 - Block, William A1 - Lagoze, Carl A1 - Williams, Jeremy AB - Managing Confidentiality and Provenance across Mixed Private and Publicly-Accessed Data and Metadata Vilhuber, Lars; Abowd, John; Block, William; Lagoze, Carl; Williams, Jeremy Social science researchers are increasingly interested in making use of confidential micro-data that contains linkages to the identities of people, corporations, etc. The value of this linking lies in the potential to join these identifiable entities with external data such as genome data, geospatial information, and the like. Leveraging these linkages is an essential aspect of “big data” scholarship. However, the utility of these confidential data for scholarship is compromised by the complex nature of their management and curation. This makes it difficult to fulfill US federal data management mandates and interferes with basic scholarly practices such as validation and reuse of existing results. We describe in this paper our work on the CED2AR prototype, a first step in providing researchers with a tool that spans the confidential/publicly-accessible divide, making it possible for researchers to identify, search, access, and cite those data. The particular points of interest in our work are the cloaking of metadata fields and the expression of provenance chains. For the former, we make use of existing fields in the DDI (Data Description Initiative) specification and suggest some minor changes to the specification. For the latter problem, we investigate the integration of DDI with recent work by the W3C PROV working group that has developed a generalizable and extensible model for expressing data provenance. PB - Cornell University UR - http://hdl.handle.net/1813/34534 ER - TY - JOUR T1 - Memory, communication, and data quality in calendar interviews JF - Public Opinion Quarterly Y1 - 2013 A1 - Belli, R. F., A1 - Bilgen, I., A1 - T. Al Baghal VL - 77 ER - TY - THES T1 - Mental Disorders and Inequality in the United States: Intersection of race, gender, and disability on employment and income T2 - Social Work Y1 - 2013 A1 - Camp, J. JF - Social Work PB - Wayne State University VL - Ph.D. ER - TY - JOUR T1 - Misplaced confidences: Privacy and the control paradox JF - Social Psychological and Personality Science Y1 - 2013 A1 - Laura Brandimarte A1 - Alessandro Acquisti A1 - George Loewenstein VL - 4 ER - TY - RPRT T1 - NCRN Meeting Spring 2013 Y1 - 2013 A1 - Vilhuber, Lars AB - NCRN Meeting Spring 2013 Vilhuber, Lars Taken place at the NISS Headquarters, Research Triangle Park, NC. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45870 ER - TY - RPRT T1 - NCRN Newsletter: Volume 1 - Issue 1 Y1 - 2013 A1 - Vilhuber, Lars A1 - Karr, Alan A1 - Reiter, Jerome A1 - Abowd, John A1 - Nunnelly, Jamie AB - NCRN Newsletter: Volume 1 - Issue 1 Vilhuber, Lars; Karr, Alan; Reiter, Jerome; Abowd, John; Nunnelly, Jamie Overview of activities at NSF-Census Research Network nodes from July 2013 to November 2013. NCRN Newsletter Vol. 1, Issue 1: November 17, 2013 PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/40232 ER - TY - JOUR T1 - Neighborhood contexts, health, and behavior: understanding the role of scale and residential sorting JF - Environment and Planning B Y1 - 2013 A1 - Spielman, S. E. A1 - Linkletter, C. A1 - Yoo, E.-H. VL - 3 ER - TY - CONF T1 - Nonlinear Dynamic Spatio-Temporal Statistical Models T2 - Southern Regional Council on Statistics Summer Research Conference Y1 - 2013 A1 - Wikle, C.K. JF - Southern Regional Council on Statistics Summer Research Conference ER - TY - JOUR T1 - Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys JF - Journal of Educational and Behavioral Statistics Y1 - 2013 A1 - Si, Y. A1 - Reiter, J.P. VL - 38 UR - http://www.stat.duke.edu/ jerry/Papers/StatinMed14.pdf ER - TY - CONF T1 - Paradata for Measurement Error Evaluation T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Olson, K. JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Predicting survey breakoff in Internet survey panels T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - McCutcheon, A.L. A1 - T. Al Baghal JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - CONF T1 - Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of autobiographical recall in surveys T2 - Biennial conference of the Society for Applied Research in Memory and Cognition Y1 - 2013 A1 - Belli, R.F. A1 - Miller, L.D. A1 - Soh, L-K A1 - T. Al Baghal JF - Biennial conference of the Society for Applied Research in Memory and Cognition CY - Rotterdam, Netherlands UR - http://static1.squarespace.com/static/504170d6e4b0b97fe5a59760/t/52457a8be4b0012b7a5f462a/1380285067247/SARMAC_X_PaperJune27.pdf ER - TY - CONF T1 - Predicting the occurrence of respondent retrieval strategies in calendar interviewing: The quality of retrospective reports T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Belli, R.F. A1 - Miller, L.D. A1 - Soh, L-K A1 - T. Al Baghal JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - RPRT T1 - Presentation: Predicting Multiple Responses with Boosting and Trees Y1 - 2013 A1 - Li, Ping A1 - Abowd, John AB - Presentation: Predicting Multiple Responses with Boosting and Trees Li, Ping; Abowd, John Presentation by Ping Li and John Abowd at FCSM on November 4, 2013 PB - Cornell University UR - http://hdl.handle.net/1813/40255 ER - TY - CONF T1 - The process of turning audit trails from a CATI survey into useful data: Interviewer behavior paradata in the American Time Use Survey T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Ruther, N. A1 - Phipps, P. A1 - Belli, R.F. JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - ABST T1 - Recent Advances in Spatial Methods for Federal Surveys Y1 - 2013 A1 - Holan, S.H. ER - TY - RPRT T1 - Reconsidering the Consequences of Worker Displacements: Survey versus Administrative Measurements Y1 - 2013 A1 - Flaaen, Aaron A1 - Shapiro, Matthew A1 - Isaac Sorkin AB - Displaced workers suffer persistent earnings losses. This stark finding has been established by following workers in administrative data after mass layoffs under the presumption that these are involuntary job losses owing to economic distress. Using linked survey and administrative data, this paper examines this presumption by matching worker-supplied reasons for separations with what is happening at the firm. The paper documents substantially different earnings dynamics in mass layoffs depending on the reason the worker gives for the separation. Using a new methodology for accounting for the increase in the probability of separation among all types of survey response during in a mass layoff, the paper finds earnings loss estimates that are surprisingly close to those using only administrative data. Finally, the survey-administrative link allows the decomposition of earnings losses due to subsequent nonemployment into non-participation and unemployment. Including the zero earnings of those identified as being unemployed substantially increases the estimate of earnings losses. PB - University of Michigan UR - http://www-personal.umich.edu/~shapiro/papers/ReconsideringDisplacements.pdf ER - TY - ABST T1 - A Reduced Rank Model for Analyzing Multivariate Spatial Datasets Y1 - 2013 A1 - Bradley, J.R. JF - University of Missouri-Kansas City PB - University of Missouri-Kansas City ER - TY - JOUR T1 - Ringtail: a generalized nowcasting system. JF - WebDB Y1 - 2013 A1 - Antenucci, Dolan A1 - Li, Erdong A1 - Liu, Shaobo A1 - Zhang, Bochun A1 - Cafarella, Michael J A1 - Ré, Christopher AB - Social media nowcasting—using online user activity to de- scribe real-world phenomena—is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcast- ing systems to quickly become a standard tool among non- computer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical now- casting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built Ringtail, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that Ringtail can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for Ringtail to quickly return a usable predictive nowcasting model. VL - 6 UR - http://cs.stanford.edu/people/chrismre/papers/Ringtail-VLDB-demo.pdf ER - TY - JOUR T1 - Ringtail: Feature Selection for Easier Nowcasting. JF - WebDB Y1 - 2013 A1 - Antenucci, Dolan A1 - Cafarella, Michael J A1 - Levenstein, Margaret C. A1 - Ré, Christopher A1 - Shapiro, Matthew AB - In recent years, social media “nowcasting”—the use of on- line user activity to predict various ongoing real-world social phenomena—has become a popular research topic; yet, this popularity has not led to widespread actual practice. We be- lieve a major obstacle to widespread adoption is the feature selection problem. Typical nowcasting systems require the user to choose a set of relevant social media objects, which is difficult, time-consuming, and can imply a statistical back- ground that users may not have. We propose Ringtail, which helps the user choose rele- vant social media signals. It takes a single user input string (e.g., unemployment) and yields a number of relevant signals the user can use to build a nowcasting model. We evaluate Ringtail on six different topics using a corpus of almost 6 billion tweets, showing that features chosen by Ringtail in a wholly-automated way are better or as good as those from a human and substantially better if Ringtail receives some human assistance. In all cases, Ringtail reduces the burden on the user. UR - http://www.cs.stanford.edu/people/chrismre/papers/webdb_ringtail.pdf ER - TY - JOUR T1 - Rising extreme poverty in the United States and the response of means-tested transfers. JF - Social Service Review Y1 - 2013 A1 - H. Luke Shaefer A1 - Edin, K. AB - This study documents an increase in the prevalence of extreme poverty among US households with children between 1996 and 2011 and assesses the response of major federal means-tested transfer programs. Extreme poverty is defined using a World Bank metric of global poverty: \$2 or less, per person, per day. Using the 1996–2008 panels of the Survey of Income and Program Participation (SIPP), we estimate that in mid-2011, 1.65 million households with 3.55 million children were living in extreme poverty in a given month, based on cash income, constituting 4.3 percent of all nonelderly households with children. The prevalence of extreme poverty has risen sharply since 1996, particularly among those most affected by the 1996 welfare reform. Adding SNAP benefits to household income reduces the number of extremely poor households with children by 48.0 percent in mid-2011. Adding SNAP, refundable tax credits, and housing subsidies reduces it by 62.8 percent. VL - 87 UR - http://www.jstor.org/stable/10.1086/671012 IS - 2 ER - TY - CONF T1 - Sleights of Privacy: Framing, Disclosures, and the Limits of Transparency T2 - Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) Y1 - 2013 A1 - Adjerid, I. A1 - Acquisti, A. A1 - Loewenstein, G. JF - Proceedings of the Ninth Symposium on Usable Privacy and Security (SOUPS) PB - ACM CY - New York, NY ER - TY - ABST T1 - Some Historical Remarks on Spatial Statistics, Spatio-Temporal Statistics Y1 - 2013 A1 - Cressie, N. JF - Reading Group, University of Missouri ER - TY - THES T1 - Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent Variables (Ph.D. Thesis) T2 - Department of Statistical Science Y1 - 2013 A1 - Jared S. Murray AB - This thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm. The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science. The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks. The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index. JF - Department of Statistical Science PB - Duke University UR - http://dukespace.lib.duke.edu/dspace/handle/10161/8253 ER - TY - ABST T1 - Spatial Fay-Herriot Models for Small Area Estimation with Functional Covariates Y1 - 2013 A1 - Porter, A.T. ER - TY - CHAP T1 - Spatio-temporal Design: Advances in Efficient Data Acquisition T2 - Spatio-temporal Design: Advances in Efficient Data Acquisition Y1 - 2013 A1 - Holan, S. A1 - Wikle, C. ED - Jorge Mateu ED - Werner Muller KW - semiparametric dynamic design for non-Gaussian spatio-temporal data JF - Spatio-temporal Design: Advances in Efficient Data Acquisition PB - Wiley SN - 9780470974292 ER - TY - ABST T1 - Statistics and the Environment: Overview and Challenges Y1 - 2013 A1 - Wikle, C.K. N1 - Invited Introductory Overview Lecture ER - TY - ABST T1 - Statistics for Spatio-Temporal Data Y1 - 2013 A1 - Cressie, N. JF - Invited One-Day Short Course at the U.S. Census Bureau ER - TY - CONF T1 - Troubles with time-use: Examining potential indicators of error in the American Time Use Survey T2 - American Association for Public Opinion Research 2013 Annual Conference Y1 - 2013 A1 - Phillips, A.L. A1 - T. Al Baghal A1 - Belli, R.F. JF - American Association for Public Opinion Research 2013 Annual Conference CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - Two-stage Bayesian benchmarking as applied to small area estimation JF - TEST Y1 - 2013 A1 - Rebecca C. Steorts A1 - Malay Ghosh KW - small area estimation VL - 22 IS - 4 ER - TY - THES T1 - User Modeling via Machine Learning and Rule-based Reasoning to Understand and Predict Errors in Survey Systems Y1 - 2013 A1 - Stuart, Leonard Cleve PB - University of Nebraska-Lincoln UR - http://digitalcommons.unl.edu/computerscidiss/70/ ER - TY - JOUR T1 - Using High Resolution Population Data to Identify Neighborhoods and Determine their Boundaries JF - Annals of the Association of American Geographers Y1 - 2013 A1 - Spielman, S. E. A1 - Logan, J. VL - 103 UR - http://www.tandfonline.com/doi/abs/10.1080/00045608.2012.685049 ER - TY - THES T1 - Using Satellite Imagery to Evaluate and Analyze Socioeconomic Changes Observed with Census Data Y1 - 2013 A1 - Wilson, C. R. N1 - NCRN ER - TY - CONF T1 - What are you doing now?: Audit trails, Activity level responses and error in the American Time Use Survey T2 - American Association for Public Opinion Research Y1 - 2013 A1 - T. Al Baghal A1 - Phillips, A.L. A1 - Ruther, N. A1 - Belli, R.F. A1 - Stuart, L. A1 - Eck, A. A1 - Soh, L-K JF - American Association for Public Opinion Research CY - Boston, MA UR - http://www.aapor.org/AAPORKentico/Conference/Recent-Conferences.aspx ER - TY - JOUR T1 - What is Privacy Worth? JF - Journal of Legal Studies Y1 - 2013 A1 - Acquisti, A. A1 - John, L. A1 - Loewenstein, G. VL - 42 N1 - Leading paper, 2010 Future of Privacy Forum's Best ``Privacy Papers for Policy Makers'' Competition ER - TY - JOUR T1 - Achieving both valid and secure logistic regression analysis on aggregated data from different private sources JF - Journal of Privacy and Confidentiality Y1 - 2012 A1 - Yuval Nardi A1 - Robert Hall A1 - Stephen E. Fienberg VL - 4 ER - TY - JOUR T1 - An Approach for Identifying and Predicting Economic Recessions in Real-Time Using Time-Frequency Functional Models JF - Applied Stochastic Models in Business and Industry Y1 - 2012 A1 - Holan, S. A1 - Yang, W. A1 - Matteson, D. A1 - Wikle, C.K. KW - Bayesian model averaging KW - business cycles KW - empirical orthogonal functions KW - functional data KW - MIDAS KW - spectrogram KW - stochastic search variable selection VL - 28 UR - http://onlinelibrary.wiley.com/doi/10.1002/asmb.1954/full N1 - DOI: 10.1002/asmb.1954 ER - TY - ABST T1 - Asymptotic Theory of Cepstral Random Fields Y1 - 2012 A1 - McElroy, T. A1 - Holan, S. PB - University of Missouri N1 - Arxiv Preprint arXiv:1112.1977 ER - TY - RPRT T1 - Asymptotic Theory of Cepstral Random Fields Y1 - 2012 A1 - McElroy, T.S. A1 - Holan, S.H. AB - Asymptotic Theory of Cepstral Random Fields McElroy, T.S.; Holan, S.H. Random fields play a central role in the analysis of spatially correlated data and, as a result,have a significant impact on a broad array of scientific applications. Given the importance of this topic, there has been a substantial amount of research devoted to this area. However, the cepstral random field model remains largely underdeveloped outside the engineering literature. We provide a comprehensive treatment of the asymptotic theory for two-dimensional random field models. In particular, we provide recursive formulas that connect the spatial cepstral coefficients to an equivalent moving-average random field, which facilitates easy computation of the necessary autocovariance matrix. Additionally, we establish asymptotic consistency results for Bayesian, maximum likelihood, and quasi-maximum likelihood estimation of random field parameters and regression parameters. Further, in both the maximum and quasi-maximum likelihood frameworks, we derive the asymptotic distribution of our estimator. The theoretical results are presented generally and are of independent interest,pertaining to a wide class of random field models. The results for the cepstral model facilitate model-building: because the cepstral coefficients are unconstrained in practice, numerical optimization is greatly simplified, and we are always guaranteed a positive definite covariance matrix. We show that inference for individual coefficients is possible, and one can refine models in a disciplined manner. Finally, our results are illustrated through simulation and the analysis of straw yield data in an agricultural field experiment. http://arxiv.org/pdf/1112.1977.pdf PB - University of Missouri UR - http://hdl.handle.net/1813/34461 ER - TY - JOUR T1 - Bayesian Multi-Regime Smooth Transition Regression with Ordered Categorical Variables JF - Computational Statistics and Data Analysis Y1 - 2012 A1 - Wang, J. A1 - Holan, S. VL - 56 UR - http://dx.doi.org/10.1016/j.csda.2012.04.018 N1 - http://dx.doi.org/10.1016/j.csda.2012.04.018 ER - TY - ABST T1 - Bayesian Multiscale Multiple Imputation With Implications to Data Confidentiality Y1 - 2012 A1 - Holan, S.H. N1 - Texas A&M University, January 2012; Duke University (Hosted by Duke Node), February 2012; Rice University, March 2012; Clemson University, April 2012 ER - TY - CONF T1 - Bayesian Parametric and Nonparametric Inference for Multiple Record Likage T2 - Modern Nonparametric Methods in Machine Learning Workshop Y1 - 2012 A1 - Hall, R. A1 - Steorts, R. A1 - Fienberg, S. E. JF - Modern Nonparametric Methods in Machine Learning Workshop PB - NIPS UR - http://www.stat.cmu.edu/NCRN/PUBLIC/files/beka_nips_finalsub4.pdf ER - TY - CONF T1 - Calendar interviewing in life course research: Associations between verbal behaviors and data quality T2 - Eighth International Conference on Social Science Methodology Y1 - 2012 A1 - Belli, R.F. A1 - Bilgen, I. A1 - T. Al Baghal JF - Eighth International Conference on Social Science Methodology CY - Sydney Australia UR - https://conference.acspri.org.au/index.php/rc33/2012/paper/view/366 ER - TY - CONF T1 - Change of Support in Spatio-Temporal Dynamical Models T2 - Joint Statistical Meetings Y1 - 2012 A1 - Wikle, C.K. JF - Joint Statistical Meetings CY - Montreal, Canada ER - TY - ABST T1 - Confidentiality and Privacy Protection in a Non-US Census Context Y1 - 2012 A1 - Anne-Sophie Charest PB - Carnegie Mellon University ER - TY - CONF T1 - Counting the people T2 - Nathan and Beatrice Keyfitz Lecture in Mathematics and the Social Sciences Y1 - 2012 A1 - Stephen E. Fienberg JF - Nathan and Beatrice Keyfitz Lecture in Mathematics and the Social Sciences PB - Fields Institute CY - Toronto, Canada ER - TY - THES T1 - Creation and Analysis of Differentially-Private Synthesis Datasets Y1 - 2012 A1 - Anne-Sophie Charest PB - Carnegie Mellon University N1 - PhD Thesis, Department of Statistics ER - TY - RPRT T1 - Data Management of Confidential Data Y1 - 2012 A1 - Lagoze, Carl A1 - Block, William C. A1 - Williams, Jeremy A1 - Abowd, John M. A1 - Vilhuber, Lars AB - Data Management of Confidential Data Lagoze, Carl; Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfill US federal data management mandates and interfering with basic scholarly practices such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access, and cite that data. PB - Cornell University UR - http://hdl.handle.net/1813/30924 ER - TY - JOUR T1 - Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications JF - Journal of Privacy and Confidentiality Y1 - 2012 A1 - Yang Xiaolin A1 - Stephen E. Fienberg A1 - Alessandro Rinaldo VL - 4 ER - TY - CONF T1 - Differential Privacy for Synthetic Datasets T2 - Proceedings of the Survey Research Section of the SSC Y1 - 2012 A1 - Anne-Sophie Charest JF - Proceedings of the Survey Research Section of the SSC CY - Guelph, Ontario N1 - Invited session on Confidentiality of the Annual Meeting of the Statistical Society of Canada ER - TY - CONF T1 - Disambiguating USPTO Inventors with Classification Models Trained on Comparisons of Labeled Inventor Records T2 - Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University Y1 - 2012 A1 - Samuel Ventura A1 - Rebecca Nugent A1 - Erich R.H. Fuchs JF - Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University ER - TY - RPRT T1 - An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Y1 - 2012 A1 - Block, William C. A1 - Williams, Jeremy A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Lagoze, Carl AB - An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR) Block, William C.; Williams, Jeremy; Abowd, John M.; Vilhuber, Lars; Lagoze, Carl This presentation will demonstrate the latest DDI-related technological developments of Cornell University’s $3 million NSF-Census Research Network (NCRN) award, dedicated to improving the documentation, discoverability, and accessibility of public and restricted data from the federal statistical system in the United States. The current internal name for our DDI-based system is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). CED²AR ingests metadata from heterogeneous sources and supports filtered synchronization between restricted and public metadata holdings. Currently-supported CED²AR “connector workflows” include mechanisms to ingest IPUMS, zero-observation files from the American Community Survey (DDI 2.1), and SIPP Synthetic Beta (DDI 1.2). These disparate metadata sources are all transformed into a DDI 2.5 compliant form and stored in a single repository. In addition, we will demonstrate an extension to DDI 2.5 that allows for the labeling of elements within the schema to indicate confidentiality. This metadata can then be filtered, allowing the creation of derived public use metadata from an original confidential source. This repository is currently searchable online through a prototype application demonstrating the ability to search across previously heterogeneous metadata sources. Presentation at the 4th Annual European DDI User Conference (EDDI12), Norwegian Social Science Data Services, Bergen, Norway, 3 December, 2012 PB - Cornell University UR - http://hdl.handle.net/1813/30922 ER - TY - CONF T1 - The Economics of Privacy T2 - The Oxford Handbook of the Digital Economy Y1 - 2012 A1 - Laura Brandimarte A1 - Alessandro Acquisti ED - Martin Peitz ED - Joel Waldfogel JF - The Oxford Handbook of the Digital Economy PB - Oxford University Press SN - 9780195397840 ER - TY - ABST T1 - Efficient Time-Frequency Representations in High-Dimensional Spatial and Spatio-Temporal Models Y1 - 2012 A1 - Wikle, C.K. ER - TY - CONF T1 - Empirical Evaluation of Statistical Inference from Differentially-Private Contingency Tables T2 - Privacy in Statistical Databases Y1 - 2012 A1 - Anne-Sophie Charest ED - Josep Domingo-Ferrer ED - Ilenia Tinnirello JF - Privacy in Statistical Databases PB - Springer VL - 7556 SN - 978-3-642-33627-0 N1 - Print ISBN is 978-3-642-33626-3 ER - TY - RPRT T1 - Encoding Provenance Metadata for Social Science Datasets Y1 - 2012 A1 - Lagoze, Carl A1 - Williams, Jeremy A1 - Vilhuber, Lars AB - Encoding Provenance Metadata for Social Science Datasets Lagoze, Carl; Williams, Jeremy; Vilhuber, Lars Recording provenance is a key requirement for data-centric scholarship, allowing researchers to evaluate the integrity of source data sets and re- produce, and thereby, validate results. Provenance has become even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. Recent work by the W3C on the PROV model provides the foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We apply that model to complex, but characteristic, provenance examples of social science data, describe scenarios that make scholarly use of those provenance descriptions, and propose a manner for encoding this provenance metadata within the widely-used DDI metadata standard. Submitted to Metadata and Semantics Research (MTSR 2013) conference. PB - Cornell University UR - http://hdl.handle.net/1813/55327 ER - TY - CHAP T1 - Entropy Estimations Using Correlated Symmetric Stable Random Projections T2 - Advances in Neural Information Processing Systems 25 Y1 - 2012 A1 - Ping Li A1 - Cun-Hui Zhang ED - P. Bartlett ED - F.C.N. Pereira ED - C.J.C. Burges ED - L. Bottou ED - K.Q. Weinberger JF - Advances in Neural Information Processing Systems 25 UR - http://books.nips.cc/papers/files/nips25/NIPS2012_1456.pdf ER - TY - JOUR T1 - Estimating identification disclosure risk using mixed membership models JF - Journal of the American Statistical Association Y1 - 2012 A1 - Manrique-Vallier, D. A1 - Reiter, J.P. VL - 107 ER - TY - CONF T1 - On Estimation of Mean Squared Errors of Benchmarked and Empirical Bayes Estimators T2 - 2012 Joint Statistical Meetings Y1 - 2012 A1 - Rebecca C. Steorts A1 - Malay Ghosh JF - 2012 Joint Statistical Meetings CY - San Diego, CA ER - TY - CONF T1 - Exploring interviewer and respondent interactions: An innovative behavior coding approach T2 - Midwest Association for Public Opinion Research 2012 Annual Conference Y1 - 2012 A1 - Walton, L. A1 - Stange, M. A1 - Powell, R. A1 - Belli, R.F. JF - Midwest Association for Public Opinion Research 2012 Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html ER - TY - ABST T1 - Extreme Poverty in the United States, 1996 to 2011 Y1 - 2012 A1 - Shaefer, H. Luke A1 - Edin, Kathryn PB - University of Michigan UR - http://www.npc.umich.edu/publications/policy_briefs/brief28/policybrief28.pdf N1 - NCRN ER - TY - CONF T1 - Fast Multi-task Learning for Query Spelling Correction T2 - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - The 21$^{st}$ ACM International Conference on Information and Knowledge Management (CIKM 2012) UR - http://dx.doi.org/10.1145/2396761.2396800 ER - TY - CONF T1 - Fast Near Neighbor Search in High-Dimensional Binary Data T2 - The European Conference on Machine Learning (ECML 2012) Y1 - 2012 A1 - Anshumali Shrivastava A1 - Ping Li JF - The European Conference on Machine Learning (ECML 2012) ER - TY - CONF T1 - Flexible Spectral Models for Multivariate Time Series T2 - Joint Statistical Meetings 2012 Y1 - 2012 A1 - Holan, S.H. JF - Joint Statistical Meetings 2012 ER - TY - RPRT T1 - A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Records Systems Y1 - 2012 A1 - Mauricio Sadinle A1 - Stephen E. Fienberg JF - arXiv UR - https://arxiv.org/abs/1205.3217 ER - TY - CONF T1 - GPU-based minwise hashing: GPU-based minwise hashing T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) Y1 - 2012 A1 - Ping Li A1 - Anshumali Shrivastava A1 - Arnd Christian König JF - Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188129 ER - TY - CONF T1 - Hierarchical General Quadratic Nonlinear Models for Spatio-Temporal Dynamics T2 - Red Raider Conference Y1 - 2012 A1 - Wikle, C.K. JF - Red Raider Conference PB - Texas Tech University CY - Lubbock, TX ER - TY - ABST T1 - Hierarchical Statistical Modeling of Big Spatial Datasets Using the Exponential Family of Distributions Y1 - 2012 A1 - Sengupta, A. A1 - Cressie, N. PB - The Ohio State University ER - TY - ABST T1 - Inference for Count Data using the Spatial Random Effects Model Y1 - 2012 A1 - Cressie, N. ER - TY - JOUR T1 - Inferentially valid partially synthetic data: Generating from posterior predictive distributions not necessary JF - Journal of Official Statistics Y1 - 2012 A1 - Reiter, J.P. A1 - Kinney, S.K. VL - 28 ER - TY - CONF T1 - Interviewer variance of interviewer and respondent behaviors: A new frontier in analyzing the interviewer-respondent interaction T2 - Midwest Association for Public Opinion Research 2012 Annual Conference Y1 - 2012 A1 - Charoenruk, N. A1 - Parkhurst, B. A1 - Ay, M. A1 - Belli, R. F. JF - Midwest Association for Public Opinion Research 2012 Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html N1 - Annual conference of the Midwest Association for Public Opinion Research, Chicago, Illinois. ER - TY - CONF T1 - Logit-Based Confidence Intervals for Single Capture-Recapture Estimation T2 - American Statistical Association Pittsburgh Chapter Banquet Y1 - 2012 A1 - Mauricio Sadinle JF - American Statistical Association Pittsburgh Chapter Banquet CY - Pittsburgh, PA N1 - April 9, 2012 ER - TY - CONF T1 - Maintaining Quality in the Face of Rapid Program Expansion T2 - 2012 Joint Statistical Meetings Y1 - 2012 A1 - Cosma Shalizi A1 - Rebecca Nugent JF - 2012 Joint Statistical Meetings CY - San Diego, CA ER - TY - CONF T1 - Methods Matter: Revamping Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records T2 - Conference Presentation Academy of Management Annual Meeting Y1 - 2012 A1 - Samuel Ventura A1 - Rebecca Nugent A1 - Erich R.H. Fuchs JF - Conference Presentation Academy of Management Annual Meeting CY - Boston, MA ER - TY - CONF T1 - MulFiles Record Linkage Using a Generalized Fellegi-Sunter Framework T2 - Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University Y1 - 2012 A1 - Mauricio Sadinle JF - Conference Presentation Classification Society Annual Meeting, Carnegie Mellon University ER - TY - RPRT T1 - NCRN Meeting Fall 2012 Y1 - 2012 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2012 Vilhuber, Lars Taken place at the Census Bureau Headquarters, Suitland, MD. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/45884 ER - TY - RPRT T1 - The NSF-Census Research Network: Cornell Node Y1 - 2012 A1 - Block, William C. A1 - Lagoze, Carl A1 - Vilhuber, Lars A1 - Brown, Warren A. A1 - Williams, Jeremy A1 - Arguillas, Florio AB - The NSF-Census Research Network: Cornell Node Block, William C.; Lagoze, Carl; Vilhuber, Lars; Brown, Warren A.; Williams, Jeremy; Arguillas, Florio Cornell University has received a $3M NSF-Census Research Network (NCRN) award to improve the documentation and discoverability of both public and restricted data from the federal statistical system. The current internal name for this project is the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). The diagram to the right provides a high level architectural overview of the system to be implemented. The CED²AR will be based upon leading metadata standards such as the Data Documentation Initiative (DDI) and Statistical Data and Metadata eXchange (SDMX) and be flexibly designed to ingest documentation from a variety of source files. It will permit synchronization between the public and confidential instances of the repository. The scholarly community will be able to use the CED²AR as it would a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, could use the CED²AR with full information in authorized domains. PB - Cornell University UR - http://hdl.handle.net/1813/30925 ER - TY - CHAP T1 - One Permutation Hashing T2 - Advances in Neural Information Processing Systems 25 Y1 - 2012 A1 - Ping Li A1 - Art Owen A1 - Cun-Hui Zhang ED - P. Bartlett ED - F.C.N. Pereira ED - C.J.C. Burges ED - L. Bottou ED - K.Q. Weinberger JF - Advances in Neural Information Processing Systems 25 UR - http://books.nips.cc/papers/files/nips25/NIPS2012_1436.pdf ER - TY - RPRT T1 - Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods Y1 - 2012 A1 - Abowd, John AB - Presentation: Revisiting the Economics of Privacy: Population Statistics and Privacy as Public Goods Abowd, John Anonymization and data quality are intimately linked. Although this link has been properly acknowledged in the Computer Science and Statistical Disclosure Limitation literatures, economics offers a framework for formalizing the linkage and analyzing optimal decisions and equilibrium outcomes. The opinions expressed in this presentation are those of the author and neither the National Science Foundation nor the Census Bureau. PB - Cornell University UR - http://hdl.handle.net/1813/30937 ER - TY - JOUR T1 - Privacy in a world of electronic data: Whom should you trust? JF - Notices of the AMS Y1 - 2012 A1 - Stephen E. Fienberg VL - 59 ER - TY - JOUR T1 - Privacy-preserving data sharing in high dimensional regression and classification settings JF - Journal of Privacy and Confidentiality Y1 - 2012 A1 - Stephen E. Fienberg A1 - Jiashun Jin VL - 4 ER - TY - CHAP T1 - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs T2 - Privacy in Statistical Databases Y1 - 2012 A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Block, William ED - Domingo-Ferrer, Josep ED - Tinnirello, Ilenia KW - Data Archive KW - Data Curation KW - Privacy-preserving Datamining KW - Statistical Disclosure Limitation JF - Privacy in Statistical Databases T3 - Lecture Notes in Computer Science PB - Springer Berlin Heidelberg VL - 7556 SN - 978-3-642-33626-3 UR - http://dx.doi.org/10.1007/978-3-642-33627-0_17 ER - TY - CONF T1 - Query spelling correction using multi-task learning T2 - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) Y1 - 2012 A1 - Xu Sun A1 - Anshumali Shrivastava A1 - Ping Li JF - Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume) UR - http://doi.acm.org/10.1145/2187980.2188153 ER - TY - JOUR T1 - Rejoinder: An approach for identifying and predicting economic recessions in real time using time frequency functional models JF - Applied Stochastic Models in Business and Industry Y1 - 2012 A1 - Holan, S. A1 - Yang, W. A1 - Matteson, D. A1 - Wikle, C. VL - 28 UR - http://onlinelibrary.wiley.com/doi/10.1002/asmb.1955/full ER - TY - CHAP T1 - Semiparametric Dynamic Design of Monitoring Networks for Non-Gaussian Spatio-Temporal Data T2 - Spatio-temporal Design: Advances in Efficient Data Acquisition Y1 - 2012 A1 - Holan, S. A1 - Wikle, C.K. ED - Jorge Mateu ED - Werner Muller JF - Spatio-temporal Design: Advances in Efficient Data Acquisition PB - Wiley CY - Chichester, UK UR - http://onlinelibrary.wiley.com/doi/10.1002/9781118441862.ch12/summary ER - TY - CONF T1 - Sleight of Privacy T2 - Conference on Web Privacy Measurement Y1 - 2012 A1 - Idris Adjerid A1 - Alessandro Acquisti A1 - Laura Brandimarte JF - Conference on Web Privacy Measurement ER - TY - THES T1 - Smooth Post-Stratification in Multiple Capture Recapture Y1 - 2012 A1 - Zachary Kurtz PB - Carnegie Mellon University N1 - Department of Statistics ER - TY - ABST T1 - Spatio-Temporal Statistics at Mizzou, Truman School of Public Affairs Y1 - 2012 A1 - Wikle, C.K. ER - TY - CONF T1 - Statistics in Service to the Nation T2 - Presentation Samuel S. Wilks Lecture Y1 - 2012 A1 - Stephen E. Fienberg JF - Presentation Samuel S. Wilks Lecture CY - Princeton, NJ N1 - April 23, 2012 ER - TY - CONF T1 - Teaching about Big Data: Curricular Issues T2 - 2012 Joint Statistical Meetings Y1 - 2012 A1 - Stephen E. Fienberg JF - 2012 Joint Statistical Meetings CY - San Diego, CA ER - TY - JOUR T1 - Testing for Membership to the IFRA and the NBU Classes of Distributions JF - Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012) Y1 - 2012 A1 - Radhendushka Srivastava A1 - Ping Li A1 - Debasis Sengupta VL - 22 UR - http://jmlr.csail.mit.edu/proceedings/papers/v22/srivastava12.html ER - TY - CONF T1 - Thinking inside the box: Mapping the microstructure of urban environment (and why it matters) T2 - AutoCarto 2012 Y1 - 2012 A1 - Seth Spielman A1 - David Folch A1 - John Logan A1 - Nicholas Nagle KW - cartography JF - AutoCarto 2012 CY - Columbus, Ohio UR - http://www.cartogis.org/docs/proceedings/2012/Spielman_etal_AutoCarto2012.pdf ER - TY - CONF T1 - Troubles with time-use: Examining potential indicators of error in the ATUS T2 - Midwest Association for Public Opinion Research 2012 Annual Conference Y1 - 2012 A1 - Phillips, A. L., A1 - T. Al Baghal A1 - Belli, R. F. JF - Midwest Association for Public Opinion Research 2012 Annual Conference CY - Chicago, IL UR - http://www.mapor.org/conferences.html N1 - Presented at the annual conference of the Midwest Association for Public Opinion Research, Chicago, Illinois ER - TY - CONF T1 - Valid Statistical Inference on Automatically Matched Files T2 - Privacy in Statistical Databases Y1 - 2012 A1 - Robert Hall A1 - Stephen E. Fienberg ED - Josep Domingo-Ferrer ED - Ilenia Tinnirello JF - Privacy in Statistical Databases PB - Springer ER - TY - JOUR T1 - The welfare reforms of the 1990s and the stratification of material well-being among low-income households with children JF - Children and Youth Services Review Y1 - 2012 A1 - Shaefer, H. Luke A1 - Ybarra, Marci AB -

We examine the incidence of material hardship experienced by low-income households with children, before and after the major changes to U.S. anti-poverty programs during the 1990s. We use the Survey of Income and ProgramParticipation (SIPP) to examine a series of measures of householdmaterial hardship thatwere collected in the years 1992, 1995, 1998, 2003 and 2005.We stratify our sample to differentiate between the 1) deeply poor (b50% of poverty), who sawa decline in public assistance over this period; and two groups that sawsome forms of public assistance increase: 2) other poor households (50–99% of poverty), and 3) the near poor (100–150% of poverty). We report bivariate trends over the study period, as well as presenting multivariate difference-indifferences estimates.We find suggestive evidence that material hardship—in the form of difficulty meeting essential household expenses, and falling behind on utilities costs—has generally increased among the deeply poor but has remained roughly the same for the middle group (50–99% of poverty), and decreased among the near poor (100–150% of poverty). Multivariate difference-in-differences estimates suggest that these trends have resulted in intensified stratification of the material well-being of low-income households with children.

VL - 34 N1 - NCRN ER - TY - CONF T1 - Approaches to Multiple Record Linkage T2 - Proceedings of the 58th World Statistical Congress Y1 - 2011 A1 - Sadinle, M. A1 - Hall, R. A1 - Fienberg, S. E. JF - Proceedings of the 58th World Statistical Congress PB - International Statistical Institute CY - Dublin UR - http://2011.isiproceedings.org/papers/450092.pdf ER - TY - JOUR T1 - Comment on Gates: Toward a Reconceptualization of Confidentiality Protection in the Context of Linkages with Administrative Records JF - Journal of Privacy and Confidentiality Y1 - 2011 A1 - Stephen E. Fienberg VL - 3 ER - TY - RPRT T1 - Do Single Mothers in the United States use the Earned Income Tax Credit to Reduce Unsecured Debt? Y1 - 2011 A1 - Shaefer, H. Luke A1 - Song, Xiaoqing A1 - Williams Shanks, Trina R. AB - Do Single Mothers in the United States use the Earned Income Tax Credit to Reduce Unsecured Debt? Shaefer, H. Luke; Song, Xiaoqing; Williams Shanks, Trina R. The Earned Income Tax Credit (EITC) is a refundable credit for low-income workers that is mainly targeted at families with children. This study uses the Survey of Income and Program Participation’s (SIPP) topical modules on Assets & Liabilities to examine the effects of EITC expansions during the early 1990s on the unsecured debt of the households of single mothers. We use two difference-in-differences comparisons over the study period 1988 to 1999, first comparing single mothers to single childless women, and then comparing single mothers with two or more children to single mothers with exactly one child. In both cases we find that the EITC expansions are associated with a relative decline in the unsecured debt of affected households of single mothers. This suggests that single mothers may have used part of their EITC to limit the growth of their unsecured debt during this period. PB - University of Michigan UR - http://hdl.handle.net/1813/34516 ER - TY - RPRT T1 - Estimating identification disclosure risk using mixed membership models Y1 - 2011 A1 - Manrique-Vallier, Daniel A1 - Reiter, Jerome AB - Estimating identification disclosure risk using mixed membership models Manrique-Vallier, Daniel; Reiter, Jerome Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confi dentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and off er an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models. PB - Duke University / National Institute of Statistical Sciences (NISS) UR - http://hdl.handle.net/1813/33184 ER - TY - RPRT T1 - NCRN Meeting Fall 2011 Y1 - 2011 A1 - Vilhuber, Lars AB - NCRN Meeting Fall 2011 Vilhuber, Lars Taken place at Census Bureau Conference Center. PB - NCRN Coordinating Office UR - http://hdl.handle.net/1813/46201 ER - TY - RPRT T1 - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Y1 - 2011 A1 - Abowd, John M. A1 - Vilhuber, Lars A1 - Block, William AB - A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs Abowd, John M.; Vilhuber, Lars; Block, William We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials. PB - Cornell University UR - http://hdl.handle.net/1813/30923 ER - TY - JOUR T1 - Secure multiparty linear regression based on homomorphic encryption JF - Journal of Official Statistics Y1 - 2011 A1 - Robert Hall A1 - Stephen E. Fienberg A1 - Yuval Nardi VL - 27 ER - TY - JOUR T1 - Parallel Associations and the Structure of Autobiographical Knowledge JF - Journal of Applied Research in Memory and Cognition Y1 - 6 A1 - Belli, Robert F. A1 - Al Baghal, Tarek KW - Autobiographical knowledge KW - Autobiographical memory KW - Autobiographical periods KW - Episodic memory KW - Retrospective reports AB - The self-memory system (SMS) model of autobiographical knowledge conceives that memories are structured thematically, organized both hierarchically and temporally. This model has been challenged on several fronts, including the absence of parallel linkages across pathways. Calendar survey interviewing shows the frequent and varied use of parallel associations in autobiographical recall. Parallel associations in these data are commonplace, and are driven more by respondents’ generative retrieval than by interviewers’ probing. Parallel associations represent a number of autobiographical knowledge themes that are interrelated across life domains. The content of parallel associations is nearly evenly split between general and transitional events, supporting the importance of transitions in autographical memory. Associations in respondents’ memories (both parallel and sequential), demonstrate complex interactions with interviewer verbal behaviors during generative retrieval. In addition to discussing the implications of these results to the SMS model, implications are also drawn for transition theory and the basic-systems model. VL - 5 SN - 2211-3681 UR - http://www.sciencedirect.com/science/article/pii/S2211368116300183 IS - 2 ER - TY - ABST T1 - Are Self-Description Scales Better than Agree/Disagree Scales in Mail and Telephone Surveys? Y1 - 0 A1 - Timbrook, Jerry A1 - Smyth, Jolene D. A1 - Olson, Kristen ER - TY - ABST T1 - Are Self-Description Scales Better than Agree/Disagree Scales in Mail and Telephone Surveys? Y1 - 0 A1 - Timbrook, Jerry A1 - Smyth, Jolene D. A1 - Olson, Kristen ER - TY - ABST T1 - The ATUS and SIPP-EHC: Recent developments Y1 - 0 A1 - Belli, R. F. ER - TY - ABST T1 - Audit trails, parallel navigation, and the SIPP Y1 - 0 A1 - Lee, Jinyoung ER - TY - JOUR T1 - Bayesian estimation of bipartite matchings for record linkage JF - Journal of the American Statistical Association Y1 - 0 A1 - Mauricio Sadinle AB - The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. ER - TY - JOUR T1 - Biomass prediction using density dependent diameter distribution models JF - Annals of Applied Statistics Y1 - 0 A1 - Schliep, E.M. A1 - A.E. Gelfand A1 - J.S. Clark A1 - B.J. Tomasek AB - Prediction of aboveground biomass, particularly at large spatial scales, is necessary for estimating global-scale carbon sequestration. Since biomass can be measured only by sacrificing trees, total biomass on plots is never observed. Rather, allometric equations are used to convert individual tree diameter to individual biomass, perhaps with noise. The values for all trees on a plot are then summed to obtain a derived total biomass for the plot. Then, with derived total biomasses for a collection of plots, regression models, using appropriate environmental covariates, are employed to attempt explanation and prediction. Not surprisingly, when out-of-sample validation is examined, such a model will predict total biomass well for holdout data because it is obtained using exactly the same derived approach. Apart from the somewhat circular nature of the regression approach, it also fails to employ the actual observed plot level response data. At each plot, we observe a random number of trees, each with an associated diameter, producing a sample of diameters. A model based on this random number of tree diameters provides understanding of how environmental regressors explain abundance of individuals, which in turn explains individual diameters. We incorporate density dependence because the distribution of tree diameters over a plot of fixed size depends upon the number of trees on the plot. After fitting this model, we can obtain predictive distributions for individual-level biomass and plot-level total biomass. We show that predictive distributions for plot-level biomass obtained from a density-dependent model for diameters will be much different from predictive distributions using the regression approach. Moreover, they can be more informative for capturing uncertainty than those obtained from modeling derived plot-level biomass directly. We develop a density-dependent diameter distribution model and illustrate with data from the national Forest Inventory and Analysis (FIA) database. We also describe how to scale predictions to larger spatial regions. Our predictions agree (in magnitude) with available wisdom on mean and variation in biomass at the hectare scale. VL - 11 UR - https://projecteuclid.org/euclid.aoas/1491616884 IS - 1 ER - TY - CHAP T1 - Calendar and time diary methods: The tools to assess well-being in the 21st century T2 - Handbook of research methods in health and social sciences Y1 - 0 A1 - Córdova Cazar, Ana Lucía A1 - Belli, Robert F. ED - Liamputtong, P JF - Handbook of research methods in health and social sciences PB - Springer ER - TY - ABST T1 - Does relation of retrieval pathways to data quality differ by self or proxy response status? Y1 - 0 A1 - Lee, Jinyoung A1 - Belli, Robert F. ER - TY - ABST T1 - "During the LAST YEAR, Did You...": The Effect of Emphasis in CATI Survey Questions on Data Quality Y1 - 0 A1 - Olson, Kristen A1 - Smyth, Jolene D. ER - TY - ABST T1 - "During the LAST YEAR, Did You...": The Effect of Emphasis in CATI Survey Questions on Data Quality Y1 - 0 A1 - Olson, Kristen A1 - Smyth, Jolene D. ER - TY - ABST T1 - The Effect of Question Characteristics, Respondents and Interviewers on Question Reading Time and Question Reading Behaviors in CATI Surveys Y1 - 0 A1 - Olson, Kristen A1 - Smyth, Jolene A1 - Kirchner, Antje ER - TY - ABST T1 - The Effect of Question Characteristics, Respondents and Interviewers on Question Reading Time and Question Reading Behaviors in CATI Surveys Y1 - 0 A1 - Olson, Kristen ER - TY - ABST T1 - The Effects of Respondent and Question Characteristics on Respondent Behaviors Y1 - 0 A1 - Ganshert, Amanda A1 - Olson, Kristen A1 - Smyth, Jolene ER - TY - JOUR T1 - An Empirical Comparison of Multiple Imputation Methods for Categorical Data JF - The American Statistician Y1 - 0 A1 - Olanrewaju Akande A1 - Fan Li A1 - Jerome Reiter AB - AbstractMultiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online. UR - http://dx.doi.org/10.1080/00031305.2016.1277158 ER - TY - JOUR T1 - An ensemble quadratic echo state network for nonlinear spatio-temporal forecasting JF - Stat Y1 - 0 A1 - McDermott, P.L. A1 - Wikle, C.K. AB - Spatio-temporal data and processes are prevalent across a wide variety of scientific disciplines. These processes are often characterized by nonlinear time dynamics that include interactions across multiple scales of spatial and temporal variability. The data sets associated with many of these processes are increasing in size due to advances in automated data measurement, management, and numerical simulator output. Non- linear spatio-temporal models have only recently seen interest in statistics, but there are many classes of such models in the engineering and geophysical sciences. Tradi- tionally, these models are more heuristic than those that have been presented in the statistics literature, but are often intuitive and quite efficient computationally. We show here that with fairly simple, but important, enhancements, the echo state net- work (ESN) machine learning approach can be used to generate long-lead forecasts of nonlinear spatio-temporal processes, with reasonable uncertainty quantification, and at only a fraction of the computational expense of a traditional parametric nonlinear spatio-temporal models. UR - https://arxiv.org/abs/1708.05094 ER - TY - ABST T1 - Evaluating Data quality in Time Diary Surveys Using Paradata Y1 - 0 A1 - Córdova Cazar, Ana Lucía A1 - Belli, Robert F. ER - TY - ABST T1 - An evaluation study of the use of paradata to enhance data quality in the American Time Use Survey (ATUS) Y1 - 0 A1 - Córdova Cazar, Ana Lucía A1 - Belli, Robert F. ER - TY - ABST T1 - Event History Calendar Interviewing Dynamics and Data Quality in the Survey of Income and Program Participation Y1 - 0 A1 - Lee, Jinyoung ER - TY - ABST T1 - Going off Script: How Interviewer Behavior Affects Respondent Behaviors in Telephone Surveys Y1 - 0 A1 - Kirchner, Antje A1 - Olson, Kristen A1 - Smyth, Jolene ER - TY - ABST T1 - How do Low Versus High Response Scale Ranges Impact the Administration and Answering of Behavioral Frequency Questions in Telephone Surveys? Y1 - 0 A1 - Sarwar, Mazen A1 - Olson, Kristen A1 - Smyth, Jolene ER - TY - ABST T1 - How do Mismatches Affect Interviewer/Respondent Interactions in the Question/Answer Process? Y1 - 0 A1 - Smyth, Jolene D. A1 - Olson, Kristen ER - TY - ABST T1 - Interviewer Influence on Interviewer-Respondent Interaction During Battery Questions Y1 - 0 A1 - Cochran, Beth A1 - Olson, Kristen A1 - Smyth, Jolene ER - TY - ABST T1 - Memory Gaps in the American Time Use Survey. Are Respondents Forgetful or is There More to it? Y1 - 0 A1 - Kirchner, Antje A1 - Belli, Robert F. A1 - Deal, Caitlin E. A1 - Córdova-Cazar, Ana Lucia ER - TY - ABST T1 - Relation of questionnaire navigation patterns and data quality: Keystroke data analysis Y1 - 0 A1 - Lee, Jinyoung ER - TY - ABST T1 - Respondent retrieval strategies inform the structure of autobiographical knowledge Y1 - 0 A1 - Belli, R. F. ER - TY - ABST T1 - Response Scales: Effects on Data Quality for Interviewer Administered Surveys Y1 - 0 A1 - Sarwar, Mazen A1 - Olson, Kristen A1 - Smyth, Jolene ER - TY - ABST T1 - Using audit trails to evaluate an event history calendar survey instrument Y1 - 0 A1 - Lee, Jinyoung A1 - Seloske, Ben A1 - Belli, Robert F. ER - TY - ABST T1 - Using behavior coding to understand respondent retrieval strategies that inform the structure of autobiographical knowledge Y1 - 0 A1 - Belli, R. F. ER - TY - ABST T1 - Why do Mobile Interviews Take Longer? A Behavior Coding Perspective Y1 - 0 A1 - Timbrook, Jerry A1 - Smyth, Jolene A1 - Olson, Kristen ER - TY - ABST T1 - Working with the SIPP-EHC audit trails: Parallel and sequential retrieval Y1 - 0 A1 - Lee, Jinyoung A1 - Seloske, Ben A1 - Córdova Cazar, Ana Lucía A1 - Eck, Adam A1 - Belli, Robert F. ER -