Software

Nodes develop and make available software.

We have developed an algorithm that reduces the margins of error in ACS Tract and Block Group Data to a user specified level by “intelligently” combining Census geographies together into regions. A region is a collection of 1 or more census geographies that meets a user specified margin of error (or CV). We refer to this procedure as “regionalization." Tutorials:  We have developed a tutorial using an extremely simple toy example (Toy Example)  and a more realistic tutorial using data for...
CED²AR is designed to improve the documentation and discoverability of both public and restricted data from the federal statistical system. CED²AR is based upon leading metadata standards (Data Documentation Initiative, DDI). The web application CED²AR was  developed to be able to expose and edit  new features added by the Cornell node. The main CED²AR instance both hosts those unique codebooks, as well as allows us to showcase the new features in our DDI extensions. The production server can...
Identifying Research Funding. Yulia Muzyrya, 2015. "Data and programs for identifying research funding based on CFDA codes." STATA dta file, STATA do file, CSV file, user guide.
Software to generate a prediction of Initial Claims for Unemployment Insurance using the University of Michigan Social Media Job Loss Index. The prediction is based on a factor analysis of social media messages mentioning job loss and related outcomes. See linked articles for more details.
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation.  In a paper published in the Journal of the American Statistical Association, we developed an approach that fully...
Many datasets include a mix of continuous and categorical variables with missing values. In a paper published in the Journal of the American Statistical Association, we developed a joint model for such mixed data that can be used for multiple imputation. The approach uses a nonparametric Bayesian mixture model as the imputation engine. The mixture model comprises one set of mixture components with multivariate normal kernels for the continuous variables, and a separate set of mixture...
This tool set provides a set of functions to fit the nested Dirichlet process mixture of products of multinomial distributions (NDPMPM) model for nested categorical household data in the presence of impossible combinations. It has direct applications in generating synthetic nested household data. This package fits a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for...
These R routines create multiple imputations of missing at random categorical data, with or without structural zeros. Imputations are based on Dirichlet process mixtures of multinomial distributions, which is a non-parametric Bayesian modeling approach that allows for flexible joint modeling.  Many datasets comprise exclusively categorical variables that suffer from missing data.  When the number of variables is large, it can be challenging to specify models for use in multiple imputation (MI)...
mtcd is a standalone C++ implementation of the statistical model proposed in “Synthesizing Truncated Count Data for Confidentiality”, developed by the Duke-NISS NCRN node (see this page). Our team, part of the LDI Summer Lab 2015, created a R wrapper around mtcd, called Rmtcd. See https://github.com/ncrncornell/Rmtcd for all additional details.
STATA utilities to facilitate probabilitistic record linkage methods Installation from within STATA: type "net from http://www-personal.umich.edu/~nwasi/programs". See Wasi, Nada, and Aaron Flaaen. "Record Linkage using STATA: Pre-processing, Linking and Reviewing Utilities." The Stata Journal 15, no. 3 (2015): 1-15. for explanation about these utilities
To maintain confidentiality national statistical agencies traditionally do not include small counts in publicly released tabular data products.  They typically delete these small counts, or combine them with counts in adjacent table cells to preserve the totals at higher levels of aggregation.  In some cases these suppression procedures result in too much loss of information.  To increase data utility and make more data publicly available, we created methods and software to generate synthetic...