Software

Nodes develop and make available software.

We have developed an algorithm that reduces the margins of error in ACS Tract and Block Group Data to a user specified level by “intelligently” combining Census geographies together into regions. A region is a collection of 1 or more census geographies that meets a user specified margin of error (or CV). We refer to this procedure as “regionalization." Tutorials: We have developed a tutorial using an extremely simple toy example (Toy Example) and a more realistic tutorial using data for...

Comprehensive Extensible Data Documentation and Access Repository (CED²AR)

CED²AR is designed to improve the documentation and discoverability of both public and restricted data from the federal statistical system. CED²AR is based upon leading metadata standards (Data Documentation Initiative, DDI). The web application CED²AR was developed to be able to expose and edit new features added by the Cornell node. The main CED²AR instance both hosts those unique codebooks, as well as allows us to showcase the new features in our DDI extensions. The production server can...

Data and programs for identifying research funding based on CFDA codes

Identifying Research Funding. Yulia Muzyrya, 2015. "Data and programs for identifying research funding based on CFDA codes." STATA dta file, STATA do file, CSV file, user guide.

Economic Indicators from Social Media

Software to generate a prediction of Initial Claims for Unemployment Insurance using the University of Michigan Social Media Job Loss Index. The prediction is based on a factor analysis of social media messages mentioning job loss and related outcomes. See linked articles for more details.

EditImputeCont: Simultaneous Edit-Imputation for Continuous Microdata

Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. In a paper published in the Journal of the American Statistical Association, we developed an approach that fully...

MixedDataImpute: Missing Data Imputation for Continuous and Categorical Data using Nonparametric Bayesian Joint Models

Many datasets include a mix of continuous and categorical variables with missing values. In a paper published in the Journal of the American Statistical Association, we developed a joint model for such mixed data that can be used for multiple imputation. The approach uses a nonparametric Bayesian mixture model as the imputation engine. The mixture model comprises one set of mixture components with multivariate normal kernels for the continuous variables, and a separate set of mixture...

NestedCategBayesImpute: Modeling and Generating Synthetic Versions of Nested Categorical Data in the Presence of Impossible Combinations

This tool set provides a set of functions to fit the nested Dirichlet process mixture of products of multinomial distributions (NDPMPM) model for nested categorical household data in the presence of impossible combinations. It has direct applications in generating synthetic nested household data. This package fits a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for...

NPBayesImpute: Non-parametric Bayesian Multiple Imputation for Categorical Data

These R routines create multiple imputations of missing at random categorical data, with or without structural zeros. Imputations are based on Dirichlet process mixtures of multinomial distributions, which is a non-parametric Bayesian modeling approach that allows for flexible joint modeling. Many datasets comprise exclusively categorical variables that suffer from missing data. When the number of variables is large, it can be challenging to specify models for use in multiple imputation (MI)...

R wrapper for Synthesizing Truncated Count Data for Confidentiality (Rmtcd)

mtcd is a standalone C++ implementation of the statistical model proposed in “Synthesizing Truncated Count Data for Confidentiality”, developed by the Duke-NISS NCRN node (see this page). Our team, part of the LDI Summer Lab 2015, created a R wrapper around mtcd, called Rmtcd. See https://github.com/ncrncornell/Rmtcd for all additional details. Citation Charley Chen, Hautahi Kingi, Alice Chou, & Lars Vilhuber. (2015, October 20). ncrncornell/Rmtcd: Synthesizing Truncated Count Data for...

STATA utilities to facilitate probabilitistic record linkage methods

STATA utilities to facilitate probabilitistic record linkage methods Installation from within STATA: type "net from http://www-personal.umich.edu/~nwasi/programs". See Wasi, Nada, and Aaron Flaaen. "Record Linkage using STATA: Pre-processing, Linking and Reviewing Utilities." The Stata Journal 15, no. 3 (2015): 1-15. for explanation about these utilities

Synthesis of county-to-county migration flows

To maintain confidentiality national statistical agencies traditionally do not include small counts in publicly released tabular data products. They typically delete these small counts, or combine them with counts in adjacent table cells to preserve the totals at higher levels of aggregation. In some cases these suppression procedures result in too much loss of information. To increase data utility and make more data publicly available, we created methods and software to generate synthetic...

Visualizing Uncertainy (VizU) R Package.

This package provides three methods for visualising uncertainty in spatial data. These approaches are based on the methods developed in Lucchesi and Wikle (2017) and we have tried to generalise the approaches so they can be applied to most types of spatial data. We welcome any comments or suggestions about the package. References Lucchesi, L.R. and Wikle C.K. (2017) Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation, Stat, 10.1002/sta4.150.

Search form

Software

Short URL