Data Science Institute

Linking Multiple Datasets Without Unique Identifiers

Research Project

Analysis of datasets created by linking two or more separate data sources is increasingly important as researchers and policy analysts seek to integrate administrative and clinical datasets while adapting to privacy regulations that limit access to unique identifiers. 

GraphicWe are developing statistically valid Bayesian procedures to link units that appear in all datasets. To link records, the methods rely on the associations between different variables in two files, and on variables that appear in both sources. The methods propagate the errors in the linking in further analysis, by relying on Bayesian data augmentation schemes to sample from the posterior distribution of the unknown links and the parameters. These methods improve the links’ quality and reduce the bias in the estimation of interesting relationships. Some examples include linking release records of prisoners with Ryan White clinical records, linking Meals on Wheels to Medicare beneficiaries’ records, and trauma registry to nursing homes residents.

Research Lead

Roee GutmanAssociate Professor of Biostatistics at Brown University

Funding Sources

Gary and Mary West Foundation, RI Foundation