PPRL

From popdata
Jump to: navigation, search

Overview

This page will track activities related to the Privacy Preserving Record Linkage (PPRL) evaluation grant.

Project members

PopData

  • Kim McGrail
  • Brent Hills
  • Kelly Sanderson
  • Mike Jarrett
  • Yinshan Zhao

Curtin University Australia

  • James Boyd (j.boyd@curtin.edu.au)
  • Adrian Brown (adrian.brown@curtin.edu.au)
  • Anna Ferrante (A.Ferrante@curtin.edu.au)

Timeline

  • April: Continue work to load LCF data into LinXmart (MJ)
  • May 1: Data stewards working group (SK,MJ)
  • May 10: Meet with Tav (no room scheduled yet) (MJ,BH,TA,SK)
  • Mid June: Expected arrival of vital stats data. MJ will do production linkage and LinXmart linkage
  • August 6-20: Sarah away
  • August 15-31: Mike away
  • September 11-14: IPDLN (MJ,BH,SK)

LinXmart

LinXmart is a record linkage application with PPRL functionality developed at Curtin University. Software usage is detailed in the LinXmart page.

  • 2018-04-03: Received new 6-month license and LinXmart software update from Curtin University
  • 2018-04-19: Software update completed on new "linxmartvm" server
  • 2018-04-23: Tutorial completed nominally

Methods

Data Sets for File-to-File Linkage (with the LCF)

Real Data Sets (real world scenarios)

  • varying complexity of data sets:
  1. missing data in fields
  2. missing fields for data linkage
  3. size (small to large?)
  • conducting data linkage to Population Directory (LCF)
LCF
frequency distribution of each variable in terms of missing rate (needed)
Number of records (size)
  • data set 1: minimal missing data in field and all fields essential for data linkage are present (unique identifier, surname, gender, first name, date of birth and postal code) - Best quality scenario
Vital Statistics (deaths) 
Missing rate by variable: (percent by variable include here)
Number of records (size)
  • data set 2: missing data in fields, but all fields essential for data linkage are present (unique identifier, surname, gender, first name, date of birth and postal code)
  • data set 3: minimal missing data in fields, but some fields essential for data linkage are missing (e.g., unique identifier)
  • data set 4: missing data in fields and fields essential for data linkage (poorest quality)

Simulation Study

  • implement corruptions to data (e.g., optical character recognition error, typographic errors, phonetic errors, and token-level errors). Would one method perform better under a particular type of error over another?
  • vary missing data within a variable, and across variables (e.g., postal code). Would we see differences in methods?

Data Linkage Strategies

  • hold record pair comparison and classification identical, so that linkage field comparison (Bloom vs. unbloomed) can be compared.
  • Three data linkage methods:
  1. Population Data BC method (Gold Standard),
  2. LinxMart (unencrypted, default settings),
  3. LinxMart (bloom filter PPRL, default settings)
Population Data BC Method
Variables for data linkage: surname, first name, gender, dob, postal code, unique identifier (PHN)
Blocking/Passes: 1. surname (soundex) and first name initial, 2. date of birth (YYYYMMDD)
Initial weights: set by user experience
Final weights:
Resolution Rules:
LinxMart (unencrypted)  
Variables for data linkage: surname, first name, gender, dob, postal code, unique identifier (PHN)
Blocking/Passes: 1. surname (soundex) and first name initial, 2. date of birth (YYYYMMDD)
Initial weights: set by software (default)
Final weights: NA (use initial weights)
Resolution Rules: Not applicable (does not exist in software)
LinxMart (encrypted - Bloom filer applied variable-by-variable)  
Bloom filter: surname, first name, dob, gender, unique identifier (PHN), postal code
Blocking/Passes: 1. surname (soundex) and first name initial, 2. date of birth (YYYYMMDD)
Initial weights: set by software (default)
Final weights: NA (use initial weights)
Resolution Rules: Not applicable (does not exist in software)

Specification of File Linkages

  • File-to-file Linkage: Data linkage of vital statistics (deaths) to Population 'Like' Directory (LCF)

Evaluation

  • Popdata method = treated as containing the true and untrue matches (known distributions) for vital statistics to LCF data linkage.
  • Compare LinxMart to Popdata method. This comparison would provide information regarding data linkage quality of LinxMart (default setting) -- software assessment. Note for method comparison, Popdata method differs from LinxMart in the following: blocking, initial and final weights, and resolution (it exists in Popdata's method, but not LinxMart). When comparing, keep note that the methods vary significantly between the two, and that any comparison can be attributed to any of the following: weights, blocking/passes, resolution. However, some of these variables are more likely to contribute to data linkage differences than others, and can be assessed (e.g., comparison of weight between the two methods). In order to assess the impact of any one method (e.g., resolution), all else would need to be held constant between the two methods. With that being said, the comparison is still extremely valuable because demographic differences in matched vs. unmatched can be compared between the two methods, and determine the trade off time (default vs. user specified 'everything' + resolution linkage). Perhaps matching all methods between LinxMart and Popdata up to resolution would allow an evaluation of the resolution phase (e.g., 'clerical' review) and its benefits to data linkage quality and demographic differences that may eventually impact research findings).
  • Compare LinxMart (unencrypted) to LinxMart (Bloom filter PPRL). This comparison will provide information regarding the data linkage quality of non PPRL to PPRL data linkage given both linkage strategies had the a)same blocking passes and b)field weights (default setting in LinxMart for both strategies), thus the methodology can now be compared for linkage quality given all else is held equal.
  • note Popdata method includes resolution, which does not exist in LinxMart.
  • the metrics to compare linkage strategies are below:

Measures

If true link status is known (Popdata method is the Gold Standard):

  • linkage percentage (percent of linked records)
  • true positives (TP)
  • False positives (FP)* assess making type I errors
  • True negatives (TN)
  • False negatives (FN)
  • recall = number of TP / number of TP + FP
  • precision = number of TP / number of TP + FN
  • F-measure = 2 * ((precision * recall) / (precision + recall))* we could use Ferrante & Boyd (2012) cut-offs of link quality (f >= 0.9 'very good', 0.9 > f >= 0.85 'good', 0.85 > f >= 0.80 'fair')
  • test conditional independence assumption = correlation between variables in linked and correlations between variables non-link data* this would be interesting to look at
  • parameter estimation (u and m)
  • cut-off thresholds
  • computational time
  • inter-rater reliability* does this ever come up?

Other options to explore:

  • graph analysis (?)
  • clustering (hierarchical) - model fit indexes for matched/unmatched
  • propensity scores

Theoretical

  • linkage quality
  • security and privacy (e.g., frequency attack)
  • computational complexity
  • scalability
  • IRT for resolution phase using probability of linkage patterns
  • Capture-Recapture: estimate the number of entities not contained in one of the datasets. Those missing records are entities of the population not represented by one of the datasets. N(approx) = Na * Nb / Nm

Software Improvements (issues with implementing LinxMart at Popdata)

  • Spine vs. non-spine linkage for encrypted linkages: re-upload data for every envelope (encrypted issue). Can link project to project, and thus upload spine into one project and link that way (but may want to consider an implementation of spine like linkages and languages) - how do we append the spine each year with new data etc.,
  • Modeling Variation (columns vs. rows): Pending how variable variation is modeled, one version can lead to memory issues (rolling out to rows - LinxMart) vs. columns (rolling out to columns – Popdata)
  • Disk space consumption and issues due to SQL tables
  • large amount of disk space required for large data sets like the LCF
  • retain intermediate variables (e.g., field comparison/outcome strings) that will help assess the quality of linkage. This can also help if manual resolution is desired.
  • no resolution phase or way to conduct resolution in current implementation
  • GUI: no options to cancel a job once it is in process beyond going into the backend and killing services. It would beneficial to have a front end gui version (more interactivity)
  • interface does not report/provide how many records are in a project, the number of linked records and unlinked records (currently must access backend tables to get this)

Literature

PPRL focused

Graph Linkage Approaches

General Linkage

Grant application articles we may wish to cite in future

Other Articles by Australian Partners

Books

EM

Evaluation

White Paper

This paper builds on the structure from the Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure Project

Table of Contents

Abstract

Introduction

  • Purpose
  • Data Linkage: deterministic, probabilistic and rule-based classification.
  • Population Data BC
    • Population Directory
    • Description of Data sets
  • Challenges

Overview Record Linkage

  • History
  • Weight Calculation
  • Linkage Comparison
  • (Introduction to weight and linkage calculations)

Details of Record Linkage

  • Overview/history
  • Linking projects in Popdata
Combination of deterministic and probabilistic classification
Two thresholds results in links, possible links and non links
Possible links undergo clerical review and rule-based classification evoked (resolution)
one-to-many classification(? or is this data set dependent with one-to-one, one-to-many as other options?).
  • Unique identifier linking
  • Probabilistic Linking
  • Selecting a cut-off
  • Quality of linkage
  • Blocking
  • Resolution

Encryption

  • Bloom filter - variable-by-variable, series of variables

Consolidation file (demographic issue, which one to choose, most recent or average?)

Examples

  • Example 1
  • Example 2

Conclusion

Future Direction

  • Linkage
  • Infrastructure
  • Privacy

Software