IDO

From popdata
Jump to: navigation, search

Contents

Contacts

Hayden Lansdell

Hayden Lansdell
Executive Lead, Integrated Data Office
Hayden.Lansdell@gov.bc.ca

Kathleen Assaf

Kathleen Assaf
Director, Data Policy, Integrated Data Office
Ministry of Jobs, Trade and Technology
Phone: 250 208-1979
Kathleen.Assaf@gov.bc.ca

Brittany Decker

Brittany Decker
Director of Client Engagement Services
Phone 604-855-0940
Mobile 250-508-7287

Donna Coward IGP

Donna Coward IGP
Business Analyst
Integrated Data Office
Ministry of Jobs, Trade & Technology
Phone: 778-698-8887

Dan MacKenzie

Dan MacKenzie
Director, Data Insights
Integrated Data Office
Ministry of Jobs, Trade and Technology
*250-216-2601
Dan.MacKenzie@gov.bc.ca
owns: DI Program_Data Export Process 2018-02-1.docx

Nolonger with DIP program approx. 2019-10

Greg Lawrance

Greg Lawrance
Integrated Data Office
BC Ministry of Jobs, Trade and Technology
Greg.Lawrance@gov.bc.ca

Paul Ripley

Paul.Ripley@avocette.com

Suraiya Uvic

suraiya.uvic@gmail.com

Data Provider Contacts

Ministry of Education

Michael Sand

Michael.Sand@gov.bc.ca
- 

Social Development and Poverty Reduction

Wayne.Guilbeault@gov.bc.ca

Current Deadlines by Component

Priority List

  • removal of SIN number from files (official confirmation received from Brittany)
  • provide statscanada to Mackenzie project (pending receiving the data from Noushin. Issues to date have been rounding issues)
  • nacrs updates - pending unlinked records for NACRs
  • SDPR version 4 pre-application udpate for Bruce Project (pending)
  • stats canada income band (determine if this can be provided to IDD from Popdata's holding. Note data is only until 2006)
  • Data Linkage Reports. Draft 1 under revisions.

Trello Board

SRE

  • discuss naming conventions of files with Donna

Future

  • receive new PSSG data
  • potential to receiving ICBC
  • creation of data dictionary similar to what is provisioned to SRE users for LMID, SDPR, MED and PSSG. This work is to be done by Donna, with Popdata adding in starts, stops, lengths for SRE users that have this data part of their project.

Data Ingestion

PSSG

  • data updates/fixes to be received still

Stats Canada Income Data

  • pending

MCFD

  • new files to be received - July 22nd week

SDPR

  • disability files to be received (disability) just for Bruce project - July 22nd week
  • address files pending

Data Linkage

  • SDPR pre-application update underway

Technological Development

  • OCWA is being PEN tested. V2.0 with input being developed (Jim)
    Probably contacts are
    Brad Payne <Brad.Payne@gov.bc.ca> and
    Paul Ripley <Paul.Ripley@gov.bc.ca> Business Analyst, Digital Platforms and Data Division, Office of the Chief Information Officer, BC Ministry of Citizen Services
    Aidan Cope <aidan.cope@gmail.com> Developer.
  • DAR online developed and being tested by DPDD

Data Projects/Catalysts

  • Mackenzie: stats canada data

Follow Up

  • nifi ingestion tool
  • SDPR: Use/and or retention of SIN number for linkage. Brittany emailed on November 27 and guided us to not use for data linkage, but retain. Maria to follow up with Tav to determinw how best to respond.

Project Data Summary

Approved Catalyst Projects

Project DAR Data Requested Data Received Pending Turn Around Times
Mackenzie 18-g01 (CYMH) Approved MED, SDPR, PSSG, Consolidation file, DAD, PNET, MSP, NACRS, Pharmacare, Vital Statistics (linked and unlinked). MED, SDPR (2017/18), PSSG. Consolidation file, MSP, DAD (September 30, 2018). PNET (October 2). NACRs, rpblites, and Pharmacare (October 3). MHS (Oct 17). SDPR legacy (October 23). Vital Statistics (Dec). SDPR version 2 required (note keep version 1 and 2 in the project folder December 19, 2018). MED updated file with unlinked records with unique study ids provisioned January 24, 2019. MSPID and DEPNO in registry file April 8, 2019 week. MCFD and Stats Canada (June 17, 2019). Stats Canada file. PSSG. Note removal of PSSG Feb 14, 2019 and re-provisioned March 11th week. Provide updated PSSG when received and processed by Popdata. StatsCanada data. 4 - 6 weeks (total)
Wilmer 18-g02 Approved September 4, 2018 PSSG and MED (linked). Ver 1.0: Complete September 21, 2018. All files need to be re-provisioned for this project. PSSG. Currently issue identified with PSSG data (updated file to be received January 30, 2019). Note removal of PSSG Feb 14, 2019 and re-provisioned March 11th week. Provide updated PSSG when received and processed by Popdata. 7-8 business days
Warburton 18-g03 (Health Trajectories Project) Approved DAR submitted Jan 7, 2019. Amendment for Vital Statistics to be added July 18, 2019. rpblite, consolidation file (registry and demographics), MSP, NACRs, Pharmanet/Pharmacare, DAD, Perinatal, Education. Note all data in this request has a research ready collection except Perinatal. Perinatal is also a one time request/provision for this project. Perinatal is not part of the IDD collections. Vital Statistics. rpblites, consolidation file, MSP, NACRs, Pharmanet/care, DAD provisioned to SRE January 17, 2019. MED provisioned on January 24, 2019. Perinatal provisioned February 14, 2019. MSPID and DEPNO in registry file April 1, 2019 week. Vital Statistics provisioned on July 19, 2019. Complete. 7-8 business days for all files, but MED and Perinatal.
Wilmer 18-g04 (Special Needs) Approved. MED, DAD, MSP, Pharmanet. They would like the files by January 23, 2019. DAD, MSP, Pharmanet provisioned to SRE January 17, 2019. MED provisioned on January 24, 2019. Complete. 7-8 business days for all files but MED.
Bruce 19-g01 Guaranteed Basic Income (Bill is on this project with 20 researchers) Submitted pre-DAR to Tim February 15, 2019. Official DAR received mid April. LMID, MCFD, MHS, DAD, MSP, rpblites, rapid, registry, NACRs, Pharmanet, Pharmacare, Vital Statistics (births, deaths, still birth), MED LMID, MHS, DAD, MSP, rpblites, registry, NACRs, Pharmanet, Pharmacare, Vital Statistics (births, deaths), Education (April 24, 2019). MCFD provisioned May 22. SDPR V4 May 29. Update SDPR version 4 pending. Note: Still Births and Marriages (provisioned, but not released). 2 weeks.
Cormier 19-g02 MCFD project Submitted March 18, 2019 (*working on amendment). Approved April 1, 2019 week. MCFD, DAD, MSP, rpblites, rapid, registry, NACRs, Pharmanet, Pharmacare, Vital Statistics (births, deaths), Education (Note Perinatal is listed, however, an amendment will be coming to have this removed). Note PSSG is removed from the request until a new file is received. DAD, MSP, rpblites, demographics, NACrs, Pharmanet, Pharmacare, Vital Statistics (births, deaths), Education transferred April 23, 2019. MCFD transferred May 22. Complete. 4 weeks (issues in transferring size of files, but turn around time was technically 2.5 weeks).
Larson 19-g03 Approved Sept 29 2019 Advanced Education (LMID), MCFD, MED, Vital Stats, MHS, DAD, MSP, Consolidation File/RPBLite, NACRS, Pcare/PNET, SDPR, Stats Canada LMID, MED, Vital Stats, DAD, MSP,and NACRS released 2019 11 06. Still pending: MCFD, MHS, Consolidation File/RPBlite, Pcare/PNET, SDPR, Stats Canada. Stats Canada collection not done yet. First data release within 30 calendar business days. First full DAR for Melissa to review and release.
Wang 19-g04 Approved October 9, 2019; received DAR Nov 1, 2019 LMID, MCFD, MED, Vital Stats, MHS, DAD, MSP, Consolidation File/RPBlite, NACRS, Pcare/PNET, SDPR, Stats Canada None All
Thomas 19-XX Beginning Stages - requires a list of Popdata members that will be accessing/viewing direct identifiers for linkage. Will require extra clearance and form filling. PSSG (linked) + more Pending Pending
TBD 1 Beginning Stages ICBC and PSSG (linked) Pending Pending

Data Sets

Popdata data holdings used for DIP

MOH

the following are under IDD collections:

  • DAD
  • MSP
  • Nacrs
  • Pharmanet
  • Pharmacare
  • rpblites
  • consolidation file (demographics and registry)
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • all unlinked records are assigned the same study id.
Perinatal *this data is not part of the IDD collections. It is a special request on a project by project basis.
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • each unlinked record is assigned unique study ids.


DIP specific data sets

PSSG
  • Data span: 2008 to 2017. Only 2017
  • Format: ASCII.
  • Summary of data: this data set is BC corrections data from the CORNET data system (oracle data base). This includes information on the individual, the offense and their status.
  • Inclusion criterion: this data sets only contains adult corrections data (not youth).
  • Data linkage variables: first name, middle name, last name, dob, gender (no postal code), direct identifier: PHN
  • Data updates: the data is to be received and updated annually.
  • Missing postal codes in 2018 data file due to poor quality. Include alias file for next iteration in March 2019. The alias file is better quality and apparently will help with data linkage.
  • Missing 'eventid' in condition file - if researchers want this, need to request.
  • this data is information on adult. No youth data should be included (as youth transition to adulthood they could appear, and we should work with data providers to ensure this is not present)
  • data update pending. This data will be updated with more years of data. Unlinked records are assigned unique study ids, however, this data has not been released given we are awaiting new files.
MED
  • Data span: 1991 to 2017.
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease)
  • Summary of data: this data set is education results of children in the BC public, independent and certified offshore schools. This data set includes grades, course registry, FSA scores, attendance, and survey data (student satisfaction survey) filled out by students (anonymously)
  • Inclusion criterion: K-12
  • Exclusion criterion: no private school data
  • Data linkage variables: first name, middle name, last name, dob, gender, postal code, direct identifier: PEN
  • Data updates: this data set is normally updated annually.
  • Issue with the ACTIVE_STATUS and ACTIVE_STATUS_DATE columns in DM_DIMSTUD file. Many records were set to the value ‘UNSPECIFIED’. Almost all of the records in this table will have an ‘ACTIVE’ status (a small number will have a ‘DECEASED’ status where the Ministry has been directly notified by the school and manually updated). MED says this columns are not meaningful, and therefore no need to re-send data for 2018 extract.
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • each unlinked record is assigned unique study ids.
SDPR version 1
  • important note: SIN is with these files.
  • Data span: 2005-2012 (Legacy data), 2012-2017/18. (Overall: 2005 to 2018)
  • This data comes from the ICM database system (Information case management system). This data is extracted from a live database, as a result, this data can change. This is why researchers at SDPR use data from the GAINs database often referred to as GRD (This is version 2 and 3 we received).
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: This data set includes data around income and disability assistance. The data is focused around payment (dates and amounts).
  • Inclusion criterion: includes all individuals that applied for assistance.
  • Data updates: annually.
  • Data linkage variables: first name, middle name, last name, dob, gender, postal code, direct identifier: PHN.
  • No direct identifiers in the data set for MCFD to link to this file (this is important to IDO). May be able to use ICM_CASE_NUMBER, X_CONTACT_NUM, which exist in the MCFD data and link via MCFD. Will have to explore this option for this years data. In the future, direct identifiers have been requested.
  • One minor exception is the primary key(payment_sk) in header file is not unique. As this issue only affects two records (one value of the primary key), it can be ignored so far. The reason for the one duplicate primary key is due to a typo in other field.
  • note cheque_number for the case_header file was provided for 2017/18. The Executive Director of SDPR indicated that cheque_number has no research value, as a result it will no longer be provided in future years or included in the legacy data (email from Donna 2018/06/25). Also note it has determined to have no risk associated to it being provided for 2017/18.
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • each unlinked record is assigned unique study ids.
  • note this data has x_contact_num
SDPR version 2 and 3 (referred to as 'GRD')
  • important note: SIN is with these files.
  • this data comes from the GAINs database. This data is used by the researchers at SDPR and is a 'cleaner' version of the data.

Two batch years: 1995-2017 (Version 2)

  • Data span: 1995-01-01 to 2017-12-31.
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: This data set includes data around income and disability assistance. The data is focused around payment (dates and amounts). This data is used by researchers at SDPR. It is a cleaner version than the data from ICM.
  • Inclusion criterion: includes all individuals that applied for assistance.
  • Data updates: annually. This file is to be updated annually.
  • Data linkage variables: first name, middle name, last name, dob, gender, postal code, direct identifier: PHN.
  • This data is to Not to replace version 1
  • This data comes from the gain research database. The Research Branch takes a monthly snapshot from the source systems and completes a 2 month reconciliation process. They consider this data processed and the official caseload. This official data has high rate of accuracy and is relied on for reporting – including public release. It is static and can be reproduced.
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • each unlinked record is assigned unique study ids.

1989-1994 data set (Version 3)

  • Data span: 1989-1994
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: This data set includes data around income and disability assistance. The data is focused around payment (dates and amounts). This data is used by researchers at SDPR and is a cleaner version than compared to the data from ICM.
  • Inclusion criterion: includes all individuals that applied for assistance.
  • Data updates: historical
  • Data linkage variables: first name, middle name, last name, dob, gender, postal code, direct identifier: PHN. The PHN was placed back onto the file, but was initially missing.
  • This data is not to be merged with GRD 1995-2017.

Note on v1, v2, v3:

  • note DSU initially refers to version 1 as 2017/18 file and version 2 as (2005-20017), version 3 as the GRD (1995-2017) and version 4 as GRD (1989-1994). This has now changed. Version 1 refers to the ICM data (2017/18 and 2005-2017). Version 2 and 3 refer to differing years of the GRD data (1995-2017; 1989-1994).
  • version 1 has contact number, version 2&3 has personid. However, v1 is not a subset of v2, as popdata ids appear in v1 files that do not appears in v2 and vice versa. The lack of overlap is only around 3% in popdata ids.
  • Version 2 and 3: these data sets were combined given they come from the same system and just vary in years they cover. As a result, the data linkage need to be reconciled.
SDPR Version 4: Pre application data (NOT part of collection, but one time for Bruce)
  • Data span: 1991-1998. Note this data is only for Bruce19-g01 project.
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: This data set includes data around income and disability assistance. The data is focused around payment (dates and amounts). This data is not as reliable as the other SDPR data sets.
  • Inclusion criterion: includes all individuals that applied for assistance.
  • Data updates: one time instance.
  • Data linkage variables: no PHNs. First name, last name, dob, gender and postal code are available.
  • This data is not to be merged with any other SDPR file.

Note this data linkage was reconciled with version 2 and 3 (GRD). An update was provided with additional columns added (added: office, curroffice, inPayNext13, openEarlyCode. Removed from earlier version: aptnum, prov and onianext3)

SDPR disability file (NOT part of collection, but one time for Bruce)
  • Data span: 2002-10 to 2012-03
  • Format: .csv
  • Summary of data: A listing of all persons with disabilities (PWD) applications, adjudications, and resulting decisions. 108 columns
  • Inclusion criterion: TBD
  • Data updates: one time instance (?)
  • Data linkage variables: PHN, full name, birth date, and postal code available.

SDPR out-of-province file (NOT part of collection, but one time for Bruce)

  • Data span: 1996/01 to 2009/12
  • Summary of data: data file to identify out of province records. ym (year and month), fileid, and monthsbc
  • Data linkage: none; fileid replaced with project-specific IDs
LMID
  • Data span: It appears the data ranges from 2013 to 2018 using the Training files Program start and end date.
  • Format: .csv (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: This data includes information around individuals participation in programs and training for labour market positions. Advanced Education, Skills and Training.
  • Inclusion criterion: NA.
  • Data updates: annually.
  • Data linkage variables: first name, middle name, last name, dob, gender, postal code, direct identifier: No
  • data sets have project specific IDs based on encrypting ids from the initial data linkage and project provision.
  • each unlinked record is assigned unique study ids.
Stats Canada
  • Data span: 2000 to 2016. The data is also provided in two ways: data on a) individuals by community (_IND extension on files) and b) families by community (_FAM extensions on files). Total of 45 files with 26 related to families and 19 to individuals.
  • Format: .csv file
  • Summary of data: census variables regarding employment, family (lone parent, couples), income (average, median etc.,), family units (# of children, e.g., couples with 1 child), # of females and males by age range, social assistance. Variables such as couples with 1 child income by 35-40 age range.
  • File information: For each file, the variables are different. Range of columns vary by data set and > 100 per file. Number of records is largest for individual files (15,000 to 25,000) and 7000 to 7200 for family files.
  • Inclusion criterion: not individual level data, but data at a postal code level.
  • Data linkage variables: each file contains cityid (numeric ####), postal area (alpha numeric e.g., 3 to 6 digital postal code, 5 to 6 numeric starting with 5, and 3-7 starting with 9. 9XXXXXXX.XXXXXXXXXX), postal walk (appears to be empty with XXXX), level of geo (numeric ##, range 3 to 61) and place geo (appears to be neighbourhood names in BC).
  • data will have to be linked using the PCF+ software. Noushin has tested this, and was able to verify this will work for linkage.
MCFD
  • Data span: 1996-2018.
  • Consists of a total of 7 files:
all_client_export: contains unique identifiers
active stage 7: contains level of care, service type
location history: postal code data (and type of location)
intake export: largest file in terms of columns. Contain survey data.  Information relating to case.
subsidy: current postal code, living situation
mcfd_fs_case
mis_fs_list_export
cysn_clients
cs_clients_parental_1A_info
  • Format: .csv ( (file has been converted to flat format in order to assign coordinating ids etc., to be project specific using Popdata's tools with ease).
  • Summary of data: this data is on children/youth under the care of ministry/in the care system. It provides information regarding why the child is in care (e.g., abuse, abandonment etc.,). It provides information on where they are living, incidents, who their legal guardian is etc., Some of the records are 'anonymous' in that no name is associated to a record.
  • Inclusion criterion: none that we are aware of. Eligibility to programs (e.g., autism, or other disability services)
  • Data linkage variables: first name, last name, dob, gender, postal code. Direct identifier: PHN, however, there is a large number of missing PHNs. Numbers may appear in the place of names. An incident report may exist, but no information associated to who the individual is.

Note: there are more incident reports than there are individuals in the client file. This is because not all incidents have a name associated to it given the nature of the incident. An incident can transpire and no information is provided beyond that or the individual is not in the care system.

To determine which is the primary residential postal code: postal code in the location history only includes child/youth client addresses. The pcdh_base_postal_code only applies to the child care clients.

loc_types has a data dictionary.

Public Safety and Solicitor General
  • Data span:
  • Format:
  • Summary of data: this is housing data, and currently no timeline associated to it. This data will have to be verified, classified, linked etc., for the first time when and if we receive it.
  • Inclusion criterion:
  • Data linkage variables:

Decisions

  • each data set will require specific rules/guidance on handling non BC residents
Title Date Decision Data Set Participants Supplementary Documentation (if needed)
Non BC residents July 23, 2018 data on non BC residents are to be included in 'research ready collections' as these individuals would had to have been a resident at one point to another to receive this assistance SDPR, MOH. All data sets Email communication with Jeannette. IDO connected with individuals in varying government levels to make this decision. Decision tracked in DataClassification_QuestionTracking document Email from July 23, 2018
Aboriginal Data August 1, 2018 aboriginal data will no longer be suppressed All data sets requested in Mackenzie Project Communication with Brittany. IDO connected with individuals in varying government levels to make this decision Excel document in IDO folder (email from Brittany August 2, 2018)
Pharmacare/PNET access for IDO July 23, 2018 Andrew Elderfield’s email of 2018-07-05 to IDO stated “Yes, under the ISA, DIP can access PharmaNet (or Pharmacare, which is really a subset of PharmaNet) for DIP projects for non-research purposes (planning, evaluation, improvement, the compilation of statistical information, etc.) without going through DSC.” Pharmacare/PNET IDO (email communication with Andrew Elderfield) Excel document (resides in IDO folder)
PSSG data linkage 2018 July 30, 2018 1 PSSG-to-many PHN with one highest candidate, one candidate will be selected (~70%, so 1 PSSG to 1 PHN). For the 1 PSSG-to-many PHN with equal weighted candidates (~30%), the data will be deduped. Many PSSG-to-1 PHN will be ignored (very small number). PSSG data linkage 2017/18 data set. Popdata (DSU) Excel and word document (both provided to IDO)
Variable Classification and de-identification process August 13, 2018 De-identification process is specific about removing direct identifiers related to people (e.g., practitioner ID, pharmacist ID), but not businesses (funeral address or code, pharmacy location codes etc.,). No issues regarding interim state or process/policy from ED governance group. MOH data sets specific and all other data sets including in this interim phase. IDO: GL and KA Excel document
Gender August 27, 2018 Gender will be part of research content All data sets IDO Brittany Verbal communication (Tav will follow up for documentation)
MOH Business Related Variables (e.g., hospital numbers etc.,) September 5, 2018 Retain the classification of 3.c Direct Identifier-Replaced for the business/facility identifiers for this data set only. While De-ID guidelines do not protect business identifiers, the wishes of the data provider with regard to their data, supersede this allowance. We will make data set by data set decisions regarding this type of variable as we go along. MOH IDO Email September 6th from Brittany (Kathleen cc'd)
Replace study ids with project specific ids, and applying encryption project specific to specified fields for the IDD project September 17, 2018 No generic IDD ids for study ids or coordinating ids and Popdata's internal variables (sequence number). For projects within IDD, each project will have project specific ids. All data sets in a project IDD Email September 17th from Brittany
Providing full addresses for fields in vital statistics October 16, 2018 Decided to not provide full addresses for particular fields in IDD that are city, country, province, accident location and death location variables. Vital Statistics IDD Email from Brittany November 6, 2018
Import feature for SRE January 31 2019 Use Popdata's existing import feature for DIP programs. All DIP projects held at Popdata. IDD - Greg Lawrence email and ticket. Ticket in Jira and email.
SIN number Initiated April 28th. Email/documentation pending. Removal of SIN in any existing transfers/data sets. No longer to receive SIN in future transfers or retain at Popdata. Currently SDPR has SIN and must be removed (April 30th, 2019). All IDD research collections IDD decision. Communication via Brittany and Beth through JIRA and email. Ticket in Jira and email.

Documentation File:DataClassification QuestionTracking18Aug13.pdf

Documentation

All documentation for this project can be found on Alfresco--> Data Services Unit --> IDO
Documents specific to a data set (e.g., variable lists) are found in the data specific folders
General documents are located in the IDO folder

Note the service agreement is located here: Alfresco->Data Sourcing->Agreements-Universal->Current Agreements->Integrated Data Office (IDO)->Modification Agreement

Popdata Documentation on the IDO project

Process Documents
IDO_DocumentYYmonthDD.docx
Appendix A_Data_Management_Raw Data to ExtractYYmonthDD.xlsx
Flow Chart
v1.1: IDO_Initial_Data Flowchart_INPUTS from Intake to Extract_CURRENT_2018June13_TKA.pdf

IDO generated documents

Preparing Data For Transfer_FNL_2018-04-25.docx
DI Program_Data Export Process 2018-02-1.docx
Interim De-Identification Process.docx

MED

Variable classification: idomed_variables4extract2018Aug7_bhfxtka.xlsx
Metadata: accessed via Metabadger DM_DIMSTU & use of Band and First Nation (Aboriginal data) type: MED_SuppressionInformation_NancyEmail_18May23TKA.docx & Metadata for First Nations schools in EDW.docx

PSSG

Variable classification: idopssg_variables4extracts18Aug1.xlsx
Metadata: Cornet Field_Interpretation_Document for IDO V4.docx; IDO_Code Table_DOCTYPES.docx

SDPR

Variable classification: idosdpr_var4extract_draftv218Aug1 fl tka.xlsx
Metadata: SDPR_Data Dictionary for IDO.xlsx
Additional files: SDPR_IDO Header Control Totals export.xlsx

MOH

Variable classification: classification_popdata_collections19Feb25_update.xlsx and classification_popdata_colletionsVitalStatistics_rpblites_MHH18Sep2-1Brittany.xlsx

LMID

Variable classification: LMID_variable classification18Oct16.xlsx

Vital Statistics

Variable classification: classification_popdata_colletionsVitalStatistics_rpblites_MHH18Sep20.xlsx this resides in the MOH folder

Perinatal

Variable classification: Perinatal_variableClassification18Nov30_fxu.xlsx


Popdata and IDO identified collaborative projects

Component Description Status
Output checker (OCWA) Phase 1 Enable researchers to check their outputs to ensure they are adequately anonymized
Enable human review stage of output check.
Development complete (testing phase with users).
Output checker (OCWA) Phase 2 Import feature, and updates based on feedback Underway
Nifi Data ingestion Exploring the use of nifi to ingest data from data providers (Aidan and Brent) Underway - exploring
Postgres DB in SRE Exploring the use of DB in the SRE. DB setup including limiting users access by project is complete. Researchers are writing scripts to load in the data currently.
SFU SRE For use by IDD Providing SFU technical specifications requirements (the environment is end of life cycle, and contract is up in 2019). Require letter that allows IDD data to reside at SFU. Otherwise staging server is all setup for development line.
Workbench Notifications, chat feature and documentation. Underway
Gitlab code sharing between and outside of the SRE underway and almost deployed
Package management (Python and R) exploring ways to manage packages and users to install being investigated during upcoming May sprint
Online DAR - Front Counter Create IDD DAR and Front Counter of how to guide researchers Underway (Mike S, Melissa). Almost complete pending review by Brittany for content. Will be launched shortly.
Metadata Ingestion of metadata - variable label and description using BC catalogue. Currently Brittany and Donna are putting together a data dictionary for the 'research ready collections.' Tav has provided them with our .xslx sheets we provide researchers in the SRE. Underway: Tav is ingesting SDPR.


SRE
  • Create a place were users can view outputs, but not take out outputs given the time constraints on output checking.
Output Checker history of development
  • Conducting interviews with Paul August 13 to August 31. Interviews will include existing Popdata SRE users, data checkers and other users in government.
  • Technological meeting around design specs August 29,2018
  • Interim solution implemented by Popdata for October 1st week - instructions included and sent to Greg October 17, 2018.
  • Juile is the output checker

Data Processing and Availability Summary

DIP DATA specific

Data Information IDD Research Ready Collection Status Received from Data Provider (verified) Variable Classification Metadata Linkage Ready for SRE Unlinked Pushed to SRE Linked Pushed to SRE
MED This data is different from Popdata's holding. Filter files are provided with release given lack of metadata. Exists. Unlinked records have unique study id ('u'). Complete March 29, 2018 (verified) Complete June 19, 2018; reclassification of aboriginal content August 1, 2018.

Time was spent ensuring aboriginal content was suppressed correctly (issues of on reserve etc.,), and ethnicity identifying variables that capture language and can be considered a proxy for ethnicity needed to be worked through.
IDO has access to Metabadger: MED metadata online system. Popdata provided variable list and description to users in the SRE, along with filter files. June 28 began. Complete July 26. Data processed and prepared July 30. Changes and updates (added in aboriginal data) August 14. Ver 1.0: Available June 25. Emailed Donna. Ver 1.1: August 7, 2018 (aboriginal content not suppressed). Ver 1.0: August 17, 2018. Ver 1.1: January 22, 2019 -- variables added back in and unlinked records assigned unique study ids.
PSSG This data is no longer going to be provisioned until a new data set is received. Exists and provisioned to a few projects. No longer provisioning until new data files provided. Note only 2017 year of data available. Complete April 24, 2018 Complete May 15, 2018. Allowed to retain 'variable labels' as placeholders (could retain labels as is, but will replace with Placeholder 1, data x00...00x (length of data in field)). Received a detailed word document. This document was updated to include changes that were made (e.g., suppression of variables or change in postal code from 6 digits to 3. This was reflected in this word document). This document is no longer to be provided -- email from Donna in 2019. June 28 data linkage began. Issues with linkage: multiple PHNs for one individual. Analysis to resolve issue conducted July 14 to July 23. July 26 linkage complete. Data processed and prepared July 31, 2018. Ver 1.0: May 22, 2018 Pushed. Ver 1.1: August 7, 2018 (aboriginal content not suppressed) Ver 1.0: August 9, 2018. **Removal of file due to sensitivity. New file to be received and Ver 1.0 will be destroyed.
Social Devel. & Poverty Reduction version 1 (from ICM database) data from the ICM database: 'live' time data and can differ from the GRD data for this reason, which is corrected data. Exists. Unlinked records have unique study id ('u'). Complete June 12, 2018 for 2017/18 data. Error in data sent, resent June 19,2018 (2 files of 8). Remaining 6 files received July 12, 2018. Verified with some issues.

Legacy data being received August 23 to 30th (2005 to 2017).
Complete and approved round 1: June 25. Additional data sets approved July 19. Note: reclassify aboriginal variables (August 1).

Time required to clarify retention and use of SIN for future data linkages and handling of non BC residents in data (decision July 23, 2018).
Received an excel document. Variable label and description, and table relations. Updated metadata to reflect variable label changes for coordinating id replacements (e.g., personid with personcoordid). Legend created to reflect this and variables we did not receive that appeared in the metadata sent. Started July 27 and complete August 3. Data processed and prepared August 14 (include aboriginal changes). Data was re-processed to included unlinked ids in April 2019. Ver 1.0: August 7, 2018. Ver 1.0: August 14, 2018 (2017/18 data only); Ver 1.1: October 23, 2018 (SDPR all data often referred to as legacy). Version 1.2: May 10, 2019 (unlinked records included with 'u').
Social Devel. & Poverty Reduction version 2 (referred to as "GRD") data from the GAIN database. 'Cleaned' version of ICM data and used by researchers at SDPR. Exists. Unlinked records have unique study id ('u'). Initial transfer: November 5, 2018. Issues with data upon receipt (no PHN or gender for data linkage). New file sent November 16, 2018. Complete and approved December 6, 2018 (sent November 28 for approval). Still must resolve issue of SIN. Not received. Started November 27 and complete December 10, 2018. Data relinked to reconcile with SDPR version 3 prior to May 2019. Data processed and complete December 12, 2018. Re-processed and complete for reconciliation May 3, 2019 and also because it can be combined with version 3. unlinked data no longer provided separately. Instead unlinked records have 'u' in studyid. Ver 1.0: December 19, 2018. Ver 1.1: May 10, 2019 (combined with version 2).
Social Devel. & Poverty Reduction version 3 (similar to version 2 "GRD", but different years of data) historical SDPR data. April 9, confirmation that version 2 and 3 can be combined together. Exists. Unlinked labeled 'u' First batch February 8, 2019, second February 13, 2019. Issues with files (missing PHN). Corrected files received February 15 and February 27. Complete and approved February 15 for first batch, and second batch March 12. Not received. Initial complete prior to May. Complete prior to May 2019 (with reconciliation to SDPR version 2 and combined with version 3) Data processed and complete May 3, 2019. unlinked data no longer provided separately. Instead unlinked records have 'u' in studyid. Ver 1.0: May 10, 2019.
Social Devel. & Poverty Reduction version 4 - pre application unique data just for Bruce19-g01 project. No one else is to receive this data. Exists. Unlinked records have unique study id ('u'). April 19, 2019. Note no PHNs in this data for linkage. Used previous classification given similarity of variables in data set to previous versions received of SDPR Not received Ver 1.0 pre-app: May 22 (with reconciliation to SDPR version 2 and 3). Ver 2.0 of pre-app: July 8 - update of file received with more columns of data. Ver 3.0 of pre-app: August 2 - corrected coordinating IDs Ver 1.0: Data processed and complete May 22, 2019. V2.0: processed and completed V3.0: Data processed and completed August 7, 2019 unlinked data no longer provided separately. Instead unlinked records have 'u' in studyid. Ver 1.0: May 29, 2019. Ver 2.0 June 19, 2019 Ver 3.0 August 8, 2019
Social Devel. & Poverty Reduction version 5 - disability unique data just for Bruce19-g01 project. No one else is to receive this data. TBD July 2019 Have draft. Waiting on metadata from SPDR to finalize. Waiting on metadata from SPDR. Initial linkage of disability file is complete 2019-08-01. TBD TBD TBD
Labour Market ID from Jobs Trade Technology this data does not have metadata associated to it. Filter files are provided with release. Exists, but not provisioned yet. Unlinked records require unique study ids still. Data to be received September 7th, 2018. Issues with data set - require resending. Resent September 28. Variable classification sent for review October 17, 2018 to Brittany. Follow up email and confirmation to proceed October 24, 2018. Received excel document with variable labels, description and data dictionary. Complete ~ less than one week. October 24, 2018. Linkage rate: 76.48%. Note no direct identifier. Complete (October 29, 2018). Release Pending update to DARs to include this data. No No
Stats Canada Data This data does not need to be classified. The data was vetted by BC stats. Received March 29. Licensing for PCF++ received April 10. over 100+ to 1000 variables. Given data is aggregated, VC may not be required. NA Examining how to link file to registry. Pending NA Pending
MCFD this data shares a variable with SDPR Exists. Unlinked records 'u'. Files received March 28. Issues in files April 5, 2019 and missing family service file. New files received April 11, 2019. Additional files sent (CYSN) April 18, 2019. Complete and approved May 1, 2019. Not received with initial files. (potentially Received on April 18 - must review). Complete May 17, 2019 Data processed and complete May 17, 2019. unlinked data no longer provided separately. Instead unlinked records have 'u' in studyid. Ver 1.0: May 22, 2019

Popdata holdings used for DIP project

Data Information IDD Research Ready Collection Status Received from Data Provider (verified) Variable Classification Metadata Linkage Ready for SRE Unlinked Pushed to SRE Linked Pushed to SRE
MOH Data: registry, MSP, DAD, NACRs, MHS, HCC data is currently part of 'one' collection and is referred to as Health Exists Round 1. Round 2 involves adding variables not previously facilitated or accessible. Not include HCC. Use existing holdings at Popdata. Agreement signed on June 27th (all data under MOH stewardship at Popdata can be used for IDO projects) Round 1 complete July 11, 2018 by Fan, and July 25, 2018 by Tav. Sent to Brittany on July 25, 2018. July 30th week back and forth on questions (no BC residents, small geographic areas that could have identifying information, aboriginal content not suppressed). MHS (~October 4, 2018). HCC not classified (pending MOH response regarding the recency of this data). Data provider input week of August 27. Finally classification September 5th. Use of our system. September 7, 2018 MSP, DAD, Consolidation, NACRs, rpblites, MHS verified by Tav and DAU unit. Ver 1.0: MSP, DAD, Consolidation File (September 30). NACRS and rpblites (October 3). MHS (October 16). see unlinked.
PNET/Pharmacare Considered apart of the Health collection by DIP. Exists Round 1. Round 2 involves adding variables not previously facilitated or accessible. Use existing holdings at Popdata (Andrew Elderfield emailed to say PNET/Pharmacare can be used for IDO projects. See decision section) Pharmacare Round 1 complete July 11, 2018 by Fan, and July 25, 2018 by Tav. Sent to Brittany on July 25, 2018 (with MOH). Pharmanet complete August 2, 2018 by Tav. Sent to Fan for his review. Same issues of non BC residents exist. Use of our system. PNET data extract began in September. PNET took 3-4 days to complete. Complete and verified by Tav and DAU unit. NA Ver 1.0: PNET (October 2), Pharmacare (October 3).
Vital Statistics (births, deaths, stillbirth and marriages) considered part of the Health Collection by DIP Exists Round 1 complete. Use existing holdings at Popdata - agreement still required to do so. Vital Statistics Round 1 complete July 11, 2018 by Fan, and July 25, 2018 by Tav. Sent to Brittany October 1, 2018 for review. Use of our system Extract began in October, but pending approval process of variable classification (which was confirmed November 6, 2018 by Brittany via email). Complete and verified by Tav and DAU unit. NA Ver 1.0 (November 27)
Perinatal Note: Perinatal is not part of the DIP collection. Project by project provisioning only with VS data steward requirement. Use existing holdings at Popdata - agreement still required to do so. Variable classification began on November 19. Use of our system Complete January 14 Data dictionary provisioned February 8, 2019

Linkage Summary

  • IDO has signed an ISA with MOH. The statistics act allows Popdata to use the existing Population Directory to link data for the IDO catalyst projects (June 27, 2018).
  • Linkage complete for MED, SDPR version 1, SDPR version2, MOH, LMID, Vital Statistics, PSSG.
  • dedooping LCF

Processes

General Overview

  1. Commitment letters are sent or an Information Sharing Agreement (ISA) is signed with data providers by/with IDO. IDO works with data providers to provides metadata and data to Popdata via the webportal (other options are available if need be)
  2. Receive data via Popdata Portal
    • verification is conducted that the number total number of records per file is correct, the number of files received is correct.
    • ensure the keys across table are retained for researchers (encode directly or encode coordinating id pending sensitivity of variables)
    • ensure keys are correct length across files before encoding the data (this includes, length, type, padding).
  3. Variable classification
    • New data sets: classify variables using variable classification index (direct, indirect and research content) developed by IDO and Popdata. The handling of non BC resident data is elicited from the data provider to determine what to do if this type of data exist. Data is only prepared once all questions regarding variables classification is complete, and IDO is sent the final variable classification sheet for a data set.
    • Appending existing data sets with updates: review intake of data that will be appended to existing 'research ready collections.' Ensure existing variables are present and classification are correct. For new variables, append variable classification list with new items and version the variables.
  4. Process data
    • data is received on Defuca
    • Direct identifiers are transferred to Ericsson (Linkage) for data linkage.
    • Research content remains on Defuca (data). Once data linkage is complete, the crosswalk is transferred to Defuca (crosswalk folder) and PopdataIDs are added to the research content file (sequence number exits). This data is then moved to a production folder on Defuca and pushed to George to facilitate project specific extracts. Note: this is the research ready collection.
    • Extracts are created for approved IDO specific projects.
  5. For all research ready collections not previously part of Popdata holdings, new version and updates with additional years of data, each of the data sets to be used for for IDO are reviewed using the following steps by a DSU member:
    • some considerations involve randomly auditing extracts for IDO specific projects (random auditing may suffice after initial verification of research ready collection, updates and appending with updated data).
    • The variable classification list is used during this process
    1. Ensure identified variables for modification have been modified appropriately, which includes life events (YYYYMMDD to YYYYMM) and Postal Codes (6 digits to 3),
    2. Ensure the sample size of each file is appropriate
    3. Ensure coordinating ids exist where needed for linked or unlinked files (*this is particular important that certain coordinating ids do not exist for linked data) exist in the extracts and are encrypted (project specific ids),
    4. Only approved research ready variables are included in the extract
    5. Ensure suppressed variables are not in the extract and no other additional variables are in the data sets (use the variable classification sheet to ensure this)
    6. Ensure Popdata versioning variables exist and are in a particular format which includes the following: study ids (for linked files), seqno, version, and datayr. Ensure sequno is encrypted (project specific ids)
    7. Ensure variables are encrypted that are identified as being replaced with project specific ids using the filter file. For files with non csv, ensure ids do not overlap with another existing project with the same data set.
  6. For all research ready collections based on existing Popdata holdings, each of the data sets to be used for the IDO project are generated and reviewed using the following step
    1. Use existing data collection extract, and ensure study ids are encrypted in filter files using Popdata automated processes.
    2. Run filter files using IDD filters
  7. Data is then transferred to the project specific data folder on the SRE. Researchers who have access to the IDO specific project are emailed by Tim Choi (DAU) and Donna/Brittany are emailed with an update regarding data availability to IDO specific projects by Tavinder Ark (DSU).

Process Intake of Data and Metadata from IDO

Current document for data intake: Preparing Data For Transfer_FNL_2018-04-25. docx.

De-identification document: Interim De-Identification Process.docx

MED

Data was processed and ready to be pushed to SRE (note, postal code has been reduced from 6 digits to 3). Tav verified the list provided by Fan to ensure what is pushed to the SRE matches the agreed upon variable list (.csv file). Additional variables were identified (Brent and Tav) to have indirect identifiers as content in variables such as aboriginal content ("band") in the following types of variables: district name, school name, school type, home language, and funding (sld). Mincode has small n's (geographic suppression). Special needs was also flagged given its sensitive nature. A follow up round 2 of variable classification was sent to MED via Donna.

March 29 Data received (verified).

Tav when and how long did variable classification take between March 28 to May 4th.

May 17 For special needs: Donna (email) stated to retain special needs data for CYMH project needs (Gen and her discussed). Waiting on MED for classification agreement of remaining variables.

May 23 Email from Donna. Holding pattern for MED until clarification around aboriginal data suppression (Nancy at MED provide a document on Band and Aboriginal schools). Small geographic areas can be suppressed during SRE extraction phase (and managed that way). Suggestion around suppression of mincode 97 and require some variable suppression (row level? needs to be clarified).

May 30 continued holding pattern around MED data. Question around use of term 'indirect' identifier during variable classification (wee classified all indirect identifiers in IDO document as direct identifiers), in the future we should visit our terminology to minimize confusion around direct (linkage purposes) and indirect (suppression due to data providers or IDO rules). Donna will update IDO catalyst document to including ethnicity in indirect identifiers section.

June 6 email from Kathleen providing guidance around indigenous data recommendations.

June 7 followup email classifying MED variables as suppressed based on Kathleen's email. Attached the MED classification for sign off. Kathleen responded and this required a followup conversation around equating language for variable classification around suppressed and special request variables by Tav.

June 11 to 14 Tav sent an updated variable classification document to IDO for review, and there was some back and forth to ensure the language around variable classification and process flows associated to each were in agreement between IDO and Popdata. A final MED classification list was sent to Donna and others using the new agreed upon variable classification list.

June 19 Followup email sent to Donna regarding MED regarding sign off of the variable classification of MED data. Tav will connect with Fan today and began to process the file, assuming all is okay.

June 21 Discovered additional variables with Aboriginal identification. Removed by Fan. Data is ready to be pushed to George. Donna is emailed regarding the additional suppression.

June 22 Fan has data ready to be pushed to SRE.

June 25 Data made available to SRE. Donna and Researchers notified.

August 1 Not suppress aboriginal variables is requested by IDO. Processing will commence with providing variables in separate file and coordinating ID to link.

August 7 Version 2 of data available with non suppressed aboriginal data

MED IDD Research Content File COMPLETE.

January 2019 each unlinked record is assigned a unique study id.

PSSG

April 24 data and metadata received (verified).

Tav when did variable classification begin?

May 10 Email sent to Donna on May 10, 2018 with latest variable classification from Fan.

May 15 Monthly IDO meeting, everyone felt comfortable with pushing PSSG to the SRE with having new individual on the data provider brought up to speed. Waiting for Donna and data providers if they are okay with the list.

Requires 3/4 days of processing once verification of variable list is complete by data provider (Brent to review if this time is required).

May 17 Fan is currently processing the data and replaced suppressed data with placeholder variable label and data (x00..00x based on length. Tav will reviewed frequencies and data values in fields, and did not notice anything else that needing suppressing (unlike MED). Tav will update meta data file: Cornet Field_Interpretation_Document for IDO V3.docx to include placeholders names in place of variable labels and data field types before push to SRE, SEQNO, VERSION and DATAYR must be added to the meta data. Note, Brent noticed missing postal codes in PSSG data file (emailed Donna to find out if we could receive this for record linkage).

Refer to document PSSG_SummaryDataFiles18Nov28.docx for more details and a diagram (found in the IDO folder under pssg).

May 22 Tav verified extract_summary. Updated metadata file to reflect all changes/additions to files so they can be read in. Fan will push data to George today. Tim sent ubikey via carrier.

May 23 PSSG pushed to SRE. Note: no encryption on SEQNO, and code table still contains code for suppressed data (e.g., Aboriginal, Citizenship etc., but cannot link given the fact that data fields have been removed, so codes are meaningless).

Donna email around missing postal codes: PSSG indicate this field was unreliable. We may receive in next round. May also recieve 'alias' file, which is much more reliable for resolving matching issues. Expected to receive it March 2019.

May 25 Tim sent email and researchers have access to MED data. Tav investigating metadata for PSSG for Greg using frictionless package online.

May 30 First samples of metadata sent to Greg for review from PSSG and MED.

May 31 Donna emailed regarding EVENTID missing from the Conditions file, and thus not linkable to the Events file. Question if researchers would like to link on EVENTID and use the information in the Condition files with this variable. Donna also raised the issue of repeating coordinating id for PERSONID. Tav explained this is to be expected given the one to many relationship to the offence data file (here it should be unique and non-repeating).

July 17 Donna sent additional documentation that needs to be included in Mackenzie project (codes for PSSG data variable). Current users have been emailed this document by Donna. Tav will follow up to have this included and part of PSSG metadata.

August 1 Not suppress aboriginal variables is requested by IDO. Processing will commence with providing variables in separate file and coordinating ID to link

August 7 Version 2 of data available with non suppressed aboriginal data

PSSG IDD Research Content File COMPLETE.

November 20 & November 23 Dan has raised an issue with the PSSG data. The number of records in some of the files do not make sense, and he is wondering if this is an issue in the data linkage, study id replacements etc., Phone call booked for November 26.

November 26 Dan describe the issue on a phone call with Tav. Dan noticed a significant drop off in the number of unique study ids that appear in the offender files to the legal file. Fan reviewed the original files received from PSSG and discovered the number of unique records in the legal, offence etc., data files is far less than those that appear in the offender file. There is a significant number of record drop off in the file received. An email was written to Dan and Brittany regarding this matter, and that data providers would have to be contacted to determine if there is data error or this is real artifact of the data.

November 27-29 Analysis of data revealed the following:

The number of unique person ids and study ids in the legal file (and all data files except for the offender file) are significantly less (n = ~ 39,000) than the number of unique person id and study id records found in the offender file (n = ~ 150,000).
Add relational keys that were previously suppressed and introduced ambiguity in table relationships.  This includes adding personid, offenceid, assessid, caseno back into the file.
The length of eventid in the condition table is 20, but is the same value for all record rows, but in the event table 30 and unique.  These two variables cannot be walked to each other given the value is not unique in the condition table.

January 2019 each unlinked record is assigned a unique study id. This data is not released because we are awaiting new data.

February 14, 2019PSSG data extracts have been quarantined (no researchers has access to the data provisioned by Popdata to 18-g01 and 18-g02) due to presence of youth records (as provided by data provider. This is considered a data breach by PSSG, and will follow government protocols to ensure the data is treated correctly). Researchers are being contacted to move all PSSG related files they have have saved to a folder on the SRE for quarantine. Quarantined here means files are not accessible by researchers, but determination of destruction or what to do with those files is pending guidance of PSSG via IDD.

February 25, 2019 Email received from Brittany regarding the destruction of the 'offence' file from PSSG.

February 28, 2019 All files related to offence have been deleted from the redzone, SRE and initial transfer locations. The process of data destruction and deletion have been followed based on popdata's protocol found in document: Data Destruction Procedures FINAL 2018 07 23.pdf. Not clear yet if researchers in the SRE have removed/deleted there files. Tav will follow up. The remaining files are to be released to the SRE users.

Social Devel. & Poverty Reduction

Version 1: April 24 Connect with Rob about transfer begins.

May 29 Currently we are waiting on Rob Morrison to decide the best mode of transferring data and metadata to Popdata. What has been proposed so far is three options: A) use Popdata's secure website for transfer (involves a username and password), B) use/setup SFTP server on Rob's end, and have Popdata retrieve the data, and C) Popdata setups a SFTP server, and provides the details to Rob to upload the data. With any of these steps, data is to be encrypted. The favoured option from Popdata is option A), followed by B) and then C). Option A) fits better with Popdata's current procedures because we receive a notification the moment data is uploaded and data is 'moved' into the correct location upon upload. With option C), Security unit would have to periodically review when data was uploaded.

June 12 Issues with legacy data - it will take 30 to 60 days to resolve, and will receive this in August. Have data extract for Fiscal Year 2017/18, and will transfer encrypted file today. Case header: 95MB. Case Detail: 83MB. They will use our web transfer system and ensure an encrypted file is sent. Data sent on June 12 to popdata using the web transfer system.

June 14 Phone call with Donna around next steps for SDPR. Priority is to conduct a validation receipt check (number of files, and number of rows per file is correct). Note SDPR did not provide this, so we will be sending this information to Donna to have SDPR verify. Variable classification will begin, with first round just on variable label (using the metadata provided) by Wednesday June 20th. Variable content will commence after that, once Fan has time to process SDPR data sets. First priority is to get MED to the SRE.

June 15 Fan reviewed data and verification of the number of files and nrow match what was sent in the header and detail file. However, many issues with the file were identified and as as follows

1) the number of variables and variable labels for both the header and detail files do not match the Data Dictionary provided.  The header and detail files have 26 variables in each. 
2) the header and detail files have the exact same variables appearing in both.  However, depending on the file, the data is missing completely for a variable.  The first three variable columns in the header file are missing, whereas in the detail file variable columns 9 to 26 are missing.
3) the header does not appear to match the variable content -- for instance under 'due_date,' the name of city, such as Vancouver, will appear in the field rather than a date.  This suggests that the header is not correct.  We also reviewed the file to ensure columns did not get shifted when importing the .csv file, and that does not seem to have occurred.
4) there is no way to link these two files (the primary key is not present in either file).

Donna was emailed this list to determine what the issues is between the two files, and discrepancy with the data dictionary. As a result, we have held off on variable classification since we have no data to match the data dictionary.

June 19 New data sent. Old data to be removed.

June 25 Fan verified data has number of correct rows, records and columns. Columns match the data dictionary. Variable classification began and is complete for all data files, including the files not received (round 1: Tav and Fan). Tav emailed Donna to ask when the rest of the 2017/18 files will be available and sent the variable classification for review.

June 27 Brent noted that SINs may require additional legislation for use. Need to follow up to determine how this data can or cannot be used.

July 11 Remaining 6 of 8 files received for 2017/18 data set.

July 12 Fan verified the file with the following issues noted and emailed to Donna on July 12, 2018:

The following variables are in the variable classification list, but not in the data we received: 
a) File: IDO_SDPR_CASEAKA. Variable that is missing: Integration_ID 
b) File: IDO_SDPR_CASEINVOLVE. Variable that is missing: Start Date, and End Date.
Integration_ID is a direct identifier, while start date and end date are research content.  The direct identifier is not required for our linkage process, and will not be required to link the files.  

Linking among files in SDPR
The case involve file can be linked up with all other files without any broken records.  There are only 13 records in the case header that do not show up in the case involve file.  But again, the case involve does not have any issues with linking up all the other files given it is a one to many relationship with case involve and all other files.  We feel this is negligible, and does not require an additional upload to correct etc.,

July 18 Variable classification sent to Donna. Asked how to handle non BC resident data.

July 23 Response back from Jeannette states to keep non BC resident data in the data set because in order to be in SDPR, individuals would have to be a BC resident in order to obtain services. Awaiting for sign off on variable classification so Fan can process unlinked data.

July 30 Confirmation to go ahead and proceed with SDPR data preparation given no more outstanding questions from Popdata to IDO, and IDO has not questions regarding variable classification conducted by Popdata (Brittany email).

August 1 Not suppress aboriginal variables is requested by IDO. Will include these variables in release.

August 7 Version 1 data available with non suppressed aboriginal data

August 23 Legacy data is being provided in the following cycle

Aug 22 (2013-2017) - 5 data sets
Aug 23 (2011/2012) - 2 data sets 
Aug 27 (2009/2010) - 2 data sets 
Aug 28 (2008/2009) - 2 data sets 
Aug 29 (2006-2007) - 2 data sets 
Aug 30 (2005) - 2 data sets

October 15 Data linkage complete and sent to Fan to consolidate files.

October 23 Data files (all files including legacy) released to SRE.

SDPR Version 1 IDD Research Content File COMPLETE.

February 2019 working to assign each unlinked record a unique study id.


Version 2: a newer version of SDPR is being provided that is to replace version 1. This file is to be easier for researchers to work with.

November 5 new data set transferred

November 13 Fan conducted validation and discovered issues with the file: no PHN and gender for linkage.

November 14 Brittany emailed regarding this issue. A new data set is to be provisioned with these variables.

November 16 New data set transferred

November 19th week working on the data set for data linkage

November 28 Classification sent to Brittany and Donna.

December 6 Question and updates of variable classification received by Podpata from Donna. Variable classification is agreed upon.

December 10 Data Linkage complete

December 12 File processed by Fan and ready for Tav to review upon return from vacation

December 17 Review of file revealed addresses in city variable. Variable was suppressed.

December 19 Data released to Mackenzie project.

SDPR Version 2 IDD Research Content File COMPLETE.


January 2019 each unlinked record is assigned a unique study id for version 2.


Version 3: 1990-1994 data, and 1980's.

February 8 received batch 1 (1990-1994)

February 13 received batch 2 (1980's)

February 13-15 verification by Fan. Identified one new variable exists in these files that did not exist with version 1. Review data for differences. Determine if this data will be part of version 1 or its on separate version. Identified a shift in the rows in the file and need them to be resent. Note this is only batch 1 (1990-1994) of 2. Email sent: We have reviewed the files received on Friday and Monday this week -- the total number of records and columns match what was sent by the data providers. However, there are issues in the files, which are as follows:

*1. no phns
*2. for the file bcea_cases2.idd.csv the variable class appears in addr1, 
addr1 in addr2, city in addr2, postcd in city and postcd in class (so a 
shifting in the data).
*We would like to have guidance on number 2 especially (should we correct 
it, or is it best for the data provider to do this, is it just a 
relabeling of columns to data that is incorrect, and we can relabel 
etc., verification of what this error is, would be great).
*As for number 1, PHNs may not exist, but we just wanted to clarify this 
is the case, and in turn will impact data linkage.

February 11 Batch 2 sent (1980s). Batch 1 verification of data sent to IDD.

February 13 Email sent to Brittany and Donna regarding if this data can be combined with version 1 (non GRD data) for Batch 1. Data has issues in PHN and SDPR will resolve on their end and re-send the files.

February 14 Gap in years of data found. No data for 1994-2005 for version 1, 2, and 3. Transfer of file provided (not for gap), but to replace files sent February 6 and 11 that were incorrect. The file was provided called: bcea_cases2.idd.csv to cover this gap. note: BCEA_Case2.idd.CSV is for 1990/10 to 1994/12. BCEA_Cases_idd3.CSV is for 1989/02 to 1990/09.

February 15 Corrected file sent for issue found in February 14 email. Note this file does not have PHNs because they did not collect them until the late 90s. file name bcea_cases2.idd.csv.

February 20 Received two files to replace one (bcea_cases2.idd.csv, bcea_cases.idd3.csv). Sent email to understand this.

February 21 providing PHNs, awaiting new files.

February 28 Batch 1 new files received. Data classification, verification and linkage to commence. Batch 2 to be processed in reviewed (1980s). ALl files updated/corrected and received.

March 11/12 Email regarding variable classification of version 3. Given data is the same (except one/two new variables), the previous classification will hold. Classification of variables sent in email to Donna. Variable classification complete for version 3 (dated Friday - email from Donna).

March 21 data linkage complete (recvd_2019-03-01)

April 9 Confirmation from Donna that Version 2 and 3 can be merged together into one data set.

April to May 22 need to reconcile linkage differences for all linkages conducted for projects that want this data (e.g., Bruce). Waiting for pre-application and to conduct linkage to reconcile differences before provisioning to users for consistency.


Version 4: Pre-application

April 11 Transfer of pre-application file.

April 16 Fan reviewed the file, and verified it. Missing PHNs. Emailed to ask if this is okay. It is.

April 18 Transfer of Research.branch.bcea.NFA_IOP file.

May 22 Data linkage complete.

July 5 New pre-application sent. This file has additional columns.

July 8File preapp_cases_final has 11 variables versus 13 from validation file. This file has four new variables as: office, curroffice, inPayNext3, and openEarlyCode. The following three variables are not provided this time: aptnum, prov, & onianext3.

July 9 This information has been sent to Donna to determine why 13 vs. 11. Relinkage of data needs to transpire over appending, as the files are different from the previous version.

MOH/HealthIdeas

July 1st week Variable classification of MOH data sets

July 9 Fan has completed the first round of MOH data variable classification. Tav is to go through the process as well.

July 11 Fan added vital statistics in variable classification. Pending Tav to review now.

July 24 Tav reviewed Fan's classification, sent back to Fan for final review (and answer some questions). Will send off to Donna/Brittany as soon as complete for review. Note, Fan and Tav require additional documentation for a few variables where we cannot classify variable placement.

July 25 Sent MOH classification to Brittany and Donna for review (registry, DAD and MSP + vital statistics and pharmacare).

July 26 Brent can classify variables that Fan and Tav could not.

week of July 30the Tav asked Brittany (via email) how to handle non BC residents in the MOH data sets.

August 7 Direction provided around including BC residents in data. MCC and HCC question regarding of when this data would be provisioned given we do not have updated versions.

August 9 Tim identified the following records may be missing from the MOH file:

  1. Abortion records were excluded from MSP, DAD and Vital Statistics. I think they were excluded by MOH/VSA before they sent us the data (supposedly, Brent can confirm).
  2. All records that are associated with ICBC or WorkSafeBC claim are excluded from our standard MSP Collection (roughly 3.5 percent of MSP records; or 1.6 and 1.9 percent for WorkSafeBC and ICBC respectively). Currently, with WorkSafeBC’s per-project approval, we can release the WorkSafeBC-related records. For ICBC-related records, I don’t know whether they were excluded before or after the MSP data is sent to PopData.

August 9 Email to Jeannette and Kathleen regarding the role of data providers for variable classification, and awareness of how data will be handled in the IDO context when identifying variables for the research ready collection.

August 10 Mike S variable classification of PNET complete and provided. Variable classification almost complete for MOH data sets.

August 13 Brittany provided with clarification question and specificity of how to handle 'business' variables for MOH data set.

Week of August 13 Variable classification completed (Brent, Tav, Fan and Mike S.)

August 16 call with Andrew Elderfield regarding the variable classification of MOH and what stage we are at in the process. Brittany and Gen emailed regarding Andrew's call. Broad question around what role data providers play in variable classification raised. Updated variable classification for MOH sent to Brittany.

August 27 Brittany has provided a list of variables that will be shared with Andrew Elderfield (and his team) that would be re-classified as to available in its raw form (e.g, Hospital numbers). We are waiting to see what decision is made regarding the classification of these variables. Once complete and approved by the data provider, we will proceed with extracting the data.

September 5 MOH classification reviewed by data provider with the following note:

We will use the revised classification index and only have the options of 3.a.b.c and 99.a,b,c- as attached.
Any variable previously classified as 6.a,b will now need to be changed to reflect the revised index.
We will retain the classification of 3.c Direct Identifier-Replaced for the business/facility identifiers for this data set only. While De-ID guidelines do not protect business identifiers, the wishes of the data provider with regard to their data, supersede this allowance. We will make data set by data set decisions regarding this type of variable as we go along.
Reminder that gender/sex, ethnicity, language, religion or citizenship are NOT considered direct identifiers and can be included in the research content.

September 10 to 26 Batch 1 release preparation which includes MSP, DAD, consolidation file; Batch 1a: PNET. PNET and NACRs is being pulled from healthideas. Identification of variables that we have not previously facilitated are identified in this variable review process. For all variables not previously facilitated, they will be highlighted for discussion.

September 24 Examine variables not found in data extracts (e.g., NACRs).

September 28 Batch 1 release provided. PNET require removal of dose description. Vital Statistics filter files complete and extraction complete.

October 1 PNET removal of variable (dose description). Communication with Brittany regarding PNET issues, and facilitation of variables we have not previously facilitated, but are on the list as research content for data provided. Plan to conducted round 2 once all data has been provided. NACRs, rpblites, pharmacare data prepared.

October 2 Update to all filter files to ensure consistency and creation of IDD specific filter files. PNET and Batch 2 complete and under review with DAU.

October 4 City fields are not to be suppressed (email from Brittany)

October 9 Email from Brittany regarding the unsupression of fields that IDD does not agree with. Phone call followup on same day.

October 11 Email from Brittany with an updated variable classification of vital statistics with variables suppressed by Popdata in in orange (cities, province, dob, 6 digital postal code, accident location and death location) are to be unsuppressed.

October 15 Tav wrote email to discuss this after reviewing the risk on the fields -- there are full addresses. Brittany communicated the mistake in that dob and 6 digital postal codes are not to be provided in its entirety, but truncated as before. However, the variables with street addresses will still be provided. A followup meeting with Beth and Kathleen are set up to discuss.

October 16 Meeting to discuss release of street address information for certain variables.


ROUND 1 of MSP, DAD, Consolidation File, NACRs, PNET, Pharmacare, rpblite, MHS COMPLETE. ROUND 2: update fields provided to include variables not previously facilitated.

LMID

September 9 LMID data transferred (4 files).

September 13 Following issues with the LMID data transfer (emailed Brittany).

1. the file called intake has 23 columns vs 21
2. the file called training has 169811 records vs 168416, and several rows at the end have ‘.’ ’s in the place of values
3. the file called participant which has the unique participated_id is unique, however some of the ids have are negative (e.g., -123456), which is odd to have an id be negative.
4. variable labels are too long.

September 19 Met with LMID to resolve data issues. They will resend that data.

September 28 LMID data transfer of files again. TB reviewed.

October 16 Verification email sent regarding training file (variable TrProg in training file has lots of white space and extra rows at bottom). Brittany emailed LMID and they confirmed this is okay). Variable classification complete, and linkage to begin.

October 17 Variable classification sent to Brittany.

October 24 Follow up email sent around variable classification. Brittany was okay with classifications. Holding until the DAR for the Mackenzie project is updated to provision this data.

COMPLETE

January 2019 each unlinked record is assigned a unique study id.


Perinatal

November 26 to December 3 Variable classification by Tav and Fan

December 4 initial classification sent to Donna and Brittany

January 4 to 9review of metadata dictionary for all data sets including perinatal. Tav noticed some data files and classifications were missing.

January 8 to 14Tav classified and update perinatal classification

January 14 Tav sent updated perinatal classification to Donna and Brittany

January 25 Data provisioned and ready for review.

January 29 Tav reviewed perinatal and sent do Tim for final review. Email sent to Brittany regarding receiving documentation (ISA or commitment letter) to release perinatal data.

February 8 Tim reviewed data. Ready for release pending email from Brittany.

February 11 Received letter from Brittany and sent to Maria for review. Data will be release upon Maria's review of the ISA.

MCFD

March 28 Data transfer of files for just data linkage. This includes: actv_stage_7, all_client_export, intake_export, location_history, subsidy files.

March 29 noticed the following issues in the files received:

*The file: subsidy.csv is missing a crosswalk key (icmpid).  We don't know how to walk this file to the other files.
*We also do not know which postal code to use from the between the location_history.csv file and the subsidy.csv file (having meta-data on the loc_typ will be helpful). 
*We also noticed walking the file with the all_client_export.csv file to location_history had a 25% missing rate, and with intake_export a 5% missing rate.
*Finally we noticed only 53% of phn's are available in the client file.
*We also wanted to know what years of data are we expected to have, and what variable in the data files would verify this?

March 29 Response to our questions:

*The file: subsidy.csv is missing a crosswalk key (icmpid).  We don't know how to walk this file to the other files.  ppd_x_contact_num is the crosswalk key
*We also do not know which postal code to use from the between the location_history.csv file and the subsidy.csv file (having meta-data on the loc_typ will be helpful).
Use the subsidy postal code for ppd_x_contact_num in the clients table and location history of icmpids in actv_stage_7
*We also noticed walking the file with the all_client_export.csv file to location_history had a 25% missing rate, and with intake_export a 5% missing rate.
Any client in subsidy or CYSN services only will not appear in location_history. 
*Finally we noticed only 53% of phn's are available in the client file.
I sent the latest information in our client system (ICM). PHN is not a mandatory field for many of our services.
*We also wanted to know what years of data are we expected to have, and what variable in the data files would verify this?
*Actv_stage_7 (sorry about that name) use cym
*subsidy use iid_svcym
*intake_export use call_date

April 1 Follow up question sent to clarify issue of differing numbers between files: location_history file 25% of unique icmpid appear that do not appear in the all_client_export file, and 5% in the intake_export do not appear in the all_client_export file. We are wondering how this could be possible?

April 5 Follow up response from Donna: Scott has just informed me that the discrepancy issue is caused by a missing family services file. He is assembling this missing file for transfer later today and will also be sending the update to the field definitions document.

April 11 Receive next round of data: mcfd_fs_case, mis_fs_list_export, and corrected file. Transfers received.

April 10 Received data dictionary for MCFD from Donna. New file to be sent that resolves the discrepancy identified in family services file.

April 11 FS_FILE_EXPORT sent by MCFD to handle the discrepancy found earlier

April 12 email sent to understand which variable to use as primary residential postal code for data linkage in location_history file or work file. Also question regarding what a variable means (loc_type)

April 16 email from Donna/Scott to use postal code in location_history for data linkage.

April 18 Data transfer of child welfare data is transferred (CYSN).

April 30 Variable classification sent to Donna.

May 1 Variable classification approved.

May 17 Data linkage and collection ready.

COMPLETE

May 22 Data provisioned.

Data Linkage

June 27 ISA signed with MOH to utilized the Population Directory at Popdata for data linkage of the IDO catalyst projects. June 28 Begin linkage of MED and PSSG. Anticipate PSSG data linkage will be complete July 14th week (assuming no issues in linkage).


PSSG

June 28 Data linkage of PSSG begins. No PHN and Postal code. July 10th week Identified issues with PSSG data linkage. Issues around one-to-many with differing PHNs. Lixiang will produce and document linkage process. Postal code may have reduced some of this, but this is a larger issues with inmates receiving multiple PHN assignment. Multiple records for one individual with equal weighting occurring.

July 14th week Meet to determine how to address linkage issue. Many 1-to-many with high weight candidate, and 1-to-many equal weight candidates (2 to 9). Analysis initially conducted using linkage rules and linkage quality contained individuals with 1-to-many (equal weighting) randomly choosing one in analysis. Next analysis will just focus on those individuals with 1-to-1 (with one obvious highest weighted individual), to determine how good linkage quality is based on linkage rules. Lixiang could randomly choose between equally weighted candidates with multiple PHNs.

Decided to create a data linkage rates based on rules with just 1-to-many highest weighted candidate data and compare to all data linkage rates.

July 23 Lixang completed analysis. We now have a comparison of 1 PSSG-to-many PHNs highest weighted candidate and 1 PSSG-to-many PHN (equal weight) by subtracting the two. Lixang is creating a PSSG linked data set where individuals with 1 PSSG-to-many PHNs highest weighted candidate and 1 PSSG-to-many PHNs (equal weightings) will be linked to all PHNs related to that individual (rather than choosing highest weighted individual or randomly selecting one when weights are equal). Simply put, Lixiang is creating a linked PSSG data file where individuals will have multiple PHNs (this will be for 1-to-many highest weight candidates and 1-to-many equal weighted candidates). From here we may decide to choose one individual for 1-to-may high weight candidate, and then 1-to-may equal weight provide all possibilities to researchers. Note the linkage also includes 1 PHN to multiple PSSGs, but the frequency of this happens far less (n=48 in total, with n=36 for those with 1 PHN-to-many PSSG one high weighted candidate).

July 26 Lixang will create a linked data set where: for the scenario of 1 PSSG-to-many with one highest candidate, one candidate will be selected (so 1 PSSG to 1 PHN). For the 1-to-many with equal weighted candidates, the data will be deduped. The results will then be examined for the existence of data points of data points in MSP, DAD and the registry files to come to a final conclusion on resolving this issue. Note, the 1 PHN-to-multiple problem will be ignored given how few in frequency it occurs.

July 26 Fan is working away on preparing linked files

July 31 Fan provided preliminary results of PHNs that have data in registry, MSP and DAD. There are cases where a PHN does not have any data across the board. Odd that some show up in MSP and DAD, but not the registry in a high number. Majority appear in all three. Conclusion is to consolidate those with 1 PSSG-to-many PHNs equal weighted candidates (~30% of the data) with one IDO id and thus study ID, and for the remaining 1 PSSG-to-many with one highest candidate (~70%), choose one candidate.

August 9, 2018 PSSG data linkaged file is complete and available to be released to the SRE projects.

COMPLETE

MED

June 28 Begin linkage of MED. Anticiapted MED data linkage will be complete July 23rd week (assuming no issues in linkage).

July 23rd week Harold is working away on data linkage for MED.

July 23rd week Harold is aiming for the end of this week, but given the size of the data set and quality of linkage desired it will take some time. Brent also conducted his resolution code, and was able to link 1-2% more records.

July 26 Harold has identified that the frequency of 1-to-many PHNs occurance in MED is 0.74%). This nearly is not as large of an issues as it is with PSSG. Data linkage is complete. File needs to be consolidated with data.

July 30 Fan is working to prepare MED linked data for SRE.

August 14 Fan has prepared the MED linked file. After review, a few more variables need to be added in (Aboriginal variables that were originally suppressed, but no longer).

COMPLETE.


SDPR version 1

July 27 Begin linkage for SDPR. Note: this data sets has two separate data sets with first, middle and last name, with a variable that captures which is the individuals preferred. This variation will be rolled out for data linkage.

August 3 SDPR data has been linked. File to be consolidated with data.

August 14 SDPR linked data set has been prepared for SRE release.

September 14 SDPR legacy linkage is pending. Completion aimed for mid October.

October 15 SDPR linkage complete. Fan to consolidate file.

COMPLETE.

April 2019 Round 2 of data linkage given new version of SDPR to reconcile for new projects.

Last week of April Complete. With SDPR version 2 and 3 reconciled.

SDPR version 2

November 27 SDPR data to be linked. December 11 Data Linkage complete.

COMPLETE.


LMID

October 17 Data linkage to commence. Note no direct identifiers.

October 24 Data linkage complete with 76.48% linkage rate.

COMPLETE.


Perinatal

December 2018 to January 2019 Popdata working on updating existing data linkage with new year of data and updated existing collection to facilitate IDD project request for this data.

January 21-24, 2019 Data linkage complete.

Technological Development and Discussions

Data Ingestion

  • meeting September 20th and outlined the exploration of coordinating ids hashing sequence to be explored for project specific assignment


Output Checker Development History

  • Project start to end of September 2018: the import and export feature has been disabled. Users who click on the script to import or export files have a warning that appears that states something to this effect 'Transfer is blocked.' Files can be imported or exported, if the user requires it, but this will be a manual process. IDO is exploring the import and export features of the SRE. The current proposition is to audit all exports out of the SRE, but no discussion has been made around importing.
  • Interim solution has been implemented by Popdata's SS team in the first week of October 2018. Instructions can be found://Gilbert/Alfresco/Systems & Security/SRE+RTL+SRTL/Docs/Catalyst. The import of syntax is disabled and is manually be imported by SS team for requests - this feature is to be implemented in Batch 2 release of the output checker under development.
  • Popdatas existing scanning procedure rules (https://wiki.popdata.bc.ca/popdata/SRE-in-out-software#Block_.2F_Warn_criteria)

Output Checker Technological Collaboration with IDD Summary

Technological documents can be find: /Alfresco/Data Service Unit/IDO/Technology/OutputChecker
Output Checker On Premise Storage Design - BGDDI Wiki.pdf
OutputCheckingWorkflowReleasePlanning v5.pdf
Output Checking project MVP v1.2.docx
  • DI Sprint Planning and Output Checker Project kickoff meeting August 13, 2018
  • Conducting interviews with Paul August 13 to August 31. Interviews will include existing Popdata SRE users, data checkers and other users in government. Interview list included: WCB users and Sandra (works with Kim).
  • Technological meeting around design specs August 29,2018
  • Interim solution implemented by Popdata for October 1st week - instructions included and sent to Greg October 17, 2018.
  • Martin Mockmen will be the output checker - pending his training.
  • October 31, 2018 Access for Julie as output checker requested. Tim has sent along privacy training materials, and awaiting an address to send yubikey.
  • MVP components determined (October 16th meeting) - OutputCheckingWorkflowReleasePlanning v5.pdf
  • Storage solution using Minio and Tusd (October 16 meeting). The system should be able to connect to Popdata's PDS/LDAP - Output Checker On Premise Storage Design - BGDDI Wiki.pdf
  • Mockups of output checker design with current Popdata SRE users (Scott Emerson, Elaine Kingwell & Feng, Sandra Peterson and Eric Sayre (from Avina-Zubieta's group) scheduled for October 22nd week.
  • Setup staging for IDD - October 15 to November 1. (Aidan was granted access for pipeline flow)
  • November 27th week: output checker implementation almost complete (this includes automated scanning).
  • December 2018: finalizing front end look and feel features. Identified issues of not having the ability to input files. This has been requested in future release.
  • January 2019: finalizing and bug fixes related to output checker. Currently a version is working in SRE (staging).
  • February 11 2019 week: OCWA usability testing
  • February end of month: deployment of new OCWA tool
  • May: deployment of OCWA in popdata environment. testing will be underway with research teams

SRE Access and installation of Tools

  • Note: To access the SRE, users do undergo the traditional training all SRE users get (this involves training around import and export of files to and from the SRE). The materials will need to be modified depending on the decision around the use of the SRE long term, and the decision around the import/export features.
  • Popdata import feature is enabled (January 30th, 2019)
  • Special export feature developed by Popdata for IDD. This feature ensures all outputs are vetted by an IDD individual before release.

Catalyst Request for SRE packages

August 14, 2018 Request to install software by Brittany for the Mackenzie project in the SRE.

  • Text editors: Could we install sublime text and VIM
  • Pentaho Data Integration Community version (ETL)
  • WEKA --a data mining tool (free) (to create models of various types: regression, decision trees, clustering, etc)
  • We'll also need a plug-in for Pentaho (to import model from WEKA into Pentaho) ... but let's first get Pentaho, then I can create screen shots to show from where and how to get the plug-in

October 2, 2018 Ryoko installed packages requested:

  • Sublimetext, VIM, Pentaho Data Integration Community version (ETL), WEKA. Plug in for Pentaho required further instructions, and emailed Brittany for further guidance.

October 5, 2018 Brittany provide additional request for packages for Pentaho.

  • Executor R
  • Cython (Python)
  • Data Mining (Weka)

October 17,2018 Ryoko attempted to install packages, but not easy without modifying config files extensively. She emailed Brittany for further guidance from the team that uses it.

January 2019 Exploring use of a postgres database in SRE. Have to work with user access permission issues. Gitlab also installed (however user permission issues need to be resolved). Installed tools that allow users to write own code over package use in R.

February 2019 Deployment of postgres to SRE environment.

March 2019 Development of gitlab and chat features. This will include a one-stop shop for users to get updates, chat, download code etc., (called workbench).

Metadata

MED: researchers and IDO have access to Metabadger. Greg will be using this to import data into PROOF.

PSSG: received and pushed with data to SRE (location will be in redzone to push data and metadata together to SRE more efficiently).

SDPR version 1 and 2: received

MOH: Provide data dictionary using our automated system.

LMID: Provided - this includes variable options (e.g., True, False etc.,).

For all data sets: we need to provide metadata around variables that may be suppressed, added (due to linkage), or modified (IDs) to Greg for PROOF. Current methodology for consideration is package built for Python called tableschema (website: https://frictionlessdata.io/specs/table-schema/ and code: https://github.com/frictionlessdata/tableschema-py). Tableschema produces a .json output schema of all variable fields, type, format, and missing values. However, this output is only as good as its ability to infer fields. So for instance if a string variable has missing data inputed as numeric, then the field will be inferred as a 'numeric.' The ability to update the metadata is possible, but requires manually doing so.

Meetings

May 15, 2018

attendees: Brent, Jim, Tim and Tav, Janet, Jeremy, Shawn(?), Donna and Greg

Issues raised: not a privacy incident if certain variables are released to SRE and later deemed to be suppressed due to its sensitive nature given these projects are catalyst projects. Some debate, but agreed to identify any more variables for suppression that may contain identifying information that would be a concern (such as aboriginal content etc.,), and to have those reviewed via Donna by data providers.
Determine a mechanism by which researchers in the SRE can provide feedback if a sensitive data variable is present in the data or notice a problem in the data (add feedback loop). Provide a summary of the data intake to date (for each data set MED and PSSG so far) with Donna (like Post Mort em assessment) of what went well, what didn't to look for improvements. Focus on creating processes for data inkate, SRE review for catalyst projects.


Jim: end of the week SRE output mechanism will be working.

Janet: working with Health regarding data and linkage.

Donna: PSSG data has undergone 2 rounds of review before the data was provided to IDO and Popdata. Believe ready to be released to researchers once Popdata's classification questions are answered.

Greg: 'glass wall environment' almost complete, geocoding license in review, Viggo(?) shared on github not that EM algorithm implemented.

Jeremy: expectation 16 projects once linked. Project list in 6 weeks.

Brent/Tav: provide 2nd round of variable classification based on values/categories within variables.

Tim: DAR received and signed for MED and PSSG. Privacy training will be complete for the two researchers.

May 24, 2018

attendees: Tav, Donna, Greg and Paul

Discuss metadata for PSSG and other data sets. Greg would like us to use: https://github.com/frictionlessdata/tableschema-py I told Greg I would explore this, but not sure if it will support flat files. Greg is interested in getting the .json output, but specific information also around table schema etc., confused about this, but will review output to see what this provides.

Greg wants name of fields, expected data type (file type), encoding of file, and variable descriptors if possible.

Greg will follow up with Brent about a conversation around digital object identifiers.

June 5, 2018

attendees: Gitta, Jim, Tim, and Tav; Donna, Greg, Janet, Gene, Jeremy, and Donna

Want to have a meeting with Brent and Jim around the use of SAIL and their appliances. Requested to have Addenum (FAQs), ORC chart, what projects is IDO working on, Project Management around development of different IDO tools that require Popdata's input. Discussed roles and responsibilities, and when Popdata needs to be involved and provide suggestions/guidance given the IDO tools are also being built for Popdata.

Donna: SDPR data has been extracted. Donna will be working on project management, and what current projects IDO is working on that might impact Popdata Greg/Jeremy: Harold sent information along, moving forward on Liggo Greg: 'glass wall environment' almost complete. Jeremy: identify next stream of catalyst projects. Most likely get details in July, and DAR requests in August.

June 5, 2018

attendees: Tav and Donna

SDPR data is extracted and excel document for variable classification should be coming. It is expected that Donna and Popdata will be doing the variable classification for SDPR. We discussed creating a document that had our agreed upon variable classification (including examples) to be given to data providers so they can do variable classification, both of field names and field content. Data providers would be given an additional variable classification option of 'suppress,' which would be open for discussion with IDO as to why they would not be provided data on a particular field, or if they do, why we would suppress it on our end. This process will save us time in the long run and back and forth. Tav will follow up with sending along our current used and recommended modifications to the classification list.

In regards to SDPR data transfer, still waiting on Rob to make a decision. The hang up with wanting to use Popdata's secure website has to do with encryption, where encryption takes a long time on their large files. Tav communicated regardless of method, Popdata would still prefer to have all files encrypted no matter what the mode of transfer is (as per Jim's guidance).

Donna did not that fields that are aboriginal specific, would be suppressed, but fields that have content that have aboriginal in it, would not given the item is not aboriginal specific.

In terms of data linkage, in process of signing ISA with MOH, and should receive that by end of June. So anticipating data linkage for July.

Donna has tracked the issues in the PSSG file - ask for postal codes or alias file for data linkage for next intake, along with missing EVENTID. We need a document between us of improvements for variables on the next intake (Tav).

June 14, 2018

attendees: Tav and Donna

Discuss plan for the SDPR data set and legacy data. SDPR did not provide number of files or nrows per file for Popdata to verify. Popdata will conduct this, and send this information to Donna. Donna will then ask SDPR to verify we have the correct number of files and rows. Tav communicated our priority is to get MED prepared and released to the SRE. As we await a sign off in the variable classification for MED, we will conduct verification of receipt, and begin variable classification June 15.

Legacy data 2005-2012, 2012-2017 will look similar and be comparable in size to the data we received for 2017/18. The legacy data (prior to 2012) will be different and Donna anticipates it will be smaller in size. There is a difference between legacy and the current data, and thus may require its own variable classification.

It is important to note that this data file has NO DIRECT IDENTIFIERS. Linkage can transpire through the X_CONTACT_NUM to the MCFD file, and then perhaps further linkage via MCFD. They have requested direct identifiers.

In regards to PSSG we discuss allowing us to have access to develop a protocol with extracting and massaging data to fit our needs over hiring a third party to do this.

Overall: wanting to move to research ready collection that is standardized with no special request variables for government individuals. This may be different for researchers. The focus this year will be on government individuals. Given the varying levels of data they have received, we may have to re-receive all the data all over again, as the second round will be more data than they have currently received.

July 10 and 11, 2018

attendees: Greg, Aidan, Jeannette Brent, Jim, and Tav

IDO visit. Purpose was to discussion and create collaborative development projects. We discussed data management and flow, metadata, SRE. Produced a list which included: implementation of Nifi for data intake/ingestion, Popdata generating metadata for existing data sets for IDO catalyst and populating atlas/BC catalogue/workbench, use of AppTracker and DARonline for IDO, Ligo probabilistic implementation and creation of a fake data set for validation, implemenation of Jupitor notebooks (and eventually workbench) in Popdata SRE.

July 16, 2018

attendees: Tav and Gen

Discussion of project list. IDO prioritized the list, and would like Popdata to do the same thing. Question regarding the role LinxMart and PPRL plays with Ligo. Also identified a priority of describing and understanding existing SRE users.

July 23, 2018

attendees: Tav and Brittany

Discussed asking PSSG why multiple PHNs are assigned to inmates, and if this is an ongoing issue or historic. Conclusion regarding SDRP non BC residents. Require sign-off on SDPR excel document before processing -- Brittany in Donna's absence will follow up with Gen to get this sign off and email.

August 1, 2018

attendees: Tav and Brittany

Decision to not suppress aboriginal data -- this will be controlled during the SRE output phase. Note this decision is intermediary for the Mackenzie project. Decision can change. Clarification around suppression of small geographic regions and variables at risk of having sensitive content. Kathleen and Jeanette shall let Popdata know what to do with this.

October 16, 2018

attendees: Tav, Brittany, Beth and Kathleen

Purpose of this meeting was to discuss the release of address for certain variable fields in vital statistics and provide context and a decision was informed. Conclusion from the meeting is that the variable fields with addresses identified are okay to be released and the risk has been assessed and are not deemed individual identifiers as one cannot ensure the accident location or death location is the actual address of the individual.

November 8, 2018

attendees: Tav, Brittany, Dan

Purpose of this meeting was to discuss linkage of Stats Canada data (Tax data aggregated individuals at community levels, and individuals and family at community level. Received by IDD: 2000 to 2016 in two separate forms - individual by community (2000 to 2016), 2004 to 2016 individuals and families by community. Include Stats Can geographic identifiers and 6 digit postal code. Have to determine how to provide data (study id to each year, or can we walk from consolidation file to there files. Totaly of 16 files x 2 by year.).

November 26, 2018

attendees: Tav, Dan, Donna

Purpose of this meeting to discuss the issues with the PSSG. The number of records in the offender file > than the number that appear in the remaining files (the drop off is significant >50%).

November 26, 2018

attendees: Tav, Brent, Jim, Aidan, Greg

Purpose of this meeting was to discuss technological priorities. Focus on discussing project specific ids and coordinating id replacement by project, and how to sustain a large number of projects with the same data that require these project specific replacements.

November 27, 2018

attendees: Tav, Noushin, Donna

Purpose of this meeting was to discuss linkage of Stats Canada data after a sample was provided. Postal code is not available, and we need to determine if census tract can be used to link individuals on. The goal is to link not only for the Wilmer project, but for other data sets.

December 6, 2018

attendees: Tav and Dan

Connected with Dan regarding the stats canada data file. The postal codes are provided for rural areas, but for non-rural areas propriety codes are provided (what may appear to be census tracts, but he needs to confirm). Stats canada is not allowed to share postal codes of urban areas, thus the propriety codes. Dan has one of two options 1) determine if the PCCF+ software can be used to merge the files using the census tract with our postal codes, or 2) purchase a special cut by Stats Canada of the data he requires by the level of aggregation her requires. This will take 2-3 months or longer. Based on this, he wants to go with route 1 if possible. He will contact Canada Post to determine cost of licensing PCCF+ (or if UBC licensing can be used). Tav made Dan aware of the commercial vs. non-commercial licensing of PCCF+, given IDD may charge users to access there data, and was not sure what this would mean for licensing.

December 19, 2018

attendees: Tav, Noushin and Dan

Require file called GAF to use PCCF+ with the data files Dan has. Note: urban areas have census tracts, no postal code, but rural areas have postal codes, no census tract. They would like to have all postal codes related to a census tract. Dan would like Tav to connect with Angela at Canada post (number provided), to determine if the UBC license can be used for their project.

January 10, 2019

attendees: Tav, Brittany

Discussion of provisioning data for Warburton 18-g03 and Wilmer 18-g02. Note perinatal is a one time request for Warburton 19-g03, but is not part of the IDD collection. Discussion of front counter and use of DAR online in the next coming months.

January 16, 2019

attendees: Tav, Donna

Finalize variable classification. Discussed for MED, SDPR, PSSG and LMID, Donna would create data dictionary similar to what Popdata that is provided to SRE users to read in files. Popdata will add in starts and stops once these are completed by Donna.

January 31, 2019

attendees: Tav, Tim, Bryonny, Brittany, Sue

Discussion on the questions and flow chart for researchers when navigating access to DIP data, data held at Popdata and BC stats.

Process Improvement Ideas

Issues: how to get data sets in 'low risk' category for high volume push

Can we push through low risk data sets in high volume (almost going back to the idea of dar approvals, where the idea was to have data stewards identify low risk projects, and have them push through without vetting, how can we do that here?)

Option 1: get data providers to do all work up front and prepare data to be sent to IDO on what they agree upon (low risk data sets)

Option 2: initial intake of all data sets is slow at first, until a low risk data set is identified (variable and field classification process is required, metadata etc.,), and then every year that data set can be appended. Who would do this? Popdata to IDO, IDO to data provider? This has been our current method with 2/3 days of effort, and time gaps in between getting responses.

Can we find a way to have IDO/someone review the 60+ data sets coming in and identify commonalities between the data sets in terms of suppression of variables and variable fields? (e.g., all agree small geographic block suppression, aboriginal data etc.,). Perhaps, we can create an intake form (e.g., fluid survey), that asks all data providers giving data to IDO to provide more information around whether or not they suppress data based on a list (could be pre-defined of current suppressed data fields for our direct identifiers, and indirect (aboriginal, small geo. areas, dates that can lead to dob). We could allow them to list all the variables that they would want suppressed if our list was not comprehensive enough. During this intake form, we could also ask if they have metadata (if so upload it here) etc., Ask what format there data is in etc.,


SRE feedback process during catalyst projects

Require a feedback mechanism by which researchers can provide feedback to remove sensitive variable types or fields that was missed during the variable classification process. IDO will be doing an audit of the SRE export, but perhaps an audit of the data set is required.


Bits and Bobs

How to track long term what variables are suppressed by dataset?

How to update metadata file when files received are flat to be read in correctly? Should be consider running our own extract tool (Tav how do we currently produce this metadata in step 16b(3))