SRE-in-out-software

From popdata
Jump to: navigation, search

Back to Secure Research Environment

Contents

Background: SAMBA on server Fraser.popdata.bc.ca

SAMBA file service runs on Linux server Fraser, appears as R: drive on SRE virtual machines. See in Secure Research Environment under "Configuring Fraser".

R drive - user and group folders etc

Since 2011-April, R: is \\fraser\sre\PROJECT_DIR , which contains Data for the project, and the project-specific user folders for all member researchers. User folders contain a folder TRANSFER, itself containing user-specific folders EXPORT_FROM_SRE and IMPORT_TO_SRE . This is achieved through a lot of user-specific unix symbolic links, together with suppression in Samba service of unix extensions, so the symlinks are processed on Fraser and transparent to Windows clients.

Symlink under R: Target Purpose
.login/ /data/sre/.login/ Log logins
.scripts/ /data/sre/.scripts/ Scripts for login and transfer
.trigger-in /data/sre/IMPORT_TO_SRE/.trigger/ Creation here by IMPORT-IT triggers yellowfolder transfer
.trigger-out /data/sre/EXPORT_FROM_SRE/.trigger/ Creation here by EXPORT-IT triggers yellowfolder transfer
USER/TRANSFER/IMPORT-IT.lnk /data/sre/.scripts/transfer-in.lnk -> \\SREfiles\sre\.scripts\transfer-in.pyw
.../transfer-in.pyw transfer-in.py ".pyw" triggers suppression of black diagnostic window
USER/TRANSFER/EXPORT-IT.lnk /data/sre/.scripts/transfer-out.lnk -> \\SREfiles\sre\.scripts\transfer-out.pyw
.../transfer-out.pyw transfer-out.py ".pyw" triggers suppression of black diagnostic window
USER/TRANSFER/IMPORT_TO_SRE /data/sre/IMPORT_TO_SRE/USER "inside" folder for import
USER/TRANSFER/EXPORT_FROM_SRE /data/sre/EXPORT_FROM_SRE/USER "inside" folder for export
R:\IMPORT_TO_SRE\.trigger real trigger directory
R:\.trigger-in\ /data/sre/IMPORT_TO_SRE/.trigger symlink in each project
  • Note that transfer-in.lnk and transfer-out.lnk opens the version of transfer-out.pyw on Fraser (a.k.a. SREfiles). Extension ".pyw" suppresses opening a command window where error messages can be seen.
  • Example: User directory for mguhn-11-s01 is /data/sre/11-s01/mguhn-11-s01/ contains subdirectory TRANSFER (modifyable only by root) and containing links to user-owned directory /data/sre/EXPORT_FROM_SRE/bforer-11-s01 / ...

NOTE: the home directory (a.k.a. windows profile) is still (as of 2011-09-26) c:\Users\USER , but "soon" will be changed to a more rational u: (R:\USER), so users see their data and settings regardless of which SRE they happened to be logged into.

Yellow Folders - transferring data in and out

See “SRE File Transfer Monitoring Procedures 2011 12 06.docx” and "Incident Response Procedures 2012 04 19 FINAL.docx" under Alfresco on Gilbert "P:\Privacy and Policies\Policies\Information Security Incident Management Policy\" .. Get Caitlin or RLU to look up requirements from Data Stewards for a particular project.

In my interpretation (Denis), all blocked transfers should be reported the same day to "Privacy Lead" (Caitlin), but before concern about a file name or size (whether from an automated email or from inspection of transfer records) gets escalated to be a suspicion to be forwarded to Privacy Lead, it may be necessary to examine the file in question.

File shares - Inside and outside view

Inside SRE Outside SRE Saved copy
R: = \\fraser\sre\EXPORT_FROM_SRE Z: ?= \\fraser\USER\EXPORT_FROM_SRE /home/saved/EXPORT_FROM_SRE
R: = \\fraser\sre\IMPORT_TO_SRE Z: ?= \\fraser\USER\IMPORT_TO_SRE /home/saved/IMPORT_TO_SRE
The outside view is implemented by placing symlinks in the user's Unix home directory on Fraser
(ex: /home/testuser1-99-t01/EXPORT_FROM_SRE => /data/share/outgoing/testuser1-99-t01
  • Note the connection between inside and outside is that yellowfolder.pl runs and transfers files after inspection.

Transfer software

  • At boot-time on Fraser, /etc/incron.d/yellowfolders configures inotify , which runs /usr/local/sbin/yellowfolder.pl upon changes in trigger directories -- whenever trigger directories ".trigger" under ither EXPORT_FROM_SRE or IMPORT_TO_SRE have a directory added.
    • Warning! encrypted filesystems get mounted late manually : must re-run /etc/init.d/incron restart .
    • To detect problem; /etc/init.d/incron status may show messages like "cannot reate watch for system table yellowfolders: (2) No such file or directory".
    • If users request transfers between reboot and restart, trigger directories must be deleted and re-created.
      diagnostic: cd /data/transfer && find */.trigger* -mindepth 1 -type d -ls
      on SRE: cd /data/sre && find *PORT*SRE/.trigger -mindepth 1 -type d -ls
  • When a user has populated his personal subdirectory of EXPORT_FROM_SRE or IMPORT_TO_SRE, he then double-clicks on Windows shortcuts EXPORT-IT.LNK or IMPORT-IT.LNK, which runs on the Virtual Windows machine the script \\SREfiles\sre\.scripts\transfer-in.pyw or transfer-out.pyw in folder .scripts, which creates subdirectory in user's name under .trigger folder.
  • On Fraser, inotify runs yellowfolder.pl which performs the transfer ...
    • ... after checking that no rules are violated.
    • after successful transfer, transferred files are moved to a folder named "Transferred_YYY-mm-DD_hh.MM.dd". This is done by asking rsync to delete all the files in the remote directory, and move them to a backup directory of the above name.
      • 2014-12-16 DL: To give ownership of "Transferred" backup directory back to the user, /etc/rsyncd.conf on Fraser contains a couple of entries
        pre-xfer exec = /usr/local/sbin/yellowfolder-rsync-transfer-import-pre [leave file in /tmp/yellowfolder-rsync-pre#{PID#}]
        post-xfer exec = /usr/local/sbin/yellowfolder-rsync-transfer-import-post [read file above, chown to user, delete tmp file]
      • 2016-04-13 Since 2016-02-15,18:15 rsync tmp files are not getting deleted, and ownership not changed
  • 2015-12-22 block all archives (except .xlsx and a few other exceptions); reduce size limits over christmas (warn == block); add "binary" files to warn list (unfortunately this includes Office lockfiles )
  • 2016-01-04 restore size limits; skip "~$*" (Office lockfiles) "._*" (Apple resource forks) ".DS_store" (Apple folder options).
    Note that many users re-open Office files after copying to EXPORT_FROM_SRE folder, thus creating lockfile
  • 2017-08-29 yellowfolder.pl warns about 10-digit studyID pattern found in files (using grep [a-z1-2]0{2}[0-9]{7}); see also yf-find-studyids
  • 2017-09-12 Exclude (silently leave behind) files named "~$*" (MSoffice lock file) "._*" & ".DS_Store" (MacOs aux. files), Thumbs.db (Windows aux. file), ".RData" (R temporary data file).
    These are usually hidden files, probably without the knowledge of the researcher exporting or importing them.

Block / Warn criteria

Updated 2018-10-24 from source code yellowfolder.pl & yellowfolder.conf & scan-studyids.py; see also https://my.popdata.bc.ca/html/SRE/working/file_transfers.html#blocking

SRE Warn Block
Import size >= 10M each, or total 10M 20M each, or total 50M
Export size >= 3.5M each, or total 7M 5M each, or total 10M
File extension .sav .dbf .dat .csv .bin .egp Archives; .com .exe .dll .scr .sas7bdat .spv .rda .dta
Exceptions: MS-Office "x" files are all Zip, but allowed Same for .egp
File content matches exactly a file in R:\Data ; encrypted files; contains StudyID
File name matches exactly a file in R:\Data
  • Ex: export blocked if the total of all files exceeds 10MB, or a single file exceeds 5M.
  • Archive formats (.zip .rar .7z .gz .tar .jar .tgz .tbz .bzip2) blocked because they are too hard to scan, and may even be encrypted. .JAR (Java ARchive) often contain data along with code in zip format, same for .SPV (SPSS report archive)
  • .RDA is zipped R data; .sas7bdat is binary uncompressed, can find strings.
  • Scanning of content uses "egrep", except PDF uses "pdfgrep", and skip TIFF images for too many false positives, and skip archive formats (xlsx) until we develop cusom scanner.
    • StudyID pattern is 10 digits, the first of which may be a letter
    • Should scan for dates - many dates are an indicator of personal data.
      Date format examples: 26DEC99 , 19991231 , 1999-12-31 , 12/31/1999 . 12/31/99 , ...
    • Should scan XLSX files with xlsx2csv, or custom scanner based on python package

OTRS

  • Note that blocking or warning triggers an email to sre@popdata (and an OTRS ticket) from root@Fraser, subject "yellowfolder: ..." ; in addition blocking is brought to the attention

Manual transfer to override blocking by yellowfolder.pl

  • login to appropriate server for project (groupinfo PROJECT).
  • review all files present in /data/sre/{PROJECT}/{$USER}/TRANSFER (or the "real" folder /data/SRE/EXPORT_TO_SRE/)
  • run as root: yellowfolder.pl ( outgoing | incoming ) {USERNAME}
  • Notify the user and merge the resulting ticket with the original ticket.

Blocking by name or content

  • If a file's name matches exactly a name in the MD5 database, a silent warning issued.
  • If a file's content exactly matches a file in the MD5 database, the transfer is blocked.

Blocking by filename extension (or file type)

extension description
.tar tar archive
.tbz tar archive bzipped
.tgz tar archive gzipped
.zip Zip archive
.7z 7-zip archive
.bzip2 compressed bzip2
.com windows executable
.dll windows executable
.dta STATA data
.exe windows executable
.gz compressed gzip
.mso_crypt maybe encrypted MS Office
.rar rar archive
.rda R gzipped data
.sas7bdat SAS binary data
.scr windows screen-saver executable
.jar Java archive may contain data
.spv SPSS report may includes data ("magic" reports as JAR)

Warning by filename extension (or file type)

extension description
.sav SPSS data
.dbf DBase data
.dat data
.csv comma-sep data
.bin binary data?
.egp SAS Enterprise Program Guide code archive

File type is calculated from file content using "magic" database via combined output of "file --mime-type" and "file" (after stripping extra details starting with comma or semicolon).

mime-type ("file" output) deemed extension
application/zip (Zip archive data) .zip
application/x-gzip (gzip compressed data) .gz
application/x-rar (RAR archive data) .rar
application/x-7z-compressed (7-zip archive data) .7z
application/octet-stream (SPSS System File TICS) .sav
application/octet-stream (SAS 7+ data file) .sas7bdat
application/x-dosexec (PE32 executable (GUI) Intel 80386) .exe
application/octet-stream (data) .bin Ex: file of NULs
application/CDFV2-encrypted (Composite Document File V2 Document) .mso_crypt Possibly encrypted MS Office file

Exceptions: some deemed file types will be ignored if the actual filename extension is listed below

actual extension deemed type to be ignored in favour of actual extension
.xlsx .zip (Probably flaky spreadsheet generated by SAS)
.xps .zip (MS-only equivalent of PDF)
.egp .zip (SAS Enterprise Guide Project (zip of SAS code files))
.xls .mso_crypt" (all ".xls" files found with "CDFV2-encrypted" were not encrypted (but ".xlsx" were)

Warnings and blocks get mentioned in logfile AND emailed to sre@popdata.bc.ca (OTRS). Blocking was instituted in summer 2011. When blocking, a pop-up is displayed on the virtual Windows machine to notify the researcher.

Yellofolder.pl history

  • 2015-10-10 changed message in yellowfolders popup "Transfer blocked for further inspection. \n Do not re-submit. \n Contact sre\@PopData.bc.ca.
  • 2015-12-21 block if
    • manually block import and/or export. E.g. /data/sre/EXPORT_FROM_SRE/.restrictions/{USERNAME}/inspect-all
    • automatically block all transfers after first block (i.e. create "inspect-all" as part of blocking procedure).
    • diagnostic:
      • ticket contains "This account currently requires inspection for all files."
      • on each SRE file server: yf-restrict will list all currently blocked accounts
  • 2017-02-20 block .JAR; updated list above
  • 2017-09-12 leave invisible (leading-dot) files behind (untransferred)
  • 2018-01-24 block by content: look for 10-digit studyID pattern in most files
  • 2018-02-14 split yellowfolder.conf from yellowfolder.pl
  • 2018-04-10 change subject line of message to OTRS when manual transfer triggers a warning
  • Future:
    • automatically block transfers exceeding daily limit. Cumulative MB in .restrictions/{USERNAME}/export-stats [date&time bytecount filecount]

YellowFolder interactive popup window

The pop-up works this way: the IMPORT-IT or EXPORT-IT scripts (Python-2.7) wait for a message to be written in .../.trigger/USER/status.txt with a final line "END". It then displays the message and deletes status.txt and directory .trigger/USER .

See transfer-out.py and transfer-in.py on Fraser in /data/sre/{sre|rtl}/.scripts/ (source and test bed in /usr/local/src/YellowFolders/ )

  1. create trigger directory to activate yellowfolder software. If directory pre-existing, warn and quit. If directory over 1 minute old, delete, invite "retry" and quit.
  2. open pop-up window with "wait" message
  3. wait for yellowfolder.pl on fileserver Fraser to create status.txt
  4. read status.txt until line "END", display message in pop-up
  5. delete status.txt and trigger directory
  6. Wait for SRE user to dismiss pop-up window.

Addition to yellowfolder process: click-through (2012-07):

  1. SRE users (but not RTL users) will be required to "click-through" a statement about no data (different for import/export).
    •  ? should a text inout box be provided for researcher to explain contents?
      • how does text get carried to Fraser?
    • how does transfer-*.py know between SRE and RTL? Separate program?
  2. following this, create trigger directory as before

Possible addition to yellowfolder process: researcher statement

  • in addition to checkbox or button, researcher might be asked for a brief description of the data
    • this could be requested only if data ws blocked
    • would it be a bad idea to "unblock" transfer based on assertion
    • we have agreed to block transfers based on criteria
  • how about whitelisting transfer? Researcher would ask for permission, and they would be allocated a single unblocked transfer.

YellowFolder interactive popup window - status/error messages

2012-04-03 on Fraser yellowfolder.pl creates status.txt and on SRE machines transfer-in.pyw / transfer-out.pyw (windows shortcut IMPORT-IT.lnk / EXPORT-IT.lnk in users\USERNAME\TRANSFER\) are responsible for displaying the contents of status.txt in a yellow popup window.

They will see a sequence of 2 status reports when they click on the EXPORT-IT or IMPORT-IT link in their TRANSFER folder. The purpose of messages about blocking is to have researchers rethink the appropriateness of their transfer, and perhaps to consult with RLU. The messages are chosen to not reveal too much (because that might invite workarounds).

  • The first immediately says something like:
    • YellowFolder export
    • Processing transfer request...
  • as soon as the transfer is done, they should see:
    • Transfer request: 1 files, 1 folders, 123 KB
    • Transfer successful
  • alternately they might see (if they used wrong-case version of username for login, e.g. "louisw" for "LouisW"):
    • Unrecognized username:
    • Transfer aborted.
  • or (if size or filename extension triggers blocking):
    • Transfer blocked by policy.
    • Contact sre@PopData.bc.ca
  • or (if they neglected to place any files into the appropriate folder):
    • Nothing to do.
  • or (if back-end on server Fraser fails to make an archive copy):
    • Cancelled due to technical difficulties
  • or (for some reason):
    • Warning: transfer incomplete
  • In the future I plan to add:
    • Warning: destination folder not empty.
    • Old files may get renamed instead of replaced.

"find" commands to match links by target, replace

  • list matching links
    • PROJ='11-s01'; USER='*'; LINK='EXPORT-IT.lnk'; TARGET='transfer-out.lnk'
    • Alternately: USER='rhershler-11-s01'
    • find $PROJ/$USER/TRANSFER -type l -name "$LINK" -lname "*$TARGET" -printf "%t %p\t=> " -exec readlink '{}' ';'
    • find $PROJ/$USER/TRANSFER -type l -name "$LINK" -not -lname "*$TARGET" -printf "%t %p\t=> " -exec readlink '{}' ';'
  • Repeat with: LINK='IMPORT-IT.lnk'; TARGET='transfer-in.lnk'
  • command to change EXPORT-IT ( cd /data/sre/ )
    • PROJ='11-s01'; USER='*'; LINK='EXPORT-IT.lnk'; TARGET='transfer-out.lnk'
    • find $PROJ/$USER/TRANSFER -type l -name "$LINK" -not -lname "*$TARGET" -ls -delete -exec ln -s /data/sre/.scripts/"$TARGET" '{}' ';' -printf "%t %p\t=> " -exec readlink '{}' ';'
  • Repeat with: LINK='IMPORT-IT.lnk'; TARGET='transfer-in.lnk'

Reasons for preferring tkinter to win32api includes ability to update text in popup, and possibility of not waiting for user interaction. Documentation about win32api is poor/nonexistent: www.python.org/getit/windows says "[...] download Win32all, Mark Hammond's add-on that includes the Win32 API [...] SourceForge." http://sourceforge.net/projects/pywin32/ has nothing but download of installer. http://vermeulen.ca/python-win32api.html has links to ~100 links to examples of use. At http://stackoverflow.com/questions/1025029/how-to-use-win32-apis-with-python A. Martelli (author of Python in a Nutshell) suggests alternative http://docs.python.org/library/ctypes.html , which seems a very verbose way to go. http://python.net/crew/mhammond/win32/ talks about compiling and versions. I guess you need Windows documentation.

fail to move files when destination subfolder not empty

  • Users often fail to clean up destination folder.
    • The archive copy into "saved" folder always succeeds because it goes into a new directory,
    • but the move fails with message "mv: cannot move `mv-test-src/level2' to `mv-test-dest/level2': Directory not empty", because "mv" will not to do recursive delete when replacing existing directory.
    • Fix: use "--backup=numbered". In the example below, mv-test-dest-level2/ contains 1 file.
      • dlaplante@fraser:~/tmp$ mv --backup=numbered -v mv-test-src/level2 mv-test-dest/
      • `mv-test-src/level2' -> `mv-test-dest/level2' (backup: `mv-test-dest/level2.~1~')

md5-collect.pl : adding to /usr/local/var/md5/Data

/usr/local/sbin/md5-update.sh is run daily from /etc/cron.daily/yellowfolders-maint . It finds all Data files added since last run, adds their MD5 checksum to repository. Source & RCS in /usr/local/src/YellowFolders/Cron.daily/ .

Updated 2012-02-07 - add new project-style data directories; ToDo: search saved outgoing transfers. Updated 2012-04-18: skip directories named "docs" (old md5 files purged of files under docs)

List all files added/changed since md5 data was last collected:

  • cd /home/sre/Data
  • for D in * ; do M="/usr/local/var/md5/Data/$D.md5" ; echo ""; ls -ld "$D"; if [ -e "$M" ]; then ls -ld "$M"; find "$D" -type f -newer "$M" -ls | head; else echo "*** missing $M ***"; fi; done

Alternately list all files changed since a particular date:

  • find . -newermt 2011-03-02 -ls

Now add md5 files for changed directories:

  • for D in GECKO 11-s01 ; do M="/usr/local/var/md5/Data/$D.md5" ; /usr/local/sbin/md5-collect.pl "$D" > "$M"; ls -l "$M"; done

Excluding files from MD5 list because they don't need protecting

  • Files that should not be blocked from exporting should be treated specially, because false security incident are awful.
    • NOTE: every morning at 06:25 the scripts in /etc/cron.daily/ run, including "yellowfolders-maint" which runs md5-update.sh, which checks each data folders for changes, and runs md5-collect.pl . If triggered by recent in file or directory time (indicating files added, deleted or renamed), all files under DATA are re-calculated (which can take a few hours), and the prevous version of ${PROJECT}.md5 is moved to sub-folder "Backup".
      • Modifying the per-project _DO_NOT_BLOCK.txt file will trigger a recalculation , but changing the per-server _DO_NOT_BLOCK.txt will not. Best is to edit the ".md5" file for the offending project(s) and delete the offending line(s) - this solves the problem immediately and does not require a lengthy recalculation until the DATA folder gets changed.
        To find the offending projects, ex: F='PopData extracting PharmaNet 2017 11 16.pdf' ; cd /usr/local/var/md5/Data && grep "$F" *.md5"
    • 2014-03 Tim says that they can't move files supplied by researchers into "docs" folders.
    • Denis has counter-proposed having RLU provide a list of such documentaion files not to be blocked.
    • 2014-05-06 Kelly tentatively agrees with Denis' proposal:
      • Exclude from blocking *.doc *.docx *.pdf and */docs/
      • Do not exclude directories */*docs/ , because outside data suppliers might mix data with docs.
      • Use per-project DATA/_DO_NOT_BLOCK.txt (regex of path/filenames, anchored at end. See sample in 99-t01/DATA/)
    • 2014-05-06 still waiting for Kelly
    • change md5-collect.pl to use global /usr/local/var/md5/_DO_NOT_BLOCK.txt for list of files to excluded
      • Ex: "feeitem-4digit.xls", "MSP - Fee Item documentation 2007 10 22.doc"
    • 2015-05-27 DL : Added to global _DO_NOT_BLOCK.txt "data.?dictionary\.xls.?" (DataDictionary.xls, data_dictionary.xlsx ...)
    • 2016-12-13 DL updated md5-collect.pl to implement case-insensitive matches as intended; updated global _DO_NOT_BLOCK.txt to escape '(', and to add descriptive comments.
       ? Consider anchoring most file names with prefix "/" and suffix "$" ? Probably not worth it.

Content of global _DO_NOT_BLOCK.txt

 # Global list for leaving non-data files out of yellowfolder block list.  Used by /usr/local/sbin/md5-collect.pl
 # NOTES:
 #	Case-insensitive Perl REGEX search (".*" means skip 0 or more of any character; ".?" means skip 1);
 #	... other magic characteres needing "\" escape prefix: '(', ')', '|', '+', '*', '?', '{', '}'
 #	Path names relative to /data/sre/, ex: 99-t01/DATA/block_me.dat
 # 	Ignore comment or empty lines
 #	Each project can have own supplementary list, ex: 99-t01/DATA/_DO_NOT_BLOCK.txt
 #
 # 	Skip any folder named "docs"
 /docs/
 # 	Skip standard exlude list
 _DO_NOT_BLOCK.txt
 # 	Skip list of standard docs from RLU - March 2014, augmented by Ex: locate 'MSP - Fee' | awk -F / '{print($NF)}' | sort | uniq 
 Comparability of ICD-10 and ICD-9 for Mortality Statistics in Canada \(Statistics Canada 2005\).pdf
 thercode20060815.xls
 Explanatory Codes_2012 07 10.pdf
 feeitem-4digit.xls
 feeitem-5digit.xls
 MSP - Fee Item documentation 2007 10 22.doc
 MSP - Fee Item documentation_20090513.doc
 MSP Diagnostic Code Descriptions \(ICD9\)_20030130.pdf
 MSP Diagnostic Code Descriptions \(ICD9\)_20030130.txt
 MSP Diagnostic Code Descriptions \(ICD9\) - Jan 30, 2003.pdf
 MSP Diagnostic Codes paper.doc
 MSP_Subsidy Code_Premium Rate History_1989 to 2012_2012 08 22.xls
 MSP_Subsidy Code_Premium Rate History_1989 to 2012.xls
 Automated Geographic Coding to Census Geographies.doc
 Automated Geographic Coding To Census Geographies.docx
 R&PB CANCEL REASONS.doc
 # 	Skip data dictionaries (Ex: "DataDictionary.xls" "data_dictionary_....xlsx" "...data dictionary....xls")
 data.?dictionary.*xlsx?

Contents of 99-t01/DATA/_DO_NOT_BLOCK.txt . Note that this is an "unanchored match" , so the item below would match "not/DATA/test-export-block-exempt.pdf.37"

# Documentation files in a "docs" sub-folder, or listed here, will not be blocked from export.
DATA/test-export-block-exempt.pdf

md5-check-same.pl: manually check a directory for MD5 matches

md5-check-same.pl in /usr/local/src/YellowFolders/ can test for MD5 matches in a directory tree
Example: /usr/local/src/YellowFolders/md5-check-same.pl '/usr/local/var/md5/Data/*.md5' /data/saved/EXPORT_FROM_SRE/*/20120207*

md5-check-same.pl: 43 files (under /data/saved/EXPORT_FROM_SRE/jboonstra-11-s01/20120207070603, /data/saved/EXPORT_FROM_SRE/jboonstra-11-s01/20120207100601, /data/saved/EXPORT_FROM_SRE/lchen-11-s01/20120207095040, /data/saved/EXPORT_FROM_SRE/lchen-11-s01/20120207095041, /data/saved/EXPORT_FROM_SRE/speterson-11-s07/20120207122532) checked against 2512 MD5 checksums. 0 size matches, 0 MD5 matches

Examining transferred files

sre-inspect-copy.sh - copy to SRE project 99-t01

  • USAGE: sre-inspect-copy.sh [IN|OUT] [USERNAME] [DAYS]
    list all in or out transfers archived over past days as specified; for each archive folder, list name and statistics.
    then prompt for space-separated list of folders to be copied to 99-t01/working/EXPORT_INSPECT (or IMPORT_INSPECT), read-only by all
  • Note: requires 99-t01 to have folders IMPORT_INSPECT and EXPORT_INSPECT under /data/sre/99-t01/working/ .
    • If R: drive is on another server, in CMD window enter for example: net use N: \\noyon\sre\99-t01

scan for studyIDs

  • /usr/local/bin/yf-find-studyids {DAYS}
    For all exports in last {DAYS} days, recursively grep all exports for PopData StudyIDs.
    egrep -nH -m 2 -w -e '[a-z1-2]0{2}[0-9]{7}
  • Alternative excluding PDF coordinates "not followed by space and 5 digit" (Ex: "0000042717 00000 n"). options: -w (whole word) -c (count) -n (line number) -H (filename) -P (perl regular expression)
    grep -c -nH -w -P '[a-z0-2]0{2}[0-9]{7}(?! [0-9]{5})'
  • /usr/local/bin/scan-studyids.py {FILENAMES} ...
  • See also my.popdata..file_transfers for a PDF explaining scanning by SRE users.
    The original is scan-studyids-how.docx in Alfresco/Systems & Security/SRE+RTL+SRTL/Software/YellowFolder/
    The PDF goes on Sullivan in /home/www/pds/html/media/ and is accessible from SRE machines.
  • Custom studyID scanning for html
    • Ticket#2019011510000041 blocked for studyid pattern in html file.
      Verendrye:/data/saved/EXPORT_FROM_SRE/ncroteau-15-052/20190115111017-blocked/ncroteau-15-052/ file contained 10-digit strings URL-encoded
DXImageTransform%2EMicrosoft%2Egradient%28startColorstr%3D%27%2380000000%27%2C%20endColorstr%3D%27%2300000000%27%2C%20GradientType%3D1%29%7D%2E

DXImageTransform.Microsoft.gradient(startColorstr='#80000000', endColorstr='#00000000', GradientType=1)}.

https://www.w3schools.com/tags/ref_urlencode.asp
%3D(=) %27(') %23(#) %2c(,)
%28(() %29()) %7D(}) %2E(.)
  • Should ignore 8-digit strings starting with "%23"(#)
    https://urldecode.org
  • Alternately strip out <tags> before scanning sed -e 's/<[^>]*>//g'
  • RTF file sometimes have 10-digit codes
  • 2019-11-13 shirazea-18-036 "pngblip\bliptag-1587363951{" , "wmetafile8\bliptag-1587363951{"

Grep small counts in CSVs

  • egrep -nH -e '^[0-4],|,[0-4],|,[0-4]$' *.csv
    look for counts between 1 and 4 at front, middle or back of line. Fails if CSV uses quote marks.

sas7bdat , sas7bcat

  • .SAS7BDAT is a SAS data file in binary uncompressed form.
    • Blocked by yellowfolder.pl because always contains some kind of "data", though not necessarily personal data (often lists of drug numbers)
    • In Windows right click on the data icon and then select "open with SAS Enterprise Guide"
    • "strings" will usually show the variable names
  • .SAS7BCAT is a SAS 'catalog' usually containing compiled macros or formats. http://support.sas.com/kb/22/352.html "no way to retrieve the original source code"
    • "strings" seems to show most of the source code, interspersed with unreadable stuff.
    • Not blocked because it contains code. Scanned for StudyIDs, get false positive for esayre-11-014, esayre-14-113
      Should be blocked (like sas7bdat) because they need special inspection.

.PDF .XPS

    • .XPS is a MS Office equivalent of PDF, stored as a ZIP archive. PDF and XPF can be opened using "evince".

Testing the yellowfolder.pl software

Script test-yellowfolder-offline.sh in /usr/local/src/YellowFolders/ will run tests in directories under Test-yf.dir , where all directories are writeable by group "admins".

  • It must *NOT* be run as root.
  • See test-yellowfolder-how.txt
    • USE: test-yellowfolder-offline.sh [ incoming | outgoing | clean ] [ manual | bad | size_KB ]"
    • To test a program other than "yellowfolder.pl", E.g. "PROG=../yellowfolder+.pl test-yellowfolder-offline.sh ..." .

Removing stale files from IMPORT / EXPORT directories

The folders incoming => IMPORT_TO_SRE and EXPORT_FROM_SRE => outgoing are meant to be temporary.

Files placed by user in incoming or EXPORT_FROM_SRE should only sit there until transfer has been triggered. Files made available to user (by yellowfolder.pl started by triggering script) in IMPORT_TO_SRE or outgoing should be removed as soon as possible by user. Unfortunately it takes an extra effort to delete files after dragging them to a new location, so stale files should get deleted after a decent interval. Note that a backup of transferred files is kept, but not of files waiting to be transferred.

Import: files in "incoming" folder should be deleted (moved) by "DO_IT" script, so we remove all after 7 days.

Export: files in "outgoing" folder should remain only until user has moved them somewhere else. Unfortunately they might not think of deleting their old junk. If the file was read after writing it could be presumed to be ready to delete. Or we could just wait 7 days.


List all files read, modify, inode times, then have awk show files not read after change:
# find /home/share/outgoing/dlaplante -type f -ctime +7 -printf "%Ay%Am%Ad.%AH%AM%AH %Ty%Tm%Td.%TH%TM%TH %Cy%Cm%Cd.%CH%CM%CH  %s %h/%f\n" | awk '{if($1 <= $3){print}}'

Command to delete stale files:
OLD_IN=7 ; OLD_OUT=7;
cd /home/share
find incoming -type f -ctime +${OLD_IN} -exec rm -f '{}' ';'

IDD CATALYST export review

  • Catalyst is a joint project between IDO and PopData (Tav) for prototyping the use of SRE environment for studies by governmnet staff.
  • Projects 18-g* require special procedure for manual inspection of all export, derived from "yellowfolder".
  • Currently 18-g01 "MacKenzie", 18-g02 "Wilmer" hosted on file server Hubbard.
  • Export submissions for review will be copied to project 18-g99; members of that "project" will inspect files, and pass them on to Fraser for export.
    Details at IDO_Catalyst_export_review
  • Import for Catalyst users is same as for SRE users. If anything gets blocked please CC <Brittany.Decker@gov.bc.ca> Director, Client Engagement and Service Delivery (CITZ Integrated Data Office), BC Ministry of Citizen Services.

Report on previous transfers

Note that sre-transfers-report.pl added several options in March 2011, examples below are out of date.

The following reports 14 days back from now (-b14) for a total of 14 days (-d14). It would report oversize files and illegal extensions.

root@fraser:/data/saved# DATE='-b14 -d14' ; for DIR in EXPORT_FROM_SRE IMPORT_TO_SRE; do echo; echo $DIR; ~/src/sre-transfers-report.pl ${DATE} ${DIR}; done ; done

EXPORT_FROM_SRE Report on SRE transfers 20110131 - 20110215

   KB FILES
   59   3 bforer  2011-02-09,13:09:30 {Ext: 3*.xlsx}
   65   3 bwarburton  2011-02-09,14:24:34 {Ext: 2*.XLS 1*.doc}
   50   1 bwarburton  2011-02-09,14:29:50 {Ext: 1*.xlsx}
  208   1 bwarburton  2011-02-09,17:11:18 {Ext: 1*.XLS}
  215   1 bwarburton  2011-02-09,17:37:12 {Ext: 1*.XLS}
   20   2 dsarkany  2011-01-31,13:23:35 {Ext: 2*.rtf}
   93   1 dsarkany  2011-02-01,14:09:24 {Ext: 1*.smcl}
  232   4 dsarkany  2011-02-04,16:09:46 {Ext: 2*.docx 2*.rtf}
   63   1 dsarkany  2011-02-04,17:29:58 {Ext: 1*.smcl}
   70   1 dsarkany  2011-02-08,16:11:32 {Ext: 1*.smcl}
   70   1 dsarkany  2011-02-08,16:16:17 {Ext: 1*.smcl}
   71   1 dsarkany  2011-02-08,18:07:43 {Ext: 1*.smcl}
   20   2 dsarkany  2011-02-08,18:24:47 {Ext: 2*.rtf}
  886   1 fxu  2011-02-10,14:52:36 {Ext: 1*.xlsx}
 1497   1 jboonstra  2011-02-03,17:33:33 {Ext: 1*.xlsx}
   82   1 jboonstra  2011-02-10,09:44:49 {Ext: 1*.pptx}
   14   1 jlloyd  2011-02-12,11:22:19 {Ext: 1*.xlsx}
  310   1 ueka  2011-02-10,15:56:42 {Ext: 1*.xlsx}
  312   1 ueka  2011-02-14,08:46:21 {Ext: 1*.xlsx}
 = 4337 KB, 28 files, 66 dirs (28 marked executable).

IMPORT_TO_SRE Report on SRE transfers 20110131 - 20110215

   KB FILES
   30   1 fxu  2011-02-10,14:48:41 {Ext: 1*.sas}
 = 30 KB, 1 files, 21 dirs (1 marked executable).

Examples of warnings:

  • WARNING: 58918888 bytes [SPSS System File MS Windows Release 13.0 spssio3]: EDI_Cycle1_MASTERFILE_43913_03-24-06.sav
  • WARNING: forbidden extension on EDI_Cycle1_MASTERFILE_43913_03-24-06.sav (bytes: 58918888, contents: SPSS System File MS Windows Release 13.0 spssio3)

Rules for import and export by researchers

See SRE Upload_Information for Researchers_Final 2014 09 04.doc and SRE Download_Information for Researchers Final 2014 09 04.doc under //Gilbert/Shoebox/SRE/SRE Uloads and Downloads/

IMPORT TO SRE
1. Non-individual level records/data/information
  • The information does not contain records about individuals (as opposed to broad classes, groups, or categories); cannot be linked at the individual level; and is being used to answer the project’s approved research questions/objectives.
AND 2. Publicly available
  • The information can be obtained, requested, or derived without formal approvals or permissions and is being used to answer the project’s approved research questions/objectives.
EXPORT FROM SRE
1. Products from the results of research analysis that are aggregated (related to broad classes, groups, or categories) and therefore cannot be linked or identified are permitted to be downloaded from the SRE. Unless otherwise permitted in your Research Agreement, suppress any cells with five or less.
Examples of allowed downloads (noting cell size restrictions):
  • Maps
  • Tables (e.g. frequency, summary)
  • Graphs
  • Log files
2. Documentation used to support research that does not contain record-level information.
Examples of allowed downloads:
  • Scripts
  • Syntax
  • Research notes
  • Data dictionary
  • Supporting papers/materials provided by PopData (e.g. MSP diagnostic code paper; information on census geographies)

Megan Engelhardt's BAH

  • Megan Englhardt requests SRE support to download a file from fraser bimonthly.
  • It's a CHSPR's monthly report in excel format with no encryption.
  *Email sample
  Can you please transfer the file “Uploads_May13” in ‘mengelhardt-14-s06’ out of the SRE?"
  • Check above file is copied under fraser:/data/sre/14-s06/mengelhardt-14-s06/EXPORT_FROM_SRE/* (the real path is fraser:/data/sre/EXPORT_FROM_SRE/mengelhardt-14-s06/*)
  • Use gnumeric to open this excel file to check the content of the file.
    • Note: this file DO contain personal data and it's exempted.
  /usr/local/sbin/yellowfolder.pl outgoing mengelhardt-14-s06 (file will move to ? )
  • Notify Megan.

Data hidden in stats report files

  • Some formats such as SPV SAV XLSX may keep a copy of the original data in a file that when opened displays only statistics.
    For example pivot table in excel (see ticket Ticket#2019050710000031
    or SPSS "PROC FREQ" or "kaplan-meyer" in an SPV file Ticket#2017030610000031
    Older versions of R used to have a hidden "." cache file containing data, which might get included with a folder in an export (now blocked)
  • Diagnostic: get a listing of uncompressed sizes Unix ex: unzip -t file.xlsx . Windows ex: right-click on filename (context menu) and select open archive with 7z.
    A large uncompressed size for a component suggests a lot of data. Unfortunately Microsoft does not break data elements into lines, so some editing may be necessary. Typically

Rules for graphs

  • Original data can be extracted from graphs, depending on the graph file format and resolution.
  • ConnectedResearchers.com software list includes 16 packages.
    • WebPlotDigitizer is a JavaScript based (claims no data uploaded to server), with desktop versions for Mac, Windows & Linux. It offers a very nice youtube tutorial (18 minutes). After you specify scales and the colour to look for, it will automatically scan the image for curves or data points.
    • R journal article (PDF 2011) The digitize Package: Extracting Numerical Data from Scatterplots by Timothée Poisot
    • Engauge-digitizer open-source Win/Mac/Linux, includes Debian package.

Explanation to researchers for export restrictions

  • To fulfill our agreements with Data Stewards to protect their data, Population Data BC has
    • set up rules and guidelines on the content of files that can be exported or imported and
    • set up software and procedures to accomplish such transfer in or out of the Secure Research Environment.
      • Please read https://my.popdata.bc.ca/html/SRE/working/file_transfers.html for the current information
      • The following is a proposed addition to the above page, and/or perhaps an email to researchers.
        • Starting in summer of 2015, the following changes to procedure will appear
          1. Instead of just a count of files and a reminder of conditions before clicking on "Agree", there will be a list of all files, and the researcher will be required to enter a brief description for each (e.g. "code", "aggregated results", "documentation"). The window will have clickable buttons to quickly enter the words listed above.
          2. We will block the export or import of "archives" (i.e. collections of files compressed together, in formats such as zip, rar, tar, 7z ...), to facilitate our inspection process.
        • Note that we inspect every import or export at some point, even those not blocked.
        • Sometimes the "yellowfolder" software (triggered by the "EXPORT-IT" or "IMPORT-IT" link in your TRANSFER folder) will refuse to transfer. The reasons could be size, file name or content. You might think of this as a deferred transfer, like being selected at customs for baggage inspection. Staff are immediately notified, and depending on result of inspection will either arrange with the researcher for manual transfer, or escalate to an incident report. Re-exporting the same file under a different name or format without consulting with <sre@popdata.bc.ca> doesn't look good.
        • Blocking of archives need not create great inconvenience. Instead of placing a number of files into a folder, "zipping" up the folder into an archive, placing the archive in your TRANSFER\EXPORT_FROM_SRE folder and triggering the export, just change the sequence: create one or more sub-folders under TRANSFER\EXPORT_FROM_SRE , trigger the export, and from your outside computer "zip" up the folder into an archive.

blocked for inspection: explanation to researchers

SAMPLE from David/Denis

We always block SPSS viewer output files (SPV) for inspection because SPV is an archive, and has sometimes been found to contain individual data (without the author being aware).
If you are quite sure that SPV outputs are the only format that you can accept, we can examine the file and export it manually.
There might be a delay as we don't have a lot of expertise to ensure there is no data embedded in the file.
We discovered from other researchers that SPSS can output in PDF format.

Suggestions for restriction by extension

Jacqui Boonstra recommends that we flag data file types for various software:

Software on SRE Data File Extension Syntax File Extension Output File Extension
SPSS 18.0.1 .sav .sps .spv/.spo
SAS 9.2 (64 Bit) .sas7bdat/.sd2/.sd7 .sas .log .lst
Stata 10 .dta .do .txt/.gph/.out/.log
ArcGIS 9.3 .dbf .mxd
HLM 6.06 .mdm/mdmt . hlm/.mlm .sts/.txt/.out
Stat Transfer 9 .any
GeoDA 0.9.5i
R 2.8.1 .RData .R
R 64 Bit 2.11.0
Python 2.5 .any .py
Python 2.6 .any .py
Microsoft Visual FoxPro 9 .dbf/.fxp .prg
Microsoft Office Excel 2007 .xlsx --
Microsoft Office Word 2007 .docx --
M-Plus 5.21 .dat .inp .out, .txt
Other data files (produced by various software above) .csv/.xml

Restriction by country of IP

Starting 2012-03-20 "sre-projects" users are blocked from logging in from outside Canada by script "login.py" (on Fraser in R:\.scripts), which is called by login script "project.bat" which is provided by Active Directory server Gilbert. Old-style "sre" users have been subject to "login.py" for longer. Happily the record of VPN connections on Cabot show no connections from outside Canada (except a few where the virtual IP corresponds to VPN groups that have no access to SRE, just to CHSPR and other department servers.

See Software-sys-maintenance#Group_Policy for MS GPO aspectsrestriction of SRE Windows Login based on IP address of VPN connection.

  • Command "query session /server:sre13" will show remote desktop sessions.
  • login.py first logs to \\srefiles\sre\.login\logins.txt one line, then various directory timestamps
    • if the session on SRE is not from Canada, spawns in parallel off.bat "os.spawn(os.p_NOWAIT...)" , which first waits 5 seconds (ping -n 5 127.0.0.1 >nul) to give time to read the popup message, then issues instant-disconnect (tsdiscon), so that if the following "logoff" offers to cancel logoff because quitting is too slow, nobody will see the prompt. The session lasts another 5 seconds in disconnected mode before disappearing.

Samba on Fraser should also restrict access to outside shares based on IP.

  • Easy to do: Block if VPN connection is for Student username.
  • Hard to do: Block if outside IP of VPN connection is tagged by GeoIP on Cabot as outside Canada.

Tracking connections on SRE

  • https://pds.popdata.bc.ca/radm/sreinfo shows users based on Gilbert script checking current connections on every SRE machine every few seconds
  • Command "query session /server:sre13" will show remote desktop sessions.
  • "Remote Desktop Services Manager" (run as Administrator on Gilbert) shows login times, idle time; can disconnect or logout users.

Restrict transfer of archives

  • 2015-11-17 - We need to request that researchers not transfer archives via yellowfolders - instead they should transfer files, then archive them after transfer.
    • First step is to inform and request.
      • See text below.
    • Second step is to enforce.
      • In yellowfolder.pl, move archive extensions from "warn" to "block", with an exemption for "ZIP" archives based on filename extension.
        • .XLSX (spreadsheets created by SAS not recognized as excel) ; .XPS (MS equivalend of PDF); .EGP (SAS Enterprise Guide Project - zip of SAS code and results)

Restrict transfer of archives - text

Dear SRE researchers
 We will soon be blocking the transfer of archives (ZIP, 7Z, RAR ...) in and out of the SRE.
 Data Stewards require PopData to inspect all transfers in and out of the SRE and block "data".
 We implement this through a mix of automated inspection in the "YellowFolder" transfer process, and manual inspection.
 The automation include a number of criteria (size, content or name of files ..) that trigger a message to staff 
 to inspect the file.  In some cases the transfer is blocked, with a pop-up message
 "Transfer blocked for further inspection.  Do not re-submit.  Contact sre@podpata.bc.ca".
 
 Archives allow you to combine and/or compress files into a single archive file.  
 Examples of archiving software: windows compressed folder (with "Z" in icon); winzip; 7zip; winrar. 
 Unfortunately inspecting the contents of archives is labour-intensive.
 Some SRE users combine their files into archives for convenience, and this can be accommodated
 with our suggestions below.  Some use it to evade restrictions on size or content.
 
 Change in researcher's workflow to accommodate this change:
 Instead of creating an archive before transfer, please first copy or move the 
 files and folders to your EXPORT_FROM_SRE or IMPORT_TO_SRE folder, trigger the transfer, 
 and then combine them into an archive on the other side.
 If you suspect the transfer will be blocked for inspection, email sre@podpata.bc.ca requesting
 a manual transfer.
 You can think of our inspection as similar to customs or security inspections, 
 and our objection to "wrapping" files into an archive as similar to requiring 
 objects in luggage to be unwrapped.

libmagic "file" content identification

The source code for libmagic file header descriptions is extremely cryptic, described in man magic(5)

  • 2017-09-05 security update: verify as apt-get -qq update; apt-cache policy file libmagic1 and see https://security-tracker.debian.org/tracker/CVE-2017-1000249
    export PATH="/usr/sbin:/sbin:$PATH"; apt-get install file
    Includes libmagic-mgc (no real change) & libmagic1; libfile-libmagic-perl not affected; debian<9 (stretch) not affected. Updated on Cortereal, Drake, Noyon, Verendrye.