BioSense 2.0. User Community Extension Project. Getting Started With The Data Lockers. Information Contributed by:

(1)

BioSense 2.0

User Community Extension Project

Getting Started With The Data Lockers

Information Contributed by:

Harold Gil, County of San Diego, Public Health Services Edward Castagna, Maine Centers for Disease Control and Prevention Robby Beum, Denver Health and Hospital Authority

(2)

Contents

Purpose……….….3

Background.………..4

Introducing The Data Lockers……….……….6

➢

Raw Data Locker……….………….6

➢

Binned Data Locker……….….6

➢

Binning Algorithm………....6

Getting Access To The Data Lockers………...8

Exporting Data From The Lockers…………..………10

➢

phpMyAdmin………...10

➢

FrontEnd……….17

➢

RStudio Server….………20

R Script Examples..….…….……..……….30

➢

Data Quality Metrics………...31

➢

Mapping Syndromes to Select Fields………..33

Automating R Script Runs………..34

(3)

Purpose

Our intention in writing this guide is to empower our colleagues with the information necessary to smooth the transition from BioSense technical novice to expert, enhancing their capacity to solve BioSense-related problems with solutions that will benefit many jurisdictions.

(4)

Background

The BioSense 2.0 Program is a nationwide collaborative system with the aim of

protecting and improving population health by providing timely information for better decision-making. BioSense supports this aim by improving situational-awareness of syndromes within and across jurisdictions.

A core component of BioSense is a cloud-enabled, web-based platform that provides several surveillance capabilities. BioSense is continually evolving to meet the needs of public health officials.

Several public health agencies (PHAs) are already participating in BioSense. The rate at which PHAs are joining the BioSense community is rapidly increasing. We can expect that in the near future, BioSense will become even more valuable in supporting syndromic surveillance activities across the nation.

As the BioSense community grows, more syndromic surveillance obstacles will arise that will require a strongly skilled and informed user-base in order to effectively tackle.

This guide contains essential information to improve users’ abilities to conceive and implement solutions to BioSense-related problems.

A focus of this guide is on the implementation of solutions with a statistical web application installed on the BioSense servers, RStudio Server. The decision to use emphasize the utility of the RStudio Server application was not trivial. The

overwhelming majority of PHAs employ Statistical Analysis System (SAS) for their surveillance operations. SAS solution implementations would, therefore, be much easier and faster for PHAs to adopt.

However, the technical features of the BioSense lockers are such that connection to the lockers is not possible using a local installation of any analytics application (SAS, SPSS, Stata, R, etc) on a Windows or Macintosh operating system (OS) machine. Connection to the BioSense lockers from a local machine can only be achieved if that machine has a Linux OS. The BioSense Redesign Team provides a way for users to circumvent this issue by accessing and connecting to the lockers via the phpMyAdmin and RStudio Server applications, both of which are installed on the BioSense servers. However, only with RStudio Server can users set up an automated process by which to export data from the

(5)

This guide will teach R routines necessary for working with the BioSense lockers. Users looking to further their R expertise can check out the resources below. SAS is not

covered in this guide since most users are already familiar with it or have local resources for it available at their respective PHAs.

It is likely that most PHAs will begin designing and implementing hybrid solutions employing both RStudio Server and SAS.

R Learning Resources

● Data Camp (https://www.datacamp.com/): Take the free “Introduction to R” lessons. Data Camp’s courses are deployed on a unique platform that provide lesson information alongside R’s command line interface so that you can learn by doing. No previous programming experience is necessary for “Introduction to R”. Follow-up “Introduction to R” with Data Camp’s “Data Analysis and Statistical Inference” labs.

● Coursera’s R Programming (https://www.coursera.org/course/rprog): Another free introductory course for R.

● Quick R (http://www.statmethods.net/): This site serves as a great resource once you are familiar with R (beyond Data Camp’s “Introduction to R” lessons). Snippets of R code for all sorts of data management or analysis tasks are organized here.

(6)

Introducing The Data Lockers

The “BioSense lockers” are MySQL databases hosted on the Amazon GovCloud servers. The BioSense servers are actually Amazon GovCloud servers.

Each BioSense locker contains one or many MySQL database tables.

There are several BioSense lockers, but the two most popular are: the “raw data locker” and the “binned data locker”.

Raw Data Locker

The raw data locker contains data that has been submitted to BioSense and been filtered before making into this locker. The raw locker is a valuable source of information, in large part because one of the tables in this database possesses a “chief complaint” field (a data field that is commonly mined by PHAs).

The raw locker contains two tables: the raw data table and the exceptions table. The former contains the raw data records that have passed the filter criteria, the latter contains descriptions of records that have not passed through the filter.

Binned Data Locker

The binned data locker is populated by records from the raw data locker that have undergone additional processing. As mentioned above, raw data records possess a chief complaint field. The binned data records, on the other hand, possess a field called

“binvalue”. The binvalue field is the syndrome/sub-syndrome category that one or more raw data records fell into based on the information from four raw data fields: diagnosis code, diagnosis text, procedure code, and chief complaint. A raw record might be assigned a binned record with no binvalue at all. Multiple raw records within a 48 hour period that fall under a single syndrome/sub-syndrome will show up as a single binned record in the binned data locker.

The Binning Algorithm

(7)

webpage below.

The binning algorithm is described in detail on this webpage: https://biosen.se/syndrome-info.php

For more information regarding the BioSense lockers (architecture and function), please see the excellent webinar hosted by the BioSense Redesign Team, “Understanding the BioSense 2.0 Fat Pipe Architecture” (http://vimeo.com/70436781). You should watch the webinar now and then again after reading the Exporting Data From The Lockers section.

(8)

Getting Access To The Data Lockers

Obtaining access to data in your jurisdiction’s BioSense lockers is a quick and simple routine.

Before we discuss how to obtain access, please note that there are two access account types: “jurisdiction account” and “personal account”.

The jurisdiction account was the first account type to be offered. Previously, when a jurisdiction would begin participating in BioSense, a BioSense data steward role and a jurisdiction account were assigned to someone at the jurisdiction’s PHA. Surveillance practitioners at this jurisdiction would share the same jurisdiction account.

The jurisdiction account was the norm, however, the assigning of personal accounts to individual users is now becoming common as it follows best practice for security.

If you do not have a personal account (or a jurisdiction account), we recommend that you contact the BioSense Redesign Team per the directions below to obtain one.

Send an email to the BioSense Redesign Team ([email protected]) with the following message:

“Hello,

My name is __[name]___ and I serve as __[role]__ for __[jurisdiction]__. I would like to obtain access to all of my jurisdiction’s BioSense lockers.

[Signature]”

You will then receive this access in the form of login credentials within 2 business days.

Now you should have access to your jurisdiction’s BioSense lockers. This immediately affords you access through phpMyAdmin (to be discussed next).

(9)

Send an email to the BioSense Redesign Team ([email protected]) with the following message:

“Hello,

My name is __[name]___ and I serve as __[role]__ for __[jurisdiction]__. I would like to obtain access to the RStudio Server application on the BioSense servers.

[Signature]”

You will then receive this access in the form of additional login credentials within 2 business days.

Please note that we are specifically referring to “the RStudio Server application installed

on the BioSense servers”. We are not referring to a local installation of R or RStudio that

can be accomplished by downloading these applications yourself from the web. As mentioned previously, these local installations cannot connect to the BioSense data lockers due to an incompatibility of technical features.

(10)

Exporting Data From The Lockers

phpMyAdmin

The BioSense lockers for individual jurisdictions can be accessed at:

https://adm.biosen.se/dbadmin/index.php. This interface is typically referred to as the “back-end” (in contrast to the “front-end” which will be discussed in the next section).

(11)

Login using your jurisdictional or personal username/password combination. You will need to choose which locker you want to view.

● The raw data locker is data3.biosen.se

● The binned data locker is adm2.biosen.se

In this example we will login to the raw data locker, data3.biosen.se. Successfully logging in displays:

(12)

The raw data locker has two tables.

● Raw_Data_Table contains the raw data records

● Exceptions_Table contains the exception data You can click on either of these to view that specific table.

(13)

(14)

A query window will pop up. You can perform any valid SQL query by submitting the query through this window.

(15)

For instance, to export all data with Admit_Date_Time values that correspond to the month of August in the year 2010, I enter the following query into the query window:

select * from Raw_Data_Table where Admit_Date_Time>='2010-08-01 00:00:00' and Admit_Date_Time<='2010-08-31 23:59:59'

Note: Your jurisdiction’s raw data table will not be named ‘Raw_Data_Table’. You will have to substitute ‘Raw_Data_Table’ in the query above with your jurisdiction’s corresponding raw data table name.

For other queries, please consult a local staff resource that is experienced with SQL or resources available online.

Note: Unfortunately, phpMyAdmin has a 5 minute logout associated with it. Thus, if the query takes more than 5 minutes (and many times they do for a number of reasons) and you have not interacted with the phpMyAdmin application within the logout time, you will be logged out, your query will terminate, and you will have to start over. Even if your query finishes, if you have not carried out the export process before being logged out, you will lose your query results and have to start over again. Due to these issues, we recommend using RStudio Server to export data. With RStudio Server, your logout time is much longer, and even if you do get logged out, your query will continue to run and the results will be saved for when you return later.

(16)

An available option after completion of the query is to export the retrieved query result.

After clicking on the Export button, you are presented with a set of export options. To create a CSV file, click on “CSV” from the Format drop-down menu.

(17)

Then press the “Go” button to download the data from this query as a CSV file.

(18)

Front-End

The front-end application can be accessed at: https://biosen.se/login.php. The front-end is a user-friendly interface for performing analytics and exporting line-level data. The catch with the front-end is that it only works with binned data and not raw data.

Nevertheless, the front-end can be a very useful tool for situational awareness depending on the quality of the data that goes into your lockers. Since the front-end application is highly intuitive, this section will only show screenshots of what the application looks like.

(19)

After logging in, the user will be able to search a standard BioSense 2.0 syndrome by jurisdiction and date range. The jurisdictions that a user can view data for is limited by the sharing permissions that the jurisdiction’s BioSense 2.0 data steward has enabled.

(20)

After pressing the “GO” button, the search parameters will be processed and graph of incident numbers will be plotted by

If you have been allowed line level data access by your data steward, you’ll get the green “Export Line Level Data” button on your screen which will allow you to export the data from the front-end matching your search parameters.

(21)

RStudio Server

The RStudio Server application installed on the BioSense servers can be accessed at: https://adm.biosen.se/rstudio/auth-sign-in.

Note: The RStudio Server application is limited to work only on the Google Chrome or Firefox browsers (Internet Explorer is not supported).

(22)

Both the raw and binned data lockers can be accessed through this application. The major drawback, however, is that RStudio Server employs a command line interface as opposed to a graphical user interface. This means that to execute commands with this application, the user needs to enter the appropriate code (specifically R code) in the command

(23)

The good news, however, is that scripts to accomplish certain common tasks have already been developed and are freely available in the UCEP’s “Useful R Scripts” Google folder: https://drive.google.com/folderview?id=0B9v3HiiWbFkpcmlCMlJta2FzeHM&usp=sharin g.

Among these is a script called “BioSense_Data_Export_Script.R”. Download this script as we will now demonstrate how to run it within the RStudio Server application. These directions for running a script in RStudio Server will actually work for any script.

(24)

Note: You might want to take a look at the other scripts in the UCEP’s “Useful R Scripts” folder as they could carry out important tasks that you are interested in (data quality metrics, linking raw and binned records, etc.).

Now that you have downloaded “BioSense_Data_Export_Script.R” and saved it in some directory on your computer, go back to the RStudio Server application. You will now upload the script to your RStudio Server workspace.

Click on the “Upload” button in the RStudio Server application.

(25)

You will notice that the name of the script is next to the “Choose File” button. At this point, click on the “OK” button.

The script is now uploaded to the RStudio Server application. You will see it in the “Files” window in the lower right corner. There are a few ways to run this script now that it is uploaded. We’ll demonstrate one of them. Click on “BioSense_Data_Export_Script.R”.

(26)

The contents of the file will now be displayed in the upper left corner.

Scroll down through the top of the file. You will see directions on how to enter some necessary parameters (for instance, your BioSense login username, password, and the number of previous days of data you want to export).

(27)

Follow these directions carefully and enter your parameters.

After you’ve entered your parameters. Click on the “Source on Save” checkbox and then on the “Save” button (floppy disk icon).

(28)

The script will now run with the parameters you’ve specified. The script will take quite a while to run depending on how much data you’re exporting (around the order of a few hours). However, a benefit of using RStudio Server to export data is that your script will continue to run to completion even if you close the application.

(29)

When the script has finished running, your data file will be available in the “Files” window.

To export this data file (or any file in the “Files” window), click on the checkbox next to it. Then click on “More” which will reveal more options.

(30)

The additional options under the “More” dropdown menu will include an “Export…” option.

You will be prompted regarding your export action. Click on the “Download” button to save the file to your computer.

(31)

Note: If you ever want to run a script that has already been uploaded, you can simply click on the script, which should be in the “Files” window. Then, after the file opens up in a new sub-window in the upper left corner of the screen, click on the “Source” button (not “Run”) at the top right of the sub-window.

R Script Examples

In this section, you will learn how to obtain some preliminary and pre-defined information regarding the quality of the data in your jurisdiction’s lockers. We will apply two R scripts that have been developed for this purpose.

Similar to the R script we applied in the previous section

(“BioSense_Data_Export_Script.R”), these two data quality scripts are freely available in the UCEP’s “Useful R Scripts” Google folder:

https://drive.google.com/folderview?id=0B9v3HiiWbFkpcmlCMlJta2FzeHM&usp=sharin g. These data quality scripts are named “UCEP-BioSense_DQ-5_12_14.R” and

“Syndrome_Mapping-5_12_14.R”. This section will explain how these scripts work and what the information in the script output means.

The information that is discussed in this section is covered in much greater detail in a BioSense 2.0 webinar, hosted by ISDS and the BioSense Redesign Team, titled “Data Quality Checks and Assurance”:

(32)

http://www.syndromic.org/resources/isds-webinars/upcoming-recent/772-biosense2-webinar-dataquality. Note that the video for this webinar unsyncs from the audio during the second portion of the webinar.

Data Quality Metrics

The “UCEP-BioSense_DQ-5_12_14.R” script was designed to provide the user with measures of completeness for data. The file is also simple to use, requiring only the user’s username, password, and the number of previous days of data to take into account (the email parameter is optional and may not work anyway because of spam blockers). Run the file in RStudio Server. After the script is finished running, it will produce an HTML file and an image file. When you open the HTML file, the image file will be embedded within the HTML file. The HTML file will display measures of completeness and other important information for your jurisdiction’s lockers. A screenshot of this HTML file (and embedded image) is displayed below.

(33)

In greater detail, the information shown on the HTML file conveys:

● Date-time range used in the analysis (time frame)

● Total number of raw records with unique patient id’s

● Total number and percentage of raw records with unique patient id’s that were assigned a binned record with a non-missing binvalue

● Percentage of complete data for a set of variables and by hospital facility

● Most recent date for which a raw record was created in your locker for a hospital facility

High-level overview of how the script works: A copy of all data in both the raw and binned lockers from the past X days is made. The raw and binned records are

(34)

binned records when both the raw and binned record share the same unique patient id and visit date. A list of every binned record that matches with each raw record is created. This list is used to calculate the number and percentage of raw records that were

assigned a non-missing binvalue. The data is then stratified by hospital and the percentage of raw records with complete entries for each of a set of variables is

calculated. This script (and the following one) can be modified and expanded fairly easily by anyone who is familiar with the R programming language to encapsulate more

features.

Mapping Syndromes to Select Fields

The “Syndrome_Mapping-DQ-5_12_14.R” script was designed to show users what

BioSense syndromes (or “binvalues”) their raw records were assigned (according to the BioSense binning algorithm). The file requires the user to input only their username, password, and the number of previous days of data to take into account (the email

parameter is optional and may not work anyway because of spam blockers). After the file is finished running, a CSV file named “Syndrome_Mapping.csv” is created. The CSV file lists by row the raw record information used in the binning process (chief complaint, diagnosis code, diagnosis text, procedure code) and the assigned BioSense syndrome

(35)

(rows with missing binvalues, i.e. binvalue=NA, appear last). A screenshot of the top portion of this CSV file is displayed below.

High-level overview of how the script works: A copy of all data in both the raw and binned lockers from the past X days is made. The raw and binned records are

deduplicated on unique patient id. Raw records are matched with their corresponding binned records when both the raw and binned record share the same unique patient id and visit date. A list of every binned record that matches with each raw record is created. A CSV file named “Syndrome_Mapping.csv”, showing the raw record information used in the binning process (chief complaint, diagnosis code, diagnosis text, procedure code) and the assigned BioSense syndrome from the associated binned record, is created.

Automating R Script Runs

This part gets tricky and can be difficult to understand and unless you’re technically savvy. It might be best if you work with your IT personal to help you set this up.

Since the BioSense servers are Unix-based, in order to start automating the previous ‘R’ jobs to run at a scheduled/designated time, you’ll need to set-up SSH public-key

authentication that lets you use one identity (i.e. one password) to connect to the BioSense SSH server.

(36)

You do this by setting up and sending the BioSense team a public-key for authentication after generating an SSH key pair. Directions can be found here:

https://hkn.eecs.berkeley.edu/~dhsu/ssh_public_key_howto.html

The “Putty Configuration” screen shots can be found below. This allows the connection to the BioSense unix servers using the SSH Key Pair to access the BioSense Unix

environment and set up/schedule the CRON job to automatically run the ‘R’ programs at the designated times.

Open putty.exe and save a session that points to the adm.biosense.se server.

Set the path to where the SSH key has been placed. Since BioSense has the key and you have the key, the two systems can safely communicate to each other.

(37)

Once signed in, you will create the program that contains the code to run the three ‘R’ jobs.

1. vi your_new_program_name.txt (‘vi’ is the command in unix to open the unix editor)

2. Type/copy the following into the vi editor. The ‘mailto’ and the directory structure path will be different per individual.

[email protected] 45 15 * * * Rscript /home/username/R/SourceCode/Pull_Binned_Data.R 55 15 * * * Rscript /home/username/R/SourceCode/Pull_Exceptions_Data.R 05 16 * * * Rscript /home/username/R/SourceCode/Pull_Raw_Data.R

3. The previous code does the following: emails the individual when it runs the three jobs and then runs the jobs where the first 2 numbers represent the run time [minutes-(45) and hour-(15)] . In other words, the first ‘R’ program runs at 3:45PM EST, the second at 3:55PM EST and the third at 4:05PM EST. You might need to play with the times.

(38)

Example of vi editor for the file called “DH_rjob.txt”

Download Output to Your Local Site

As was done in the previous step (i.e. Schedule Jobs Using SSH Keys, Unix Commands and Cron), you’ll need to set-up another SSH public-key authentication that lets you use one identity (i.e. one password) to connect to the BioSense SSH server. Once this second set of SSH keys are set up you can continue on to setting up the sftp code.

Add the next line of code to run the program that takes the output created from the running of the three ‘R’ jobs and sftps the output to the desired location.

(39)

Below is an example code for DH_SFTP_Script.sh. Of course the directory paths and server information will be different for each site.

#!/bin/bash user="cdc_biosense_adm" host="XXX.X.XXX.XX" datetime=$(date +%y%m%d%H.%M.%S) thedate=$(date +%y%m%d) ##cd to directory containing files to download ##copy *.csv to upload echo $datetime ": Copying files to upload directory" cp ~/R/Output/*.csv ~/upload; ##check if files were copied if ls ~/upload/* > /dev/null ; then##Transfer files echo $datetime ": Successfully copied files to upload directory" cd ~/upload echo $datetime ": connecting to remote server. Will try to upload files on successful connection."

(40)

sftp $user@$host<<EOF put *.csv ls bye EOF echo $datetime ": Moving files to archive at " ~/archive/$thedate mkdir ~/archive/$thedate mv b ~/upload/*.csv ~/archive/$thedate/ if ls ~/archive/$thedate/* > /dev/null ; then echo $datetime ": Successfully moved files to archive at " ~/archive/$thedate echo $datetime ": Cleaning ~/R/Output/ folder" rm ~/R/Output/*.csv if ls ~/R/Output/*.csv > /dev/null ; then echo $datetime ": Failed to delete ~/R/Output/*.csv" else echo $datetime ": Successfully deleted ~/R/Output/*.csv" fi else echo $datetime ": Failed to move files to archive at " ~/archive/$thedate fi else echo $datetime ": No files in ~/R/Output/*.csv to upload " fi echo $datetime ": Sending email to [email protected] " echo "" #mail s "Biosense cloud file tranfer: "$datetime " [email protected] " < ~/Logs/out$(date +%y%m%d).log mail s "Biosense cloud file tranfer: "$datetime " [email protected] " < ~/Logs/out.log cat ~/Logs/out.log >> ~/Logs/out$(date +%y%m%d).log

(https://www.datacamp.com/)

(https://www.coursera.org/course/rprog)

(http://www.statmethods.net/)

https://biosen.se/index.php,

https://biosen.se/syndrome-info.php

(http://vimeo.com/70436781)

https://adm.biosen.se/dbadmin/index.php.

: https://biosen.se/login.php.

https://adm.biosen.se/rstudio/auth-sign-in.

https://drive.google.com/folderview?id=0B9v3HiiWbFkpcmlCMlJta2FzeHM&usp=sharing.

: https://hkn.eecs.berkeley.edu/~dhsu/ssh_public_key_howto.html