Lowering the barrier to connect scientific data to LabKey
Server
Increased flexibility in routing data
Cooperating with other systems
Giving users more options with their data
Historical look at LabKey Server connectivity
Focus on some recent changes (REDCap, FreezerPro, ETL)
Future directions
LabKey Software focused on Proteomics
CPAS server processing MS2 runs through the data pipeline
Data uploaded through the browser and results saved to the database
Pipeline tasks could parse specific data formats
Analysis of Flow data
FCS files processed via the pipeline
2005 Connectivity Summary
Data Pipeline Form Entry
LabKey Server Java Module
Collaboration with SCHARP on the Atlas Portal
Many data types associated with HIV/AIDS research
Lots of study and assay data
CRF and specimen data imported through the pipeline
Assay data consisted of machine generated data files
Assay framework and GPAT
Imports data from spreadsheets or tab-separated text files
No built-in specialized analysis or visualizations
Needed many custom applications for Atlas
Java modules were complex to build and maintain
Build custom applications without the module overhead
LabKey APIs & Simple Modules
Lowered the extensibility barrier
Insert, update, delete programmatically
Module based assays allowed easy entry into the assay framework
Lists
Create tables in LabKey Server and integrate with existing data
Easily import file based data through the browser
Tools to infer fields from files
2008 Connectivity Summary
Data Pipeline Form Entry
LabKey Server
Java Module Simple Module File upload
Support for connecting to data sources not in the LabKey Server schema
Relocating the data is no longer required
LabKey Server security could be applied
Editing of external table through the LabKey Server UI can be enabled
Supported data sources:
SAS PostgreSQL Microsoft SQL Server Oracle MySQL
2008-2009 External Schemas
LabKey Software continues to refine APIs
Additional language bindings for Perl and Python
Polish module based tools
Remote connections
LabKey Server as an external data source
Connectivity through the LabKey Server API
Folder level granularity
2012 Connectivity Summary
Data Pipeline Form Entry File upload Client API LabKey ServerJava Module Simple Module
External SQL Data Sources
REDCap
Web application for building and managing online surveys and databases
Developed and distributed by Vanderbilt University
Popular in the academic and research community for designing clinical and translational research databases
International Center of Excellence for Malaria Research (ICEMR) at the University of Washington
Demographic and clinical data in REDCap
Wanted their REDCap data integrated into their LabKey Server
Visualizations
Queries
Integration with experimental data
Data needed to be synchronized from REDCap to the LabKey Server
REDCap API allowed programmatic and secure access to the projects of interest
Data is extracted and saved in a format that can be imported into a LabKey Server study
Scheduled automatic import
FreezerPro
Commercial web application for frozen specimen inventory management
Supports various sample types
Tracks location and availability of specimens
Allows user defined fields
Users can create custom reports and export data
Novo Nordisk Type 1 Diabetes Research Center
Uses FreezerPro to manage their research specimens
Needed their specimen inventory integrated into LabKey Server
Combine with experimental data
Queries
API access to the remote FreezerPro server
LabKey Server uses a secure storage to encrypt the FreezerPro credentials
Inventory information is imported directly into LabKey Server
Uses the data pipeline
Study specimen repository
Users control, field mapping, filtering, synchronization schedule
Stands for extract, transform and load
Developed as part of HIDRA (Hutch Integrated Data Repository & Archive)
Goals of building a LabKey Server ETL Framework
Provenance
Understanding the origin of the data, knowing when and how it got there
Auditing
Security
Built on top of Pipelines
Functionality
Query based ETLs
Stored procedures
Remote Sources
Checkers (identify whether work is to be done)
Scheduling
Logging output
2013-2014 ETL Framework
ETLs are module based
An ETL consists of a set of Transform Steps
Key components of a transform
Source table or query
Destination table
Filter strategy
Identifies rows to transform & if there is work to do
Filter Strategies
Choose which rows to move to target table
Select all
Just get all the data, every time
Last modified
Rows with a date/time column newer than last run
Records most recent value
Run filter
Checks a specified column, especially an incrementing integer column
Any rows with higher value than last time are transformed
Useful for rows written by previous ETLs
2013-2014 ETL Framework
Target Options
How to add data to target table
truncate - delete all rows and add the selected ones
append
Add new rows to the target table
Will fail if duplicate primary keys
merge
Update or Insert
Matches Primary Keys
Schedule Options
When to run the transform
Poll option
Check at a defined interval
Cron option
Can be used to check at a particular time of day
2013-2014 ETL Framework
Connectivity Summary
Data Pipeline Form Entry File upload Client API LabKey ServerJava Module Simple Module
External SQL Data Sources
Other connection strategies LabKey is investigating
DatStat
Online data and study management software
I2b2
informatics framework that will enable clinical researchers to use existing clinical data for discovery research
Caisis
Open source, cancer data management system
Any questions?
Karl Lum