Data Management Tools: practical
approaches and lessons
learned when scaling up a computing and
data environment to keep
Declaration of Potential Conflicts-of-Interest,
Consulting and Corporate Collaborators
Scientific research is now data driven,
analysis costs now exceed data generation
costs, most data goes unanalyzed…
and the most competitive institutes will be
those that embrace informatics as a
Computations in the life and medical
sciences are unique I
Emphasis on symbolic/integer (non-floating point) intense computations (yet with floating point capabilities).
Diverse types of computations that are continuously evolving, and demand different hardware, software and compute environment configurations
Emphasis on a mix of computing technologies (microprocessor, GPGPU, FPGA) with an objective to build the capability to optimize different codes (or code parts) on the different platforms for maximum performance in a pipeline.
Emphasis on scalable, sustainable hardware/software/environmental architectures and installations (including support staff).
Emphasis on data intense computations, so significant storage and
bandwidth to/from the HPC installation and to/from storage to processors is essential.
Computations in the life and medical
sciences are unique II
Emphasis on providing answers through very simple web interfaces
(especially for biomedical researchers, health care providers, or applications that appeal to lay people) by creating or porting applications that demand real-time HPC intense resources. Life and Medical Scientists want to solve cancer, not become programmers.
Installation would have readily available en masse data from major public and local databases and a semantic web approach to gathering and
accessing the larger data and knowledge bases that are available and essential to extract new knowledge. Typical datasets, such as NextGen Human Genome Sequences and Medical images are TB in size, and some projects are in the PB size.
Informatics involves hardware, software,
and expertise – because scientists want
answers – should be thought of as an
integrated mix that is continuously
evolving, and complementary, not
competitive
Computing hardware (local, centralized/supercomputer centers, cloud)
Computing software (many, varied, continuously evolving, few standards, best software becomes proprietary, comparisons of different implementations is biased, and most important there is little funding for sustaining software)
Data analysts (many flavors – programmers, informaticians, bioinformaticians, statisticians, clinical informaticians, anthropologists….)
Local computing is primarily done on a
machine we developed: SHADOWFAX
A heterogeneous computing environment for data intensive computations
~2,524 CPUs, > 12TB RAM (Dell/Intel)
~27,000 GPUs (nVidia)
8 FPGA hybrid core systems (Convey)
~0.8 PB Disk Arrays (DDN)
100 PB Sun/Oracle tape storage system
Local computing is primarily done on a
machine we developed: SHADOWFAX
With local synchronized copies of major databases:
Medline, arXiv, PubMed Central, Genbank, SwissProt, 1,000
Genomes Project, The Cancer Genome Atlas, Wikipedia
Designed to meet the needs of applications that demand HPC:
deep sequencing assembly and analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT
“Deals” with vendors greatly controlled cost
Data Analysis Core (DAC) provides turnkey
study design, monitoring and analysis
Projects are diverse
• ~80 projects completed in 2 years • Genome assembly focuses on
non-human and especially challenging genomes
• Turkey, bacteria, insects (butterfly), fish
• Genome variation discovery and
annotation projects
• RNAseq
• Multiple projects ranging from binary
comparisons to multifactor time studies
• miRNA expression and discovery • SNP population studies
• Metagenomic studies
Co-Author papers for
contributions are made to
the science
• 9 published • 4 submitted
• ~10 currently in draft
Core personnel directly
participate in grant
applications
• USDA grant submitted (PI) • 7 grants submitted as coPI
A data intense example…
NextGen DNA sequence analysis is now
the rate limiting step
• The cost of sequencing has dropped from $3B/genome to ~$1K/genome. • New genomes are sequenced daily.
• It is estimated that there are 30,000 human genomes complete, with 15,000 of
these in the public domain.
• Analysis has focused on Single Nucleotide Polymorphisms (“ SNPs”), which are single
letter changes in the DNA code.
• For complex diseases like cancer, heart disease and mental disorders, extensive
work has still only explains 10-20% of the known genetic component.
• Recent research indicates that do to experimental measurement noise, perhaps
most of the measured variations are false positives.
• Data analysis pipelines are built from a number of standard tools.
• There are many public and proprietary analysis pipelines, and there
performance accuracy is highly contested.
• “Truth Data” is just beginning to be assembled. Different types of DNA
Microsatellites, or repetitive DNA
sequences are particularly challenging
• Microsatellites, also called Simple Sequence Repeats or Short Tandem Repeats,
are an understudied portion of genome; because they are considered part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus has been on Single Nucleotide Polymorphisms (“ SNPs”)
• Microsatellites have known value: long used for paternity and forensic testing and
linked to neurological diseases (e.g. Huntington’s and Fragile-X)
• None of major genomic research projects have focused on Microsatellites: not
Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the iCOGS study.
Microsatellite myths dispelled, enabling
new discoveries
• Myth 1: Accurate and efficient analysis of the ~1 million Microsatellites is not
possible.
• Microsatellite genotypes in 1000 Genome Project and The Cancer Genome
Atlas demonstrated to be only 20% accurate1; new proprietary algorithm is
96% accurate
• Myth 2: Microsatellites are hyper-variable, and will therefore not be useable in
genotype-phenotype association studies
• Analysis of 1,200 “healthy” genomes demonstrated that 98% of the
~150,000 microsatellites in genes are highly invariant
• Myth 3: Heritable and spontaneous components of disease will be explained by
SNPs.
• Recent iCOGS study involving over 200,000 subjects demonstrated that
known and new SNPs explain less than 50% of heritability in breast, ovarian and prostate cancer
Research Pipeline
Download and rebuild thousands of “healthy and
“affected” genomes
Create genotype distributions for “healthy” and “affected” populations
Compute Fishers Exact Test p-value for each of ~1 million loci and rank results
Identify “Patterns of Informative Microsatellites”
(PIM) from loci that pass Bonferroni and Benjamini– Hochberg False Discovery
Rate tests
Annotate with ontologies, literature, input from
experts
Business analysis; product definition; IP Validate PIM with
sequencing of well-characterized
samples
Publish; translate, regulatory approval, reimbursement; team with
established clinical services co. Manually review, do QC,
compute sensitivity and specificity
Genomeon has created a unique library of
over 8000 genomes from 1000 Genomes
Project and The Cancer Genome Atlas with
corrected microsatellites
•
“Healthy Population” representing many ethnicities
•
Ovarian cancer
•
Breast cancer
•
Brain cancer: Glioma; Glioblastoma; Medulloblastoma
•
Lung cancer
•
Prostate cancer
•
Melanoma
Comparative analysis has yielded new
actionable clinical diagnostics and drug
targets for cancer,
Pattern of 55 informative microsatellites
differentiates Breast Cancer germlines from
“healthy” germlines
Sensitivity = 88% Specificity = 77% BRCA ½ positive samplesGenes proximate to 55 BC Informative Loci
• 52 loci are in genes, 3 loci are intergenic
• Of the 52 loci, 1 is in an exon, 4 are in untranslated regions while the rest are intronic located very close to the intron-exon boundary. Many of the genes are known to be alternatively spliced and are differentially expressed, both of
which imply mechanism
• Ontologies: notch signaling, genome stability, alternative splicing, programed cell death, cell cycle and apoptosis
• 32 of the 52 genes previously associated with cancer, 18 with breast cancer • Several genes are known and highly pursued drug targets, new targets include
several kinase and membrane bound proteins. 11 of the 52 genes are targets or affected by pharmaceuticals, including 5 that are prescribed or in clinical trials for BC.
Applications of these microsatellite loci
variations
Cancer Risk Diagnostics – Microsatellite profiling for increased risk of cancer, and the tissues at highest risk
Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response
Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in clinical trials
Drug Targets - Identification of large number of genes previously unassociated with cancer - many with functions associated with cancer processes
Toxicology - Quantification of stress induced exposures/stressors via microsatellite mutation screen
Prognosis - Comparison of microsatellite variations between germlines and tumors
Another data intense example…
Text analytics to quantify publication ethics
violations and fraud
….but lets talk about that and the fallout
later.
Lessons Learned…..
Informatics has become a critical bottleneck,
is evolving quickly, is expensive and requires
continuous investment ($, people,
recognition….), but it is here to stay and is
required to be competitive.
A few things to keep in mind
•
Grow and evolve
• Systems (hardware/cloud, software and people) should be obtained in
smaller, diverse (including jobs with high memory requirements, fast database access, intense parallel message passing) chunks and grow as demand grows to take advantage of Moore’s Law, changing requirements, and vendor
competition
• Systems should and can be operational on day one • Provide for public-facing real time web services
•
Verification and retention
• Data AND analysis history must be verified and retained – will be required and
will make one more competitive
•
Restrictions
• Access to public databases is variable, for example TCGA cannot be
downloaded/analyzed in the cloud, and there are minimum systems/personnel required for access
A few things to keep in mind
•
Security
• Server security via multiple layers, limited access, invisibility
• Collaboritoriums are hard to secure, most times simple solutions are best
(Google drive)
• Varying and changing requirements of institutes, governments, projects
•
Liability
• Material Transfer Agreements now involve data and are getting more complex • Release of data, software, etc.
•
Uncertainty
• Changing demand, fluctuating funding, and impact of ‘breakthroughs’ • HIPAA
• Clinical data access may not be possible even behind their firewalls, driven by
fear of loss of control, “discovering” an adverse event or comparisons across practices
• Access to data
•
Commercialization/translation
• Patentability, proprietary/trade secret
• The world’s best bioinformatics company is worth …. The world’s worst
Cloud computing is not yet the answer to
computing in the life and medical sciences
Locality/Dependencies – Where is the data and what about data that must be merged from many sources?
Compute match – Some jobs require non-standard hardware configurations for performance: some genomic assemblies require 2+TB of memory, some
simulations require extremely high data exchange/update rates
Bandwidth – Getting the data to the cloud from local sources can be limiting, as will be cases where data is moved from cloud to cloud.
Cost – The initial cost is low, but the sustained cost can be high, and in academic settings, funding to support work beyond 3 years is very difficult
Security – There are HIPAA compliant clouds, there are issues with acceptance
Storage – Costs are still high for sustained storage. Known amounts of local storage drives scientists to be economical in experimental design.
Unknowns – What happens when the cloud goes down? What happens if a supplier goes out of business? What if……
One possible solution…..
Create and support critical mass sized
“entities” that span AIRI members, so that
members together take advantage of scale
Discipline-specific informatics “entities”:
“condo computing” organization where
members buy in and excess capacity
available to new/unfunded researchers
• Mix of compute technologies and bulk purchasing
• Best of class software, algorithms and data warehouses • Automated pipelines
• Data analysists as independent researchers and as “collaborators”
• Complete data analysis solutions – computing, statistics, experimental design,
data monitoring/archiving, data and analysis reproducibility validation/checking
• Data and analysis delivery portals (required by funders and journals) • Critical mass so all needed expertise and infrastructure is available and