Data Management Tools: practical approaches and lessons learned when scaling up a computing and data environment to keep up with the pace of data

(1)

Data Management Tools: practical

approaches and lessons

learned when scaling up a computing and

data environment to keep

(2)

Declaration of Potential Conflicts-of-Interest,

Consulting and Corporate Collaborators

(3)

Scientific research is now data driven,

analysis costs now exceed data generation

costs, most data goes unanalyzed…

and the most competitive institutes will be

those that embrace informatics as a

(4)

Computations in the life and medical

sciences are unique I

 Emphasis on symbolic/integer (non-floating point) intense computations (yet with floating point capabilities).

 Diverse types of computations that are continuously evolving, and demand different hardware, software and compute environment configurations

 Emphasis on a mix of computing technologies (microprocessor, GPGPU, FPGA) with an objective to build the capability to optimize different codes (or code parts) on the different platforms for maximum performance in a pipeline.

 Emphasis on scalable, sustainable hardware/software/environmental architectures and installations (including support staff).

 Emphasis on data intense computations, so significant storage and

bandwidth to/from the HPC installation and to/from storage to processors is essential.

(5)

Computations in the life and medical

sciences are unique II

 Emphasis on providing answers through very simple web interfaces

(especially for biomedical researchers, health care providers, or applications that appeal to lay people) by creating or porting applications that demand real-time HPC intense resources. Life and Medical Scientists want to solve cancer, not become programmers.

 Installation would have readily available en masse data from major public and local databases and a semantic web approach to gathering and

accessing the larger data and knowledge bases that are available and essential to extract new knowledge. Typical datasets, such as NextGen Human Genome Sequences and Medical images are TB in size, and some projects are in the PB size.

(6)

Informatics involves hardware, software,

and expertise – because scientists want

answers – should be thought of as an

integrated mix that is continuously

evolving, and complementary, not

competitive

 Computing hardware (local, centralized/supercomputer centers, cloud)

 Computing software (many, varied, continuously evolving, few standards, best software becomes proprietary, comparisons of different implementations is biased, and most important there is little funding for sustaining software)

 Data analysts (many flavors – programmers, informaticians, bioinformaticians, statisticians, clinical informaticians, anthropologists….)

(7)

Local computing is primarily done on a

machine we developed: SHADOWFAX

 A heterogeneous computing environment for data intensive computations

 ~2,524 CPUs, > 12TB RAM (Dell/Intel)

 ~27,000 GPUs (nVidia)

 8 FPGA hybrid core systems (Convey)

 ~0.8 PB Disk Arrays (DDN)

 100 PB Sun/Oracle tape storage system

(8)

Local computing is primarily done on a

machine we developed: SHADOWFAX

 With local synchronized copies of major databases:

 Medline, arXiv, PubMed Central, Genbank, SwissProt, 1,000

Genomes Project, The Cancer Genome Atlas, Wikipedia

 Designed to meet the needs of applications that demand HPC:

 deep sequencing assembly and analysis, molecular modeling, simulations, proteomics analysis, text mining, Health IT

 “Deals” with vendors greatly controlled cost

(9)

Data Analysis Core (DAC) provides turnkey

study design, monitoring and analysis

Projects are diverse

• ~80 projects completed in 2 years • Genome assembly focuses on

non-human and especially challenging genomes

• Turkey, bacteria, insects (butterfly), fish

• Genome variation discovery and

annotation projects

• RNAseq

• Multiple projects ranging from binary

comparisons to multifactor time studies

• miRNA expression and discovery • SNP population studies

• Metagenomic studies

Co-Author papers for

contributions are made to

the science

• 9 published • 4 submitted

• ~10 currently in draft

Core personnel directly

participate in grant

applications

• USDA grant submitted (PI) • 7 grants submitted as coPI

(10)

A data intense example…

(11)

NextGen DNA sequence analysis is now

the rate limiting step

• The cost of sequencing has dropped from $3B/genome to ~$1K/genome. • New genomes are sequenced daily.

• It is estimated that there are 30,000 human genomes complete, with 15,000 of

these in the public domain.

• Analysis has focused on Single Nucleotide Polymorphisms (“ SNPs”), which are single

letter changes in the DNA code.

• For complex diseases like cancer, heart disease and mental disorders, extensive

work has still only explains 10-20% of the known genetic component.

• Recent research indicates that do to experimental measurement noise, perhaps

most of the measured variations are false positives.

• Data analysis pipelines are built from a number of standard tools.

• There are many public and proprietary analysis pipelines, and there

performance accuracy is highly contested.

• “Truth Data” is just beginning to be assembled. Different types of DNA

(12)

Microsatellites, or repetitive DNA

sequences are particularly challenging

• Microsatellites, also called Simple Sequence Repeats or Short Tandem Repeats,

are an understudied portion of genome; because they are considered part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus has been on Single Nucleotide Polymorphisms (“ SNPs”)

• Microsatellites have known value: long used for paternity and forensic testing and

linked to neurological diseases (e.g. Huntington’s and Fragile-X)

• None of major genomic research projects have focused on Microsatellites: not

Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas, ENCODE or the iCOGS study.

(13)

Microsatellite myths dispelled, enabling

new discoveries

• Myth 1: Accurate and efficient analysis of the ~1 million Microsatellites is not

possible.

• Microsatellite genotypes in 1000 Genome Project and The Cancer Genome

Atlas demonstrated to be only 20% accurate1_{; new proprietary algorithm is}

96% accurate

• Myth 2: Microsatellites are hyper-variable, and will therefore not be useable in

genotype-phenotype association studies

• Analysis of 1,200 “healthy” genomes demonstrated that 98% of the

~150,000 microsatellites in genes are highly invariant

• Myth 3: Heritable and spontaneous components of disease will be explained by

SNPs.

• Recent iCOGS study involving over 200,000 subjects demonstrated that

known and new SNPs explain less than 50% of heritability in breast, ovarian and prostate cancer

(14)

Research Pipeline

Download and rebuild thousands of “healthy and

“affected” genomes

Create genotype distributions for “healthy” and “affected” populations

Compute Fishers Exact Test p-value for each of ~1 million loci and rank results

Identify “Patterns of Informative Microsatellites”

(PIM) from loci that pass Bonferroni and Benjamini– Hochberg False Discovery

Rate tests

Annotate with ontologies, literature, input from

experts

Business analysis; product definition; IP Validate PIM with

sequencing of well-characterized

samples

Publish; translate, regulatory approval, reimbursement; team with

established clinical services co. Manually review, do QC,

compute sensitivity and specificity

(15)

Genomeon has created a unique library of

over 8000 genomes from 1000 Genomes

Project and The Cancer Genome Atlas with

corrected microsatellites

•

“Healthy Population” representing many ethnicities

•

Ovarian cancer

•

Breast cancer

•

Brain cancer: Glioma; Glioblastoma; Medulloblastoma

•

Lung cancer

•

Prostate cancer

•

Melanoma

(16)

Comparative analysis has yielded new

actionable clinical diagnostics and drug

targets for cancer,

(17)

Pattern of 55 informative microsatellites

differentiates Breast Cancer germlines from

“healthy” germlines

Sensitivity = 88% Specificity = 77% BRCA ½ positive samples

(18)

Genes proximate to 55 BC Informative Loci

• 52 loci are in genes, 3 loci are intergenic

• Of the 52 loci, 1 is in an exon, 4 are in untranslated regions while the rest are intronic located very close to the intron-exon boundary. Many of the genes are known to be alternatively spliced and are differentially expressed, both of

which imply mechanism

• Ontologies: notch signaling, genome stability, alternative splicing, programed cell death, cell cycle and apoptosis

• 32 of the 52 genes previously associated with cancer, 18 with breast cancer • Several genes are known and highly pursued drug targets, new targets include

several kinase and membrane bound proteins. 11 of the 52 genes are targets or affected by pharmaceuticals, including 5 that are prescribed or in clinical trials for BC.

(19)

Applications of these microsatellite loci

variations

Cancer Risk Diagnostics – Microsatellite profiling for increased risk of cancer, and the tissues at highest risk

Companion/Treatment Diagnostics - Many informative microsatellites are functional elements implicated in therapeutic response

Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in clinical trials

Drug Targets - Identification of large number of genes previously unassociated with cancer - many with functions associated with cancer processes

Toxicology - Quantification of stress induced exposures/stressors via microsatellite mutation screen

Prognosis - Comparison of microsatellite variations between germlines and tumors

(20)

Another data intense example…

Text analytics to quantify publication ethics

violations and fraud

….but lets talk about that and the fallout

later.

(21)

Lessons Learned…..

Informatics has become a critical bottleneck,

is evolving quickly, is expensive and requires

continuous investment ($, people,

recognition….), but it is here to stay and is

required to be competitive.

(22)

A few things to keep in mind

•

Grow and evolve

• Systems (hardware/cloud, software and people) should be obtained in

smaller, diverse (including jobs with high memory requirements, fast database access, intense parallel message passing) chunks and grow as demand grows to take advantage of Moore’s Law, changing requirements, and vendor

competition

• Systems should and can be operational on day one • Provide for public-facing real time web services

•

Verification and retention

• Data AND analysis history must be verified and retained – will be required and

will make one more competitive

•

Restrictions

• Access to public databases is variable, for example TCGA cannot be

downloaded/analyzed in the cloud, and there are minimum systems/personnel required for access

(23)

A few things to keep in mind

•

Security

• Server security via multiple layers, limited access, invisibility

• Collaboritoriums are hard to secure, most times simple solutions are best

(Google drive)

• Varying and changing requirements of institutes, governments, projects

•

Liability

• Material Transfer Agreements now involve data and are getting more complex • Release of data, software, etc.

•

Uncertainty

• Changing demand, fluctuating funding, and impact of ‘breakthroughs’ • HIPAA

• Clinical data access may not be possible even behind their firewalls, driven by

fear of loss of control, “discovering” an adverse event or comparisons across practices

• Access to data

•

Commercialization/translation

• Patentability, proprietary/trade secret

• The world’s best bioinformatics company is worth …. The world’s worst

(24)

Cloud computing is not yet the answer to

computing in the life and medical sciences

 Locality/Dependencies – Where is the data and what about data that must be merged from many sources?

 Compute match – Some jobs require non-standard hardware configurations for performance: some genomic assemblies require 2+TB of memory, some

simulations require extremely high data exchange/update rates

 Bandwidth – Getting the data to the cloud from local sources can be limiting, as will be cases where data is moved from cloud to cloud.

 Cost – The initial cost is low, but the sustained cost can be high, and in academic settings, funding to support work beyond 3 years is very difficult

 Security – There are HIPAA compliant clouds, there are issues with acceptance

 Storage – Costs are still high for sustained storage. Known amounts of local storage drives scientists to be economical in experimental design.

 Unknowns – What happens when the cloud goes down? What happens if a supplier goes out of business? What if……

(25)

One possible solution…..

Create and support critical mass sized

“entities” that span AIRI members, so that

members together take advantage of scale

(26)

Discipline-specific informatics “entities”:

“condo computing” organization where

members buy in and excess capacity

available to new/unfunded researchers

• Mix of compute technologies and bulk purchasing

• Best of class software, algorithms and data warehouses • Automated pipelines

• Data analysists as independent researchers and as “collaborators”

• Complete data analysis solutions – computing, statistics, experimental design,

data monitoring/archiving, data and analysis reproducibility validation/checking

• Data and analysis delivery portals (required by funders and journals) • Critical mass so all needed expertise and infrastructure is available and

(27)