Big Data: Challenges and Opportunities

(1)

Big Data: Challenges and Opportunities

NGWI & USDA/ARS Meeting

USDA Carver Center

April 16, 2014

Doreen Ware

Acting Chief Science Information Officer

(2)

Big Data: Challenges and Response

• Volume

• Velocity

• Variety

• Value

• Complexity

• Human Resources

• Community Building

• Knowledge Management

• Standards

• Policies

• Network

• Storage

• Compute

• Innovation

Biology is an information science

Big Data in Agriculture: Emerging discipline that involves using genomic and phenotype information to support accelerating breeding strategies, directed engineering and integration of environmental genomic and climate data to support improvement of yields

(3)

Sensors & Metadata

Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO Systems

Hardrives, Networking, Databases, Compression, LIMS

Compute Systems

CPU, GPU, Distributed, Clouds, Workflows

Scalable Algorithms

Streaming, Sampling, Indexing, Parallel

Machine Learning

classification, modeling, visualization & data Integration

Results

Domain Knowledge

“Big Data” Biology Pyramid

Quantitative Biology Technologies

(4)

Biological Sensor Network

The rise of a digital immune system

Schatz, MC, Phillippy,AM (2012) GigaScience 1:4

(@ewanbirney)

(@latimes)

Small labs are now large data generators.

All scientist have the need to manage their

own data & make this accessible to others

This is a non trival engineering objective!

(5)

Data Production & Collection

Expect massive growth of sequencing, imaging, mass spec, and other biological sensor data based on technology and automation over the next 10 years

•Exascale biology is certain, and mobile streaming of data

•Germplasm Developmental traits, field based phenotyping using robotics (increase

the granularity of the phenotyping)

•Molecular & Physiological phenotying (Omics, Metabolites, Infrared) •Requires careful consideration of the “preciousness” of the sample •Compression helps, but need to aggressively review data lifecycle

•Need to capture meta data associated with sample prior or at the time of

data generation

Major data producers concentrated in universities, agricultural &

pharmaceutical companies, research institutes, Germplasm Centers, & Farmers

•Major efforts in, agriculture, bioenergy, Climate & Human Health

•Variety of the data, require the development of standards descriptors •Coordination across areas of plant science

•Coordination across domestically & internationally

(6)

Lack of Standards or Adoption are

one of the Major Limitations

• Nomenclature, Standard Formats,

Controlled Vocabulary

– Human and Machine – Germplasm identifiers – Variation identifiers

• Adoption of common references data sets – Reference Assemblies: Multiple

assemblies & versioning

– Populations

– Variations data sets

• Standard workflows to support

reproducibility

– Support for versions of data

– Support for translations between

assay types

Anna McClung, Brian Scheffler, Angela Baldo, Jeremy Edwards

(7)

Sensors & Metadata

IO Systems

Compute Systems

Scalable Algorithms

Machine Learning

Results

Domain Knowledge

Quantitative Biology Technologies

(8)

Information Centers and Science Data Highway

DOE ESnet

http://www.es.net/

NSF XSEDE

https://www.xsede.org/

NSF iPlant

http://www.iplantcollaborative.org/

EU Elixir

http://www.elixir-europe.org/

(9)

Computational Infrastructure for the

Life Sciences

Kbase.us www.gramene.org

(10)

• Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn

• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours

• I-OMAP Josh Stein

• 12 rice species (each w/12 chromosome pseudomolecules) • 96 CPU per chromosome (1152 CPU total) 1-2hr per genome

Genome Assembly Size

(Mb) CPU

Run Time

Arabidopsis thaliana TAIR10 120 600 2:44 Arabidopsis thaliana TAIR10 120 1500 1:27 Zea mays RefGen_v2 2067 2172 2:53

TACC Lonestar Supercomputer 22,656 CPU cores on1,888 nodes

MAKER-P at iPlant

Reducing 3 weeks to 3 hours

Josh Stein

Genome annotation pipeline

(11)

(12)

Long Read sequencing technologies supports improved structural gene annotations Example from maize Chr4: 4172270-4180486 has the most isoforms: 140 isoforms

(13)

(14)

Genome Services

Uniform data formats

http://plants.ensembl.org/info/website/ftp/index.html

Visualization

Tools

RNA-Seq

Reference genomes

INSDC

Community data

Variable standards

History of tool development

Fostering Intercompatibility

(15)

Compute & Algorithmic Challenges

Expect to see many dozens of major informatics centers that consolidate

regional / topical information

• Clouds for Agriculture, Climate, Commodity, Traits

• Need for short, long & archival resources for many data types beyond sequence • Science highways to support movement of data

• Standards: meta data, formats, reproducible workflows • Move the code to the data

Parallel hardware and algorithms are required

• Expect to see >1000 cores in a single computer

• Compute & input/output (IO) needs to be considered together • Rewriting efficient parallel software is complex and expensive • Many existing bioinformatics tools not configured for HPC • New data types and tools will be needed

Applications will shift from individuals to populations, many species, support emergent data type

• Moving from Single reference genome, to population analysis, pan-genome, and time

series analysis

• Need for network analysis, probabilistic techniques • Existing solutions do not scale….

(16)

Sensors & Metadata

IO Systems

Compute Systems

Scalable Algorithms

Machine Learning

Results

Domain Knowledge

Who is a Data Scientist?

http://en.wikipedia.org/wiki/Data_science

(17)

Learning and Translation

Tremendous power from data aggregation

• Observe the dynamics of biological systems

• Need Reproducible workflows and APIs

• Develop predictive models and support directed

design

• Need for trainings sets to support model development

• Development of intuitive interfaces to support access

to the data

• Mechanisms to support integration of public and

private data

Mindful of the risks

• Generating & aggregating the data is just the

beginning, need to provide resources to support the

interpretation of the data, starting with the quality. Not

all data is equal

• The potential for over-fitting grows with the complexity

of the data, statistical significance is a statement about

the sample size

(18)

Training & Community building

Training of the all levels of scientist

• Moving from an observational to quantitative biology will require

skills in statistics to support experimental design & interpretation,

information technology for best practices for handling the

biological, software to support interfaces for visualization

• Cross-disciplinary training necessary for the next generation of

scientist

• Development of virtual training materials to support development

and adoption of the emerging infrastructure to minimize the

emphasis on technology and return it to the biology

Changing the culture

• Move from individual science to group science

• Mechanisms & Incentives to foster adoptions of standards &

sharing of data need to be developed

The foundations of biology will continue to be

observation, experimentation, and interpretation

(19)

Workshop ”Towards a GrapeIS”

February 2015 – Bordeaux –France

Draft recommendations from the workshop

• Data Standards: to support generate human readable (Web pages) or

better machine readable (Web services response or RDF)

– Minimal information about experiments – Controlled vocabularies/ontologies – Varietal/material identification

– Data exchange formats and data collection

• Alignment to broader initiatives

– Participate & exploit broader data management initiatives

– (Research Data Alliance, FOODBALL, IPPN, EPPN, DivSeek etc)

– infrastructures (iPlant Collaborative, ELIXIR, etc ) to increase sustainability, reduce

duplication of efforts and ensure broader impact of the data in the scientific community.

• Training and dissemination

• Leverage The International Grapevine Genome Project consortium (IGGP;

www.vitaceae.org

) to support coordination & development of a grapevine

information system

(20)

Community Building

• Need to change the culture

within the ARS

• Move from individual CRIS

project, location, to a Virtual

Organization

• The will require both bottom

up and top down approaches

–

Working groups to support

standards, policy, and

training

–

Review of the existing policy

& incentives

• We need to start now and

leverage existing ongoing work

within projects.

• Pilot projects 6 months examples

– Data: Transcriptomes, Genetic variation,

Genome assembly & Annotation

– Communities: I5K, NPGS, Pathogens,

Natural resources

• Define set of use cases emphasizing – Support for data access and sharing – Analyses

– Data standards – Best practices – Training materials • Establish Teams

• Diversity locations, species, expertise

• Workshops

– Train the trainers

– 2-3 day workshops targeting specific data

types

• Evaluate outcomes and recommendations

USDA AGRICULTURAL RESEARCH

(21)

USDA Database workshop PAG Jan. 2015

Lisa Harper, Taner Sean, Steven Cannon (ARS) , 70 data scientists representing 23 groups

Objectives

• Bring scientists together to meet

each other

• Outline priority needs facing

biological databases

• Determine how to address these

needs as a group

• Create opportunities to work

together and improve

interoperability between

resources

• Explore what steps ARS

leadership can take to facilitate

and support excellence in this

expanding area

– Survey of community challenges, needs and

expertise

– Commitment for Monthly meeting

– 4 Breakout groups: Each group either has

or are developing, a communication plan, and collaborative pilot projects that can be

accomplished in 3-6 months

Outcomes

(22)

ARS/iPlant RNAseq Data Workshop Dec. 7-10

• Data sets from Insects, Plants, Animal Health,

&Aquaculture

• Goals

– Foster competency in HPC resources – Support trainers with a basic knowledge of

analyzing RNA-Seq

– Cultivate support and commitment for a

sustainable network

– Review Data managements needs

• Outcomes

– Network of ARS scientist working on a core data

type

– Improved competency in HPC resources – Establishment of 4 Working group: Tools &