• No results found

Big Data: Challenges and Opportunities

N/A
N/A
Protected

Academic year: 2021

Share "Big Data: Challenges and Opportunities"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data: Challenges and Opportunities

NGWI & USDA/ARS Meeting

USDA Carver Center

April 16, 2014

Doreen Ware

Acting Chief Science Information Officer

(2)

Big Data: Challenges and Response

Volume

Velocity

Variety

Value

Complexity

Human Resources

Community Building

Knowledge Management

Standards

Policies

Network

Storage

Compute

Innovation

Biology is an information science

Big Data in Agriculture: Emerging discipline that involves using genomic and phenotype information to support accelerating breeding strategies, directed engineering and integration of environmental genomic and climate data to support improvement of yields

(3)

Sensors & Metadata

Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO Systems

Hardrives, Networking, Databases, Compression, LIMS

Compute Systems

CPU, GPU, Distributed, Clouds, Workflows

Scalable Algorithms

Streaming, Sampling, Indexing, Parallel

Machine Learning

classification, modeling, visualization & data Integration

Results

Domain Knowledge

“Big Data” Biology Pyramid

Quantitative Biology Technologies

(4)

Biological Sensor Network

The rise of a digital immune system

Schatz, MC, Phillippy,AM (2012) GigaScience 1:4

(@ewanbirney)

(@latimes)

Small labs are now large data generators.

All scientist have the need to manage their

own data & make this accessible to others

This is a non trival engineering objective!

(5)

Data Production & Collection

Expect massive growth of sequencing, imaging, mass spec, and other biological sensor data based on technology and automation over the next 10 years

•Exascale biology is certain, and mobile streaming of data

•Germplasm Developmental traits, field based phenotyping using robotics (increase

the granularity of the phenotyping)

•Molecular & Physiological phenotying (Omics, Metabolites, Infrared) •Requires careful consideration of the “preciousness” of the sample •Compression helps, but need to aggressively review data lifecycle

•Need to capture meta data associated with sample prior or at the time of

data generation

Major data producers concentrated in universities, agricultural &

pharmaceutical companies, research institutes, Germplasm Centers, & Farmers

•Major efforts in, agriculture, bioenergy, Climate & Human Health

•Variety of the data, require the development of standards descriptors •Coordination across areas of plant science

•Coordination across domestically & internationally

(6)

Lack of Standards or Adoption are

one of the Major Limitations

• Nomenclature, Standard Formats,

Controlled Vocabulary

– Human and Machine – Germplasm identifiers – Variation identifiers

• Adoption of common references data sets – Reference Assemblies: Multiple

assemblies & versioning

– Populations

– Variations data sets

• Standard workflows to support

reproducibility

– Support for versions of data

– Support for translations between

assay types

Anna McClung, Brian Scheffler, Angela Baldo, Jeremy Edwards

(7)

Sensors & Metadata

Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO Systems

Hardrives, Networking, Databases, Compression, LIMS

Compute Systems

CPU, GPU, Distributed, Clouds, Workflows

Scalable Algorithms

Streaming, Sampling, Indexing, Parallel

Machine Learning

classification, modeling, visualization & data Integration

Results

Domain Knowledge

Quantitative Biology Technologies

(8)

Information Centers and Science Data Highway

DOE ESnet

http://www.es.net/

NSF XSEDE

https://www.xsede.org/

NSF iPlant

http://www.iplantcollaborative.org/

EU Elixir

http://www.elixir-europe.org/

(9)

Computational Infrastructure for the

Life Sciences

Kbase.us www.gramene.org

(10)

Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn

• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours

I-OMAP Josh Stein

• 12 rice species (each w/12 chromosome pseudomolecules) • 96 CPU per chromosome (1152 CPU total) 1-2hr per genome

Genome Assembly Size

(Mb) CPU

Run Time

Arabidopsis thaliana TAIR10 120 600 2:44 Arabidopsis thaliana TAIR10 120 1500 1:27 Zea mays RefGen_v2 2067 2172 2:53

TACC Lonestar Supercomputer 22,656 CPU cores on1,888 nodes

MAKER-P at iPlant

Reducing 3 weeks to 3 hours

Josh Stein

Genome annotation pipeline

(11)
(12)

Long Read sequencing technologies supports improved structural gene annotations Example from maize Chr4: 4172270-4180486 has the most isoforms: 140 isoforms

(13)
(14)

Genome Services

Uniform data formats

http://plants.ensembl.org/info/website/ftp/index.html

Visualization

Tools

RNA-Seq

Reference genomes

INSDC

Community data

Variable standards

History of tool development

Fostering Intercompatibility

(15)

Compute & Algorithmic Challenges

Expect to see many dozens of major informatics centers that consolidate

regional / topical information

• Clouds for Agriculture, Climate, Commodity, Traits

• Need for short, long & archival resources for many data types beyond sequence • Science highways to support movement of data

Standards: meta data, formats, reproducible workflowsMove the code to the data

Parallel hardware and algorithms are required

• Expect to see >1000 cores in a single computer

• Compute & input/output (IO) needs to be considered together • Rewriting efficient parallel software is complex and expensive • Many existing bioinformatics tools not configured for HPC • New data types and tools will be needed

Applications will shift from individuals to populations, many species, support emergent data type

• Moving from Single reference genome, to population analysis, pan-genome, and time

series analysis

• Need for network analysis, probabilistic techniques • Existing solutions do not scale….

(16)

Sensors & Metadata

Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO Systems

Hardrives, Networking, Databases, Compression, LIMS

Compute Systems

CPU, GPU, Distributed, Clouds, Workflows

Scalable Algorithms

Streaming, Sampling, Indexing, Parallel

Machine Learning

classification, modeling, visualization & data Integration

Results

Domain Knowledge

Who is a Data Scientist?

http://en.wikipedia.org/wiki/Data_science

(17)

Learning and Translation

Tremendous power from data aggregation

Observe the dynamics of biological systems

Need Reproducible workflows and APIs

Develop predictive models and support directed

design

Need for trainings sets to support model development

Development of intuitive interfaces to support access

to the data

Mechanisms to support integration of public and

private data

Mindful of the risks

Generating & aggregating the data is just the

beginning, need to provide resources to support the

interpretation of the data, starting with the quality. Not

all data is equal

The potential for over-fitting grows with the complexity

of the data, statistical significance is a statement about

the sample size

(18)

Training & Community building

Training of the all levels of scientist

Moving from an observational to quantitative biology will require

skills in statistics to support experimental design & interpretation,

information technology for best practices for handling the

biological, software to support interfaces for visualization

Cross-disciplinary training necessary for the next generation of

scientist

Development of virtual training materials to support development

and adoption of the emerging infrastructure to minimize the

emphasis on technology and return it to the biology

Changing the culture

Move from individual science to group science

Mechanisms & Incentives to foster adoptions of standards &

sharing of data need to be developed

The foundations of biology will continue to be

observation, experimentation, and interpretation

(19)

Workshop ”Towards a GrapeIS”

February 2015 – Bordeaux –France

Draft recommendations from the workshop

Data Standards: to support generate human readable (Web pages) or

better machine readable (Web services response or RDF)

– Minimal information about experiments – Controlled vocabularies/ontologies – Varietal/material identification

– Data exchange formats and data collection

Alignment to broader initiatives

– Participate & exploit broader data management initiatives

– (Research Data Alliance, FOODBALL, IPPN, EPPN, DivSeek etc)

– infrastructures (iPlant Collaborative, ELIXIR, etc ) to increase sustainability, reduce

duplication of efforts and ensure broader impact of the data in the scientific community.

Training and dissemination

Leverage The International Grapevine Genome Project consortium (IGGP;

www.vitaceae.org

) to support coordination & development of a grapevine

information system

(20)

Community Building

Need to change the culture

within the ARS

Move from individual CRIS

project, location, to a Virtual

Organization

The will require both bottom

up and top down approaches

Working groups to support

standards, policy, and

training

Review of the existing policy

& incentives

We need to start now and

leverage existing ongoing work

within projects.

Pilot projects 6 months examples

– Data: Transcriptomes, Genetic variation,

Genome assembly & Annotation

– Communities: I5K, NPGS, Pathogens,

Natural resources

• Define set of use cases emphasizing – Support for data access and sharing – Analyses

– Data standards – Best practices – Training materials • Establish Teams

• Diversity locations, species, expertise

• Workshops

– Train the trainers

– 2-3 day workshops targeting specific data

types

• Evaluate outcomes and recommendations

USDA AGRICULTURAL RESEARCH

(21)

USDA Database workshop PAG Jan. 2015

Lisa Harper, Taner Sean, Steven Cannon (ARS) , 70 data scientists representing 23 groups

Objectives

Bring scientists together to meet

each other

Outline priority needs facing

biological databases

Determine how to address these

needs as a group

Create opportunities to work

together and improve

interoperability between

resources

Explore what steps ARS

leadership can take to facilitate

and support excellence in this

expanding area

– Survey of community challenges, needs and

expertise

– Commitment for Monthly meeting

4 Breakout groups: Each group either has

or are developing, a communication plan, and collaborative pilot projects that can be

accomplished in 3-6 months

Outcomes

USDA AGRICULTURAL RESEARCH

(22)

ARS/iPlant RNAseq Data Workshop Dec. 7-10

• Data sets from Insects, Plants, Animal Health,

&Aquaculture

• Goals

– Foster competency in HPC resources – Support trainers with a basic knowledge of

analyzing RNA-Seq

– Cultivate support and commitment for a

sustainable network

– Review Data managements needs

• Outcomes

– Network of ARS scientist working on a core data

type

– Improved competency in HPC resources – Establishment of 4 Working group: Tools &

Workflows, Install new tools HPC resources, Meta Data, Adoption

USDA AGRICULTURAL RESEARCH

SERVICE

Dewayne Shoemaker (ARS), Kapeel Chougule, Jason

Williams(iPlant), 24 ARS scientist, representing 18

locations, 5 ARS areas

(23)

ARS/iPlant Population Data Workshop April 19-22

Ivan Baxter, Jan-Luc Jannik (ARS), Liya Wang, Jason Williams (iPlant),

24 ARS scientist, representing 5 ARS areas

• Data sets from Fungal, Animals, Insect,

Plants, & oomycytes

• Goals of workshop

– Foster competency in HPC resources – Support training in Population

Genomics

– Cultivate support and commitment

for a sustainable network

– Review Data managements needs

USDA AGRICULTURAL RESEARCH

(24)
(25)

Discussion - Big Data

What areas of basic research, tools and resources

are needed to facilitate interoperability and promote

sharing to advance breakthrough discoveries?

What needs are not currently being met? In this

regard, are there opportunities to leverage existing

data, tools and infrastructure?

What areas of research training and skills are

required and not currently being met?

What opportunities do you see for leveraging

investments through public:private and international

partnerships?

(26)
(27)

Grapevine Community

Grape Community adds value to iPlant

Share tools for using genomic, phenomic and molecular

technologies with iPlant community

Provision of data sets via iPlant Data Commons to iPlant community

Development of metadata standards for inclusion in iPlant Data

Commons

Engagement with Grape community

iPlant contributions to Grape Community

Scalable, sustainable platform for Grape IS CI development

Computational support for Genome annotationa, genotyping and

automatic phenotyping

References

Related documents

With a view to strengthening relations among US allies in Asia, the new strategy documents for the first time call on Japan to pursue trilateral security cooperation among the

Enige verswakking of verandering in my gesondheidstoestand of in dié van my afhanklikes voor die datum of gebeurtenis wat deur Bestmed vir die aanvang van lidmaatskap gestel

Despite the demonstrated benefits both at macroeconomic and individual levels, the prEA sector contribution is still hampered by four main factors: unjustified regula-

As a conclusion, the COMSOL numerical model could be applied to investigate the heat transfer and solid-liquid phase change process of the paraffin RT27 embedded in

The samples used in this study were taken from 50 patients including 42 respiratory specimens from severe (patients hospitalized with severe pneumonia and severe acute

Charity that runs a helpline that helps parents search for products for individual children, including sleep wear, day and night-time wetting, equipment for bedtime like stay-on

This paper explains the antenna design array synthesis using genetic algorithm and optimization frequency.In this approach, the design necessities are indicated and analyzer

Because choroidal thickness becomes thinner in eyeballs with a longer axial length (20), we next evaluated whether there was an asso- ciation between rs800292 and rs3793217 with