Big Data: Challenges and Opportunities
NGWI & USDA/ARS Meeting
USDA Carver Center
April 16, 2014
Doreen Ware
Acting Chief Science Information Officer
Big Data: Challenges and Response
•
Volume
•
Velocity
•
Variety
•
Value
•
Complexity
•
Human Resources
•
Community Building
•
Knowledge Management
•
Standards
•
Policies
•
Network
•
Storage
•
Compute
•
Innovation
Biology is an information science
Big Data in Agriculture: Emerging discipline that involves using genomic and phenotype information to support accelerating breeding strategies, directed engineering and integration of environmental genomic and climate data to support improvement of yields
Sensors & Metadata
Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
IO Systems
Hardrives, Networking, Databases, Compression, LIMS
Compute Systems
CPU, GPU, Distributed, Clouds, Workflows
Scalable Algorithms
Streaming, Sampling, Indexing, Parallel
Machine Learning
classification, modeling, visualization & data Integration
Results
Domain Knowledge
“Big Data” Biology Pyramid
Quantitative Biology Technologies
Biological Sensor Network
The rise of a digital immune system
Schatz, MC, Phillippy,AM (2012) GigaScience 1:4
(@ewanbirney)
(@latimes)
Small labs are now large data generators.
All scientist have the need to manage their
own data & make this accessible to others
This is a non trival engineering objective!
Data Production & Collection
Expect massive growth of sequencing, imaging, mass spec, and other biological sensor data based on technology and automation over the next 10 years
•Exascale biology is certain, and mobile streaming of data
•Germplasm Developmental traits, field based phenotyping using robotics (increase
the granularity of the phenotyping)
•Molecular & Physiological phenotying (Omics, Metabolites, Infrared) •Requires careful consideration of the “preciousness” of the sample •Compression helps, but need to aggressively review data lifecycle
•Need to capture meta data associated with sample prior or at the time of
data generation
Major data producers concentrated in universities, agricultural &
pharmaceutical companies, research institutes, Germplasm Centers, & Farmers
•Major efforts in, agriculture, bioenergy, Climate & Human Health
•Variety of the data, require the development of standards descriptors •Coordination across areas of plant science
•Coordination across domestically & internationally
Lack of Standards or Adoption are
one of the Major Limitations
• Nomenclature, Standard Formats,
Controlled Vocabulary
– Human and Machine – Germplasm identifiers – Variation identifiers
• Adoption of common references data sets – Reference Assemblies: Multiple
assemblies & versioning
– Populations
– Variations data sets
• Standard workflows to support
reproducibility
– Support for versions of data
– Support for translations between
assay types
Anna McClung, Brian Scheffler, Angela Baldo, Jeremy Edwards
Sensors & Metadata
Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
IO Systems
Hardrives, Networking, Databases, Compression, LIMS
Compute Systems
CPU, GPU, Distributed, Clouds, Workflows
Scalable Algorithms
Streaming, Sampling, Indexing, Parallel
Machine Learning
classification, modeling, visualization & data Integration
Results
Domain Knowledge
Quantitative Biology Technologies
Information Centers and Science Data Highway
DOE ESnet
http://www.es.net/
NSF XSEDE
https://www.xsede.org/
NSF iPlant
http://www.iplantcollaborative.org/
EU Elixir
http://www.elixir-europe.org/
Computational Infrastructure for the
Life Sciences
Kbase.us www.gramene.org
•
Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn
• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours
•
I-OMAP Josh Stein
• 12 rice species (each w/12 chromosome pseudomolecules) • 96 CPU per chromosome (1152 CPU total) 1-2hr per genome
Genome Assembly Size
(Mb) CPU
Run Time
Arabidopsis thaliana TAIR10 120 600 2:44 Arabidopsis thaliana TAIR10 120 1500 1:27 Zea mays RefGen_v2 2067 2172 2:53
TACC Lonestar Supercomputer 22,656 CPU cores on1,888 nodes
MAKER-P at iPlant
Reducing 3 weeks to 3 hours
Josh Stein
Genome annotation pipeline
Long Read sequencing technologies supports improved structural gene annotations Example from maize Chr4: 4172270-4180486 has the most isoforms: 140 isoforms
Genome Services
Uniform data formats
http://plants.ensembl.org/info/website/ftp/index.html
Visualization
Tools
RNA-Seq
Reference genomes
INSDC
Community data
Variable standards
History of tool development
Fostering Intercompatibility
Compute & Algorithmic Challenges
Expect to see many dozens of major informatics centers that consolidateregional / topical information
• Clouds for Agriculture, Climate, Commodity, Traits
• Need for short, long & archival resources for many data types beyond sequence • Science highways to support movement of data
• Standards: meta data, formats, reproducible workflows • Move the code to the data
Parallel hardware and algorithms are required
• Expect to see >1000 cores in a single computer
• Compute & input/output (IO) needs to be considered together • Rewriting efficient parallel software is complex and expensive • Many existing bioinformatics tools not configured for HPC • New data types and tools will be needed
Applications will shift from individuals to populations, many species, support emergent data type
• Moving from Single reference genome, to population analysis, pan-genome, and time
series analysis
• Need for network analysis, probabilistic techniques • Existing solutions do not scale….
Sensors & Metadata
Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
IO Systems
Hardrives, Networking, Databases, Compression, LIMS
Compute Systems
CPU, GPU, Distributed, Clouds, Workflows
Scalable Algorithms
Streaming, Sampling, Indexing, Parallel
Machine Learning
classification, modeling, visualization & data Integration
Results
Domain Knowledge
Who is a Data Scientist?
http://en.wikipedia.org/wiki/Data_science
Learning and Translation
Tremendous power from data aggregation
•
Observe the dynamics of biological systems
•
Need Reproducible workflows and APIs
•
Develop predictive models and support directed
design
•
Need for trainings sets to support model development
•
Development of intuitive interfaces to support access
to the data
•
Mechanisms to support integration of public and
private data
Mindful of the risks
•
Generating & aggregating the data is just the
beginning, need to provide resources to support the
interpretation of the data, starting with the quality. Not
all data is equal
•
The potential for over-fitting grows with the complexity
of the data, statistical significance is a statement about
the sample size
Training & Community building
Training of the all levels of scientist
•
Moving from an observational to quantitative biology will require
skills in statistics to support experimental design & interpretation,
information technology for best practices for handling the
biological, software to support interfaces for visualization
•
Cross-disciplinary training necessary for the next generation of
scientist
•
Development of virtual training materials to support development
and adoption of the emerging infrastructure to minimize the
emphasis on technology and return it to the biology
Changing the culture
•
Move from individual science to group science
•
Mechanisms & Incentives to foster adoptions of standards &
sharing of data need to be developed
The foundations of biology will continue to be
observation, experimentation, and interpretation
Workshop ”Towards a GrapeIS”
February 2015 – Bordeaux –France
Draft recommendations from the workshop
•
Data Standards: to support generate human readable (Web pages) or
better machine readable (Web services response or RDF)
– Minimal information about experiments – Controlled vocabularies/ontologies – Varietal/material identification
– Data exchange formats and data collection
•
Alignment to broader initiatives
– Participate & exploit broader data management initiatives
– (Research Data Alliance, FOODBALL, IPPN, EPPN, DivSeek etc)
– infrastructures (iPlant Collaborative, ELIXIR, etc ) to increase sustainability, reduce
duplication of efforts and ensure broader impact of the data in the scientific community.
•
Training and dissemination
•
Leverage The International Grapevine Genome Project consortium (IGGP;
www.vitaceae.org
) to support coordination & development of a grapevine
information system
Community Building
•
Need to change the culture
within the ARS
•
Move from individual CRIS
project, location, to a Virtual
Organization
•
The will require both bottom
up and top down approaches
–
Working groups to support
standards, policy, and
training
–
Review of the existing policy
& incentives
•
We need to start now and
leverage existing ongoing work
within projects.
• Pilot projects 6 months examples
– Data: Transcriptomes, Genetic variation,
Genome assembly & Annotation
– Communities: I5K, NPGS, Pathogens,
Natural resources
• Define set of use cases emphasizing – Support for data access and sharing – Analyses
– Data standards – Best practices – Training materials • Establish Teams
• Diversity locations, species, expertise
• Workshops
– Train the trainers
– 2-3 day workshops targeting specific data
types
• Evaluate outcomes and recommendations
USDA AGRICULTURAL RESEARCH
USDA Database workshop PAG Jan. 2015
Lisa Harper, Taner Sean, Steven Cannon (ARS) , 70 data scientists representing 23 groups
Objectives
•
Bring scientists together to meet
each other
•
Outline priority needs facing
biological databases
•
Determine how to address these
needs as a group
•
Create opportunities to work
together and improve
interoperability between
resources
•
Explore what steps ARS
leadership can take to facilitate
and support excellence in this
expanding area
– Survey of community challenges, needs and
expertise
– Commitment for Monthly meeting
– 4 Breakout groups: Each group either has
or are developing, a communication plan, and collaborative pilot projects that can be
accomplished in 3-6 months
Outcomes
USDA AGRICULTURAL RESEARCH
ARS/iPlant RNAseq Data Workshop Dec. 7-10
• Data sets from Insects, Plants, Animal Health,
&Aquaculture
• Goals
– Foster competency in HPC resources – Support trainers with a basic knowledge of
analyzing RNA-Seq
– Cultivate support and commitment for a
sustainable network
– Review Data managements needs
• Outcomes
– Network of ARS scientist working on a core data
type
– Improved competency in HPC resources – Establishment of 4 Working group: Tools &
Workflows, Install new tools HPC resources, Meta Data, Adoption
USDA AGRICULTURAL RESEARCH
SERVICE
Dewayne Shoemaker (ARS), Kapeel Chougule, Jason
Williams(iPlant), 24 ARS scientist, representing 18
locations, 5 ARS areas
ARS/iPlant Population Data Workshop April 19-22
Ivan Baxter, Jan-Luc Jannik (ARS), Liya Wang, Jason Williams (iPlant),
24 ARS scientist, representing 5 ARS areas
• Data sets from Fungal, Animals, Insect,
Plants, & oomycytes
• Goals of workshop
– Foster competency in HPC resources – Support training in Population
Genomics
– Cultivate support and commitment
for a sustainable network
– Review Data managements needs
USDA AGRICULTURAL RESEARCH