• No results found

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

N/A
N/A
Protected

Academic year: 2021

Share "Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data in BioMedical Sciences

Steven Newhouse,

(2)

Big Data for BioMedical Sciences

EMBL-EBI: What we do and why?

Challenges & Opportunities

Infrastructure Requirements

European Context

(3)

EMBL-EBI

What we do and why?

(4)

The European Molecular Biology Laboratory

Grenoble Structural biology Hinxton, Cambridge Bioinformatics Hamburg Structural biology Heidelberg Basic research Administration EMBO EMBL staff: 1700 people >60 nationalities Mouse biology Monterotondo, Rome

(5)

EMBL member states

Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland,

Israel, Italy, Luxembourg, the Netherlands, Norway,

Portugal, Spain, Sweden, Switzerland and the United Kingdom

Associate member states:

(6)

EMBL-EBI

MISSION

To provide freely

available data and

bioinformatics services

to all facets of the

scientific community in

ways that promote

(7)

European Bioinformatics Institute (EBI)

International, non-profit

research institute

Europe’s hub for biological

data services and research

570 members of staff from

53 nations

Funded primarily by

member states and

research bodies (EC, USA, UK, Wellcome Trust)

(8)

What is bioinformatics?

The science of storing, retrieving

and analysing large amounts of biological information

An interdisciplinary science involving: • biologists • biochemists • computer scientists • mathematicians

(9)

Challenges &

Opportunities

(10)

Drug discovery

From discovering a

target to a drug

reaching the market:

12 years

Bioinformatics

shortens time to target discovery.

EMBL-EBI services

support all stages of drug discovery.

(11)

Interpreting human variation

How and why do we differ from one another?

What causes susceptibility to disease?

We explore individual genomes by comparing them

with reference genomes.

Combining that information with other types of data

provides valuable insights into human variation.

(12)

Making choices

There are 3 billion base pairs in the human

genome.

Figuring out which regions are involved in disease

– and what they do – is a major challenge.

(13)

Personalised Medicine

Reduced sequencing

costs enable new techniques

• Past: 13 yrs & £2Bn

• Now: 2 days & £1000

Clinical integration

raises privacy & ethical concerns

• UK: 100K Genomes

(14)

From Data to Knowledge in the Life Sciences

Molecular data Information Knowledge

Mutation X disrupts enzyme function, which causes Y disease

(15)

Answering these Challenges

Data Infrastructure

• Distributed trans-national sources (1000s+ sequencers)

• Individually and collectively producing LOTS of data

• Incredibly sensitive meta-data (i.e. your medical records)

Data Consumption

• Annotating data leads to information and knowledge

Data Exploitation

• Enable access beyond specialised researcher

(16)
(17)
(18)

Infrastructure

Requirements

Role of the Technical Services Cluster

(19)

Life science: many data types

Genes, genomes & variation

Gene, protein & metabolite expression

Protein sequences, families & motifs

Macromolecular structures

Interactions, reactions & pathways

Chemogenomics & metabolomics

(20)
(21)

Challenges of Big Data Infrastructure 2015+

0 50000 100000 150000 200000 250000 2014 2015 2016 2017 2018 2019 2020 Cores 0 500 1000 1500 2000 2014 2015 2016 2017 2018 2019 2020 Storage (PB) 40 50 60 70 80 90 100 110 2014 2015 2016 2017 2018 2019 2020 Racks 0 5 10 2014 2015 2016 2017 2018 2019 2020 Cost (M£/year) OpEx CapEx In 2020: Storage: 1,920PB Cores: 230,000

(22)

Flint Cross (FC)

Hinxton (HX) Hemel Hempstead (HH)

Embassy Cloud Disk Cluster Virtual Servers Archive B Archive A Disk Cluster Virtual Servers R ed un da n t LA N

Disk StandbyDBs Disk

Cluster Virtual Servers Disk Cluster Virtual Servers Staging Area Production Area LA N LAN Delphix DB LAN LAN JANET WLB WLB

EMBL-EBI Data Centre Infrastructure

Web Web Web Web DB DB DB DB R ed un da n t SA N SA N SA N SAN SAN JANET JANET 10Gb Network Connections WLB

(23)

EBI Provides Services and Data Resources

Data Resources

• Public and Managed Access

• Individual sequence or bulk download

Services

• Web & programmatic access for common tools

• Run ‘jobs’ on EMBL-EBI hardware

Volume and variety of genomic data expanding

• EMBL-EBI data doubling every year - replication is challenging

(24)

Service Teams Research Teams Ma n a g e me n t

Technical services

Basic Users Power users Research Partners Infrastructur e Providers Web development User experience Web Development Web platforms Web infrastructure Web deployment Web Production

Storage Network Compute Systems Infrastructure

Database Desktop Cloud Systems Applications

(25)

The Challenge Facing Services @

EMBL-EBI

Need to support complex analysis scenarios

• Access to both public and managed access data sets

• Bespoke workflows and tools across a variety of domains

• Issues with disk to memory bandwidth

• Web and programmatic access to services (3M unique users)

Hard for users to replicate data sets for local analysis

(26)

Embassy Cloud Concept

Virtualised EMBL-EBI Hardware

Publi

c

Data

Manag

ed

Data

Publi

c

Services

Emba

ssy Cl

oud

1

Emba

ssy Cl

oud

2

Embassy Clo

ud

3

Pri

vate D

ata

PanCancer

(27)

Embassy Cloud

Our partners work directly

beside EMBL-EBI data. • High bandwidth

• Low latency

• Robust, secure environment.

Not in competition with

commercial cloud services.

Other cloud initiatives:

• ELIXIR-facing cloud support

(28)

Typical Uses

Web Application Hosting

• Limited need for resources & VMs

• CTTV: Host intranet, databases, …

Data Staging

• Undertake submission from local machine (following data staging) rather from remote location

• BRAEMBL: Remote submission unreliable due to file upload

Data Analysis

• Large scale management and analysis of data

(29)
(30)

ELIXIR: a distributed data infrastructure

EMBL-EBI is a major driver in ELIXIR,

the pan-European research

infrastructure for biological information.

Central Hub at EMBL-EBI, with Nodes at

centres of excellence throughout Europe.

The goal of ELIXIR:

• Build a sustainable European

infrastructure for biological information

• Support life science research and its

translation to medicine, the environment, the bioindustries and society.

(31)

Building capacity in Europe

• ELIXIR: a sustainable

infrastructure for biological information in Europe.

• Supporting life science

research and its translation to:

• medicine

• agriculture

• the environment

• the bioindustries

(32)

AAI Architecture

ELIXIR AAI

External authentication (e-infrastructures) Relying services

eduGAIN IdPs Common IdPs ELIXIR Proxy IdP

ELIXIR Directory

Bona fide management Dataset authorisation management Group/role management Credential translation EGA wiki Cloud Intranet … Data archive … … Attribute self-management

(33)

Cloud & Storage Architecture

ELIXIR External (e-infrastructures) Relying services Cloud IaaS Cores Storage Cloud IaaS Cores Storage Storage Storage Au thenticatio n & Au thorization Accounting Data Transfer Portals Workflow Engines Service Directory Service Registry Metrics Software Library

(34)

Initial Prioritisation and Grouping Analysis

• AAI

• 0: Federated ID, Other ID

• 1: Elixir ID

• 2: Credential Translation, Group/Attribute Mgmt, Endorsed Attributes

• Cloud Compute

• 1: Cloud IaaS, HTC Cluster, Cloud Storage, Federated Cloud IaaS

• 2: Infrastructure Service Registry & Directory, VM Library

• 3: Operational Integration, Resource Accounting

• Data Transfer

• 1:Network File Storage, File Transfer

(35)

Centralization & specialization

35

Data production Data centralization Data

• Data is submitted to specialized centralized repositories.

(36)

Federation

36

If data gets bigger, the data might have to stay where it

is produced.

We might have to provision data producers with storage

and computation.

• Data might be pulled instead of pushed into centralized

repositories.

36

(37)
(38)

Centre for Therapeutic Target Validation

Collaboration to pinpoint the

processes in the human body that have a demonstrable effect on

disease.

Public-private initiative:

• GSK: expertise in disease biology

• EMBL-EBI: expertise in life science data integration and analysis

• Wellcome Trust Sanger Institute:

expertise in the role of genetics in disease.

(39)

Technical Services Cluster: IT as a Service

‘Professionalising’ our IT Services

• Looking at lightweight FitSM portfolio

Developing a Service Portfolio

• Internal and External (Public & Private) Users

Putting the Service Portfolio into a governance process

• To manage and communicate change of the portfolio

(40)

External Infrastructure Activities

EUDAT

• Leverage CDI to strategically replicate large ‘popular’ data sets

HelixNebula

• Meet future needs through hybrid model

• External hosted data centres with direct cloud access

EGI: European Grid Infrastructure

• Provides federated cloud & cluster resources

CERN OpenLab

• Identifying and building common IT services for science

Cancer Research UK

• Bringing islands of protected data together for cloud based analysis

(41)

Other Cloud Activity at EMBL-EBI

Use Amazon to provide geographical distribution

• Direct link to globally replicate databases

HelixNebula

• Integration of commercial cloud providers with big research

Benefit of additional security assurances

• For use by pharmaceutical companies

• For on-demand personalised medicine

Explore using IaaS to supplement/replace data centres

• Put DC on cloud, scale out services (service + database), etc.

(42)

The Future

Virtualised Infrastructure

Storage Compute Data EMBL-EBI IT (Services, Research, Clusters)

EMBL-EBI Data Centre Cloud Providers Virtualised Infrastructure Storage Data Compute Infrastructure Provider Elixir Service Data Data Provider Elixir Community Services

Public Service Virtualised Infrastructure Storage Compute

Geant Network

Private Analysis

Integrating Platform

(43)

References

Related documents

The present study was undertaken to assess the effects of hot air drying on phenolic compositions, total phenolic (TP) content, total anthocyanin (TA) content, as well

The second, called an end vise or tail vise, can clamp work like a front vise, but is more often used to hold boards flat on the bench, pinched between a pin or dog in the vise

Finally, we developed seven new upper-division, 4-unit major courses to provide students with more FIP options, and to take advantage of faculty expertise and student interest in

(To be eligible, services must not have been received within the last 30 days.) More information about the program and our services can be found online at

This study allows educators within higher education to better understand the complex processes of civic commitment development and how to holistically support college students

i) Streamline flow is that flow in which every particle flows along the path of its preceding particle. ii) The path taken by a particle in a flowing fluid is called its line of

a) FM NAV ACCY check using raw data (only if GPS is not primary).. The result of the NAV ACCY check determines the strategy on how to conduct the approach, and as a consequence

With respect with the second high mean, which was for the role of distributing security information in different media, it is in line with Al Harbi [9] findings that social