• No results found

Visualizing the Protein Sequence Universe

N/A
N/A
Protected

Academic year: 2020

Share "Visualizing the Protein Sequence Universe"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

VISUALIZING THE PROTEIN

SEQUENCE UNIVERSE

L.STANBERRY1, R.HIGDON1, W.HAYNES1,

N.KOLKER1, W.BROOMALL1, S.EKANAYAKE2,

A.HUGHES2, Y.RUAN2, J.QIU2, E.KOLKER1,

G.FOX

1SEATTLE CHILDREN’S, 2INDIANA UNIVERSIT

(2)

Grand Challenge of Functional

Genomics

Visualizing PSU

2

 New technologies produce peta- and

exabytes of data

 Protein Sequence Universe (PSU), the protein

sequence space, expands exponentially

 EMP, i5K, iPlant, NEON

 30% of existing sequenced proteins

unannotated – even before drastic expansion

 Existing resources overwhelmed, many

unsupported: COG, Systers, ClusTr, eggNOG.

(3)

Ultimate Goal: Annotate All

Proteins

Visualizing PSU

3

Our approach:

 Revitalize, expand & enhance protein

annotation resources.

 Develop sustainable software framework.

 Use HPC and most powerful Cyberinfrastructure

 Provide rigorous and reliable tools to annotate

protein sequences.

(4)

COG: Clusters of Orthologous

Groups

Visualizing PSU

4

 COG database was developed by NCBI.

 Proteins classified into groups with common

function encoded in complete genomes.

 Prokaryotes (COG): 66 genomes, 200K proteins,

5K clusters.

 Eukaryotes (KOG): 7 genomes, 113K proteins,

5K clusters.

 Valuable scientific resource: 5K citations.

 Last updated: 2006.

(5)

Protein Sequence Universe

Visualizing PSU

5

 PSU Goal: Enhance annotation resources

with analytic and visualization (browser) tools.

 One component of PSU is to project

sequence data into 3D using

multidimensional scaling (MDS).

 MDS interpolation allows expanding the

universe without time consuming all vs all O(N2)

 3D map allows much faster interpolation

 Use set of pairwise dissimilarities – don’t do

MSA – so don’t have vectors in some space

(6)

Multi-Dimensional Scaling

(MDS)

Visualizing PSU

6

 Sammon‘s objective function

 is dissimilarity measure between sequences i

and j

d is Euclidean distance (here in 3D for

visualization) between projections xi and xj

 Denominator chosen to get larger contribution in

objective function from smaller dissimilarities

f is monotone transformation of dissimilarity

(7)

Typical Metagenomics MDS

03/02/2020 Visualizing PSU

(8)

MDS Details

03/02/2020 Visualizing PSU

8

f chosen heuristically to increase the ratio of

standard deviation to mean for and to

increase the range of dissimilarity measures.

O(n2) complexity to map n sequences into 3D.

 MDS can be solved using EM (SMACOF – fastest but

limited) or directly by Newton's method (it’s just

2 )

 Used robust implementation of nonlinear 2

minimization with Levenberg-Marquardt

(9)

MDS Details

03/02/2020 Visualizing PSU

9

 Input Data: 100K sequences from

well-characterized prokaryotic COGs.

 Proximity measure: sequence alignment % scores

 Scores calculated using Needleman-Wunsch

 Scores “sqrt 4D” transformed and fed into MDS

 Analytic form for transformation to 4D

 ijn decreases dimension n > 1; increases n < 1

 “sqrt 4D” reduced dimension of distance data

from 244 for ij to14 for f(ij)

(10)

3D View of 100K COG

Sequences

Visualizing PSU

10

(11)

Implementation

Visualizing PSU

11

 NW computed in parallel on 100 node 8-core

system.

 Used Twister (IU) in the Reduce phase of

MapReduce

 MDS Calculations performed on 768 core MS

HPC cluster (32 nodes)

 Scaling, parallel MPI with threading intranode

 Parallel efficiency of the code approximately

70%

 Lost efficiency due memory bandwidth

saturation

 NW required 1 day, MDS job - 3 days.

(12)

Cluster Annotation

Visualizing PSU

12

COG Annotation Uniref100

COG1131 ABC-type multidrug transport system, ATPase component 14406

COG1136 ABC-type antimicrobial peptide transport system, ATPase component 7306

COG1126 ABC-type polar amino acid transport system, ATPase component 4061

COG3839 ABC-type sugar transport systems, ATPase component 4121

COG0444 ABC-type dipeptide/oligopeptide/nickel transport system ATPasecomp 3520

COG4608 ABC-type oligopeptide transport system, ATPase component 3074

COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665

COG0333 Ribosomal protein L32 1148

COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085

COG0477 Permeases of the major facilitator superfamily 48590

COG1028 Dehydrogenases with different specificities 37461

(13)

Selected Clusters

Visualizing PSU

13

(14)

Heatmap of NW vs Euclidean

Distances

Visualizing PSU

14

(15)

Heatmap for Selected Clusters

Visualizing PSU

15

(16)

Future Steps

 Comparison Needleman-Wunsch v. Blast v. PSIBlast

 NW easier as complete; Blast has missing distances

 Different Transformations distance  monotonic

function(distance) to reduce formal starting dimension (increase sigma/mean)

 Automate cluster consensus finding as sequence that

minimizes maximum distance to other sequences

 Improve O(N2) to O(N) complexity by interpolating new

sequences to original set and only doing small regions with O(N2)

 Successful in metagenomics

 Can use Oct-tree from 3D mapping or set of consensus

vectors

 Some clusters diffuse?

03/02/2020 Visualizing PSU

(17)

03/02/2020 Visualizing PSU

17

(18)

03/02/2020 Visualizing PSU

18

Full Dat

Blast

(19)

03/02/2020 Visualizing PSU

19

Cluster

Dat

Blast

(20)

20

Use Barnes Hut

OctTree originally developed to make O(N2) astrophysics

O(NlogN)

(21)

21

OctTree for 100K sample of Fungi

We use OctTree for logarithmic

interpolation

(22)

440K Interpolated

22

(23)

Conclusions

Visualizing PSU

23

 Data Knowledge: protein annotation

 Overwhelming influx of new sequences

 Annotation is an immense challenge.

 HPC and advanced analytics needed.

 PSU as tool to facilitate annotation:

 Interactive visualization and exploration

 Integrates info on function, pathways, structure, and

environment

 MDS preserves grouping structure of protein space

 MDS can use different proximities and biological data

 Parallel MDS handles large-scale data

 MDS interpolation quickly maps new sequences into existing

space 03/02/2020

(24)

DELSA:

Data

Knowledge

Action

Visualizing PSU

24

Data-Enabled Life Sciences Alliance International

 Collective innovation to tackle modern

biological challenges through best

computational practices and advanced cyberinfrastructure.

 Harness expertise and resources across

disciplines

 Promote accurate, sustainable,

scalable approaches

 Facilitate translation of data influ

into tangible innovations and groundbreaking discoveries

(25)

DW2 Workshop, May 2012, D.C.

Who was there:

-~ 90 participants (by invitation only)

-Academia, Government, Industry, Media, NFP

-9 Countries (Belgium, Canada, China, Germany, Israel, India, Russia, U.K., and U.S.A.)

What were goals:

-Help identify Transformational Business Models

-Help identify Top (high impact/potential) Projects

-Stay engaged with DELSA and support mission

-Get the word out about DELSA... tweet, blog, email, talk, FB, LI, present, connect etc.

- Identify people’s optimal role/s in DELSA and endorsed

(26)

DELSA Endorsed Projects

Project 1: Social Networking Platform for Tool Brokering/Community Building *

Goal: Open a dialog and an organizational effort to build a social networking platform to broker bioinformatics tools. This project would encourage community engagement by crowdsourcing, accelerate discovery by making tools more accessible, and through community ranking, more trustworthy. It would also build community and connect people and resources.

Deliverable: Social networking platform for idea exchange, resource sharing, tool ranking and brokering.

Project 2: Data Set Accessibility Project

Lead: Corinna Gries

Goal: Make high quality life sciences data broadly available, traceable and usable.

Deliverable: Follow-on workshop to define issues such as: Curation, Sustainability, Rapid growth in data volume, Data provider incentives, Non-trivial processing on the data in the repository, Limited bandwidth from the open Internet to clouds, and Security.

Project 3: Training Data Scientists

Lead: Geoffrey Fox

Goal: Train new and established scientists to enable more effective use of big data and its cyberinfrastructure.

Deliverable: Courses in data enabled science culminating in a certification similar to Microsoft or Cisco certification or existing scientific computing or computational science certificates/curricula. Need to evaluate existing resources such as: UW eScience classes, OGF Grid Computing certificate , and XSEDE HPC University… Possible approach is to focus on particular life science subdomains.

Project 4: Global Protein Atlas

Lead: Jack Gilbert

Goal: For all the meta-genomes and genomes that are available cluster at the protein level and annotate… MG-RAST, CAMERA, MOPED, PSU, etc… The goal is to characterize all the proteins and answer the question: what protein is expressed in what

organism, what disease, what tissue, what condition, what environment, and in what concentration?

Deliverable:Based on current large scale projects such as Earth Microbiome Project and Human Microbiome Project, we will analyze samples from diverse communities using meta-genomics and meta-proteomics to produce a Global Protein Atlas.

(27)

DELSA Endorsed Projects, Cont.

Project 5: Internet2 Application

Lead: Michael Sullivan

Background: Internet2 is an advanced not-for-profit networking consortium developing revolutionary Internet technologies and leveraging a high-performance network (http://www.internet2.edu/). It is currently being adopted by NLM. It has three components: 1) connect pilot place to Internet2; 2) Deploy Science DMZ at the pilot place; and 3) Perform routine exchange of BigData. It is a dedicated data transfer mode to enable fast data transfer mode.

Goal: Create scalable process to connect entities (Research institutes, Universities, and Global Governments) to Internet2.

Project 6: DELSA Matchmaking Website *

Goal: Help scientists connect to each other, tools, publications, industry as a way to facilitate more effective science. Possible examples are VIVO and Linkedin. Could develop matchmaking 20 questions to determine individuals’ skills, interests, tools, review favorites. Could point to publications, tools or resources.

Deliverable: A web-based platform for connecting scientists to other scientists as well as research resources.

Project 7: Pregnancy Atlas Use Case

Lead: Joseph Kemnitz

Goal: Utilize DELSA and its members and connections for resources that would help the Pregnancy Atlas….The Pregnancy Atlas Consortium has an Integrative Discovery Platform that could be expanded. Help with metrics to assess Platform success.

Deliverable: Additional information for the Pregnancy Atlas such as potential collaborators, CI tools, data formats and funding opportunities. Provide files in a format that could be integrated by the Platform.

Project 8: ParaMEDIC Use Case

Lead: Wu Feng

Background: Frequent Pain Points experienced by DELSA members include ease of use issues with analysis tools and compute resources, as well as performance issues which may be due to compute problems, data management problems or data representation problems.

Goal: Use the automated, easy to use and integrated high-performance Biocomputing system (including ParaMEDIC: Parallel Metadata Environment for Distributed I/O & Computing) on a Suggested BigData challenge to show what can be done if the system was widely available.Deliverable: BigData life sciences challenging project successfully accomplished.

(28)

References and Resources

Visualizing PSU

28

 COG data is available at the NCBI site

ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/

 MDS results are available at

http://manxcatcogblog.blogspot.com/

 All software used to analyze and visualize

the data is open source.

 DELSA: http://www.delsaglobal.org

 Protein Global Atlas and Data Accessibility

Projects

(29)

Acknowledgements

03/02/2020 Visualizing PSU

29

Grant support

 NSF: under DBI: 0969929 (EK) and 0910818

(GF)

 NIH: 5 RC2 HG 005806- 02 (GF); NIGMS grant

References

Related documents

The availability of Swedish education are dependent on the choices made by students, on the school establishments of both private and public providers as well as the formation

Advancement Annual Program Planning Character Development Cub Scout Camping Family Involvement Leadership Training Membership National Awards Pack Budget Plan Pack Committee

Two previous studies ( Miyahara et al., 2008; Sutton et al., 2008 ) compared the body composition and fat distribution of male wheelchair athletes with able-bodied athletes and found

CASTLE [22], a cluster-based approach. It incorporates cluster merging and split- ting mechanisms based on a maximum allowable delay parameter. In SANATOMY , Wang et al. [123]

David works in the fields of educational evaluation, ethnography, policy analysis, educational technology, and focuses on programs for dropouts and gifted and talented

A Prospective Randomized Trial Comparing Tension-Free Vaginal Tape and Transobturator Suburethral Tape for Surgical Treatment of Stress Urinary Incontinence.. TVT Vs Monarc:

Ioannis Zois Managing Director &amp; Operations Manager UNIVERSITY OF AEGEAN / DEPARTMENT OF SHIPPING TRADE. AND

that “journalism over Europe is emerging as a common transnational experience”, whereas “the EU news appears restricted to the elite readerships of press” (Mapping the