VISUALIZING THE PROTEIN
SEQUENCE UNIVERSE
L.STANBERRY1, R.HIGDON1, W.HAYNES1,
N.KOLKER1, W.BROOMALL1, S.EKANAYAKE2,
A.HUGHES2, Y.RUAN2, J.QIU2, E.KOLKER1,
G.FOX
1SEATTLE CHILDREN’S, 2INDIANA UNIVERSIT
Grand Challenge of Functional
Genomics
Visualizing PSU
2
New technologies produce peta- and
exabytes of data
Protein Sequence Universe (PSU), the protein
sequence space, expands exponentially
EMP, i5K, iPlant, NEON
30% of existing sequenced proteins
unannotated – even before drastic expansion
Existing resources overwhelmed, many
unsupported: COG, Systers, ClusTr, eggNOG.
Ultimate Goal: Annotate All
Proteins
Visualizing PSU
3
Our approach:
Revitalize, expand & enhance protein
annotation resources.
Develop sustainable software framework.
Use HPC and most powerful Cyberinfrastructure
Provide rigorous and reliable tools to annotate
protein sequences.
COG: Clusters of Orthologous
Groups
Visualizing PSU
4
COG database was developed by NCBI.
Proteins classified into groups with common
function encoded in complete genomes.
Prokaryotes (COG): 66 genomes, 200K proteins,
5K clusters.
Eukaryotes (KOG): 7 genomes, 113K proteins,
5K clusters.
Valuable scientific resource: 5K citations.
Last updated: 2006.
Protein Sequence Universe
Visualizing PSU
5
PSU Goal: Enhance annotation resources
with analytic and visualization (browser) tools.
One component of PSU is to project
sequence data into 3D using
multidimensional scaling (MDS).
MDS interpolation allows expanding the
universe without time consuming all vs all O(N2)
3D map allows much faster interpolation
Use set of pairwise dissimilarities – don’t do
MSA – so don’t have vectors in some space
Multi-Dimensional Scaling
(MDS)
Visualizing PSU
6
Sammon‘s objective function
is dissimilarity measure between sequences i
and j
d is Euclidean distance (here in 3D for
visualization) between projections xi and xj
Denominator chosen to get larger contribution in
objective function from smaller dissimilarities
f is monotone transformation of dissimilarity
Typical Metagenomics MDS
03/02/2020 Visualizing PSU
MDS Details
03/02/2020 Visualizing PSU
8
f chosen heuristically to increase the ratio of
standard deviation to mean for and to
increase the range of dissimilarity measures.
O(n2) complexity to map n sequences into 3D.
MDS can be solved using EM (SMACOF – fastest but
limited) or directly by Newton's method (it’s just
2 )
Used robust implementation of nonlinear 2
minimization with Levenberg-Marquardt
MDS Details
03/02/2020 Visualizing PSU
9
Input Data: 100K sequences from
well-characterized prokaryotic COGs.
Proximity measure: sequence alignment % scores
Scores calculated using Needleman-Wunsch
Scores “sqrt 4D” transformed and fed into MDS
Analytic form for transformation to 4D
ijn decreases dimension n > 1; increases n < 1
“sqrt 4D” reduced dimension of distance data
from 244 for ij to14 for f(ij)
3D View of 100K COG
Sequences
Visualizing PSU
10
Implementation
Visualizing PSU
11
NW computed in parallel on 100 node 8-core
system.
Used Twister (IU) in the Reduce phase of
MapReduce
MDS Calculations performed on 768 core MS
HPC cluster (32 nodes)
Scaling, parallel MPI with threading intranode
Parallel efficiency of the code approximately
70%
Lost efficiency due memory bandwidth
saturation
NW required 1 day, MDS job - 3 days.
Cluster Annotation
Visualizing PSU
12
COG Annotation Uniref100
COG1131 ABC-type multidrug transport system, ATPase component 14406
COG1136 ABC-type antimicrobial peptide transport system, ATPase component 7306
COG1126 ABC-type polar amino acid transport system, ATPase component 4061
COG3839 ABC-type sugar transport systems, ATPase component 4121
COG0444 ABC-type dipeptide/oligopeptide/nickel transport system ATPasecomp 3520
COG4608 ABC-type oligopeptide transport system, ATPase component 3074
COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665
COG0333 Ribosomal protein L32 1148
COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085
COG0477 Permeases of the major facilitator superfamily 48590
COG1028 Dehydrogenases with different specificities 37461
Selected Clusters
Visualizing PSU
13
Heatmap of NW vs Euclidean
Distances
Visualizing PSU
14
Heatmap for Selected Clusters
Visualizing PSU
15
Future Steps
Comparison Needleman-Wunsch v. Blast v. PSIBlast
NW easier as complete; Blast has missing distances
Different Transformations distance monotonic
function(distance) to reduce formal starting dimension (increase sigma/mean)
Automate cluster consensus finding as sequence that
minimizes maximum distance to other sequences
Improve O(N2) to O(N) complexity by interpolating new
sequences to original set and only doing small regions with O(N2)
Successful in metagenomics
Can use Oct-tree from 3D mapping or set of consensus
vectors
Some clusters diffuse?
03/02/2020 Visualizing PSU
03/02/2020 Visualizing PSU
17
03/02/2020 Visualizing PSU
18
Full Dat
Blast
03/02/2020 Visualizing PSU
19
Cluster
Dat
Blast
20
Use Barnes Hut
OctTree originally developed to make O(N2) astrophysics
O(NlogN)
21
OctTree for 100K sample of Fungi
We use OctTree for logarithmic
interpolation
440K Interpolated
22
Conclusions
Visualizing PSU
23
Data Knowledge: protein annotation
Overwhelming influx of new sequences
Annotation is an immense challenge.
HPC and advanced analytics needed.
PSU as tool to facilitate annotation:
Interactive visualization and exploration
Integrates info on function, pathways, structure, and
environment
MDS preserves grouping structure of protein space
MDS can use different proximities and biological data
Parallel MDS handles large-scale data
MDS interpolation quickly maps new sequences into existing
space 03/02/2020
DELSA:
Data
→
Knowledge
→
Action
Visualizing PSU
24
Data-Enabled Life Sciences Alliance International
Collective innovation to tackle modern
biological challenges through best
computational practices and advanced cyberinfrastructure.
Harness expertise and resources across
disciplines
Promote accurate, sustainable,
scalable approaches
Facilitate translation of data influ
into tangible innovations and groundbreaking discoveries
DW2 Workshop, May 2012, D.C.
Who was there:
-~ 90 participants (by invitation only)
-Academia, Government, Industry, Media, NFP
-9 Countries (Belgium, Canada, China, Germany, Israel, India, Russia, U.K., and U.S.A.)
What were goals:
-Help identify Transformational Business Models
-Help identify Top (high impact/potential) Projects
-Stay engaged with DELSA and support mission
-Get the word out about DELSA... tweet, blog, email, talk, FB, LI, present, connect etc.
- Identify people’s optimal role/s in DELSA and endorsed
DELSA Endorsed Projects
Project 1: Social Networking Platform for Tool Brokering/Community Building *
Goal: Open a dialog and an organizational effort to build a social networking platform to broker bioinformatics tools. This project would encourage community engagement by crowdsourcing, accelerate discovery by making tools more accessible, and through community ranking, more trustworthy. It would also build community and connect people and resources.
Deliverable: Social networking platform for idea exchange, resource sharing, tool ranking and brokering.
Project 2: Data Set Accessibility Project
Lead: Corinna Gries
Goal: Make high quality life sciences data broadly available, traceable and usable.
Deliverable: Follow-on workshop to define issues such as: Curation, Sustainability, Rapid growth in data volume, Data provider incentives, Non-trivial processing on the data in the repository, Limited bandwidth from the open Internet to clouds, and Security.
Project 3: Training Data Scientists
Lead: Geoffrey Fox
Goal: Train new and established scientists to enable more effective use of big data and its cyberinfrastructure.
Deliverable: Courses in data enabled science culminating in a certification similar to Microsoft or Cisco certification or existing scientific computing or computational science certificates/curricula. Need to evaluate existing resources such as: UW eScience classes, OGF Grid Computing certificate , and XSEDE HPC University… Possible approach is to focus on particular life science subdomains.
Project 4: Global Protein Atlas
Lead: Jack Gilbert
Goal: For all the meta-genomes and genomes that are available cluster at the protein level and annotate… MG-RAST, CAMERA, MOPED, PSU, etc… The goal is to characterize all the proteins and answer the question: what protein is expressed in what
organism, what disease, what tissue, what condition, what environment, and in what concentration?
Deliverable:Based on current large scale projects such as Earth Microbiome Project and Human Microbiome Project, we will analyze samples from diverse communities using meta-genomics and meta-proteomics to produce a Global Protein Atlas.
DELSA Endorsed Projects, Cont.
Project 5: Internet2 Application
Lead: Michael Sullivan
Background: Internet2 is an advanced not-for-profit networking consortium developing revolutionary Internet technologies and leveraging a high-performance network (http://www.internet2.edu/). It is currently being adopted by NLM. It has three components: 1) connect pilot place to Internet2; 2) Deploy Science DMZ at the pilot place; and 3) Perform routine exchange of BigData. It is a dedicated data transfer mode to enable fast data transfer mode.
Goal: Create scalable process to connect entities (Research institutes, Universities, and Global Governments) to Internet2.
Project 6: DELSA Matchmaking Website *
Goal: Help scientists connect to each other, tools, publications, industry as a way to facilitate more effective science. Possible examples are VIVO and Linkedin. Could develop matchmaking 20 questions to determine individuals’ skills, interests, tools, review favorites. Could point to publications, tools or resources.
Deliverable: A web-based platform for connecting scientists to other scientists as well as research resources.
Project 7: Pregnancy Atlas Use Case
Lead: Joseph Kemnitz
Goal: Utilize DELSA and its members and connections for resources that would help the Pregnancy Atlas….The Pregnancy Atlas Consortium has an Integrative Discovery Platform that could be expanded. Help with metrics to assess Platform success.
Deliverable: Additional information for the Pregnancy Atlas such as potential collaborators, CI tools, data formats and funding opportunities. Provide files in a format that could be integrated by the Platform.
Project 8: ParaMEDIC Use Case
Lead: Wu Feng
Background: Frequent Pain Points experienced by DELSA members include ease of use issues with analysis tools and compute resources, as well as performance issues which may be due to compute problems, data management problems or data representation problems.
Goal: Use the automated, easy to use and integrated high-performance Biocomputing system (including ParaMEDIC: Parallel Metadata Environment for Distributed I/O & Computing) on a Suggested BigData challenge to show what can be done if the system was widely available.Deliverable: BigData life sciences challenging project successfully accomplished.
References and Resources
Visualizing PSU
28
COG data is available at the NCBI site
ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/
MDS results are available at
http://manxcatcogblog.blogspot.com/
All software used to analyze and visualize
the data is open source.
DELSA: http://www.delsaglobal.org
Protein Global Atlas and Data Accessibility
Projects
Acknowledgements
03/02/2020 Visualizing PSU
29
Grant support
NSF: under DBI: 0969929 (EK) and 0910818
(GF)
NIH: 5 RC2 HG 005806- 02 (GF); NIGMS grant