• No results found

IU ORE Chem Update

N/A
N/A
Protected

Academic year: 2020

Share "IU ORE Chem Update"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

IU ORE-Chem Update

(2)

IU to lead New US NSF Track 2d $10M Award

(3)

What We Said We Would Do

Apply data-centric workflow technologies (Dryad)

Significant effort

Install and run triple store

Done locally.

Need to do this in Azure.

Design alternative formats for ORE (JSON, Microformats)

– Nothing to report yet

Design secure services, compositions, mash-ups

OAuth piece done.

Significant effort on social network interfaces

– Nothing to report on ORE-chem enabled services yet

Investigate clouds for ORE-Chem

Infrastructure and runtime

(4)

Layer Cake of IU Activities

Web 2.0 Research: Security for REST

Services

Cloud Computing: Infrastructure and

Runtimes

(5)
(6)

Cloud Infrastructure

Tempest: HP distributed shared memory cluster with

768 processor cores and 1.5 TB total memory

capacity. The cluster includes 13.7 TB of local

spinning disk.

– Tempest can be dynamically reconfigured to act as either a Windows HPC or Linux cluster.

– Smaller versions Madrid and Barcelona

Other machines:

– The IBM iDataPlex system is an IBM e1350 distributed shared memory cluster with 1024 processor cores and 3 TB total memory capacity.

– Cray XT5m distributed shared memory cluster with 672 processor cores and 1.3 TB total memory capacity.

(7)
(8)
(9)

Triple Store: Intellidimension

This has been installed on IU servers.

We are ready for data.

Efforts to install this on MS Azure

were not successful.

Inadequate documentation earlier in

the year.

(10)

Open Elastic Block Store

Amazon EBS is a way to mount virtual disks in

cloud-space.

Empty disk space or archived data stores

– ORECHEM enabled data sets, for example.

Clone-able, so keep your own version of community data.

We are implementing an open version of this.

Contribute to Nimbus, an open-source EC2

But independent of Xen, etc.

Would be interesting to do this for Windows

Eventual backbone: IU has over a petabyte disk space

of lustre file system.

Can be used to load and store VMs.

X. Gao won best student poster award at TG09.

(11)

Block Store Architecture

Volume Server

Volume Delegate

Virtual Machine Manager (Xen Dom

(12)

Integration with Cloud Computing Systems

Volume Server

Volume Delegate

Xen Dom 0

Xen Delegate

Xen Dom U

VBS Web Service VBS Client VBD iSCSI Create Volume, Export Volume, Create Snapshot,Etc. Import Volume, Attach Device,

Detach Device,Etc. Nimbus Workspace

Service

VBS_Nimbus Web Service

Attach-volume <volId>

<Nimbus Instance Id> <device> Query for Xen Dom0 Host and DomUId with <Nimbus

(13)
(14)

Multicore and Cloud Technologies to

support Data Intensive applications

Using Dryad (Microsoft) and MPI to study structure of Gene

Sequences on Tempest Cluster. We are working on

PubChem.

See

http://www

infomall.org

salsa for

lab

(15)

PubChem dataset consists of binary 166 MACCS keys (fingerprints), which indicate whether a each chemical compound has a special functional molecule or not

We have total 26,466,421 chemical compounds. (i.e, the total PubChem dataset has 166 dimensions and 26M records)

Randomly selected 50K chemicals to produce 3D GTM map. GTM is an algorithm to find a lower dimension structure from higher dimensional data (3D in this case).

http://www.youtube.com/watch?v=nylgjKgnSLg

(16)

IU’s ORE-CHEM Pipeline

Harvest NIH PubChem for 3D

Structures Convert PubChem XML to CML Convert PubChem XML to CML

Convert CML to Gaussian Input

Submit Jobs to TeraGrid with

Swarm

Convert Gaussian Output to CML

Convert CML to

RDF->ORE-Chem

Insert RDF into RDF Triple Store

Conversions are done with Jumbo/CML tools from Peter Murray Rust’s

group at Cambridge. Swarm is a Web service capable of managing 10,000’s of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.

Goal is to create a public, searchable triple store populated with ORE-CHEM data on drug-like

(17)

Iterative MapReduce- Kmeans Clustering and Matrix Multiplication

Iterative MapReduce algorithm for Matrix Multiplication

Kmeans Clustering implemented as an iterative MapReduce

application

Overhead of parallel runtimes – Matrix Multiplication

Compute intensive

application O(n^3)Higher data

transfer

requirements O(n^2)

CGL-MapReduce shows minimal overheads next to MPI

Overhead of parallel runtimes – Kmeans

ClusteringO(n) calculations

in each iterationSmall data

transfer

requirements O(1)With large data sets,

CGL-MapReduce shows negligible

overheads

Extremely higher overheads in

Hadoop and Dryad

(18)

• Performance of MPI on virtualized resources

– Evaluated using a dedicated private cloud infrastructure

– Exactly the same hardware and software configurations in bare-metal and virtual nodes – Applications with different communication: computation ratios

– Different virtual machine(VM) allocation strategies{1-VM per node to 8-VMs per node}

High Performance Parallel Computing on Cloud

Performance of Matrix multiplication under

different VM configurations configurations for Concurrent WaveOverhead under different VM Equation Solver

O(n^2) communication (n = dimension of a matrix)

More susceptible to bandwidth than latency

Minimal overheads under virtualized resources

O(1) communication (Smaller messages)

More susceptible to latency

Higher overheads under virtualized resources

(19)

Conclusions: Dryad for Scientific Computing

Investigated several applications with various computation,

communication, and data access requirements

All DryadLINQ applications work, and in many cases

perform better than Hadoop

We can definitely use DryadLINQ (and Hadoop) for

scientific analyses

We did not implement (find)

Applications that can only be implemented using DryadLINQ but

not with typical MapReduce

Current release of DryadLINQ has some performance

limitations

DryadLINQ hides many aspects of parallel computing from

user

(20)

IU’s ORE-CHEM Pipeline

Harvest NIH PubChem for 3D

Structures Convert PubChem XML to CML Convert PubChem XML to CML

Convert CML to Gaussian Input

Submit Jobs to TeraGrid with

Swarm

Convert Gaussian Output to CML

Convert CML to

RDF->ORE-Chem

Insert RDF into RDF Triple Store

Conversions are done with Jumbo/CML tools from Peter Murray Rust’s

group at Cambridge. Swarm is a Web service capable of managing 10,000’s of jobs on the TeraGrid. We are developing a Dryad version of the pipeline.

Goal is to create a public, searchable triple store populated with ORE-CHEM data on drug-like

(21)

Architecture of Swarm Service

Windows Server Cluster

Swarm-Grid

Swarm-

Dryad

Local RDMBS

Swarm-Analysis

Standard Web Service Interface

Large Task Load Optimizer

Swarm-Grid

Connector Swarm-DryadConnector Swarm-HadoopConnector

Cloud Comp. Cluster Grid HPC/

Condor Cluster

(22)

Swarm-Grid

Swarm considers

traditional Grid HPC

cluster are suitable for

the high-throughput

jobs.

Parallel jobs (e.g. MPI

jobs)

Long running jobs

Resource Ranking

Manager

Prioritizes the resources

with QBETS, INCA

Fault Manager

Fatal faults

Recoverable faults

Resource Ranking Manager

Grid HPC/Condor pool Resource Connector

Condor(Grid/Vanilla) with Birdbath

Grid HPC ClustersGrid HPCClustersGrid HPC

ClustersGrid HPCClusters

Condor Cluster Standard Web Service Interface

Swarm-Grid QBETS Web Service Local RDMBS MyProxy Server Hosted by TeraGrid Project Hosted by UCSB

Request Manager

Job Distributor Job Queue Data Model

(23)

Some Details

We can submit jobs to 3 different TeraGrid

machines

Abe, Mercury, Cobalt (all at NCSA)

IU’s BigRed has some technical problems

Can do about 100-200 molecules per day in

tests.

Approach is fragile because

application/system admins have tendency to

change things every few months.

(24)

Dryad Data Partitioning

Two methods:

Manually place the files in every node or

Write a C# code that uses DryadLINQ partitioning

operators like Hash Partition<T,K> or Range

Partition<T,K>

A partitioned data set consists of 2 types of files:

A metadata file (.pt as extension) containing metadata

that describes the partitions

Set of partition files, one for each data partition.

\DryadData\UserName\InputData (file path and name)

4 (number of partitions depending on number of nodes available) 0,2000,NODE01 (Partition files: Partition number, size(in bytes), node name : File path) 1,2000,NODE02,NODE03:FilePath

(25)

Programming the Pipeline

IQueryable<T> represents query over the data

Input data is represented by a PartitionedTable<T> object

DryadLINQ programs apply LINQ query operations to

PartitionedTable<T> objects.

LINQ queries on the PartitionTable object are executed on

the Cluster.

Jobs will be executed on different nodes and the output

would be collected in the outputDirectory.

IQueryable<LineRecord> filenames = PartitionedTable.Get<LineRecord> (filepathuri);

IQueryable<outputinfo> outputs = filenames.Select(s =>

(26)
(27)

OAuth: REST Security

This is actually a Year 2 deliverable but we made

progress in Year 1.

OAuth is essentially security for REST.

Provide authentication and authorization

Relevant to ORE-CHEM services

Use REST URL and HTTP method

Resources are identified by URLs

Access privileges are identified by HTTP methods

(GET, POST, PUT, DELETE)

Extend OAuth

Add finer-grained authorization information in

(28)

OAuth Security Status

OAuth *Core* Code provides the fundamental piece of OAuth

specification 1.0.

Includes minimal webapp example

– The sample web apps just support shared secret.

– We extend to support PKI

– Also fixed some bugs in the code.

– To support OAuth extensions, more code is needed in OAuth core.

For OpenID, we use library OpenID4Java and it seems to offer

enough functionalities so far.

Tutorial given at TeraGrid09

Slides: http://w

ww.collab-ogce.org/ogce/images/3/39/OAuthOverview-TG09.ppt

Code:

(29)

Acknowledgments

Geoffrey Fox

Judy Qiu and SALSA team: data mining

ww

w.infomall.org/salsa

Jal

iya Ekanayake: Dryad and Cloud

performance

Sangmi Pallickara: Swarm service

Xiaoming Gao: Virtual Block Store

Zhenhua Guo: OAuth, OpenID, and Social

(30)
(31)

Dryad and DryadLINQ

Dryad is a high-performance, general-purpose distributed computing

engine that simplifies the task of implementing distributed

applications on clusters of computers running a windows operating

system.

DryadLINQ allows us to implement Dryad applications in managed

code by using an extended version of the LINQ programming model

and API. LINQ was introduced with Microsoft .NET framework

version 3.5.

DryadLINQ provider translates the application’s LINQ queries into

a Dryad job and runs the job as a distributed application on a

windows HPC cluster.

(32)
(33)
(34)

Client Workstation

: runs DryadLINQ application.

DryadLINQ Provider

creates a Windows HPC job on the cluster to

handle the Dryad processing, receives the results, and returns

them to the application.

Job Manager

: Windows HPC task that manages execution of

associated Dryad job on the cluster.

Head Node

: manages the cluster, hosts the Windows HPC

Administration Console and Dryad management service.

(35)

java.exe +

jumboconverters.jar xml -> cml

cml -> gaussian input

Local Machine

DryadLINQ Provider (LinqToDryad.dll

)

Dryad Cluster

Distribute gaussian Input files across the cluster and run gaussian.exe on every file at every node in the

(36)

Distribute all the initial xml files over the cluster

xml to cml conversion

Stage1

cml to gaussian conversion

Stage2

run gaussian on every file

Stage3

(37)

Drilling Though Data Clouds

Bare metal

(Computer, network, storage)

FutureGrid/VM/Virtual Storage Cloud Technologies

(MapReduce, Dryad, Hadoop) Classic HPCMPI

Applications

§ Cheminformatics: Mapping PubChem data into low dimensions to aid drug discovery

§ Biology: Expressed Sequence Tag (EST) sequence assembly (CAP3)

§ Biology: Pairwise Alu sequence alignment (SW)

§ Health: Correlating childhood obesity with environmental factors

Data mining Algorithm

Clustering (Pairwise , Vector)

MDS, GTM, PCA, CCA

Visualizatio n

(38)

Architecture and Performance of Runtime Environments

for Data Intensive Scalable Computing

Data/compute intensive applications implemented as MapReduce “filters”

Architecture of CGL-MapReduce

Measured using 32

Compute nodes each with 8 cores and 16 GB of memory

Compute intensive application

Embarrassingly parallel operationAll runtimes

performs equally well

Number of Reads processed

High Energy Physics Data Analysis

CAP3 – Gene Assembly Program

Data intensive application

MapReduce style parallel operation

Both runtimes perform comparably well

(39)
(40)

References

Related documents

The foundation of Bayesian horseshoe priors is based on the inference of the posterior distribution of the parameters, which can be attained by combining the prior knowledge of

want to take CHEM 4196 or the combination of CHEM 3397 &amp; CHEM 3398 as a Biochemistry elective must take the CHEM 3301-CHEM 3302 sequence as CHEM 3405 does not serve as

The formation of the NDDB stemmed from the vision of the then Prime Minister of India, the late Lal Bahadur Shastri, to extend the success of the Kaira Cooperative Milk

Reliance on Mutual Funds for Stock Ownership Rises (Equity fund assets as percent of households’ total and non-IRA stock

not listed Varies BUS-UN 100/200 Undistributed Credit CHEM 101 Introductory Chemistry I CHEM-C101 Elementary Chemistry I CHEM 102 Introductory Chemistry II CHEM-UN 100

*Note: The following combinations must be taken to obtain General Education credit in Chemistry: CHEM 012 (or CHEM 017) and CHEM 014; CHEM 013 and CHEM015. Visit the Penn State

CHEM 1102 Intro to Organic &amp; Biochemistr ISU Equivalent Course:. CHEM 7 Intro Chem