Outline
Motivation
Requirements for Scientific Data Transfer
Related Works
Our Proposal: GridTorrent Framework
Test Results
Summary
Motivation
•
Computational science is changing to be data intensive
•
Scientists are faced with mountains of data that stem from
four sources[1]:
New scientific instruments double their output every year
or so
Simulations generates flood of data
The Internet and computational Grid allow the
Motivation (cont.)
Scientific discovery increasingly driven by data
collection[3]
Computationally intensive analyses Massive data collections
Data distributed across networks of varying capability Internationally distributed collaborations
Data Intensive Science: 2000-2015
Dominant factor: data growth (1 Petabyte = 1000 TB)
Motivation (cont.)
Scientific applications generates petabytes of data are
very diverse.
– Fusion power
– Climate modeling
– Earthquake engineering
– Astronomy
– Bioinformatics
Motivation (cont.)
Some examples[]
Climate modeling
Community Climate System Model and other simulation applications
generates 1.5 petabytes/year
Bioinformatics
The Pacific Northwest National Laboratory is building new Confocal
microscopes which will be generating 5 petabytes/year
High-energy physics
The large hadron collider (LHC) project at CERN will create 15
Motivation Conclusion
Scientific community has large set of distributed data
Requirements for Scientific Data Transfer
Transferring scientific data over large-scale requires
efficient
high-performance reliable
secure
policy-aware management balanced system
CPU farms
storage
Is it a new problem?
The answer is no.
There are attempts to meet the above requirements as
GridFTP
GridFTPXIO
GridHTTP
TeraGrid Copy (TGCP)
The Replica Location Service (RLS)
GridFTP
Extension of the standard FTP protocol
Reliable,
secure
high performance
Efficient
The de facto standard for transferring data in many Grid
projects
GridFTP (cont.)
Additional features supported by the GridFTP
protocol
Grid Security Infrastructures (GSI) and Kerberos support Support for reliable and restartable data transfer: restart
transfers from point of failure when failures occurred
Partial file transfer: regions of a file transfer.
Parallel data transfer: multiple TCP streams between two
network endpoints to improve bandwidth.
Third-party control of data transfer: the ability to control
GridHTTP
Allow large (gigabyte) files to be transferred at
optimal speeds using HTTP
Does not deviate from existing HTTP standards,
But describes how to use existing headers and
methods to produce an encrypted data stream.
Support bulk data transfers via unencrypted HTTP,
Support authentication and authorization with the
GridFTPXIO
The Globus eXtensible Input/Output (XIO) System
provides an abstraction layer to transport protocols.
enables different I/O problems to be presented uniformly as a simple open/close/read/write (OCRW) interface.
a support framework for developing communication protocols.
an interface that enables an existing
application written with XIO to access their hardware.
primary usage scenarios
Independence from the Transport Control
Protocol
Ease of Adding GridFTP Support to Third-Party
Applications
Ease of Providing GridFTP Access to Data
TeraGrid Copy (TGCP)
TeraGrid Copy (TGCP) solutionincludes three main components:
GridFTP Service RFT Service
TGCP shell script
In the striped configuration,
GridFTP service runs on several
nodes of a cluster
the data to be transferred is
partitioned among the nodes
each node may use several parallel
TGCP (cont.)
The tgcp script can use
the globus-url-copy tool
(A) in either third-party
transfer mode
(B) in conventional
TGCP (cont.)
RFT Service will be used to
manage the transfer.
adds additional reliability to
the transfer request
The Replica Location Service (RLS)
provides a framework for tracking the physical locations of data that has been replicated.
maps logical names to physical names.
Replication of data items can
reduce access latency,
improve data locality,
increase robustness, scalability and performance for distributed applications.
does not operate in isolation,
used with other components like the Reliable File Transfer
RLS (cont.)
The current RLS implementation has the following
features.
Local Replica Catalogs (LRCs) Replica Location Indices (RLIs)
LRCs send information about their state to RLIs using soft
state protocols.
Optional "Bloom Filter" compression can be used to
summarize the contents of the LRC.
The current RLS implementation maintains static
So, if there are solutions….
There is no pure P2P data transfer mechanism used in
this area.
There are several different protocols
Our proposal: GridTorrent
Framework
We are proposing a new distributed file peer-to-peer
protocol in scientific data in an acceptable speed
Similar to (GridFTP) redefining of Bittorrent protocol to
adjust it using in scientific data transfer
Why we need GridTorrent
Framework?
Requirements and characteristics of scientific data
transfer
Large and voluminous data set
Security
Reliability
Efficiency
Scalability
User-friendly environment
Balanced
Why we need GridTorrent
Framework? (cont.)
GridTorrent has faster download speed
Large and voluminous data set
Balanced
GridTorrent allows to share bandwidths between peers
Efficiency
GridTorrent is based on Bittorrent
Reliability
Why we need GridTorrent
Framework? (cont.)
GridTorrent has security manager
Security
GridTorrent has content management framework
User-friendly environment
Why Bittorrent?
Alternative Peer to Peer Protocols
FastTrack Gnutella eDonkey
Direct Connect Ares
Why BitTorrent?
Better bandwidth utilization Never before speeds.
Limit free riding – tit-for-tat
Limit leech attack – coupling upload & download Spurious files not propagated
Why Bittorrent? (cont.)
Bittorrent proved that it is suitable for distributing very
large files.
There are many companies using Bittorrent as distributing
protocol
Amazon S3
Microsoft’s Avalanche (inspired by Bittorrent) Blizzard (Game production company)
Advantages of GridTorrent
Framework
Saves resources by taking advantage of the unused
upload capacity of downloaders.
CPU
Network Bandwidth
Disk
Reliable
Jobs can be started and stopped using web interface
Can be deployed under any system
GridTorrent Framework
Components (cont.)
GridTorrent Framework has three major components:
GridTorrent Client
GridTorrent Content Manager
GridTorrent Client
It has four components
Torrent Data Sharing Algorithm
Task Manager
WS-Tracker Client
Data Transfer layer
GridTorrent Content Manager
Four main components:
Task Manager
ACL Manager
Content Manager
GridTorrent WS-Tracker
It functions as regular Bittorrent Tracker
Send source and peer list to peers Update their status
It sends tasks list obtained from GridTorrent Content
Manager
All communications are secure (SSL)
GridTorrent Content Manager
It allows content owner to publish content in different
access level.
Public level
User level
Group level
Initial Test Results
File size (MB) : 300 MB
Number of Streams/Sources: 4
Source machines: gridfarm (Bloomington, IN)
LAN test:
Iperf bandwith (Mbps): 857
Client machine: complexity (Indianapolis, IN)
WAN test:
Iperf bandwith (Mbps): 30.2
Initial Test Results (cont.)
Table 1: Download speed of PTCP vs. GridTorrent with 4 streams/sources
Table 2: GridTorrent bandwidth load balancing on downloaded file segment with 4 streams/sources
Download Speed (Mbps)
PTCP GridTorrent(1 stream) GridTorrent(4 streams)
LAN Test 80 90 95
WAN Test 42 49 102
Bandwidth usage (Downloaded MB from each source)
Source1 Source2 Source3 Source4
LAN Test 44 53 47 42
Research Issues
Current Bittorrent protocol is designed for actual network environment
Modifications needed to provide pure scientific data transfer
modification on message format and frequency UDP
GridFTP
Requirements needed to provide pure scientific data transfer
Security
References
Petascale computational systems, Bell, G.; Gray, J.; Szalay, A. Computer Volume 39, Issue 1, Jan. 2006 Page(s): 110 – 112
Getting Up To Speed, The Future of Supercomputing,
Graham, S.L. Snir, M., Patterson, C.A., (eds), NAE Press, 2004, ISBN 0-309-09502-6
Overview of Grid Computing, Ian Foster,
http://www-fp.mcs.anl.gov/~foster/Talks/ResearchLibraryGroupGridsAp ril2002.ppt, last seen 2007
Science-Driven Network Requirements for Esnet, http://