• No results found

Using the Amazon Compute Cloud to Analyze Large Data Sets

N/A
N/A
Protected

Academic year: 2021

Share "Using the Amazon Compute Cloud to Analyze Large Data Sets"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Using the Amazon Compute

Cloud to Analyze Large Data Sets

Benjamin F. Williams

University of Washington

(2)

Ben Williams March 11, 2015

(3)
(4)

Coverage in 3 Cameras

Text

Ultraviolet camera

Optical camera

Infrared camera

Characterizing all star types requires ultraviolet optical

and infrared measurements

(5)

WFC3/IR +

(6)

WFC3/IR +

WFC3/UVIS

ACS

+

WFC3/IR +

WFC3/UVIS

ACS

180

o

flip

(7)
(8)

Outline

Pipeline Architecture

Challenges and solutions for running on the

Amazon Cloud (EC2)

Necessary improvements for larger data sets.

Example application: M31 HST Survey

(9)

Break down data reduction into many individual

tasks.

Run tasks on lots of data in parallel.

Group data to optimize measurements

Efficient to reduce same data multiple ways.

Chain of tasks can be restarted from any point

of failure

(10)

Pipeline Architecture

Each task alters data, metadata, parameters,

and/or produces new data

task 1

task 2

task 2

task 2

task 2

task 2

task 2

task 2

task 3

task 3

task 4

task 4

task 4

task 4

task 4

Parameters

Data and

Metadata

Can branch off separate reduction with

different parameters at any point

Database

(11)

Disk

Node 1

Node 2

Node 3

Node N

Database

Cluster Computing

Even

ts

Job Requests

(12)

Why Move to EC2?

Root privileges for installing and maintaining

software

No end-of-life/contract problems

Scaling up or down based on data amounts

Easy to change machine specifications

Minimal unscheduled down time (hackers,

upgrades, failures)

Amazon Research Grants

(13)

S3

N1

N2

DB

EC2 Computing

Data

In

Data Updates

Ev

en

ts,

Job

s, a

nd

Data checks
(14)

Security: launch instances through DB server

Disk storage: separate for each instance

All data on Instances vanishes when they are

terminated

Job queuing and node selection done by server

Initial Challenges

(15)

Security

Only allow instances to be passed

a single number (the event code)

Nodes only let single number

through the firewall.

Node then initiates a connection

to the server -- more complex

information can be exchanged.

S3

N1 N2

Nn

(16)

File Handling

Each file’s metadata is entered

into the DB

A copy of each file is put on S3

When a file is requested, server

syncs S3 and node to the

newest version

Requires careful scripting

S3

N1 N2

Nn

DB

(17)

Tools Produced

Automating start up multiple nodes with a single

command to the database

Web interface to keep track of which nodes are

doing which jobs, as well as finding, and fixing

failures

Task launching regulated and coordinated across

nodes

File synchronization across many nodes through

S3.

(18)
(19)

File synchronization issues -- colliding requests.

More careful task scripting

Server overload -- too many requests

DB indexing

More powerful server node

File transfer size limits and errors

Increased checks that transfers succeed.

Too expensive

(20)

Hourly Rates

type

memory

cpu

on-demand

reserved

1yr min.

typical

spot

Average

8

2

0.14

0.09

0.02

more

memory

15

2

0.18

0.10

0.02

more CPU

15

4

0.28

0.18

0.03

high

memory

61

8

0.70

0.38

0.07

high CPU

30

16

0.84

0.51

0.13

(21)

On demand $0.28 per

hour

(22)

Prices can jump

unexpectedly

(23)

Continuing

Improvements

Optimizing node use (ratio of high to low

memory nodes, and putting the right jobs

on the right type of node).

Still occasional communication problems,

intermittent and very difficult to trace.

Server always on… $$$

(24)

Improvements needed

for Scaling

automated termination of idle nodes

automated adding of nodes (of appropriate

type) when queue is long

storing and using technical requirements

for each task

high-powered server instance (off cloud?)

(25)
(26)

Measure 3 cameras at once

Multiple orientations

Multiple pixel scales

Multiple distortions

Multiple point spread functions

Andy Dolphin, DOLPHOT 2.0

(http://americano.dolphinsim.com/dolphot/)

Neighboring fields aligned for catalog merging and mosaic

(27)

F475W

,

F814W

,

F160W

(28)

F475W

,

F814W

,

F160W

2’ (0.45 kpc)

(29)
(30)

30’’ (0.11 kpc)

(31)
(32)

“Crowding” varies w/ Radius

Bulge

Outer Disk

(33)

Crowding Dominates

Photometric Quality

(34)

Blue

Red

Cyan

Magenta

Yellow

Optical-IR

Main sequence

Red Giant Branch

Asymptotic Giant Branch Bump

Red Clump

(35)

Clusters

(HST vs Ground)

HST

HST

(36)
(37)
(38)

Young clusters probe Initial Mass Function

(39)
(40)

Red Giant Branch Sensitive to Extinction

F110W-F160W

CMD of Single

WFC3/IR

frame,

subdivided in

4x4 grid

(41)

Unreddened

peak

Reddened

peak

Subregions

of single

WFC3/IR

frame

(42)

Map of

the Dust

Residual

structure: 1.5%

flat fielding errors

(43)

Excellent Consistency with

Dust Emission

(44)

Choose dust-free locations

(45)
(46)
(47)
(48)

Spectral energy distribution

Temperature, gravity,

metallicity, intervening dust

properties

(49)
(50)

Public Release

http://archive.stsci.edu/prepds/phat/

117 million stars

Paper reference:

(51)

Julianne J. Dalcanton (UW)

Keith Rosema (Random Walk)

Dan Pirone (Qumulo)

Dustin Lang (CMU)

Andrew Dolphin (Raytheon)

Daniel R. Weisz (UW)

Lori Beerman (UW)

Eric F. Bell (UMich)

Luciana Bianchi (STScI)

Nell Byler (UW)

Morgan Fouesneau (MPIA)

Karoline Gilbert (STScI)

Leo Girardi (INAF)

Karl Gordon (STScI)

Dimitrios Gouliermis (MPIA

Dylan Gregersen (UUtah)

Cliff Johnson (UW)

Jason Kalirai (STScI)

Tod Lauer (NOAO)

Alexia Lewis (UW)

Antonela Monachesi (UMich)

Hans-Walter Rix (MPIA)

Philip Rosenfield (UNIPD)

Anil Seth (UUtah)

(52)

THANKS

NASA: Space Telescope Science Institute, GO

12055

(http://americano.dolphinsim.com/dolphot/) http://archive.stsci.edu/prepds/phat/

References

Related documents

MapReduce is a framework for processing the parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if

In many real world classification tasks, the number of labeled instances is very few due to the prohibitive cost of manually labeling every single data point,

A visual comparison of Figure 3 with Figure 2 shows that the number of potent nodes and the biological activity of “better” daughter nodes found by RP/SA with mean

(ii) When a petition of appiying a filter or operation on an image is received in the cloud computing service access point agent, it spread the message to thoso working agents

Data Mining approach for identifying and monitoring firm’s competitors and to solve number of question like standardizing and quantifying the competitiveness between two items,

A common problem of iMesh and MEMO is that both of them are based on routing protocols proposed for mobile ad hoc networks that rely on broadcasting for route

MapReduce is a framework for processing the parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a

Figure 5 shows the load imbalance in terms of the number of distance calculations performed by each process (using 1024 processes) for the standard k -means and triangle