Migrating Production HPC to AWS

(1)

Migrating Production

HPC to AWS

A Story of Early Adoption & Lessons Learned

(2)

Common Computing Service (CCS)

The Common Computing Service (CCS) is the HPC

(grid computing) environment at a major commodity trader

A custom software layer providing map-reduce and

memoization functions

Schedules client jobs across multiple compute nodes

that execute models provided by the quant teams

Jobs are Closures in that their input contains all the

necessary data for evaluation

(3)

CCS Architecture

End Users Excel Excel CCS Service CCS Client Interface CCS Job Scheduler CCS Model Store CCS Compute Node CCS Agent Application Model CCS Task Execution Request/Response CCS Compute Node CCS Agent

Client Job Execution Request/Response

Clustered servers hosting CCS service and associated components MS HPC Server and SQL Server Dedicated servers and (potentially) scavenged / virtual / cloud Trading Systems Murex DealBus Openlink

Four environments in total

CCS Compute Node CCS Agent CCS Compute Node

(4)

CCS in Q1 2013

CCS entered service in Q1 providing a shared grid

computing environment as planned

Used by multiple business units and applications

As is usual with such systems load was quite volatile

Average utilisation 24/7 of under 20%

Peak of 100% for four hour EoD batch

(5)

Predicted Growth

After go live there was a capacity uplift of 25% to

accommodate demand from the US business

Empirical evidence from other Financial Services

organisations was that over 5 years grid demand grew by between 10 and 100 fold

If replicated in this case would see annual operational

costs rise to consume up to 20% of the divisions operating budget

(6)

Need to Control Costs

The possible growth in operating costs was such that

alternatives had to be considered

The low average utilisation showed there was an

opportunity to do this

An alternative that could scale capacity to meet

demand was very attractive

So in Q3 2013 the decision was taken to investigate

(7)

Which Cloud to Use?

CCS is based on Windows HPC Server so our first

thought was to use Azure

However there was no contract in place with Azure

There was one for AWS

(8)

Feasibility

The first step was to show CCS would run in AWS

Adopted a change nothing, lift and shift approach

The first manual build took about a week, which

included learning how to use AWS

By the end of October 2013 knew the project was

technically feasible

The next step was to get approval to proceed with

(9)

Quite A Few Stake Holders

The Business Quants Digital Security

Operational Integrity Central Accounting

Compliance, Control & Legal

Internal & External Networks

Infrastructure

(10)

Digital Security

Worked extensively with Digital Security to show that

migrating CCS to the Cloud would not introduce unacceptable risks

Demonstrated that CCS was equivalent to several of

the SAAS products already in use

Once submitted CCS jobs did not require access to

internal data

All communications could be initiated internally

No need for AWS machines to access in-house

resources

(11)

Central Accounting

No mechanism to pay AWS!

Worked with central accounting function to design

new process

AWS provide consolidated billing at the business unit

level

Which needed to be recharged to the individual

(12)

System Build

The CCS environment is reasonably complicated with

a strict sequence of steps required to build a new instance

Time consuming and error prone to do this by hand

so decided to automate the process

Achieved using a combination of Chef and Power

Shell to give fine grained control

The end result was that a new environment could be

(13)

Development Migration

To ensure that the system would function correctly in

AWS elements of the development environment were migrated

Build and Unit Tests executed in-house by TFS

When a clean build was available it was automatically

deployed to AWS

Then the set of Acceptance Tests would run in AWS

(14)

SLA for a Scalable System

What is the SLA for a scalable environment?

At times demand will exceed current scale

After some discussion it was agreed that the

appropriate measure was the maximum queue time between job submission to start of execution

This was adopted as the system SLA with different

(15)

Scaling to Meet Demand

Produced a model that predicted the amount of time

a job would queue once submitted

Based on estimating the time taken to complete

currently executing jobs and jobs already queued

Challenges was that it took 15 minutes from

requesting a new node to it being operational

Addressed by creating a fleet of halted nodes which

would start in 60 seconds

New job submitted Queued jobs Available Node Est 15 sec Est 28 sec CCS Job Scheduler CCS Resource Manager

(16)

Automated Scaling

The Resource Manager

scales the running

compute node fleet to meet demand

Compute nodes started

as load rises and halted as it falls

But always run for 60

minutes as this is AWS minimum time charged

(17)

Reliability

Production management components in one AZ, DR

in another one

Compute nodes spread across all available AZ

Use AWS ELB to provide well known IP addresses

Prod ELB CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager Heartbeat EL B Sta tu s Production Clients

(18)

Extended the use of the ELB components to automate fail over and fail back

Production failure detected in 60 seconds

Fail back automated once production system

recovered

DR - Automated Fail Over

Prod ELB CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node DR Node Fail Over Manager Heartbeat EL B Sta tu s & Co n tro l Production Clients

(19)

Test At Production Scale

Tested with production workloads for production

timescales

Measured the performance of system and individual

components

Revealed a number of bottle necks which were

(20)

What Was Delivered

Fully automated, reliable and repeatable

deployments

Pay for usage, Opex reduced by 40%

No more hardware purchases, end of Capex shocks

Ability to meet unusual business demand

(21)

Lessons Learned

Find and engage all the stake holders

Right size the architecture, experiment with

alternative platform configurations

Dynamic environments are not as stable as dedicated

hardware, need a strategy to cope

Build automation a must in order to achieve required

levels of agility

Production scale testing is a must to identify and

remediate bottle necks

(22)

Next Steps?

Recharge to the business line

Distributed Data Assets to remove repeated data

transmissions

(23)

Migrating Production HPC to AWS