Migrating Production
HPC to AWS
A Story of Early Adoption & Lessons LearnedCommon Computing Service (CCS)
The Common Computing Service (CCS) is the HPC
(grid computing) environment at a major commodity trader
A custom software layer providing map-reduce and
memoization functions
Schedules client jobs across multiple compute nodes
that execute models provided by the quant teams
Jobs are Closures in that their input contains all the
necessary data for evaluation
CCS Architecture
End Users Excel Excel CCS Service CCS Client Interface CCS Job Scheduler CCS Model Store CCS Compute Node CCS Agent Application Model CCS Task Execution Request/Response CCS Compute Node CCS AgentClient Job Execution Request/Response
Client Job Execution Request/Response
Clustered servers hosting CCS service and associated components MS HPC Server and SQL Server Dedicated servers and (potentially) scavenged / virtual / cloud Trading Systems Murex DealBus Openlink
Four environments in total
CCS Compute Node CCS Agent CCS Compute Node
CCS in Q1 2013
CCS entered service in Q1 providing a shared grid
computing environment as planned
Used by multiple business units and applications
As is usual with such systems load was quite volatile
Average utilisation 24/7 of under 20%
Peak of 100% for four hour EoD batch
Predicted Growth
After go live there was a capacity uplift of 25% to
accommodate demand from the US business
Empirical evidence from other Financial Services
organisations was that over 5 years grid demand grew by between 10 and 100 fold
If replicated in this case would see annual operational
costs rise to consume up to 20% of the divisions operating budget
Need to Control Costs
The possible growth in operating costs was such that
alternatives had to be considered
The low average utilisation showed there was an
opportunity to do this
An alternative that could scale capacity to meet
demand was very attractive
So in Q3 2013 the decision was taken to investigate
Which Cloud to Use?
CCS is based on Windows HPC Server so our first
thought was to use Azure
However there was no contract in place with Azure
There was one for AWS
Feasibility
The first step was to show CCS would run in AWS
Adopted a change nothing, lift and shift approach
The first manual build took about a week, which
included learning how to use AWS
By the end of October 2013 knew the project was
technically feasible
The next step was to get approval to proceed with
Quite A Few Stake Holders
The Business Quants Digital Security
Operational Integrity Central Accounting
Compliance, Control & Legal
Internal & External Networks
Infrastructure
Digital Security
Worked extensively with Digital Security to show that
migrating CCS to the Cloud would not introduce unacceptable risks
Demonstrated that CCS was equivalent to several of
the SAAS products already in use
Once submitted CCS jobs did not require access to
internal data
All communications could be initiated internally
No need for AWS machines to access in-house
resources
Central Accounting
No mechanism to pay AWS!
Worked with central accounting function to design
new process
AWS provide consolidated billing at the business unit
level
Which needed to be recharged to the individual
System Build
The CCS environment is reasonably complicated with
a strict sequence of steps required to build a new instance
Time consuming and error prone to do this by hand
so decided to automate the process
Achieved using a combination of Chef and Power
Shell to give fine grained control
The end result was that a new environment could be
Development Migration
To ensure that the system would function correctly in
AWS elements of the development environment were migrated
Build and Unit Tests executed in-house by TFS
When a clean build was available it was automatically
deployed to AWS
Then the set of Acceptance Tests would run in AWS
SLA for a Scalable System
What is the SLA for a scalable environment?
At times demand will exceed current scale
After some discussion it was agreed that the
appropriate measure was the maximum queue time between job submission to start of execution
This was adopted as the system SLA with different
Scaling to Meet Demand
Produced a model that predicted the amount of time
a job would queue once submitted
Based on estimating the time taken to complete
currently executing jobs and jobs already queued
Challenges was that it took 15 minutes from
requesting a new node to it being operational
Addressed by creating a fleet of halted nodes which
would start in 60 seconds
New job submitted Queued jobs Available Node Est 15 sec Est 28 sec CCS Job Scheduler CCS Resource Manager
Automated Scaling
The Resource Manager
scales the running
compute node fleet to meet demand
Compute nodes started
as load rises and halted as it falls
But always run for 60
minutes as this is AWS minimum time charged
Reliability
Production management components in one AZ, DR
in another one
Compute nodes spread across all available AZ
Use AWS ELB to provide well known IP addresses
Prod ELB CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node Fail Over Manager Heartbeat EL B Sta tu s Production Clients
Extended the use of the ELB components to automate fail over and fail back
Production failure detected in 60 seconds
Fail back automated once production system
recovered
DR - Automated Fail Over
Prod ELB CCS Production Prod Node Prod Node Prod Node Prod Node Prod Node Prod Node DR Node Fail Over Manager Heartbeat EL B Sta tu s & Co n tro l Production Clients
Test At Production Scale
Tested with production workloads for production
timescales
Measured the performance of system and individual
components
Revealed a number of bottle necks which were
What Was Delivered
Fully automated, reliable and repeatable
deployments
Pay for usage, Opex reduced by 40%
No more hardware purchases, end of Capex shocks
Ability to meet unusual business demand
Lessons Learned
Find and engage all the stake holders
Right size the architecture, experiment with
alternative platform configurations
Dynamic environments are not as stable as dedicated
hardware, need a strategy to cope
Build automation a must in order to achieve required
levels of agility
Production scale testing is a must to identify and
remediate bottle necks
Next Steps?
Recharge to the business line
Distributed Data Assets to remove repeated data
transmissions