© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
November 12, 2014 | Las Vegas
BDT312
Using the Cloud to Scale
from a Database to a Data Platform
Hi, I’m Ryan
What is Twilio?
We provide a communications API that enables phones,
VoIP, and messaging to be embedded into web, desktop and mobile software.
How Does it Work?
A user calls your number
What is the User Data Team?
• We scale Twilio's backend database infrastructure • We build customer facing data APIs
Calls and Messages are Stateful
Queued Ringing In Progress Completed Queued Sending Sent DeliveredIn the Beginning…
All data was placed in the same physical database regardless of where the call or message was in its lifecycle.
The Monolithic Database Model
API Web Billing MySQL Call/Message Service CarriersProblems at Scale
• Many consumers of data
• Data with different performance characteristics • Failure in the database degrades many services • Horizontal scaling and orchestration is
What is a Service-Oriented Architecture?
An architecture in which required system behavior is decomposed into discrete units of functionality, implemented as individual services for applications to compose and consume.
Communicate Through Interfaces, Not Databases
API Web Billing In Flight MySQL Call/Message Service In Flight Service Post Flight Service Post Flight MySQL CarriersDatabase Can Change Without Changing Every Service API Web Billing In Flight MySQL Call/Message Service In Flight Service Post Flight Service Post Flight Amazon DynamoDB Carriers
SOA Doesn’t Solve Everything
No matter how many services you put in front of MySQL, it’s still a single point of failure.
Implementing Sharding (the easy part)
1. Choose partitioning scheme 2. Implement routing logic
3. Send application queries through router 4. Go!
Sharding at Twilio
Application Router Shard1
Shard2 Shard0
0-3
3-6
Rolling it Out With Zero Downtime (the hard part)
• We provide a 24/7, always on service
• Communications is intolerant of inconsistency and latency
Bringing Up a New Shard
Master1 Slave1 Master2 Slave2 Application 0-9Split Odds and Evens for Writes
Master1 Slave1 Master2 Slave2 Application Odds Evens 0-9Update Routing
Master1 Slave1 Master2 Slave2 Application Odds Evens 0-4 5-9Cut Slave Link
Master1 Slave1 Master2 Slave2 Application 0-4 5-9A Necessary Burden
In the beginning, the burden of managing our
own databases was non-negotiable.
The Landscape has Changed
We now have a variety of managed database services which solve these problems for us, such as Amazon RDS, Amazon DynamoDB, Amazon SimpleDB, Amazon Redshift, etc.
Cost Is Never Optimized
Application developers do not (and should not) optimize for database cost.
Self Managed Databases are Costly
EverythingElse 22%
Databases 78%
Keeping up With Growth
As growth continues to accelerate, we need to somehow keep up.
A Change in Approach
• Change our hiring practices and bring in specialists • Remove the context switching
Thinking in Terms of Throughput
Amazon DynamoDB allows us to scale in terms of throughput, not machines. This is the future of
Operations
Management and scaling of our cluster is fully abstracted away from us.
Cost Compared to MySQL
MySQL 82%
Amazon
DynamoDB 18%
Cost with MySQL Fully Replaced
Everything Else 61% Databases
39%
A Relational Model with Amazon DynamoDB
Many of our services allow for querying data in a way that maps naturally to a relational database.
GET /Accounts/2/Events
SELECT * FROM events WHERE IpAddress=“5.6.7.8” ORDER BY date DESC;
SELECT * FROM events WHERE IpAddress=“5.6.7.8” AND Date<=“2014-10-03” ORDER BY date DESC;
AccountId (Hash) Date (Range) IpAddress_Date Type 2 2014-10-03 5.6.7.8|2014-10-03 call
2 2014-10-01 5.6.7.8|2014-10-01 message
GET /Accounts/2/Events
AccountId (Hash) IpAddress_Date (Range) Date Type 2 5.6.7.8|2014-10-03 2014-10-03 call 2 5.6.7.8|2014-10-01 2014-10-01 message GET /Accounts/2/Events?IpAddress=5.6.7.8
AccountId (Hash) IpAddress_Date (Range) Date Type 2 5.6.7.8|2014-10-03 2014-10-03 call 2 5.6.7.8|2014-10-01 2014-10-01 message GET /Accounts/2/Events?IpAddress=5.6.7.8&Date<=2014-10-03
Need to Handle Exceeded Throughput Failures
Handling Exceeded Write Throughput with
Amazon SQS
Queuing events to Amazon SQS processing
asynchronously allows us to gracefully deal with write throughput errors.
API Web Billing Amazon SQS Events Processor Amazon DynamoDB
Maximum of 5 Global and 5 Local Indexes
You can manage your own indexes, but your
Local Index Size Limits
Local secondary indexes provide immediate
consistency… and limit the data set for a given hash key to 10GB.
Brief History
2008 - 2011
All business intelligence queries run on replicas of MySQL clusters serving production traffic.
Brief History
2011 - 2013
Data pushed to Amazon S3 and queried with Pig,
Amazon EMR, improving ability to aggregate, but with high latency.
Brief History
2013 - Present
Move to Amazon Redshift cut the time these reports took from hours to seconds allowing us to answer
Pushing Data Into Amazon Redshift
Post Flight Service Kafka SQS (DLQ) Amazon S3 Loader S3 Warehouse Loader Amazon RedshiftManaged Services as a Culture
Our focus is on creating an experience that unifies and simplifies communications is a reflection on our adoption of managed services.
Managed Services as a Culture
Understanding and focusing on our areas of expertise and leveraging managed services for the rest
accelerates the delivery of value and innovation to our customers.