Monetizing Millions of Mobile Users with Cloud Business
Analytics
MicroStrategy World 2013 David Abercrombie
Agenda
Tapjoy
Big Data Architecture
MicroStrategy Cloud Implementation MicroStrategy Cloud vs. Amazon EC2 Vertica
Tapjoy by the Numbers
339MM
Global Reach
5,300+
Active Apps
1.3MM
Daily Ad Conversions
A Few of Our Advertisers
RETAIL HEALTH & BEAUTY CPG TRAVEL AUTOMOTIVE ENTERTAINMEN T QSR INSURANCE TELCO ALCOHOL FINANCIAL TECHNOLOGYBig Data
Operational Data Systems
Ruby application servers in Amazon EC2 Memcached
Amazon Simple DB => Riak in data center Amazon RDS => Heroku PostgreSQL
Hadoop on EC2 (HDFS, Hive, Pig, etc.) Vertica on EC2 => data center
R, Mahout, Mallet, etc. Syslog-ng => RabbitMQ Amazon S3
Analytic Data Volume
About one billion analytic rows per day
10 to 15+ thousand per second
Even more in transactional databases
Several dozen terabytes online
Compression and parallelism: Vertica Aggregate 100 billion rows in a second!
Transactional
System – Ruby
Apps
• web_requests
• Gzip Files
• Other Data Files
Amazon S3 Storage Area
• Was Syslog NG
• Will be Rabbit MQ
Hadoop • Main • Backup • DW RDS MySQL Vertica • BI • Prod • MSTR
ETL 1
ETL 2
ETL 3
ETL 4
ETLs
1) Hadoop
2) Load Vertica
3) Load Vertica
MicroStrategy
Cloud
MicroStrategy Uses
Executive dashboards Canned reports
Ad hoc analysis
External partner dashboards – Mobile
Mobile, web and iframe Replace Tableau, etc.
Why Cloud?
Obvious:
No installation No maintenance
No need for special admin skills
Other:
We have had no data center!
Cloud challenges
User and group management (MSTR LDAP) Project-level setup
Mobile URL configs
Deployment (dev, QA, prod) Training, docs, and consulting Password reset
No end-to-end metrics
Schema Design
25 dimensions 100 metrics
80 tables
Partly denormalized snowflake
Aggregated facts, one-hour granularity Custom “ELT” (within Vertica)
MicroStrategy Design Challenges
Non-additive metrics
Type 2 Slow Changing Dimensions
Tech Note 11286
“Row-level” security for external partners
Tech Note 11351
MicroStrategy
Cloud
vs.
EC2 response time anomalies
Good (“prod”) Bad (“bi”)
Minimum 120 ms 150 ms
First Quartile 132 ms 344 ms
Median 153 ms 2,660 ms
Mean 432 ms 10,180 ms
Third Quartile 219 ms 11,230 ms
Maximum 9,652 ms 211,500 ms
EC2 load anomalies (response
time)
top - 18:53:52 up 16 days, 1:09, 4 users, load average: 82.65, 86.79, 89.87 Tasks: 147 total, 1 running, 144 sleeping, 2 stopped, 0 zombie
Cpu(s): 0.1%us, 2.0%sy, 11.9%ni, 81.9%id, 3.9%wa, 0.0%hi, 0.0%si, 0.2%st Mem: 71687580k total, 71642800k used, 44780k free, 1136k buffers Swap: 0k total, 0k used, 0k free, 63913076k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1666 dbadmin 18 0 23.3g 5.5g 76m S 112 8.1 47448:02 vertica 175 root 10 -5 0 0 0 S 1 0.0 20:12.49 kswapd0 1392 spread 4 -11 24680 5708 1096 S 1 0.0 19:03.80 spread 17297 root 15 0 0 0 0 S 0 0.0 0:08.50 pdflush
1 root 18 0 10304 772 636 S 0 0.0 0:02.01 init
2 root RT 0 0 0 0 S 0 0.0 0:01.48 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.50 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/0 5 root 10 -5 0 0 0 S 0 0.0 0:00.95 events/0 6 root 10 -5 0 0 0 S 0 0.0 0:01.44 khelper
Clustered Database
“Move the computation to the data, never move the data to the computation”
Dr. Michael Stonebraker
“Moving Computation is Cheaper than Moving Data”
Vertica/MicroStrategy
Resources
Track 2 Session 8, 4:45 today in Lafleur 1:
“HP Vertica: A Step-by-Step Plan for
Implementing Mobile Business Intelligence”
Vertica Magic
MPP shared nothing
Parallel SQL
Column store Compression
Vertica Parallel SQL
Client connects to any node
Becomes the initiator node
Initiator creates execution plan
Initiator sends plans to other nodes
Nodes execute plans on their local data Nodes return results to initiator
Initiator combines result subsets from nodes Initiator send answer to client
Vertica Column Store
Pioneered at C-Store project at MIT, etc. Read only the columns you need
Reduces IO
Each IO is more relevant
Vertica Projections
Disk data structures
Like Oracle materialized views Like index-only access path
Can have many per table
Optimize for different queries Watch load time and disk
Run Length Encoding
Sorted: identical values together in same “bucket”
Each “bucket” maps to a set of rows in subsequent columns
F M Fresh Junior Senior Soph Fresh Junior Senior Soph F F F F T T T T F F F F T T T T
Gender Class Pass Name SELECT Name
FROM Students
WHERE Class=‘Junior’ AND Gender=‘M’
308 million rows, 28 columns, 8
GB
Column Name Bytes Rows per Bytedevice_size_code 5,290 58,100
month_key 5,460 56,300
day_key 34,000 9,030
hour_key 571,000 537
platform_key 9,160,000 34
… … …
conversions 308,000,000 1.0
currency_key 609,000,000 0.5
publisher_partner_key 722,000,000 0.42
advertiser_partner_key 729,000,000 0.42
offer_key 742,000,000 0.41
publisher_app_key 784,000,000 0.39
tapjoy_revenue 903,000,000 0.34
Denormalize!
Column Name Bytes Rows per Byte
device_size_code 5,290 58,100
month_key 5,460 56,300
day_key 34,000 9,030
hour_key 571,000 537
platform_key 9,160,000 34
Very little extra disk and IO to denormalize few MB on 8 GB table
Moderately denormalized snowflake schema
Vertica Segmentation (sharding)
Distribute data evenly among nodes
For parallel query processing
Typically hash() a high-cardinality column
But not date or timestamp
More than one segmentation allowed
Parallel Joins
Local joins require identical segmentation
Must include equality on segmentation columns
Predicate Pushdown and Fact
Levels
“MicroStrategy facts must be at same level to be combined into a single metric”
921 Advanced project design p 128.
Vertica aggregation views can change a fact level
Reduce dimensionality
Vertica will “push” a filter predicate down into a view before aggregating, helping performance
Changing fact level with SQL view
CREATE VIEW fact_table_1_ab (dimension_a, dimension_b, high_cardinality_id, fact_1
) AS SELECT dimension_a, dimension_b, high_cardinality_id, sum(metric_1) FROM fact_table_1 GROUP BY dimension_a, dimension_b, high_cardinality_id;
Predicate Pushdown and Fact
Levels
EXPLAIN SELECT * FROM fact_table_1_ab WHERE dimension_a = 'A';
Access Path:
+-GROUPBY HASH (SORT OUTPUT) (PATH ID: 3) | Aggregates: sum(fact_table_1.metric_1)
| Group By: fact_table_1.dimension_a, fact_table_1.dimension_b, | fact_table_1.high_cardinality_id
| Execute on: All Nodes
| +---> STORAGE ACCESS for fact_table_1 (PATH ID: 4)
| | Projection: mstr.fact_table_1_super_b0
| | Materialize: fact_table_1.dimension_a, fact_table_1.dimension_b, | | fact_table_1.high_cardinality_id,
| | fact_table_1.metric_1
| | Filter: (fact_table_1.dimension_a = 'A') | | Execute on: All Nodes
Vertica Challenges
Data structure design is key
Must match SQL and business needs
Parallel joins
Poor update/delete and referential integrity
Complicates ELT
Agenda
Tapjoy
Big Data Architecture
MicroStrategy Cloud Implementation MicroStrategy Cloud vs. Amazon EC2 Vertica
© 2013 Tapjoy, Inc. All Rights Reserved. Tapjoy, Inc. Confidential and Proprietary - Please Do Not Copy or Distribute Without Tapjoy, Inc.’s Prior Written Consent. The data provided herein is for information purposes only and while all efforts are made to ensure accuracy, errors may arise. Tapjoy and the Tapjoy logo are trademarks or registered trademarks of Tapjoy, Inc. All third party logos and trademarks mentioned are the property of their respective owners.