Monetizing Millions of Mobile Users with Cloud Business Analytics

(1)

Monetizing Millions of Mobile Users with Cloud Business

Analytics

MicroStrategy World 2013 David Abercrombie

(2)

Agenda

Tapjoy

Big Data Architecture

MicroStrategy Cloud Implementation MicroStrategy Cloud vs. Amazon EC2 Vertica

(3)

(4)

Tapjoy by the Numbers

339MM

Global Reach

5,300+

Active Apps

1.3MM

Daily Ad Conversions

(5)

A Few of Our Advertisers

RETAIL HEALTH & BEAUTY CPG TRAVEL AUTOMOTIVE ENTERTAINMEN T QSR INSURANCE TELCO ALCOHOL FINANCIAL TECHNOLOGY

(6)

(7)

(8)

Big Data

(9)

Operational Data Systems

Ruby application servers in Amazon EC2 Memcached

Amazon Simple DB => Riak in data center Amazon RDS => Heroku PostgreSQL

Hadoop on EC2 (HDFS, Hive, Pig, etc.) Vertica on EC2 => data center

R, Mahout, Mallet, etc. Syslog-ng => RabbitMQ Amazon S3

(10)

Analytic Data Volume

About one billion analytic rows per day

10 to 15+ thousand per second

Even more in transactional databases

Several dozen terabytes online

Compression and parallelism: Vertica Aggregate 100 billion rows in a second!

(11)

Transactional

System – Ruby

Apps

• web_requests

• Gzip Files

• Other Data Files

Amazon S3 Storage Area

• Was Syslog NG

• Will be Rabbit MQ

Hadoop • Main • Backup • DW RDS MySQL Vertica • BI • Prod • MSTR

ETL 1

ETL 2

ETL 3

ETL 4

ETLs

1) Hadoop

2) Load Vertica

3) Load Vertica

(12)

MicroStrategy

Cloud

(13)

MicroStrategy Uses

Executive dashboards Canned reports

Ad hoc analysis

External partner dashboards – Mobile

Mobile, web and iframe Replace Tableau, etc.

(14)

Why Cloud?

Obvious:

No installation No maintenance

No need for special admin skills

Other:

We have had no data center!

(15)

Cloud challenges

User and group management (MSTR LDAP) Project-level setup

Mobile URL configs

Deployment (dev, QA, prod) Training, docs, and consulting Password reset

No end-to-end metrics

(16)

Schema Design

25 dimensions 100 metrics

80 tables

Partly denormalized snowflake

Aggregated facts, one-hour granularity Custom “ELT” (within Vertica)

(17)

MicroStrategy Design Challenges

Non-additive metrics

Type 2 Slow Changing Dimensions

Tech Note 11286

“Row-level” security for external partners

Tech Note 11351

(18)

MicroStrategy

Cloud

vs.

(19)

(20)

EC2 response time anomalies

Good (“prod”) Bad (“bi”)

Minimum 120 ms 150 ms

First Quartile 132 ms 344 ms

Median 153 ms 2,660 ms

Mean 432 ms 10,180 ms

Third Quartile 219 ms 11,230 ms

Maximum 9,652 ms 211,500 ms

(21)

EC2 load anomalies (response

time)

top - 18:53:52 up 16 days, 1:09, 4 users, load average: 82.65, 86.79, 89.87 Tasks: 147 total, 1 running, 144 sleeping, 2 stopped, 0 zombie

Cpu(s): 0.1%us, 2.0%sy, 11.9%ni, 81.9%id, 3.9%wa, 0.0%hi, 0.0%si, 0.2%st Mem: 71687580k total, 71642800k used, 44780k free, 1136k buffers Swap: 0k total, 0k used, 0k free, 63913076k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1666 dbadmin 18 0 23.3g 5.5g 76m S 112 8.1 47448:02 vertica 175 root 10 -5 0 0 0 S 1 0.0 20:12.49 kswapd0 1392 spread 4 -11 24680 5708 1096 S 1 0.0 19:03.80 spread 17297 root 15 0 0 0 0 S 0 0.0 0:08.50 pdflush

1 root 18 0 10304 772 636 S 0 0.0 0:02.01 init

2 root RT 0 0 0 0 S 0 0.0 0:01.48 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.50 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/0 5 root 10 -5 0 0 0 S 0 0.0 0:00.95 events/0 6 root 10 -5 0 0 0 S 0 0.0 0:01.44 khelper

(22)

(23)

Clustered Database

“Move the computation to the data, never move the data to the computation”

Dr. Michael Stonebraker

“Moving Computation is Cheaper than Moving Data”

(24)

Vertica/MicroStrategy

Resources

Track 2 Session 8, 4:45 today in Lafleur 1:

“HP Vertica: A Step-by-Step Plan for

Implementing Mobile Business Intelligence”

(25)

Vertica Magic

MPP shared nothing

Parallel SQL

Column store Compression

(26)

Vertica Parallel SQL

Client connects to any node

Becomes the initiator node

Initiator creates execution plan

Initiator sends plans to other nodes

Nodes execute plans on their local data Nodes return results to initiator

Initiator combines result subsets from nodes Initiator send answer to client

(27)

Vertica Column Store

Pioneered at C-Store project at MIT, etc. Read only the columns you need

Reduces IO

Each IO is more relevant

(28)

Vertica Projections

Disk data structures

Like Oracle materialized views Like index-only access path

Can have many per table

Optimize for different queries Watch load time and disk

(29)

Run Length Encoding

Sorted: identical values together in same “bucket”

Each “bucket” maps to a set of rows in subsequent columns

F M Fresh Junior Senior Soph Fresh Junior Senior Soph F F F F T T T T F F F F T T T T

Gender Class Pass Name SELECT Name

FROM Students

WHERE Class=‘Junior’ AND Gender=‘M’

(30)

308 million rows, 28 columns, 8

GB

Column Name Bytes Rows per Byte

device_size_code 5,290 58,100

month_key 5,460 56,300

day_key 34,000 9,030

hour_key 571,000 537

platform_key 9,160,000 34

… … …

conversions 308,000,000 1.0

currency_key 609,000,000 0.5

publisher_partner_key 722,000,000 0.42

advertiser_partner_key 729,000,000 0.42

offer_key 742,000,000 0.41

publisher_app_key 784,000,000 0.39

tapjoy_revenue 903,000,000 0.34

(31)

Denormalize!

Column Name Bytes Rows per Byte

device_size_code 5,290 58,100

month_key 5,460 56,300

day_key 34,000 9,030

hour_key 571,000 537

platform_key 9,160,000 34

Very little extra disk and IO to denormalize few MB on 8 GB table

Moderately denormalized snowflake schema

(32)

Vertica Segmentation (sharding)

Distribute data evenly among nodes

For parallel query processing

Typically hash() a high-cardinality column

But not date or timestamp

More than one segmentation allowed

(33)

Parallel Joins

Local joins require identical segmentation

Must include equality on segmentation columns

(34)

Predicate Pushdown and Fact

Levels

“MicroStrategy facts must be at same level to be combined into a single metric”

921 Advanced project design p 128.

Vertica aggregation views can change a fact level

Reduce dimensionality

Vertica will “push” a filter predicate down into a view before aggregating, helping performance

(35)

Changing fact level with SQL view

CREATE VIEW fact_table_1_ab (

dimension_a, dimension_b, high_cardinality_id, fact_1

) AS SELECT dimension_a, dimension_b, high_cardinality_id, sum(metric_1) FROM fact_table_1 GROUP BY dimension_a, dimension_b, high_cardinality_id;

(36)

Predicate Pushdown and Fact

Levels

EXPLAIN SELECT * FROM fact_table_1_ab WHERE dimension_a = 'A';

Access Path:

+-GROUPBY HASH (SORT OUTPUT) (PATH ID: 3) | Aggregates: sum(fact_table_1.metric_1)

| Group By: fact_table_1.dimension_a, fact_table_1.dimension_b, | fact_table_1.high_cardinality_id

| Execute on: All Nodes

| +---> STORAGE ACCESS for fact_table_1 (PATH ID: 4)

| | Projection: mstr.fact_table_1_super_b0

| | Materialize: fact_table_1.dimension_a, fact_table_1.dimension_b, | | fact_table_1.high_cardinality_id,

| | fact_table_1.metric_1

| | Filter: (fact_table_1.dimension_a = 'A') | | Execute on: All Nodes

(37)

Vertica Challenges

Data structure design is key

Must match SQL and business needs

Parallel joins

Poor update/delete and referential integrity

Complicates ELT

(38)

Agenda

Tapjoy

Big Data Architecture

MicroStrategy Cloud Implementation MicroStrategy Cloud vs. Amazon EC2 Vertica

(39)

© 2013 Tapjoy, Inc. All Rights Reserved. Tapjoy, Inc. Confidential and Proprietary - Please Do Not Copy or Distribute Without Tapjoy, Inc.’s Prior Written Consent. The data provided herein is for information purposes only and while all efforts are made to ensure accuracy, errors may arise. Tapjoy and the Tapjoy logo are trademarks or registered trademarks of Tapjoy, Inc. All third party logos and trademarks mentioned are the property of their respective owners.