Big Data with Component Based Software
Who am I
Erik who?
Erik Forsberg <[email protected]>
Linköping University, 1998-2003.
Computer Science programme + lot's of time at Lysator ACS At Opera Software since 2008
Outline
● Background
● The problem
● Components for Processing of Big Data – Hadoop
● Components for Coordination - ZooKeeper
● Components for Storing of Big Data - Cassandra
● Gluing it together
Background - What is Opera Software?
1200+ employees 13 locations
Norway, Sweden (Linköping [70+ people], Gothenburg, Stockholm), Poland, USA, Japan, China, South Korea, Taiwan, Russia, Ukraine, Iceland,
Singapore
We make browsers, deliver content, serve ads, compress video, help operators sell data plans, etc..
And use a lot of computers and bandwidth :-)
What am I doing @ Opera Software?
I'm the Grumpy Guy in the corner Architect, Opera Statistics Platform
Working with around 10 other people in Poland, India, USA
Dealing with Hadoop, Zookeeper, Cassandra, python, Java, Nagios, Zabbix, etc.
Background
Background – What's our problem?
Opera has a multitude of services producing data
● Opera Mini
● Opera Coast
● Opera for Android
● Opera Mediaworks
● Opera Discover
● Etc..
Basically every Opera Service / Product produce some kind of data
Background – contd.
Some of these produce huge amounts of data
Opera Mini – produce 5.5 TiB of data daily for some 250 million monthly users
Background – What's our problem?
We want statistics on.. a lot of things
● Usage by country?
● Usage by version?
● Usage by country and version?
● Pageviews by country and version?
● Unique users by country and version? Per day, week, month, quarter and year?
● Top domains used? Top domains per country?
● How do we make users stay with our products?
● Etc..
Problem #1: Big Data
Don't try to handle 5.5TiB of daily data using your laptop..
Don't try to store the results in a MySQL database and do JOIN. It doesn't work too well when you're adding 200+ million combinations every day.
“Internet-scale data”
Problem #2: Scalability
Our services are growing, and so is the list of required features Must be able to cope with growth by adding more machines, not by replacing one machine with another more expensive and powerful one.
Problem #3: Reliability / Fault tolerance
Statistics are important and must be delivered in time, even if machines in data processing cluster go down
Opera Statistics Platform
Opera Statistics Platform
Architectural overview
The General Idea
How it all works 1. Collect data
2. Produce combinations of variables for fixed timeperiods in Apache Hadoop
3. Store in Apache Cassandra as key/values 4. Access data via Web UI talking to Cassandra
Click Insert > Header & Footer
Input data
Logs are produced by various clusters
We install nginx on each machine.
Logs are retrieved over https, with client certificate authentication.
Components used: nginx
Click Insert > Header & Footer
Getting the logs
OSP Log Manager, written in-house.
Pro tip: Doing 400 concurrent https connections, per file,
works well when communicating with servers in China :-)
Currently using 2Gbit/s during approx. 1.5h period
Logs are directly uploaded to HDFS
Components: Twisted library, python, zookeeper
Processing Big Data
Apache Hadoop
Click Insert > Header & Footer
Apache Hadoop
Map/Reduce (M/R) framework
Distributed fault-tolerant filesystem (HDFS)
Scalable, just add more computers
M/R jobs can be written in Java (native), C++ (pipes),
or with any programming language (streaming)
Hadoop M/R example (in python)
class MRExample(object):
def map(self, key, value):
if key.startswith("i_want_this"):
yield key, value
def reduce(self, key, values):
yield key, sum(values)
Input:
(”i_want_this_1”, 7) (”i_want_this_2”, 8) (”dont_want_this”, 10) (”i_want_this_1”, 7)
Results:
(”i_want_this_1”, 14) (”i_want_this_2”, 8)
Click Insert > Header & Footer
OSP and Hadoop
●
Long chain of M/R jobs
●
Begins with logextract job with directory of logs as input.
●
Continues with jobs that merges timeperiods
●
Then jobs that produce combinations from variables
●
Ends with job that bulkloads permutations to Cassandra
Current Hadoop Production Cluster
● 88 nodes
● Each with 16 cores, 64GiB RAM, 6TB disk
● In total 450TB of HDFS storage
● On Iceland, where Power and Cooling is environmentally friendly and cheap.
Hardware, you cannot always avoid it
Click Insert > Header & Footer
OSP Scheduler
●
Keeps track of the chain of jobs
●
One job's output serves as input to another job
●
Multiple chains, executed per timeperiod
●
Multiple machines, multiple worker processes per machine
●
Uses Apache Zookeeper for coordination, locks, distributed queue
OSP Scheduler
Keeps track of our chain of events
● Reads a configuration file (.ini)
● When one job has completed, jobs depending on it will be started
● Tasks written in python
● Either runs on the scheduler node, or starts a Hadoop job
● Developed in-house. No good alternative available (when we started)
● Fault-tolerant and Scalable
Components used: Python, ZooKeeper
Coordinating Distributed
Systems
Apache ZooKeeper
Apache ZooKeeper
● Centralized service for
distributed synchronization
● Really simple API, but can be used to build complex
distributed services
● Provides a simple way of not having to worry about doing things like distributed locks right.
Because coordinating distributed systems is a Zoo
Apache ZooKeeper
● Filesystem-like data structure
● Znodes
● Znodes can have data (up to 1MiB)
● Znodes can have children
● Atomic changes
● Versioning
● Ephemeral nodes
● Watches Overview
Apache ZooKeeper
Fault-tolerance and performance
● Needs at least 3 servers for redundancy
● One server goes down – service is still up
● With 5 servers, 2 servers may go down and service is still up
● Coordinator is dynamically elected through quorum vote
● Write performance limited by I/O performance on coordinator node
● Read-performance is good, reads are local on each server. All servers keep full copy of database.
● Clients connect to a randomly selected node, writes are forwarded to coordinator node
How OSP uses ZooKeeper
● Distributed queue
● Multiple producers, multiple consumers
● No polling, uses watches
● Locks
● Either a znode is present, or it's not (atomicity)
● Want to run only one cron job at any time, but on multiple nodes – ephemeral node with known name
● Barriers
● Waiting until all tasks in a chain have completed before some action
● Configuration data
● Dynamic reconfig of service by watching known znode
Storing Big Data
Apache Cassandra
Apache Cassandra
● Distributed key/value store, but with columns
● Scalable. Keys distributed among nodes. Just add more hardware.
● Handles large amounts of data
● Tunable consistency
● Configurable replication
● All nodes in cluster are equal.
No master node.
A distributed NoSQL solution
NoSQL
I was raised with RDBMS and MySQL, this is different
● No relations
● Key/Value
● But with Columns. A Key can have many many columns
● Think Different
● “How will data be accessed by the application?”
● Not “How should I store data in the most efficient way?”
Cassandra Data Model
Row keys with sparse columns
● Row keys are hashed and data is distributed among servers
● Keyspace replication factor (RF) decides number of servers that hold each key
● RF > X protects against X servers failing
● RF also decides read performance
● Columns are sorted, and can be retrieved in slices
Click Insert > Header & Footer
Cassandra Cluster
Keyspace1, RF=2 Keyspace1, RF=2 Column Family X
Column Family Y
Column Family Y Column Family Z
Cassandra data model
Sparse columns
Row
key
Home address Work e-mail Home e-mail forsberg The outskirts forsberg@opera.com
carrie NY Downtown sexandthecity
@gmail.com [email protected]
ola Lambohov Ola.leifler@liu.
se
Cassandra data model
No JOIN, use a second Column Family
Row
key
Username[email protected] forsberg [email protected] carrie
Cassandra storage
Sorted String Tables
● Writes to Cassandra are to commitlog, then to memory
● Memory is dumped to sstables periodically (or when they get full)
● Very fast, scalable writes!
● Sparse columns are stored efficiently. No null values to store.
● Columns can have TTL, which makes data go away automatically
● Sstables are merged automatically over time
How OSP use Cassandra
● Results of Hadoop calculations are stored in Cassandra
● i.e. “country=sweden,shoesize=43”: 7 unique users for Opera Mini
● Millions of keys stored every day
● TTL used to make keys go away after our retention period (i.e. daily data is gone after 6 months)
Choosing the right tool for
the job
Hadoop vs Cassandra
Hadoop
● Really good Streaming I/O
● Slow as a melting iceberg on random access. So not useful as backend for UI.
● Has M/R component
Cassandra
● Really good response times on random access to any row key.
● Makes it useful as backend for UI.
● No data crunching component
● Hadoop can crunch data from Cassandra efficiently
Cassandra vs ZooKeeper
Cassandra
● Scalable write performance
● Can keep large blobs
● Structured data, keys and
columns (can have) data types
● No atomicity, but tunable consistency
ZooKeeper
● Write performance limited by I/O of one node
● Small amounts of data (1MiB).
Data must fit in Heap
● Just a blob (bsob?) (we use JSON)
● Atomicity, ephemeral nodes, watches
Two very different Key/Value (kind of) systems
How do you choose the right component?
● KISS
● Don't be afraid of Open Source.
● But evaluate the community activeness
● Evaluate multiple alternatives
● Choose what seems easiest to work with. If it takes 3 days to set it up for an experiment, it probably isn't right.
The UI
The UI
I'm not a UI person, but we have other team members who are
● Written in PHP
● Uses Highcharts, a fine Norwegian component for drawing SVG graphs
● Also uses Memcache for some local caching
Tying it together
Tying it all together
● We choose python
● Chances are that if you need to communicate with component X, python already has a module for it
● Really quick to do development in. Get down to business at once!'
● At Opera we use a multitude of Languages
You need some kind of glue to build a system
End
Questions?
We are hiring!
http://jobs.opera.com
| © 2013 Opera Software ASA. All rights reserved.