Big Data for Architects
Sam Babad @ MidLink
052-595-8885
Agenda
•
What is BigData
•
BigData challenges
•
Textbook big data architecture
•
Analysis, visualization & self service
•
BigData in the cloud
What is BigData
Why is everyone talking about BigData?
•
BigData is a Big Buzzword coined by Big
Companies to make Big Money
•
There are a lot of new and exciting
technologies behind the buzzword
•
A lot of money has been and is being
invested into BigData companies
What is BigData
•
Commonly defined
as a combination of
three
•
(the three Vs)
–
Volume
–
Velocity
–
Variety
•
My Definition
– Data that prevents running the algorithms you want on it in a reasonable amount of time.
What is BigData?
•
Big Data is not a buzzword. Big Data is something
more and more organizations have
•
Today most consider > 15TB Big Data
•
For real time > 5TB is considered Big Data
•
The internet entities are managing PBs of data
Data sources
•
Machine generated
•
Handheld devices
•
Optimizers
•
Analyzers
•
Loggers
•
Probes
•
And a billion sensors
Use Cases
•
Advertisement optimization and verification
•
Security and fraud detection
•
Algo-trading
•
Behavioral analysis
•
Network optimization
•
Real time billing
Some numbers
•
In memory data bases of 50TB are
reasonable
•
Columnar of 2-5PB are reasonable
•
We usually do not consider 15TB big
data for analytics
•
This is all on commodity hardware
Some numbers
•
AirBus A380 => ~640TB per flight
•
Twitter ~12TB per day
•
NYSE 1TB per day
•
Storage capacity doubles every 3 years
Is there value in BigData
• In 2014 Big Data vendors will pocket nearly $30 Billion from HW, SW and PS (2020 => $76B) :marketwatch.com
• How did Target know your 16 years old is pregnant before you did?
• Would you like to know the weather tomorrow?
• Take a look at Goldcorp Challenge
• Fight Cancer, Crime, Starvation & pollution
• There is one CCTV camera for every 32 people in the UK
Leveraging BigData technologies
•
BigData technologies can be used to solve
other problems
–
50GB is not BD but reducing foot print is big!
–
Postgre the new DB since 1989
–
CEP (Complex Event Processing) have
replaced rules engines
Big Data Challenge
–
Data and Analytical Complexity
Extracting actual business value Data-driven decision making
BigData challenges
•
Organizations capture far more data than
they ever have
•
Storage is usually not the issue
•
No two organizations are the same, nor are
their data sets
Data Challenges & Complexity
•
Volume
– Collecting
– Storage & backup
– Scanning (querying)
•
Velocity
– Data that has to be processed in real time
– Parallelism
– Correlation (over time and between series)
Data Challenges & Complexity
•
Variety
– Unstructured data along side structured data
– Internal along side external data
– A variety of formats MP4, AVI, Avro etc…
•
Distribution
– Multiple data centers
– Regulation
Data Challenges & Complexity
• Life cycle management – Retention policy
• What is a enough sampling? 2, 3 xmases? • Stale data (in different systems)
• Coherence/Consistency
– Between systems & data sources
• Integration
– Data collection and augmentation
– Multiple systems integration (islands of data)
Data Challenges & Complexity
•
Security
–
All your data is in one location
–
Who is allowed to access which data?
–
Multiple tenancy
•
One version of the truth
–
Where do you live right now?
Data Challenges & Complexity
•
Working with Legacy systems
–
prolong life time
•
Self-service
– Investigative access to ALL users
– Non techies included
– Keeping access permissions in mind
– Ever changing data model
– High touch vs. Low touch data
BigData the BigPicture
–
partial
No SQL Columnar Hadoop In-Memory Appliance OLTP Cloud Visualization Sharding IBM EMC HP Oracle Google Yahoo Amazon SAP Actian Datastax EnterpriseDB Apache Twitter Tableau LinkedIn DatameerBigData echo system
ETL Enrichment Storage Lifecycle Virtualization Deep Analytics BI CodeConsiderations
•
Is data your business?
•
How much data do you have?
•
Where is it now?
•
What is your growth projection?
•
Can you take it outside the organization?
•
Do you need to augment?
Considerations
• For how long do you need the data?
• Can you save partial data/aggregates?
• Can you join/dedup/aggregate historical data?
• Do you need real-time processing?
• Do you have frequent schema changes?
• Do you need to do OLAP & OLTP on the same system?
• Any limitations HW/OS/MEM?
Climbing up the ladder
• Will fast storage be enough?
• Will open source cut it? Or, what part do you use open source for?
• Which route is best? Hadoop, NoSQL, InMemory, Columnar etc…
• What do you use to code?
• Which ETL do you use?
Textbook architecture (lambda)
Introducing the major players
Kafka.apache.org
• Apache Kafka is publish-subscribe messaging rethought as a distributed commit log
• Open sourced and maintained by LinkedIn
• Fast
• Scalable
• Durable
• Distributed by Design
Kafka
–
how it works
•
Publishers send messages to a cluster of brokers
•
The brokers persist the messages to disk
•
Consumers can request a range of messages
– Offset and Length
•
Everything is distributed Pub, Sub & Queues
•
Consumers maintain their own state (No TX)
•
Throughput oriented
Kafka
–
log based queue
•
Messages are persisted to append-only logs
– Sequential writes and reads
•
Topics are distributed queues (partitions)
– Partitions are replicated
– Leader & followers
– Producers load balance
– Instances know each other
Kafka
•
A single broker can handle hundreds of
megabytes of reads and writes per second from
thousands of clients
•
A cluster is a great data transfer backbone
•
Data is retained based on a preset time (2 days)
•
Guaranties order of messages per partition
•
Uses java.nio.FileChannel#transferTo (sendfile
system call)
Storm
• CEP – Complex Event Processor
• Platform for analyzing streams of data as they “occur” • Highly distributed, real-time computation system
• Provides primitives for real time computation • Simplifies working with queues and workers • Fault tolerant and scalable
• Complementary to Hadoop
• Created at Backtype and acquired by Twitter in 2011 • Component of Apache Incubator from 2013
Storm Components
•
Master node - Nimbus daemon
– Distributes code around the cluster, assigns the tasks and monitors failures
•
Worker node - Supervisor daemon
– Listens to the work assigned, runs the worker process
•
Zookeeper
– maintains the coordination service between the supervisor and master
Storm Components
• Stream - An unbounded sequence of tuples
• Spout - A stream source. Reads data from real data sources (logs, API calls, event data..) and generates a stream..
• Bolt - processes input streams (joins, filters,
aggregations...) and produces output streams. Contains data processing, persistence, and messaging alert logic.
Storm Components
•
Spouts and Bolts are packaged into a
Topology
•
Topology is submitted to Storm clusters for
execution and runs indefinitely until it is
manually terminated
Storm use-cases
•
Stream processing of tweets
•
Real-time log processing
•
Sensor data analysis
•
Financial market analysis
•
Natural Language Processing
Hadoop
•
H
igh
A
vailability
D
istributed
O
bject
O
peration
P
latform
•
Distributed infrastructure for parallel processing
•
Initially HDFS & Map/Reduce
•
Ability to grow to thousands of machines
•
Virtualizes the hardware
•
Redundant
•
Simplifies growth & parallel processing
•
Distros: Cloudera, Hortonworks, Apache, IBM
…
HDFS
•
H
adoop
D
istributed
F
ile
S
ystem
•
Distributed file system for redundant storage
•
Uses commodity hardware
•
Supports big files (PBs) + large quantities of files
•
Write-once-ready-many
•
Built for HW failure
HDFS Architecture
HDFS
–
Master
/Slave
•
Data is organized into files and directories
•
Files are divided into uniform sized blocks
•
Master
“
Namenode
”
• Manages the file system namespace in memory
• Maintain file name to list blocks + location mapping
• Manages block allocation/replication
• Checkpoints namespace and journals namespace
changes for reliability
HDFS
–
Master/
Slave
• Slaves “Datanodes” handle block storage
• Blocks are stored using the underlying OS’s files
• Clients access the blocks directly from datanodes
• Periodically sends block reports to Namenode
• Periodically check block integrity
• Blocks are replicated (=3)
• File system keeps checksums of data for corruption detection and recovery
• HDFS exposes block placement so that computation can be migrated to data
Map/Reduce
•
Programming model for distributed
computation jobs at a massive scale
•
A framework to organize and execute such
job
•
The idea is take the logic to the data and
not vice versa
Map/Reduce
•
Map
– Inspect a huge amount of data
– Get something of interest
– Shuffle and sort the interesting data
•
Reduce
– Aggregate the interesting data
– Generate a final report
•
Think of grep | sort | aggregate by customer ID
Hadoop
–
the big picture
Pig
•
Map/Reduce is a lot of work to write
•
Pig is a High-level data flow language
•
Data processing language
•
Compiler to translate Pig Latin to
Map/Reduce
Pig example
Users = load‘users’as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);
Jnd = joinFltrdby name, Pages by user; Grpd = groupJndbyurl;
Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks; Srtd = orderSmmdby clicks desc;
Top5 = limitSrtd 5;
store Top5 into‘top5sites’;
Hive
–
SQL with Hadoop
• Puts a schema/structure to log data stored in HDFS
• Provides an SQL-like query language
• SQL is complied into a chain of M/R jobs
• Usually faster than Java M/R because of the optimizer in the complier
• Has a command shell interface
• Tables can be associated with a serializer /deserializer class - parse data into the table
Hive high level architecture
HBase
•
Open source non-relational distributed
column-oriented database on top of HDFS
– (row key, column family, column, timestamp) -> value
•
Real time read/write access to data in HDFS
•
No: Sql, Join, TX
•
Yes: Scalable, Redundant, Consistent
Cloudera Impala
•
High performance general purpose SQL query
language for Hadoop not based on M/R
•
Written in C++ with dynamic code generation
•
Impalad runs on every node
– Accepts client request, plans & executes the query
•
Statestored used to find the data
– Provides name service & metadata distribution
•
Catalogd metadata updates to all nodes
Cloudera impala performance
•
I/O bound x3-4 then hive
•
Multi M/R phases in Hive x45 faster in
impala
Parquet
•
Columnar storage across the Hadoop platform
– Reduces IO requirement
– Saves space (better compression)
– Different encoding for different types of data
– Additional metadata Page/Column/File statistics
•
Implemented both in Java for M/R & C++(Impala)
– Can be used with Impala, Hive, Pig & M/R
– Impala & Hive can access the same parquet tables
Columnar databases
•
Stores information by columns, rather than rows
•
This is much better for data crunching applications
and DWH than using row storage.
•
The table below will be stored on disk in this order:
– 1, 2, 3 -10:17, 11:11, 13:15 -1:37, 5:13, 3:59 -12345, 23456, 34567
•
Brands
– ParAccel a.k.a Matrix
– Vertica
– RedShift (cloud)
Call ID Time Duration Number
1 10:17 1:37 12345
2 11:11 5:13 23456
Columnar database features
• Homogenous blocks on disk allow for better compression/encoding of data
• Storage reduction by X3 to X12
• Massive Parallel Processing (MPP) on commodity HW enabled by correct distribution keys
• Take the code to the data
• Shared nothing
• Standard SQL language and relational schema
• UDF (User Defined Functions) like stored procedures
Columnar database performance
•
Columnar addressed the disk I/O bottle neck
specifically targeting seek time
•
Designed for BI oriented queries
– Select 7 columns from a table with 28 columns will fetch ¼ of the data as each block has info from only one column
– The data will also be mostly sequential so no seek time
– Compression will reduce amount of data fetched even more
•
Local processing on each node (shared nothing)
Columnar database pitfalls
• Trickle loading is a tricky
• Concurrency might be an issue
• Internal network speed is crucial
• Data redistribution is an issue
• Updates are out of the question
• Joins can be an issue
• However, for that you get X10 to X100 faster queries, less storage, bulk inserts of XTB an hour and excellent
simultaneous load and query it is worth it
In-Memory database
•
A DBMS the only uses main memory (RAM) to
store data
•
In Memory replication is used for High Availability
•
Usually these DBs do an Asyncbackup to an
external file or DB
•
NVDIMM now enables to run at full speed and
maintain data in the event of power failure
In-Memory database
• Designed for extreme OLTP processing
• Some IMDBs (In Memory DB) are MPP
• Close to linear scaling (some claim a=0.85)
• Some use smart mechanisms to only hold HOT data in memory
• Memory is 3,000 times faster then disk
• Cost in constantly dropping while size is increasing
In-Memory examples
•
Hana
– Column-oriented, Relational, MPP RDBMS
– Convergence of OLTP and OLAP Analytics
•
VoltDB
– MPP, Relational, RDBMS, ACID, ANSI SQL, Java UDF
– No write a head, no redo log, no access to disk in TX
– Real time response & async-offload (File,Columnar,Hadoop)
Analysis, visualization and self service
ETL
•
ETL is still mostly a traditional realm
– Goldengate, Informatica & Python are very much in use as an addition/instead of Hadoop
•
More sources and targets as each BigData
technology is targeted at a different requirement
•
Data travels more
– Multiple new BigData technologies side by side with
Traditional BI tools
•
Traditional BI tools still play a big roll
– OBIEE, Cognos, BO, Microstrategy
– Have connectors to BD technologies like Impala & Hive • Most use JDBC or ODBC connectivity
– Enable federated reports from multiple sources
– Still require building a “world”
– Are still relatively cumbersome
– Provide most users with limited access to limited data
Self service BI
•
With more and more data collected & its value
rising, accessibility is becoming a big issue
•
Canned reports simply don
’
t cut it anymore
•
Business users want to
“
add
”
a local excel file
they have to the data and run correlations
•
Exported partial data is less valuable
Self service
•
Data consumers want to run their own analysis,
do their own exploration and use simple tools
•
They are not techies
•
Security and constraints are required for self
service, especially in multi tenants environments
•
The trick in self service is to give access to all the
data required and only that data
•
Ability to easily manipulate data is critical
New visualization tools
•
The
“
new
”
visualization tools are still traditional
– Tableau, Qlickview and others offer a sexier, easier to use interface, intuitive for more users
– Point and click discovery and reporting
– Do not offer a paradigm shift
•
New BI & visualization tools are just emerging
– Datameer, Platfora
New BI tools Datameer as an example
•
Paradigm shifts only happen with new BI tools that
step out of the old realm
•
Changing the way we think about BI
– No ETL – just EL
– No building of “worlds”
– Working directly and exclusively on Hadoop
– Easy integration and import of data
– An “Excel” like interface for analytics
Datameer capabilities
• Easy integration and import of information from files, logs, database and many other sources
• Data manipulation done via excel. SQL or M/R knowledge is not required
• Administrator can control data access at the column level
• “Power Point” like visualization
• A new way of looking at self service BI
BigData in the cloud
•
More and more organization place their data in the
cloud, thus the data in the cloud is exploding
– Pay as you go, for what you use
– Elasticity – Grow when required
• think of changes as duplicate, change, test, switch drop the old
– Decrease administration overhead
– Use state of the art technologies
Google BigQuery
•
Big Data analysis requires expensive hardware
and skilled DB administrators
•
Managing data centers and tuning software is
time consuming and expensive
•
Why not to use web services as analytic tools
instead?
•
BigQuery - a fully managed data analytics service
in the cloud
What is BigQuery?
•
Service for interactive analyzing of big datasets
•
Works in conjunction with Google Storage
•
Uses SQL like query syntax
•
Web-service accessed by a RESTful API
•
99.9% reliable and secure:
• Data replicated across multiple data centers
• Secured through Access Control Lists
•
Scalability to any number of users
BigQuery
–
echo system
•
Part of the Google cloud offering
App engine(app execution)Compute Engine (linuxVMs) Storage
BigQuery
•
Works with most BI & Visualization tools
•
Direct access from Google App engine
•
Securely share & distribute the results
•
Offered with Premium support
BigQuery - Technology
•
Columnar
•
MPP
•
Uses tree structure for distribution
•
Limitations
•
Only one join per query
•
Relatively slow and runs out of resources
(rare)
Amazon RedShift
•
Data warehousing done the AWS way
–
Easy to provision and scale up massively
–
Pay as you go
–
Really fast performance
–
Open and flexible with support for popular tools
–
Petabytescale
Amazon RedShift
•
Dramatically reduce I/O by
– Column storage
– Data compression
– Zone maps
– Direct-attached storage
– Large data block sizes
Amazon RedShift
•
Parallelize and distribute everything
–
Query
–
Load
–
Backup
–
Restore
–
Resize
Amazon RedShift
–
more features
•
Monitor query performance
•
Point & click resize or recreate
•
Built in security
•
Automatic backups
•
Integrates with multiple data sources
•
Use existing analysis tools
DynamoDB
• Fully managed noSQL DB service on AWS
• Table based (each table is independent)
• Data stored in the form of name - value attributes
• Automatic scaling – Provisioned throughput – Storage scaling
– Distributed architecture
• Monitoring tables with CloudWatch
• Integration with Elastic MapReduce – Analyze and store in S3
DynamoDB
●
Schema free
●
Fast to find using primary and range keys.
●Support for complex queries (scan)
●
Consistency by default with high cost to ensure it
●Must use SDK/API to access
●
Complex queries are made using Sequential/Full
table scan (high cost)
DynamoDB Architecture
●
Data spread across hundreds of servers (nodes)
●Multiple versions of data across multiple nodes
●Conflict resolutions during reads (not writes)
●Servers form a cluster in a form of a
“
ring
”
●Client connection through:
– Routing using a load balancer
– Client-library that reflects Dynamo’s partitioning
scheme and can determine the storage host to connect
DynamoDB Limitations
• 64 KB limit on item size (row size)
• 1 MB limit on fetching data
• Pay more if you want strongly consistent data
• Size is multiple of 4KB (provisioning throughput wastage)
• No table joins
• Indexes are created during table creation only
• No triggers
• Limited comparison capability
Amazon Elastic MapReduce (EMR)
• A web service to process vast amounts of data
• “On demand” Hadoop cluster
• Store data on Amazon S3
• Scale the number of virtual servers in your cluster to manage your computation needs
• Start Hadoop cluster to process data
• Turn off when done
• Pay for the hours used
• EMR integrates seamlessly with AWS services
EMR
•
Hadoop clusters running on Amazon EMR use:
– EC2 instances as virtual Linux servers for the master and slave nodes
– Amazon S3 for bulk storage of input and output data
– CloudWatch to monitor cluster performance and raise alarms
•
Move data into and out of DynamoDB using
Amazon EMR and Hive (Used by EMR)
EMR considerations
• EMR reduces Hadoop management complexity/costs
• Compared to a local cluster - Lower performance of M/R jobs
• Reduced data throughput (S3 cannot be compared to a local hdd)
• EMR Hadoop is usually not the latest version
Kinesis
• AWS service for real time processing of streaming data
• Easy administration
• Performs continuous processing on streaming big data
• Scales seamlessly to match the operational needs
• Redshift and DynamoDB integration
• Client libraries allow designing and operating real-time streaming data applications
• Ensures high durability and availability of data by replicating across multiple Availability Zones
• Cost efficient for workloads of any scale
Kinesis architecture - Input
• Kinesis streams are sharded
• Each shard ingests up to 1MB/sec of data and up to 1000 tps
• Data is stored for 24 hours
• Scaling is done by adding or removing shards
• To store data in a stream producers use a PUT call
• PUTs are distributed across the shards by using a Partition Key
Kinesis architecture -Output
• You must design distributed, fault tolerant and scalable application that can keep up with the stream
• Use Kinesis Client Library to:
– Simplify reading from the stream – Automatically start Kinesis workers
– Adjust the number of workers as the number of shards
changes
– Restart workers if they fail and redistribute to use the
new EC2 instances
Kinesis pricing
• Pay as you go. No up-front costs Hourly shard rate - $0.015
Per 1,000,000,000 PUTs - $0.028
● Customers specify throughput requirements in shards ● Each shard delivers 1MB/s on ingest and 2MB/s on
egress
● Inbound data transfer is free
● EC2 charges apply for Kinesis processing apps
Will hardware catch up?
Technology to balance growth
•
A lot of the BigData technology revolved around
tackling disk speed and size
– Avoiding seek time
– Working in memory
– Smart Caching
– Order data homogeneously for better compression
– Duplicate data for bandwidth purposes
– Eventually consistent
The storage world is not idle
•
Storage today is faster with more capacity and
“
smarter
”
•
1 million IOPs is
“
easy
”
– Violin – 70TB of usable space in a 3U SSD box
– XtreamIO – Extreme deduplication in SSD arrays
– Infinidat – 2PB in a 42U rack
– ExaData – Database logic executed at hardware level
Storage revives legacy
• Legacy system longevity can be greatly prolonged by advanced storage
• The easiest “upgrade” path – just copy the data
• Limitations like data size, TX per second and other boundaries can be taken down by storage
• And that is before DNA Storage – 5.5PB on one cubic millimeter
What does the future hold?
Information Week
–
innovators
1.
MongoDB
2.
Amazon (Redshift,
EMR, DynamoDB)
3.
Cloudera(CDH, Impala)
4.
Couchbase
5.
Datameer
6.
Datastax
7.
Hadapt
8.
Hortonworks
9.
Karmasphere
10.
MapR
11.
Neo Technology
12.
Platfora
13.
Splunk
A few facts
• Data will continue to explode
• Value will become harder to harvest
• Storage will get cheaper
• CPUs will be faster
• Cost will be a big player
• The realm remains prone to disruption
Sam Babad @ MidLink