Data-Intensive Programming
Timo Aaltonen
Department of Pervasive Computing
Data-Intensive Programming
• Lecturer: Timo Aaltonen
– University Lecturer – [email protected]
• Assistants: Henri Terho and Antti Kallonen
– helping with technicalities – answering questions
• Course work
• Exam!
Communication
• Communication between students and
personnel is carried out with Slack
– team collaboration tool
– all students will get an e-mail invitation – the main channel for asking questions
Exam
• The popularity of the course surprised
• For the course work we are in plan C
– grading 150 students with our current plan does not work
– therefore, the course will have an exam – sorry
Course Work
• Hands on course
• Course work
– Groups of three students
– Java and Python (maybe Scala)
– Hadoop, HDFS, MapReduce, Spark, Hue, … – Traffic data by Finnish Transport Agency – Report
Course Work
• Cloudera CDH 5 -virtual machine
– Each group should install to own laptop – Virtualbox, for instance
• Data preparation, import, analysis,
dissemination
• The tasks will be published during the course
• This week’s task: form a group of three
Today
• Big data
• Data Science
• Hadoop
• HDFS
Big Data
• World is drowning in data
– click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits
• “More data usually beats better algorithms”
Three Vs of Big Data
• Volume: amount of data
– Transaction data stored through the years,
unstructured data streaming in from social media, increasing amounts of sensor and machine-to-
machine data
• Velocity: speed of data in and out
– streaming data from RFID, sensors, …
• Variety: range of data types and sources
– structured, unstructured
Big Data
• Variability
– Data flows can be highly inconsistent with periodic peaks
• Complexity
– Data comes from multiple sources.
– linking, matching, cleansing and transforming data across systems is a complex task
Data Science
• Definition: Data science is an activity to
extracts insights from messy data
• Facebook analyzes location data
– to identify global migration patterns
– to find out the fanbases to different sport teams
• A retailer might track purchases both online
and in-store to targeted marketing
Data Science
New Challenges
• Compute-intensiveness
– raw computing power
• Challenges of data intensiveness
– amount of data
– complexity of data
– speed in which data is changing
Data Storage Analysis
• Hard drive from 1990
– store 1,370 MB – speed 4.4 MB/s
• Hard drive 2010s
– store 1 TB
– speed 100 MB/s
Scalability
• Grows without requiring developers to re-
architect their algorithms/application
• Horizontal scaling
• Vertical scaling
Parallel Approach
• Reading from multiple disks in parallel
– 100 drives having 1/100 of the data => 1/100 reading time
• Problem: Hardware failures
– replication
• Problem: Most analysis tasks need to be able
to combine data in some way
– MapReduce
• Hadoop
Hadoop
• Hadoop is a frameworks of tools
– libraries and methodologies
• Operates on large unstructured datasets
• Open source (Apache License)
• Simple programming model
• Scalable
Hadoop
• A scalable fault-tolerant distributed system for
data storage and processing (open source
under the Apache license)
• Core Hadoop has two main systems:
– Hadoop Distributed File System: self-healing high- bandwidth clustered storage
– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a
scalable data programming abstraction
Hadoop
• Administrators
– Installation
– Monitor/Manage Systems – Tune Systems
• End Users
– Design MapReduce Applications – Import Export data
– Work with various Hadoop Tools
Hadoop
• Developed by Doug Cutting and Michael J.
Cafarella
• Based on Google MapReduce technology
• Designed to handle large amounts of data and
be robust
• Donated to Apache Foundation in 2006 by
Yahoo
Hadoop Design Principles
• Moving computation is cheaper than moving data
• Hardware will fail, manage it
• Hide execution details from the user
• Use streaming data access
• Use simple file system coherency model
• Hadoop is not a replacement for SQL, always fast
and efficient quick ad-hoc querying
Hadoop MapReduce
• Collocate data with compute node
– data access is fast since its local (data locality)
• Network bandwidth is the most precious
resource in the data center
– MR implementations explicit model the network topology
• MR operates at a high level of abstraction
– programmer thinks in terms of functions of key and value pairs
Hadoop MapReduce
• MR is a shared-nothing architecture
– tasks do not depend on each other
– failed tasks can be rescheduled by the system
• Invented by Google
– used for producing search indexes
– applicable to many other problems too
Hadoop Ecosystem
• Common
– A set of components and interfaces for distributed file systems and general I/O
• Avro
– A serialization system for efficient, cross-language RPC and persistent storage
• MapReduce
– Distributed data processing model and execution environment
Hadoop Ecosystem
• HDFS
– A Distributed filesystem
• Pig, Hive HBase, ZooKeeper, Sqoop, Oozie
RDBMS vs HDFS
• Schema-on-Write (RDBMS)
– Schema must be created before any data can be loaded
– An explicit load operation which transforms data to DB internal structure
– New columns must be added explicitly before
new data for such columns can be loaded into the DB
• Schema-on-Read (HDFS)
– Data is simply copied to the file store, no
transformation is needed – A SerDe (Serializer
/Deserlizer) is applied
during read time to extract the required columns (late binding)
– New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it
Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
Hadoop Distributed File System
• Hadoop comes with distributed file system
called HDFS (Hadoop Distributed File System)
• Based on Google’s GFS (Google File System)
• HDFS provides redundant storage for massive
amounts of data
– using commodity hardware
• Data in HDFS is distributed across all data
nodes
– Efficient MapReduce processing
HDFS Design
• File system on commodity hardware
– Survives even with high failure rates of the components
• Supports lots of large files
– File size hundreds GB or several TB
• Main design principles
– Write once, read many times
– Rather streaming reads, than frequent random access – High throughput is more important than low latency
HDFS Architecture
• HDFS operates on top of existing file system
• Files are stored as blocks (default size 64 MB,
different from file system blocks)
• File reliability is based on block-based replication
– Each block of a file is typically replicated across several DataNodes (default replication is 3)
• NameNode stores metadata, manages replication
and provides access to files
• No data caching (because of large datasets), but
direct reading/streaming from DataNode to client
HDFS Architecture
• NameNode stores HDFS metadata
– filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups
• The number of files in HDFS is limited by the
amount of available RAM in the NameNode
– HDFS NameNode federation can help in RAM issues: several NameNodes, each of which
manages a portion of the file system namespace
HDFS Architecture
• DataNode stores file contents as blocks
– Different blocks of the same file are typically stored on different DataNodes
– Same block is typically replicated across several DataNodes for redundancy
– Periodically sends report of all existing blocks to the NameNode
– DataNodes exchange heartbeats with the NameNode
HDFS Architecture
• Built-in protection against DataNode failure
• If NameNode does not receive any heartbeat
from a DataNode within certain time period,
DataNode is assumed to be lost
• In case of failing DataNode, block replication is
actively maintained
– NameNode determines which blocks were on the lost DataNode
– The NameNode finds other copies of these lost blocks and replicates them to other nodes
High-Availability (HA) Issues:
NameNode Failure
• NameNode failure corresponds to losing all
files on a file system
% sudo rm --dont-do-this /
• For recovery, Hadoop provides two options
– Backup files that make up the persistent state of the file system
– Secondary NameNode
• Also some more advanced techniques exist
HA Issues: the secondary NameNode
• The secondary NameNode is not mirrored NameNode
• Required memory-intensive administrative functions
– NameNode keeps metadata in memory and writes changes to an edit log
– The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large
• Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure
• Recommended to run on a separate machine
• Requires as much RAM as primary NameNode
Network Topology
• HDFS is aware how close two nodes are in the
network
• From closer to further
0: Processes in the same node
2: Different nodes in the same rack
4: Nodes in different racks in the same data center 6: Nodes in different data centers
File Block Placement
• Clients always read from the closest node
• Default placement strategy
– One replica in the same local node as client – Second replica in a different rack
– Third replica in different, randomly selected, node in the same rack as the second replica
• Additional (3+) replicas are random
Balancing
• Hadoop works best when blocks are evenly
spread out
• Support for DataNodes of different size
– In optimal case the disk usage percentage in all DataNodes approximately the same level
• Hadoop provides balancer daemon
– Re-distributes blocks
– Should be run when new DataNodes are added
Accessing Data
• Data can be accessed using various methods
– Java API: Demo – C API
– Command line / POSIX (FUSE mount) – Command line / HDFS client: Demo – HTTP
HDFS URI
• All HDFS (CLI) commands take path URIs as
arguments
• URI format
– scheme://authority/path
• The scheme and authority are optional
– If not specified the default (set in the configuration file) one are used
Conclusions
• Pros
– Support for very large files – Designed for streaming data – Commodity hardware
• Cons
– Not designed for low-latency data access
– Architecture does not support lots of small files – No support for multiple writers / arbitrary file
modifications (Writes always at the end of the file)