Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

(1)

Data-Intensive Programming

Timo Aaltonen

Department of Pervasive Computing

(2)

Data-Intensive Programming

• Lecturer: Timo Aaltonen

– University Lecturer – [email protected]

• Assistants: Henri Terho and Antti Kallonen

– helping with technicalities – answering questions

• Course work

• Exam!

(3)

Communication

• Communication between students and

personnel is carried out with Slack

– team collaboration tool

– all students will get an e-mail invitation – the main channel for asking questions

(4)

Exam

• The popularity of the course surprised

• For the course work we are in plan C

– grading 150 students with our current plan does not work

– therefore, the course will have an exam – sorry

(5)

Course Work

• Hands on course

• Course work

– Groups of three students

– Java and Python (maybe Scala)

– Hadoop, HDFS, MapReduce, Spark, Hue, … – Traffic data by Finnish Transport Agency – Report

(6)

Course Work

• Cloudera CDH 5 -virtual machine

– Each group should install to own laptop – Virtualbox, for instance

• Data preparation, import, analysis,

dissemination

• The tasks will be published during the course

• This week’s task: form a group of three

(7)

Today

• Big data

• Data Science

• Hadoop

• HDFS

(8)

Big Data

• World is drowning in data

– click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits

• “More data usually beats better algorithms”

(9)

Three Vs of Big Data

• Volume: amount of data

– Transaction data stored through the years,

unstructured data streaming in from social media, increasing amounts of sensor and machine-to-

machine data

• Velocity: speed of data in and out

– streaming data from RFID, sensors, …

• Variety: range of data types and sources

– structured, unstructured

(10)

Big Data

• Variability

– Data flows can be highly inconsistent with periodic peaks

• Complexity

– Data comes from multiple sources.

– linking, matching, cleansing and transforming data across systems is a complex task

(11)

Data Science

• Definition: Data science is an activity to

extracts insights from messy data

• Facebook analyzes location data

– to identify global migration patterns

– to find out the fanbases to different sport teams

• A retailer might track purchases both online

and in-store to targeted marketing

(12)

Data Science

(13)

New Challenges

• Compute-intensiveness

– raw computing power

• Challenges of data intensiveness

– amount of data

– complexity of data

– speed in which data is changing

(14)

Data Storage Analysis

• Hard drive from 1990

– store 1,370 MB – speed 4.4 MB/s

• Hard drive 2010s

– store 1 TB

– speed 100 MB/s

(15)

Scalability

• Grows without requiring developers to re-

architect their algorithms/application

• Horizontal scaling

• Vertical scaling

(16)

Parallel Approach

• Reading from multiple disks in parallel

– 100 drives having 1/100 of the data => 1/100 reading time

• Problem: Hardware failures

– replication

• Problem: Most analysis tasks need to be able

to combine data in some way

– MapReduce

• Hadoop

(17)

Hadoop

• Hadoop is a frameworks of tools

– libraries and methodologies

• Operates on large unstructured datasets

• Open source (Apache License)

• Simple programming model

• Scalable

(18)

Hadoop

• A scalable fault-tolerant distributed system for

data storage and processing (open source

under the Apache license)

• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing high- bandwidth clustered storage

– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a

scalable data programming abstraction

(19)

Hadoop

• Administrators

– Installation

– Monitor/Manage Systems – Tune Systems

• End Users

– Design MapReduce Applications – Import Export data

– Work with various Hadoop Tools

(20)

Hadoop

• Developed by Doug Cutting and Michael J.

Cafarella

• Based on Google MapReduce technology

• Designed to handle large amounts of data and

be robust

• Donated to Apache Foundation in 2006 by

Yahoo

(21)

Hadoop Design Principles

• Moving computation is cheaper than moving data

• Hardware will fail, manage it

• Hide execution details from the user

• Use streaming data access

• Use simple file system coherency model

• Hadoop is not a replacement for SQL, always fast

and efficient quick ad-hoc querying

(22)

Hadoop MapReduce

• Collocate data with compute node

– data access is fast since its local (data locality)

• Network bandwidth is the most precious

resource in the data center

– MR implementations explicit model the network topology

• MR operates at a high level of abstraction

– programmer thinks in terms of functions of key and value pairs

(23)

Hadoop MapReduce

• MR is a shared-nothing architecture

– tasks do not depend on each other

– failed tasks can be rescheduled by the system

• Invented by Google

– used for producing search indexes

– applicable to many other problems too

(24)

Hadoop Ecosystem

• Common

– A set of components and interfaces for distributed file systems and general I/O

• Avro

– A serialization system for efficient, cross-language RPC and persistent storage

• MapReduce

– Distributed data processing model and execution environment

(25)

Hadoop Ecosystem

• HDFS

– A Distributed filesystem

• Pig, Hive HBase, ZooKeeper, Sqoop, Oozie

(26)

RDBMS vs HDFS

• Schema-on-Write (RDBMS)

– Schema must be created before any data can be loaded

– An explicit load operation which transforms data to DB internal structure

– New columns must be added explicitly before

new data for such columns can be loaded into the DB

• Schema-on-Read (HDFS)

– Data is simply copied to the file store, no

transformation is needed – A SerDe (Serializer

/Deserlizer) is applied

during read time to extract the required columns (late binding)

– New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it

(27)

Flexibility: Complex Data Processing

1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop).

2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower

performance and less flexibility than native Java MapReduce.

3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava)

4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads.

5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes.

6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

(28)

Hadoop Distributed File System

• Hadoop comes with distributed file system

called HDFS (Hadoop Distributed File System)

• Based on Google’s GFS (Google File System)

• HDFS provides redundant storage for massive

amounts of data

– using commodity hardware

• Data in HDFS is distributed across all data

nodes

– Efficient MapReduce processing

(29)

HDFS Design

• File system on commodity hardware

– Survives even with high failure rates of the components

• Supports lots of large files

– File size hundreds GB or several TB

• Main design principles

– Write once, read many times

– Rather streaming reads, than frequent random access – High throughput is more important than low latency

(30)

HDFS Architecture

• HDFS operates on top of existing file system

• Files are stored as blocks (default size 64 MB,

different from file system blocks)

• File reliability is based on block-based replication

– Each block of a file is typically replicated across several DataNodes (default replication is 3)

• NameNode stores metadata, manages replication

and provides access to files

• No data caching (because of large datasets), but

direct reading/streaming from DataNode to client

(31)

HDFS Architecture

• NameNode stores HDFS metadata

– filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups

• The number of files in HDFS is limited by the

amount of available RAM in the NameNode

– HDFS NameNode federation can help in RAM issues: several NameNodes, each of which

manages a portion of the file system namespace

(32)

HDFS Architecture

• DataNode stores file contents as blocks

– Different blocks of the same file are typically stored on different DataNodes

– Same block is typically replicated across several DataNodes for redundancy

– Periodically sends report of all existing blocks to the NameNode

– DataNodes exchange heartbeats with the NameNode

(33)

HDFS Architecture

• Built-in protection against DataNode failure

• If NameNode does not receive any heartbeat

from a DataNode within certain time period,

DataNode is assumed to be lost

• In case of failing DataNode, block replication is

actively maintained

– NameNode determines which blocks were on the lost DataNode

– The NameNode finds other copies of these lost blocks and replicates them to other nodes

(34)

High-Availability (HA) Issues:

NameNode Failure

• NameNode failure corresponds to losing all

files on a file system

% sudo rm --dont-do-this /

• For recovery, Hadoop provides two options

– Backup files that make up the persistent state of the file system

– Secondary NameNode

• Also some more advanced techniques exist

(35)

HA Issues: the secondary NameNode

• The secondary NameNode is not mirrored NameNode

• Required memory-intensive administrative functions

– NameNode keeps metadata in memory and writes changes to an edit log

– The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large

• Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure

• Recommended to run on a separate machine

• Requires as much RAM as primary NameNode

(36)

Network Topology

• HDFS is aware how close two nodes are in the

network

• From closer to further

0: Processes in the same node

2: Different nodes in the same rack

4: Nodes in different racks in the same data center 6: Nodes in different data centers

(37)

File Block Placement

• Clients always read from the closest node

• Default placement strategy

– One replica in the same local node as client – Second replica in a different rack

– Third replica in different, randomly selected, node in the same rack as the second replica

• Additional (3+) replicas are random

(38)

Balancing

• Hadoop works best when blocks are evenly

spread out

• Support for DataNodes of different size

– In optimal case the disk usage percentage in all DataNodes approximately the same level

• Hadoop provides balancer daemon

– Re-distributes blocks

– Should be run when new DataNodes are added

(39)

Accessing Data

• Data can be accessed using various methods

– Java API: Demo – C API

– Command line / POSIX (FUSE mount) – Command line / HDFS client: Demo – HTTP

(40)

HDFS URI

• All HDFS (CLI) commands take path URIs as

arguments

• URI format

– scheme://authority/path

• The scheme and authority are optional

– If not specified the default (set in the configuration file) one are used

(41)

Conclusions

• Pros

– Support for very large files – Designed for streaming data – Commodity hardware

• Cons

– Not designed for low-latency data access

– Architecture does not support lots of small files – No support for multiple writers / arbitrary file

modifications (Writes always at the end of the file)