• No results found

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

N/A
N/A
Protected

Academic year: 2021

Share "Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Data-Intensive Programming

Timo Aaltonen

Department of Pervasive Computing

(2)

Data-Intensive Programming

• Lecturer: Timo Aaltonen

– University Lecturer – [email protected]

• Assistants: Henri Terho and Antti Kallonen

– helping with technicalities – answering questions

• Course work

• Exam!

(3)

Communication

• Communication between students and

personnel is carried out with Slack

– team collaboration tool

– all students will get an e-mail invitation – the main channel for asking questions

(4)

Exam

• The popularity of the course surprised

• For the course work we are in plan C

– grading 150 students with our current plan does not work

– therefore, the course will have an exam – sorry

(5)

Course Work

• Hands on course

• Course work

– Groups of three students

– Java and Python (maybe Scala)

– Hadoop, HDFS, MapReduce, Spark, Hue, … – Traffic data by Finnish Transport Agency – Report

(6)

Course Work

• Cloudera CDH 5 -virtual machine

– Each group should install to own laptop – Virtualbox, for instance

• Data preparation, import, analysis,

dissemination

• The tasks will be published during the course

• This week’s task: form a group of three

(7)

Today

• Big data

• Data Science

• Hadoop

• HDFS

(8)

Big Data

• World is drowning in data

– click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits

• “More data usually beats better algorithms”

(9)

Three Vs of Big Data

• Volume: amount of data

– Transaction data stored through the years,

unstructured data streaming in from social media, increasing amounts of sensor and machine-to-

machine data

• Velocity: speed of data in and out

– streaming data from RFID, sensors, …

• Variety: range of data types and sources

– structured, unstructured

(10)

Big Data

• Variability

– Data flows can be highly inconsistent with periodic peaks

• Complexity

– Data comes from multiple sources.

– linking, matching, cleansing and transforming data across systems is a complex task

(11)

Data Science

• Definition: Data science is an activity to

extracts insights from messy data

• Facebook analyzes location data

– to identify global migration patterns

– to find out the fanbases to different sport teams

• A retailer might track purchases both online

and in-store to targeted marketing

(12)

Data Science

(13)

New Challenges

• Compute-intensiveness

– raw computing power

• Challenges of data intensiveness

– amount of data

– complexity of data

– speed in which data is changing

(14)

Data Storage Analysis

• Hard drive from 1990

– store 1,370 MB – speed 4.4 MB/s

• Hard drive 2010s

– store 1 TB

– speed 100 MB/s

(15)

Scalability

• Grows without requiring developers to re-

architect their algorithms/application

• Horizontal scaling

• Vertical scaling

(16)

Parallel Approach

• Reading from multiple disks in parallel

– 100 drives having 1/100 of the data => 1/100 reading time

• Problem: Hardware failures

– replication

• Problem: Most analysis tasks need to be able

to combine data in some way

– MapReduce

• Hadoop

(17)

Hadoop

• Hadoop is a frameworks of tools

– libraries and methodologies

• Operates on large unstructured datasets

• Open source (Apache License)

• Simple programming model

• Scalable

(18)

Hadoop

• A scalable fault-tolerant distributed system for

data storage and processing (open source

under the Apache license)

• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing high- bandwidth clustered storage

– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a

scalable data programming abstraction

(19)

Hadoop

• Administrators

– Installation

– Monitor/Manage Systems – Tune Systems

• End Users

– Design MapReduce Applications – Import Export data

– Work with various Hadoop Tools

(20)

Hadoop

• Developed by Doug Cutting and Michael J.

Cafarella

• Based on Google MapReduce technology

• Designed to handle large amounts of data and

be robust

• Donated to Apache Foundation in 2006 by

Yahoo

(21)

Hadoop Design Principles

• Moving computation is cheaper than moving data

• Hardware will fail, manage it

• Hide execution details from the user

• Use streaming data access

• Use simple file system coherency model

• Hadoop is not a replacement for SQL, always fast

and efficient quick ad-hoc querying

(22)

Hadoop MapReduce

• Collocate data with compute node

– data access is fast since its local (data locality)

• Network bandwidth is the most precious

resource in the data center

– MR implementations explicit model the network topology

• MR operates at a high level of abstraction

– programmer thinks in terms of functions of key and value pairs

(23)

Hadoop MapReduce

• MR is a shared-nothing architecture

– tasks do not depend on each other

– failed tasks can be rescheduled by the system

• Invented by Google

– used for producing search indexes

– applicable to many other problems too

(24)

Hadoop Ecosystem

• Common

– A set of components and interfaces for distributed file systems and general I/O

• Avro

– A serialization system for efficient, cross-language RPC and persistent storage

• MapReduce

– Distributed data processing model and execution environment

(25)

Hadoop Ecosystem

• HDFS

– A Distributed filesystem

• Pig, Hive HBase, ZooKeeper, Sqoop, Oozie

(26)

RDBMS vs HDFS

• Schema-on-Write (RDBMS)

– Schema must be created before any data can be loaded

– An explicit load operation which transforms data to DB internal structure

– New columns must be added explicitly before

new data for such columns can be loaded into the DB

• Schema-on-Read (HDFS)

– Data is simply copied to the file store, no

transformation is needed – A SerDe (Serializer

/Deserlizer) is applied

during read time to extract the required columns (late binding)

– New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it

(27)

Flexibility: Complex Data Processing

1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop).

2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower

performance and less flexibility than native Java MapReduce.

3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava)

4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads.

5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes.

6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

(28)

Hadoop Distributed File System

• Hadoop comes with distributed file system

called HDFS (Hadoop Distributed File System)

• Based on Google’s GFS (Google File System)

• HDFS provides redundant storage for massive

amounts of data

– using commodity hardware

• Data in HDFS is distributed across all data

nodes

– Efficient MapReduce processing

(29)

HDFS Design

• File system on commodity hardware

– Survives even with high failure rates of the components

• Supports lots of large files

– File size hundreds GB or several TB

• Main design principles

– Write once, read many times

– Rather streaming reads, than frequent random access – High throughput is more important than low latency

(30)

HDFS Architecture

• HDFS operates on top of existing file system

• Files are stored as blocks (default size 64 MB,

different from file system blocks)

• File reliability is based on block-based replication

– Each block of a file is typically replicated across several DataNodes (default replication is 3)

• NameNode stores metadata, manages replication

and provides access to files

• No data caching (because of large datasets), but

direct reading/streaming from DataNode to client

(31)

HDFS Architecture

• NameNode stores HDFS metadata

– filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups

• The number of files in HDFS is limited by the

amount of available RAM in the NameNode

– HDFS NameNode federation can help in RAM issues: several NameNodes, each of which

manages a portion of the file system namespace

(32)

HDFS Architecture

• DataNode stores file contents as blocks

– Different blocks of the same file are typically stored on different DataNodes

– Same block is typically replicated across several DataNodes for redundancy

– Periodically sends report of all existing blocks to the NameNode

– DataNodes exchange heartbeats with the NameNode

(33)

HDFS Architecture

• Built-in protection against DataNode failure

• If NameNode does not receive any heartbeat

from a DataNode within certain time period,

DataNode is assumed to be lost

• In case of failing DataNode, block replication is

actively maintained

– NameNode determines which blocks were on the lost DataNode

– The NameNode finds other copies of these lost blocks and replicates them to other nodes

(34)

High-Availability (HA) Issues:

NameNode Failure

• NameNode failure corresponds to losing all

files on a file system

% sudo rm --dont-do-this /

• For recovery, Hadoop provides two options

– Backup files that make up the persistent state of the file system

– Secondary NameNode

• Also some more advanced techniques exist

(35)

HA Issues: the secondary NameNode

• The secondary NameNode is not mirrored NameNode

• Required memory-intensive administrative functions

– NameNode keeps metadata in memory and writes changes to an edit log

– The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large

• Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure

• Recommended to run on a separate machine

• Requires as much RAM as primary NameNode

(36)

Network Topology

• HDFS is aware how close two nodes are in the

network

• From closer to further

0: Processes in the same node

2: Different nodes in the same rack

4: Nodes in different racks in the same data center 6: Nodes in different data centers

(37)

File Block Placement

• Clients always read from the closest node

• Default placement strategy

– One replica in the same local node as client – Second replica in a different rack

– Third replica in different, randomly selected, node in the same rack as the second replica

• Additional (3+) replicas are random

(38)

Balancing

• Hadoop works best when blocks are evenly

spread out

• Support for DataNodes of different size

– In optimal case the disk usage percentage in all DataNodes approximately the same level

• Hadoop provides balancer daemon

– Re-distributes blocks

– Should be run when new DataNodes are added

(39)

Accessing Data

• Data can be accessed using various methods

– Java API: Demo – C API

– Command line / POSIX (FUSE mount) – Command line / HDFS client: Demo – HTTP

(40)

HDFS URI

• All HDFS (CLI) commands take path URIs as

arguments

• URI format

– scheme://authority/path

• The scheme and authority are optional

– If not specified the default (set in the configuration file) one are used

(41)

Conclusions

• Pros

– Support for very large files – Designed for streaming data – Commodity hardware

• Cons

– Not designed for low-latency data access

– Architecture does not support lots of small files – No support for multiple writers / arbitrary file

modifications (Writes always at the end of the file)

References

Related documents

The fallen dominoes lie where they fall, but past events vanish into the present, which is just another way of saying that the world is a self-moving pattern which, when

‹ Inodes for files allocated in same cylinder as file data blocks. z Free

TP B: Prove that when a transversal cuts two parallel lines, alternate interior and exterior angles are congruent.. TP C: Prove that the sum of the interior angles of a triangle

 Name Node : Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file data is kept.  Data Node : DataNode actually stores data

•  The blocking factor for a file is the (average) number of file records stored in a disk block. •  The physical disk blocks that are allocated to hold the records of a file

• Components of Data Intensive Cloud Computing – File- and block-based distributed file systems – Distributed databases. –

timber can be harvested Politicians have informal access Farmers typically assume passive recipient role vis-à-vis agricultural extension workers Agricultural department

▪ Disks are divided into physical blocks (sectors on a track) ▪ Files are divided into logical blocks (subdivisions of the file) ▪ Logical block size = some multiple of a physical