Cloud Computing, Big Data, & CDN Emerging Technologies

(1)

Cloud Introduction

Cloud Computing

(2)

Cloud Computing

What does Cloud Computing do?

• Provides online data storage

• Enables configuration and accessing of online applications

• Provides a variety of software usage

• Provides computing platform and computing infrastructure

(3)

Cloud Computing

Application Example

• Using Gmail on my smartphone to check e-mails

• Receive an e-mail with a MS Power Point attachment file

• However, MS Power Point and Windows OS is not installed on my smartphone!

• Google Drive service’s Google Docs, Sheets, and Slides

can be used to open the file

(4)

Cloud Computing

What is a Cloud?

• Cloud can provide services through a public or private

Network or the Internet, where the service hosting system is at a remote location

• Cloud can support various applications

• E-mail, Web Conferencing, Games, Database

Management, CRM (Customer Relationship Management), etc.

(5)

Cloud Computing

Cloud Models

(6)

Cloud Models

• Public Cloud

˗ Enables public systems and service access

˗ Open architecture (e.g., e-mail)

˗ Could be less secure due to openness

• Private Cloud

˗ Enables service access within an organization

˗ Due to its private nature, it is more secure

Cloud Computing

(7)

• Community Cloud

˗ Cloud accessible by a group of organizations

• Hybrid Cloud

˗ Hybrid Cloud = Public Cloud + Private Cloud

˗ Private cloud supports critical activities

˗ Public cloud supports non-critical activities

Cloud Computing

Cloud Models

(8)

Cloud Computing

Cloud Service Models

Ø SaaS: Software as a Service

Ø PaaS: Platform as a Service

Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model

(9)

Cloud Computing

Software as a Service (SaaS)

• Provides a variety of software applications as a service to end users

Platform as a Service (PasS)

• Provides a program executable platform for applications, development tools, etc.

Infrastructure as a Service (IaaS)

• Provides the fundamental computing and security resources for the entire cloud

• Backup storage, computing power, VM (Virtual Machines), etc.

(10)

Cloud Computing

Cloud Service Models

• There are many other service models

• XaaS = Anything as a Service

• NaaS à N for Network as a Service

• DaaS à D for Database as a Service

• BaaS à B for Business as a Service

• etc.

(11)

Cloud Computing

Cloud Benefits

(12)

Cloud Computing

Characteristics

(13)

REFERENCES

Cloud Computing

(14)

• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org

• Apple, iCloud, https://www.icloud.com

• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,

http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]

• Tutorialspoint, Cloud computing,

http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]

References

(15)

Image sources

• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons

References

(16)

Cloud Service Models

Cloud Computing

(17)

Cloud Computing

Cloud Service Models

Ø SaaS: Software as a Service

Ø PaaS: Platform as a Service

Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model

(18)

IaaS

IaaS (Infrastructure as a Service)

• Infrastructure support over the Internet

• Cloud’s Computing & Storage Resources

• Computing Power

• Storage Services

• Software Packages & Bundles

• VLAN (Virtual Local Area Network)

• VM (Virtual Machine) Features

(19)

IaaS

VM (Virtual Machine) Administration

• IaaS enables control of computing resources through Administrative Access to VMs

è Server Virtualization features

• Access to computing resources are enabled by

Administrative Access to VMs

• VM Administrative Command examples

• Save data on cloud server

• Start web server

• Install new application

(20)

IaaS

IaaS Procedures

(21)

IaaS

IaaS Benefits

• Flexible and Efficient Renting of Computer & Server

Hardware

• Rentable Resources

• VM, Storage, Bandwidth,

IP Addresses, Monitoring Services, Firewalls, etc.

• Rent Payment Basis

• Resource type

• Usage time

• Service packages

(22)

IaaS

IaaS Benefits

• Portability & Interoperability with

Legacy Applications

• Enables portability based on infrastructure resources that are

used through Internet connections

• Enables a method to maintain interoperability with legacy applications and workloads

between IaaS clouds

(23)

PaaS

(Platform as a Service)

• Provides development &

deployment tools for

application development

• Provides runtime

environment for apps.

(24)

Stand Alone Development Environment Application Delivery-Only Environment Open Platform as a Service Add-on Development Facilities

Cloud Services

PaaS Types

24

(25)

PaaS

PaaS Types

• Application Delivery-Only Environment

• Provides on-demand scaling & application security

• Stand-Alone Development Environment

• Provides an independent platform for a specific function

• Open Platform as a Service

• Provides open source software to run applications for

PaaS providers

• Add-On Development Facilities

• Enables customization to the existing SaaS platforms 25

(26)

PaaS

PaaS Benefits

(27)

PaaS

Benefits

• Lower Administrative Overhead

• User does not need to be involved in any administration of the platform

• Lower Total Cost of Ownership

• User does not need to purchase any hardware, memory, or server

(28)

PaaS

• Scalable Solutions

• Application resource demand based automatic resource scale control

• More Current System Software

• Cloud provider needs to maintain software upgrades & patch installations

Benefits

(29)

SaaS

SaaS (Software as a Service)

• Provides software applications as a service to the user

• Software that is deployed on a cloud server which is accessible through the Internet

(30)

SaaS

Characteristics

• On Demand Availability

• Cloud software is available anywhere that the cloud is reachable via Internet

• Easy Maintenance

• No user software upgrade or maintenance needed

è All supported by the cloud

• Flexible Scale Up or Scale Down

• Centralized Management & Data

(31)

SaaS

Characteristics

• Enables a Shared Data Model

• Multiple users can share a single data model and database

• Cost Effectiveness

• Pay based on usage

• No risk in buying the wrong software

• Multitenant Programming Solutions

• Multiple programmers are ensured to use the same software version

è No version mismatch problems

(32)

Software-as-a-service

Open SaaS

Applications

(33)

REFERENCES

Cloud Computing

(34)

References

(35)

Image sources

References

(36)

Cloud Services

Cloud Computing

(37)

Cloud Services

Google Cloud

• Google App Engine

˗ Released as a preview in April 2008

˗ PaaS (Platform as a Service) for web applications

˗ Provides automatic scaling based on resource demands and server load

• Google Cloud Storage

˗ Launched in May 2010

˗ Online file storage service

(38)

Cloud Services

Google Cloud

• Google BigQuery

˗ Released in April 2012

˗ Data analysis tool that uses SQL-like queries to process big datasets in seconds

• Google Compute Engine

˗ Released in June 2012

˗ IaaS (Infrastructure as a Service) support

to enable on demand launching of VMs (Virtual Machines)

(39)

Cloud Services

Google Cloud

• Google Cloud Endpoints

˗ Released in November 2013

˗ Tool to create services inside App Engine

˗ Easily connects from Android, iOS, and JavaScript clients

• Google Cloud DNS (Domain Name System)

˗ DNS service supported by the Google Cloud

(40)

Cloud Services

• Google Cloud Datastore

˗ NoSQL (No Structured Query Language) data storage

• Google Cloud SQL (Structured Query Language)

˗ Released in February 2014 as GA (General Availability)

˗ Fully managed MySQL database

Google Cloud

(41)

Cloud Services

Amazon S3 (Simple Storage Service)

• Online file storage web service offered by Amazon Web Services

• Public web service released in the United States in March 2006 and in Europe in November 2007

• Provides storage through web services interfaces

(REST, SOAP, and BitTorrent)

(42)

Cloud Services

Amazon Cloud Drive

• Amazon Cloud Drive was released in

March 2011

• Web storage application from Amazon

• Storage Space Characteristics

˗ Can be accessed from up to eight specific devices (e.g., mobile devices & different computers) and by using

different browsers on the same computer

(43)

Cloud Services

Amazon Cloud Drive

• Cloud Player (Originally bundled)

˗ Users can play music in their Cloud Drive from any computer or Android device

˗ Music browsing based on song titles, albums, artists, genres (website only), and playlists

(44)

Cloud Services

Amazon Cloud Drive

Options

• Unlimited

Photos

˗ Unlimited storage for photos & raw data files

˗ 5 gigabytes of video storage

• Unlimited

Everything

˗ Unlimited storage for photos, videos, documents, and various files types

(45)

Cloud Services

iCloud

• Developed by Apple, Inc.

• Public release in October 2011

• Cloud Storage & Cloud Computing

• Operating system

˗ OS X (10.7 Lion or later)

˗ Microsoft Windows 7 or later

˗ iOS 5 or later

(46)

Cloud Services

iCloud replaces MobileMe

• Subscription-based collection of Apple’s online services and software

• MobileMe was replaced by iCloud

• MobileMe ceased services in June 2012

• MobileMe users were allowed transfers to iCloud until

July 2012

(47)

Cloud Services

iCloud Features

• Email, Contacts, and Calendars

• Find My Friends

• Backup & Restore

˗ Back up feature for device settings & data

˗ iOS 5 or later required • Find My iPhone

˗ Enables a user to track the location of an iOS device or Mac

˗ Formerly a feature of MobileMe

(48)

Cloud Services

• Can manage lost or stolen Apple devices

• Back to My Mac

˗ Enables remote log in to other computers that have

Back to My Mac installed (using the same Apple ID) • iWork for iCloud

˗ Apple's iWork suite (Pages, Numbers, and Keynote) made available on a web interface

iCloud Features

(49)

Cloud Services

iCloud Features

• Photo Stream

˗ Can store most recent 1,000 photos

˗ Free storage for up to 30 days • iCloud Photo Library

˗ Stores all photos at original resolution

˗ Stores photo metadata • Storage (Introduced in 2011)

˗ 5 GB of free storage per account

(50)

Cloud Services

• iCloud Drive

˗ Can save photos, videos, documents, and apps • iCloud Keychain

˗ Secure database for Website and Wi-Fi password

˗ Secure Credit card & Debit card management for

quick access and auto-fill

iCloud Features

(51)

Cloud Services

• iTunes Match

˗ iTunes music library scan and match tracks

function

˗ Serves tracks copied from CDs or other sources

iCloud Features

(52)

REFERENCES

Cloud Computing

(53)

References

(54)

Image sources

References

(55)

Big Data Examples

Big Data

(56)

Big Data

New FLU Virus Starts in the U.S.!

• H1N1 flu virus (which has combined virus elements of the bird and swine (pig) flu) started to spread in the U.S. in 2009

• U.S. CDC (Centers for Disease Control and Prevention) was only collecting diagnostic data of Medical Doctors once a week

• Using the CDC information to find how the flu was spreading would have an approximate

2 week lag, which is far too slow compared to the speed of the virus spreading

(57)

Big Data

New FLU Virus Starts in the U.S.!

• What

vaccine was needed?

• How much

vaccine was needed?

• Where

was the vaccine needed?

• Vaccine preparation and delivery plans could

not

be setup fast enough to safely prevent the

virus from spreading out of control

(58)

Big Data

• Fortunately,

Google

published a paper about

how they could

predict

the spread of the

winter

flu

in the U.S. accurately down to

specific

regions

and

states

• This paper was published in the journal

Nature

a

few weeks before

the H1N1 virus made the

headline news

New FLU Virus Starts in the U.S.!

(59)

Big Data

• Millions of the most common search terms and

Millions of different mathematical models were tested on Google’s database

• Google receives more than 3 billion search queries

a day

• Analysis system was set to look for correlation

between the frequency of certain search queues and the spread of the flu over time and space

New FLU Virus Starts in the U.S.!

(60)

Big Data

• Google’s method of analysis did not use data provided from hospitals or Medical Doctors

• Google used Big Data analysis on the most common search terms people use

• Google’s system proved to be more accurate and

faster than analyzing government statistics

New FLU Virus Starts in the U.S.!

(61)

Big Data

Wal-Mart

• Wal-Mart’s Data Warehouse

• Stores 4 petabytes (4´1015_{) of data}

• Records every single purchase

• Approximately 267 million transactions a day from 6000 stores worldwide is recorded

(62)

Big Data

• Wal-Mart’s Data Analysis

• Focused on evaluating the effectiveness of

pricing strategies and advertising campaigns

• Seeking for improvement methods

in inventory management and supply chains

Wal-Mart

(63)

Big Data

Recommendation System

using Big Data

• Based on data analysis of simple elements

• What users made purchases in the past

• Which items do they have in their virtual shopping cart

• Which items did customers rate and like

• What influence did the rating have on other customers to make a purchase

(64)

Big Data

Amazon.com

• Amazon.com’s Recommendation System

• Item-to-Item Collaborative Filtering Algorithm

• Personalization of the Online Store

è Customized to each customer

• Each customer’s store is based on the customer’s personal interest

• Example: For a new mother, the store will display baby supplies and toys

(65)

Big Data

Citibank

• Bank operations in 100 countries

• Big Data analysis on the database of basic financial transactions can enable Global insight on

investments, market changes, trade patterns, and economic conditions

• Many companies (e.g., Zara, H&M, etc.) work with Citibank to locate new stores and factories

(66)

Big Data

Product Development & Sales

• For example, a Smartphone takes significant time

and money to manufacture

• In addition, the duration of popularity for a new Smartphone is limited

• To maximize sales, a company needs to manufacture

just the right amount of products and sell them in the

right locations

(67)

Big Data

Product Development & Sales

• Too much will result in leftovers and a

big waste for the company!

• Too less will result in a lost opportunity for company profit and growth!

• Big Data analysis can help find how many smartphones

and where the products could be popular based on

common search terms that people use è Use this to also estimate how many products could be sold in a certain location è But why is this difficult?

(68)

REFERENCES

Big Data

(69)

• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(70)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(71)

• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(72)

Big Data's 4 Vs

Big Data

(73)

Big Data

Big Data’s 4

V

Big

Challenges

• Volume – Data Size

• Variety – Data Formats

• Velocity – Data Streaming Speeds

• Veracity – Data Trustworthiness

(74)

Big Data

Volume –

Data Size

• 40 Zettabytes (1021₎_{of data is predicted to be created}

by 2020

• 2.5 Quintillionbytes (1018₎_{of data are created every}

day

• 6 Billion (109₎_{people have mobile phones}

• 100 Terabytes (1012₎_{of data (at least) is stored by}

most U.S. companies

• 966 Petabytes (1015₎ _{was the approximate storage size}

of the American manufacturing industry in 2009

(75)

Big Data

Variety –

Data Formats

• 150 Exabytes (1018₎_{was the estimated size of data for}

health care throughout the world in 2011

• More than 4 Billion (109_{) hours}_{each month are used in}

watching YouTube

• 30 Billon contents are exchanged every month on Facebook

• 200 Million monthly active users exchange 400 Million tweets every day

(76)

Big Data

Velocity –

Data Streaming Speeds

• 1 Terabytes (1012₎ _{of trade information is exchanged}

during every trading session at the New York Stock Exchange

• 100 sensors (approximately) are installed in modern

cars to monitor fuel level, tire pressure, etc.

• 18.9 Billion network connections are predicted to exist by 2016

(77)

Big Data

Veracity –

Data Trustworthiness

• 1 out of 3 business leaders have experienced trust issues with their data when trying to make a

business decision

• $3.1 Trillion (1012₎ _{a year is estimated to be wasted}

in the U.S. economy due to poor data quality

(78)

Big Data

New technology is needed to overcome these

4 V

Big Data

Challenges

• Volume – Data Size

• Variety – Data Formats

• Velocity – Data Streaming Speeds

• Veracity – Data Trustworthiness

(79)

REFERENCES

Big Data

(80)

References

(81)

References

(82)

Image sources

References

(83)

HADOOP

Big Data

(84)

Hadoop

Data Storage, Access, and Analysis

• Hard drive storage capacity has tremendously increased

• But the data read and write speeds to and from the hard drives have not significantly improved yet

• Simultaneous parallel read and write of data with multiple hard disks requires advanced technology

(85)

Hadoop

• Challenge 1: Hardware Failure

• Challenge 2: Cost

˗ When using many computers for data storage and analysis, the probability that one computer will fail is very high

˗ To avoid data loss or computed analysis information loss, using backup computers and memory is needed, which helps the reliability, but is very expensive

Data Storage, Access, and Analysis

(86)

Hadoop

• Challenge 3: Combining Analyzed Data

˗ Combining the analyzed data is very difficult

˗ If one part of the analyzed data is not ready, then the overall combining process has to be delayed

˗ If one part has errors in its analysis, then the overall combined result may be unreliable and useless

Data Storage, Access, and Analysis

(87)

Hadoop

• Hadoop is a Reliable Shared Storage and Analysis System

• Hadoop = HDFS + MapReduce + α

˗ HDFS provides Data Storage

˗ HDFS: Hadoop Distributed FileSystem

˗ MapReduce provides Data Analysis

˗ MapReduce = Map + Reduce Function Function

(88)

Hadoop

HDFS: Hadoop Distributed FileSystem

• DFS (Distributed FileSystem) is designed for storage management of a network of computers

• HDFS is optimized to store huge files with streaming data access patterns

• HDFS is designed to run on clusters of general computers

(89)

Hadoop

HDFS: Hadoop Distributed FileSystem

• HDFS was designed to be optimal in performance for a WORM (Write Once, Read Many times) pattern, which is a very efficient data processing pattern

• HDFS was designed considering the time to read the whole dataset to be more important than the time

required to read the first record

(90)

Hadoop

HDFS

• HDFS clusters use 2 types of nodes

• Namenode (master node)

• Datanode (worker node)

(91)

Hadoop

HDFS: Namenode

• Manages the filesystem namespace

• Maintains the filesystem tree and the metadata for all the files and directories in the tree

• Stores on the local disk using 2 file forms

• Namespace Image

• Edit Log

(92)

Hadoop

HDFS: Datanodes

• Workhorse of the filesystem

• Store and retrieve blocks when requested by the client or the namenode

• Report back to the namenode periodically with lists of blocks that were stored

(93)

Hadoop

MapReduce

• MapReduce is a program that abstracts the analysis problem from stored data

• MapReduce transforms the analysis problem into a

computation process that uses a set of keys and

values

(94)

Hadoop

MapReduce System Architecture

• MapReduce was designed for tasks that consume

several minutes or hours on a set of dedicated trusted computers connected with a broadband high-speed network managed by a single master data center

(95)

Hadoop

MapReduce Characteristics

• MapReduce uses a somewhat brute-force data analysis approach

• The entire dataset (or a big part of the dataset) is processed for every query

• è Batch Query Processor model

(96)

Hadoop

MapReduce Characteristics

• MapReduce enables the ability to run an ad hoc query

against the whole dataset within a scalable time

• Many distributed systems combine data from multiple sources (which is very difficult), but MapReduce does this in a very effective and efficient way

(97)

Hadoop

Technical Terms

used in MapReduce

• Seek Time is the delay in finding a file

• Transfer Rate is the speed to move a file

• Transfer Rate has improved significantly more (i.e., now has much faster transfer speeds) compared to improvements in Seek Time (i.e., still relatively slow)

(98)

Hadoop

MapReduce

• MapReduce gains performance enhancement through

optimal balancing

of Seeking and Transfer operations

• Reduce Seek operations

• Effectively use Transfer operations

• In the next lecture, we will compare MapReduce with a traditional RDBMS (Rational Database Management System)

(99)

REFERENCES

Big Data

(100)

References

(101)

References

(102)

Image sources

References

(103)

MapReduce vs.

RDBMS

Big Data

(104)

Hadoop

• RDBMS (Rational Database Management System)

Characteristics

• RDBMS is good for updating a small proportion of a big database

• RDBMS uses a traditional B-Tree, which is highly dependent in the time required to perform seek

operations

MapReduce vs. RDBMS

(105)

Hadoop

MapReduce vs. RDBMS

• MapReduce Characteristics

• MapReduce is good for updating all (or a majority) of a big database

• MapReduce uses Sort and Merge to rebuild the database, which depends more on transfer

operations

(106)

Hadoop

• RDBMS is good for applications that require the

datasets of the database to be very frequently updated (e.g., point queries or small dataset updates)

• MapReduce is better for WORM (Write Once and Read Many times) based data applications

• MapReduce is a complementary system to RDBMS

MapReduce vs. RDBMS

(107)

Hadoop

MapReduce vs. RDBMS

RDBMS

MapReduce

Data Size Gigabytes (109₎ _{Petabytes (10}12₎ Access Interactive & Batch Batch

Updates Read & Write Many Times WORM (Write Once, _{Read Many Times)}

Data

Structure Static Schema Dynamic Schema

Integrity High Low

Scalability Nonlinear Linear

(108)

Hadoop

MapReduce vs. RDBMS:

Data Types

• Structured Data: Data that has a formal defined structure (e.g., XML documents or database tables)

• Semi-Structured Data: Data that has a looser format where the data structure is used as a guide and may be ignored

• Unstructured Data: Data that does not have any formal structure (e.g., plain text or image data)

(109)

Hadoop

MapReduce vs. RDBMS:

Data Types

• MapReduce is very effective on unstructured and semi-structured data

• Why?

• MapReduce interprets data during the data processing sessions

• MapReduce does not use intrinsic properties of the data as input keys or input values. The parameters used

are selected by the person analyzing the data

(110)

Hadoop

MapReduce vs. RDBMS:

Scalability

• MapReduce has a programming model that is linearly scalable

• MapReduce Functions: 2 types

• Map function

• Reduce function

• Both of these functions define a

Key-Value pair mapping relation

(e.g., Key-Value pair 1 è Key-Value pair 2)

(111)

Hadoop

Hadoop Release Series

Feature 1.x 0.22 2.X

Secure authentication Yes No Yes

Old configuration names Yes

New configuration names No Yes Yes

Old MapReduce API Yes Yes Yes

New MapReduce API Yes (with some

missing libraries) Yes Yes

MapReduce 1 runtime (Classic) Yes Yes No

MapReduce 2 runtime (YARN) No No Yes

HDFS Federation No No Yes

HDFS High-Availability No No Yes Release 2.6.0 became available Nov. 2014

(112)

Hadoop

Hadoop Release Series

• 2.x includes several major new features

• MapReduce 2 is the new MapReduce runtime implemented on a new system called YARN

• YARN

• Yet Another Resource Negotiator

• General resource management system for running distributed applications

(113)

Hadoop

• HDFS Federation partitions the HDFS namespace across multiple namenodes

• Enables improved support for clusters with very large numbers of files

• HDFS High-Availability feature uses standby

namenodes for backup, and therefore, the namenode is no longer a potential SPOF (Single Point of Failure)

Hadoop Release Series

(114)

REFERENCES

Big Data

(115)

References

(116)

References

(117)

Image sources

References

(118)

MapReduce

Big Data

(119)

MapReduce

Hadoop

˗ HDFS provides Data Storage

˗ MapReduce provides Data Analysis

˗ MapReduce = Map Function + Reduce Function

(120)

MapReduce

Scaling Out

• Scaling out is done by the DFS (Distributed FileSystem), where the data is divided and stored in distributed

computers & servers

• Hadoop uses HDFS to move the MapReduce computation to several distributed computing machines

that will process a part of the divided data assigned

(121)

MapReduce

Jobs

• MapReduce job is a unit of work that needs to be executed

• Job types: Data input, MapReduce program, Configuration Information, etc.

• Job is executed by dividing it into one of two types of

tasks

• Map Task

• Reduce Task

(122)

MapReduce

Node types for Job execution

• Job execution is controlled by 2 types of nodes

• Jobtracker

• Tasktracker

• Jobtracker coordinates all jobs

• Jobtracker schedules all tasks and assigns the tasks to tasktrackers

(123)

MapReduce

• Tasktracker will execute its assigned task

• Tasktracker will send a progress reports to the Jobtracker

• Jobtracker will keep a record of the progress of all jobs executed

(124)

MapReduce

Data flow

• Hadoop divides the input into input splits (or splits) suitable for the MapReduce job

• Split has a fixed-size

• Split size is commonly matched to the size of a HDFS block (64 MB) for maximum processing efficiency

(125)

MapReduce

• Map Task is created for each split

• Map Task executes the map function for all records within the split

• Hadoop commonly executes the Map Task on the node where the input data resides

Data flow

(126)

MapReduce

Data flow

• Data-Local Map Task

• Data locality optimization

does not need to use the cluster network

• Data-local flow process shows why the

Optimal Split Size = 64 MB HDFS Block Size 126

(127)

MapReduce

• Rack-Local Map Task

• A node hosting the

HDFS block replicas for a map task’s input split

could be running other map tasks

• Job Scheduler will look for a free map slot on a node in the same rack as one of the blocks

Map Task HDFS Block Node Rack Data Center

Data flow

127

(128)

MapReduce

• Off-Rack Map Task

• Needed when the Job Scheduler

cannot perform data-local or rack-local map tasks

• Uses inter-rack network transfer

Data flow

(129)

MapReduce

Map

• Map task will write its output to the local disk

• Map task output is not the final output, it is only the

intermediate output

Reduce

• Map task output is processed by Reduce Tasks to produce the final output

• Reduce Task output is stored in HDFS

• For a completed job, the Map Task output can be

discarded

(130)

MapReduce

Single Reduce Task

• Node includes Split, Map, Sort, and Output unit

• Light blue arrows show data transfers in a node

• Black arrows show data transfers between nodes

(131)

MapReduce

Single Reduce Task

• Number of reduce tasks is specified

independently, and is not based on the size of the input

(132)

MapReduce

Combiner Function

• User specified function to run on the Map output è

Forms the input to the Reduce function

• Specifically designed to minimize the data transferred

between Map Tasks and Reduce Tasks

• Solves the problem of limited network speed on the cluster and helps to reduce the time in completing MapReduce jobs

(133)

MapReduce

Multiple Reducer

• Map tasks partition their output, each creating one partition for each reduce task

• Each partition may use many keys and key associated values

• All records for a key are kept in a single partition

(134)

MapReduce

Multiple Reducers

• Shuffle process is used in the data flow between the Map tasks and Reduce tasks

Shuffle

(135)

MapReduce

Zero Reducer

• Zero reducer

uses

no shuffle

process

• Applied when all of the

processing can be carried

out in

parallel

Map tasks

(136)

REFERENCES

Big Data

(137)

References

(138)

References

(139)

Image sources

References

(140)

HDFS

Big Data

(141)

HDFS

Hadoop

˗ HDFS provides Data Storage

˗ MapReduce provides Data Analysis

˗ MapReduce = Map Function + Reduce Function

(142)

HDFS

• DFS (Distributed FileSystem) is designed for storage management of a network of computers

• HDFS is optimized to store large terabyte size files

with streaming data access patterns

HDFS: Hadoop Distributed FileSystem

(143)

HDFS

HDFS: Hadoop Distributed FileSystem

• HDFS was designed to be optimal in performance for a WORM (Write Once,

Read Many times) pattern

• HDFS is designed to run on clusters of general computers & servers from multiple vendors

(144)

HDFS

HDFS Characteristics

• HDFS is optimized for large scale and high throughput

data processing

• HDFS does not perform well in supporting applications that require minimum delay (e.g., tens of milliseconds range)

(145)

HDFS

Blocks

• Files in HDFS are divided into block size chunks è 64 Megabyte default block size

• Block is the minimum size of data that it can read or write

• Blocks simplifies the storage and replication process è Provides fault tolerance & processing speed

enhancement for larger files

(146)

HDFS

• HDFS clusters use 2 types of nodes

• Namenode (master node)

• Datanode (worker node)

(147)

HDFS

Namenode

• Manages the filesystem namespace

• Namenode keeps track of the datanodes that have blocks of a distributed file assigned

• Maintains the filesystem tree and the metadata for all the files and directories in the tree

• Stores on the local disk using 2 file forms

• Namespace Image

• Edit Log

(148)

HDFS

Namenode

• Namenode holds the filesystem metadata in its memory

• Namenode’s memory size determines the limit to the number of files in a filesystem

• But then, what is Metadata?

(149)

HDFS

Metadata

• Traditional concept of the library card catalogs

• Categorizes and describes the contents and context of the data files

• Maximizes the usefulness of the original data file by making it easy to find and use

(150)

HDFS

Metadata

Types

• Structural Metadata

• Focuses on the data structure's design and specification

• Descriptive Metadata

• Focuses on the individual instances of application data or the data content