• No results found

Cloud Computing, Big Data, & CDN Emerging Technologies

N/A
N/A
Protected

Academic year: 2021

Share "Cloud Computing, Big Data, & CDN Emerging Technologies"

Copied!
212
0
0

Loading.... (view fulltext now)

Full text

(1)

Cloud Introduction

Cloud Computing

(2)

Cloud Computing

What does Cloud Computing do?

Provides online data storage

Enables configuration and accessing of online applications

Provides a variety of software usage

Provides computing platform and computing infrastructure

(3)

Cloud Computing

Application Example

Using Gmail on my smartphone to check e-mails

Receive an e-mail with a MS Power Point attachment file

However, MS Power Point and Windows OS is not installed on my smartphone!

Google Drive service’s Google Docs, Sheets, and Slides

can be used to open the file

(4)

Cloud Computing

What is a Cloud?

Cloud can provide services through a public or private

Network or the Internet, where the service hosting system is at a remote location

Cloud can support various applications

E-mail, Web Conferencing, Games, Database

Management, CRM (Customer Relationship Management), etc.

(5)

Cloud Computing

Cloud Models

(6)

Cloud Models

Public Cloud

˗ Enables public systems and service access

˗ Open architecture (e.g., e-mail)

˗ Could be less secure due to openness

Private Cloud

˗ Enables service access within an organization

˗ Due to its private nature, it is more secure

Cloud Computing

(7)

Community Cloud

˗ Cloud accessible by a group of organizations

Hybrid Cloud

˗ Hybrid Cloud = Public Cloud + Private Cloud

˗ Private cloud supports critical activities

˗ Public cloud supports non-critical activities

Cloud Computing

Cloud Models

(8)

Cloud Computing

Cloud Service Models

Ø SaaS: Software as a Service

Ø PaaS: Platform as a Service

Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model

(9)

Cloud Computing

Software as a Service (SaaS)

Provides a variety of software applications as a service to end users

Platform as a Service (PasS)

Provides a program executable platform for applications, development tools, etc.

Infrastructure as a Service (IaaS)

Provides the fundamental computing and security resources for the entire cloud

Backup storage, computing power, VM (Virtual Machines), etc.

(10)

Cloud Computing

Cloud Service Models

There are many other service models

XaaS = Anything as a Service

NaaS à N for Network as a Service

DaaS à D for Database as a Service

BaaS à B for Business as a Service

etc.

(11)

Cloud Computing

Cloud Benefits

(12)

Cloud Computing

Characteristics

(13)

REFERENCES

Cloud Computing

(14)

• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org

• Apple, iCloud, https://www.icloud.com

• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,

http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]

• Tutorialspoint, Cloud computing,

http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]

References

(15)

Image sources

• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons

References

(16)

Cloud Service Models

Cloud Computing

(17)

Cloud Computing

Cloud Service Models

Ø SaaS: Software as a Service

Ø PaaS: Platform as a Service

Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model

(18)

IaaS

IaaS (Infrastructure as a Service)

Infrastructure support over the Internet

Cloud’s Computing & Storage Resources

Computing Power

Storage Services

Software Packages & Bundles

VLAN (Virtual Local Area Network)

• VM (Virtual Machine) Features

(19)

IaaS

VM (Virtual Machine) Administration

IaaS enables control of computing resources through Administrative Access to VMs

è Server Virtualization features

Access to computing resources are enabled by

Administrative Access to VMs

VM Administrative Command examples

Save data on cloud server

Start web server

Install new application

(20)

IaaS

IaaS Procedures

(21)

IaaS

IaaS Benefits

Flexible and Efficient Renting of Computer & Server

Hardware

Rentable Resources

VM, Storage, Bandwidth,

IP Addresses, Monitoring Services, Firewalls, etc.

Rent Payment Basis

Resource type

Usage time

Service packages

(22)

IaaS

IaaS Benefits

Portability & Interoperability with

Legacy Applications

Enables portability based on infrastructure resources that are

used through Internet connections

Enables a method to maintain interoperability with legacy applications and workloads

between IaaS clouds

(23)

PaaS

PaaS

(Platform as a Service)

Provides development &

deployment tools for

application development

Provides runtime

environment for apps.

(24)

Stand Alone Development Environment Application Delivery-Only Environment Open Platform as a Service Add-on Development Facilities

Cloud Services

PaaS Types

24

(25)

PaaS

PaaS Types

Application Delivery-Only Environment

Provides on-demand scaling & application security

Stand-Alone Development Environment

Provides an independent platform for a specific function

Open Platform as a Service

Provides open source software to run applications for

PaaS providers

Add-On Development Facilities

Enables customization to the existing SaaS platforms 25

(26)

PaaS

PaaS Benefits

(27)

PaaS

Benefits

Lower Administrative Overhead

User does not need to be involved in any administration of the platform

Lower Total Cost of Ownership

User does not need to purchase any hardware, memory, or server

(28)

PaaS

Scalable Solutions

Application resource demand based automatic resource scale control

More Current System Software

Cloud provider needs to maintain software upgrades & patch installations

Benefits

(29)

SaaS

SaaS (Software as a Service)

Provides software applications as a service to the user

Software that is deployed on a cloud server which is accessible through the Internet

(30)

SaaS

Characteristics

On Demand Availability

Cloud software is available anywhere that the cloud is reachable via Internet

Easy Maintenance

No user software upgrade or maintenance needed

è All supported by the cloud

Flexible Scale Up or Scale Down

• Centralized Management & Data

(31)

SaaS

Characteristics

Enables a Shared Data Model

Multiple users can share a single data model and database

Cost Effectiveness

Pay based on usage

No risk in buying the wrong software

Multitenant Programming Solutions

Multiple programmers are ensured to use the same software version

è No version mismatch problems

(32)

Software-as-a-service

Open SaaS

Applications

(33)

REFERENCES

Cloud Computing

(34)

• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org

• Apple, iCloud, https://www.icloud.com

• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,

http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]

• Tutorialspoint, Cloud computing,

http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]

References

(35)

Image sources

• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons

References

(36)

Cloud Services

Cloud Computing

(37)

Cloud Services

Google Cloud

Google App Engine

˗ Released as a preview in April 2008

˗ PaaS (Platform as a Service) for web applications

˗ Provides automatic scaling based on resource demands and server load

• Google Cloud Storage

˗ Launched in May 2010

˗ Online file storage service

(38)

Cloud Services

Google Cloud

Google BigQuery

˗ Released in April 2012

˗ Data analysis tool that uses SQL-like queries to process big datasets in seconds

Google Compute Engine

˗ Released in June 2012

˗ IaaS (Infrastructure as a Service) support

to enable on demand launching of VMs (Virtual Machines)

(39)

Cloud Services

Google Cloud

Google Cloud Endpoints

˗ Released in November 2013

˗ Tool to create services inside App Engine

˗ Easily connects from Android, iOS, and JavaScript clients

Google Cloud DNS (Domain Name System)

˗ DNS service supported by the Google Cloud

(40)

Cloud Services

Google Cloud Datastore

˗ NoSQL (No Structured Query Language) data storage

Google Cloud SQL (Structured Query Language)

˗ Released in February 2014 as GA (General Availability)

˗ Fully managed MySQL database

Google Cloud

(41)

Cloud Services

Amazon S3 (Simple Storage Service)

Online file storage web service offered by Amazon Web Services

Public web service released in the United States in March 2006 and in Europe in November 2007

Provides storage through web services interfaces

(REST, SOAP, and BitTorrent)

(42)

Cloud Services

Amazon Cloud Drive

Amazon Cloud Drive was released in

March 2011

Web storage application from Amazon

Storage Space Characteristics

˗ Can be accessed from up to eight specific devices (e.g., mobile devices & different computers) and by using

different browsers on the same computer

(43)

Cloud Services

Amazon Cloud Drive

Cloud Player (Originally bundled)

˗ Users can play music in their Cloud Drive from any computer or Android device

˗ Music browsing based on song titles, albums, artists, genres (website only), and playlists

(44)

Cloud Services

Amazon Cloud Drive

Options

Unlimited

Photos

˗ Unlimited storage for photos & raw data files

˗ 5 gigabytes of video storage

Unlimited

Everything

˗ Unlimited storage for photos, videos, documents, and various files types

(45)

Cloud Services

iCloud

Developed by Apple, Inc.

Public release in October 2011

Cloud Storage & Cloud Computing

Operating system

˗ OS X (10.7 Lion or later)

˗ Microsoft Windows 7 or later

˗ iOS 5 or later

(46)

Cloud Services

iCloud replaces MobileMe

Subscription-based collection of Apple’s online services and software

MobileMe was replaced by iCloud

MobileMe ceased services in June 2012

MobileMe users were allowed transfers to iCloud until

July 2012

(47)

Cloud Services

iCloud Features

Email, Contacts, and Calendars

Find My Friends

Backup & Restore

˗ Back up feature for device settings & data

˗ iOS 5 or later requiredFind My iPhone

˗ Enables a user to track the location of an iOS device or Mac

˗ Formerly a feature of MobileMe

(48)

Cloud Services

Can manage lost or stolen Apple devices

Back to My Mac

˗ Enables remote log in to other computers that have

Back to My Mac installed (using the same Apple ID) iWork for iCloud

˗ Apple's iWork suite (Pages, Numbers, and Keynote) made available on a web interface

iCloud Features

(49)

Cloud Services

iCloud Features

Photo Stream

˗ Can store most recent 1,000 photos

˗ Free storage for up to 30 daysiCloud Photo Library

˗ Stores all photos at original resolution

˗ Stores photo metadataStorage (Introduced in 2011)

˗ 5 GB of free storage per account

(50)

Cloud Services

iCloud Drive

˗ Can save photos, videos, documents, and appsiCloud Keychain

˗ Secure database for Website and Wi-Fi password

˗ Secure Credit card & Debit card management for

quick access and auto-fill

iCloud Features

(51)

Cloud Services

iTunes Match

˗ iTunes music library scan and match tracks

function

˗ Serves tracks copied from CDs or other sources

iCloud Features

(52)

REFERENCES

Cloud Computing

(53)

• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org

• Apple, iCloud, https://www.icloud.com

• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,

http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]

• Tutorialspoint, Cloud computing,

http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]

References

(54)

Image sources

• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons

References

(55)

Big Data Examples

Big Data

(56)

Big Data

New FLU Virus Starts in the U.S.!

H1N1 flu virus (which has combined virus elements of the bird and swine (pig) flu) started to spread in the U.S. in 2009

U.S. CDC (Centers for Disease Control and Prevention) was only collecting diagnostic data of Medical Doctors once a week

Using the CDC information to find how the flu was spreading would have an approximate

2 week lag, which is far too slow compared to the speed of the virus spreading

(57)

Big Data

New FLU Virus Starts in the U.S.!

What

vaccine was needed?

How much

vaccine was needed?

Where

was the vaccine needed?

Vaccine preparation and delivery plans could

not

be setup fast enough to safely prevent the

virus from spreading out of control

(58)

Big Data

Fortunately,

Google

published a paper about

how they could

predict

the spread of the

winter

flu

in the U.S. accurately down to

specific

regions

and

states

This paper was published in the journal

Nature

a

few weeks before

the H1N1 virus made the

headline news

New FLU Virus Starts in the U.S.!

(59)

Big Data

Millions of the most common search terms and

Millions of different mathematical models were tested on Google’s database

Google receives more than 3 billion search queries

a day

Analysis system was set to look for correlation

between the frequency of certain search queues and the spread of the flu over time and space

New FLU Virus Starts in the U.S.!

(60)

Big Data

Google’s method of analysis did not use data provided from hospitals or Medical Doctors

Google used Big Data analysis on the most common search terms people use

Google’s system proved to be more accurate and

faster than analyzing government statistics

New FLU Virus Starts in the U.S.!

(61)

Big Data

Wal-Mart

Wal-Mart’s Data Warehouse

Stores 4 petabytes (4´1015) of data

Records every single purchase

Approximately 267 million transactions a day from 6000 stores worldwide is recorded

(62)

Big Data

Wal-Mart’s Data Analysis

Focused on evaluating the effectiveness of

pricing strategies and advertising campaigns

Seeking for improvement methods

in inventory management and supply chains

Wal-Mart

(63)

Big Data

Recommendation System

using Big Data

Based on data analysis of simple elements

What users made purchases in the past

Which items do they have in their virtual shopping cart

Which items did customers rate and like

What influence did the rating have on other customers to make a purchase

(64)

Big Data

Amazon.com

Amazon.com’s Recommendation System

Item-to-Item Collaborative Filtering Algorithm

Personalization of the Online Store

è Customized to each customer

Each customer’s store is based on the customer’s personal interest

Example: For a new mother, the store will display baby supplies and toys

(65)

Big Data

Citibank

Bank operations in 100 countries

Big Data analysis on the database of basic financial transactions can enable Global insight on

investments, market changes, trade patterns, and economic conditions

Many companies (e.g., Zara, H&M, etc.) work with Citibank to locate new stores and factories

(66)

Big Data

Product Development & Sales

For example, a Smartphone takes significant time

and money to manufacture

In addition, the duration of popularity for a new Smartphone is limited

To maximize sales, a company needs to manufacture

just the right amount of products and sell them in the

right locations

(67)

Big Data

Product Development & Sales

Too much will result in leftovers and a

big waste for the company!

Too less will result in a lost opportunity for company profit and growth!

Big Data analysis can help find how many smartphones

and where the products could be popular based on

common search terms that people use è Use this to also estimate how many products could be sold in a certain location è But why is this difficult?

(68)

REFERENCES

Big Data

(69)

V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(70)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(71)

I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(72)

Big Data's 4 Vs

Big Data

(73)

Big Data

Big Data’s 4

V

Big

Challenges

Volume – Data Size

Variety – Data Formats

Velocity – Data Streaming Speeds

Veracity – Data Trustworthiness

(74)

Big Data

Volume –

Data Size

40 Zettabytes (1021) of data is predicted to be created

by 2020

2.5 Quintillionbytes (1018) of data are created every

day

6 Billion (109) people have mobile phones

100 Terabytes (1012) of data (at least) is stored by

most U.S. companies

966 Petabytes (1015) was the approximate storage size

of the American manufacturing industry in 2009

(75)

Big Data

Variety –

Data Formats

150 Exabytes (1018) was the estimated size of data for

health care throughout the world in 2011

More than 4 Billion (109) hours each month are used in

watching YouTube

30 Billon contents are exchanged every month on Facebook

200 Million monthly active users exchange 400 Million tweets every day

(76)

Big Data

Velocity –

Data Streaming Speeds

1 Terabytes (1012) of trade information is exchanged

during every trading session at the New York Stock Exchange

100 sensors (approximately) are installed in modern

cars to monitor fuel level, tire pressure, etc.

18.9 Billion network connections are predicted to exist by 2016

(77)

Big Data

Veracity –

Data Trustworthiness

1 out of 3 business leaders have experienced trust issues with their data when trying to make a

business decision

$3.1 Trillion (1012) a year is estimated to be wasted

in the U.S. economy due to poor data quality

(78)

Big Data

New technology is needed to overcome these

4 V

Big Data

Challenges

Volume – Data Size

Variety – Data Formats

Velocity – Data Streaming Speeds

Veracity – Data Trustworthiness

(79)

REFERENCES

Big Data

(80)

V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(81)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(82)

I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(83)

HADOOP

Big Data

(84)

Hadoop

Data Storage, Access, and Analysis

Hard drive storage capacity has tremendously increased

But the data read and write speeds to and from the hard drives have not significantly improved yet

Simultaneous parallel read and write of data with multiple hard disks requires advanced technology

(85)

Hadoop

Challenge 1: Hardware Failure

Challenge 2: Cost

˗ When using many computers for data storage and analysis, the probability that one computer will fail is very high

˗ To avoid data loss or computed analysis information loss, using backup computers and memory is needed, which helps the reliability, but is very expensive

Data Storage, Access, and Analysis

(86)

Hadoop

Challenge 3: Combining Analyzed Data

˗ Combining the analyzed data is very difficult

˗ If one part of the analyzed data is not ready, then the overall combining process has to be delayed

˗ If one part has errors in its analysis, then the overall combined result may be unreliable and useless

Data Storage, Access, and Analysis

(87)

Hadoop

Hadoop

Hadoop is a Reliable Shared Storage and Analysis System

Hadoop = HDFS + MapReduce + α

˗ HDFS provides Data Storage

˗ HDFS: Hadoop Distributed FileSystem

˗ MapReduce provides Data Analysis

˗ MapReduce = Map + Reduce Function Function

(88)

Hadoop

HDFS: Hadoop Distributed FileSystem

DFS (Distributed FileSystem) is designed for storage management of a network of computers

HDFS is optimized to store huge files with streaming data access patterns

HDFS is designed to run on clusters of general computers

(89)

Hadoop

HDFS: Hadoop Distributed FileSystem

HDFS was designed to be optimal in performance for a WORM (Write Once, Read Many times) pattern, which is a very efficient data processing pattern

HDFS was designed considering the time to read the whole dataset to be more important than the time

required to read the first record

(90)

Hadoop

HDFS

HDFS clusters use 2 types of nodes

Namenode (master node)

Datanode (worker node)

(91)

Hadoop

HDFS: Namenode

Manages the filesystem namespace

Maintains the filesystem tree and the metadata for all the files and directories in the tree

Stores on the local disk using 2 file forms

Namespace Image

Edit Log

(92)

Hadoop

HDFS: Datanodes

Workhorse of the filesystem

Store and retrieve blocks when requested by the client or the namenode

Report back to the namenode periodically with lists of blocks that were stored

(93)

Hadoop

MapReduce

MapReduce is a program that abstracts the analysis problem from stored data

MapReduce transforms the analysis problem into a

computation process that uses a set of keys and

values

(94)

Hadoop

MapReduce System Architecture

MapReduce was designed for tasks that consume

several minutes or hours on a set of dedicated trusted computers connected with a broadband high-speed network managed by a single master data center

(95)

Hadoop

MapReduce Characteristics

MapReduce uses a somewhat brute-force data analysis approach

The entire dataset (or a big part of the dataset) is processed for every query

• è Batch Query Processor model

(96)

Hadoop

MapReduce Characteristics

MapReduce enables the ability to run an ad hoc query

against the whole dataset within a scalable time

Many distributed systems combine data from multiple sources (which is very difficult), but MapReduce does this in a very effective and efficient way

(97)

Hadoop

Technical Terms

used in MapReduce

Seek Time is the delay in finding a file

Transfer Rate is the speed to move a file

Transfer Rate has improved significantly more (i.e., now has much faster transfer speeds) compared to improvements in Seek Time (i.e., still relatively slow)

(98)

Hadoop

MapReduce

MapReduce gains performance enhancement through

optimal balancing

of Seeking and Transfer operations

Reduce Seek operations

Effectively use Transfer operations

In the next lecture, we will compare MapReduce with a traditional RDBMS (Rational Database Management System)

(99)

REFERENCES

Big Data

(100)

V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(101)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(102)

I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(103)

MapReduce vs.

RDBMS

Big Data

(104)

Hadoop

RDBMS (Rational Database Management System)

Characteristics

RDBMS is good for updating a small proportion of a big database

RDBMS uses a traditional B-Tree, which is highly dependent in the time required to perform seek

operations

MapReduce vs. RDBMS

(105)

Hadoop

MapReduce vs. RDBMS

MapReduce Characteristics

MapReduce is good for updating all (or a majority) of a big database

MapReduce uses Sort and Merge to rebuild the database, which depends more on transfer

operations

(106)

Hadoop

RDBMS is good for applications that require the

datasets of the database to be very frequently updated (e.g., point queries or small dataset updates)

MapReduce is better for WORM (Write Once and Read Many times) based data applications

MapReduce is a complementary system to RDBMS

MapReduce vs. RDBMS

(107)

Hadoop

MapReduce vs. RDBMS

RDBMS

MapReduce

Data Size Gigabytes (109) Petabytes (1012) Access Interactive & Batch Batch

Updates Read & Write Many Times WORM (Write Once, Read Many Times)

Data

Structure Static Schema Dynamic Schema

Integrity High Low

Scalability Nonlinear Linear

(108)

Hadoop

MapReduce vs. RDBMS:

Data Types

Structured Data: Data that has a formal defined structure (e.g., XML documents or database tables)

Semi-Structured Data: Data that has a looser format where the data structure is used as a guide and may be ignored

Unstructured Data: Data that does not have any formal structure (e.g., plain text or image data)

(109)

Hadoop

MapReduce vs. RDBMS:

Data Types

MapReduce is very effective on unstructured and semi-structured data

Why?

MapReduce interprets data during the data processing sessions

MapReduce does not use intrinsic properties of the data as input keys or input values. The parameters used

are selected by the person analyzing the data

(110)

Hadoop

MapReduce vs. RDBMS:

Scalability

MapReduce has a programming model that is linearly scalable

MapReduce Functions: 2 types

Map function

Reduce function

Both of these functions define a

Key-Value pair mapping relation

(e.g., Key-Value pair 1 è Key-Value pair 2)

(111)

Hadoop

Hadoop Release Series

Feature 1.x 0.22 2.X

Secure authentication Yes No Yes

Old configuration names Yes

New configuration names No Yes Yes

Old MapReduce API Yes Yes Yes

New MapReduce API Yes (with some

missing libraries) Yes Yes

MapReduce 1 runtime (Classic) Yes Yes No

MapReduce 2 runtime (YARN) No No Yes

HDFS Federation No No Yes

HDFS High-Availability No No Yes Release 2.6.0 became available Nov. 2014

(112)

Hadoop

Hadoop Release Series

2.x includes several major new features

MapReduce 2 is the new MapReduce runtime implemented on a new system called YARN

YARN

Yet Another Resource Negotiator

General resource management system for running distributed applications

(113)

Hadoop

HDFS Federation partitions the HDFS namespace across multiple namenodes

Enables improved support for clusters with very large numbers of files

HDFS High-Availability feature uses standby

namenodes for backup, and therefore, the namenode is no longer a potential SPOF (Single Point of Failure)

Hadoop Release Series

(114)

REFERENCES

Big Data

(115)

V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(116)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(117)

I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(118)

MapReduce

Big Data

(119)

MapReduce

Hadoop

Hadoop is a Reliable Shared Storage and Analysis System

Hadoop = HDFS + MapReduce + α

˗ HDFS provides Data Storage

˗ HDFS: Hadoop Distributed FileSystem

˗ MapReduce provides Data Analysis

˗ MapReduce = Map Function + Reduce Function

(120)

MapReduce

Scaling Out

Scaling out is done by the DFS (Distributed FileSystem), where the data is divided and stored in distributed

computers & servers

Hadoop uses HDFS to move the MapReduce computation to several distributed computing machines

that will process a part of the divided data assigned

(121)

MapReduce

Jobs

MapReduce job is a unit of work that needs to be executed

Job types: Data input, MapReduce program, Configuration Information, etc.

Job is executed by dividing it into one of two types of

tasks

Map Task

Reduce Task

(122)

MapReduce

Node types for Job execution

Job execution is controlled by 2 types of nodes

Jobtracker

Tasktracker

Jobtracker coordinates all jobs

Jobtracker schedules all tasks and assigns the tasks to tasktrackers

(123)

MapReduce

Tasktracker will execute its assigned task

Tasktracker will send a progress reports to the Jobtracker

Jobtracker will keep a record of the progress of all jobs executed

(124)

MapReduce

Data flow

Hadoop divides the input into input splits (or splits) suitable for the MapReduce job

Split has a fixed-size

Split size is commonly matched to the size of a HDFS block (64 MB) for maximum processing efficiency

(125)

MapReduce

Map Task is created for each split

Map Task executes the map function for all records within the split

Hadoop commonly executes the Map Task on the node where the input data resides

Data flow

(126)

MapReduce

Data flow

Data-Local Map Task

Data locality optimization

does not need to use the cluster network

Data-local flow process shows why the

Optimal Split Size = 64 MB HDFS Block Size 126

(127)

MapReduce

Rack-Local Map Task

A node hosting the

HDFS block replicas for a map task’s input split

could be running other map tasks

Job Scheduler will look for a free map slot on a node in the same rack as one of the blocks

Map Task HDFS Block Node Rack Data Center

Data flow

127

(128)

MapReduce

Off-Rack Map Task

Needed when the Job Scheduler

cannot perform data-local or rack-local map tasks

Uses inter-rack network transfer

Data flow

(129)

MapReduce

Map

Map task will write its output to the local disk

Map task output is not the final output, it is only the

intermediate output

Reduce

Map task output is processed by Reduce Tasks to produce the final output

Reduce Task output is stored in HDFS

For a completed job, the Map Task output can be

discarded

(130)

MapReduce

Single Reduce Task

Node includes Split, Map, Sort, and Output unit

Light blue arrows show data transfers in a node

Black arrows show data transfers between nodes

(131)

MapReduce

Single Reduce Task

Number of reduce tasks is specified

independently, and is not based on the size of the input

(132)

MapReduce

Combiner Function

User specified function to run on the Map output è

Forms the input to the Reduce function

Specifically designed to minimize the data transferred

between Map Tasks and Reduce Tasks

Solves the problem of limited network speed on the cluster and helps to reduce the time in completing MapReduce jobs

(133)

MapReduce

Multiple Reducer

Map tasks partition their output, each creating one partition for each reduce task

Each partition may use many keys and key associated values

All records for a key are kept in a single partition

(134)

MapReduce

Multiple Reducers

Shuffle process is used in the data flow between the Map tasks and Reduce tasks

Shuffle

(135)

MapReduce

Zero Reducer

Zero reducer

uses

no shuffle

process

Applied when all of the

processing can be carried

out in

parallel

Map tasks

(136)

REFERENCES

Big Data

(137)

V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.

T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.J. Venner, Pro Hadoop. Apress, 2009.

• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.

• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.

• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.

References

(138)

• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.

S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International

Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and

Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.

X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE

Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.

• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.

References

(139)

I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.

• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database

Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.

• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]

• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org

Image sources

• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

References

(140)

HDFS

Big Data

(141)

HDFS

Hadoop

Hadoop is a Reliable Shared Storage and Analysis System

Hadoop = HDFS + MapReduce + α

˗ HDFS provides Data Storage

˗ HDFS: Hadoop Distributed FileSystem

˗ MapReduce provides Data Analysis

˗ MapReduce = Map Function + Reduce Function

(142)

HDFS

DFS (Distributed FileSystem) is designed for storage management of a network of computers

HDFS is optimized to store large terabyte size files

with streaming data access patterns

HDFS: Hadoop Distributed FileSystem

(143)

HDFS

HDFS: Hadoop Distributed FileSystem

HDFS was designed to be optimal in performance for a WORM (Write Once,

Read Many times) pattern

HDFS is designed to run on clusters of general computers & servers from multiple vendors

(144)

HDFS

HDFS Characteristics

HDFS is optimized for large scale and high throughput

data processing

HDFS does not perform well in supporting applications that require minimum delay (e.g., tens of milliseconds range)

(145)

HDFS

Blocks

Files in HDFS are divided into block size chunks è 64 Megabyte default block size

Block is the minimum size of data that it can read or write

Blocks simplifies the storage and replication process è Provides fault tolerance & processing speed

enhancement for larger files

(146)

HDFS

HDFS

HDFS clusters use 2 types of nodes

Namenode (master node)

Datanode (worker node)

(147)

HDFS

Namenode

Manages the filesystem namespace

Namenode keeps track of the datanodes that have blocks of a distributed file assigned

Maintains the filesystem tree and the metadata for all the files and directories in the tree

Stores on the local disk using 2 file forms

Namespace Image

Edit Log

(148)

HDFS

Namenode

Namenode holds the filesystem metadata in its memory

Namenode’s memory size determines the limit to the number of files in a filesystem

But then, what is Metadata?

(149)

HDFS

Metadata

Traditional concept of the library card catalogs

Categorizes and describes the contents and context of the data files

Maximizes the usefulness of the original data file by making it easy to find and use

(150)

HDFS

Metadata

Types

Structural Metadata

Focuses on the data structure's design and specification

Descriptive Metadata

Focuses on the individual instances of application data or the data content

References

Related documents

multiplier effect yang lebih besar. Dari Tabel 6 dapat dilihat gambaran dampak yang ditimbulkan oleh pertumbuhan tenaga kerja sub sektor perikanan terhadap

When calculating the bending strength of the joint in composite structures, the interface itself is usually treated as being only under shear action – this is apparent in

 No measures to clarify that secrecy provisions in sector specific laws to not inhibit FI’s ability to share confidential information in cases where this is required under

Aside from the larger randomness in behaviour, there is also an important difference when uncertainty about the true reward statistics is high (lower left corner): in the absence

Fiscal drag effects due to real or infla- tionary income growth may change the progressivity of the income tax if these incomes rise into higher brackets and are subject to

The new international cruise ship terminal is associated with the urban waterfront development plan at Wynyard Quarter to create a program as a whole to revitalise this industrial

Her research interests include human factor impact on transportation and logistics processes, and risks caused by illegal migration to European road freight transport. ORCID

In the following pages, you'll find 137 real world Instagram Ads that our team picked as the best among thousands to inspire you and help you test new approaches. Remember: