Cloud Introduction
Cloud Computing
Cloud Computing
What does Cloud Computing do?
• Provides online data storage
• Enables configuration and accessing of online applications
• Provides a variety of software usage
• Provides computing platform and computing infrastructure
Cloud Computing
Application Example
• Using Gmail on my smartphone to check e-mails
• Receive an e-mail with a MS Power Point attachment file
• However, MS Power Point and Windows OS is not installed on my smartphone!
• Google Drive service’s Google Docs, Sheets, and Slides
can be used to open the file
Cloud Computing
What is a Cloud?
• Cloud can provide services through a public or private
Network or the Internet, where the service hosting system is at a remote location
• Cloud can support various applications
• E-mail, Web Conferencing, Games, Database
Management, CRM (Customer Relationship Management), etc.
Cloud Computing
Cloud Models
Cloud Models
• Public Cloud
˗ Enables public systems and service access
˗ Open architecture (e.g., e-mail)
˗ Could be less secure due to openness
• Private Cloud
˗ Enables service access within an organization
˗ Due to its private nature, it is more secure
Cloud Computing
• Community Cloud
˗ Cloud accessible by a group of organizations
• Hybrid Cloud
˗ Hybrid Cloud = Public Cloud + Private Cloud
˗ Private cloud supports critical activities
˗ Public cloud supports non-critical activities
Cloud Computing
Cloud Models
Cloud Computing
Cloud Service Models
Ø SaaS: Software as a Service
Ø PaaS: Platform as a Service
Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model
Cloud Computing
Software as a Service (SaaS)
• Provides a variety of software applications as a service to end users
Platform as a Service (PasS)
• Provides a program executable platform for applications, development tools, etc.
Infrastructure as a Service (IaaS)
• Provides the fundamental computing and security resources for the entire cloud
• Backup storage, computing power, VM (Virtual Machines), etc.
Cloud Computing
Cloud Service Models
• There are many other service models
• XaaS = Anything as a Service
• NaaS à N for Network as a Service
• DaaS à D for Database as a Service
• BaaS à B for Business as a Service
• etc.
Cloud Computing
Cloud Benefits
Cloud Computing
Characteristics
REFERENCES
Cloud Computing
• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org
• Apple, iCloud, https://www.icloud.com
• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,
http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]
• Tutorialspoint, Cloud computing,
http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]
References
Image sources
• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons
References
Cloud Service Models
Cloud Computing
Cloud Computing
Cloud Service Models
Ø SaaS: Software as a Service
Ø PaaS: Platform as a Service
Ø IaaS: Infrastructure as a Service The lower service model supports the management, computing power, security of its upper service model
IaaS
IaaS (Infrastructure as a Service)
• Infrastructure support over the Internet
• Cloud’s Computing & Storage Resources
• Computing Power
• Storage Services
• Software Packages & Bundles
• VLAN (Virtual Local Area Network)
• VM (Virtual Machine) Features
IaaS
VM (Virtual Machine) Administration
• IaaS enables control of computing resources through Administrative Access to VMs
è Server Virtualization features
• Access to computing resources are enabled by
Administrative Access to VMs
• VM Administrative Command examples
• Save data on cloud server
• Start web server
• Install new application
IaaS
IaaS Procedures
IaaS
IaaS Benefits
• Flexible and Efficient Renting of Computer & Server
Hardware
• Rentable Resources
• VM, Storage, Bandwidth,
IP Addresses, Monitoring Services, Firewalls, etc.
• Rent Payment Basis
• Resource type
• Usage time
• Service packages
IaaS
IaaS Benefits
• Portability & Interoperability with
Legacy Applications
• Enables portability based on infrastructure resources that are
used through Internet connections
• Enables a method to maintain interoperability with legacy applications and workloads
between IaaS clouds
PaaS
PaaS
(Platform as a Service)
• Provides development &
deployment tools for
application development
• Provides runtime
environment for apps.
Stand Alone Development Environment Application Delivery-Only Environment Open Platform as a Service Add-on Development Facilities
Cloud Services
PaaS Types
24PaaS
PaaS Types
• Application Delivery-Only Environment
• Provides on-demand scaling & application security
• Stand-Alone Development Environment
• Provides an independent platform for a specific function
• Open Platform as a Service
• Provides open source software to run applications for
PaaS providers
• Add-On Development Facilities
• Enables customization to the existing SaaS platforms 25
PaaS
PaaS Benefits
PaaS
Benefits
• Lower Administrative Overhead
• User does not need to be involved in any administration of the platform
• Lower Total Cost of Ownership
• User does not need to purchase any hardware, memory, or server
PaaS
• Scalable Solutions
• Application resource demand based automatic resource scale control
• More Current System Software
• Cloud provider needs to maintain software upgrades & patch installations
Benefits
SaaS
SaaS (Software as a Service)
• Provides software applications as a service to the user
• Software that is deployed on a cloud server which is accessible through the Internet
SaaS
Characteristics
• On Demand Availability
• Cloud software is available anywhere that the cloud is reachable via Internet
• Easy Maintenance
• No user software upgrade or maintenance needed
è All supported by the cloud
• Flexible Scale Up or Scale Down
• Centralized Management & Data
SaaS
Characteristics
• Enables a Shared Data Model
• Multiple users can share a single data model and database
• Cost Effectiveness
• Pay based on usage
• No risk in buying the wrong software
• Multitenant Programming Solutions
• Multiple programmers are ensured to use the same software version
è No version mismatch problems
Software-as-a-service
Open SaaS
Applications
REFERENCES
Cloud Computing
• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org
• Apple, iCloud, https://www.icloud.com
• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,
http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]
• Tutorialspoint, Cloud computing,
http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]
References
Image sources
• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons
References
Cloud Services
Cloud Computing
Cloud Services
Google Cloud
• Google App Engine
˗ Released as a preview in April 2008
˗ PaaS (Platform as a Service) for web applications
˗ Provides automatic scaling based on resource demands and server load
• Google Cloud Storage
˗ Launched in May 2010
˗ Online file storage service
Cloud Services
Google Cloud
• Google BigQuery
˗ Released in April 2012
˗ Data analysis tool that uses SQL-like queries to process big datasets in seconds
• Google Compute Engine
˗ Released in June 2012
˗ IaaS (Infrastructure as a Service) support
to enable on demand launching of VMs (Virtual Machines)
Cloud Services
Google Cloud
• Google Cloud Endpoints
˗ Released in November 2013
˗ Tool to create services inside App Engine
˗ Easily connects from Android, iOS, and JavaScript clients
• Google Cloud DNS (Domain Name System)
˗ DNS service supported by the Google Cloud
Cloud Services
• Google Cloud Datastore
˗ NoSQL (No Structured Query Language) data storage
• Google Cloud SQL (Structured Query Language)
˗ Released in February 2014 as GA (General Availability)
˗ Fully managed MySQL database
Google Cloud
Cloud Services
Amazon S3 (Simple Storage Service)
• Online file storage web service offered by Amazon Web Services
• Public web service released in the United States in March 2006 and in Europe in November 2007
• Provides storage through web services interfaces
(REST, SOAP, and BitTorrent)
Cloud Services
Amazon Cloud Drive
• Amazon Cloud Drive was released in
March 2011
• Web storage application from Amazon
• Storage Space Characteristics
˗ Can be accessed from up to eight specific devices (e.g., mobile devices & different computers) and by using
different browsers on the same computer
Cloud Services
Amazon Cloud Drive
• Cloud Player (Originally bundled)
˗ Users can play music in their Cloud Drive from any computer or Android device
˗ Music browsing based on song titles, albums, artists, genres (website only), and playlists
Cloud Services
Amazon Cloud Drive
Options
•
Unlimited
Photos˗ Unlimited storage for photos & raw data files
˗ 5 gigabytes of video storage
•
Unlimited
Everything˗ Unlimited storage for photos, videos, documents, and various files types
Cloud Services
iCloud
• Developed by Apple, Inc.
• Public release in October 2011
• Cloud Storage & Cloud Computing
• Operating system
˗ OS X (10.7 Lion or later)
˗ Microsoft Windows 7 or later
˗ iOS 5 or later
Cloud Services
iCloud replaces MobileMe
• Subscription-based collection of Apple’s online services and software
• MobileMe was replaced by iCloud
• MobileMe ceased services in June 2012
• MobileMe users were allowed transfers to iCloud until
July 2012
Cloud Services
iCloud Features
• Email, Contacts, and Calendars
• Find My Friends
• Backup & Restore
˗ Back up feature for device settings & data
˗ iOS 5 or later required • Find My iPhone
˗ Enables a user to track the location of an iOS device or Mac
˗ Formerly a feature of MobileMe
Cloud Services
• Can manage lost or stolen Apple devices
• Back to My Mac
˗ Enables remote log in to other computers that have
Back to My Mac installed (using the same Apple ID) • iWork for iCloud
˗ Apple's iWork suite (Pages, Numbers, and Keynote) made available on a web interface
iCloud Features
Cloud Services
iCloud Features
• Photo Stream
˗ Can store most recent 1,000 photos
˗ Free storage for up to 30 days • iCloud Photo Library
˗ Stores all photos at original resolution
˗ Stores photo metadata • Storage (Introduced in 2011)
˗ 5 GB of free storage per account
Cloud Services
• iCloud Drive
˗ Can save photos, videos, documents, and apps • iCloud Keychain
˗ Secure database for Website and Wi-Fi password
˗ Secure Credit card & Debit card management for
quick access and auto-fill
iCloud Features
Cloud Services
• iTunes Match
˗ iTunes music library scan and match tracks
function
˗ Serves tracks copied from CDs or other sources
iCloud Features
REFERENCES
Cloud Computing
• K. Kumar and Y. H. Lu, “Cloud Computing for Mobile Users: Can Offloading Computation Save Energy?,” Computer, vol. 43, no. 4, pp. 51–56, Apr. 2010. • Wikipedia, http://www.wikipedia.org
• Apple, iCloud, https://www.icloud.com
• Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015] • Virtualization, Cisco’s IaaS cloud,
http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg [Accessed June 1, 2015]
• Tutorialspoint, Cloud computing,
http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf [Accessed June 1, 2015]
References
Image sources
• AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
• iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons • MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons
References
Big Data Examples
Big Data
Big Data
New FLU Virus Starts in the U.S.!
• H1N1 flu virus (which has combined virus elements of the bird and swine (pig) flu) started to spread in the U.S. in 2009
• U.S. CDC (Centers for Disease Control and Prevention) was only collecting diagnostic data of Medical Doctors once a week
• Using the CDC information to find how the flu was spreading would have an approximate
2 week lag, which is far too slow compared to the speed of the virus spreading
Big Data
New FLU Virus Starts in the U.S.!
•
What
vaccine was needed?
•
How much
vaccine was needed?
•
Where
was the vaccine needed?
•
Vaccine preparation and delivery plans could
not
be setup fast enough to safely prevent the
virus from spreading out of control
Big Data
•
Fortunately,
published a paper about
how they could
predict
the spread of the
winter
flu
in the U.S. accurately down to
specific
regions
and
states
•
This paper was published in the journal
Nature
a
few weeks before
the H1N1 virus made the
headline news
New FLU Virus Starts in the U.S.!
Big Data
• Millions of the most common search terms and
Millions of different mathematical models were tested on Google’s database
• Google receives more than 3 billion search queries
a day
• Analysis system was set to look for correlation
between the frequency of certain search queues and the spread of the flu over time and space
New FLU Virus Starts in the U.S.!
Big Data
• Google’s method of analysis did not use data provided from hospitals or Medical Doctors
• Google used Big Data analysis on the most common search terms people use
• Google’s system proved to be more accurate and
faster than analyzing government statistics
New FLU Virus Starts in the U.S.!
Big Data
Wal-Mart
• Wal-Mart’s Data Warehouse
• Stores 4 petabytes (4´1015) of data
• Records every single purchase
• Approximately 267 million transactions a day from 6000 stores worldwide is recorded
Big Data
• Wal-Mart’s Data Analysis
• Focused on evaluating the effectiveness of
pricing strategies and advertising campaigns
• Seeking for improvement methods
in inventory management and supply chains
Wal-Mart
Big Data
Recommendation System
using Big Data
• Based on data analysis of simple elements
• What users made purchases in the past
• Which items do they have in their virtual shopping cart
• Which items did customers rate and like
• What influence did the rating have on other customers to make a purchase
Big Data
Amazon.com
• Amazon.com’s Recommendation System
• Item-to-Item Collaborative Filtering Algorithm
• Personalization of the Online Store
è Customized to each customer
• Each customer’s store is based on the customer’s personal interest
• Example: For a new mother, the store will display baby supplies and toys
Big Data
Citibank
• Bank operations in 100 countries
• Big Data analysis on the database of basic financial transactions can enable Global insight on
investments, market changes, trade patterns, and economic conditions
• Many companies (e.g., Zara, H&M, etc.) work with Citibank to locate new stores and factories
Big Data
Product Development & Sales
• For example, a Smartphone takes significant time
and money to manufacture
• In addition, the duration of popularity for a new Smartphone is limited
• To maximize sales, a company needs to manufacture
just the right amount of products and sell them in the
right locations
Big Data
Product Development & Sales
• Too much will result in leftovers and abig waste for the company!
• Too less will result in a lost opportunity for company profit and growth!
• Big Data analysis can help find how many smartphones
and where the products could be popular based on
common search terms that people use è Use this to also estimate how many products could be sold in a certain location è But why is this difficult?
REFERENCES
Big Data
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database
Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References
Big Data's 4 Vs
Big Data
Big Data
Big Data’s 4
V
Big
Challenges
• Volume – Data Size
• Variety – Data Formats
• Velocity – Data Streaming Speeds
• Veracity – Data Trustworthiness
Big Data
Volume –
Data Size
• 40 Zettabytes (1021) of data is predicted to be created
by 2020
• 2.5 Quintillionbytes (1018) of data are created every
day
• 6 Billion (109) people have mobile phones
• 100 Terabytes (1012) of data (at least) is stored by
most U.S. companies
• 966 Petabytes (1015) was the approximate storage size
of the American manufacturing industry in 2009
Big Data
Variety –
Data Formats
• 150 Exabytes (1018) was the estimated size of data for
health care throughout the world in 2011
• More than 4 Billion (109) hours each month are used in
watching YouTube
• 30 Billon contents are exchanged every month on Facebook
• 200 Million monthly active users exchange 400 Million tweets every day
Big Data
Velocity –
Data Streaming Speeds
• 1 Terabytes (1012) of trade information is exchanged
during every trading session at the New York Stock Exchange
• 100 sensors (approximately) are installed in modern
cars to monitor fuel level, tire pressure, etc.
• 18.9 Billion network connections are predicted to exist by 2016
Big Data
Veracity –
Data Trustworthiness
• 1 out of 3 business leaders have experienced trust issues with their data when trying to make a
business decision
• $3.1 Trillion (1012) a year is estimated to be wasted
in the U.S. economy due to poor data quality
Big Data
New technology is needed to overcome these
4 V
Big Data
Challenges
• Volume – Data Size
• Variety – Data Formats
• Velocity – Data Streaming Speeds
• Veracity – Data Trustworthiness
REFERENCES
Big Data
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database
Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References
HADOOP
Big Data
Hadoop
Data Storage, Access, and Analysis
• Hard drive storage capacity has tremendously increased
• But the data read and write speeds to and from the hard drives have not significantly improved yet
• Simultaneous parallel read and write of data with multiple hard disks requires advanced technology
Hadoop
• Challenge 1: Hardware Failure
• Challenge 2: Cost
˗ When using many computers for data storage and analysis, the probability that one computer will fail is very high
˗ To avoid data loss or computed analysis information loss, using backup computers and memory is needed, which helps the reliability, but is very expensive
Data Storage, Access, and Analysis
Hadoop
• Challenge 3: Combining Analyzed Data
˗ Combining the analyzed data is very difficult
˗ If one part of the analyzed data is not ready, then the overall combining process has to be delayed
˗ If one part has errors in its analysis, then the overall combined result may be unreliable and useless
Data Storage, Access, and Analysis
Hadoop
Hadoop
• Hadoop is a Reliable Shared Storage and Analysis System
• Hadoop = HDFS + MapReduce + α
˗ HDFS provides Data Storage
˗ HDFS: Hadoop Distributed FileSystem
˗ MapReduce provides Data Analysis
˗ MapReduce = Map + Reduce Function Function
Hadoop
HDFS: Hadoop Distributed FileSystem
• DFS (Distributed FileSystem) is designed for storage management of a network of computers
• HDFS is optimized to store huge files with streaming data access patterns
• HDFS is designed to run on clusters of general computers
Hadoop
HDFS: Hadoop Distributed FileSystem
• HDFS was designed to be optimal in performance for a WORM (Write Once, Read Many times) pattern, which is a very efficient data processing pattern
• HDFS was designed considering the time to read the whole dataset to be more important than the time
required to read the first record
Hadoop
HDFS
• HDFS clusters use 2 types of nodes
• Namenode (master node)
• Datanode (worker node)
Hadoop
HDFS: Namenode
• Manages the filesystem namespace
• Maintains the filesystem tree and the metadata for all the files and directories in the tree
• Stores on the local disk using 2 file forms
• Namespace Image
• Edit Log
Hadoop
HDFS: Datanodes
• Workhorse of the filesystem
• Store and retrieve blocks when requested by the client or the namenode
• Report back to the namenode periodically with lists of blocks that were stored
Hadoop
MapReduce
• MapReduce is a program that abstracts the analysis problem from stored data
• MapReduce transforms the analysis problem into a
computation process that uses a set of keys and
values
Hadoop
MapReduce System Architecture
• MapReduce was designed for tasks that consume
several minutes or hours on a set of dedicated trusted computers connected with a broadband high-speed network managed by a single master data center
Hadoop
MapReduce Characteristics
• MapReduce uses a somewhat brute-force data analysis approach
• The entire dataset (or a big part of the dataset) is processed for every query
• è Batch Query Processor model
Hadoop
MapReduce Characteristics
• MapReduce enables the ability to run an ad hoc query
against the whole dataset within a scalable time
• Many distributed systems combine data from multiple sources (which is very difficult), but MapReduce does this in a very effective and efficient way
Hadoop
Technical Terms
used in MapReduce
• Seek Time is the delay in finding a file
• Transfer Rate is the speed to move a file
• Transfer Rate has improved significantly more (i.e., now has much faster transfer speeds) compared to improvements in Seek Time (i.e., still relatively slow)
Hadoop
MapReduce
• MapReduce gains performance enhancement through
optimal balancing
of Seeking and Transfer operations
• Reduce Seek operations
• Effectively use Transfer operations
• In the next lecture, we will compare MapReduce with a traditional RDBMS (Rational Database Management System)
REFERENCES
Big Data
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database
Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References
MapReduce vs.
RDBMS
Big Data
Hadoop
• RDBMS (Rational Database Management System)
Characteristics
• RDBMS is good for updating a small proportion of a big database
• RDBMS uses a traditional B-Tree, which is highly dependent in the time required to perform seek
operations
MapReduce vs. RDBMS
Hadoop
MapReduce vs. RDBMS
• MapReduce Characteristics
• MapReduce is good for updating all (or a majority) of a big database
• MapReduce uses Sort and Merge to rebuild the database, which depends more on transfer
operations
Hadoop
• RDBMS is good for applications that require the
datasets of the database to be very frequently updated (e.g., point queries or small dataset updates)
• MapReduce is better for WORM (Write Once and Read Many times) based data applications
• MapReduce is a complementary system to RDBMS
MapReduce vs. RDBMS
Hadoop
MapReduce vs. RDBMS
RDBMS
MapReduce
Data Size Gigabytes (109) Petabytes (1012) Access Interactive & Batch Batch
Updates Read & Write Many Times WORM (Write Once, Read Many Times)
Data
Structure Static Schema Dynamic Schema
Integrity High Low
Scalability Nonlinear Linear
Hadoop
MapReduce vs. RDBMS:
Data Types
• Structured Data: Data that has a formal defined structure (e.g., XML documents or database tables)
• Semi-Structured Data: Data that has a looser format where the data structure is used as a guide and may be ignored
• Unstructured Data: Data that does not have any formal structure (e.g., plain text or image data)
Hadoop
MapReduce vs. RDBMS:
Data Types
• MapReduce is very effective on unstructured and semi-structured data
• Why?
• MapReduce interprets data during the data processing sessions
• MapReduce does not use intrinsic properties of the data as input keys or input values. The parameters used
are selected by the person analyzing the data
Hadoop
MapReduce vs. RDBMS:
Scalability
• MapReduce has a programming model that is linearly scalable
• MapReduce Functions: 2 types
• Map function
• Reduce function
• Both of these functions define a
Key-Value pair mapping relation
(e.g., Key-Value pair 1 è Key-Value pair 2)
Hadoop
Hadoop Release Series
Feature 1.x 0.22 2.X
Secure authentication Yes No Yes
Old configuration names Yes
New configuration names No Yes Yes
Old MapReduce API Yes Yes Yes
New MapReduce API Yes (with some
missing libraries) Yes Yes
MapReduce 1 runtime (Classic) Yes Yes No
MapReduce 2 runtime (YARN) No No Yes
HDFS Federation No No Yes
HDFS High-Availability No No Yes Release 2.6.0 became available Nov. 2014
Hadoop
Hadoop Release Series
• 2.x includes several major new features
• MapReduce 2 is the new MapReduce runtime implemented on a new system called YARN
• YARN
• Yet Another Resource Negotiator
• General resource management system for running distributed applications
Hadoop
• HDFS Federation partitions the HDFS namespace across multiple namenodes
• Enables improved support for clusters with very large numbers of files
• HDFS High-Availability feature uses standby
namenodes for backup, and therefore, the namenode is no longer a potential SPOF (Single Point of Failure)
Hadoop Release Series
REFERENCES
Big Data
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database
Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References
MapReduce
Big Data
MapReduce
Hadoop
• Hadoop is a Reliable Shared Storage and Analysis System
• Hadoop = HDFS + MapReduce + α
˗ HDFS provides Data Storage
˗ HDFS: Hadoop Distributed FileSystem
˗ MapReduce provides Data Analysis
˗ MapReduce = Map Function + Reduce Function
MapReduce
Scaling Out
• Scaling out is done by the DFS (Distributed FileSystem), where the data is divided and stored in distributed
computers & servers
• Hadoop uses HDFS to move the MapReduce computation to several distributed computing machines
that will process a part of the divided data assigned
MapReduce
Jobs
• MapReduce job is a unit of work that needs to be executed
• Job types: Data input, MapReduce program, Configuration Information, etc.
• Job is executed by dividing it into one of two types of
tasks
• Map Task
• Reduce Task
MapReduce
Node types for Job execution
• Job execution is controlled by 2 types of nodes
• Jobtracker
• Tasktracker
• Jobtracker coordinates all jobs
• Jobtracker schedules all tasks and assigns the tasks to tasktrackers
MapReduce
• Tasktracker will execute its assigned task
• Tasktracker will send a progress reports to the Jobtracker
• Jobtracker will keep a record of the progress of all jobs executed
MapReduce
Data flow
• Hadoop divides the input into input splits (or splits) suitable for the MapReduce job
• Split has a fixed-size
• Split size is commonly matched to the size of a HDFS block (64 MB) for maximum processing efficiency
MapReduce
• Map Task is created for each split
• Map Task executes the map function for all records within the split
• Hadoop commonly executes the Map Task on the node where the input data resides
Data flow
MapReduce
Data flow
• Data-Local Map Task
• Data locality optimization
does not need to use the cluster network
• Data-local flow process shows why the
Optimal Split Size = 64 MB HDFS Block Size 126
MapReduce
• Rack-Local Map Task
• A node hosting the
HDFS block replicas for a map task’s input split
could be running other map tasks
• Job Scheduler will look for a free map slot on a node in the same rack as one of the blocks
Map Task HDFS Block Node Rack Data Center
Data flow
127MapReduce
• Off-Rack Map Task
• Needed when the Job Scheduler
cannot perform data-local or rack-local map tasks
• Uses inter-rack network transfer
Data flow
MapReduce
Map
• Map task will write its output to the local disk
• Map task output is not the final output, it is only the
intermediate output
Reduce
• Map task output is processed by Reduce Tasks to produce the final output
• Reduce Task output is stored in HDFS
• For a completed job, the Map Task output can be
discarded
MapReduce
Single Reduce Task
• Node includes Split, Map, Sort, and Output unit
• Light blue arrows show data transfers in a node
• Black arrows show data transfers between nodes
MapReduce
Single Reduce Task
• Number of reduce tasks is specified
independently, and is not based on the size of the input
MapReduce
Combiner Function
• User specified function to run on the Map output è
Forms the input to the Reduce function
• Specifically designed to minimize the data transferred
between Map Tasks and Reduce Tasks
• Solves the problem of limited network speed on the cluster and helps to reduce the time in completing MapReduce jobs
MapReduce
Multiple Reducer
• Map tasks partition their output, each creating one partition for each reduce task
• Each partition may use many keys and key associated values
• All records for a key are kept in a single partition
MapReduce
Multiple Reducers
• Shuffle process is used in the data flow between the Map tasks and Reduce tasks
Shuffle
MapReduce
Zero Reducer
•
Zero reducer
uses
no shuffle
process
•
Applied when all of the
processing can be carried
out in
parallel
Map tasks
REFERENCES
Big Data
• V. Mayer-Schönberger, and K. Cukier, Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013.
• T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012. • J. Venner, Pro Hadoop. Apress, 2009.
• S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path From Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, Winter 2011.
• B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating revolutionary breakthroughs in commerce, science and society," Computing Community Consortium, pp. 1-15, Dec. 2008.
• G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb. 2003.
References
• J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data," Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
• S. Sagiroglu and D. Sinanc, “Big data: A review,” Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013. • M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
• X. Wu, X. Zhu, G. Q. Wu, and W. Ding, ‘‘Data Mining with Big Data,’’ IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, Jan. 2014.
• Z. Zheng, J. Zhu, and M. R. Lyu, ‘‘Service-Generated Big Data and Big Data-as-a-Service: An Overview,’’ Proc. IEEE International Congress on Big Data, pp. 403– 410, Jun/Jul. 2013.
References
• I. Palit and C.K. Reddy, “Scalable and Parallel Boosting with MapReduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916, 2012.
• M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, “A Database
Synchronization Algorithm for Mobile Devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
• IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-data.html [Accessed June 1, 2015]
• Hadoop Apache, http://hadoop.apache.org • Wikipedia, http://www.wikipedia.org
Image sources
• Walmart Logo, By Walmart [Public domain], via Wikimedia Commons • Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
References
HDFS
Big Data
HDFS
Hadoop
• Hadoop is a Reliable Shared Storage and Analysis System
• Hadoop = HDFS + MapReduce + α
˗ HDFS provides Data Storage
˗ HDFS: Hadoop Distributed FileSystem
˗ MapReduce provides Data Analysis
˗ MapReduce = Map Function + Reduce Function
HDFS
• DFS (Distributed FileSystem) is designed for storage management of a network of computers
• HDFS is optimized to store large terabyte size files
with streaming data access patterns
HDFS: Hadoop Distributed FileSystem
HDFS
HDFS: Hadoop Distributed FileSystem
• HDFS was designed to be optimal in performance for a WORM (Write Once,
Read Many times) pattern
• HDFS is designed to run on clusters of general computers & servers from multiple vendors
HDFS
HDFS Characteristics
• HDFS is optimized for large scale and high throughput
data processing
• HDFS does not perform well in supporting applications that require minimum delay (e.g., tens of milliseconds range)
HDFS
Blocks
• Files in HDFS are divided into block size chunks è 64 Megabyte default block size
• Block is the minimum size of data that it can read or write
• Blocks simplifies the storage and replication process è Provides fault tolerance & processing speed
enhancement for larger files
HDFS
HDFS
• HDFS clusters use 2 types of nodes
• Namenode (master node)
• Datanode (worker node)
HDFS
Namenode
• Manages the filesystem namespace
• Namenode keeps track of the datanodes that have blocks of a distributed file assigned
• Maintains the filesystem tree and the metadata for all the files and directories in the tree
• Stores on the local disk using 2 file forms
• Namespace Image
• Edit Log
HDFS
Namenode
• Namenode holds the filesystem metadata in its memory
• Namenode’s memory size determines the limit to the number of files in a filesystem
• But then, what is Metadata?
HDFS
Metadata
• Traditional concept of the library card catalogs
• Categorizes and describes the contents and context of the data files
• Maximizes the usefulness of the original data file by making it easy to find and use
HDFS
Metadata
Types
• Structural Metadata
• Focuses on the data structure's design and specification
• Descriptive Metadata
• Focuses on the individual instances of application data or the data content