Lecture 4
Introduction to
Hadoop & GAE
Cloud Application Development
(SE808, School of Software, Sun Yat-Sen University)
Yabo (Arber) Xu
Outline
Introduction to Hadoop
•
The Hadoop ecosystem•
Related projects•
How to start Introduction to GAE
•
What is GAE•
Overview of runtime environment•
Scalable services•
Advantages and limitations•
Billing and free quotas•
Demo, and how to startHadoop Stack & Google’s Equivalents
Google Hadoop
MapReduce Hadoop
MapReduce
Programming Framework
GFS HDFS Distributed File
System
BigTable HBase Distributed Column
Database
Sawzall PIG / Hive High-level Language Chubby Zookeeper Distributed Consensus
Engine
Pig
Data-flow oriented language
•
“Pig latin”•
Datatypes include sets, associative arrays, tuples•
High-level language for routing data, allows easy integration of Java for complex tasks Developed at Yahoo!
Hive
SQL-based data warehousing app
•
Feature set is similar to Pig•
Language is more strictly SQL-esque Supports SELECT, JOIN, GROUP BY, etc.
Features for analyzing very large data sets
•
Partition columns•
Sampling•
Buckets Developed at Facebook
HBase
Row/Column store
Billions of rows * millions of columns
Column-oriented – nulls are free
Untyped – stores bytes.
Constraint access model
•
(key,value) look up•
Limited transactions ( only one row)Hbase – Data Model
Data schema
Disk storage
Hbase – Design & Features
Design similar to GFS
Name node Master server
Data node Region server, organized in columns and cells
Features
Fault tolerant and auto load balancing
Fast access to cells, and fast scan over the ranges of rows.
More flexible schema than traditional database.
Less transaction support and weak consistency guarantee
HBase as a MapReduce Input
Each row is an input record to MapReduce
MapReduce jobs can sort/search/index/query data in bulk
*If you are interested in knowing more about HBase, you may take a look at Cloudera’s training video on HBase.
Zookeeper
Distributed consensus engine
Provides well-defined concurrent access semantics:
•
Leader election•
Service discovery•
Distributed locking / mutual exclusion•
Message board / mailboxesPipes, Streaming
Multi-language connector libraries for MapReduce
•
Write native-code MapReduce in C++•
Write MapReduce passes in arbitrary scripting languagesHadoop related projects
Avro: A data serialization system
Chukwa –Hadoop log aggregation
Scribe –More general log aggregation
Mahout –Machine learning library
Cassandra –Column store database on a P2P backend
Hadoop Status
Still under active development
Current stable release: 0.20.2 ( Hadoop official websites)
There are some other well-maintained distribution
•
Cloudera’s CDH2•
Yahoo’s Distribution: Hadoop 0.20.10 Supported platform
•
Linux as production platform/Win32 as a dev platform Get yourself started with (Also Lab1’s task)
•
Download a Hadoop stable release•
Setup a single-node Hadoop installation•
Try out the HDFS operations•
Read WordCount Example codes, and run your first MR job on HadoopIntroduction to Google App Engine (GAE)
IaaS
Infrastructure as a Service PaaS
Platform as a Service SaaS
Software as a Service
#1: The technology drives PaaS
#2: The application development
on top of
PaaS platform
What is Google App Engine
• A PaaS platform for hosting web applications in Google- managed data centers.
•
Released on April 08 with Python support.•
Java included on May 09.+ + =
Google App Engine Java Language Google Web Toolkit Google App Engine for Java
A Traditional Scalable Website
A GAE Scalable Website
GAE Advantages
• Easy to use, scale and manage
• Run your application on Google’s infrastructure
• Forgot worries of managing your servers
• Think about developing more features for your web, let Google manage the rest
• No server restart, no network issues
19
GAE Architecture
GAE Java Runtime Environment
• Java 6 VM
• Servlet 2.5 Container
• HTTP Session support (need to enable explicitly)
• JDO/JPA for Datastore API
• JSR 107 for Memcache API
• javax.mail for Mail API
• javax.net.URLConnection for URLFetch API
http://code.google.com/appengine/docs/java/runtime.html
Java Standards on GAE
http://code.google.com/appengine/docs/java/runtime.html
Datastore API
• Storing data and manipulation
• Based on Bigtable
• Not a relational database
• GQL (Google Query Language)
• Need to use JDO/JPA
http://code.google.com/appengine/docs/java/datastore/
23
Memcache
•
Better than Datastore•
Storage on memory rather on disk•
Arbitrary key-value pair mapping•
It implements JCache interface•
1MB limit per entry•
Free quota 8.6M/day, 800 request/sechttp://code.google.com/appengine/docs/java/memcache/
Users & Authentication
@gmail.com address
Apps for Domain
Admin Privileges
25
URLFetch
•
Load external URL•
Asynchronous support•
HTTP/HTTPS•
Max 10 second response•
Max 1MB datahttp://code.google.com/appengine/docs/java/urlfetch/
Even More…
• Datastore – database storage and operations
• Memcache API – high performance in-memory key-value cache
• User Accounts – using Google accounts for authentication
• URLFetch – invoking external URLs
• Mail – sending mail from your application
• XMPP – sending/receiving XMPP-compatible instant messages
• Task Queues – for invoking background processes
• Images – for image manipulation
• Cron Jobs – scheduled tasks on defined time
http://code.google.com/appengine/docs/java/apis.html
Who is using GAE?
http://code.google.com/appengine/casestudies.html
GAE Demo
Demo site: http://shen-ma.appspot.com/
How Do You Start
The best way to learn is by practice!
Following GAE’s Getting-Started: Java, and have your first application online in 2 hrs. (Also Lab 1 Task)
Recommend everybody using Eclipse as Dev IDE, GAE offers a very nice plugin
Other GAE examples available on our course website