www.kit.edu
04.08
KIT – the cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Cloud Computing mit
mathematischen Anwendungen
Vorlesung SoSe 2009
Dr. Marcel Kunze
Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC)
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 2
Agenda Cloud Computing
1. Einleitung
Was ist Cloud Computing?
2. Grundlagen
Virtualisierung, Web Services,…
3. Cloud Architekturen
Infrastruktur, Plattform, Anwendung 4. Cloud Services
Amazon Web Services, Google App Engine 5. Aufbau einer Cloud
OpenCirrus Projekt, Eucalyptus, Hadoop, Google 6. Cloud Algorithmen
MapReduce, Optimierungsverfahren, …
Praktische Übungen und Anwendungen Vorlesung im Web:
http://www.mathematik.uni-karlsruhe.de/mitglieder/lehre/cloud2009s/
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 3
Infrastructure for Search Systems
Several key pieces of infrastructure:
GFS: large-scale distributed file system
MapReduce: makes it easy to write/run large scale jobs
generate production index data more quickly perform ad-hoc experiments more rapidly
BigTable: semi-structured storage system
online, efficient access to per-document information at any time multiple processes can update per-doc info asynchronously critical for updating documents in minutes instead of hours
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/bigtable.html
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 4
Google Cluster
Thousands of computers Distributed
Computers have their own disks, and the file system spans those disks
Failures are the norm
Disks, networks, processors, power supplies, application software, operating system software, human error
Files are huge
Multi-gigabyte files, each containing many objects
Read/write characteristics
Files are mutated by appending
Once written, files are typically only read
Large streaming reads and small random reads are typical
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 5
Google File System (GFS)
A GFS cluster has one master and many chunkservers Files are divided into 64 MB chunks
Chunks are replicated and stored in the Unix file systems of the chunkservers
The master holds metadata
Clients get metadata from the master, and data directly from chunkservers
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 6
Google File System - Read
From byte offset within the file, client computes chunk index Client sends filename and chunk index to master
Master returns a list of replicas of the chunk Client interacts with a replica to access data
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 7
Google File System - Write
Client asks master for identity of primary and secondary replicas
Client pushes data to memory at all replicas via a replica-to-replica
“chain”
Client sends write request to primary Primary orders concurrent requests, and triggers disk writes at all replicas Primary reports success or failure to client
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 8
What is BigTable?
“A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, a column key, and a timestamp;
each value in the map is an uninterpreted array of bytes.”
Google‘s Implementation of a database
Lots of semi-structured data Enormous scale
(row:string, column:string, time:int64) -> string
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 9
Relative to a DBMS, BigTable provides …
- Simplified data retrieval mechanism
(row, column, time:) -> string only No relational operators
- Atomic updates only possible at row level + Arbitrary number of columns per row
+ Arbitrary data type for each column
Designed and optimized for Google’s application set
Provides extremely large scale (data, throughput) at extremely small cost
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 10
Physical representation of data
A logical “table” is divided into multiple tablets
Each tablet is one or more SSTable files in GFS
Each tablet stores an interval of table rows
Rows are lexicographically ordered (by key)
If a tablet grows beyond a certain size, it is split into two new tablets
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 11
Software structure of a BigTable cell
One master server
Communicates only with tablet servers
Multiple tablet servers
Perform actual client accesses
Chubby lock service holds metadata (e.g., the location of the root metadata tablet for the table ), handles master election
GFS servers provide underlying storage
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 12
High-level structure
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 13
The Eight Design Fallacies
The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.
-- Peter Deutsch and James Gosling, Sun Microsystems
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 14
ATC Architecture
NETWORK INFRASTRUCTURE NETWORK INFRASTRUCTURE
ATC State ATC State
ATC status is a kind of temporal database: for each ATC sector, it tells us what flights might be in that sector and when they will be there
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 15
Server replication
Let’s think about the service that tracks the status of ATC sectors
Client systems are like web browsers
Server is like a web service. ATC is a “cloud” but one with special needs: it speaks with “one voice”
Now, an ATC needs highly available servers.
Else a crash could leave controller unable to make decisions So: how can we make a service highly available?
Most obvious option: “primary/backup”
We run two servers on separate platforms The primary sends a log to the backup
If primary crashes, the backup soon catches up and can take over
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 16
A primary-backup scenario…
primary
backup
Clients initially connected to primary, which keeps backup up to date. Backup collects the log
log
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 17
Split brain Syndrome…
Transient problem causes some links to break but not all.
Backup thinks it is now primary, primary thinks backup is down primary
backup
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 18
Split brain Syndrome
Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both
primary
backup
Safe for US227 to land on Ithaca’s NW runway
Safe for NW111 to land on Ithaca’s NW runway
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 19
Oh no! But how could this happen?
How do web service systems detect failures?
The specifications don’t really answer this question
A web client senses a failure if it can’t connect to a server, or if the connection breaks
And the connections are usually TCP
So, how does TCP detect failures?
Under the surface, TCP sends data in IP packets, and the receiver acknowledges receipt.
TCP channels break if a timeout occurs.
Fix the problem
Just have the backup unplug the primary
Alternative: Install a consistency mechanism (Lock service)
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 20
Chubby
A coarse-grained lock service
Other distributed systems can use this to synchronize access to shared resources
Intended for use by “loosely-coupled distributed systems”
Design Goals
High availability Reliability
Anti-goals
High performance Throughput
Storage capacity
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 21
Chubby - Intended Use Cases
GFS: Elect a master
BigTable: master election, client discovery, table service locking Well-known location to bootstrap larger systems
Partition workloads
Locks should be coarse: held for hours or days – build your own fast locks on top
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 22
Chubby - External Interface
Presents a simple distributed file system Clients can open/close/read/write files
Reads and writes are whole-file
Also supports advisory reader/writer locks
Clients can register for notification of file update
Files == Locks?
“Files” are just handles to information
The contents of the file is one (primary) attribute
As is the owner of the file, permissions, date modified, etc
Can also have an attribute indicating whether the file is locked or not.
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 23
Chubby - Topology
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 24
Chubby – Master Election
Master election is simple: all replicas try to acquire a write lock on designated file. The one who gets the lock is the master.
Master can then write its address to file; other replicas can read this file to discover the chosen master name.
Chubby doubles as a name service
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 25
Chubby - Distributed Consensus
Chubby cell is usually 5 replicas
3 must be alive for cell to be viable
How do replicas in Chubby agree on their own master, official lock values?
PAXOS algorithm
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 26
PAXOS
Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.
Processor Assumptions
Operate at arbitrary speed Independent, random failures
Procs with stable storage may rejoin protocol after failure
Do not lie, collude, or attempt to maliciously subvert the protocol
Network Assumptions
All processors can communicate with (“see”) one another
Messages are sent asynchronously and may take arbitrarily long to deliver Order of messages is not guaranteed: they may be lost, reordered, or duplicated
Messages, if delivered, are not corrupted in the process
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 27
A Fault Tolerant Memory of Facts
Paxos provides a memory for individual “facts” in the network.
A fact is a binding from a variable to a value.
Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.
Roles
Proposer – An agent that proposes a fact Leader – the authoritative proposer
Acceptor – holds agreed-upon facts in its memory Learner – May retrieve a fact from the system
Safety Guarantees
Nontriviality: Only proposed values can be learned Consistency: Only at most one value can be learned
Liveness: If at least one value V has been proposed, eventually any learner L will get some value
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 28
Step 1: Prepare
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 29
Step 2: Promise
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 30
Step 3: Accept
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 31
Step 4: Accepted
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 32
Learning Values
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 33
Cross-Language Information Retrieval
Translate all the world’s documents to all the world’s languages
increases index size substantially computationally expensive
but huge benefits if done well
Challenges:
continuously improving translation quality
large-scale systems work to deal with larger and more complex language models
to translate one sentence ~1M lookups in multi-TB model
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 34
ACLs in Information Retrieval Systems
Retrieval systems with mix of private, semiprivate, widely shared and public documents
e.g. e-mail vs. shared doc among 10 people vs. messages in group with 100,000 members vs. public web pages
Challenge: building retrieval systems that efficiently deal with ACLs that vary widely in size
best solution for doc shared with 10 people is different than for doc shared with the world
sharing patterns of a document might change over time
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 35
Automatic Construction of Efficient IR Systems
Currently use several retrieval systems
e.g. one system for sub-second update latencies, one for very large # of documents but daily updates, ...
common interfaces, but very different implementations primarily for efficiency
works well, but lots of effort to build, maintain and extend different systems
Challenge: can we have a single parameterizable system that automatically constructs efficient retrieval system based on these parameters?
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 36
Information Extraction from Semi-structured Data
Data with clearly labelled semantic meaning is a tiny fraction of all the data in the world
But there’s lots semi-structured data
books & web pages with tables, data behind forms, ...
Challenge: algorithms/techniques for improved extraction of structured information from unstructured/semi-structured sources
noisy data, but lots of redundancy
want to be able to correlate/combine/aggregate info from different sources
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 37
Summary
Google Cloud
GFS
BigTable
Consistency
Split Brain Problem Chubby Lock Service
Distributed Consensus: Paxos Algorithm
Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 38
Karlsruhe Institute of Technology
Thank you for your attention.