Cloud Computing mit mathematischen Anwendungen

(1)

www.kit.edu

04.08

KIT – the cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Cloud Computing mit

mathematischen Anwendungen

Vorlesung SoSe 2009

Dr. Marcel Kunze

Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC)

(2)

Cloud Computing Teil 10 | SoSe 2009 | Dr. M.Kunze 2

Agenda Cloud Computing

1. Einleitung

Was ist Cloud Computing?

2. Grundlagen

Virtualisierung, Web Services,…

3. Cloud Architekturen

Infrastruktur, Plattform, Anwendung 4. Cloud Services

Amazon Web Services, Google App Engine 5. Aufbau einer Cloud

OpenCirrus Projekt, Eucalyptus, Hadoop, Google 6. Cloud Algorithmen

MapReduce, Optimierungsverfahren, …

Praktische Übungen und Anwendungen Vorlesung im Web:

http://www.mathematik.uni-karlsruhe.de/mitglieder/lehre/cloud2009s/

(3)

Infrastructure for Search Systems

Several key pieces of infrastructure:

GFS: large-scale distributed file system

MapReduce: makes it easy to write/run large scale jobs

generate production index data more quickly perform ad-hoc experiments more rapidly

BigTable: semi-structured storage system

online, efficient access to per-document information at any time multiple processes can update per-doc info asynchronously critical for updating documents in minutes instead of hours

http://labs.google.com/papers/gfs.html

http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/bigtable.html

(4)

Google Cluster

Thousands of computers Distributed

Computers have their own disks, and the file system spans those disks

Failures are the norm

Disks, networks, processors, power supplies, application software, operating system software, human error

Files are huge

Multi-gigabyte files, each containing many objects

Read/write characteristics

Files are mutated by appending

Once written, files are typically only read

Large streaming reads and small random reads are typical

(5)

Google File System (GFS)

A GFS cluster has one master and many chunkservers Files are divided into 64 MB chunks

Chunks are replicated and stored in the Unix file systems of the chunkservers

The master holds metadata

Clients get metadata from the master, and data directly from chunkservers

(6)

Google File System - Read

From byte offset within the file, client computes chunk index Client sends filename and chunk index to master

Master returns a list of replicas of the chunk Client interacts with a replica to access data

(7)

Google File System - Write

Client asks master for identity of primary and secondary replicas

Client pushes data to memory at all replicas via a replica-to-replica

“chain”

Client sends write request to primary Primary orders concurrent requests, and triggers disk writes at all replicas Primary reports success or failure to client

(8)

What is BigTable?

“A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, a column key, and a timestamp;

each value in the map is an uninterpreted array of bytes.”

Google‘s Implementation of a database

Lots of semi-structured data Enormous scale

(row:string, column:string, time:int64) -> string

(9)

Relative to a DBMS, BigTable provides …

- Simplified data retrieval mechanism

(row, column, time:) -> string only No relational operators

- Atomic updates only possible at row level + Arbitrary number of columns per row

+ Arbitrary data type for each column

Designed and optimized for Google’s application set

Provides extremely large scale (data, throughput) at extremely small cost

(10)

Physical representation of data

A logical “table” is divided into multiple tablets

Each tablet is one or more SSTable files in GFS

Each tablet stores an interval of table rows

Rows are lexicographically ordered (by key)

If a tablet grows beyond a certain size, it is split into two new tablets

(11)

Software structure of a BigTable cell

One master server

Communicates only with tablet servers

Multiple tablet servers

Perform actual client accesses

Chubby lock service holds metadata (e.g., the location of the root metadata tablet for the table ), handles master election

GFS servers provide underlying storage

(12)

High-level structure

(13)

The Eight Design Fallacies

The network is reliable.

Latency is zero.

Bandwidth is infinite.

The network is secure.

Topology doesn't change.

There is one administrator.

Transport cost is zero.

The network is homogeneous.

-- Peter Deutsch and James Gosling, Sun Microsystems

(14)

ATC Architecture

NETWORK INFRASTRUCTURE NETWORK INFRASTRUCTURE

ATC State ATC State

ATC status is a kind of temporal database: for each ATC sector, it tells us what flights might be in that sector and when they will be there

(15)

Server replication

Let’s think about the service that tracks the status of ATC sectors

Client systems are like web browsers

Server is like a web service. ATC is a “cloud” but one with special needs: it speaks with “one voice”

Now, an ATC needs highly available servers.

Else a crash could leave controller unable to make decisions So: how can we make a service highly available?

Most obvious option: “primary/backup”

We run two servers on separate platforms The primary sends a log to the backup

If primary crashes, the backup soon catches up and can take over

(16)

A primary-backup scenario…

primary

backup

Clients initially connected to primary, which keeps backup up to date. Backup collects the log

log

(17)

Split brain Syndrome…

Transient problem causes some links to break but not all.

Backup thinks it is now primary, primary thinks backup is down primary

backup

(18)

Split brain Syndrome

Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both

primary

backup

Safe for US227 to land on Ithaca’s NW runway

Safe for NW111 to land on Ithaca’s NW runway

(19)

Oh no! But how could this happen?

How do web service systems detect failures?

The specifications don’t really answer this question

A web client senses a failure if it can’t connect to a server, or if the connection breaks

And the connections are usually TCP

So, how does TCP detect failures?

Under the surface, TCP sends data in IP packets, and the receiver acknowledges receipt.

TCP channels break if a timeout occurs.

Fix the problem

Just have the backup unplug the primary

Alternative: Install a consistency mechanism (Lock service)

(20)

Chubby

A coarse-grained lock service

Other distributed systems can use this to synchronize access to shared resources

Intended for use by “loosely-coupled distributed systems”

Design Goals

High availability Reliability

Anti-goals

High performance Throughput

Storage capacity

(21)

Chubby - Intended Use Cases

GFS: Elect a master

BigTable: master election, client discovery, table service locking Well-known location to bootstrap larger systems

Partition workloads

Locks should be coarse: held for hours or days – build your own fast locks on top

(22)

Chubby - External Interface

Presents a simple distributed file system Clients can open/close/read/write files

Reads and writes are whole-file

Also supports advisory reader/writer locks

Clients can register for notification of file update

Files == Locks?

“Files” are just handles to information

The contents of the file is one (primary) attribute

As is the owner of the file, permissions, date modified, etc

Can also have an attribute indicating whether the file is locked or not.

(23)

Chubby - Topology

(24)

Chubby – Master Election

Master election is simple: all replicas try to acquire a write lock on designated file. The one who gets the lock is the master.

Master can then write its address to file; other replicas can read this file to discover the chosen master name.

Chubby doubles as a name service

(25)

Chubby - Distributed Consensus

Chubby cell is usually 5 replicas

3 must be alive for cell to be viable

How do replicas in Chubby agree on their own master, official lock values?

PAXOS algorithm

(26)

PAXOS

Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.

Processor Assumptions

Operate at arbitrary speed Independent, random failures

Procs with stable storage may rejoin protocol after failure

Do not lie, collude, or attempt to maliciously subvert the protocol

Network Assumptions

All processors can communicate with (“see”) one another

Messages are sent asynchronously and may take arbitrarily long to deliver Order of messages is not guaranteed: they may be lost, reordered, or duplicated

Messages, if delivered, are not corrupted in the process

(27)

A Fault Tolerant Memory of Facts

Paxos provides a memory for individual “facts” in the network.

A fact is a binding from a variable to a value.

Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.

Roles

Proposer – An agent that proposes a fact Leader – the authoritative proposer

Acceptor – holds agreed-upon facts in its memory Learner – May retrieve a fact from the system

Safety Guarantees

Nontriviality: Only proposed values can be learned Consistency: Only at most one value can be learned

Liveness: If at least one value V has been proposed, eventually any learner L will get some value

(28)

Step 1: Prepare

(29)

Step 2: Promise

(30)

Step 3: Accept

(31)

Step 4: Accepted

(32)

Learning Values

(33)

Cross-Language Information Retrieval

Translate all the world’s documents to all the world’s languages

increases index size substantially computationally expensive

but huge benefits if done well

Challenges:

continuously improving translation quality

large-scale systems work to deal with larger and more complex language models

to translate one sentence ~1M lookups in multi-TB model

(34)

ACLs in Information Retrieval Systems

Retrieval systems with mix of private, semiprivate, widely shared and public documents

e.g. e-mail vs. shared doc among 10 people vs. messages in group with 100,000 members vs. public web pages

Challenge: building retrieval systems that efficiently deal with ACLs that vary widely in size

best solution for doc shared with 10 people is different than for doc shared with the world

sharing patterns of a document might change over time

(35)

Automatic Construction of Efficient IR Systems

Currently use several retrieval systems

e.g. one system for sub-second update latencies, one for very large # of documents but daily updates, ...

common interfaces, but very different implementations primarily for efficiency

works well, but lots of effort to build, maintain and extend different systems

Challenge: can we have a single parameterizable system that automatically constructs efficient retrieval system based on these parameters?

(36)

Information Extraction from Semi-structured Data

Data with clearly labelled semantic meaning is a tiny fraction of all the data in the world

But there’s lots semi-structured data

books & web pages with tables, data behind forms, ...

Challenge: algorithms/techniques for improved extraction of structured information from unstructured/semi-structured sources

noisy data, but lots of redundancy

want to be able to correlate/combine/aggregate info from different sources

(37)

Summary

Google Cloud

GFS

BigTable

Consistency

Split Brain Problem Chubby Lock Service

Distributed Consensus: Paxos Algorithm

(38)

Karlsruhe Institute of Technology

Thank you for your attention.