Computing at the Edge: Improving the Performance of Big Data and Mobile Apps with In-Network Computation

(1)

Peter R. Pietzuch

[email protected]

Computing at the Edge:

Improving the Performance of Big Data and

Mobile Apps with In-Network Computation

Peter Pietzuch

(joint work with Andreas Pamboris*, Lukas Rupprecht, Luo Mai, Abdul Alim, Paolo Costa*, Matteo Migliavacca, Alexander L. Wolf, Miguel Baguena Albaladejo,

Pietro Manzoni)

Large-Scale Distributed Systems Group

Department of Computing

(2)

Network Devices Have Changed…

2

1973: First portable mobile phone call in NYC

1946: First mobile phone call from car

(3)

Today’s Mobile Devices Are About Apps

•

Mobile applications require

low latency

access to services

•

Mobile networks: often slow and unpredictable

– Highly variable network conditions

•

Consider online gaming use case

– Different clients perceive updates to shared game state with lag

Multiplayer games

(4)

Mobile Users Have High Expectations

Mobile devices have

limited resources

(CPU, memory)

compared to laptops/desktops/servers

Consider limited device memory: eg 512MB (iPhone 4S)

– Application left with 213MB (rest for OS)

– Single 8-megapixel photo: over 30MB of bitmap data

– Image editing app cannot have more than 7 photos resident in memory

4 Real-time augmented reality

Speech recognition

(5)

Mobile Networks Are Dumb Pipes

Mobile Network

Call Of Duty

(6)

Data Centres Have Come A Long Way…

6

1997: Google’s first data centre at Stanford

Today: Google’s data centre in Oregon

(7)

…

Big Data Analytics is Network-Bound

Our digital interactions are stored, searched, analysed

– eg email, online shoppping, texts, tweets, pictures, Facebook updates

•

Explosion in data volume

•

Requires massively-parallel processing in many servers

Smart fridges Smartphones

(8)

Data Centre Networking Provides Dumb Pipes

8

Data Centre Network

(9)

Idea: Hosting Services within the Network

Network

Call Of Duty

End users

Backend services

Call Of Duty

1. NetAgg:

Higher

bandwidth

for distributed applications

2. EdgeCloud:

Lower

latency

to Internet backend services

Talk outline:

(10)

1. Increasing Bandwidth

10

Network

Call Of Duty

End users

Backend services

1. NetAgg:

Higher bandwidth for distributed applictions



Avoid bandwidth bottleneck

2. EdgeCloud: Lower latency to Internet backend services

3. CloudSplit: Access to more memory resources

(11)

Data Centre Networks Are Expensive

Cisco 3560-48TD: $11,995 Juniper Ex8216: $750,000 Cisco 6509E: $500,000

10 Gbps

20 Gbps

1 Gbps

Top of the rack switch

Aggregation

switch

Core

switch

(12)

Problem: Bandwidth Over-Subscription

1 Gbps

10 Gbps

20 Gbps

40 Gbps

20 Gbps

160 Gbps

40 Gbps

1:100 oversubscription

rate not uncommon, with

reported cases of 1:240

1:2

oversubscription

1:4

12

•

Performance of distributed applications limited by network

(13)

Network-as-a-Service: In-Network Processing

[HotICE’12]

•

Network switches augmented with

processing capabilities

– Multiple options: software-only middleboxes, programmable FPGAs

•

Applications deploy

processing functions

on each device

(14)

NetAgg: On-Path Data Aggregation

[CoNEXT’14] 14 ToR W S S S S AS W W W ToR W M W W ToR W W W W ToR W W W W AS

AS = L2 Aggr Switch

S = L2 Switch

ToR = Top-of-Rack Switch

W = Worker Server

M = Master Server

Input Data Flow

Outgoing Data Flow

A Single Layer 2 Domain

Bandwidth

bottleneck

ToR W S S S S AS W W W ToR W M W W ToR W W W W ToR W W W W AS

AS = L2 Aggr Switch

S = L2 Switch

ToR = Top-of-Rack Switch

W = Worker Server

M = Master Server

Input Data Flow

Aggr Data Flow

On-path

Aggregation

A Single Layer 2 Domain

•

Aggregate application data early

along network paths

to reduce traffic

(15)

Example: MapReduce Jobs

Chunk

0

Chunk

1

Chunk

2

Input file

Map Task

Reduce Task

Intermediate results Final results

•

Mainstream “Big Data” analytics framework

– E.g. PageRank (web search indexing)

– Open source version (Hadoop) by Yahoo/Apache

(16)

Shuffle Phase is Network Bottleneck

Split

0

Split

1

Split

2

Input file

Map Task

Reduce Task

Intermediate results Final results

•

Shuffle phase

challenging for data centre networks

– All-to-all traffic pattern with O(N2_{) flows} – Bottleneck due to limted edge bandwidth,

over-subscription and TCP incast

Parent Server

(17)

Data Reduction in MapReduce

•

Final results typically much smaller than intermediate results

(e.g. WordCount)

– Most Facebook jobs final size is 5.4% of intermediate size

– In most Yahoo jobs, final size is 8.2%

Split

0

Split

1

Split

2

Input file

Map Task

Reduce Task

(18)

On-Path Aggregation Trees

•

Aggregation trees perform multi-step aggregation

Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce

Reduce Reduce

(19)

In-Network Aggregation Increases Throughput

Pr ocessing thr oug hpu t (Gbp s) Number of clients

Bottlenecked by 1 Gbps

edge bandwidth

Plain Solr

NetAgg + Solr

(20)

2. Reducing Latency

20

Network

Call Of Duty

End users

Backend services

1. NetAgg: High bandwidth for distributed applications

1. EdgeCloud:

Lower latency to Internet backend services



Avoids high Internet delays

(21)

Edge Service Hosting in Mobile Networks

•

Edge services can reduce latency to clients

•

1. Challenge: Mobile clients roam over time

– Need to adapt placement decisions over time

•

2. Challenge: Need to migrate

service state

efficiently

(22)

Scenario: First Person Shooter (FPS) Game

22

Game clients

180ms 80ms 40ms 5ms 5ms 5ms

1

2

3

4

5

6

Possible game server

locations

(23)

Scenario: First Person Shooter (FPS) Game

1

2

3

=5ms

4

5

6

Idle servers

Active game server

=10ms

(24)

24 Client 1 Moving… Client 2 Looking at Client 1 10ms to Server 4 Client 3 Looking at Client 1 300ms to Server 4

(25)

E

DGE

C

LOUD

: Automatic Edge Service Migration

4

5

6

Game clients:

-

Monitor network connectivity to servers

-

Send updates to active server:

<server, latency>

2

3

1

(26)

26 180ms 80ms 40ms 5ms 5ms 5ms

1

2

3

Active server:

-

Checks for “better” server

-

Minimise maximum latency

4

5

6

Client 1 Client 2 Client 3

<4, 5ms> <4, 10ms> <4, 300ms> <5, 15ms> <5, 0ms> <5,

210ms> <6,

125ms> <6, 130ms> <6, 180ms>

(27)

1

2

3

4

5

6

Client 1 Client 2 Client 3

<4, 5ms> <4, 10ms> <4, 300ms> <5, 15ms> <5, 0ms> <5, 210ms> <6, 125ms> <6, 130ms> <6, 180ms>

 Server 6 is most fair choice

(28)

1

2

3

5

Create snapshot of game service

4

6

(29)

1

2

3

5

Send snapshot to Server 6 via Client 1

Restart game service on Server 6

4

6

(30)

30 Client 1 Moving… Client 2 Looking at Client 1 130ms to Server 6 Client 3 Looking at Client 1 180ms to Server 6

(31)

1

2

3

5

To minimise “dead time” during migration

-

Clients and servers cache previous snapshots received

4

6

cache cache cache cache cache cache

(32)

1

2

3

5

4

6

When game service migrates back to Server 4…

-

Only diffs from cached snapshot sent

-

_cache

=

cache

(33)

1

2

3

5

4

6

Less data



faster transmission



faster migration

+

_cache

=

(34)

3. Extending Client Resources

34

Network

Call Of Duty

End users

Backend services

1. NetAgg: Higher bandwidth for distributed applications

1. EdgeCloud: Lower latency to Internet backend services



Offloads apps running on resource-constrained mobile devices

Code offloading

(35)

Code Offloading for Mobile Apps

•

Most existing offloading systems speed up execution of apps

– eg MAUI, COMET, ThinkAir, CloneCloud

– Offload compute-intensive app functions (eg game AI, image processing, …) – Exploits faster/more parallel CPU of remote server

•

Challenge: Increase

memory

available to application?

– Application memory footprint > available device memory – Offloading systems today migrate application state

(36)

State Migration

Vs State Partitioning

36

<offload request> +

Application state

transmit portion of state

Current state-of-the-art Maui ThinkAir COMET CloneCloud

(37)

State Migration

Vs State Partitioning

Application state

(38)

State Migration

Vs State Partitioning

38

Application state

<response> +

(39)

State Migration

Vs

State Partitioning

Internet

Local state

Remote state

(40)

State Migration

Vs

State Partitioning

40

(41)

State Migration

Vs

State Partitioning

(42)

State Migration State Partitioning

42

Pros

Simplifies handling of network

failures:

- Repeat failed call locally

Cons

Limited by device memory

Higher offloading overhead

Repeatedly offloaded calls require sending same state multiple times

Solved!

Need to recover

remote state after

(43)

Local state snapshot L_k-1

Local state

Remote state

Before offloading:

Snapshot-based Failure Recovery

(44)

44

Remote state snapshot R_k-1

Local state

Remote state

Before returning:

(45)

Local state

background task

Remote state

In the background:

<R_k>

Transmits most recent remote

(46)

46

Local state

Remote state

After network failure during k

th

offloaded call:

-

Use

<R

_k-1

>

and <L

_k

> to restore state

-

Repeat failed k

th

offloaded call locally

<R_k-1>

Case A:

R

_k-1

received before failure

(47)

snapshot-based fault-tolerance mechanism!

Local state

Remote state

Case B

:

R

_k-1

not

received before failure

After network failure (during k

th

offloaded call):

<L_k> <L_k-1>

<R_k-2>

-

Use

<R

_k-2

>

and <L

_k-1

> to restore state

(48)

Conclusions

•

Modern applications have new network requirements

– Interactive mobile apps need extremely low latency – Data centre applications require high bandwidth – Clients remain resource constrained

•

We need to change the network model:

Push application-specific computation into the network

– Exposes network locations for cloud service hosting – Blurs boundary between applications and networks – Three examples: NetAgg, EdgeCloud, CloudSplit

•

Lots of open research challenges

– Programming models, network management, security, …

Thank You! Any Question?

Peter Pietzuch

(49)

(50)

Cooling and Power Distribution

(51)

Developer’s View of the Network

•

POSIX network socket interface:



Strict separation between applications and network



No control over network

int

send(

int

sockfd

,

const void *

msg

,

int

len

,

int

flags

);

int

recv(

int

sockfd

,

void *

buf

,

(52)

Why Large DCs? Because Performance Matters

•

Online services

(e.g. Facebook, Google Search, Amazon)

– e.g. Facebook: 400,000,000 requests/s, 200 TB of data in RAM – Expected response time < 100 ms

• Amazon: every 10 ms of latency cost them 1% in sales • Google: extra 500 ms reduced traffic by 20%

•

Offline batch processing

(e.g. MapReduce, graph processing, ML)

– High throughput (e.g. Facebook 60-90 TB/day)

•

Internal services

to support above applications

– Google (GFS, BigTable, MapReduce), Yahoo (Hadoop, HDFS), Amazon (Dynamo, Astrolabe), Microsoft (Dryad), Facebook (Cassandra, Hbase)

(53)

Scalability through Partition/Aggregation

•

Many large-scale

services use

partition/aggregation

pattern

– e.g. Google Search 1. Request sent to

multiple workers 2. Workers process

request in parallel 3. Responses from

workers are combined

•

Every request on

Amazon might involve

10,000+ servers



Network critically determines application

(54)

Mismatch: Abstraction & Reality

Switches

Router

One logical hop is mapped to multiple physical hops

Two disjoint

logical paths share some physical links

(55)

Mismatch between App & Network Abstraction

Abstraction

Reality

•

Applications use logical

topologies to communicate

Dynamo MapReduce Tree Search Databus

(56)

Addressing Over-subscription

Solutions

– Keep traffic local to racks

– Reverse engineering of IP addresses to extract rack location

– Use multi-path topologies (fat trees) to increase bandwidth in core network [SIGCOMM’08, SIGCOMM’09]

(57)

Problem #2: TCP Incast

Distributed file system

Files are striped across

multiple servers

Client issues several synchronised reads

for given query

Servers respond (approx.) at same time

Switches drop entire windows of packets, causing TCP collapse

CONGESTIO N

Solutions

– Add random delays to each reply – Use small replies (<2.5 KB)

(58)

Problem #3: Path Collision

Network allocates paths independently

Applications cannot modify way packets are routed

(59)

Bridging the Application/Network Gap

Applications perspective

Network Perspective

•

Network only sees packets

•

No insights about application

behaviour

•

Has to infer application patterns

The network is a black box

for applications (and vice versa)

•

Applications only see IP addresses

− Hard to infer locality & congestion

•

No control over packet routing

•

Need to reverse-engineer

network

?

Why slow?

(60)

•

Single

administrative domain

•

Homogenous

HW and network

− x86 and Ethernet

•

Topology

known

− and can be customised

•

Trusted

components

− e.g. using virtualisation

Data Centre vs. Internet Networking

Internet

Data Centres

DC networks affected by Internet design…

…but data centres are not mini-Internets

Strict

layer

isolation

•

Multiple

administrative domains

•

Heterogeneous

hw and network

•

Topology

not

known

•

Malicious

software

(61)

Letting Applications Adapt the Network

•

Many proposals for

software-based routers and switches

– e.g. RouteBricks , ServerSwitch, PacketShader, SideCar, NetMap,…

•

Replace traditional (

application-agnostic

) network services

– e.g. IPv4 forwarding, deep packet inspection, firewalls

•

Why don’t use them to implement

application-specific

services?

– E.g. aggregate packets in distributed query or content-based routing

Goal:

Abstractions & mechanisms to enable applications to efficiently,

easily and safely

process data within the network

(62)

NetAgg Platform: Design

•

Agg nodes

connected to

switches via high capacity links

– High-performance aggregation in software

– Exploit associativity of aggregation computation for parallel execution

on multicore CPUs

•

Shim layers

at edge nodes

intercept network data

– Transparent aggregation for applications

– Requires few changes to existing applications Worker node Worker node Master node Shim layers 1 2 W W Agg node 1 AS Agg node 2

...

... S requests 62

(63)

NetAgg Aggregation I

Worker node Worker node Master node Shim layers 1 2 W W Agg node 1 AS Agg node 2

...

... S requests

1. Data processing request

received by

master node

2. Partial requests sent to all

worker nodes

for parallel

processing

(64)

NetAgg Aggregation II

3. Partial results redirected to

agg nodes

by

shim layers

on

worker nodes

Worker node Worker node 3 Agg node 1 AS Agg node 2 S W W ...

...

3 partial results 64

(65)

NetAgg Aggregation III

4.

Agg nodes

form

aggregation

tree

along network paths,

exchanging partially aggregated

results

5. Final aggregated result

returned to

master node

Master node Agg node 1 AS Agg node 2 S

...

4 5 aggregated results

(66)

Experimental Evaluation

•

Questions:

– What are the benefits for NetAgg applications? – What is the impact for non-NetAgg applications? – What is the Agg Node processing rate achieved?

•

Set-up: Packet-level simulator (OMNeT++)

– 1,024-server fat-tree topology (1 Gbps edge links) – Over-subscription of 1:4

– 10,000 simulated TCP flows

– Consider Flow Completion Time (FCT)

(67)

Individual Flow Completion Time

Includes 50%

non-NetAgg

flows

CDF

Flow Completion Time (seconds)

NetAgg

Edge-based

aggregation

(rack, chain,

binary tree)

(68)

Total Flow Completion Time

Faster Agg node

Worse

Better

Fl ow Co mpleti on

Time

R elativ e to R ack -lev el aggr egatio n

Agg node processing rate R (Gbps)

Even low aggregation rate enough

to achieve significant benefits

R=1 Gbps already

reduces time by 83%

91% time

reduction

(69)

E

DGE

C

LOUD

Features

•

Edge service migration occurs via clients

– Supports clients in multi-homed environments:

e.g. switching from LTE network to Wi-Fi LAN environment – Does not require edge locations to interact directly

•

No service modification required

– Edge services treated as “black boxes” running in containers

•

Asynchronous transfer of service state

– Service state transferred while service is running

(70)

Game events from Client 1

– Propagated to Clients 2 and 3 with some lag

• Client 2 – Server 4



10ms latency



300ms latency

Client 2 has an advantage over Client 3

– Can react to different events faster

– May lead to game inconsistencies

• Client 3 shoots at Client 1 thinking the latter is in its range

• Client 1 only appears to be in Client 3’s shooting range

»

due to increased network delays

Problems due to increased network latency

(71)

Game events from Client 1

– Received by Clients 2 and 3 in a synchronous fashion



130ms latency



180ms latency

Clients 2 and 3 playing on equal terms

(72)

Snapshot-based Fault-Tolerance Mechanism

72

Device periodically stores snapshot of remote state

– Used to reconstruct application state after failure

– Execution rolls back to last consistent snapshot received

User-level virtual memory scheme after failure

– Remote state may grow larger than available memory – Snapshots stored in flash memory