Peter R. Pietzuch
[email protected]
Computing at the Edge:
Improving the Performance of Big Data and
Mobile Apps with In-Network Computation
Peter Pietzuch
(joint work with Andreas Pamboris*, Lukas Rupprecht, Luo Mai, Abdul Alim, Paolo Costa*, Matteo Migliavacca, Alexander L. Wolf, Miguel Baguena Albaladejo,
Pietro Manzoni)
Large-Scale Distributed Systems Group
Department of Computing
Network Devices Have Changed…
2
1973: First portable mobile phone call in NYC
1946: First mobile phone call from car
Today’s Mobile Devices Are About Apps
•
Mobile applications require
low latency
access to services
•
Mobile networks: often slow and unpredictable
– Highly variable network conditions
•
Consider online gaming use case
– Different clients perceive updates to shared game state with lag
Multiplayer games
Mobile Users Have High Expectations
Mobile devices have
limited resources
(CPU, memory)
compared to laptops/desktops/servers
Consider limited device memory: eg 512MB (iPhone 4S)
– Application left with 213MB (rest for OS)
– Single 8-megapixel photo: over 30MB of bitmap data
– Image editing app cannot have more than 7 photos resident in memory
4 Real-time augmented reality
Speech recognition
Mobile Networks Are Dumb Pipes
Mobile Network
Call Of Duty
Data Centres Have Come A Long Way…
6
1997: Google’s first data centre at Stanford
Today: Google’s data centre in Oregon
…
Big Data Analytics is Network-Bound
Our digital interactions are stored, searched, analysed
– eg email, online shoppping, texts, tweets, pictures, Facebook updates
•
Explosion in data volume
•
Requires massively-parallel processing in many servers
Smart fridges Smartphones
Data Centre Networking Provides Dumb Pipes
8
Data Centre Network
Idea: Hosting Services within the Network
Network
Call Of Duty
End users
Backend services
Call Of Duty
1. NetAgg:
Higher
bandwidth
for distributed applications
2. EdgeCloud:
Lower
latency
to Internet backend services
Talk outline:
1. Increasing Bandwidth
10
Network
Call Of Duty
End users
Backend services
1. NetAgg:
Higher bandwidth for distributed applictions
Avoid bandwidth bottleneck
2. EdgeCloud: Lower latency to Internet backend services
3. CloudSplit: Access to more memory resources
Data Centre Networks Are Expensive
Cisco 3560-48TD: $11,995 Juniper Ex8216: $750,000 Cisco 6509E: $500,00010 Gbps
20 Gbps
1 Gbps
Top of the rack switchAggregation
switch
Core
switch
Problem: Bandwidth Over-Subscription
1 Gbps
10 Gbps
20 Gbps
40 Gbps
20 Gbps
160 Gbps
40 Gbps
1:100 oversubscription
rate not uncommon, with
reported cases of 1:240
1:2
oversubscription
oversubscription
1:4
12
•
Performance of distributed applications limited by network
Network-as-a-Service: In-Network Processing
[HotICE’12]•
Network switches augmented with
processing capabilities
– Multiple options: software-only middleboxes, programmable FPGAs
•
Applications deploy
processing functions
on each device
NetAgg: On-Path Data Aggregation
[CoNEXT’14] 14 ToR W S S S S AS W W W ToR W M W W ToR W W W W ToR W W W W ASAS = L2 Aggr Switch
S = L2 Switch
ToR = Top-of-Rack Switch
W = Worker Server
M = Master Server
Input Data Flow
Outgoing Data Flow
A Single Layer 2 Domain
Bandwidth
bottleneck
ToR W S S S S AS W W W ToR W M W W ToR W W W W ToR W W W W ASAS = L2 Aggr Switch
S = L2 Switch
ToR = Top-of-Rack Switch
W = Worker Server
M = Master Server
Input Data Flow
Aggr Data Flow
On-path
Aggregation
A Single Layer 2 Domain
•
Aggregate application data early
along network paths
to reduce traffic
Example: MapReduce Jobs
Chunk
0
Chunk
1
Chunk
2
Input fileMap Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
•
Mainstream “Big Data” analytics framework
– E.g. PageRank (web search indexing)
– Open source version (Hadoop) by Yahoo/Apache
Shuffle Phase is Network Bottleneck
Split
0
Split
1
Split
2
Input fileMap Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Intermediate results Final results
•
Shuffle phase
challenging for data centre networks
– All-to-all traffic pattern with O(N2) flows – Bottleneck due to limted edge bandwidth,
over-subscription and TCP incast
Parent Server
Data Reduction in MapReduce
•
Final results typically much smaller than intermediate results
(e.g. WordCount)
– Most Facebook jobs final size is 5.4% of intermediate size
– In most Yahoo jobs, final size is 8.2%
Split
0
Split
1
Split
2
Input file
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
On-Path Aggregation Trees
•
Aggregation trees perform multi-step aggregation
Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce
Reduce Reduce
In-Network Aggregation Increases Throughput
Pr ocessing thr oug hpu t (Gbp s) Number of clientsBottlenecked by 1 Gbps
edge bandwidth
Plain Solr
NetAgg + Solr
2. Reducing Latency
20
Network
Call Of Duty
End users
Backend services
1. NetAgg: High bandwidth for distributed applications
1. EdgeCloud:
Lower latency to Internet backend services
Avoids high Internet delays
3. CloudSplit: Access to more memory resources
Edge Service Hosting in Mobile Networks
•
Edge services can reduce latency to clients
•
1. Challenge: Mobile clients roam over time
– Need to adapt placement decisions over time
•
2. Challenge: Need to migrate
service state
efficiently
Scenario: First Person Shooter (FPS) Game
22Game clients
180ms 80ms 40ms 5ms 5ms 5ms1
2
3
4
5
6
Possible game server
locations
Scenario: First Person Shooter (FPS) Game
180ms 80ms 40ms 5ms 5ms 5ms1
2
3
=5ms
4
5
6
Idle servers
Active game server
=10ms
24 Client 1 Moving… Client 2 Looking at Client 1 10ms to Server 4 Client 3 Looking at Client 1 300ms to Server 4
E
DGE
C
LOUD
: Automatic Edge Service Migration
180ms 80ms 40ms 5ms 5ms 5ms4
5
6
Game clients:
-
Monitor network connectivity to servers
-
Send updates to active server:
<server, latency>
2
3
1
26 180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
Active server:
-
Checks for “better” server
-
Minimise maximum latency
4
5
6
Client 1 Client 2 Client 3
<4, 5ms> <4, 10ms> <4, 300ms> <5, 15ms> <5, 0ms> <5,
210ms> <6,
125ms> <6, 130ms> <6, 180ms>
180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
4
5
6
Client 1 Client 2 Client 3
<4, 5ms> <4, 10ms> <4, 300ms> <5, 15ms> <5, 0ms> <5, 210ms> <6, 125ms> <6, 130ms> <6, 180ms>
Server 6 is most fair choice
28 180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
5
Create snapshot of game service
4
6
180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
5
Send snapshot to Server 6 via Client 1
Restart game service on Server 6
4
6
30 Client 1 Moving… Client 2 Looking at Client 1 130ms to Server 6 Client 3 Looking at Client 1 180ms to Server 6
180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
5
To minimise “dead time” during migration
-
Clients and servers cache previous snapshots received
4
6
cache cache cache cache cache cache32 180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
5
4
6
When game service migrates back to Server 4…
-
Only diffs from cached snapshot sent
-
cache=
cache
cache
180ms 80ms 40ms 5ms 5ms 5ms
1
2
3
5
4
6
Less data
faster transmission
faster migration
+
cache=
3. Extending Client Resources
34
Network
Call Of Duty
End users
Backend services
1. NetAgg: Higher bandwidth for distributed applications
1. EdgeCloud: Lower latency to Internet backend services
2. CloudSplit: Access to more memory resources
Offloads apps running on resource-constrained mobile devices
Code offloadingCode Offloading for Mobile Apps
•
Most existing offloading systems speed up execution of apps
– eg MAUI, COMET, ThinkAir, CloneCloud
– Offload compute-intensive app functions (eg game AI, image processing, …) – Exploits faster/more parallel CPU of remote server
•
Challenge: Increase
memory
available to application?
– Application memory footprint > available device memory – Offloading systems today migrate application state
State Migration
Vs State Partitioning
36
<offload request> +
Application state
transmit portion of state
Current state-of-the-art Maui ThinkAir COMET CloneCloud
State Migration
Vs State Partitioning
Application state
State Migration
Vs State Partitioning
38
Application state
<response> +
State Migration
Vs
State Partitioning
Internet
Local state
Remote state
State Migration
Vs
State Partitioning
40
<offload request>
State Migration
Vs
State Partitioning
<response>
State Migration State Partitioning
42
Pros
Simplifies handling of network
failures:
- Repeat failed call locally
Cons
Limited by device memory
Higher offloading overhead
Repeatedly offloaded calls require sending same state multiple times
Solved!
Need to recover
remote state after
Local state snapshot Lk-1
Local state
Remote state
Before offloading:
Snapshot-based Failure Recovery
44
Remote state snapshot Rk-1
Local state
Remote state
Before returning:
Local state
background taskRemote state
In the background:
<Rk>
Transmits most recent remote
Snapshot-based Failure Recovery
46
Local state
Remote state
After network failure during k
thoffloaded call:
-
Use
<R
k-1>
and <L
k> to restore state
-
Repeat failed k
thoffloaded call locally
<Rk-1>
Case A:
R
k-1received before failure
snapshot-based fault-tolerance mechanism!
Local state
Remote state
Case B
:
R
k-1not
received before failure
After network failure (during k
thoffloaded call):
Snapshot-based Failure Recovery
<Lk> <Lk-1>
<Rk-2>
-
Use
<R
k-2>
and <L
k-1> to restore state
Conclusions
•
Modern applications have new network requirements
– Interactive mobile apps need extremely low latency – Data centre applications require high bandwidth – Clients remain resource constrained
•
We need to change the network model:
Push application-specific computation into the network
– Exposes network locations for cloud service hosting – Blurs boundary between applications and networks – Three examples: NetAgg, EdgeCloud, CloudSplit
•
Lots of open research challenges
– Programming models, network management, security, …
Thank You! Any Question?
Peter Pietzuch
Cooling and Power Distribution
Developer’s View of the Network
•
POSIX network socket interface:
Strict separation between applications and network
No control over network
int
send(
int
sockfd
,
const void *
msg
,
int
len
,
int
flags
);
int
recv(
int
sockfd
,
void *
buf
,
Why Large DCs? Because Performance Matters
•
Online services
(e.g. Facebook, Google Search, Amazon)
– e.g. Facebook: 400,000,000 requests/s, 200 TB of data in RAM – Expected response time < 100 ms
• Amazon: every 10 ms of latency cost them 1% in sales • Google: extra 500 ms reduced traffic by 20%
•
Offline batch processing
(e.g. MapReduce, graph processing, ML)
– High throughput (e.g. Facebook 60-90 TB/day)
•
Internal services
to support above applications
– Google (GFS, BigTable, MapReduce), Yahoo (Hadoop, HDFS), Amazon (Dynamo, Astrolabe), Microsoft (Dryad), Facebook (Cassandra, Hbase)
Scalability through Partition/Aggregation
•
Many large-scale
services use
partition/aggregation
pattern
– e.g. Google Search 1. Request sent to
multiple workers 2. Workers process
request in parallel 3. Responses from
workers are combined
•
Every request on
Amazon might involve
10,000+ servers
Network critically determines application
Mismatch: Abstraction & Reality
Switches
Router
One logical hop is mapped to multiple physical hops
Two disjoint
logical paths share some physical links
Mismatch between App & Network Abstraction
Abstraction
Reality
•
Applications use logical
topologies to communicate
Dynamo MapReduce Tree Search DatabusAddressing Over-subscription
Solutions
– Keep traffic local to racks
– Reverse engineering of IP addresses to extract rack location
– Use multi-path topologies (fat trees) to increase bandwidth in core network [SIGCOMM’08, SIGCOMM’09]
Problem #2: TCP Incast
Distributed file system
Files are striped across
multiple servers
Client issues several synchronised reads
for given query
Servers respond (approx.) at same time
Switches drop entire windows of packets, causing TCP collapse
CONGESTIO N
Solutions
– Add random delays to each reply – Use small replies (<2.5 KB)
Problem #3: Path Collision
Network allocates paths independently
Applications cannot modify way packets are routed
Bridging the Application/Network Gap
Applications perspective
Network Perspective
•
Network only sees packets
•
No insights about application
behaviour
•
Has to infer application patterns
The network is a black box
for applications (and vice versa)
•
Applications only see IP addresses
− Hard to infer locality & congestion
•
No control over packet routing
•
Need to reverse-engineer
network
?
?
Why slow?
•
Single
administrative domain
•
Homogenous
HW and network
− x86 and Ethernet
•
Topology
known
− and can be customised
•
Trusted
components
− e.g. using virtualisation
Data Centre vs. Internet Networking
Internet
Data Centres
DC networks affected by Internet design…
…but data centres are not mini-Internets
Strict
layer
isolation
•
Multiple
administrative domains
•
Heterogeneous
hw and network
•
Topology
not
known
•
Malicious
software
Letting Applications Adapt the Network
•
Many proposals for
software-based routers and switches
– e.g. RouteBricks , ServerSwitch, PacketShader, SideCar, NetMap,…
•
Replace traditional (
application-agnostic
) network services
– e.g. IPv4 forwarding, deep packet inspection, firewalls
•
Why don’t use them to implement
application-specific
services?
– E.g. aggregate packets in distributed query or content-based routing
Goal:
Abstractions & mechanisms to enable applications to efficiently,
easily and safely
process data within the network
NetAgg Platform: Design
•
Agg nodes
connected to
switches via high capacity links
– High-performance aggregation in software
– Exploit associativity of aggregation computation for parallel execution
on multicore CPUs
•
Shim layers
at edge nodes
intercept network data
– Transparent aggregation for applications
– Requires few changes to existing applications Worker node Worker node Master node Shim layers 1 2 W W Agg node 1 AS Agg node 2
...
... S requests 62NetAgg Aggregation I
Worker node Worker node Master node Shim layers 1 2 W W Agg node 1 AS Agg node 2...
... S requests1. Data processing request
received by
master node
2. Partial requests sent to all
worker nodes
for parallel
processing
NetAgg Aggregation II
3. Partial results redirected to
agg nodes
by
shim layers
on
worker nodes
Worker node Worker node 3 Agg node 1 AS Agg node 2 S W W ......
3 partial results 64NetAgg Aggregation III
4.
Agg nodes
form
aggregation
tree
along network paths,
exchanging partially aggregated
results
5. Final aggregated result
returned to
master node
Master node Agg node 1 AS Agg node 2 S
...
4 5 aggregated resultsExperimental Evaluation
•
Questions:
– What are the benefits for NetAgg applications? – What is the impact for non-NetAgg applications? – What is the Agg Node processing rate achieved?
•
Set-up: Packet-level simulator (OMNeT++)
– 1,024-server fat-tree topology (1 Gbps edge links) – Over-subscription of 1:4
– 10,000 simulated TCP flows
– Consider Flow Completion Time (FCT)
Individual Flow Completion Time
Includes 50%
non-NetAgg
flows
CDF
Flow Completion Time (seconds)
NetAgg
Edge-based
aggregation
(rack, chain,
binary tree)
Total Flow Completion Time
Faster Agg node
Worse
Better
Fl ow Co mpleti onTime
R elativ e to R ack -lev el aggr egatio nAgg node processing rate R (Gbps)
Even low aggregation rate enough
to achieve significant benefits
R=1 Gbps already
reduces time by 83%
91% time
reduction
E
DGE
C
LOUD
Features
•
Edge service migration occurs via clients
– Supports clients in multi-homed environments:
e.g. switching from LTE network to Wi-Fi LAN environment – Does not require edge locations to interact directly
•
No service modification required
– Edge services treated as “black boxes” running in containers
•
Asynchronous transfer of service state
– Service state transferred while service is running
Game events from Client 1
– Propagated to Clients 2 and 3 with some lag
• Client 2 – Server 4
10ms latency
• Client 3 – Server 4
300ms latency
Client 2 has an advantage over Client 3
– Can react to different events faster
– May lead to game inconsistencies
• Client 3 shoots at Client 1 thinking the latter is in its range
• Client 1 only appears to be in Client 3’s shooting range
»
due to increased network delays
Problems due to increased network latency
Game events from Client 1
– Received by Clients 2 and 3 in a synchronous fashion
• Client 2 – Server 6
130ms latency
• Client 3 – Server 6
180ms latency
Clients 2 and 3 playing on equal terms
Snapshot-based Fault-Tolerance Mechanism
72
Device periodically stores snapshot of remote state
– Used to reconstruct application state after failure
– Execution rolls back to last consistent snapshot received
User-level virtual memory scheme after failure
– Remote state may grow larger than available memory – Snapshots stored in flash memory