Data Center Quantized Congestion Notification (QCN):
Implementation and Evaluation on NetFPGA
2010 June 14th
Masato Yasuda (NEC Corporation)
Background
▐
In data centers, there is a movement of Network Convergence
▐
The spec of Converged Enhanced Ethernet (CEE) has been discussed in
IEEE 802.1 standard committees
Converged Enhanced Ethernet
Internet servers terminals Internet terminals Converged DC transport LAN (Fibre Channel: FC) (TCP) SAN storage storage servers
Congestion Control Mechanism in CEE
▐
In CEE, it has Congestion Control Mechanism
Non-TCP traffic (storage, media) can avoid network
congestion
▐
Congestion Notification (CN)
The spec of Congestion Control Mechanism in CEE
It is specified in IEEE802.1Qau in Data Center Bridging Task Group
We proposed QCN (Quantized Congestion Notification) and finally
accepted as a standard in March 2010.
Converged Enhanced Ethernet (CEE)
Converged Enhanced Ethernet (CEE)
FCoE FCoE IP IP IPIP TCP TCP UDP (for media) UDP (for media) FC (for storage) FC (for storage) Support Support Congestion Congestion Control Control at Layer 2 at Layer 2
Significant restrictions/requirements for CN
We must avoid queue oscillation which causes overflow (causes link pause) or underflow (lose utilization)
Stable
Stable
Unlike TCP slow start, sources can come on at the full line rate of 10Gbps
Sources can start
Sources can start
at line late
at line late
Links can be paused by pause mechanism but CN is needed to avoid congestion
spreading (pause spreading to innocent paths).
Links can be
Links can be
paused
paused
The algorithm should be simple enough to be implemented in hardware (for 40G or 100G processing)
Simple
Simple
No way to know round trip time. Not automatically self-clocked like TCP.
(The algorithm need to have counters for self-clocking)
No per
No per
-
-
packet
packet
ACKs
QCN (Quantized Congestion Notification)
▐
QCN: Congestion Control Mechanism at Layer 2
Proposed by our group and accepted in IEEE 802.1Qau
▐
Terminology:
Congestion Point: Where congestion occurs, mainly switches
Reaction Point: Source of traffic, mainly rate limiters in Ethernet
NICs
Reaction Point
Reaction Point
Congestion Point
Congestion Point
Feedback Message Fb Fb Traffic from other sources Switch NIC▐
Feedback Processing
Calculate Feedback value (Fb) from Switch Queue length
Send feedback to Reaction Point (RP)
▐
Fb sampling probability
No Congestion
:
Pmin(1%)
Under Congestion
:
Pmax(10%) at maximum
▐
Fb Calculation
QCN Algorithm: Congestion Point (CP)
)
Q
w
Q
(
F
b
off
delta Qeq : Qlen in equibriumQold : Queue length at previous processing
Q: Current Queue Length
Fb : Feedback (quantized to 6bits) w: fixed parameter (= 2)
Qeq Q Qold Qdelta Qoff Qoff: Offset eq off
Q
Q
Q
Qdelta: Variance old deltaQ
Q
Q
|Fb| Re fle ct io n Pr ob ab ili ty Pmin PmaxQCN Algorithm: Reaction Point (RP)
▐
Rate Control
Rate Decrease
:
Feedback from CP
Rate Increase: Increase by itself
▐
Rate Increase Algorithm
:
Averaging Principle
Based on BIC-TCP
:
Works without ACK
Target Rate (TR) Current Rate (CR)
Fast Recovery
Fast Recovery
Active
Active
Increase
Increase
HyperHyper--Active Active Increase Increase Receive Feedback Receive Feedback ) F G CR(1 CR d b CR TR 1/2) F (Gd bmax TR)/2 (CR CR TR TR AI R TR TR TR)/2 (CR CR CR (CR TR)/2 HAI iR TR TR Time Rate Trigger
Current QCN Implementation
▐
▐
World
World
’
’
s first Hardware QCN Full implementation
s first Hardware QCN Full implementation
It is fully functional
We confirmed that the result matches OMNet++ simulation result!
▐
QCN-NIC
Multiple Reaction Points
Improved Token Bucket Rate Limiter
▐
QCN-Switch
Multiple Congestion Points
▐
Compliant with Pseudo Code v2.3 [1]
1Gpbs Platform (NetFPGA)
Packet Format
▐
Data Frame (Normal Ethernet UDP/IP Frame)
▐
QCN Fb Frame(64Bytes)
SRC MAC[48:32]
DST MAC[47:0]
63 0
SRC MAC[31:0] Type =16’h0800
IP and UDP Header
Extract as flowid[7:0] at QCN Switch
SRC MAC[48:32]
DST MAC[47:0]
63 0
SRC MAC[31:0] Type = 16’hXXXX QCN Field[15:0]
flowid[7:0] qcn_fb[5:0]
Ethernet Padding
reserved(2bits) QCN Field[15:0]
QCN NIC Block Diagram
User Data Path
Input Arbiter
Demux Mux Demux
Delay Line
SRAM
Receive Packet parser QCN Terminator Fb Demux qcn_fb, flowid Rate Calculator Timer Byte Count Token Bucket Rate Limiter UDP Pkt Gen Reaction Point 0 Demux Mux Reaction Point 1 Reaction Point 2 Reaction Point 3Token Bucket Rate Limiter
▐
Token bucket algorithm avoiding burstiness
Packet
Increase token_fill size tokens If refresh timer has expired Decrease packet size tokens
in sending packet
Bucket Size B
Wait for being allowed to send tokens
Send Condition:
tokens in the bucket > 0
Send Condition:
tokens in the bucket > 0 Send Condition:
tokens in the bucket >= token size
Send Condition:
tokens in the bucket >= token size
Causing burstiness!
Previous implementation Current implementation
QCN Switch Module Structure
User Data Path
Fb Calculator Qlen Monitor Fb Generator Prioritized Mux Output Queues Input Arbiter Packet Parser Port Byte Count Receive Block Fwd Port Congestion Point src/dst mac src/dst port Demux 4port Prioritized Mux Prioritized Mux Prioritized Mux Rate Limiter
SRAM
QCN Hardware Evaluation System
QCN-NIC0 QCN-Switch Port1 MAC Port2 MAC QCN-NIC1 CP LimiterRate Fb Mux Demux drop RP0 RP1 RP2 RP3 Receive Mux MAC Delay Line RP0 RP1 RP2 RP3 Receive Mux MAC Delay Line Port0 MAC Output Queue (Port0) Fwd Port▐
Number of Reaction Points (RPs): 1-8
Each RP generates 1Gbps UDP Traffic
▐
Limit Rate at Switch:
950Mbps(3.7sec) 200Mbps(3.7sec) 950Gbps(3.7sec)
Limit Rate
(950M200M950M) to check flow control
behavior
4RPs
Parameters (QCN)
▐
NIC
FAST_RECOVERY_THRESHOLD = 5
AI_INC = 0.5 Mbps
HAI_INC = 5 Mbps
BC_LIMIT = 150 KB (30% randomness)
TIMER_PERIOD = 25 ms (30% randomness)
MIN_RATE = 0.5 Mbps
GD = 1/128
▐
Switch
Quantized_Fb: 6 bits
Q_EQ = 33 KB
W = 2
Base marking = 150 KB, and varies according to the lookup table in
the pseudo code (30% randomness)
Based on QCN Pseudo Code
Hardware Evaluation Result (1RP) , RTT=100usec
Switch: Queue Length
NIC: Rate
0 20 40 60 80 100 120 140 1 3 5 7 9 11Evaluation Time [sec]
Q u eu e S iz e [ K B y te s ] 0 200 400 600 800 1000 1 3 5 7 9 11
Evaluation Time [sec]
R a te [ M b p s ]
Queue length rises temporarily in limiting Switch rate
but go back to Qeq in shorter time.
Queue length rises temporarily in limiting Switch rate
but go back to Qeq in shorter time.
The NIC rate behavior well follows the limited Switch rate
by QCN Algorithm
The NIC rate behavior well follows the limited Switch rate
OMNet++ Simulation Result (1RP) , RTT=100usec
Switch: Queue Length
NIC: Rate
0 20 40 60 80 100 120 140 1 3 5 7 9 11Simulation Time [sec]
Q ueue S iz e [ K B y te s ] 0 200 400 600 800 1000 1 3 5 7 9 11
Simulation Time [sec]
R a te [ M b p s ]
Simulation result well matches the Hardware Evaluation Result !
Hardware Evaluation Result (8RPs) , RTT=1msec
Switch: Queue Length
NIC: Rate
0 20 40 60 80 100 120 140 1 3 5 7 9 11Evaluation Time [sec]
Q u e u e S iz e [ K B y te s ] 0 200 400 600 800 1000 1 3 5 7 9 11
Evaluation Time [sec]
R a te [ M b p s ]
It also matches OMNet++ Simulation Result with the same parameter
It also matches OMNet++ Simulation Result with the same parameter
Wiggling because of longer delay but still keeps the same queue length.
Wiggling because of longer delay but still keeps the same queue length.
The aggregate NIC rate behavior still well follows the Switch limited rate
The aggregate NIC rate behavior still well follows the Switch limited rate
Conclusion
▐
We presented/demonstrated QCN theory and implementation
▐
QCN Implementation on NetFPGA Board
All QCN Functions are implemented (Pseudo Code v2.3)
QCN-NIC
•
Multiple RPs are ready
•
Improved Token Bucket Rate Limiter
QCN-Switch
• Multiple CPs are ready
• Binary Search Logic for Fb Calculation