IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 6, NO. 9, DECEMBER 1988 1587
Queueing in High-Performance Packet Switching
Abstract-Because of the unscheduled nature of arrivals to a packet switch, two or more packets may arrive on different inputs destined for the same output. The switch architecture may allow one of these packets to pass through to the output, but the others must be queued for later transmission. We study the performance of four different ap- proaches for providing the queueing necessary to smooth fluctuations in packet arrivals to a high-performance packet switch. They a r e 1) input queueing where a separate buffer is provided a t each input to the switch; 2) input smoothing where a frame of b packets is stored a t each of the N input lines to the switch and simultaneously launched into a switch fabric of size Nb x Nb; 3) output queueing where packets are queued in a separate Erst-in first-out (FIFO) buffer located a t each output of the switch; and 4) completely shared buffering where all queueing is done at the outputs and all buffers a r e completely shared among all the output lines. Input queues saturate at a n offered load that depends on the service policy and the number of inputs N, but is approximately 0.586 with FIFO buffers when N is large. At the expense of a n increase in the switch fabric size and latency, the lost packet rate for input smoothing can be made small by increasing the frame size b.
Output queueing and completely shared buffering both achieve the op- timal throughput-delay performance for any packet switch. However, compared to output queueing, completely shared buffering requires less buffer memory a t the expense of a n increase in switch fabric size.
1. INTRODUCTION
N the move toward high-performance packet switching
I
for integrated service networks [ 11 and multiprocessor interconnects [2], attention is focusing on packet-switch- ing architectures that provide many simultaneous input/output paths through the switch fabric and allow the in- ternal paths to be time-multiplexed in a statistical rather than deterministic fashion. Such architectures provide the capability for high-speed transmission ( 1-200 Mbits/s ) on each input/output with a total switch capacity of 1-200 Gbits/s. Because of the high-speed operation of the switch, the processing of packets is largely hardware- based, with packet headers containing address informa- tion that is used by the switch fabric to route packets from inputs to outputs on the switch. Depending on its design, the switch fabric may be blocking. That is, the switch fabric may be unable to provide simultaneous, indepen- dent paths between arbitrary pairs of inputs and outputs.
However, even if the switch fabric is nonblocking, congestion in the switch will still arise because, unlike a circuit switch, arrivals to a packet switch are unsched-
Manuscript received November 2, 1987; revised June 11, 1988. This paper was presented at INFOCOM ’88, New Orleans, LA, March 1988.
This work was performed while M. G. Hluchyj was with AT&T Bell Lab- oratories.
M. G. Hluchyj is with Codex Corporation, Mansfield, MA 02048.
M. J. Karol is with AT&T Bell Laboratories, Holmdel, NJ 07733.
IEEE Log Number 8824400.
uled: two or more packets may arrive simultaneously on different inputs destined for the same output. One of these contending packets for an output may be allowed to pass through the switch, but the others must be queued for later transmission on the output. This form of congestion is unavoidable in a packet switch and dealing with it often represents the greatest source of complexity in the switch architecture.
In this paper, we examine the performance of four dif- ferent approaches for providing the queueing necessary to smooth the statistical fluctuations in packet arrivals to the switch. The switch fabric in all cases is assumed to be nonblocking and, as illustrated in Fig. 1, operates syn- chronously with fixed-length packets arriving on the N in- puts in a time-slotted fashion. The four different ap- proaches to packet queueing in the switch are described in Section 11, and the performance of each is analyzed in Section 111.
11. PACKET QUEUEING ARCHITECTURES Fig. 2 illustrates four approaches to providing the queueing for a high-performance packet switch. In this section, we describe how each functions to smooth (in time) the packet arrivals destined to a common output.
A. Input Queueing
With input queueing, illustrated in Fig. 2(a), a separate buffer is placed on each input to the switch. Each arriving packet enters, at least momentarily, the buffer on its input where it awaits access to the switch fabric. Initially, we assume the buffers are served first-in first-out (FIFO), so that at the beginning of each time slot only the packets at the heads of the FIFO’s contend for access to the switch outputs. If every packet is addressed to a different output, the nonblocking switch fabric allows each to pass through to its respective output. If k packets at the heads of the input FIFO’s are addressed to a particular output, one is allowed to pass through the switch fabric, while the other k - 1 must wait until the next time slot, when a new selection is made among the packets that are then waiting.
Note that while a packet is waiting its turn for access to an output, other packets may be queued behind it in the FIFO and, consequently, blocked from reaching possibly idle outputs on the switch. As we shall see in Section III-A, this results in a maximum throughput, for large N, of ( 2 -
h )
= 0.586 for input queueing with FIFO buff- ers.The throughput can be increased by relaxing the strict first-in first-out queueing discipline of the input buffers.
0733-8716/88/1200-1587$01.00 O 1988 IEEE
1588
1 1 1 4 1 2 1 I
N -
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 6, NO. 9, DECEMBER 1988
SWITCH
I N 1 I N I N ] + N TIME SLOT
A I I-
[ N I 1 3 1 1 1 1
I 1 2 1 2 1 I‘
NONELOCKING .2
TIME- SLOTTED PACKET
(a) I N P U T OUEUEING (b) INPUT SMOOTHING
- l b k
(C) O U T P U T QUEUEING
(d) COMPLETELY SHARED BUFFERING
Fig. 2. Four approaches to providing the queueing for a high-performance packet switch.
Each input still sends, at most, one packet into the switch fabric per time slot, but not necessarily the first packet in its queue, and no more than one packet is allowed to pass through the switch fabric to each output in a time slot.
For example, at the beginning of each time slot, suppose the first w packets in each input queue sequentially con- tend for access to the switch outputs. The packets at the heads of the input queues contend first for access to the switch outputs. Those inputs not selected to transmit the first packets in their input queues then contend with their second packets for access to any remaining idle outputs (i.e., outputs not yet assigned to receive packets in this time slot). The contention process is repeated up to w times at the beginning of each time slot, sequentially al- lowing the w packets in an input buffer’s “window” to contend for any remaining idle outputs, until the input is selected to transmit a packet. A window size of w = 1 corresponds to input queueing with FIFO buffers.
B. Input Smoothing
Fig. 2(b) illustrates an arrangement where the arriving packets are not so much queued at each input but smoothed; hence, the name input smoothing. Specifically, the packets within a frame of b time slots are stored at each of the N inputs (i.e., demultiplexed) and simulta- neously launched into a switch fabric of size N b X Nb.
At most, N b packets enter the fabric, of which b can be
simultaneously received at each output where the packets are then multiplexed onto the output line. Any more than b packets destined for an output are dropped (i.e., lost) within the switch fabric. In Section 111-B, we show that the probability of dropping a packet can be made small by making the frame size b large. This is analogous to fixed-length source coding in information theory where code words are only assigned to a subset of likely source sequences. By making the source sequence sufficiently long, the probability of a source sequence generated for which there is no assigned code word can be made arbi- trarily small [3].
Note that although the switch fabric has been enlarged from N X N to N b x N b , the speed at which each input to the fabric operates can be reduced by a factor of b. The Starlite Digital Switch [4] uses demultiplexing’ to reduce the required switch fabric speed relative to the incoming line speed. Its use as a means to smooth traffic arrivals does not seem to have been exploited in any proposed switch architecture.
‘With Starlite, fixed-length packets arrive to the switch multiplexed bit- by-bit (i.e., the first b bits of the frame correspond to the first bit of each b packets, the next b bits correspond to the second bit, and so on). This has the same smoothing effect as the “packet multiplexed” approach ana- lyzed here. However, bit-by-bit multiplexing reduces the latency through the switch since only b bits, rather than an entire frame of b packets, has to be accumulated at the input before entering the switch fabric.
HLUCHYJ A N D KAROL: QUEUEING IN HIGH-PERFORMANCE PACKET SWITCHING
~
1589
C. Output Queueing
With output queueing, shown in Fig. 2(c), all queueing is done at the outputs of the switch with a separate b packet FIFO provided for each output. One can think of the switch fabric as operating N times as fast as the inputs and outputs, so that if k ( k = 1,
- - -
, N ) packets arrive in a time slot on different inputs all addressed to the same output, all k can be routed through the switch fabric and into the proper output FIFO within one time slot. Only one packet, however, can be transmitted on the output line in a time slot; the remaining k - 1 packets must wait in the output FIFO for transmission during subsequent time slots.Note that with output queueing, unlike input queueing, arriving packets addressed to one output do not interfere with (i.e., block or delay) packets going to different out- puts. It is only at each output that one finds the unavoid- able congestion caused by multiple packets simulta- neously arriving on different inputs addressed to the same output. The waiting time performance for output queueing represents the best achievable by any approach.
It is possible to implement output queueing without the N times speed-up of the switch fabric. The Knockout Switch [ 5 ] , having a fully interconnected switch fabric topology, uses an N to L concentrator at each output to reduce the number of buffers needed to receive simulta- neously arriving packets. Packet loss is inevitable in any packet network; with L = 8, the probability of losing a packet in the concentrator is under l o p 6 for an arbitrarily large switch size N . A novel buffering scheme, combining L separate FIFO’s into the equivalent of a single FIFO with L inputs and one output, is then used to queue at the output. Hence, the Knockout Switch achieves output queueing without requiring a speed-up of the switch fab- ric.
D. Completely Shared Buffering
The buffer architecture shown in Fig. 2(d) still provides for output queueing, but rather than have a separate buffer for each output, all memory is pooled into one completely shared buffer. Fig. 2(d) represents the architecture for the Starlite Digital Switch with a trap [4]. The common buffer has a separate input and separate output for each of up to Nb packets, and the switch fabric is enlarged from N X
N t o N ( b
+
1 ) X N ( b+
l).’Up toNnew packet arrivals to the switch and up to Nb buffered packets enter the switch fabric at the beginning of each time slot. If k ( k = 1 , 2 , *-
* , N ( b+
1 ) ) packets are addressed to the same output, the switch fabric will route one to the output and the remaining k - 1 will be routed to k - 1 of the N b inputs to the shared buffer. These k - 1 packets will wait until the beginning of the next time slot before reentering the switch fabric along with the other stored packets and any new arrivals on the inputs. The packets continue to recirculate through the switch fabric and shared buffer,*We assume no input smoothing as described in Section 11-B.
with the output removing one packet from the group each time slot.
Effectively, a separate queue is formed for each output of the switch, but physically, all queued packets in the switch share the same buffer space. We shall see in the next section that this sharing allows one to reduce the total amount of buffering in the switch, but at the expense of an increase in the size of the switch fabric.
111. PERFORMANCE ANALYSIS
In this section, we analyze and compare the perfor- mance of the four queueing architectures described in the previous section. In each case, we determine the proba- bility of packet loss and the expected packet waiting time in the switch. In all cases, we model the packet arrivals on the N inputs by independent and identical Bernoulli processes. That is, in any given time slot, the probability that a packet will arrive on a particular input is p ; each packet has equal probability 1 / N of being addressed to any given output, and successive packets are indepen- dent.
A. Input Queueing
In this section, we concentrate primarily on the perfor- mance of input queueing with FIFO buffers. Unlike the other three architectures, the first-in first-out queueing discipline of input queueing limits the maximum through- put of the switch. Specifically, packets within the input FIFO’s are prevented from reaching idle outputs, blocked by packets at the heads of the FIFO’s contending for com- mon outputs. At the end of this section, we show that the throughput can be increased by relaxing the strict first-in first-out queueing discipline of the input buffers.
To determine the maximum throughput of the switch, we examine the case where all the input queues are satu- rated. That is, packets are always waiting in every input FIFO, and whenever a packet is transmitted through the switch, a new packet immediately replaces it at the head of the input queue. We assume that if there are k packets waiting at the heads of input queues addressed to the same output, the selection of one to pass through the switch is done at random, each having equal probability ( 1 / k ) of being selected.
Following the analysis in [ 6 ] , we define Bk as the num- ber of packets at the heads of the input queues destined for output i in the mth time slot, but not selected to pass through the switch. We define A: as the number of pack- ets moving to the heads of the input queues during the mth time slot and destined for output i. Note that a packet can only move to the head of an input queue if, in the previous time slot, a packet was removed from that queue for transmission on an output. It follows that
B k = max ( 0 , BLpl
+
A ; - 1 ) . ( 1 ) Although Bk does not represent the occupancy of any physical queue, notice that (1) has the same form as the fundamental queueing relation for a single-server queueing system [7].1590 IEEE JOURNAL ON SELECTED AREAS I N COMMUNICATIONS, VOL. 6 , NO. 9, DECEMBER 1988
0 5
With each new packet arrival to the head of an input queue having equal probability 1/N of being addressed to any given output,
A i
has the binomial probabilitieso
< 2 0 4 0 60 80 100
where
N
( 3 ) F m - l = n N -
c
B L - l .F,,, - I represents the total number of packets transmitted through the switch during the (m - 1 ) st time slot, which is also equal to the total number of input queues with new packets at their heads in the mth time slot. That is,
i = I
N
F m P l =
c
AL.i = I (4)
Note that F I N = pO where
F
is the mean steady-state number of packets passing through the switch and pO is the utilization of the output lines (i.e., the normalized switch throughput). In addition, A’, the steady-state num- ber of packets addressed to output i that move to the head of input queues each time slot, becomes Poisson with ratepO as N + 00 [6]. These observations imply that (1) is driven by the same Markov process as an M / D / 1 queue.
Using the results for the mean steady-state queue size for an M / D / 1 queue [7], for N = 00 we have
From ( 3 ) and F / N = p O , we also have
-
B’ = 1 - PO
as N + 00. It follows from (5) and (6) that pO = ( 2 -
h
) = 0.586 when the switch is saturated and N = 0 0 .For small values of N , a Markov chain analysis of the system throughput can be done, yielding the results given in Table I [6], [8]. From Table I and the simulation re- sults shown in Fig. 3, note the rapid convergence to the asymptotic throughput of 0.586.
Before saturation, a discrete-time Geom/G/ 1 queueing model is used to determine an exact formula for the ex- pected waiting time for the limited case N = 00 [6]. The amval process to each input queue in Bernoulli: a packet arrives independently in each time slot with probability p , equally likely destined for each output. The “service time” for a packet at the head of an input queue addressed to outputj consists of the wait until it is randomly selected among all packets at the heads of input queues contending for outputj, plus one time slot for its transmission through the switch. As N + 00, the steady-state number of packet
“arrivals” to the heads of input queues, and addressed to outputj, becomes Poisson with rate p O . Hence, the service time distribution for the discrete-time Geom/G/ 1 model is itself the packet delay distribution of another queueing
TABLE 1 FIFO BUFFERS
THE MAXIMUM THROUGHPUT ACHIEVABLE U S I N G INPUT QUEUEING WITH
0 8
-
01 t-
_I 3 ul W n z 0 5 0 7
J 3
s
-
+ 3
a L 0 3 0
n O E z
t- n
3 t- 01
0 a a
HLUCHYJ A N D KAROL: QUEUEING IN HIGH-PERFORMANCE PACKET SWITCHING 1591
' O
'
OFFERED LOAD
Fig. 4. The mean waiting time for input queueing with FIFO buffers for the limiting case of N = m.
TABLE I1
THE MAXIMUM THROUGHPUT ACHIEVABLE WITH INPUT QUEUEING FOR
VARIOUS SWITCH SIZES N AND WINDOW SIZES w
I N 1 1 I 2 I 3Window Siar, I 4 I 5 I w G I 7 I 8
1
dow sizes ( N and w , respectively). The values were ob- tained by simulation. Note that a big increase in the achievable throughput is possible by increasing the win- dow size w from w = 1 (i.e., FIFO buffers) to w = 2, 3, and 4, with diminishing improvements thereafter. How- ever, input queueing with even an infinite window (w =
00 ) does not attain the optimal throughput-delay perfor- mance of output queueing and completely shared buffer- ing. Input queueing limits each input to send at most one packet into the switch fabric per time slot, presents pre- venting packets from reaching idle outputs.
B. Input Smoothing
With input smoothing, the packets within a frame of b time slots are stored at each input and then enter the switch fabric together on separate input ports [Fig. 2(b)]. With each output connected to exactly b output ports on the switch fabric, if k
>
b packets enter the switch fabric destined for a given output, then k - b packets will be lost. Defining the random variable A as the number of packets entering the switch fabric destined for a given output, we havek = 0 , 1,
- -
- , Nb. ( 8 ) Hence, the probability that a packet is lost within the fab- ric is given byN b
Pr [packet loss] = - bp 1 k = b + l ( k - b )
p bpk=O
Taking the limit as N + 00, we obtain Pr [packet loss]
1 e - b p b - 1
= l - - + - - - C ( b - k ) - ( b p f (10)
p bp k = O k ! '
The packet loss probability increases with increasing N , and so (10) represents an upper bound on the lost packet performance for all finite N . As illustrated in Fig. 5(a) and (b), the bound is tight for N
>
16. Fig. 6 shows, for N = 00, the lost packet performance of input smoothing as a function of the frame size b for offered loads between 0.7 and 0.95. The y axis has been scaled to make it easier to compare the performance of input smoothing to the other packet queueing approaches. Note from Fig. 6 that the decrease in the packet loss probability with increasing frame size b is slow. For example, to achieve a lost packet probability of at an offered load of 85 percent (Le., p = 0.85) requires the frame size b>
100.For those packets not dropped in the switch fabric, the mean
- packet waiting time (measured in packet time slots) W is given by
b - 1
k - 1 b - 1
k
-
Pr [ A = k ]+
- b-
Pr [ A 1 b ]c - -
2k = l 2
( 1 1 )
-
w=-
b - 1~ + ( b - 1 ) + b - 1 L
k Pr[A = k]
+
b Pr[A 2 b ]k = 1
1592 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 6, NO. 9, DECEMBER 1988
0 . 0 0 0 1 1 I I I
0 2 0 40 60 8 0
BUFFER SIZE, b ( PACKETS)
(a)
1 .o
0 1
>
k - 2 a m a a 0 . 0 1
s
In t- W Y Ua 0 001
0.0001
0 20 40 60
BUFFER SIZE, b (PACKETS)
(b)
Fig. 5. The packet loss probability for input smoothing as a function of the buffer size (frame size) b and the switch size N , for offered loads (a) p = 0.8 and ( b ) p = 0.9.
Equation (1 1) follows from the timing diagram in Fig. 7.
The first term on the right-hand side of (1 1) is the ex- pected amount of time a packet has to wait while the frame is being stored at the inputs. The second term is the delay
1.0
1 0 - 2
>.
c 2 1 0 - ~ m a m 0 a In 0
+
W 1 0 - 6
E 1 0 - 8
a
10-’0
P = 0.5
5
0 80 70 \
20 40 60
BUFFER SIZE, b (PACKETS) 1 0 - 1 2
3
Fig. 6 . The packet loss probability for input smoothing as a function of the buffer size (frame size) b and offered loads varying from p = 0.70 t o p = 0.95, for the limiting case of N = m .
INPUT FF STORE PAC
PACKETS FLOW THROUGH SWITCH FABRIC AT l / b THE SPEED OF INPUTS AND OUTPUTS
TIME
L
UP TO b PACKETS TIME- MULTIPLEXED ON OUTPUT LINE
Fig. 7. The timing diagram used to compute the mean waiting time for input smoothing.
resulting from the fabric running at 1 / b the speed of the inputs and outputs. The last term represents the expected waiting time in the multiplexing operation at the outputs.
Using (8), (1 1) may be rewritten as
HLUCHYJ A N D KAROL: QUEUEING IN HIGH-PERFORMANCE PACKET SWITCHING 1593
Taking the limit as N -+ 00, we obtain
* [ebp - 11 k ( k - 1 ) - b ( b - 1 )
2 2
bebP - ( b - k ) -
(13) (bP)k
b - 1
- 3(b - 1) k = l
W = 2
+
k = O k!
The mean waiting time for input smoothing is plotted in Fig. 8 against the offered load p for N = 00 and various values of the frame size b . The mean packet waiting time curves for finite N 2 2 are only slightly below those shown in Fig. 8. Note from Fig. 8 and (12) and ( 13), that the mean waiting time increases proportionally with b .
In summary, for input smoothing to achieve a low packet loss probability requires a large frame size b . Un- fortunately, a large frame size increases the size of the Nb
X Nb switch fabric and also the packet delay through the switch. Hence, although intellectually interesting, input smoothing does not seem to have much practical value, other than allowing the switch fabric to run b times slower than the input and output lines.
C. Output Queueing
With output queueing, all queueing is done at the out- puts with a separate b packet FIFO at each output of the switch fabric [Fig. 2(c)]. In the analysis, we fix our at- tention on a particular (i.e., tagged) output queue. Defin- ing the random variable A as the number of packet arrivals destined for the tagged output in a given time slot, we have
which, for N = 00, becomes
A p ke -P
cdk = Pr [A = k ] = - k !
k = 0, 1, 2, * * e , N = 00. (15) Letting Q , denote the number of packets in the tagged queue at the end of the mth time slot, and A, denote the number of packet arrivals during the mth time slot, we have
Q, = min {max (0, Q , - l + A , - l ) , b } . (16) When Q , - = 0 and A,
>
0, one of the arriving packets is immediately transmitted during the mth time slot; that is, a packet flows through the switch without suffering any delay. For N = 00 and b = 0 0 , the queue size Q , can be modeled by a M I D / 1 queue [ 6 ] . For finite N and b , we model Q , by a finite-state, discrete-time Markov chainwith state transition probabilities Pi A = Pr [ Q, = j
1
Q, -= i ] given by Qo + Q1 QO
i = 0 , j = 0
1 I i I b, j = i - 1 1 s j s b - 1 , 0 I i s j
N
a, j = b , O s i s j
m = j - i + l
0 otherwise ( 1 7 )
(18) A ( 1 - Qo - Q l )
.
q1 = Pr [ Q = 13 = 40
QO
where ak is given by (14) and (15) for N
<
00 and N =00, respectively. The steady-state queue size can be ob- tained directly from the Markov chain balance equations to yield
where
A packet will not be transmitted on the tagged output line during the mth time slot if, and only if, Q,- = 0 and A, = 0. Therefore, letting po denote the normalized switch throughput, we have
Po = 1 - qoao. (21)
A packet will be lost if, when emerging from the switch fabric, it finds the output queue already containing b packets. Dividing the utilization of the output line po by the arrival rate p, we obtain the packet success probabil- ity. Therefore,
(22) P
Pr [packet loss] = 1 - P
Fig. 9(a) and (b) show the lost packet performance for
1594
o 0 2 0 4 0 6 0 8 o
lEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 6, NO. 9, DECEMBER 1988
1.0 40 I
output queueing as a function of the output FIFO size b for various number of users N and offered loads p = 0.8 and 0.9. Note that at an 80 percent offered load, with b
= 28, the probability of a lost packet is under lop6 for arbitrarily large N . Note also that the N = 03 curve is a good approximation for finite N
>
3 2 . In Fig. 10, for N= 00, we plot the lost packet performance for output queueing against the output FIFO size b for offered loads p varying from 0.70 to 0.95.
Output queueing achieves the optimal throughput-delay performance because packets are delayed only by the un- avoidable congestion caused by two or more packets simultaneously arriving on different inputs destined for the same output. The mean waiting time
w
for a packet mak- ing it into an output FIFO is obtained from Little's resultb -
c
nqn(23
1
-
w = - =
Q n = ~ P o 1 - qoaoThe mean waiting time for output queueing is shown in Fig. 11 as a function of the offered load p for N = 00 and various values of the output FIFO size b . When N = 03
and b = 03, we have the mean waiting time for a M I D / 1 queue [7]
D. Completely Shared Buflering
By increasing the switch fabric size from N x N to N ( b
+
1 ) X N ( b+
1 ) and pooling all memory into one completely shared buffer [Fig. 2(d)], we still achieve thea m 0 n
U ) v) 0 J
+
W
a 10-6
::
10-8a
10-10
10-12
20 4 0 60
BUFFER SIZE, b (PACKETS)
(a)
I . 0
1 0 - 2
>
5
10-4-
m a m
a
ul 10-6
0"
0
J
+
W Y
:
10-8a
10-10
BUFFER S I Z E , b (PACKETS) (b)
Fig. 9 . The packet loss probability for output queueing as a function of the buffer size b and the switch size N , for offered loads (a) p = 0.8 and ( b ) p = 0.9.
optimal throughput-delay performance of output queue- ing, but save on the total amount of buffering needed to achieve a desired packet loss probability. Because of the statistical nature of packet arrivals, more efficient use is made of the N b buffer locations when they are shared by all outputs, rather than dedicating b to each of the N out- puts.
Packets that enter the buffer will recirculate through the switch fabric and shared buffer until they are transmitted
HLUCHYJ A N D KAROL: QUEUEING IN HIGH-PERFORMANCE PACKET SWITCHING 1595
1.0
10-8
>
5
40-4:
10'6m a
v)
OJ
::
40-8d
w I-
10 - '0
1 0 - 1 2
0 10 20 30 40 50
BUFFER SIZE, b (PACKETS)
Fig. 10. The packet loss probability for output queueing as a function of the buffer size b and offered loads varying from p = 0.70 t o p = 0.95, for the limiting case of N = w .
OFFERED LOAD, p
Fig. 11. The mean waiting time for output queueing as a function of the offered load p , for N = w and output FIFO sizes varying from b = 1 to b = 03.
over the appropriate output. Sizing the buffer is equivalent to determining the number of recirculation ports needed in the Starlite Switch. Following a different approach, our results agree with those obtained by Eckberg and Hou [ 101 for large N . Their analysis includes the negative correla- tions that exist between packet streams destined for dif- ferent outputs.
If
QL
denotes the number of packets destined for output i in the buffer at the end of the mth time slot, then1 .o
10 -2
1 0 - ~
>
I- d
0
3
E
U) 10-6
9 Li 10-8
$
10 -10 0
Y 0
10-12 0
1 .o
10 -2
t d
g
1 0 - ~>
E
v) 10-6
9
w 10-8
2
a
v)
I- Y 0
10-10
10-12
0 10 20 30 40 50
BUFFER SIZE, b (PACKETS) (b)
Fig. 12. The packet loss probability for completely shared buffering as a function of the buffer size per output b and the switch size N, for offered loads (a) p = 0.8 and (b) p = 0.9.
E :
=
IQL
is the total number of packets in the shared buffer at the end of the mth time slot. If the buffer size is infinite ( N b = m ) , thenQL
= max (0, Q k - l+
A: - 1 ) (25) where A; is the number of packets addressed to output i that arrive during the mth time slot. With a finite buffer size, packet arrivals destined for some outputs may fill the shared buffer at the expense of other arrivals in the same time slot; the resulting buffer overflow invalidates (25).We will use (25), however, since it is a good approxi- mation in the region of interest: the low packet loss prob- ability region (e.g., less than packet loss probabil- ity).
1596
4.0
1 0 - 2
>
C 1 0 - 4 2
U 0 n
:
10-6ul 0
b W
::
10-8a
10-10
10-12
IEEE JOURNAL ON SELECTED AREAS I N COMMUNICATIONS, VOL. 6, NO. 9, DECEMBER 1988
INPUT QUEUEING
SIMPLE QUEUEING STRUCTURE
THROUGHPUT L I M I T E D TO 0 . 5 8 6 W I T H FIFO BUFFERS
INPUT SMOOTHING
a NUMBER O F SWITCH INPUTS/OUTPUTS GROWS AS N b
VERY LARGE b REQUIRED FOR S M A L L PACKET LOSS PROB.
DELAY THROUGH THE SWITCH PROPORTIONAL TO b
OUTPUT OUEUEING
ACHIEVES O P T I M A L THROUGHPUT/ DELAY PERFORMANCE
lpty
* SEPARATE BUFFER FOR EACH OUTPUT
U-
4 b b COMPLETELY SHARED BUFFERING
ACHIEVES O P T I M A L THROUGHPUT/ DELAY PERFORMANCE
S M A L L T O T A L BUFFER MEMORY REQUIRED FOR L A R G E N
NUMBER OF SWITCH INPUTS/OUTPUTS GROWS AS N ( b + 4 )
HEAVILY LOADED OUTPUT COULD AFFECT OTHER OUTPUTS
Fig. 13. A performance summary of the four approaches to providing the queueing for a high-performance packet switch.
40 20 3 0 40 50
BUFFER S I Z E , b I P A C K E T S )
Fig. 14. A comparison of the packet loss probabilities at an offered load o f p = 0.85.
For finite N , A', the steady-state number of packet ar- rivals destined for output i, is unfortunately not indepen- dent of A'( j # i ). At most N packets arrive to the switch, so a large number of packets arriving for one output im- plies a small number for the remaining outputs. As N in- creases, however, the A' become independent Poisson random variables (each with mean value p ) , and the steady-state number of packets in the buffer that are des- tined for output i, Q i , becomes independent of Q'( j #
i ). We will use the Poisson and independence assump- tions even for finite N , and show that the approximations are good for N 2 16.
I N P U T SMOOTHING
b = 6
-
BUFFERING AND O U T P U T
0
0 0 2 0 4 0 6 0 8 1 0
O F F E R E D LOAD. p
Fig. 15. A comparison of the mean waiting times for the limiting case of N = OD.
Our approach, therefore, is to model Q ' , the steady-state number of packets in the buffer, as the N fold convolution of N M I D / 1 queues. With the assumption of an infinite buffer size, we then approximate the packet loss probability by Pr Q i 2 N b ] . Fig. 12(a) and (b) show the packet loss probability for completely shared buffering as a function of 6, the buffer size per output, for various number of users N , and offered loads p = 0.8 and 0.9, respectively. The results converge to the asymptotic limit of p 2 / 2 ( I - p ) recirculation ports per output;
p 2 / 2 ( 1 - p ) is the mean value of the M I D / 1 queue size [61.
HLUCHYJ AND KAROL: QUEUEING IN HIGH-PERFORMANCE PACKET SWITCHING 1597
In this section, we have computed packet loss proba- bilities for uniform traffic models. A potential problem with completely shared buffering is that one heavily
[IO] A. E. Eckberg and T.-C. Hou, “Effect of output buffer sharing on buffer requirements in an ATDM packet switch,” in Proc.
INFOCOM,88, Mar. 1988, pp, 459-466,
loaded output might monopolize use of the shared buffer, thereby adversely affecting the performance of other out- puts.
IV. CONCLUSION
Figs. 13, 14, and 15 summarize the results of this pa- per. The throughout of input queueing with FIFO buffers is limited to 0.586, but can be increased by relaxing the strict first-in first-out queueing discipline. Input smooth- ing increases the throughput, but at the expense of a large increase in switch fabric size and latency. Completely shared buffering requires less buffer memory than output queueing, but requires a larger switch fabric size. Both output queueing and completely shared buffering, how- ever, achieve the optimal throughput-delay performance.
REFERENCES
J . S . Turner and L. F. Wyatt, “A packet network architecture for integrated services,’’ in Proc. IEEE GLOBECOM’83 Con5 Rec., Nov. 1983, pp. 45-50.
T.-Y. Feng, “A survey of interconnection networks,” Computer, vol.
14, pp. 12-27, Dec. 1981.
R. G. Gallager, Information Theory and Reliable Communication.
New York: Wiley, 1968.
A. Huang and S. Knauer, “Starlite: A wideband digital switch,’’ in Proc. IEEE GLOBECOM’84 Conf. Rec., Nov. 1984, pp. 121-125.
Y. S . Yeh, M. G. Hluchyj, and A. S. Acampora, “The knockout switch: A simple, modular architecture for high-performance packet switching,” IEEE J . Select. Areas Commun., vol. SAC-5, pp. 1274- 1283, Oct. 1987.
M. J . Karol, M. G. Hluchyj, and S. P. Morgan, “Input vs. output queueing on a space-division packet switch,” IEEE Trans. Commun., vol. COM-35, pp. 1347-1356, Dec. 1987.
L. Kleinrock, Queueing Systems, Vol. I : Theory. New York: Wiley, 1975.
D. P. Bhandarkar, “Analysis of memory interference in multiproces- sors,” IEEE Trans. Comput., vol. C-24, pp. 897-908, Sept. 1975.
T. Meisling, “Discrete-time queueing theory,’’ Oper. Res., vol. 6 , pp. 96-105, Jan.-Feb. 1958.
Michael G. Hluchyj (S’75-M’82) was born in Erie, PA, on October 23, 1954. He received the B.S.E.E. degree in 1976 from the University of Massacuhsetts at Amherst and the S.M., E.E., and Ph.D. degrees in electrical engineering from the Massachusetts Institute of Technology, Cam- bridge, in 1968, 1978, and 1981, respectively.
From 1977 to 1981 he was a Research Assis- tant in the Data Communication Networks Group at the M.I.T. Laboratory for Information and De- cision Systems, where he investigated fundamen- tal problems in packet radio networks and multiple access communications.
In 1981, he joined the technical staff at Bell Laboratories, where he worked on the architectural design and performance analysis of local area net- works. In 1984 he transferred to the Network Systems Research Depart- ment at AT&T Bell Laboratories, performing fundamental and applied re- search in the areas of high-performance, integrated communciation networks and multiuser lightwave networks. In June 1987 he assumed his current position as Director of Networking Research at Codex Corporation.
His current research interests include wide-band circuit and packet switch- ing architectures, integrated voice and data networks, and local area net- work interconnects.
Dr. Hluchyj is active in the IEEE COMMUNICATIONS SOCIETY and is a member of the Technical Editorial Board for the IEEE NETWORK MAGA- ZINE.
Mark J . Karol (S’79-M’85) was born in Jersey City, NJ, on February 28, 1959. He received the B.S. degree in mathematics and the B.S.E.E. de- gree in 1981 from Case Western Reserve Univer- sity, and the M.S.E., M.A., and Ph.D. degrees in electrical engineering from Princeton Univer- sity in 1982, 1984, and 1985, respectively.
Since 1985, he has been a member of the Net- work Systems Research Department at AT&T Bell Laboratories. His current research interests in- clude local and metrouolitan area lightwave net- Y
works, and wide-band circuit and packet switching architectures.