Objective: In this lecture we will learn about communication over a channel of practical interest, in which the transmitted signal is subjected to additive white Gaussian noise. We will derive the famous capacity formula.
1
The Gaussian channel
Suppose we send information over a channel that is subjected to additive white Gaussian noise. Then the output is
Yi=Xi+Zi
whereYiis the channel output,Xi is the channel input, andZiis zero-mean Gaus-sian with varianceN: Zi∼ N(0, N). This is different from channel models we saw before, in that the output can take on a continuum of values. This is also a good model for a variety of practical communication channels.
We will assume that there is a constraint on the input power. If we have an input codeword (x1, x2, . . . , xn), we will assume that theaverage poweris constrained so that 1 n n X i=1 x2i ≤P
Let is consider the probability of error for binary transmission. Suppose that we can send either +√P or−√P over the channel. The receiver looks at the received signal amplitude and determines the signal transmitted using a threshold test. Then
Pe= 1 2P(Y <0|X= + √ P) +1 2P(Y >0|X =− √ P) =1 2P(Z <− √ P|X = +√P) +1 2P(Z > √ P|X =−√P) =P(Z >√P) = Z ∞ √ P 1 √ 2πNe −x2/2Ndx =Q(pP/N) = 1−Φ(pP/N) where Q(x) =√1 2π Z ∞ x e−x2/2dx or Φ(x) =√1 2π Z x −∞ e−x2/tdx
Definition 1 The information capacity of the Gaussian channel with power constraint is
C= max
p(x):EX2≤PI(X;Y).
We can compute this as follows: I(X;Y) =h(Y)−h(Y|X) =h(Y)−h(X+Z|X) =h(Y)−h(Z|X) =h(Y)−h(Z) ≤ 1 2log 2πe(P+N)− 1 2log 2πeN = 1 2log(1 +P/N)
sinceEY2 =P+N and the Gaussian is the maximum-entropy distribution for a
given variance. So
C=1
2log(1 +P/N),
bits per channel use. The maximum is obtained when X is Gaussian distributed. (How do we make the input distribution look Gaussian?)
Definition 2 An (M, n) code for the Gaussian channel with power constraint P
consists of the following:
1. An index set {1,2, . . . , M}
2. An encoding functionx:{1, . . . , M} → Xn, which maps an input index into a sequence that is n elements long, xn(1), xn(2), . . . , xn(M), such that the average power constraints is satisfied:
n X i=1 (xni(w))2≤nP forw= 1,2, . . . , M. 3. A decoding functiong:Yn → {1,2, . . . , M}. 2
Definition 3 A rate R is said to be achievable for a a Gaussian channel with a power constraint P if there exists a sequence of (2nR, n) codes with codewords satisfying the power constraint such that the maximal probability of errorλ(n)→0.
Thecapacityof the channel is the supremum of the achievable rates. 2
Theorem 1 The capacity of a Gaussian channel with power constraintP and noise variance N is C= 1 2log 1 + P N
bits per transmission.
Geometric plausibility For a codeword of length n, the received vector (in n
space) is normally distributed with mean equal to the true codeword. With high probability, the received vector is contained in sphere about the mean of radius
p
n(N+). Why? Because with high probability, the vector falls within one stan-dard deviation away from the mean in each direction, and the total distance away is the Euclidean sum:
This is the square of the expected distance within which we expect to fall. If we assign everything within this sphere to the given codeword, we misdetect only if we fall outside this codeword.
Other codewords will have other spheres, each with radius approximatelypn(N+). The received vectors a limited in energy byP, so they all must lie in a sphere of ra-diuspn(P+N). The number of (approximately) nonintersecting decoding spheres is therefore
number of spheres≈ volume of sphere inn-space with radiusr=
p
n(P+N) volume of sphere inn-space with radiusr=pn(N+)
The volume of a sphere of radius rin n space is proportional to rn. Substituting in this fact we get
number of spheres≈ (n(P+N)) n/2 (n(N+))n/2 ≈2 n 2(1+ P N)
Proof We will follow essentially the same steps as before.
1. First we generate a codebookat random. This time we generate the codebook according to the Gaussian distribution: letXi(w), i= 1,2, . . . , nbe the code sequence corresponding to input index w, where each Xi(w) is selected at random i.i.d. according to N(0, P −). (With high probability, this has average powerP.) The codebook is known by both transmitter and receiver. 2. Encode as described above.
3. The receiver gets a Yn, and looks at the list of codewords {Xn(w)} and searches for one which is jointly typical with the received vector. If there is only one such vector, it is declared as the transmitted vector. If there is more than one such vector, an error is declared. An error is also declared if the chosen codeword does not satisfy the power constraint.
For the probability of error, assume w.o.l.o.g. that codeword 1 is sent:
Yn=Xn(1) +Zn
Define the following events:
E0={1 n n X i=1 Xi2(1)> P}
(the event that the codeword exceeds the power constraint) and
Ei={(Xn(i), Yn) is inA(n)} The probability of error is then
P(E) =P(E0∪E1c∪E2∪E3∪ · · · ∪E2nR) ≤P(E0) +P(E1c) + 2nR X i=2 P(Ei) union bound
By LLN, P(E0)→0. By joint AEP, P(E1c)→0, so P(E1c)≤ for nsufficiently
Yn and Xn(i), i
6
= 1. So the probability that Xn(1) and Yn are jointly typical is ≤2−n(I(X;Y)−3) by joint AEP. So Pe(n)≤++ 2nR X i=2 2−n(I(X;Y)−3) ≤2+ (2nR−1)2−n(I(X;Y)−3) =≤2+ 2nR2−n(I(X;Y)−3)≤3
fornsufficiently large, ifR < I(X;Y)−3.
This gives the average probability of error: we then go through the same kinds of arguments as before to conclude that the maximum probability of error also must go to zero.
2
The converse is that rate R > C are not achievable, or, equivalently, that if
Pe(n)→0 then it must be thatR≤C.
Proof The proof starts with Fano’s inequality:
H(W|Yn)≤1 +nRPe(n)=nn where n = 1 n+RP (n) e andn →0 asn→ ∞.
The proof is a string of inequalities:
nR=H(W) =I(W;Yn) +H(W|Yn) uniformW; definition ofI ≤I(W;Yn) +nn Fano’s inequality =h(Yn)−h(Yn|Xn) +nn =h(Yn)−h(Zn) +nn ≤ n X i=1 h(Yi)−h(Zn) +nn = n X i=1 h(Yi)− n X i=1 h(Zi) +nn ≤ n X i=1 1 2log 2πe(Pi+N)− 1
2log 2πeN+nn entropies ofY andZ; power constraint = n X i=1 1 2log(1 +Pi/N) +nn =n 1 n n X i=1 log(1 +Pi/N) ! +nn ≤nlog(1 +1 n n X i=1 Pi/N) +nn Jensen’s ≤n1 2log(1 +P/N) +nn. Dividing through byn, R≤ 1 2log(1 +P/N) +n. 2
2
Band-limited channels
We now come to the first time in the book where the information is actually carried by a time-waveform, instead of a random variable. We will consider transmission over a band-limited channel (such as a phone channel). A key result is the sampling theorem:
Theorem 2 If f(t)is bandlimited toW Hz, then the function is completely deter-mined by samples of the function taken every 2W1 seconds apart.
This is the classical Nyquist sampling theorem. However, Shannon’s name is also attached to it, since he provided a proof and used it. A representation of the functionf(t) is f(t) =X n f( n 2W) sinc(t− n 2W) where sinc(t) =sin(2πW t) 2πW t
From this theorem, we conclude (the dimensionality theorem) that a bandlimited function has only2W degrees of freedom per second.
For a signal which has “most” of the energy in bandwidthW and “most” of the energy in a time T, then there are about 2W T degrees of freedom, and the time- and band-limited function can be represented using 2W T orthogonal basis functions, known as theprolate spheroidalfunctions. We can view band- and time-limited functions as vectors in a 2T W dimensional vector space.
Assume that the noise power-spectral density of the channel isN0/2. Then the noise power is (N0/2)(2W) =N0W. Over the time interval ofTseconds, the energy per sample (per channel use) is
P T
2W T = P
2W.
Use this information in the capacity:
C= 1
2log(1 +
P
N) bits per channel use
= 1
2log(1 +
P N0W
) bits per channel use.
There are 2W samples each second (channel uses), so the capacity is
C= (2W)1 2log(1 + P N0W) bits/second or C=Wlog(1 + P N0W )
This is the famous and key result of information theory. AsW → ∞, we have to do a little calculus to find that
C= P
N0
This is interesting: even with infinite bandwidth, the capacity is not infinite, but grows linearly with the power.
Example 1 For a phone channel, take W = 3300 Hz. If the SNR is P/N0W =
40dB = 10000, we get
C= 43850 bits per second.
IfP/W N0= 20dB = 100 we get
C= 21972 bits/second.
(The book is dated.) 2
We cannot do better than capacity!
3
Kuhn-Tucker Conditions
Before proceeding with the next section, we need a result from constrained opti-mization theory known as the Kuhn-Tucker condition.
Suppose we are minimizing some convex objective functionL(x), minL(x)
subject to a constraint
f(x)≤0.
Let the optimal value of xbe x0. Then either the constraint is inactive, in which
case we get ∂L ∂x x 0 = 0
or, if the constraint is active, it must be the case that the objective function increases for alladmissiblevalues ofx:
∂L
∂xx∈A≥0 whereAis the set of admissible values, for which
∂f ∂y ≤0.
(Think about what happens if this is not the case.) Thus, sgn∂L ∂x =−sgn ∂f ∂x or ∂L ∂x +λ ∂f ∂x = 0 λ≥0. (1)
We can create a new objective function
J(x, λ) =L(x) +λf(x),
so the necessary conditions become
∂J ∂x = 0
and f(x)≤0 where λ ( ≥0 f(y) = 0 constraint is active = 0 f(y)<0 constraint is inactive.
For a vector variablex, then the condition (1) means:
∂L
∂x is parallel to ∂f
∂x and pointing in opposite directions,
where ∂L
∂x is interpreted as the gradient.
In words, what condition (1) says is: the gradient of L with respect to x at a minimum must be pointed in such a way that decrease of L can only come by violating the constraints. Otherwise, we could decrease L further. This is the essence of the Kuhn-Tucker condition.
4
Parallel Gaussian channels
Parallel Gaussian channels are used to model bandlimited channels with a non-flat frequency response. We assume we havekGaussian channels,
Yj =Xj+Zj, j= 1,2, . . . , k. where
Zj∼ N(0, Nj)
and the channels are independent. The total power used is constrained:
E
k
X
j=1
Xj2≤P.
One question we might ask is: how do we distribute the power across thekchannels to get maximum throughput.
We can find the maximum mutual information (the information channel capac-ity) as
I(X1, . . . , Xk;Y1, . . . , Yk) =h(Y1, . . . , Yk)−h(Y1, . . . , Yk|X1, . . . , Xk) =h(Y1, . . . , Yk)−h(Z1, . . . , Zk) =h(Y1, . . . , Yk)− k X i=1 h(Zi) ≤ k X i=1 h(Yi)−h(Zi) ≤X i 1 2log(1 +Pi/Ni)
Equality is obtained when theXs are independent normally distributed. We want to distribute the power available among the various channels, subject to not exceeding the power constraint:
J(P1, . . . , Pk) =X i 1 2log(1 + Pi Ni ) +λ k X i=1 Pi
with a side constraint (not shown) thatPi ≥0. Differential w.r.t. Pj to obtain 1
Pj+Nj
+λ≥0.
with equality only if all the constraints are inactive. After some fiddling, we obtain
Pj =ν−Nj
(sinceλis a constant). However, we must also havePj ≥0, so we must ensure that we don’t violate that ifNj> ν. Thus, we let
Pj= (ν−Nj)+ where (x)+= ( x x≥0 0 x <0 andν is chosen so that
n
X
i=1
(ν−Ni)+=P
Draw picture; explain “water filling.”
5
Channels with colored Gaussian noise
We will extend the results of the previous section now to channels with non-white Gaussian noise. Let Kz be the covariance of the noise Kx the covariance of the input, with the input constrained by
1
n
X
i
EXi2≤P
which is the same as
1 ntr(KX)≤P. We can write I(X1, . . . , Xn;Y1, . . . , Yn) =h(Y1, . . . , Yn)−h(Z1, . . . , Zn) where h(Y1, . . . , Yn)≤ 1 2log((2πe) n |Kx+Kz|)
Now how do we chooseKxto maximizeKx+Kz, subject to the power constraint? Let
Kz=QΛQT then
|Kx+Kz|=|Kx+QΛQT|=|Q||QTKxQ+ Λ||QT| =|QTKxQ+ Λ|=|A+λ|
whereA=QTK
xQ. Observe that
tr(A) = tr(QTKxQ) = tr(QTQKx) = tr(Kx)
So we want to maximize |A+ Λ| subject to tr(A)≤nP. The key is to use an in-equality, in this case Hadamard’s inequality. Hadamard’s inequality follows directly from the “conditioning reduces entropy” theorem:
h(X1, . . . , Xn)≤Xh(Xi). LetX∼ N(0, K). Then h(X) =1 2log(2πe) n |K| and h(Xi) = 1 2log(2πe)Kii Substituting in and simplifying gives
|K| ≤Y i
Kii with equality iffK is diagonal.
Getting back to our problem,
|A+ Λ| ≤Y i
(Aii+ Λii) with equality iffAis diagonal. We have
1
n
X
i
Aii≤P (the power constraint), andAii≥0. As before, we take
Aii = (ν−λi)+ whereν is chosen so that
X
Aii=nP.
Now we want to generalize to a continuous time system. For a channel with AWGN and covariance matrixKZ(n), the covariance is Toeplitz. If the channel noise process is stationary, then the covariance matrix is Toeplitz, and the eigenvalues of the covariance matrix tend to a limit as n → ∞. The density of the eigenvalues on the real line tends to the power spectrum of the stochastic process. That is, if
Kij =Ki−j are the autocorrelation values and the power spectrum is
S(ω) =F[rk] then lim M→∞ λ1+λ2+· · ·+λM M = 1 2π Z π −π S(ω)dω.
In this case, the water filling translates to water filling in the spectral domain. The capacity of the channel with noise spectrumN(f) can be shown to be
C=
Z 1
2log(1 +
(ν−N(f))+ N(f) )df whereν is chosen so that
Z