The concept of the information bottleneck provides another way to think about our problem. We will be following the presentation in [3], which relies heavily on the presentation of [137] for the mutual information and distortion rate material.
Consider a signalx∈Xused to predict a signaly∈Y. The relevant information inxis defined as the amount of information aboutyinx. There is a certain minimal amount of information in the signalx
that is needed to predict the signaly. Equivalently, there is a way to representXin a short code so as to preserve the maximum information aboutY. The information aboutY inXis thus squeezed through the bottleneck of a set of codewordsX˜. We thereby have for the information flow,
X→X˜
˜
X→Y.
(4.26)
Information aboutY inXis represented by a limited set of codewordsX˜ and then the limited set of codewords is used to predictY with the goal of maximizing the amount of information aboutY.
The mapping between elements ofXand the codewordsx˜∈X˜ can be represented by a probability densityp(˜x|x). The mapping thereby partitionsXinto blocks with each block associated to a codeword ˜
xwith probability
p(˜x) =X
x
p(x)p(˜x|x). (4.27)
Notice here that the probabilityp(˜x)is just one of the marginal distributions produced from the joint
To determine the quality of a signal, two things are needed: the rate of transmission and the accuracy of the transmission. The rate of transmission, i.e., the average number of of bits per message needed to specify an element in the codebook without confusion, is determined by the mutual information. The mutual information bounds from below the rate per element ofX,
I(X; ˜X) = X x∈X X ˜ x∈X˜ p(x,x˜) log p(˜x|x) p(˜x) . (4.28)
To determine the accuracy of a signal, the distortion function is used. The distortion function is supposed to be small, and so it gives a measure for the most relevant aspects ofX. The expected distortion for the partitioning ofXfromp(˜x|x)is,
hd(x,x˜)i= X x∈X X ˜ x∈X˜ p(x,x˜)d(x,x˜). (4.29)
As noted earlier, the rate distortion function R(D)is given by the minimal achievable rate (i.e., the minimal mutual information) for a given distortionD,
R(D)≡minI(X; ˜X), (4.30)
where the minimum is taken over{p(˜x|x) :hd(x,x˜)i ≤D}. The rate distortion functionR(D)captures the relation between rate and distortion. A larger rateRmeans there is a smaller achievable distortionD. The optimal distribution can be found, and it is,
p(˜x|x) = p(˜x)
Z(x, β)exp(−βd(x,x˜)), (4.31) whereZ is a normalization function (the partition function) andβ is the Lagrange multiplier (and is positive) used to solve the constrained optimization problem forR(D),
δR
δD =−β. (4.32)
A difficulty with the usual distortion rate tools is that it is hard to find a correct and non-arbitrary distortion measured(x,x˜). So Tishby et al. propose using an information bottleneck to determine the relevant
information content of a signal [3]. Access top(x, y)is assumed, like access top(x)is assumed for rate distortion theory.XandY must have positive mutual information: the relevant information aboutY is found inX.X˜ compressesXas much as possible but so as to preserve as much information aboutY as possible. Hence, the following relation must hold,
I( ˜X;Y) =X y X x p(y,x˜) log p(y,x˜) p(y)p(˜x) ≤I(X, Y), (4.33) since the compression cannot in general preserve more information aboutY than whatXhas. The goal is then to keep fixed the amount of information aboutY inX while minimizing the number of bits needed to represent that information inX˜. The method of Lagrange multipliers then gives the following functional to minimize,
L[p(˜x, x)] =I( ˜X;X)−βI( ˜X;Y). (4.34)
It is then shown that the solution to the optimization problem is,
p(˜x|x) = p(˜x)
Z(x, β)exp −β X
y
p(y|x) logp(y|x)
p(y|x˜) ! , p(y|x˜) = 1 p(˜x) X x p(y|x)p(˜x|x)p(x). (4.35)
Because of the multiple appearances ofp(˜x, x), the solution forp(˜x, x)andp(˜x)must be determined self-consistently.
Notice that this solution (4.35) is equivalent to,
p(˜x|x) = p(˜x) Z(x, β)exp(−βDKL(p(y|x)||p(y|x˜))), Z(x, β) =X ˜ x p(˜x) exp(−βDKL(p(y|x)||p(y|x˜))). (4.36)
So the KL divergence appears as the correct distortion measure for this problem of optimizing the information bottleneck.
The connection to our problem should be clear from the above discussion. With the stochastic association of blocks ofXinto elements ofX˜, we here see another way to understand the renormalization group procedure. TheXrepresents the UV variables. TheX˜ represents the renormalized variables, e.g.,
in the Kadanoff procedure they are the block-spins that summarize the information about the spins in a block. TheY is then the variables after performing a rescaling, e.g., of the lattice in the Kadanoff procedure.
The KL divergence that we have been calculating can be understood as an effective distortion measure of passing information from the UV variables to the IR variables via renormalized variables: in particular, it can be understood as an effective distortion measure of passing information from string physics to SM physics as the scale is changed. The distributionp(y|x)is the final distribution after renormalization and rescaling;p(x)is the original UV distribution, givingp(x, y) =p(y|x)p(x);p(˜x|x)is the kernelT that transforms from the UV variables to the blocked variables, which allows us by (4.35) to findp(y|x˜)by findingp(y,x˜)and then marginalizing overyto getp(˜x). However, it is not clear how to findp(y,x˜), and it is not clear how to easily invert (4.35) to findp(y|x˜)directly. Due to these difficulties, we continue on with our procedure of using the KL divergence to directly quantify the information.