Entropic update of probabilities - The principle of entropy maximisation

Chapter 3 Entropy Maximisation

3.3 The principle of entropy maximisation

3.3.5 Entropic update of probabilities

It has been demonstrated how PEM can be used to construct probability distribution to describe the behaviour of almost any system of interest by construction an entropy function and

converting available information about the system into a set of constraints. It has also been noted that the maximised value of the entropy function captures the amount of missing information in the probabilities constructed. The entropic probability of a system therefore, reflects of our state of knowledge about the system. We therefore, expect that the more information we have about the system under consideration, the better the predictive power of the constructed probability distribution and hence the more we can trust its description of the system. This means that being able to construct probability distribution based on the current state of knowledge about the system is not enough. It is equally important to find the means of updating the distribution as and when new information becomes available. The updating scheme will ensure that the probability distribution at any given time reflects our most current state of knowledge about the system.

One obvious updating scheme is the Bayesian method where the existing probability distribution (called the prior) is replaced by a posterior distribution as described in Section 3.2.

Employing Bayesian method required making some assumptions about probability distribution of the new information, which means artificially "adding" information, which may be true or false. The good news is that we do not have to switch to Bayesian updating method when new information becomes available. The framework of entropy maximisation allows for the update of the probability distribution with new information in at least two consistent ways; absolute entropy update (AEU) as proposed by Jaynes (1957) and relative entropy update (REU), also called cross entropy update, first proposed by Kullback and Leibler (1951).

Absolute entropy update (AEU)

The AEU approach simply discards the existing probability distribution and reconstructs a new one using both the old and the new information. This approach is attractive and efficient if the volume of new information is significantly more than the old one and/or if the new information is considered more accurate than certain aspects of the old information. In many practical problems, however, the new information is likely to be significantly smaller in volume and/or reflects some aspect of the system that has not been adequately described by the existing probability distribution. In these situations, the REU method is efficient and more relevant.

Relative update method (REU)

The REU generates new (posterior) probability distribution by allowing the construction of Boltzmann’s entropy function to explicitly account for the existing (prior) probability distribution.

Proposition 3.1: Maximising entropy is equivalent to maximising the log-likelihood of the multinomial probability mass function:

𝑓(𝑛₁, 𝑛₂, … 𝑛_𝑚|𝒫) = ( 𝑛!

∏^𝑚_𝑖=1𝑛_𝑖!) ∏ 𝓅_𝑖^𝑛^𝑖

𝑚

𝑖=1

(3.15)

with the assumption that the prior probabilities 𝓅_𝑖 (𝑖 = 1,2, … 𝑚) follow a uniform distribution.

𝓅_𝑖 is the prior probability that alternative 𝑖 takes on the value 𝑛_𝑖 such that 𝑛₁+ 𝑛₂+ ⋯ +𝑛_𝑚 = 𝑛.

Proof 3.1: The term in the bracket of Equation (3.15) is called the multinomial coefficient and corresponds to the entropy function in Equation (3.8). Taking the natural logarithm of (3.15) and applying Stirling’s approximation we have:

ln𝑓 = 𝑛ln𝑛 − ∑ 𝑛_𝑖ln𝑛_𝑖

𝑖

+ ∑ 𝑛_𝑖ln𝓅_𝑖

𝑖

If we define the posterior probability as 𝑝_𝑖 = ^𝑛^𝑖

𝑛 , where 𝑛 is the total size of the system and 𝑛_𝑖 and is the size occupied by alternative 𝑖 or the number of times that event 𝑖 occurred.

Converting the 𝑛_𝑖′𝑠 into 𝑝_𝑖′𝑠 and simplifying we have:

ln𝑓 = −𝑛 ∑ 𝑝_𝑖ln𝑝_𝑖

𝑖

+ 𝑛 ∑ 𝑝_𝑖ln𝓅_𝑖

𝑖

which simplifies to become

ln𝑓 = −𝑛 ∑ 𝑝_𝑖ln𝑝_𝑖 𝓅_𝑖

𝑖

(3.16)

If the prior probabilities follow a uniform distribution, 𝓅₁ = 𝓅₂ = ⋯ = 𝓅_𝑚 = 𝓅 then the above simplifies to become:

ln𝑓 = −𝑛 ∑ 𝑝_𝑖ln𝑝_𝑖

𝑖

+ 𝑛𝓅 = −𝑛 ∑ 𝑝_𝑖ln𝑝_𝑖

𝑖

+ 𝑘

where 𝑘 = 𝑛𝓅 is constant and can be ignored in the optimisation process. Clearly, maximising entropy 𝑆 in (3.8) is equivalent to maximising 𝑓 with the assumption that the prior probabilities are uniformly distributed. In fact, it has already been shown that the uniform probability distribution is the default distribution when the entropy of a system is maximised with no available information―an outcome that echoes Laplace’s principle of insufficient reasoning.

Thus, the entropy in (3.8) or (3.11) can be generalised by relaxing the uniform distribution assumption of the prior probabilities:

𝑆(𝑝₁, … , 𝑝_𝑛|𝒫) = − ∑ 𝑝_𝑖ln𝑝_𝑖 𝓅_𝑖

𝑖

(3.17)

Since maximising 𝑆(𝑝₁, … , 𝑝_𝑛|𝒫) is equivalent to minimising −𝑆(𝑝₁, … , 𝑝_𝑛|𝒫), the minimisation of −𝑆(𝑝₁, … , 𝑝_𝑛|𝒫), is referred to in the literature as cross entropy or Kullback-Leibler divergence (Kullback and Kullback-Leibler 1951):

𝑆⃗(𝑝₁, … , 𝑝_𝑛|𝒫) = −𝑆(𝑝₁, … , 𝑝_𝑛|𝒫) = ∑ 𝑝_𝑖ln𝑝_𝑖 𝓅_𝑖

𝑖

(3.18)

The process of minimising (3.18) subject to the new information is generally referred to in the literature as the principle of minimum information (Williams 1980; Kullback and Leibler 1951;

Caticha and Giffin 2006).

We have demonstrated ways in which existing probabilities can be updated with new information as and when they become available making the entropy framework a truly universal method of deductive inference. It is also worth noting that some scholars such as Willians (1980), Diaconis and Zabell (1982), Jaynes (1988), Caticha and Giffin (2006) have investigated the link between the principle of minimum information and Bayesian updating method and found the later to be a special case of the former. As noted earlier a connection

with maximum likelihood estimation has also been established (Burg 1978; Seth and Kapur 1990). Thus, the principle of entropy maximisation is truly universal; it allows for the construction of probabilities, updating the probabilities with new information as they become available and the estimation of parameters governing the distributions. The next section presents a numerical example to illustrate some important features of the entropy concept.

In document The Siting Of Multi-User Inland Intermodal Container Terminals In Transport Networks (Page 86-90)