Chapter 3 Entropy Maximisation
3.3 The principle of entropy maximisation
3.3.5 Entropic update of probabilities
It has been demonstrated how PEM can be used to construct probability distribution to describe the behaviour of almost any system of interest by construction an entropy function and
converting available information about the system into a set of constraints. It has also been noted that the maximised value of the entropy function captures the amount of missing information in the probabilities constructed. The entropic probability of a system therefore, reflects of our state of knowledge about the system. We therefore, expect that the more information we have about the system under consideration, the better the predictive power of the constructed probability distribution and hence the more we can trust its description of the system. This means that being able to construct probability distribution based on the current state of knowledge about the system is not enough. It is equally important to find the means of updating the distribution as and when new information becomes available. The updating scheme will ensure that the probability distribution at any given time reflects our most current state of knowledge about the system.
One obvious updating scheme is the Bayesian method where the existing probability distribution (called the prior) is replaced by a posterior distribution as described in Section 3.2.
Employing Bayesian method required making some assumptions about probability distribution of the new information, which means artificially "adding" information, which may be true or false. The good news is that we do not have to switch to Bayesian updating method when new information becomes available. The framework of entropy maximisation allows for the update of the probability distribution with new information in at least two consistent ways; absolute entropy update (AEU) as proposed by Jaynes (1957) and relative entropy update (REU), also called cross entropy update, first proposed by Kullback and Leibler (1951).
Absolute entropy update (AEU)
The AEU approach simply discards the existing probability distribution and reconstructs a new one using both the old and the new information. This approach is attractive and efficient if the volume of new information is significantly more than the old one and/or if the new information is considered more accurate than certain aspects of the old information. In many practical problems, however, the new information is likely to be significantly smaller in volume and/or reflects some aspect of the system that has not been adequately described by the existing probability distribution. In these situations, the REU method is efficient and more relevant.
Relative update method (REU)
The REU generates new (posterior) probability distribution by allowing the construction of Boltzmannβs entropy function to explicitly account for the existing (prior) probability distribution.
Proposition 3.1: Maximising entropy is equivalent to maximising the log-likelihood of the multinomial probability mass function:
π(π1, π2, β¦ ππ|π«) = ( π!
βππ=1ππ!) β π πππ
π
π=1
(3.15)
with the assumption that the prior probabilities π π (π = 1,2, β¦ π) follow a uniform distribution.
π π is the prior probability that alternative π takes on the value ππ such that π1+ π2+ β― +ππ = π.
Proof 3.1: The term in the bracket of Equation (3.15) is called the multinomial coefficient and corresponds to the entropy function in Equation (3.8). Taking the natural logarithm of (3.15) and applying Stirlingβs approximation we have:
lnπ = πlnπ β β ππlnππ
π
+ β ππlnπ π
π
If we define the posterior probability as ππ = ππ
π , where π is the total size of the system and ππ and is the size occupied by alternative π or the number of times that event π occurred.
Converting the ππβ²π into ππβ²π and simplifying we have:
lnπ = βπ β ππlnππ
π
+ π β ππlnπ π
π
which simplifies to become
lnπ = βπ β ππlnππ π π
π
(3.16)
If the prior probabilities follow a uniform distribution, π 1 = π 2 = β― = π π = π then the above simplifies to become:
lnπ = βπ β ππlnππ
π
+ ππ = βπ β ππlnππ
π
+ π
where π = ππ is constant and can be ignored in the optimisation process. Clearly, maximising entropy π in (3.8) is equivalent to maximising π with the assumption that the prior probabilities are uniformly distributed. In fact, it has already been shown that the uniform probability distribution is the default distribution when the entropy of a system is maximised with no available informationβan outcome that echoes Laplaceβs principle of insufficient reasoning.
Thus, the entropy in (3.8) or (3.11) can be generalised by relaxing the uniform distribution assumption of the prior probabilities:
π(π1, β¦ , ππ|π«) = β β ππlnππ π π
π
(3.17)
Since maximising π(π1, β¦ , ππ|π«) is equivalent to minimising βπ(π1, β¦ , ππ|π«), the minimisation of βπ(π1, β¦ , ππ|π«), is referred to in the literature as cross entropy or Kullback-Leibler divergence (Kullback and Kullback-Leibler 1951):
πβ(π1, β¦ , ππ|π«) = βπ(π1, β¦ , ππ|π«) = β ππlnππ π π
π
(3.18)
The process of minimising (3.18) subject to the new information is generally referred to in the literature as the principle of minimum information (Williams 1980; Kullback and Leibler 1951;
Caticha and Giffin 2006).
We have demonstrated ways in which existing probabilities can be updated with new information as and when they become available making the entropy framework a truly universal method of deductive inference. It is also worth noting that some scholars such as Willians (1980), Diaconis and Zabell (1982), Jaynes (1988), Caticha and Giffin (2006) have investigated the link between the principle of minimum information and Bayesian updating method and found the later to be a special case of the former. As noted earlier a connection
with maximum likelihood estimation has also been established (Burg 1978; Seth and Kapur 1990). Thus, the principle of entropy maximisation is truly universal; it allows for the construction of probabilities, updating the probabilities with new information as they become available and the estimation of parameters governing the distributions. The next section presents a numerical example to illustrate some important features of the entropy concept.