Background - Generalized Maximum Entropy, Convexity and Machine Learning

The proof of the Hammersley-Clifford Theorem relies on terminology and results from both number theory and graph theory. We give the necessary elements here as well as on q-analogues of some elementary mathematical operations.

7.2.1 Graphical Models

A graph G = (V,E) consists of a set of vertices (or nodes), V, and a set of edges

E. An edge is an ordered pair of nodes. A clique, c ∈ V is a fully connected subgraph of G. The vertices, {ai}Mi=1 in a graphical model typically correspond

to the variables of a distribution, with densityf(x). We will move between these representations assuming the order of the nodes is the same.

We will say a function T-decomposes according to a graph for the transfor- mation T :_RX →_RX _{if there are} _ψ

c such that

Tf(x) = X

c∈C

ψc(xc), (7.1)

where C is the set of cliques of G and where xc is equal to x with the entries

corresponding to V\cremoved.

Letxc _{denote a vector equal to}_x_{with the}_m_{-th and}_n_{-th entries deleted. We}

say that a function has the pairwise generalized-Markov property if, whenever the nodes am and an do not share an edge, then there exists functions with

appropriate domains and ranges, h1 and h2, such that

Tf(x) =h1(xm,xc) +h2(xn,xc). (7.2)

In the classic formulation of the Hammersley-Clifford TheoremT = log is implic- itly assumed. Also in this setting decomposition is equivalent to multiplicative factorization. The role of T is therefore to convert factorization (of some kind) into addition. The generalized-Markov property then asserts the separability of the effects of xm and xn when the corresponding nodes are separated. More in-

depth background in graph theory, especially as it relates to graphical models is given by Lauritzen (1996).

7.2.2 Number Theory Essentials

Number theoretic tools play an important role in accounting for all the operations that could be defined on a graph. Although the subject is rather deep, we only require a modest gathering of results for our problem.

The prime numbers will be denoted pi, which represents the i-th prime num-

An arithmetic function is one of the form g : _N → _C. A multiplicative function is an arithmetic function that also satisfies g(m·n) = g(m)g(n) if the greatest common divisor of m and n is 1. If it holds for all m, n the arithmetic function is said to be totally multiplicative. The sum-function of an arithmetic function g is defined as

Sg(n) :=

d|n

g(d),

where the summation notation P

d|n is standard notation for the sum over the

divisors of n.

The fundamental theorem of arithmetic states that any 1< n∈_Ndecomposes into a unique product of powers of primes: n = pαi1

i1 ·,· · · ,·p

α_iM

iM . The function

λ(n) will be used to denote the number of prime factors ofn.

We require a few of the important number theoretic functions. One is the M¨obius function: For n =pαi1

i1 ·,· · · ,·p α_iM iM , µ(n) =      0 if any αi >1 1 if n= 1 (−1)λ(n) _otherwise (7.3)

The first condition tests whether or not the factorization of n contains a square number.

The constant function will be denoted 1(n) or simply 1 when the context makes it clear that a function, not a number, is required. The Dirichlet identity is defined as

(n) =

(

1 if n= 1 0 otherwise.

The Dirichlet convolution of two arithmetic functions, f and g, is another arithmetic function. It is denotedf ∗g and defined as

(f ∗g)(n) =X

d|n

f(d)g(n

d). (7.4)

This operation is commutative and associative. The sum-function, defined earlier, is in fact the Dirichlet convolution Sf = f ∗1. Two other important identities

are: µ∗1 =, and, for all arithmetic functions, f, f∗ =f holds. Using these, we can easily derive the famous M¨obius Inversion theorem:

This theorem plays an important role in the analysis. Our account of the number theoretic tools has been terse. To place them in their proper context one should consult a reference such as Wilf (1994) or Graham et al. (1989). The notes of Stankova-Frenkel (1999) are also helpful.

7.2.3 q-logarithm

q-log and q-exponential is included to make the chapter self-contained. This material is presented with much more background in the previous chapter. The

q-logarithm is defined forq >0 as

log_q(p) := ( log(p) if q= 1 p1−q₋₁ 1−q otherwise (7.6)

We let the notation (v)₊ mean v if v >0 and 0 otherwise. The inverse of the

q-logarithm is

exp_q(v) = (1 + (1−q)v)

1 1−q

+ .

Using these two functions we can define an analogue to multiplication:

x⊗qy= expq(logq(x) + logq(y)) = (x

1−q

+y1−q−1)1−1q

ifx1−q₊_y1−q₋₁_>_{0 and otherwise it is 0. It is associative, commmutative and it}

has 1 as its neutral elements. Under this definition of ⊗q we have the identities:

exp_q(x+y) = exp_q(x)⊗qexpq(y)

log_q(x⊗qy) = logq(x) + logq(y) (whenever the left-hand-side is defined).

Thus a q-exponential does not factorize, but it does “q-factorize”.

In document Generalized Maximum Entropy, Convexity and Machine Learning (Page 145-147)