The general shape of a probability density may determine properties of sta- tistical inference procedures. We can easily identify various aspects of a prob- ability distribution that has a continuous density function. For discrete dis- tributions, some of the concepts carry over in an intuitive fashion, and some do not apply.
In the following, we will assume thatX is a random variable (or vector) with distribution in the family
P ={Pθ : θ∈Θ⊆IRk} that is dominated by a σ-finite measureν, and we let
fθ(x) = dPθ/dν.
First of all, we consider the shape only as a function of the value of the random variable, that is, for a fixed member of the family the shape of the PDF.
In some cases, the shape characteristic that we consider has (simple) mean- ing only for random variables in IR.
Empirical Distributions and Kernels of Power Laws
Many important probability distributions that we identify and give names to arise out of observations on the behavior of some data-generating process. For example, in the process of forming coherent sentences on some specific topics there is an interesting data-generating process that yields the number of the most often used word, the number of the second most often used word and so on; that is, the observed data arex1, x2, . . ., wherexi is the count of the word that occurs as theithmost frequent. The linguist George Kingsley Zipf studied this data-generating process and observed a remarkable empirical relationship. In a given corpus of written documents, the second most commonly-used word occurs approximately one-half as often as the most common-used word, the third most commonly-used word occurs approximately one-third as often as the second most common-used word. (This general kind of relationship had been known before Zipf, but he studied it more extensively.) A probability- generating function that expresses this empirical relationship has the kernel
k(x) =x−α, x= 1,2, . . . where α >1.
The salient characteristic, which determines the shape of the PDF, is that the relative frequency is a function of the value raised to some power. This kind of situation is observed often, both in naturally occurring phenomena such as magnitudes of earthquakes or of solar flares, and in measures of human artifacts such as sizes of cities or of corporations. This is called a “power law”. Use of the kernel above leads to a Zipf distribution, also called a zeta distribution because the partition function is the (real) zeta function,ζ(s) = P∞
i=1zs. (The Riemann zeta function is the analytic continuation of series, and obviously it is much more interesting than the real series.) The PDF, for α >1, is
f(x) = 1 ζ(α)x
−α, x= 1,2, . . .
power law distribution Pareto distribution Benford distribution power func- tion distribution (not power series distribution)
f(x) =c(α, θ)x−αθx−1, x= 1,2, . . .
Symmetric Family
A symmetric family is one for which for any givenθthere is a constantτ that may depend onθ, such that
fθ(τ+x) =fθ(τ−x), ∀x. In this case, we say the distribution is symmetric aboutτ.
In a symmetric family, the third standardized moment, η3, if it exists is 0; however, skewness coefficient. Ifη3= 0, the distribution is not necessarily symmetric.
The characteristic function of distribution that is symmetric about 0 is real, and any distribution whose characteristic function is real must have symmetries about 0 within the periods of the sine function (see equation (1.91) on page46).
Unimodal Family
A family of distributions is said to be unimodal if for any given θ the mode of the distribution exists and is unique. This condition is sometimes referred to asstrictly unimodal, and the term unimodal is used even with the mode of the distribution is not unique.
A family of distributions with Lebesgue PDFpis unimodal if for any given θ,fθ(x) is strictly concave inx(exercise). This fact can be generalized to fam- ilies with superharmonic Lebesgue PDFs (see Definition0.0.14on page659). Theorem 2.2
A probability distribution with a Lebesgue PDF that is superharmonic is uni- modal.
Proof.Exercise.
If the PDF is twice differentiable, by Theorem 0.0.15 unimodality can be characterized by the Laplacian. For densities that are not twice differ- entiable, negative curvature along the principal axes is sometimes called or- thounimodality.
Logconcave Family
If logfθ(x) is strictly concave inxfor anyθ, the family is called a logconcave family. It is also called astrongly unimodal family. A strongly unimodal fam- ily is unimodal; that is, if logfθ(x) is concave inx, then fθ(x) is unimodal (exercise). Strong unimodality is a special case of total positivity (see below). The relevance of strong unimodality for location families, that is, for fam- ilies in which fθ(x) = g(x−θ), is that the likelihood ratio is monotone in x(see below) iff the distribution is strongly unimodal for a fixed value of θ (exercise).
Heavy-tailed Family
A heavy-tailed family of probability distributions is one in which there is a relatively large probability in a region that includes ]−∞, b[ or ]b,∞[ for some finiteb. This general characterization has various explicit instantiations, and
one finds in the literature various definitions of “heavy-tailed”. A standard definition of that term is not important, but various specific cases are worth study. A heavy-tailed distribution is also called an outlier-generating distri- bution, and it is because of “outliers” that such distributions find interesting applications.
The concept of a heavy tail is equally applicable to the “left” or the “right” tail, or even a mixture in the case of a random variable over IRd whend >1. We will, however, consider only the right tail; that is, a region ]b,∞[.
Most characterizations of heavy-tailed distributions can be stated in terms of the behavior of the tail CDF. It is informative to recall the relationship of the first moment of a positive-valued random variable in terms of the tail CDF (equation (1.46)):
E(X) = Z ∞
0
F(t)dt. If for some constantb,x > bimplies
f(x)> cexp(−xTAx), (2.1) where cis some positive constant andAis some positive definite matrix, the distribution with PDFf is said to beheavy-tailed.
Equivalent to the condition (2.1) in terms of the tail CDF is lim
x→∞e
aTx
F(x) =∞ ∀a >0. (2.2)
Another interesting condition in terms of the tail CDF that implies a heavy-tailed distribution is
lim
x→∞F(x+t) =F(x). (2.3)
Distributions with this condition are sometimes called “long-tailed” distri- butions because of the “flatness” of the tail in the left-hand support of the distribution. This condition states that F(log(x)) is a slowly varing function of xat ∞. (A functiong is said to be slowly varyingat ∞ if for anya >0, limx→∞g(ax)/g(x) = 1.)
Condition (2.3) implies condition (2.2), but the converse is not true (Ex- ercise 2.3).
Most heavy-tailed distributions of interest are univariate or else product distributions. A common family of distributions that are heavy-tailed is the Cauchy family. Another common example is the Pareto family withγ= 0.
Subexponential Family
Another condition that makes a family of distributions heavy-tailed is lim
x→∞
1−F(2)(x)
A family of distributions satisfying this condition is called a subexponential family (because the condition can be expressed as limx→∞e−a
T
x/F(x) = 0). Condition (2.4) implies condition (2.3), but the converse is not true (Ex- ercise 2.4).
Monotone Likelihood Ratio Family
The shape of parametric probability densities as a function of both the values of the random variable and the parameter may be important in statistical applications. Here and in the next section, we define some families based on the shape of the density over the cross product of the support and the parameter space. These characteristics are most easily expressed for the case of scalar parameters (k= 1), and they are also most useful in that case.
Lety(x) be a scalar-valued function. The familyPis said to have amono- tone likelihood ratioiff for anyθ16=θ2, the likelihood ratio,
λ(θ1, θ2|x) =fθ2(x)/fθ1(x)
is a monotone function ofxfor all values ofxfor whichfθ1(x) is positive.
We also say that the family has amonotone likelihood ratio iny(x) iff the likelihood ratio is a monotone function of y(x) for all values of x for which fθ1(x) is positive.
Some common distributions that have monotone likelihood ratios are shown in Table2.1. See also Exercise 2.5.
Table 2.1.Some Common One-Parameter Families of Distributions with Monotone Likelihood Ratios
normal(µ, σ2 0)
uniform(θ0, θ), uniform(θ, θ+θ0) exponential(θ) or exponential(α0, θ)
double exponential(θ) or double exponential(µ0, θ) binomial(n, π) (nis assumed known)
Poisson(θ)
A subscript on a symbol for a parameter indicates that the symbol represents a known fixed quantity. See AppendixAfor meanings of symbols.
Families with monotone likelihood ratios are of particular interest because they are easy to work with in testing composite hypotheses (see the discussion in Chapter 7beginning on page520).
The concept of a monotone likelihood ratio family can be extended to fam- ilies of distributions with multivariate parameter spaces, but the applications in hypothesis testing are not as useful because we are usually interested in each element of the parameter separately.
Totally Positive Family
A totally positive family of distributions is defined in terms of the total posi- tivity of the PDF, treating it as a function of two variables,θ andx. In this sense, a family is totally positive of order r iff for all x1 < · · · < xn and θ1<· · ·< θn, fθ1(x1)· · · fθ1(xn) .. . ... ... fθn(x1)· · · fθn(xn) ≥0 ∀n= 1, . . . , r. (2.5) A totally positive family withr= 2 is a monotone likelihood ratio family.
2.3 “Regular” Families
Conditions that characterize a set of objects for which a theorem applies are called “regularity conditions”. I do not know the origin of this term, but it occurs in many areas of mathematics. In statistics there are a few sets of reg- ularity conditions that define classes of interesting probability distributions.
We will often use the term “regularity conditions” to refer to continuity and differentiability of the PDF wrt the parameter.
2.3.1 The Fisher Information Regularity Conditions
The most important set of regularity conditions in statistics are some that allow us to put a lower bound on the variance of an unbiased estimator (see inequality (B.25) and Sections3.1.3and5.1). Consider the family of distribu- tionsP ={Pθ;θ∈Θ} that have densitiesfθ.
There are generally three conditions that together are called the Fisher information regularity conditions:
• The parameter space Θ⊆IRk is convex and contains an open set. • For any xin the support and θ ∈Θ◦, ∂f
θ(x)/∂θ and ∂2fθ(x)/∂θ2 exist and are finite, and∂2f
θ(x)/∂θ2 is continuous inθ.
• The support is independent ofθ; that is, allPθhave a common support. The latter two conditions ensure that the operations of integration and dif- ferentiation can be interchanged twice.
Because the Fisher information regularity conditions are so important, the phrase “regularity conditions” is often taken to mean “Fisher information regularity conditions”. The phrase “Fisher regularity conditions” is also used synonymously, as is “FI regularity conditions”.
2.3.2 The Le Cam Regularity Conditions
The Le Cam regularity conditions are the first two of the usual FI regularity conditions plus the following.
• The Fisher information matrix (see equation (1.82)) is positive definite for any fixedθ∈Θ.
• There exists a positive number cθ and a positive function hθ such that E(hθ(X))<∞and sup γ:kγ−θk<cθ ∂2logf γ(x) ∂γ(∂γ)T F≤hθ(x) a.e. (2.6) wherefθ(x) is a PDF wrt aσ-finite measure, and “a.e.” is taken wrt the same measure.
2.3.3 Quadratic Mean Differentiability
The Fisher information regularity conditions are often stronger than is needed to ensure certain useful properties. The double exponential distribution with Lebesgue PDF 1
2θe−|
y−µ|/θ, for example, has many properties that make it a useful model, yet it is not differentiable wrtµat the pointx=µ, and so the FI regularity conditions do not hold. A slightly weaker regularity condition may be more useful.
Quadratic mean differentiability is expressed in terms of the square root of the density. As with differentiability generally, we first consider the property at one point, and then we apply the term to the function, or in this case, family, if the differentiability holds at all points in the domain.
Consider again a family of distributionsP ={Pθ;θ∈Θ⊆IRk}that have densities fθ. This family is said to be quadratic mean differentiable atθ0 iif there exists a real k-vector function η(x, θ0) = (η1(x, θ0), . . . , ηk(x, θ0)) such that
∗ ∗ ∗fix Z
(∗ ∗ ∗ ∗ ∗∗)2dx∈o(|h|2) as
|h| →0.
Compare quadratic mean differentiability with Fr´echet differentiability (Def- inition0.1.57, on page760).
If each member of a family of distributions (specified byθ) is quadratic mean differentiable at θ, then the family is said to be quadratic mean differ- entiable, or QMD.