The Model - Costly Information and Multiattribute Choice

Costly Information and Multiattribute Choice

3.2 The Model

Suppose the decision-maker (DM) must evaluate I options, or items, (i = 1, . . . , I), each with J attributes (j = 1, . . . , J). Let ai,j be the value of the j-th attribute of the i-th item, let A be the I×J matrix of item-attribute pair values, and let ⃗ai := (a_i,j)^J_j=1be the vector

of attribute values for item i. The DM’s utility from selecting item i is:²

u_i := u(⃗a_i) :=

∑J j=1

a_i,j (3.1)

The DM has a continuous, non-atomic prior over each item-attribute pair with cumu-lative distribution function (CDF) F_i,jand associated density f_i,j, with full support on the real line. Assume that each Fi,j has a finite second moment. Each item-attribute pair is independently distributed from any other item-attribute pair, and ∀ j, fi,j = f_k,j. Put differently, the same attribute is independently and identically distributed across items, and within an item, each attribute is independently (but not necessarily identically) dis-tributed. Let f denote the joint density of all item-attribute pairs, and let f_i denote the joint density of all attributes within an item.

The DM wishes to learn the value of each uiby minimizing the the mean squared error (MSE)E[(ui− ˆui)²], where ˆuidenotes her prediction of ui.³ Since item-attribute pairs are independent, this is equivalent to minimizing the sum of mean squared errors:

∑J j=1

E[(ai,j− ˆai,j)²] (3.2)

where ˆai,j denotes her prediction of ai,j. She does so by choosing probabilistic subjective

²It is a fairly strong assumption that utility is additively separable in attribute values. However, lin-earity is not a strong assumption, since as long as utility is monotonically increasing in each attribute and additively separable, then one could simply apply a transformation to the underlying attribute values to make utility linear in the transformed values. Moreover, the lack of attribute weights is also not a restric-tive assumption, since attribute values can simply be rescaled multiplicarestric-tively. For a further discussion of these points, see Subsection 2.1 of Woodford (2012).

³The importance of evaluating each item separately will become clear in Section 3.4 when discussing salience.

representations ri,j of these values, distributed according to the joint density gi(⃗r_i|⃗ai), where ⃗r = ((r_i,j)^J_j=1. This incurs a cost C(f_i, g_i)given by a multiplicative scaling of the mutual information (cf. Cover and Thomas, 2006) between f and g:

C(f_i, g_i) = κ

where κ > 0. Since the item-attribute pairs are independent, this can be rewritten as:

C(f_i, g_i)

where g_i,j is the density distribution of the probabilistic subjective representation r_i,j. Mutual information measures the expected reduction in entropy going from the prior distribution to the posterior, and it can be thought of as how much of the uncertainty in the prior is explained by the posterior. It is intuitive that the more one wishes to learn about an attribute value, the more effort is required, and so the greater the cost incurred.

Assume that the cost enters the DM’s objective function additively, so that she must

choose gi to minimize:

E[(ui− ˆui)²] + C(f_i, g_i) (3.5)

The MSE is minimized by the selecting the posterior mean for each ˆa_i,j. In other words, define:

h_i,j(a_i,j, r_i,j) := g_i,j(r_i,j|ai,j)f_i,j(a_i,j)

∫

bi,jg_i,j(r_i,j|bi,j)f_i,j(b_i,j)db_i,j (3.6)

i.e. the posterior density of the value of ai,j given ri,j. The MSE is minimized by selecting the mean value of this distribution as ˆa_i,j. The MSE of u_i is then minimized by selecting:

ˆ u_i =

∑J j=1

a_i,j (3.7)

Since the posterior mean minimizes the MSE, the optimal MSE is actually the posterior variance of ˆu_i, which by independence across attributes, is the sum of the posterior vari-ances of ˆa_i,j.

Before proceeding further, it is necessary to have a quantitative definition for the amount of attention paid to an attribute of an item. Define:

α_i,j := σ²_f

i,j − σ²hi,j

σ_f²

i,j

(3.8)

to be the amount of attention paid to attribute j of item i, where σ_q²is the variance of the distribution with density q. In other words, I define the amount of attention paid to an

item-attribute pair to be proportional reduction in variance from observing its subjective representation. Since σ_h²

i,j ∈ [ 0, σ_f²

i,j

], attention takes values between 0 and 1, where

0 implies no attention was paid to an attribute — the posterior variance is the same as the prior variance — and 1 implies full attention was paid — the posterior variance is 0, implying a 1-to-1 deterministic mapping between the subjective representation and the true attribute value. It is easy to see that αi,j decreases in the posterior variance (i.e.

increases in the posterior precision); the more the DM decreases the spread of her belief, the more attention she pays.

3.3 Focusing

3.3.1 The KS Model

I begin this section by presenting a brief overview of KS’s model of focusing effects. They assume, as this paper does, that the DM’s utility for option i is given by:

u(⃗a_i) =

∑J j=1

a_i,j (3.9)

However, she evaluates options according to the following criterion:

u(⃗a_i, A) =

∑J j=1

γ_j(∆_j(A))a_i,j (3.10)

where γj is increasing in its argument, and ∆j(A) := maxjai,j − minjai,j is the range of attribute j. Therefore, when evaluating options, the DM puts greater emphasis on

attributes with a wider range, i.e. those that vary more.

3.3.2 Focusing in a Probabilistic Model

Since my model is probabilistic, instead of using the range of an attribute within the choice set as a measure of how much an attribute varies, I use the variance.⁴ In this subsection, I show that the attention paid to an attribute of an item is weakly increasing in the prior variance.

Proposition 20. Let Φ and ϕ denote the CDF and density of the standard normal distribution.

Then either hi,j = f_i,jor hi,j(a_i,j|ri,j) = _σ¹ϕ (_a_−µ

hi,j

), where µhi,j is the posterior mean and

σ =√_κ

2, whichever has lower variance.

A version of this result was stated but not proven by Sims (2003). It may seem remark-able that the optimal posterior is normally distributed regardless of the prior (provided the prior has full support on the real line), but this is not surprising in light of Caplin and Dean’s (2013) “locally invariant posteriors” result, which shows in a discrete setting that if a set of posteriors is optimal for some prior, if it is feasible for another prior, then it is optimal for that prior. Proposition 20 can be seen as a continuous version of that result for a specific utility function.

In document Essays in Information and Behavior (Page 131-136)