The reader may be also wondering whether it suffices to just run separate learning algorithms in the two groups or whether multiplicative weights has a special property. In the following theorem, we show that the latter is the case. In particular, multiplicative weights has the property of not doing better than the best expert in hindsight. The main representative of algorithms that do not have such a property are the algorithms that achieve low approximate regret compared to a shifting benchmark (tracking the best expert), which we already discussed in the previous chapter. More formally, approximate regret against a shifting comparator f? = ( f?(1), . . . , f?(T ))is defined as:
ApxReg,T( f?)= X t ptf`tf − (1+ )X t `t f?(t).
Typical bounds are E[ApxReg( f?)]= O(K( f?)·ln(dT ) )where K( f?)= 1+PTt=21{ f?(t) ,
f?(t−1)}is the number of switches in the comparator. We show that any algorithm
that achieves such a guarantee even when K( f?) = 2 does not satisfy fairness
in composition with respect to equalized error rate. This indicates that, for the metric of equalized error rates, the algorithm not being too good is essential.
Theorem 7.4. For any α < 1/2 and > 0, any algorithm that can achieve the vanishing approximate regret property against shifting comparators f of length K( f )= 2, running separate instances of the algorithm for each group is α-unfair in composition with respect to equalized error rate.
Proof. Our instance has two groups G= {A, B}, two experts F = { f1, f2}, and three
phases described below.
1. Phase I lasts for half of the time horizon {1, . . . , T/2} and during this time, we receive examples from group A. At round t, the adversary selects loss `t
f = 1 for the expert f ∈ F that is predicted with higher probability
(pt
f ≥ 1/2) and ` t
h = 0 for the other expert h ∈ F − { f }.
2. Phase II lasts PT/2
τ=1`τf1 rounds and involves examples in B. The adversary selects losses `t
f1 = 1 and `
t f2 = 0. 3. Phase III lasts PT/2
τ=1`τf2 rounds and again involves examples in B. The
adversary now selects losses `t
f1 = 0 and `
t f2 = 1.
The instance is fair in isolation with respect to equalized error rates as the cardinality of both groups is the same (half of the population in each group) and the experts make the same number of mistakes in both groups. By construction,
the algorithm has expected average loss at least 1
We now focus on group B. By the shifting approximate regret guarantee and given that there exists a sequence of experts of length 2 that has 0 loss, it holds that the total loss of the algorithm needs to be sublinear on T and, in particular, at most (12−α) · T2, which implies an expected error rate of 12 −α. Subtracting the
two error rates concludes the proof.
7.6
Remarks
More information about the paper. The results presented in this chapter are joint work with Avrim Blum, Suriya Gunasekar, and Nathan Srebro [28]. With respect to the equalized error rates, we also show that group-unaware algorithms also suffer from impossibility results. Our work opens up a number of interesting questions with respect to whether other fairness metrics are compatible with the no-regret property. Additionally, in the impossibility result for group-aware algorithms, we heavily used that the adversary is adaptive and there was some imbalance between the two populations; understanding what happens when this is not the case would be interesting.
On balance across groups as a fairness notion. Our work points to an issue that balance notions suffer from. If it is difficult to classify correctly a particular group, balance notions require the decision-maker to jeopardize the performance in other (possibly easily classifiable) groups. Providing bad treatment despite enough confidence about the best alternative is arguably immoral and, in cases such as clinical trials, explicitly illegal. Tackling this concern, in an ongoing joint work with Avrim Blum, we suggest a group fairness notion for online decision-
making that, instead of focusing on equality, aims for accuracy in all (possibly overlapping) populations and discuss the arising incentive issues.
On fairness in online decision-making. Dealing with fairness issues in online decision-making has gained much attention over the last few years. One line of work extends individual notions of fairness which require that similar individuals (with respect to some similarity metric) should be treated similarly [49] to online set- tings [90, 57, 62]. Another line of work aims to achieve the so called meritocratic fairness [71, 73], which says that an individual/group of higher intrinsic quality should never be selected with smaller probability than less qualified candidates. Regarding notions targeting discrimination against particular groups, beyond our work, there have been nice attempts to tackle important considerations of se- quential decision-making. In particular, a line of work points to counterintuitive externalities of using contextual bandit algorithms agnostic to the group iden- tity and suggest that heterogeneity in data can replace the need for exploration [20, 74, 109]. Other works have focused on designing bandit algorithms that restrict the probabilities of selecting a particular group to avoid overexposure or equivalently underexposure [38], or are only given one-sided feedback [21].
One important distinction compared to these works is that we do not assume that the input is i.i.d. over time. A main complication in most of the above works is that the algorithm needs to be very pessimistic throughout exploration to learn the best fair policy but subsequently the algorithm can just use this policy over time. In non-i.i.d. settings, the fairness consideration does not only affect an initial stage; the algorithm needs to balance the optimization goal with the fairness constraint throughout all time. Focusing on the simplest extension of adversarial online learning with fairness concerns (all experts assumed to be
individually fair), our work sheds light on which notions of fairness are amenable to non-i.i.d. inputs arriving online.
APPENDIX A
SUPPLEMENTARY MATERIAL FROM CHAPTER 2.
A.1
Concentration inequality
Lemma 2.2 (restated). Let x1, x2, . . . , xT be a sequence of nonnegative random
variables, each with xt ∈ [0, 1], and let mt = Et−1[xt]= E[xt|x1, . . . , xt−1], the random
variable that is the expectation of xt conditioned on the sequence x1, x2, . . . , xt−1.
Let > 0, and X = PT
t=1xtand M = PTt=1mt. Then, with probability at least 1 − δ
X −(1+ )M ≤ (1+ ) ln(1/δ) and also with probability at least 1 − δ
(1 − )M − X ≤ (1+ ) ln(1/δ)
Proof. The proof follows the outline of classical Chernoff bounds for independent variables combined with the law of total expectation to handle the dependence.
First claim. For parameters b, λ > 0 to be set later, it holds:
P[X − (1 + )M > b] ≤ e−λbE h eλ(X−(1+)M)i = e−λbE T Y t=1 eλ(xt−(1+)mt) (A.1)
We will prove by induction on T that the expectation above is at most 1 if we use λ = ln(1 + ). Given this fact, we can set b such that e−λb = e− ln(1+)b = δ. Using that
ln(1+ ) ≥ /(1 + ) for all ≥ 0, it follows that b = ln(1+)ln(1/δ) ≤ (1+)·ln(1/δ) .
Base of induction for first claim. Now consider the expectation EhQ
t=1T eλ(xt−(1+)mt) i
For the base case of T = 1 we have a single random variable x1 ∈ [0, 1]and its
expectation m1= E[x1]. The expectation is E
h
eλ(x1−(1+)m)i = Eheλx1i· e−λ(1+)m1.
Note that for any value of x ∈ [0, 1], the following simple inequality holds: eλx ≤ xeλ− x+ 1
This is true as it holds with equality for x= 0 and 1, and the difference is a concave function (as the second derivative of g(x)= eλx− xeλ+ x − 1 is g00(x)= λ2eλx ≥ 0), so the inequality is true between the two points. Now write its expectation as:
E h eλx1i≤ Ehxeλ − x 1+ 1i = Ehx · eλ− 1 + 1i = m · eλ− 1 + 1 ≤ em·(e λ−1 ). Using this in the expectation of (A.1), we obtain:
E h
eλ(x1−(1+)m)i≤ em·(eλ−1) · e−λ(1+)m = em(eλ−1−λ(1+)) ≤ 1
where the last inequality follows from the choice of λ= ln(1 + ), as the multiplier of m in the exponent with this choice of λ is
eλ− 1 − λ(1+ ) = − (1 + ) ln(1 + ) ≤ − (1 + ) − 2/2 = −
2(1 − )
2 < 0.
Inductive step for first claim. Now we are ready to prove the general case. Using the law of total expectation, we obtain:
E T Y t=1 eλ(xt−(1+)mt) = E T −1 Y t=1 eλ(xt−(1+)mt)· eλ(xT−(1+)mT) = E T −1 Y t=1 eλ(xt−(1+)mt)· E T −1 h eλ(xT−(1+)mT)i
where Et−1[·]is the random variable taking expectation over the last term con-
previous terms, the conditional expectation ET −1
h
eλ(xT−(1+)mT)iis exactly the base case, and hence at most 1 by the above, so we can conclude that
E T Y t=1 eλ(xt−(1+)mt) ≤ E T −1 Y t=1 eλ(xt−(1+)mt)
and the statement follows by the induction hypothesis.
Second claim. To prove the lower bound, we proceed in an analogous way. For λ = − ln(1 − ), using that 1/(1 − ) ≥ 1 + , we obtain the equivalent of the inequality (A.1) with b= ln(1/(1−))ln(1/δ) ≤ ln(1+)ln(1/δ).
P[(1 − )M − X > b] ≤ e−λbE h eλ((1−)M−X)i = e−λbE T Y t=1 eλ((1−)mt−xt)