On Selection Bias and Fairness Issues in Machine-Learning

(1)

On Selection Bias and Fairness Issues

in Machine-Learning

Stephan Clémençon –

LTCI, Telecom Paris

–

with M. Achab (Pompeu Fabra), G. Ausset (Telecom Paris), A. Bellet (INRIA), P. Bertail (Paris X), E. Chautru (Mines Paris), P. Laforgue (Univ. Milano), G. Papa (G-Research), F. Portier (Telecom Paris), C. Tillié (UVSQ), R. Vogel (Univ. Edinburgh)

1/11/2021

(2)

The Way of Considering Bias & Fairness in AI

Here

I Expertise: Statistical Machine Learning, Theory and Algorithms I Scientific Goals

I _Predictive_{issues cast as M-estimation problems:}

I Classification I Regression

I Density level set estimation I ... and theirnumerous variants

I _Complex_{data -}_Minimal_{assumptions on the distribution}

I Algorithms: designfeasible_{M-estimators for specific criteria}

I Many Questions - Theory/Computation/Applicability:

I Theoretical guarantees: optimal elements, consistency,

non-asymptotic excess risk bounds, fast rates of convergence, oracle inequalities

I Practice: numerical optimization, convexification,

randomization, relaxation, scalability (distributed architectures, real-time, memory, etc.)

I Applicability: robustness/reliability, explainability, privacy

preservation, ...

(3)

Many applications of these concepts/methods

e.g. Facial Recognition

(4)

The Flagship Problem: Pattern Recognition

I _{(X, Y) random pair with unknown distribution P} I X ∈ X observation vector

I Y _{∈ {−1, +1} binary label/class}

I A posteriori probability_∼regression function

∀x ∈ X , η(x) = P{Y = 1 | X = x}

I g:X → {−1, +1}classifierthat can be coded using a machine I Performance measure =classification error

L(g) = P{g(X) 6= Y} → min_g I Solution:Bayes rule

∀x ∈ X , g∗(x) = 2I{η(x) > 1/2} − 1

I Bayes errorL∗= L(g∗₎

(5)

Main Paradigm: Empirical Risk Minimization

I Training sample(X1, Y1), . . . , (Xn, Yn) with i.i.d. copies of (X, Y) I ClassG of classifiers (massive catalog)

I Empirical Risk Minimization principle

ˆ gn = arg min g∈G Ln(g) := 1 n n X i=1 I{g(Xi)6= Yi} I Mimic the best classifier in the class

¯

g= arg min g∈G

L(g)

(6)

Does the ERM principle works?

Predict labels of past data vs

(7)

Empirical Processes in Statistical Learning

I Bias-variance decomposition L(ˆgn)− L∗ ≤ (L(ˆgn)− Ln(ˆgn)) + (Ln(¯g)− L(¯g)) + (L(¯g) − L∗) ≤ 2 sup g∈G | L n(g)− L(g) | ! + inf g∈GL(g)− L ∗ I Concentration inequality With probability 1_{− δ:} sup g∈G | Ln(g)− L(g) |≤ E supg∈G | Ln(g)− L(g) | + r 2 log(1/δ) n 7

(8)

The ERM principle works!

(9)

MS1M (Guo et al, 2016) LFW (Huang et al, 2007)

13K images, 5.7K people 5.2M images, 93.4K people

Machine Learning Implemented at Large Scale

In the Big Data era,massivedatasets are available but... the acquisition process may bepoorly controlled.

In facial recognition (FR), public databases do not represent well the target population in terms of ethnicity and gender.

FR systems perform much better on certain segments of the population

(10)

ROC (DET) Curve 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR

False positive rate ( )FPR

False negative rate (FNR)

(Grother and Ngan, 2019)

FPR for t s.t. FPR = 10MW -3

FPR FPR

M/F: Male/Female W/B: White/Black

FR systems are less accurate for certain social groups

In FR, the ROC curve evaluates a similarity function s w.r.t. its ability to separate positive and negative observations with thresholding s> t. (left) Recent reports of the NIST show discrepancies in error rates between social groups for FR. (right)

At fixed t, 13× more FP for black females than white males.

(11)

Deployment of Machine Learning - Threats

I In many situations, training data are ’easily available’ and ’Big’ data.Poor control of the acquisition process of the training data!

I The generalization ability of predictive rules is established when

Ptrain = Ptest

Otherwise? It requires novel versions of ML algorithms, new dedicated analyses andauxiliary information about the data acquisition process

I Many selection bias issues documented in the literature ’Women also Snowboard: Overcoming Bias in Captioning Models’ in ECCV 2018, L.A. Hendricks et al.

I In the Big Data era, can the ideas of the Scarce Data era, survey theory in particular, be of any help?

(12)

(Ganin et al, 2016)

Domain Adaptation - Transferring Deep Features

In computer vision, most of the transfer learning work focuses on Domain Adaptation (DA). Most DA work seeks to correct for covariate shift (ptrain _{6= p}test) with invariant deep features, and sometimes also models

differences in posterior distributions.

[Tzeng et al., 2015] [Long et al., 2016]

In e.g.[Ganin et al., 2016], an approach to learn invariant deep features between visual domains is proposed.

(13)

How to apply ERM to biased data?

I Goal: minimize the risk

LP(θ) = EZ∼P[`(Z, θ)]

over the decision spaceΘ, where P is the test/target distribution

I Training data available Z₁0, . . . , Z0 n

i.i.d.

∼ P0_{, with P}0 _{6= P} I A specifictransfer learningproblem

I Heuristic:solve a minimization problem min

θ∈Θ

bLn,ω(θ),

where the objective is aweighted empirical risk

bLn,ω(θ) = n X i=1 ωi`(Z_i0, θ). 10

(14)

How to apply ERM to biased data?

I Ideally, pick theωi’s, so that

sup θ∈Θ bLn,ω(θ)− LP(θ) =OP( p log n/n) I Challenges:

I _Design_{methods/algorithms}_{to build the debiasing weights} I _{Study the}_fluctuations_{of the nearly debiased risk}

I Someauxiliary informationabout the biasing mechanism is

required!

(15)

A First Go: Training from Survey Data

I Framework: original sample(Z1, . . . , ZN) viewed as a

superpopulation

I Sampling planRN= probability distribution on the ensemble

of all nonempty subsets of_{{1, . . . , N}}

I Let S_{∼ R}Nand seti = 1 if i∈ S, i= 0 otherwise

The vector(1, . . . , n) fully describes the training sample S I First and second orderinclusion probabilities:

πi(RN) = P{i ∈ S} and πi,j(RN) = P{(i, j) ∈ S2_} I Do not rely on the raw empirical risk based on the sample S:

1 #S

P

i∈S`(Zi, θ) is abiasedestimate of LP(θ)

(16)

Horvitz -Thompson theory

I Suppose thatthe inclusion probabilities are known

I Inverse Probability Weighting (IPW):Horvitz-Thompson

estimator of the empirical distribution of the Zi’s 1 N N X i=1 i πi δZi

I It isnot a probability measure in generalbut yields an

unbiased estimateof the (empirical) risk

LRN N (θ) = 1 N N X i=1 i πi `(Zi, θ)

I The Horvitz Thompson empirical risk minimizer arg min

θ∈Θ LRN

N (θ) = ˆθN

(17)

A

functional non-asymptotic

Horvitz -Thompson theory

I Due to thedependence structureof the terms averaged in the HT risk, investigating the fluctuations of the supremum

sup θ∈Θ L RN N (θ)− ˆLN(θ) isnot straightforward!

I Many situations can be handled: when data are sampled from

I _a_Poisson_scheme

I _{a survey plan such that the}_i_{’s are}_{negatively associated}_{, e.g.}

rejective sampling, Srinivasan sampling, Rao-Sampford

sampling, Successive sampling, Pareto sampling, Post-stratified sampling ...

I any plan that can be tightly coupled with one of the schemes

above

(18)

On the use of survey schemes for machine-learning

To minimize ˆLN(θ), rather than implementing SGD based on mini-batches selected by simple sampling without replacement, use a Poisson scheme with inclusion probabilities positively correlated to π∗_i(θ) = N0 ||H −1/2 N ∇`(Zi, θ)|| PN j=1||H −1/2 N ∇`(Zi, θ)|| , with HN =∇2_ˆLN(θ∗

N) to drastically reduce the asymptotic variance

0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000

More in Bertail, Chautru, Clémençon & Papa (2018)

(19)

Biased training data with known inclusion probabilities:

done!

I It works for successive sampling, post-stratified sampling, etc.

I When theπi’s areknown, the IPW method applied to ERM produces predictive rules with thesame performance as that attained by ERM based on unbiased data

More in e.g. Bertail, Clémençon & Papa (2016), Bertail, Chautru & Clémençon (2016, 2018)

I Well ... but what if the inclusion probabilities areunknown? You mayestimatethem when you havesome knowledge of the

biasing mechanism...

(20)

ERM under Random Censorship

I Regressionframework: predict a random duration Y _{≥ 0 based}

on a random vector X through f(x), so as to minimize LP(f ) = E[(Y− f(X))2]

I Right censoredoutput data: n independent copies(Xi, ˜Yi, δi) of

X, ˜Y = min{Y, C}, δ = I{Y ≤ C},

assuming that Y and C are conditionally independent given X I Applying ERM to the(Xi, ˜Yi)’s would naturally lead to severe

underestimation

I Set SC(t| X) = P{C > t | X} and rewrite the risk as LP(f ) = E δ(˜Y− f(X))

2 SC(˜Y_{− | X)}

(21)

ERM with IP(C)W

I Inverse of Probability of Censoring Weights: minimize

˜Ln(f )def = 1 n n X i=1 δi ˆ

SC(˜Y_{i− | Xi)} Y˜i− f(Xi) 2

I The probability of censoring can be estimated by the

Kaplan-Meier method

I The concentration properties of the process{˜Ln(f )− LP(f )}f ∈F can be established by means oflinearization techniques

I Neglecting the bias error due to the plug-in step, the classic learning rate O_P(plog n/n) is attained by ERM with IPCW More in Ausset, Clémençon & Portier (2019)

(22)

ERM with approximate IPCW works!

100 200 300 500 1000 2000 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 IPCW Naive Observed

But you need to know something about the selection bias process or to learn it from extra data!

(23)

Weighted ERM beyond IPW

I Assume P<< P0. LetΦ = dP/dP0 I Weights andImportance Sampling

1 n n X i=1 Φ(Z0i)`(Zi0, θ) is an unbiased estimate of LP(θ)

I In various cases (e.g. covariate shift, positive-unlabeled learning, availability of several biased training samples),Φ can be estimated by means of the Z_i0’s andauxiliary information about the target population

More in e.g. Clémençon and Laforgue (2019), Achab, Clémençon, Tillier and Vogel (2019) and Bertail, Clémençon, Guyonvarch & Noiry (2021)

(24)

In Facial Recognition, reweighting may be insufficient!

Certain learning subproblems can be much harder than others, possibly causing a greataccuracy disparity.

See e.g. Bharadwaj et a. (2020)

Fairness constraintsmust be incorporated to the ERM program.

Why facial recognition algorithms can’t be perfectly fair? Clémençon & Maxwell (the Conversation, 2020).

(25)

Fair Machine Learning, beyond Biometrics

Algorithmic decisions are increasingly used in many domains:

Banking (e.g. loans) Recruting (e.g., hiring)

Insurance (e.g. cars) Judiciary (e.g., bail)

Recently, the fairness of algorithms has gathered lots of attention.

05/2016: The COMPAS system predicts recidivism likelihood for US courts.

Algorithms are designed for the interest of some party, fairness in ML suggests confronting those to the law.

“Predictive models are really just opinions embedded in math.” C. O’Neil.

Lack of fairness is not always a consequence of selection bias.

An illustration is given by age performance gaps in biometrics. See e.g. Achab, Clémençon, Tillié & Vogel (2020).

(26)

Fairness Definitions in Binary Classification

A lot of recent works considered fairness in binary classification, with two sensitive groups.

[Donini et al., 2018, Menon and Williamson, 2018, Zafar et al., 2019]

They add a sensitive variable Z_{∈ {0, 1} to the usual binary} classification model(X, Y), and learn g(X) from:

Dn={(X1, Y1, Z1), . . . , (Xn, Yn, Zn)}.

Many definitions of fairness exist, and apply to specific use-cases.

· Treatment: g(X, Z) = g(X) a.s.

· Impact: P{g(X) = +1 | Z = 0} = P{g(X) = +1 | Z = 1} · Error: P{g(X) 6= Y | Z = 0} = P{g(X) 6= Y | Z = 1}

· FPR: P{g(X) = +1 | Y = −1, Z = 0} = P{g(X) = +1 | Y = −1, Z = 1}

(27)

On the Design of Fair Scoring Rules

I Taux de vrais positifs:

1_{− G}s(t) = P{s(X) ≥ t | Y = 1}

I Taux de faux positifs:

1_−Hs(t) = P{s(X) ≥ t | Y = −1}

ROC curve: t_{7→ (1 − H}s(t), 1− Gs(t))

AUC = Area Under the ROC Curve

(28)

ROCG(0)

s, G(1)s

Learning with Pointwise ROC Constraints

Bellet, Clémençon, Vogel (2020)

To measure the difference between cdfs for Z= 0 and Z = 1:

∆H,α(s) = ROC_H(0) s ,H (1) s (α)−α and ∆G,α(s) = ROCG (0) s ,G (1) s (α)−α.

Incorporate mHpointwise constraints for∆H,·and mGfor∆G,·as a

penalization, and maximizeLΛinS, where:

LΛ(s) := AUCHs,Gs− mH X k=1 λ(k)_H |∆_H,α(k) H (s)| − mG X k=1 λ(k)_G |∆_G,α(k) G (s)|.

Finite-sample generalization bounds of order O_P(n−1/2_{) have been proved.}

(29)

Accuracy vs Fairness: Satisfactory trade-offs?

German Credit Dataset (German) in

[Zafar et al., 2019, Zehlike et al., 2017, Singh and Joachims, 2019, Donini et al., 2018].

Sensitive variable: gender.

Bank Marketing Dataset (Bank) in[Zafar et al., 2019]: predict whether a client shall

subscribe to a term deposit. Sensitive variable: age.Manuscript under review by AISTATS 2021

ROCH(1) s, Gs ROCH(0) s, Gs ROCH(1) s, Gs ROCH(0) s, Gs ROCH(0)s, Gs(1) ROCH(1)s, Gs(0) 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 ROCH(0) s, G(1) s ROCH(1)s, G(0)s ROCH(0) s, G(1) s ROCH(1)s, G(0)s ROCHs(0), Gs(1) ROCHs(1), Gs(0) 0.000.250.500.751.00 ROCH(0) s, H(1) s ROCGs(0), Gs(1) ROCH(0) s, H(1) s ROCG(0)s, G(1)s 0.000.250.500.751.00 ROCH(0) s, H(1)s ROCG(0)s, G(1)s 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 0.000.250.500.751.00 0.000.250.500.751.00 0.000.250.500.751.00 ROCH(0)s, H(1)s ROCG(0) s, G(1) s ROCH(0)s, H(1)s ROCG(0) s, G(1) s ROCH(0) s, H(1) s ROCGs(0), Gs(1) ROCHs, G(0) s ROCHs, G(1) s ROCHs, G(0)s ROCHs, G(1) s ROCHs, G(0) s ROCHs, G(1) s

Figure 10: ROC curves for Bank and German for a score learned without and with fairness constraints. On all plots, dashed and solid lines represent respectively training and test sets. Black curves represent ROCHs,Gs, and

above the curves we report the corresponding ranking performance AUCHs,Gs.

Figure 11: ROC curves for Adult and Compas for a score learned without and with fairness constraints. On all plots, dashed and solid lines represent respectively training and test sets. Black curves represent ROCHs,Gs, and

above the curves we report the corresponding ranking performance AUCHs,Gs.

(30)

Some references

I Empirical processes in survey sampling. Bertail, Chautru and Clémençon (2016). Scandinavian J. Stat.

I Learning from Survey Training Samples: Rate Bounds for

Horvitz-Thompson Risk Minimizers. Bertail, Clémençon & Papa. In ACML 2016.

I Optimal Survey Schemes for SGD with Applications to M-estimation. Bertail, Chautru, Clémençon & Papa. In ESAIM, 2018.

I Sampling and Empirical Risk Minimization. Bertail, Chautru and Clémençon (2016). In Statistics.

I Empirical Risk Minimization under Random Censorship. Ausset, Clémençon & Portier JMLR (2021).

I Learning from Biased Training Samples. Clémençon & Laforgue (2020). Submitted

I Weighted ERM: Transfer Learning based on Importance Sampling. Achab, Clémençon, Tillier and Vogel (2020). In ICMA.

I Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints. Bellet, Clémençon and Vogel (2020).

Submitted.

(31)

References I

Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and Pontil, M. (2018). Empirical risk minimization under fairness constraints.

In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pages 2796–2806.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016).

Domain-adversarial training of neural networks. Journal of Machine Learning Research.

Guo, Y., Zhang, L., Hu, Y., He, X., and Gao, J. (2016).

Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision - ECCV 2016 - 14th European Conference, volume 9907 of Lecture Notes in Computer Science, pages 87–102. Springer.

Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. (2007).

Labeled faces in the wild: A database for studying face recognition in unconstrained environments.

Technical Report 07-49, University of Massachusetts, Amherst. Long, M., Zhu, H., Wang, J., and Jordan, M. I. (2016).

Unsupervised domain adaptation with residual transfer networks.

In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pages 136–144.

(32)

References II

Menon, A. K. and Williamson, R. C. (2018). The cost of fairness in binary classification.

In Conference on Fairness, Accountability and Transparency, FAT 2018, volume 81 of Proceedings of Machine Learning Research, pages 107–118. PMLR.

Singh, A. and Joachims, T. (2019). Policy learning for fairness in ranking.

In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 5427–5437.

Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. (2015). Simultaneous deep transfer across domains and tasks.

In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, page 4068âĂŞ4076, USA. IEEE Computer Society.

Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(75):1–42.

Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., and Baeza-Yates, R. (2017). Fa*ir: A fair top-k ranking algorithm.

In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pages 1569–1578. ACM.