Why does my model have different behavior on different demographic groups?

(1)

Why does my model have different behavior on

different demographic

groups?

(2)

Examples

(3)

ML systems’ differing behavior by demographic

Sweeney observed numerous correlations

between name and when bail bond ads shown

(4)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑

Sweeney observed numerous correlations

between name and when bail bond ads shown

(5)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics

Sweeney observed numerous correlations

between name and when bail bond ads shown

(6)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups

Sweeney observed numerous correlations

between name and when bail bond ads shown

(7)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups

Sweeney observed numerous correlations

between name and when bail bond ads shown

(8)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups

Sweeney observed numerous correlations

between name and when bail bond ads shown

(9)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups

But credit, housing, and employment have special legal protections against ↑

Sweeney observed numerous correlations

between name and when bail bond ads shown

(10)

ML systems’ differing behavior by demographic

Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups

But credit, housing, and employment have special legal protections against ↑

Sweeney observed numerous correlations

between name and when bail bond ads shown

(11)

ML systems’ differing behavior by demographic

(12)

Error rates vary across race, gender, disability status, age…

ML systems’ differing behavior by demographic

(13)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

ML systems’ differing behavior by demographic

(14)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

ML systems’ differing behavior by demographic

(15)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]

ML systems’ differing behavior by demographic

(16)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]

ML systems’ differing behavior by demographic

(17)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]

Rate of facial recognition false positives, negatives vary by race [NIST, others]

ML systems’ differing behavior by demographic

(18)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]

Rate of facial recognition false positives, negatives vary by race [NIST, others]

ML systems’ differing behavior by demographic

(19)

Error rates vary across race, gender, disability status, age…

and false positive/false negative rates also differ.

Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]

Rate of facial recognition false positives, negatives vary by race [NIST, others]

Risk predictions of those incarcerated, charged aren’t equally predictive | race, gender [Propublica]

ML systems’ differing behavior by demographic

(20)

Why does this happen?

Smaller samples

Less statistical significance

Less prioritization in optimization | data

Data sets with more + examples from G1 and more - examples from G2 Less informative features

Both in terms of noise in labels and measurement error

Models which capture correlations between X and Y better for A than B Loss function a better proxy for performance on larger populations

…. And many other reasons.

(21)

Why does this happen?

What precisely are we referring to?

Model has different P[f(x) = + | group] ?

Does 0/1 loss (accuracy) vary by demographic

Do the demographics face different kinds of errors?

(not) Statistical parity

(not) equalized error rates

(not) equalized false pos/false negative rates

Can we avoid all of these simultaneously?

In general, not at the model selection point of the pipeline :(

(22)

Crime Statistics and ML

(23)

A “standard” ML perspective

Can we predict crime?

Can we prevent crime?

And if we can do either, what are the right measures of effectiveness?

You have historical data {(x ⁱ , y _i ) } ⁿ i=1

<latexit sha1_base64="bq5eSTEhMQKuh/vb2OreuCri7b8=">AAACGnicbVA9SwNBEN2L3/Hr1NJmMREiSLiLhTZC0MZSwZhIEo+5zSZZsrd77O4Fw5HfYeNfsbFQxE5s/DduYgpNfDDweG+GmXlhzJk2nvflZObmFxaXlleyq2vrG5vu1vaNlokitEIkl6oWgqacCVoxzHBaixWFKOS0GvbOR361T5VmUlybQUybEXQEazMCxkqB69/KBHehT3HXLpPKGhy3wADON9LCfcAO8SBgB41hkLJTf3gn8oGb84reGHiW+BOSQxNcBu5HoyVJElFhCAet674Xm2YKyjDC6TDbSDSNgfSgQ+uWCoiobqbj14Z43yot3JbKljB4rP6eSCHSehCFtjMC09XT3kj8z6snpn3STJmIE0MF+VnUTjg2Eo9ywi2mKDF8YAkQxeytmHRBATE2zawNwZ9+eZbclIr+UbF0VcqVzyZxLKNdtIcKyEfHqIwu0CWqIIIe0BN6Qa/Oo/PsvDnvP60ZZzKzg/7A+fwGamif1A==</latexit>

x _i 2 R

<latexit sha1_base64="H+TedMJR4T41hQv95nzdxPmO6FQ=">AAAB/XicbVC7TsMwFL3hWcorPDYWixaJqUrKAGMFC2NB9CE1UeW4bmvVcSLbQZSo4ldYGECIlf9g429w2gzQciRLR+fcq3t8gpgzpR3n21paXlldWy9sFDe3tnd27b39pooSSWiDRDyS7QArypmgDc00p+1YUhwGnLaC0VXmt+6pVCwSd3ocUz/EA8H6jGBtpK59WH7oMuQxgbwQ62EQpLeTctcuORVnCrRI3JyUIEe9a395vYgkIRWacKxUx3Vi7adYakY4nRS9RNEYkxEe0I6hAodU+ek0/QSdGKWH+pE0T2g0VX9vpDhUahwGZjKLqOa9TPzP6yS6f+GnTMSJpoLMDvUTjnSEsipQj0lKNB8bgolkJisiQywx0aawoinBnf/yImlWK+5ZpXpTLdUu8zoKcATHcAounEMNrqEODSDwCM/wCm/Wk/VivVsfs9ElK985gD+wPn8AW/yUhA==</latexit>

y

<latexit sha1_base64="OCIitnT2yrJx/JjOE6kPp7F+Hjc=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WWwFD1KSetBj0YvHCvYDmhA22227dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0w4U9pxvq3C2vrG5lZxu7Szu7d/YJcP2ypOJaEtEvNYdkOsKGeCtjTTnHYTSXEUctoJxzczv/NApWKxuNeThPoRHgo2YARrIwV2uToJGPKYQF7mnLvetBrYFafmzIFWiZuTCuRoBvaX149JGlGhCcdK9Vwn0X6GpWaE02nJSxVNMBnjIe0ZKnBElZ/NT5+iU6P00SCWpoRGc/X3RIYjpSZRaDojrEdq2ZuJ/3m9VA+u/IyJJNVUkMWiQcqRjtEsB9RnkhLNJ4ZgIpm5FZERlphok1bJhOAuv7xK2vWae1Gr39Urjes8jiIcwwmcgQuX0IBbaEILCDzCM7zCm/VkvVjv1seitWDlM0fwB9bnDw5LkpA=</latexit>

_i 2 {0, 1}

Geographic location

Did a (violent) crime occurred there yesterday?

(24)

A “standard” ML perspective

You have historical data {(x ⁱ , y _i ) } ⁿ i=1

<latexit sha1_base64="bq5eSTEhMQKuh/vb2OreuCri7b8=">AAACGnicbVA9SwNBEN2L3/Hr1NJmMREiSLiLhTZC0MZSwZhIEo+5zSZZsrd77O4Fw5HfYeNfsbFQxE5s/DduYgpNfDDweG+GmXlhzJk2nvflZObmFxaXlleyq2vrG5vu1vaNlokitEIkl6oWgqacCVoxzHBaixWFKOS0GvbOR361T5VmUlybQUybEXQEazMCxkqB69/KBHehT3HXLpPKGhy3wADON9LCfcAO8SBgB41hkLJTf3gn8oGb84reGHiW+BOSQxNcBu5HoyVJElFhCAet674Xm2YKyjDC6TDbSDSNgfSgQ+uWCoiobqbj14Z43yot3JbKljB4rP6eSCHSehCFtjMC09XT3kj8z6snpn3STJmIE0MF+VnUTjg2Eo9ywi2mKDF8YAkQxeytmHRBATE2zawNwZ9+eZbclIr+UbF0VcqVzyZxLKNdtIcKyEfHqIwu0CWqIIIe0BN6Qa/Oo/PsvDnvP60ZZzKzg/7A+fwGamif1A==</latexit>

x _i 2 R

y

_i 2 {0, 1}

Geographic location

Did a (violent) crime occurred there yesterday?

If a violent crime occurs tomorrow where we didn’t predict, cost of $100,000.

If no crime occurs tomorrow where we predicted one to occur, cost of $100.

Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.

(25)

A (slightly) more nuanced set of questions

What if our predictions are only effective for some types of crime?

For some types of neighborhoods?

What features are acceptable to use in predicting crime?

How are these features/labels gathered?

What if they are gathered in an uneven manner?

And what will be done with these predictions?

(26)

A “standard” ML perspective

x _i 2 R

y

_i 2 {0, 1}

Geographic location

Did a (violent) crime occurred there yesterday?

If a violent crime occurs tomorrow where we didn’t predict, cost of $100,000.

If no crime occurs tomorrow where we predicted one to occur, cost of $100.

Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.

You have historical data {(x ⁱ , a _i , y _i ) } ⁿ i=1

<latexit sha1_base64="ubFLYLboX7spSMhspxjibgP/J7A=">AAACHXicbVDLSgNBEJyNrxhfUY9eBhMhgoTdKOhFCXjxGME8JIlL72SSDM7OLDOzYljyI178FS8eFPHgRfwbJ4+DRgsaiqpuuruCiDNtXPfLSc3NLywupZczK6tr6xvZza2alrEitEokl6oRgKacCVo1zHDaiBSFMOC0Htyej/z6HVWaSXFlBhFth9ATrMsIGCv52aNrGeM+3FHct8uksgbHHTCA862kcO+zA7A18Nl+a+gn7NQb3oi8n825RXcM/Jd4U5JDU1T87EerI0kcUmEIB62bnhuZdgLKMMLpMNOKNY2A3EKPNi0VEFLdTsbfDfGeVTq4K5UtYfBY/TmRQKj1IAxsZwimr2e9kfif14xN96SdMBHFhgoyWdSNOTYSj6LCHaYoMXxgCRDF7K2Y9EEBMTbQjA3Bm335L6mVit5hsXRZypXPpnGk0Q7aRQXkoWNURheogqqIoAf0hF7Qq/PoPDtvzvukNeVMZ7bRLzif3+VuoSM=</latexit>

a _i 2 {majority minority neighborhood, low income neighborhood, majority white neighborhood,.. }

<latexit sha1_base64="pndqYQiMIcZ5sq1HElgfnADKvQs=">AAACXXicbVFNaxsxFNRumjZx0sRtDz30ImIXeijLrnNITiHQS44p1EnAa4xW++x9jT4W6W2DWfwne2sv+SuRPyit0wHBMPMG6Y2KWqGnNP0VxTsvdl++2tvvHBy+Pjruvnl7423jJAylVdbdFcKDQgNDQlJwVzsQulBwW9x/Wfq3P8B5tOYbzWsYazEzOEUpKEiTLuXGoinBEOd9MUGeo+F529fiu3VIc67RrIkBnFWFdZW15WfOlX3gaKTVsOX8ST5USFtmkvB+vuhPur00SVfgz0m2IT22wfWk+zMvrWx0eKZUwvtRltY0boUjlAoWnbzxUAt5L2YwCtQIDX7crtpZ8I9BKfnUunDCmiv170QrtPdzXYRJLajy295S/J83amh6Pm7R1A2BkeuLpo3iZPmyal6iA0lqHoiQoRKUXFbCCUnhQzqhhGx75efkZpBkp8ng66B3ebGpY499YCfsE8vYGbtkV+yaDZlkvyMW7Ued6DHejQ/jo/VoHG0y79g/iN8/AQRxtF4=</latexit>

(27)

Would most of our concerns be mitigated by:

Removing demographic information from a dataset?

“Fairness through unawareness”, or demographically blind decisions Pro: Simple, easy to audit

Con: geographic information often contains a proxy for demographics.

Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.

P[f(x ⁱ ) = + |a ⁱ = ⇤] = P[f(x ⁱ ) = +]

<latexit sha1_base64="z7aHl7s5D9l86Xn7CCf3avKQo10=">AAACKnicbZDLSsNAFIYn9VbrLerSzWARKoWSVEE3SsWNywr2AkkIk+mkHTq5ODMRS+zzuPFV3HShFLc+iJM2C239YeDjP+cw5/xezKiQhjHVCiura+sbxc3S1vbO7p6+f9AWUcIxaeGIRbzrIUEYDUlLUslIN+YEBR4jHW94m9U7T4QLGoUPchQTJ0D9kPoUI6ksV7+xAyQHnpc2x5ZfeXbpKbyCsApfIHKpQls8JogTJ8N/OqvQcfWyUTNmgstg5lAGuZquPrF7EU4CEkrMkBCWacTSSRGXFDMyLtmJIDHCQ9QnlsIQBUQ46ezUMTxRTg/6EVcvlHDm/p5IUSDEKPBUZ7auWKxl5n81K5H+pZPSME4kCfH8Iz9hUEYwyw32KCdYspEChDlVu0I8QBxhqdItqRDMxZOXoV2vmWe1+v15uXGdx1EER+AYVIAJLkAD3IEmaAEMXsE7+ACf2ps20aba17y1oOUzh+CPtO8fTHOkEw==</latexit>

(28)

Would most of our concerns be mitigated by:

Requiring our predictions be conditionally independent of demographic information?

Demographic parity, statistical parity…

Pro: Aligns with certain legal definitions of equity Con: demographics with lower levels of violent crime

will have higher predicted violent crime rates

Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.

P[f(x

<latexit sha1_base64="WnCzHfP6JOJclcm63R+bPbeuQ0o=">AAACPHicbZBLS8NAFIUnPmt9VV26GSyCIpSkCrpRCm5cVrQPaEO4mUx0cDKJMxOxxP4wN/4Id67cuFDErWsnbUGtXhj4OPdc7tzjJ5wpbdtP1sTk1PTMbGGuOL+wuLRcWlltqjiVhDZIzGPZ9kFRzgRtaKY5bSeSQuRz2vKvjvN+64ZKxWJxrnsJdSO4ECxkBLSRvNJZNwJ96ftZvd8Jt249to0PMd7Bdxg8ZrCrrlOQ1M3xH+cOxt/WgEEUi8D1SmW7Yg8K/wVnBGU0qrpXeuwGMUkjKjThoFTHsRPtZiA1I5z2i91U0QTIFVzQjkEBEVVuNji+jzeNEuAwluYJjQfqz4kMIqV6kW+c+QFqvJeL//U6qQ4P3IyJJNVUkOGiMOVYxzhPEgdMUqJ5zwAQycxfMbkECUSbvIsmBGf85L/QrFac3Ur1dK9cOxrFUUDraANtIQftoxo6QXXUQATdo2f0it6sB+vFerc+htYJazSzhn6V9fkFQL+qgQ==</latexit>

ⁱ ) = + |a ⁱ = ⇤] = P[f(x i ) = + |a ⁱ = ⇧]

(29)

Would most of our concerns be mitigated by:

Requiring equal false positive and negative rates for all demographics?

Equality of odds

Pro: Chance of false prediction of crime (or missing crime) independent of demographics Con: higher complexity to explain to non-experts, necessarily precludes other options.

Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.

P[f(x ⁱ ) = y |a ⁱ = ⇤, y ⁰ ] = P[f(x ⁱ ) = y |a ⁱ = ⇧, y ⁰ ]

<latexit sha1_base64="PywlbbBa4Q+oioG6dlIMq18Ip6M=">AAACQ3icbVBPS8MwHE397/w39eglOMQJMloV9KIIXjxOcFPcSvk1TbdgmtYkFUvdd/PiF/DmF/DiQRGvgum2g04fBF7eez+S3/MTzpS27WdrbHxicmp6ZrY0N7+wuFReXmmqOJWENkjMY3npg6KcCdrQTHN6mUgKkc/phX99UvgXt1QqFotznSXUjaAjWMgIaCN55at2BLrr+3m91wqrdx7bwocYZ/geg8cMbaubFCTdxtmmW1z/Sf8IBwyiWARFGrteuWLX7D7wX+IMSQUNUffKT+0gJmlEhSYclGo5dqLdHKRmhNNeqZ0qmgC5hg5tGSogosrN+x308IZRAhzG0hyhcV/9OZFDpFQW+SZZ7KBGvUL8z2ulOjxwcyaSVFNBBg+FKcc6xkWhOGCSEs0zQ4BIZv6KSRckEG1qL5kSnNGV/5LmTs3Zre2c7VWOj4Z1zKA1tI6qyEH76BidojpqIIIe0At6Q+/Wo/VqfVifg+iYNZxZRb9gfX0DovKtGw==</latexit>

(30)

(Some) additional concerns here

In the US, policing, arresting, charging, and convicting for certain crimes has been applied to different populations at very unequal rates.

E.g., illicit drug use is charged at much higher rates for minority persons Moreover, crimes charged at higher rates for certain demographics have been deigned more dangerous than similar ones charged in other demographics