Why does my model have different behavior on
different demographic
groups?
Examples
ML systems’ differing behavior by demographic
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups
But credit, housing, and employment have special legal protections against ↑
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Rates of predictions {+,-} vary across race, gender, disability status, age… ↑ That does not necessarily mean accuracy varies across these demographics E.g, when base rates of {+,-} differ within these groups
But credit, housing, and employment have special legal protections against ↑
Sweeney observed numerous correlations
between name and when bail bond ads shown
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]
Rate of facial recognition false positives, negatives vary by race [NIST, others]
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]
Rate of facial recognition false positives, negatives vary by race [NIST, others]
ML systems’ differing behavior by demographic
Error rates vary across race, gender, disability status, age…
and false positive/false negative rates also differ.
Gender identification accuracy varies by skin type, gender [Buolamwini, Gebru 2018]
Rate of facial recognition false positives, negatives vary by race [NIST, others]
Risk predictions of those incarcerated, charged aren’t equally predictive | race, gender [Propublica]
ML systems’ differing behavior by demographic
Why does this happen?
Why does this happen?
Smaller samples
Less statistical significance
Less prioritization in optimization | data
Data sets with more + examples from G1 and more - examples from G2 Less informative features
Both in terms of noise in labels and measurement error
Models which capture correlations between X and Y better for A than B Loss function a better proxy for performance on larger populations
…. And many other reasons.
Why does this happen?
What precisely are we referring to?
Model has different P[f(x) = + | group] ?
Does 0/1 loss (accuracy) vary by demographic
Do the demographics face different kinds of errors?
(not) Statistical parity
(not) equalized error rates
(not) equalized false pos/false negative rates
Can we avoid all of these simultaneously?
In general, not at the model selection point of the pipeline :(
Crime Statistics and ML
A “standard” ML perspective
Can we predict crime?
Can we prevent crime?
And if we can do either, what are the right measures of effectiveness?
You have historical data {(x i , y i ) } n i=1
<latexit sha1_base64="bq5eSTEhMQKuh/vb2OreuCri7b8=">AAACGnicbVA9SwNBEN2L3/Hr1NJmMREiSLiLhTZC0MZSwZhIEo+5zSZZsrd77O4Fw5HfYeNfsbFQxE5s/DduYgpNfDDweG+GmXlhzJk2nvflZObmFxaXlleyq2vrG5vu1vaNlokitEIkl6oWgqacCVoxzHBaixWFKOS0GvbOR361T5VmUlybQUybEXQEazMCxkqB69/KBHehT3HXLpPKGhy3wADON9LCfcAO8SBgB41hkLJTf3gn8oGb84reGHiW+BOSQxNcBu5HoyVJElFhCAet674Xm2YKyjDC6TDbSDSNgfSgQ+uWCoiobqbj14Z43yot3JbKljB4rP6eSCHSehCFtjMC09XT3kj8z6snpn3STJmIE0MF+VnUTjg2Eo9ywi2mKDF8YAkQxeytmHRBATE2zawNwZ9+eZbclIr+UbF0VcqVzyZxLKNdtIcKyEfHqIwu0CWqIIIe0BN6Qa/Oo/PsvDnvP60ZZzKzg/7A+fwGamif1A==</latexit>
x i 2 R
<latexit sha1_base64="H+TedMJR4T41hQv95nzdxPmO6FQ=">AAAB/XicbVC7TsMwFL3hWcorPDYWixaJqUrKAGMFC2NB9CE1UeW4bmvVcSLbQZSo4ldYGECIlf9g429w2gzQciRLR+fcq3t8gpgzpR3n21paXlldWy9sFDe3tnd27b39pooSSWiDRDyS7QArypmgDc00p+1YUhwGnLaC0VXmt+6pVCwSd3ocUz/EA8H6jGBtpK59WH7oMuQxgbwQ62EQpLeTctcuORVnCrRI3JyUIEe9a395vYgkIRWacKxUx3Vi7adYakY4nRS9RNEYkxEe0I6hAodU+ek0/QSdGKWH+pE0T2g0VX9vpDhUahwGZjKLqOa9TPzP6yS6f+GnTMSJpoLMDvUTjnSEsipQj0lKNB8bgolkJisiQywx0aawoinBnf/yImlWK+5ZpXpTLdUu8zoKcATHcAounEMNrqEODSDwCM/wCm/Wk/VivVsfs9ElK985gD+wPn8AW/yUhA==</latexit>
y
<latexit sha1_base64="OCIitnT2yrJx/JjOE6kPp7F+Hjc=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WWwFD1KSetBj0YvHCvYDmhA22227dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0w4U9pxvq3C2vrG5lZxu7Szu7d/YJcP2ypOJaEtEvNYdkOsKGeCtjTTnHYTSXEUctoJxzczv/NApWKxuNeThPoRHgo2YARrIwV2uToJGPKYQF7mnLvetBrYFafmzIFWiZuTCuRoBvaX149JGlGhCcdK9Vwn0X6GpWaE02nJSxVNMBnjIe0ZKnBElZ/NT5+iU6P00SCWpoRGc/X3RIYjpSZRaDojrEdq2ZuJ/3m9VA+u/IyJJNVUkMWiQcqRjtEsB9RnkhLNJ4ZgIpm5FZERlphok1bJhOAuv7xK2vWae1Gr39Urjes8jiIcwwmcgQuX0IBbaEILCDzCM7zCm/VkvVjv1seitWDlM0fwB9bnDw5LkpA=</latexit>i 2 {0, 1}
Geographic location
Did a (violent) crime occurred there yesterday?
A “standard” ML perspective
You have historical data {(x i , y i ) } n i=1
<latexit sha1_base64="bq5eSTEhMQKuh/vb2OreuCri7b8=">AAACGnicbVA9SwNBEN2L3/Hr1NJmMREiSLiLhTZC0MZSwZhIEo+5zSZZsrd77O4Fw5HfYeNfsbFQxE5s/DduYgpNfDDweG+GmXlhzJk2nvflZObmFxaXlleyq2vrG5vu1vaNlokitEIkl6oWgqacCVoxzHBaixWFKOS0GvbOR361T5VmUlybQUybEXQEazMCxkqB69/KBHehT3HXLpPKGhy3wADON9LCfcAO8SBgB41hkLJTf3gn8oGb84reGHiW+BOSQxNcBu5HoyVJElFhCAet674Xm2YKyjDC6TDbSDSNgfSgQ+uWCoiobqbj14Z43yot3JbKljB4rP6eSCHSehCFtjMC09XT3kj8z6snpn3STJmIE0MF+VnUTjg2Eo9ywi2mKDF8YAkQxeytmHRBATE2zawNwZ9+eZbclIr+UbF0VcqVzyZxLKNdtIcKyEfHqIwu0CWqIIIe0BN6Qa/Oo/PsvDnvP60ZZzKzg/7A+fwGamif1A==</latexit>
x i 2 R
<latexit sha1_base64="H+TedMJR4T41hQv95nzdxPmO6FQ=">AAAB/XicbVC7TsMwFL3hWcorPDYWixaJqUrKAGMFC2NB9CE1UeW4bmvVcSLbQZSo4ldYGECIlf9g429w2gzQciRLR+fcq3t8gpgzpR3n21paXlldWy9sFDe3tnd27b39pooSSWiDRDyS7QArypmgDc00p+1YUhwGnLaC0VXmt+6pVCwSd3ocUz/EA8H6jGBtpK59WH7oMuQxgbwQ62EQpLeTctcuORVnCrRI3JyUIEe9a395vYgkIRWacKxUx3Vi7adYakY4nRS9RNEYkxEe0I6hAodU+ek0/QSdGKWH+pE0T2g0VX9vpDhUahwGZjKLqOa9TPzP6yS6f+GnTMSJpoLMDvUTjnSEsipQj0lKNB8bgolkJisiQywx0aawoinBnf/yImlWK+5ZpXpTLdUu8zoKcATHcAounEMNrqEODSDwCM/wCm/Wk/VivVsfs9ElK985gD+wPn8AW/yUhA==</latexit>
y
<latexit sha1_base64="OCIitnT2yrJx/JjOE6kPp7F+Hjc=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WWwFD1KSetBj0YvHCvYDmhA22227dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0w4U9pxvq3C2vrG5lZxu7Szu7d/YJcP2ypOJaEtEvNYdkOsKGeCtjTTnHYTSXEUctoJxzczv/NApWKxuNeThPoRHgo2YARrIwV2uToJGPKYQF7mnLvetBrYFafmzIFWiZuTCuRoBvaX149JGlGhCcdK9Vwn0X6GpWaE02nJSxVNMBnjIe0ZKnBElZ/NT5+iU6P00SCWpoRGc/X3RIYjpSZRaDojrEdq2ZuJ/3m9VA+u/IyJJNVUkMWiQcqRjtEsB9RnkhLNJ4ZgIpm5FZERlphok1bJhOAuv7xK2vWae1Gr39Urjes8jiIcwwmcgQuX0IBbaEILCDzCM7zCm/VkvVjv1seitWDlM0fwB9bnDw5LkpA=</latexit>i 2 {0, 1}
Geographic location
Did a (violent) crime occurred there yesterday?
If a violent crime occurs tomorrow where we didn’t predict, cost of $100,000.
If no crime occurs tomorrow where we predicted one to occur, cost of $100.
Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.
A (slightly) more nuanced set of questions
What if our predictions are only effective for some types of crime?
For some types of neighborhoods?
What features are acceptable to use in predicting crime?
How are these features/labels gathered?
What if they are gathered in an uneven manner?
And what will be done with these predictions?
A “standard” ML perspective
x i 2 R
<latexit sha1_base64="H+TedMJR4T41hQv95nzdxPmO6FQ=">AAAB/XicbVC7TsMwFL3hWcorPDYWixaJqUrKAGMFC2NB9CE1UeW4bmvVcSLbQZSo4ldYGECIlf9g429w2gzQciRLR+fcq3t8gpgzpR3n21paXlldWy9sFDe3tnd27b39pooSSWiDRDyS7QArypmgDc00p+1YUhwGnLaC0VXmt+6pVCwSd3ocUz/EA8H6jGBtpK59WH7oMuQxgbwQ62EQpLeTctcuORVnCrRI3JyUIEe9a395vYgkIRWacKxUx3Vi7adYakY4nRS9RNEYkxEe0I6hAodU+ek0/QSdGKWH+pE0T2g0VX9vpDhUahwGZjKLqOa9TPzP6yS6f+GnTMSJpoLMDvUTjnSEsipQj0lKNB8bgolkJisiQywx0aawoinBnf/yImlWK+5ZpXpTLdUu8zoKcATHcAounEMNrqEODSDwCM/wCm/Wk/VivVsfs9ElK985gD+wPn8AW/yUhA==</latexit>
y
<latexit sha1_base64="OCIitnT2yrJx/JjOE6kPp7F+Hjc=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WWwFD1KSetBj0YvHCvYDmhA22227dLMJuxulxP4ULx4U8eov8ea/cdvmoK0PBh7vzTAzL0w4U9pxvq3C2vrG5lZxu7Szu7d/YJcP2ypOJaEtEvNYdkOsKGeCtjTTnHYTSXEUctoJxzczv/NApWKxuNeThPoRHgo2YARrIwV2uToJGPKYQF7mnLvetBrYFafmzIFWiZuTCuRoBvaX149JGlGhCcdK9Vwn0X6GpWaE02nJSxVNMBnjIe0ZKnBElZ/NT5+iU6P00SCWpoRGc/X3RIYjpSZRaDojrEdq2ZuJ/3m9VA+u/IyJJNVUkMWiQcqRjtEsB9RnkhLNJ4ZgIpm5FZERlphok1bJhOAuv7xK2vWae1Gr39Urjes8jiIcwwmcgQuX0IBbaEILCDzCM7zCm/VkvVjv1seitWDlM0fwB9bnDw5LkpA=</latexit>i 2 {0, 1}
Geographic location
Did a (violent) crime occurred there yesterday?
If a violent crime occurs tomorrow where we didn’t predict, cost of $100,000.
If no crime occurs tomorrow where we predicted one to occur, cost of $100.
Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.
You have historical data {(x i , a i , y i ) } n i=1
<latexit sha1_base64="ubFLYLboX7spSMhspxjibgP/J7A=">AAACHXicbVDLSgNBEJyNrxhfUY9eBhMhgoTdKOhFCXjxGME8JIlL72SSDM7OLDOzYljyI178FS8eFPHgRfwbJ4+DRgsaiqpuuruCiDNtXPfLSc3NLywupZczK6tr6xvZza2alrEitEokl6oRgKacCVo1zHDaiBSFMOC0Htyej/z6HVWaSXFlBhFth9ATrMsIGCv52aNrGeM+3FHct8uksgbHHTCA862kcO+zA7A18Nl+a+gn7NQb3oi8n825RXcM/Jd4U5JDU1T87EerI0kcUmEIB62bnhuZdgLKMMLpMNOKNY2A3EKPNi0VEFLdTsbfDfGeVTq4K5UtYfBY/TmRQKj1IAxsZwimr2e9kfif14xN96SdMBHFhgoyWdSNOTYSj6LCHaYoMXxgCRDF7K2Y9EEBMTbQjA3Bm335L6mVit5hsXRZypXPpnGk0Q7aRQXkoWNURheogqqIoAf0hF7Qq/PoPDtvzvukNeVMZ7bRLzif3+VuoSM=</latexit>
a i 2 {majority minority neighborhood, low income neighborhood, majority white neighborhood,.. }
<latexit sha1_base64="pndqYQiMIcZ5sq1HElgfnADKvQs=">AAACXXicbVFNaxsxFNRumjZx0sRtDz30ImIXeijLrnNITiHQS44p1EnAa4xW++x9jT4W6W2DWfwne2sv+SuRPyit0wHBMPMG6Y2KWqGnNP0VxTsvdl++2tvvHBy+Pjruvnl7423jJAylVdbdFcKDQgNDQlJwVzsQulBwW9x/Wfq3P8B5tOYbzWsYazEzOEUpKEiTLuXGoinBEOd9MUGeo+F529fiu3VIc67RrIkBnFWFdZW15WfOlX3gaKTVsOX8ST5USFtmkvB+vuhPur00SVfgz0m2IT22wfWk+zMvrWx0eKZUwvtRltY0boUjlAoWnbzxUAt5L2YwCtQIDX7crtpZ8I9BKfnUunDCmiv170QrtPdzXYRJLajy295S/J83amh6Pm7R1A2BkeuLpo3iZPmyal6iA0lqHoiQoRKUXFbCCUnhQzqhhGx75efkZpBkp8ng66B3ebGpY499YCfsE8vYGbtkV+yaDZlkvyMW7Ued6DHejQ/jo/VoHG0y79g/iN8/AQRxtF4=</latexit>
Would most of our concerns be mitigated by:
Removing demographic information from a dataset?
“Fairness through unawareness”, or demographically blind decisions Pro: Simple, easy to audit
Con: geographic information often contains a proxy for demographics.
Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.
P[f(x i ) = + |a i = ⇤] = P[f(x i ) = +]
<latexit sha1_base64="z7aHl7s5D9l86Xn7CCf3avKQo10=">AAACKnicbZDLSsNAFIYn9VbrLerSzWARKoWSVEE3SsWNywr2AkkIk+mkHTq5ODMRS+zzuPFV3HShFLc+iJM2C239YeDjP+cw5/xezKiQhjHVCiura+sbxc3S1vbO7p6+f9AWUcIxaeGIRbzrIUEYDUlLUslIN+YEBR4jHW94m9U7T4QLGoUPchQTJ0D9kPoUI6ksV7+xAyQHnpc2x5ZfeXbpKbyCsApfIHKpQls8JogTJ8N/OqvQcfWyUTNmgstg5lAGuZquPrF7EU4CEkrMkBCWacTSSRGXFDMyLtmJIDHCQ9QnlsIQBUQ46ezUMTxRTg/6EVcvlHDm/p5IUSDEKPBUZ7auWKxl5n81K5H+pZPSME4kCfH8Iz9hUEYwyw32KCdYspEChDlVu0I8QBxhqdItqRDMxZOXoV2vmWe1+v15uXGdx1EER+AYVIAJLkAD3IEmaAEMXsE7+ACf2ps20aba17y1oOUzh+CPtO8fTHOkEw==</latexit>
Would most of our concerns be mitigated by:
Requiring our predictions be conditionally independent of demographic information?
Demographic parity, statistical parity…
Pro: Aligns with certain legal definitions of equity Con: demographics with lower levels of violent crime
will have higher predicted violent crime rates
Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.
P[f(x
<latexit sha1_base64="WnCzHfP6JOJclcm63R+bPbeuQ0o=">AAACPHicbZBLS8NAFIUnPmt9VV26GSyCIpSkCrpRCm5cVrQPaEO4mUx0cDKJMxOxxP4wN/4Id67cuFDErWsnbUGtXhj4OPdc7tzjJ5wpbdtP1sTk1PTMbGGuOL+wuLRcWlltqjiVhDZIzGPZ9kFRzgRtaKY5bSeSQuRz2vKvjvN+64ZKxWJxrnsJdSO4ECxkBLSRvNJZNwJ96ftZvd8Jt249to0PMd7Bdxg8ZrCrrlOQ1M3xH+cOxt/WgEEUi8D1SmW7Yg8K/wVnBGU0qrpXeuwGMUkjKjThoFTHsRPtZiA1I5z2i91U0QTIFVzQjkEBEVVuNji+jzeNEuAwluYJjQfqz4kMIqV6kW+c+QFqvJeL//U6qQ4P3IyJJNVUkOGiMOVYxzhPEgdMUqJ5zwAQycxfMbkECUSbvIsmBGf85L/QrFac3Ur1dK9cOxrFUUDraANtIQftoxo6QXXUQATdo2f0it6sB+vFerc+htYJazSzhn6V9fkFQL+qgQ==</latexit>i ) = + |a i = ⇤] = P[f(x i ) = + |a i = ⇧]
Would most of our concerns be mitigated by:
Requiring equal false positive and negative rates for all demographics?
Equality of odds
Pro: Chance of false prediction of crime (or missing crime) independent of demographics Con: higher complexity to explain to non-experts, necessarily precludes other options.
Discussion based on [Hardt, Price, Srebro ’16]. See http://www.fatml.org for more resources.
P[f(x i ) = y |a i = ⇤, y 0 ] = P[f(x i ) = y |a i = ⇧, y 0 ]
<latexit sha1_base64="PywlbbBa4Q+oioG6dlIMq18Ip6M=">AAACQ3icbVBPS8MwHE397/w39eglOMQJMloV9KIIXjxOcFPcSvk1TbdgmtYkFUvdd/PiF/DmF/DiQRGvgum2g04fBF7eez+S3/MTzpS27WdrbHxicmp6ZrY0N7+wuFReXmmqOJWENkjMY3npg6KcCdrQTHN6mUgKkc/phX99UvgXt1QqFotznSXUjaAjWMgIaCN55at2BLrr+3m91wqrdx7bwocYZ/geg8cMbaubFCTdxtmmW1z/Sf8IBwyiWARFGrteuWLX7D7wX+IMSQUNUffKT+0gJmlEhSYclGo5dqLdHKRmhNNeqZ0qmgC5hg5tGSogosrN+x308IZRAhzG0hyhcV/9OZFDpFQW+SZZ7KBGvUL8z2ulOjxwcyaSVFNBBg+FKcc6xkWhOGCSEs0zQ4BIZv6KSRckEG1qL5kSnNGV/5LmTs3Zre2c7VWOj4Z1zKA1tI6qyEH76BidojpqIIIe0At6Q+/Wo/VqfVifg+iYNZxZRb9gfX0DovKtGw==</latexit>