1.1 Contributions
2.1.3 Algorithmic Bias and Machine Learning Fairness
Algorithmic bias and machine learning fairness have recently come to public attention due to high profile incidents like COMPAS’ negative predictions about black defendants [5]. Broadly machine learning fairness strives to address bias by creating algorithms that both treat people equally and are perceived as fair by their users. Deploying fair algorithms is essential for creating human-centered intelligent systems. Unfairness can lead to everything from unfair high stakes decisions about defendant parole [5] to poor user experiences due to a lack of trust and perceived unfairness (lee big data). When users perceive an algorithm as unfair they may behave in ways that the system does not intend them to [118]. Other instances of unfairness such as unfairness in voice systems can diminish user acceptance of such systems.
Friedman and Nissenbaum define computational bias as “computer systems that sys- tematically and unfairly discriminate against certain individuals or groups of individuals in favor of others” [67]. In addition, Friedman and Nissenbaum present a taxonomy of biases in com- putational systems with top-level categories of Preexisting Bias, Technical Bias, and Emergent Bias. While Friedman and Nissenbaum’s work was prescient in many ways, it is difficult to use this taxonomy to address algorithmic and data bias problems in practice. Their categorization also does not point to underlying causes or solutions.
problems in ways that suggest how to intervene and correct biases. One helpful taxonomy for classifying algorithmic and data bias is presented by Ricardo Baeza-Yates [9]. The taxonomy consists of 6 types of bias: activity bias, data bias, sampling bias, algorithm bias, interface bias, and self-selection bias. These biases form a directed cyclic graph, where each bias feeds biased data into the next stage where more bias is introduced. While the cyclical nature of bias seems disheartening, this taxonomy also provides us with ways to intervene in the bias cycle. For example, voice interfaces often struggle with strong regional accents [199]. This deficiency can be classified into the above taxonomy which allows us to suggest corrective actions. The inaccuracy with strong accents could be due to a data bias; the company trained ASR models on data that did not include such accents and thus needs to collect that data. It could also be due to sampling bias for the algorithm, a sample over-representing unaccented voices could be corrected by using different stratified samplings for training data. The bias taxonomy from Baeza-Yates provides a common language to discuss problems with bias.
2.1.3.1 Algorithmic Biases In Voice
While biases in voice technologies are modality and domain-specific, they illustrate many of the common problems that larger algorithmic bias efforts must deal with. As such, we include this review of algorithmic bias in voice systems here and later review more broadly the topic of voice interfaces and the specific domain our later work will fall into: language usage in music. Automatic speech recognition systems may exhibit biases with regards to different voices and use of language. Tatman shows that different English dialects result in significantly different accuracy in automatic captions [199]. Other work shows that current nat-
ural language processing systems perform poorly with African-American English compared to Standard American English [17, 98]. This may mean that music creators who use their dialects when titling their compositions may cause their materials to be less accessible than those using more standard titles. Recent initiatives to create more open and diverse speech datasets for voice recognition models have yet to bear fruit [142]; nor will these efforts cover every domain. Many voice applications also re-use training data from applications in other modalities. For example, voice web search will at least partially exploit text web search data. However, voice queries are longer and closer to natural language than typed queries [84] so that voice naming may be atypical of long-form training text.
2.1.3.2 Solutions to Voice Biases and Challenges
As we explored with Baeza-Yates, there are many different sources of biases. These different sources lead to different approaches to overcoming bias, which we apply to voice interfaces. In voice interfaces, correcting biases can range from approaches focused on the interaction model itself to those dealing with the underlying data and algorithms. One approach is detecting when a user is having speech recognition problems and automatically adjusting the voice dialogue itself, e.g. switching from an open to a closed form questioning style. Other approaches may combine voice recognition with on-screen input. Goto et al. demonstrate this, showing options on-screen in response to uncertain voice commands [78]. However, these approaches do not necessarily solve problems where the training data itself is impoverished and particular content is inaccessible. We focus on the identification of inaccessible content and solutions through data collection.
One aspect of inaccessible content identification is common in ASR: recognizing Out- Of-Vocabulary (OOV) terms. Multiple ways are available to detect and deal with out of vocab- ulary terms. Parada et al. [151] describe multiple ways to deal with (OOV) terms. The first method, filler models, represents unrecognized terms using fillers, sub-words or generic word models The second method uses confidence estimation scores to find unreliable OOV regions [88]. The third and final method Parada et al. describe uses the local lexical context of a tran- scription region. Other approaches model the pronunciation of OOV terms, see Can et al. [28]. Alternatively, Parada et al. [152] describe how, after OOV regions have been detected in tran- scriptions, they use the lexical context to query the web and retrieve related content from which they then derive OOV terms, often names. The above methods for recognizing OOV terms often assume that the developer has built a specialized ASR from the ground up and can modify it however they choose. With the advent of large-scale public ASR APIs, this assumption may no longer be true. In addition, bias studies find categories of problems that would exist even with a perfect vocabulary. For example, ‘100it racks’is a creative way of spelling ‘hunnit racks’which is slang for ‘hundred racks’, even a perfect vocabulary in an ASR could at best recognize ‘hunnit racks’which still leaves a needed bridge between this recognition and the actual song title ‘100it racks’. Crowdsourcing is a relatively common part of dealing with ASR problems. Data collec- tion through crowdsourcing can be used to learn pronunciations for named entities [181], and similar work exists for the generation of search aliases [33]. Ali et al. [3] describe the challenge in evaluating ASR output for languages without standard orthographic representation; where no canonical spelling exists. They use crowdsourced transcriptions to evaluate performance for Dialectal Arabic ASRs. Granell and Martínez-Hinarejos [80] use crowdsourcing to collect
spoken utterances to help transcribe handwritten documents, combining speech and text image transcription.
However, before such processes can be applied, we first need to assess potential prob- lems occurring within a domain, and their prevalence. We can no longer assume voice appli- cation builders develop their own ASR nor that they have access to the internals of their ASR service; this creates challenges to correcting ASR errors that require pragmatic solutions. It may be that there is less accessible content that accesses cultural and linguistic practices that are not well-supported by current speech solutions. Most ASR systems rely on a language model that prioritizes high-probability word sequences over less likely utterances. These probabilities are trained from frequencies of word n-grams in corpora [169]. Probability of co-occurrence is also a significant predictor of ASR error [74]. For example, a popular track, at the time of writing, is “Two High" by Moon Taxi. The name of this song is pronounced [tu haI] and can correspond to three possible English strings: “to high", “too high" or “two high". Of these strings, “too high" is the most statistically likely, and so when a user asks for the sound string [tu haI], a generalized ASR system’s language model is more likely to return the written string “Too high." This can be problematic for named content, creating confusion if the user is asking for a current popular track “Two High” or an older popular track like “Too High” by Stevie Wonder. While some ASR APIs accept custom language models and pronunciation dictionaries, these are usually quite limited. The additional vocabulary words still need to be detected, generated and sup- plied with a probability. This is especially an issue for domains where creative language usage is valued (e.g. music, art), or systems used by audiences with diverse linguistic backgrounds. These errors may require downstream solutions if the developers use off-the-shelf APIs where
language models are not directly modifiable.