Topics to Avoid: Demoting Latent Confounds in Text Classification

(1)

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4153–4163,

4153

Topics to Avoid: Demoting Latent Confounds in Text Classification

Sachin Kumar♦ Shuly Wintner♣ Noah A. Smith♥♠ Yulia Tsvetkov♦

♦_{Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA} ♣_{Department of Computer Science, University of Haifa, Haifa, Israel}

♥_{Paul G. Allen School of Computer Science & Engineering,}

University of Washington, Seattle, WA, USA

♠_{Allen Institute for Artificial Intelligence, Seattle, WA, USA}

[email protected], [email protected] [email protected], [email protected]

Abstract

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not al-ways generalize well. In this work, we ob-serve this limitation with respect to the task ofnative language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author’s native lan-guage is Swedish). We propose a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but we train the two predic-tors adversarially in an alternating fashion to learn a text representation that predicts the cor-rect label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.1

1 Introduction

Text classification systems based on neural net-works are biased towards learning frequent spu-rious correlations in the training data that may be confounds in the actual classification task (Leino et al.,2019). A major challenge in building such systems is to discover features that are not just cor-related with the signals in the training data, but are true indicators of these signals, and therefore gen-eralize well.

For example, Kiritchenko and Mohammad (2018) found that sentiment analysis systems im-plicitly overfit to demographic confounds, system-atically amplifying the intensity ratings of posts

1_{The code is available at:} _{https://github.com/} Sachin19/adversarial-classify

written by women. Zhao et al. (2017) showed that visual semantic role labeling models implic-itly capture actions stereotypically associated with men or women (e.g., women are cooking and

men are fixing a faucet), and in cases of higher model uncertainty assign stereotypical labels to actions and objects, thereby amplifying social bi-ases found in the training data.

We focus on the task ofnative language identi-fication(L1ID), which aims at automatically iden-tifying the native language (L1) of an individ-ual based on their language production in a sec-ond language (L2, English in this work). The aim of this task is to discover stylistic features present in the input that are indicative of the au-thor’s L1. However, a model trained to predict L1 is likely to predict that a person is, say, a native Greek speaker, if the texts authored by that person mention Greece, because the training data exhibits such topical correlations (§2).

(2)

Note that these two proposals are task-independent and can be extended to a vast array of text classification tasks where confounding factors are not known a priori. For concreteness, however, we evaluate our approach on the task of L1ID (§5). We experiment with two different datasets: a small corpus of student written essays (Malmasi et al., 2017) and a large and noisy dataset of Reddit posts (Rabinovich et al., 2018). We show that classi-fiers trained on these datasets without any inter-vention learn spurious topical correlations that are not indicative of style, and that our proposed de-confounded classifiers alleviate this problem (§6). We present an analysis of the features discovered after demoting these confounds in§7.

The main contributions of this work are: 1. We introduce a novel method for represent-ing and identifyrepresent-ing variables which are confounds in text classification tasks.

2. We propose a classification model and an al-gorithm aimed at learning textual representations that are invariant to the confounding variable.

3. We introduce a novel approach to adversarial training with multiple adversaries, to alleviate the problem of drifting parameters during alternating classifier–adversary optimization.

4. Finally, we analyze some linguistic features that are not only predictive of the author’s L1 but are also devoid of topical bias.

2 Motivation

We study the general effect oftopicalconfounds in text classification. To motivate the need to demote them, we introduce as a case study the L1ID task, in which the goal is to predict the native language of a writer given their texts in L2.

We begin with a subset of the L2-Reddit cor-pus (Rabinovich et al.,2018), consistsing of Red-dit posts by authors with 23 different L1s, most of them European languages. Some of the posts come from Europe-related forums (e.g. r/Europe, r/AskEurope, r/EuropeanCulture), whereas others are from unrelated forums. We view the latter as out-of-domain data and use them to evaluate the generalization of our models. We use a sub-set of this corpus, with only the 10 most frequent L1s, to guarantee a large enough balanced train-ing set. We remove all the posts with fewer than 50 words and sample the dataset to obtain a bal-anced distribution of labels: from this balbal-anced dataset, we randomly sample 20% of examples

from each class and divide them equally to cre-ate development and test sets. In total, there are around 260,000 examples in the training set and 32,000 examples each in the development, the in-domain test set, and the out-of-in-domain test set.

We trained a standard (non-adversarial) classi-fier, with a bidirectional LSTM encoder followed by two feedforward layers with atanhactivation function and a softmax in the final layer (full ex-perimental details are given in§5.2). We refer to this model as NO-ADV. The results are shown in Table 1. Notice the huge drop in accuracy on the out-of-domain data, which indicates that the model is learning topical features.

To further verify this claim, we used log-odds ratio with Dirichlet prior(Monroe et al.,2008)— a common way to identify words that are sta-tistically overrepresented in a particular popula-tion compared to others—to identify the top-K

words that were most strongly associated with a specific L1 in the training set. (We refer the reader to (Monroe et al., 2008) for the details about the algorithm.) We experimented with

K ∈ {20,50,100,200}. Table 2shows the top-10 words in each class; observe that almost all of these words are geographical (hence, topical) terms that have nothing to do with the L1.

Next, we masked such topical words (by re-placing them with a special token) and evaluate the trained classifier on masked test sets. Ac-curacy (Table 1) degrades on both the in-domain and out-of-domain sets, even when only20words are removed. The drop in accuracy with the out-of-domain dataset is smaller since these data do not include many instances where the presence of topical words would help in identifying the label. These experiments confirm our hypothe-sis that the baseline classifier is primarily learn-ing topical correlations, and motivate the need for a deconfounded classification approach which we describe next.

In-Domain

Out-of-Domain

NO-ADV 52.5 25.7

+MASKTOP-20 32.8 21.0

+MASKTOP-50 31.6 20.4

+MASKTOP-100 30.1 19.7

[image:2.595.320.513.635.730.2]

+MASKTOP-200 28.5 18.7

(3)

3 Representing Confounds

Latent Dirichlet allocation (LDA;Blei et al.,2003) is a probabilistic generative model for discover-ing abstract topics that occur in a collection of documents. Under LDA, each document can be considered a mixture of a small (fixed) number of topics—each represented as a distribution over words—and each word’s presence is assumed to be attributed to one of the document’s topics. More precisely, LDA assigns each document a probability distribution over a fixed number of top-icsK.

LDA topics are known to be poor features for classification (McAuliffe and Blei,2008), indicat-ing that they do not encode all the topical infor-mation. Moreover, they can encode information which is not actually topical and can be a useful L1 marker. Motivated by our case study (§2), we propose a novel method to represent topic distri-butions, based on log-odds scores (Monroe et al., 2008), and compare it to LDA as a baseline.

For each class label y and each word typew, we calculate a log-odds scorelo(w, y) ∈ _R. The higher this score, the stronger the association be-tween the class and the word. As we saw in§2, the highest scored words are mostly topical and hence constitute superficial features which we want the classification model to “unlearn.” We therefore define a distribution which assigns high probabil-ity to a document containing these high scoring words. For a labely ∈ Y and an input document

x=hw₁, . . . , wni, we definep(y|x):

p(y|x)∝p(y)·p(x|y) =p(y)·

n

Y

i=1

p(wi|y)

The above expansion assumes a bag of words rep-resentation. When the dataset is balanced,p(y)is equal for each label and can be omitted. Finally, we definep(wi |y) ∝σ(lo(wi, y)), whereσ(.)is

the sigmoid function, which squashes the log-odds scores (whose values are inR) to the range[0,1].

We normalize the sigmoid values over the vocab-ulary to convert them to a probability distribution. In this distribution, the number of “topics” equals the number of labels,m.

4 Deconfounded Text Classification

We now formalize the task setup and the clas-sification model. We are given N labeled doc-uments in the training set {(x1, y1),(x2, y2),

. . . ,(xN, yN)}, wherexiis a document with label

yi ∈ Y, where m = |Y|is the number of labels.

For each document xi, we represent latent

(top-ical) confounds—domain-specific and superficial document features—as aK-dimensional multino-mial distributionti ∈ {(t1, . . . , tK)|

PK

j=1tj =

1}. In our task, the confounds aretopics, so that eachtjrepresents the proportion of documenti

as-sociated with topicjbut these topics are not given a priori. In this work, the number of topics K, equalsm, but the methods presented in this work are valid for any number of topics.

Our goal is to train a classifierf, parameterized by θ, which learns to accurately predict the tar-get label, while ignoring superficial topical corre-lations present in the training set. That is, for a textxwe wish to predictyˆ=fθ(x)which doesn’t

encode any information aboutt. FollowingGanin et al.(2016),Pryzant et al.(2018), andElazar and Goldberg(2018), we input x to an encoder neu-ral networkh(x;θh) to obtain a hidden

represen-tationhx(see Figure1), followed by two

feedfor-ward networks: (1) c(h(x);θc) to predict the

la-bely; and (2) an adversary networkadv(h(x);θa)

to predict the topics. Departing from prior work which used predefined binary confounds, our ad-versary predicts the topic distribution t. If hx

does not encode any information to predictt, then

c(h(x))will not depend ont. Concretely, we want to optimize the following quantity:

min

c,h

1

N

X

i=1

CE(c(h(xi)), yi)

+CE(adv∗_h(h(xi)),UK)

whereCEdenotes cross-entropy loss, and

adv∗_h = arg min

adv

1

N

X

i=1

CE(adv(h(xi)), ti),

and UK = (_K1, . . . ,_K1). This objective seeks a

representation hx which is maximally predictive

of the class label but not of the topical distribution (ideally, it should output a uniform topic distribu-tion for every input).

4.1 Learning Schedule: Alternating

Optimization of Classifier and Adversary

(4)

[image:4.595.313.517.239.388.2]

English ireland irish british britain russia scotland england states american london brexit Finnish finland finnish finns helsinki swedish finn nordic sweden sauna nokia estonian French french france paris sarkozy macron fillon hollande gaulle hamon marine valls breton German german germany austria merkel refugees asylum germans bavaria austrian berlin also Greece greek greece greeks syriza macedonia athens turkey macedonians fyrom turkish ancient Dutch dutch netherlands amsterdam wilders rotterdam holland rutte belgium bike hague Polish poland polish poles warsaw lithuanian lithuania judges jews ukranians imho tusk Romanian romania romanian romanians moldova bucharest hungarian hungarians transistria Spanish spain catalan spanish catalonia catalans madrid barcelona independence spaniards Swedish sweden swedish swedes stockholm swede malmo danish nordic denmark finland

Table 2: Top words based on log-odds scores for each label in the L2-Reddit dataset.

y

x

𝐭

Text Attribute Predictor (c)

Array of Topic Predictors (adv) LSTM with attention

h

(a) Weights of the LSTM and of the discriminator are fixed. A new topic predictor is trained by minimizing the cross en-tropy of the output and the distribution of the input document over latent topics as described in§3.

y

x

uniform Text Attribute

Predictor (c)

Array of Topic Predictors (adv) LSTM with attention

h

(b) Weights of all the topic predictors are fixed, but the en-coder is trained. The model is jointly minimizing the cross-entropy of the classifier and encouraging the topic predictor toward uniformity.

Figure 1: We alternate between training the topic predictor (left; (1)) and the deconfounded classifier/encoder (right; (2)). Pretraining is not shown in the figure.

quantities:

min

adv

1

N

X

i=1

CE(adv(h(xi)), ti) (1)

min

c,h

1

N

X

i=1

CE(c(h(xi)), yi)

+ CE(adv(h(xi)),UK)

(2)

The training schedule is critical in adversarial setups where the loss has two competing terms (Mescheder et al., 2018; Arjovsky and Bottou, 2017;Roth et al., 2017); here, these terms min-imize classification loss while maximizing the topic prediction loss. Algorithm1details our pro-posed alternating learning procedure.

Inspired by generative adversarial networks (GANs;Goodfellow et al.,2014), the training pro-cedure alternates between training the classifier and the adversary (see Figure1). First (

pretrain-ing), we train the encoder along with the classifier using only classification loss, until convergence. After pretraining,hxhas encoded topical

informa-tion which it uses for classificainforma-tion (as shown in our analysis in§2). Now, we train onlyadv(h(x)) to (accurately) predictt, keeping the parameters of

h(.)fixed. Onceadv(.)is trained, it should be able to successfully extract a topic distribution fromhx

(topic training, see Figure1a). The goal now is to modifyhxin such a way thatadv(hx)produces a

[image:4.595.78.297.241.390.2] [image:4.595.95.290.500.608.2]

(5)

Algorithm 1: Alternating optimization of classifier and adversary.

Result: θh, θc, θa1, . . . , θaT

Randomly initializeθh, θc;

whilenot convergeddo

Sample a minibatch ofbtraining samples; Updateθhandθcusing gradients with

respect to 1_bPb

i=1CE(c(h(xi)), yi).;

end

j = 1;

fornumber of training iterationsT do Randomly initializeθaj;

end

fortstepsdo

Sample a minibatch ofbtraining samples; Fixθhandθc, updateθaj using gradients

with respect to

1 b

Pb

i=1CE(advθ_aj(h(xi)), ti);

end

forcstepsdo

Sample a minibatch ofbtraining samples; Fixθauforu∈R{1, . . . , j}and updateθc

andθhusing gradients with respect to 1

b

Pb

i=1CE(c(h(xi)), yi) +

CE(advθau(h(xi)),UK);

end

j ←−j+ 1;

4.2 Multiple Adversaries

In our experiments, we observe that after every “topic forgetting” stage,adv(.)does end up pro-ducing a uniform distribution, but in the next “topic training” phase,adv(.)is able to reproduce the topical distribution accurately. This is because, during “topic forgetting,” the classifier does not re-ally forget the topics inhx; it just encodes them

in a different way.2 This is a general problem in setups with alternating classifier-adversary op-timization. To solve this issue, we propose using

multiple adversaries, inspired by the “experience replay” approach used in reinforcement learning (O’Neill et al., 2010;Mnih et al., 2015). During theith “topic training” phase, we train a new ad-versary advi (with parameters θai instead of

re-training only one adversary over and over again. In the next “topic forgetting” phase, at each train-ing step we pickadvj at random from the pool of

2_{We observe this in our analysis where the most salient} features encoded after pretraining and topic forgetting phrase are the same.

previously learned adversaries, j ∈_R {1, . . . , i}. By using multiple adversaries, we make it diffi-cult for the classifier to encode topical information anywhere.

5 Experimental Setup

5.1 Datasets

We evaluate our topical confound demotion method on the L1ID task. We show experiments with two datasets where L2 is English: the L2-Reddit dataset described in §2, and TOEFL17, a collection of essays authored by non-native En-glish speakers who apply for academic studies in the US (Malmasi et al.,2017). This corpus reflects eleven L1s: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The training data include 11,000 au-thors (1,000 per L1) and the development set has 1,100 essays per L1. We evaluate on the devel-opment set. Each essay is also marked with a prompt ID which was given to the authors to write the essay. There are 8 prompts in total, based on which we construct 8 versions of train and test set. In each version, we remove essays marked with one of the prompts from both the train and the de-velopment sets, and consider the removed essays from the development set an “out-of-domain” test set. We refer to the version where prompt “PK” is out-of-domain as “–PK” in the results (Table 3),

K ∈ {0, . . . ,7}.

5.2 Implementation Details

We tokenized and lowercased all the text using spaCy. Limiting our vocabulary to the most fquent 30,000 words in the training data, we re-placed all out-of-vocabulary words with “UNK.” We encoded each word using a word embed-ding layer (initialized at random and learned) and passed these embeddings to a bidirectional LSTM encoder (one layer for each direction) with atten-tion (h(x);Pryzant et al.,2018). Each LSTM layer had a hidden dimension of 128. We used two lay-ered feed forward networks with atanhactivation function in the middle layer (of size 256), followed by a softmax in the final layer, asc(.)andadv(.).

5.3 Baselines

(6)

Linear classifier with content-independent fea-tures (LR) ReplicatingGoldin et al.(2018), we trained a logistic regression classifier with three types of features: function words, POS trigrams, and sentence length, all of which are reflective of the style of writing. We deliberately avoided using content features (e.g., word frequencies).

Classification with no adversary on masked texts (LO-TOP-K) We mask the top-K words (based on log-odds scores) in both the train and the test sets (as in §2); we train the classification model again without trainingadv(.). After mask-ing the top words, we expect patterns of writmask-ing style (and, therefore, L1) to become more appar-ent.

Adversarial training with gradient reversal (GR-LO) A common method of learning a confound-invariant representations is to use a gra-dient reversal layer (Beutel et al., 2017; Ganin et al.,2016;Pryzant et al.,2018;Elazar and Gold-berg, 2018). The output of the encoder, hx, is

passed through this layer before applyingadv(.). This training setup usually proves too difficult to optimize, and often results in poor performance. That is, even if the performance ofadv(.)is weak,

hx still ends up leaking information about the

confound (Lample et al.,2019;Elazar and Gold-berg, 2018). In the forward pass, this layer acts as identity whereas in the backward pass it multi-plies the gradient values by−λ, essentially revers-ing the gradients before they go into the encoder.

λ controls the intensity of the reversal (we used

λ= 0.2).

LDA topics as confounds (ALT-LDA) We trained LDA on the training set and for each ex-ample in the training set, generated a probability distribution (over50topics), and used it as topical confound with our proposed learning setup, alter-nating classifier-adversary training.

6 Results

6.1 TOEFL17 Dataset

We begin with experiments on the TOEFL17 dataset, where predicting L1 is an easier task due to the lower proficiency of the authors. Table 3 reports the accuracy of our proposed model, de-noted ALT-LO, compared to the logistic regres-sion baseline (LR), and two adversarial baselines: one demotes latent log-odds-based topics via

gra-dient reversal (GR-LO), and another uses our pro-posed novel learning procedure but demotes base-line LDA topics (ALT-LDA). We report both in-domain accuracy and out-of-in-domain results; the latter is obtained by averaging the accuracy of each set “–PK” overK ∈ {0, . . . ,7}.

In-Domain

Out-of-Domain

LR 55.3 50.9

[image:6.595.341.493.158.241.2]

GR-LO 12.7 13.6 ALT-LDA 59.1 50.1 ALT-LO 61.9 60.4

Table 3: Classification accuracy with topic-demoting methods, TOEFL dataset.

Our model strongly outperforms all baselines that demote confounds, in both classification se-tups. We observe in our experiments that gradient reversal is especially unstable and hyperparame-ter sensitive: it has been shown to work well with categorical confounds like domain type or binary gender, but in demoting continuous outputs like a topic distribution, we observe it is not effec-tive. The proposed alternating training with mul-tiple discriminators obtains better results, and re-placing LDA with log-odds-based topics also im-proves both in-domain and (much more substan-tially) out-of-domain predictions, confirming the effectiveness of our proposed innovations.

A vanilla classifier without demoting confounds (denoted in§2 asNO-ADV) yields in-domain and out-of-domain accuracies of 62.0 and 58.3, respec-tively. We would expect that the better generaliza-tion power of our proposed model would come at a price of lower accuracy in-domain. Our goal is to capture the true signals of L1, rather than su-perficial patterns that are more frequent in the data and artificially boost the performance inNO-ADV settings. This is indeed what we observe.

(7)

peo-ple’s demographic attributes: ethnicity, which of-ten correlates with L1, but also gender, race, reli-gion, and others (Hardt et al.,2016;Beutel et al., 2017).

6.2 L2-Reddit Dataset

Next, we experiment with L2-Reddit, a larger and more challenging dataset (since many speakers in the dataset are highly fluent, and the signal of their native language is weaker). The performance of the simple baselines on this dataset is shown in Ta-ble4. The accuracy of the linear classifier is poor (compared to Table1), perhaps because it fails to capture some contextual features learned by the neural network models. WithLO-TOP-20, the per-formance on both test sets improves. It slightly degrades when more words are removed, perhaps because some words indicative of L1 are also re-moved.

In-Domain

Out-of-Domain

LR 21.2 18.5

[image:7.595.321.512.239.333.2]

LO-TOP-20 38.7 21.9 LO-TOP-50 36.4 21.4 LO-TOP-100 35.8 21.2 LO-TOP-200 34.7 20.8

Table 4: Baseline classification accuracy on L2-Reddit.

Finally, we evaluate the impact of our novel training procedure and the quality of our proposed topical confound identification method. We com-pare our proposed solution, denotedALT-LO, with two alternatives, as before, one with a different learning setup (GR-LO) and one with a different confound representation (ALT-LDA). Table5 sum-marizes the results: our proposed learning proce-dure ALT-LO performs better than both the alter-natives. Unsurprisingly, the model trained with gradient reversal (GR-LO) performs particularly poorly; this was our primary motivation to explore better learning techniques.

In-Domain

Out-of-Domain GR-LO 22.5 15.7 ALT-LDA 46.2 21.9 ALT-LO 48.8 22.9

Table 5: Classification accuracy with topic-demoting methods, L2-Reddit dataset.

To further confirm that the ALT-LO model is not learning topical features, we repeat the ex-periment presented in Table 1—masking the top

K topical words (based on log-odds scores) from the test sets, but not retraining the models—now, with our proposed modelALT-LO. Table6shows that in contrast to standard models that do not de-mote topical confounds (as in Table 1), there is less degradation in the performance of ALT-LO. We conjecture that our model is stable to demoting topics because it learns relevant stylistic features, rather than spurious correlations.

In-Domain

Out-of-Domain

ALT-LO 48.8 22.9

+MASKTOP-20 38.7 21.6

+MASKTOP-50 36.2 21.5

+MASKTOP-100 33.5 21.2

[image:7.595.99.267.338.432.2]

+MASKTOP-200 31.9 20.4

Table 6: Accuracy on the L2-Reddit dataset; the pro-posed model (ALT-LO) with different settings of the test sets.

7 Analysis

We present an analysis of what the models are learning, based on words they attend to for clas-sification. We focus on the L2-Reddit dataset.

FollowingPryzant et al.(2018), we generated a lexicon of most attended words by (1) running the model on the test set and saving the attention score for each word; and (2) for each word, computing its average attentional score and selecting the

top-kwords based on this score.

[image:7.595.106.258.660.729.2]

(8)

the correct usage ofthe in particular is quite hard for learners to master. These challenges are evi-dent from the most indicative words of our model. Observe also that the LO-TOP-50model is some-where in the middle: it includes some proper nouns (including geographical terms such aseu or us) but also several function words. A more de-tailed analysis of these observations is left for fu-ture work.

Recently, there has been a debate on whether attention can be used to explain model decisions (Serrano and Smith,2019;Jain and Wallace,2019; Wiegreffe and Pinter,2019), we thus present addi-tional analysis of our proposed method based on saliency maps (Ding et al.,2019). Saliency maps have been shown to better capture word align-ment than attention probabilities in neural ma-chine translation. This method is based on com-puting the gradient of the probability of the pre-dicted label with respect to each word in the input text and normalizing the gradient to obtain proba-bilities. We use saliency maps to generate lexicons similar to the ones generated using attention. As shown in table8, the top indicative words for base-line andLO-TOP-50 follow a similar pattern as the ones obtained with attention scores. In line with results in Table 7, salient words for ALT-LO are determiners and prepositions. However, saliency maps also reveal that our proposed approach still attends to some geographical terms that were not demoted by our classifier.

8 Related Work

Controlling for confounds in text Controlling for confounds is an active field of research, es-pecially in the medical domain, where the com-mon solution is to do random trials or propensity score matching (Rosenbaum and Rubin, 1985). Paul(2017) tackled the problem of learning causal associations between word features and class la-bels using propensity matching for the task of sentiment analysis. This method is not scal-able to large text datasets as it involves training a logistic regression model for every word type. Tan et al. (2014) built models to estimate the number of retweets of Twitter messages and ad-dressed confounding factors by matching tweets of the same author and topic. Reis and Cu-lotta(2018) proposed a statistical technique called Pearl’s back-door adjustment for text classification (Pearl,2009). All these works focused on a

bag-of-words model with lexical features only.

Adversarial training in text Much recent work focuses on learning textual representations that are invariant to selective properties of the text. This work used domain adaptation and transfer learn-ing (Ganin et al., 2016; Tzeng et al., 2014; Xie et al.,2017), either to remove sensitive attributes such as demographic information (Li et al.,2018; Elazar and Goldberg, 2018;Beutel et al., 2017), or to understand costumer behavior for social sci-ence applications (Pryzant et al.,2018). Most of the work in this area, however, focuses on cases where these confounds are known in advance and their values are given along with the training data.

Native language identification The L1ID task was introduced by Koppel et al. (2005), who worked on the International Corpus of Learner English (Granger, 2003). The same experimen-tal setup was adopted by several other authors (Tsur and Rappoport,2007;Wong and Dras,2009, 2011). Since the release of nonnativeTOEFL es-says by the Educational Testing Service (Blan-chard et al.,2013), the task gained popularity and this dataset has been used for two L1ID Shared Tasks (Tetreault et al.,2013;Malmasi et al.,2017). Malmasi and Dras(2017) report that the state of the art is a linear classifier with charactern-grams and lexical and morphosyntactic features.

The best accuracy under cross-validation on the TOEFL17 dataset, which includes 11 native guages (with a rather diverse distribution of lan-guage families), was 85.2%.

The above works all identify the L1 oflearners. Identifying the native language of advanced, flu-ent speakers is a much harder task. Goldin et al. (2018) addressed this task, using the L2-Reddit dataset with as many as 23 different L1s, all of them European and many which are typologically close, which makes the task even harder. They ex-perimented with a variety of features, using logis-tic regression as the classifier, and achieved results as high as 69% accuracy with cross-validation; however, when testing their classifier outside the domain it was trained on (Reddit forums focusing on European issues), accuracy dropped to 36%.

9 Conclusion

(9)

NO-ADV sweden france greece finland poland spain greek germany french eu romania polish dutch german spanish swedish netherlands finnish

LO-TOP-50 eu ’s ’re ’m ’ & uk us because ’ve am its nt english these usa nt here ’ll especially correct pis de within

[image:9.595.112.487.179.265.2]

ALT-LO the in to of that a i is and ’t as from with by ? on but & they are about at because like was would have you

Table 7: The highest scoring words in lexicons generated using attention scores.

NO-ADV poland greek romania greece france spain french sweden finland polish dutch spanish netherlands finnish german

LO-TOP-50 on ’re even ’d up less things ’ll doesn living majority sense talk level ’ve rights took number north

ALT-LO the of to i a in greece romania france finland that for is french & you ’t finnish

Table 8: The highest scoring words in lexicons generated using saliency maps.

with alternating optimization to learn textual rep-resentations which are invariant of confounds. We evaluated the proposed solution on the task of na-tive language identification, and showed that it learns to make predictions using stylistic features, rather than focus on topical information.

The learning procedure we presented is general and applicable to other tasks that require learn-ing invariant representations with respect to some attribute of text (some of which are discussed in

§8). We plan to evaluate our proposed solution on other tasks where topics can be latent confounds, like predicting gender bias (Voigt et al.,2018). We leave this exploration for future work.

Acknowledgments

The authors acknowledge helpful input from the anonymous reviewers. This work was sup-ported in part by NSF grants IIS-1812327 and IIS-1813153, by grant no. 2017699 from the United States-Israel Binational Science Founda-tion (BSF), and by grant no. LU 856/13-1 from the Deutsche Forschungsgemeinschaft. Finally, the authors also thank Anjalie Field, Biswajit Paria, Ella Rabinovich, and Gili Goldin for helpful dis-cussions.

References

Martin Arjovsky and L´eon Bottou. 2017. Towards principled methods for training generative adversar-ial networks. InProc. ICLR.

Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. 2017. Data decisions and theoretical implications when

ad-versarially learning fair representations. In Proc. FATML.

Daniel Blanchard, Joel Tetreault, Derrick Hig-gins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native English. ETS Research Report Series.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. JMLR.

Shuoyang Ding, Hainan Xu, and Philipp Koehn. 2019. Saliency-driven word alignment interpretation for neural machine translation. In Proceedings of the Fourth Conference on Machine Translation.

Yanai Elazar and Yoav Goldberg. 2018. Adversarial removal of demographic attributes from text data. In Proc. EMNLP.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Lavi-olette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural net-works. JMLR.

Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. InProc. EMNLP.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative ad-versarial nets. InNeurIPS.

Sylviane Granger. 2003. The International Corpus of Learner English: a new resource for foreign lan-guage learning and teaching and second lanlan-guage acquisition research.Tesol Quarterly.

(10)

Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. InProc. NAACL.

Svetlana Kiritchenko and Saif M. Mohammad. 2018. Examining gender and race bias in two hundred sen-timent analysis systems. InProc. NAACL.

Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Automatically determining an anonymous au-thor’s native language. Intelligence and Security In-formatics.

Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewrit-ing. InProc. ICLR.

Klas Leino, Matt Fredrikson, Emily Black, Shayak Sen, and Anupam Datta. 2019. Feature-wise bias amplification. InProc. ICLR.

Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. Towards robust and privacy-preserving text repre-sentations. InProc. ACL.

Shervin Malmasi and Mark Dras. 2017. Native lan-guage identification using stacked generalization. ArXiv:1703.06541.

Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A report on the 2017 native language identification shared task. In Proc. Workshop on Innovative Use of NLP for Build-ing Educational Applications.

Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. InNeurIPS.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which training methods for GANs do actually converge? InProc. ICML.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Na-ture.

Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature se-lection and evaluation for identifying the content of political conflict. Political Analysis.

Joseph O’Neill, Barty Pleydell-Bouverie, David Dupret, and Jozsef Csicsvari. 2010. Play it again: reactivation of waking experience and memory. Trends in Neurosciences.

Michael J. Paul. 2017. Feature selection as causal in-ference: Experiments with text classification. In Proc. CoNLL.

Judea Pearl. 2009. Causality: Models, Reasoning and Inference. Cambridge University Press.

Reid Pryzant, Kelly Wang, Dan Jurafsky, and Stefan Wager. 2018. Deconfounded lexicon induction for interpretable social science. InProc. NAACL.

Ella Rabinovich, Yulia Tsvetkov, and Shuly Wintner. 2018. Native language cognate effects on second language lexical choice.TACL.

Virgile Landeiro Dos Reis and Aron Culotta. 2018. Robust text classification under confounding shift. Journal of Artificial Intelligence Research.

Paul R. Rosenbaum and Donald B. Rubin. 1985. Con-structing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician.

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. 2017. Stabilizing training of generative adversarial networks through regulariza-tion. InNeurIPS.

Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? InProc. ACL.

Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef-fect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter. In Proc. ACL.

Joel Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identi-fication shared task. InProc. Workshop on Building Educational Applications Using NLP.

Oren Tsur and Ari Rappoport. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proc. Workshop on Cognitive Aspects of Computa-tional Language Acquisition.

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep do-main confusion: Maximizing for dodo-main invariance. ArXiv:1412.3474.

Rob Voigt, David Jurgens, Vinodkumar Prabhakaran, Dan Jurafsky, and Yulia Tsvetkov. 2018. RtGender: A corpus for studying differential responses to gen-der. InProc. LREC.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. InProc. EMNLP.

Sze-Meng Jojo Wong and Mark Dras. 2009. Con-trastive analysis and native language identification. InProc. Australasian Language Technology Associ-ation Workshop.

Sze-Meng Jojo Wong and Mark Dras. 2011. Exploit-ing parse structures for native language identifica-tion. InProc. EMNLP.

(11)