688
Serial Recall Effects in Neural Language Modeling
Hassan Hajipoor1, Hadi Amiri2, Maseud Rahgozar1, Farhad Oroumchian3 1School of Electrical and Computer Engineering, University of Tehran, Iran
2Department of Biomedical Informatics, Harvard University, Boston, USA
3University of Wollongong in Dubai, Knowledge Village, Dubai, UAE
[email protected],[email protected],
[email protected],[email protected]
Abstract
Serial recall experiments study the ability of humans to recall words in the order in which they occurred. The following serial recall ef-fects are generally investigated in studies with
humans: word lengthandfrequency,primacy
and recency, semantic confusion, repetition,
andtranspositioneffects. In this research, we
investigate LSTM language models in the con-text of these serial recall effects. Our work pro-vides a framework to better understand and an-alyze neural language models and opens a new window to develop accurate language models.
1 Introduction
The goal of language modeling is to estimate the probability of a sequence of words in natural lan-guage, typically allowing one to make probabilis-tic predictions of the next word given preced-ing ones (Bahl et al.,1983;Berger et al., 1996). For several years now, Long Short-Term Mem-ory (LSTM) (Graves,2013) language models have demonstrated state-of-the-art performance (Melis
et al., 2018; Merity et al., 2018a; Sundermeyer
et al., 2012). Recent studies have begun to shed
light on the information encoded by LSTM net-works. These models can effectively use distant history (about200tokens of context) and are
sensi-tive to word order, replacement, or removal (
Khan-delwal et al.,2018), can learn function words much
better than content words (Ford et al.,2018), can re-member sentence lengths, word identity, and word order (Adi et al.,2017), can capture syntactic struc-tures (Kuncoro et al.,2018) such as subject-verb agreement (Linzen et al.,2016). These characteris-tics are often attributed to LSTM’s ability in over-coming thecurse of dimensionality–by associating
a distributed feature vector to each word (Hinton
et al., 1986; Neubig, 2017)–and modeling
long-range dependencies in faraway context (
Khandel-wal et al.,2018).
The goal of our research is to complement the prior work to provide a richer understanding about how LSTM language models use prior linguistic context. Inspired by investigations in cognitive psychology about serial recall in humans (Avons
et al.,1994;Henson,1998;Poliˇsensk´a et al.,2015)–
where participants are asked to recall a sequence of items in order in which they were presented, we investigate how word length or frequency ( word-frequencyeffect), word position (primacy,recency,
andtranspositioneffects), word similarity ( seman-tic confusioneffect), and word repetition ( repeti-tioneffect) influence learning in LSTM language
models. Our investigation provides a framework to better understand and analyze language models at a considerably finer-grained level than previous studies, and opens a new window to develop more accurate language models.
We find that LSTM language models (a) can learn frequent/shorter words considerably better than infrequent/longer ones, (b) can learn recent words in sequences better than words in earlier po-sitions,1(c) have a tendency to predict words that
are semantically similar to target words - indicat-ing that these networks have a tendency to group semantically similar words while suggesting one specific word as target based on prior context, (d) predict as output the words that are observed in prior context, i.e. repeat words from prior context, and (e) may transpose (switch adjacent) words in output depending on word syntactic function.
1Rats and humans recall the first and last items of
se-quences best and the middle ones worst (Bolhuis and Van
2 Serial Recall Effects
Language models estimate the probability of a se-quence asP(w1n) =Qn
i=1P(wi|w1i−1), wherewi is theith word andwji indicates the sub-sequence
fromwitowj. These models minimize their predic-tion error against words in context, using e.g. the negative log likelihood loss function, during train-ing: L =−n1Σn
i=1logP(wi|wi−1 1). In this paper,
we show the loss of a model against sequencesand
wordwi∈sbyL(s)andL(s)[wi]respectively. We use the LSTM language model developed
in (Merity et al.,2018b) for our experiments. Given
a sequence of wordswi−1 1 as context, this model
predicts the next word in sequence. We refer to wi and wˆi as target and predicted words re-spectively; given the global vocabularyV,wˆi =
arg maxwj∈VPr(wj|w
i−1
1 ). We study this LSTM
in the context of serial recall effects.
2.1 Word-Frequency Effect
What is the effect of word frequency/length2on the
performance of LSTM language models? For this effect, we report the average loss for each word frequency as follows:
LkW F = 1/|Sk| X s∈Sk,wi∈s
L(s)[wi], (1)
whereSkis the set of sequences that, at least, have one target word with term frequency ofk, andLk
W F is the overall loss for target words of frequencyk
which sheds light on the expected frequency of words for accurate language modeling.
2.2 Primacy and Recency Effect
What is the effect of word position on the perfor-mance of LSTM language models? To analyze this effect, we compute the average loss of network with respect to the position of target words as follows:
LiP R = 1/ZX
s
L(s)[wi], (2)
wherewiis the target word in sequences,Z is the number of sequences as normalization factor, and LiP R is the average of loss at position i. This
ef-fect will shed light on network performance at spe-cific positions in texts which can help rationalizing the need for new language modeling architectures, such as bidirectional LSTMs (Schuster and Paliwal,
1997), or the order by which input data should be processed (Sutskever et al.,2014).
2Note that word length and frequency are directly
corre-lated (Bell et al.,2009).
2.3 Semantic Confusion Effect
Are predicted words semantically similar to the target ones in case of incorrect predictions? For this analysis, we report the average semantic similarity between target (wi) and predicted (wˆi) words as follows:
SC= 1/Z X
s,wi∈s,wi6= ˆwi
sim(wi,wˆi), (3)
where the functionsim(., .)computes word sim-ilarity either through WordNet (Miller,1998) or
cosinesimilarity of the corresponding embed-dings of its arguments. This effect will shed light on how effective LSTMs are in disentangling semantically-similar concepts. This gives us a pow-erful metric to compare networks semantically, es-pecially, in case of equal loss/perplexity.
2.4 Repetition Effect
This effect refers to prediction of a word that al-ready exists in context, i.e. in an earlier position in the sequence. Here, for each targetwn, we com-pute the probability that, instead of wn, any of
wn−1 1words is predicted as output and report aver-age across all samples:
Pr(REi) = (4)
1/ZX
s
Pr(wi∈w1n−1, wi 6=wn|wn−1 1).
This effect will shed light on the extent to which network repeats/predicts observed words as possi-ble responses. Given that words rarely repeat in sentences, the above metric can be used as a good regularizer for language modeling.
2.5 Transposition Effect
This effect refers to word prediction in transposed positions, i.e. the case where the word pairwii+1in
an original sequence is more likely to be predicted by the network aswi+1wiin output. Here, for each pairwii+1in target sequence, we count the number
of times in which wi+1 is more probable to be
predicted at positioni(as compared towi)andwi is more probable to be predicted at positioni+1(as
compared towi+1). We report the average number
3 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
be predicted at positioni+ 1thanwi+1, andwi+1
is more probable to be predicted at positionithan
wi. We report the average number of transposition occurrences for all samples at each word position. This effect will shed light on how network learn nearby grammatical orders such as conjunction and adjective order.
3 Experiments
In this section we investigate serial recall effects on a standard LSTM language model (Merity et al., 2018b) with its default parameter set mentioned in its original paper. In addition, for all experiments the average of results from three models trained with different random seeds is reported. Unless it is mentioned, we chose sequence 100 to visualize the results for the sake of readability. However, we test different sequence lengths in range5to2003. Our code is publically available4.
In this study, we report network performance in terms of loss on development set which is equiva-lent to relative performance in terms of perplexity.
3.1 Dataset and Settings
Our experiments are conducted on two benchmark language modeling datasets: Penn Treebank (PTB)
and WikiText-2 (WT2). The PTB dataset (Marcus
et al.,1993;Mikolov et al.,2010) contains Wall Street Journal articles with 10,000 vocabulary.
The WT2 dataset introduced in (Merity et al.,
2017) is composed of 720 Wikipedia verified
articles. It contains 2M, 217K and 245K tokens in train, development and test sets respectively and its vocabulary size is 33,278. WT2 is over two times larger than PTB. For experiments in part-of-speech tags level, we use the annotated version of these datasets publicly provided by
Khandelwal (Urvashi Khandelwal,2018). Given
the parts-of-speech of each word, content words are nouns (NN), verbs (VB), adjectives (JJ) and adverbs (RB), and others would be considered as function words. More details about the datasets in POS level are presented in Table??.
3.2 Results
Short words are recalled more accurately than
longer words. We observed word-length effect
3Number of samples are equal for different sequence lengths.
4https://github.com/hassan_hajipoor/
PTB WT2
#tokens #words #tokens #words
NN 21k 3k 55k 8k
VB 10k 1.7k 25k 3k
RB 2.6k 286 5k 0.5k
JJ 4.9k 0.9k 12k 1.8k Func 34.9k 0.2k 118k 0.9k
since increasing the target word length leads to
in-crease in loss. Figures1aand Figures1bshow
this effect in POS level for both PTB and WT2. It shows the strong correlation between word length and loss in content words while function words are not so affected. We discovered the reason in cor-relation between word frequency and loss. Figure
1creveals the strong negative correlation between
word frequency and loss such that increase in word frequency leads to decrease in loss. From another
viewpoint, Figure1dshows that the frequency of
content words decrease while their length increase
but Figure1eshows frequency of function words is
independent of their length. The fact that function words are closed class and we can not extend them with shorter words can be the reason. It is also re-ported in (Bell et al.,2009) that content words are shorter when more frequent while function words are not so. Therefore, word length effect is strongly exists in content words but is weak in function words.
Also, from Figure1cand the fact that average fre-quency in function words is much higher than con-tent words ( 10k in compare to 58 in PTB dataset), it is now clear that why loss of function words are significantly lower than content words.
Tokens later in the sequence have better
recall.We examine primacy and recency effect by
comparing the loss values in different positions of
the sequence. Figure2shows the loss decreases
(i.e. the network has better recall) as we approach the end of the sequence.
Recalling words in transposed positions are
fairly rare.To figure out the transposition effect,
we count how many times words recall in trans-posed positions. Table1shows the number of trans-positions normalized over sequence length. It re-veals that number of transpositions do not increase while the sequence length increases. Moreover, transposition occurs rarely. For example, in se-quence of length 100, from 99 candidates of
trans-position only 0.72 transposed on average. Figure3
shows the number of transposition occurrences in each position of this sequence. It shows that
trans-(a) Dataset statistics
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Word Frequency (k) - log scale 0 2 4 6 8 10 12 Lo ss -
k WF
content words function words
(b) Word frequency vs. loss (PTB)
0 10 20 30 40 50 60 70 80 90
position in sequence(i) 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 Lo ss - i PR PTB WT2
(c) Primacy and Recency effect
Figure 1: (a) Dataset statistics, (b): Word-frequency effect, network learns frequent words better than infrequent ones, (c): Primacy and Recency effect, loss decreases for target words that appear at the end of sequences.
3 Experiments
Datasets. We use two benchmark language mod-eling datasets: Penn Treebank (PTB) (Marcus
et al.,1993;Mikolov et al.,2010) and WikiText-2
(WT2) (Merity et al.,2017). PTB and WT2 have vocabulary sizes of10K and33K respectively. We
use the POS-tagged versions of these datasets pro-vided byKhandelwal et al.(2018), and treat nouns (NN), verbs (VB), adjectives (JJ), and adverbs (RB) as content words, and others word classes as func-tion words, see details in Figure1a.
Settings. We set LSTM’s parameters as suggested
in (Merity et al., 2018b) for PTB and follow its
suggested parameter tuning procedure for WT2. For both datasets, we set context size ton= 100
obtained from{5,20,50,100,200}and validation data; note that the number of samples are equal for different sequence lengths.
3.1 Results
We report LSTM performance in terms of predic-tion loss on development sets for all experiments.
Word-Frequency Effect:More frequent target words are predicted (learned) more accurately than less frequent ones.Figure1bshows strong inverse
correlation between word frequency and LSTM prediction loss. This is expected as neural models learn better with more data. In addition, although the overall loss of function words is considerably lower than that of content words (because of their overall higher frequency), Figure 1bshows that, for the same word frequency, content words are learned better than function words.
Primacy and Recency Effects: Target words that appear later in sequences are predicted con-siderably better than those at earlier positions.
Fig-ure1cshows that prediction loss considerably de-creases for target words that appear toward the end of the sequences. The results are consistent across
1 3 5 7 9 11 Loss ( )
0.2 0.4 0.6 0.8 1.0 sim (ta rg et ,p re dic te d) Embedding
w nearest neithbor random word
1 3 5 7 9 11 Loss ( )
WordNet
(a) Semantic confusion effect in PTB.
1 3 5 7 9 11 Loss ( ) 0.0 0.2 0.4 0.6 0.8 1.0 sim (ta rg et ,p re dic te d)
Embeddingw nearest neithbor random word
1 3 5 7 9 11 Loss ( )
WordNet
[image:3.595.73.518.65.166.2](b) Semantic confusion effect in WT2.
Figure 2: We report semantic similarity between pre-dicted and target words, random and target words (lower bound), and nearest neighbor and target words (upper bound).
both datasets. This effect can explain why bitional LSTMs which read input from opposite direc-tions usually work better in NLP applicadirec-tions such as machine translation (Firat et al.,2016;Domhan,
2018). A remaining question that is worth investi-gating is whether bidirectional LSTMs learn first and last few words of sequences better than those in the middle, and if yes, how can we make these models more robust against word position.
Semantic Confusion Effect. There is signifi-cant tendency to predict words that resemble (are semantically-similar to) target words. Figure 2
[image:3.595.315.517.222.452.2]target predicted
0.0 0.2 0.4 0.6 0.8 1.0
probability
predicted similar neighbours others predicted similar neighbours others
(a) Proportion of neighbors from
pre-diction probability (1−Loss)
5 50 100 150 200 250 300 350 400 450 500 550 600 # of most similar neighbours (Word Embedding)
0.1 0.2 0.3 0.4 0.5 0.6
Probability
target similar neighbours
target others predicted similar neighbourspredicted others
[image:4.595.152.446.64.197.2](b) Finding appropriate neighbors size
Figure 3: Semantic confusion effect. (a) The chance of neighbors of target is equal to chance of other words. (b) When the neighbours size is set to 150, the probability of target neighbouring group became equal to the probability of target. Neighbours size of 600, leads to equal chance for predicted token.
0.08 0.09 0.10 0.11
re
pe
tit
ion
p
ro
ba
bil
ity
-
Pr
(R
E)
JJ NN RB VB func
0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90-95 95-100 0.00
0.01 0.02
prior tokens in context (window size = 5)
(a) Repetition effect in PTB
0.08 0.09 0.10 0.11
re
pe
tit
ion
p
ro
ba
bil
ity
-
Pr
(R
E)
JJ NN RB VB func
0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90-95 95-100 0.00
0.01 0.02
prior tokens in context (window size = 5)
(b) Repetition effect in WT2
1 2 3 4 5 6 7 8 9 10 11 12 13 14+
# of token occurrences in context
0.00 0.02 0.04 0.06 0.08 0.10 0.12
re
pe
tit
ion
p
ro
ba
bil
ity
-
Pr
(R
E) PTB WT2
(c) Repetition vs. occurrence in context
Figure 4: (a, b): Repetition effect across POS tag classes in PTB and WT2 and (c): Repetition probability decreases as a function of word occurrences in prior context.
for smaller loss values. However, confusion consis-tently increases as prediction loss increases. The upper bound similarity (obtained by treating the nearest neighbor of each target word as predicted word) indicates there exists better candidates which LSTM fails to predict. Our further analyses show that LSTM has a tendency to group semantically similar words and then suggest one of them. We consider most similar words as a group of neigh-bors and examine how network assigns probabili-ties to them as compared to others. Figure3ashows that the chance of neighbors of target (with size 150) is equal to chance of other words (with size 9849). As Figure3ashows, the neighbors of pre-dicted words (with size 600) carry equal chance as compared to other words (with size 9399). To find these thresholds, we gradually increase the number of neighbors and track the trend of approaching the probability of neighboring group to target. As shown in Figure3b, if the size of neighboring group is set to 150, these probabilities became equal. The similar way is repeated for finding the appropriate size for neighbors of predicted words.
Repetition Effect. Repetition probability of function words is significantly higher than that of
content words. We report repetition effect, see
Eq. (4), at POS tag level (where the predicted word should have the same POS tag as target word in prior context). As Figure4a and4bshow, func-tion words have higher repetifunc-tion probability than content words. This is because function words are more frequent, and the average distance among them (i.e. number of intervening words) is consid-erably smaller than other POS tags (e.g.2.1words
vs28.9and12.4words forRBandJJrespectively). We also find that repetition probability decreases as a function of word frequency in prior context, see Figure4c. This is because words (especially
NNs andVBs) are often self-contained and their occurrence in prior context helps LSTM to avoid “repeating” them. In addition, we find that function words repeat more frequently than other types and repetition amongNNs andVBs is higher than other POS tag pairs; Table 1shows the confusion ma-trix for repetition across POS tag classes Perhaps, this could explain the recent language modeling im-provement obtained in (Ford et al.,2018) through developing separate yet sequentially connected net-work architectures with respect to POS tag class.
[image:4.595.76.522.257.377.2]5 10 20 50 100 200
context size (number of tokens)
0.51.0 1.5 2.0
Lo
ss
-
POS
JJ NN RB VB func
[image:5.595.92.270.62.175.2]PTB WT2
Figure 5: Effective context to learn syntax.
NN VB JJ RB Func
NN 0.07 0.00 0.01 0.00 0.24
VB 0.01 0.09 0.00 0.01 0.20
JJ 0.03 0.01 0.01 0.00 0.27
RB 0.01 0.02 0.00 0.03 0.31
[image:5.595.328.504.66.170.2]Func 0.01 0.01 0.00 0.00 0.52
Table 1: Confusion matrix for repetition effect. Rows and columns show POS tag classes of target and pre-dicted words respectively.
repetition of a POS class: First, the average dis-tance of consecutive tokens of that POS class; Ta-ble2reports the corresponding values from training set. Second, the accuracy of network in predicting POS classes which has been shown in Figure5and also reported in (Khandelwal et al.,2018). From these, the repetition probability of function words are expected to be higher than content words.
NN VB JJ RB Func
PTB 3.6 7.1 12.4 28.9 2.1
WT2 3.6 8.5 15.8 38.5 1.9
Table 2: Average distance of tokens in POS classes.
Transposition Effect: Transpositions occur more frequently at the beginning of sequences and rarely at the end. Figure 6 shows average
num-ber of transpositions at each word position across datasets. This result is meaningful because miss-predictions occur more frequently at earlier posi-tions (see results for primacy and recency effect). In addition, transpositions are rare at higher posi-tions because more context information can help LSTM to make accurate (transposition-free) pre-diction. In addition, Table3shows the percentage of transpositions across POS tag classes on PTB. The result show that LSTM mainly transposes ‘RB NN’ word pairs with ‘NN RB.’ In future, we will conduct further analyses to understand the reason. The findings presented in this paper provide in-sight into how LSTMs model context. This infor-mation can be useful for improving language
mod-0 20 40 60 80 100
position in sequence
500 550 600 650 700 750
1.2k 1.5k 1.7k 2k 2.2k 2.5k
#transpositions - PTB #transpositions - WT2
[image:5.595.100.267.209.270.2]PTB WT2
Figure 6: Transposition effect, transpositions occur more at the beginning of sequences.
NN VB JJ RB Func
NN 0.025% 0.006% 0.032% 0.015% 0.001%
VB 0.011% 0.001% 0.004% 0.002% 0.001%
JJ 0.021% 0.054% 0.045% 0.036% 0.002%
RB 0.073% 0.017% 0.017% 0.023% 0.012%
Func 0.002% 0.002% 0.002% 0.001% 0.006%
Table 3: Confusion matrix for transpositions in PTB
at POS tag level. LSTM transposes ‘RB NN’ and ‘JJ
VB|JJ|RB,’ more than others pairs.
els. For instance, the discovery that some word types are repeated in predictions more than others can help regularizing neural language models by making them adaptive to the different word types.
4 Conclusion
We investigate LSTM language models in the con-text of serial recall indicators. We find that frequent target words and target words that appear later in sequences are predicted more accurately, predic-tions often resemble (are semantically similar to) target words, function words of prior context are more likely to be predicted as target words, and word pair transpositions occur more frequently at the beginning of sequences.
Acknowledgments
We sincerely thank Nazanin Dehghani for her con-structive collaboration during development of this paper and anonymous reviewers for their insightful comments and effective feedback.
References
Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analy-sis of sentence embeddings using auxiliary
predic-tion tasks. International Conference on Learning
Representations (ICLR).
SE Avons, KL Wright, and Kristen Pammer. 1994. The
[image:5.595.313.534.218.281.2] [image:5.595.100.263.456.487.2]Quarterly Journal of Experimental Psychology
Sec-tion A, 47(1):207–231.
Lalit R Bahl, Frederick Jelinek, and Robert L Mer-cer. 1983. A maximum likelihood approach to
con-tinuous speech recognition. IEEE transactions on
pattern analysis and machine intelligence, (2):179–
190.
Alan Bell, Jason M Brenier, Michelle Gregory, Cyn-thia Girand, and Dan Jurafsky. 2009. Predictability effects on durations of content and function words
in conversational english. Journal of Memory and
Language, 60(1):92–111.
Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996. A maximum entropy approach
to natural language processing. Computational
lin-guistics, 22(1):39–71.
Johan J Bolhuis and Hendrik S Van Kampen. 1988. Se-rial position curves in spatial memory of rats:
Pri-macy and recency effects. The Quarterly Journal of
Experimental Psychology, 40(2):135–149.
Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine
trans-lation architectures. InProceedings of the 56th
An-nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), volume 1,
pages 1799–1808.
H Ebbinghaus. 1913. Memory: A contribution to ex-perimental psychology (ha ruger & ce bussenius, trans.). new york, ny, us.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine
trans-lation with a shared attention mechanism. In
Pro-ceedings of NAACL-HLT, pages 866–875.
Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George Dahl. 2018. The importance of
genera-tion order in language modeling. InProceedings of
the 2018 Conference on Empirical Methods in
Natu-ral Language Processing, pages 2942–2946.
Alex Graves. 2013. Generating sequences with
recurrent neural networks. arXiv preprint
arXiv:1308.0850.
Richard NA Henson. 1998. Short-term memory for
se-rial order: The start-end model. Cognitive
psychol-ogy, 36(2):73–137.
Geoffrey E Hinton et al. 1986. Learning distributed
representations of concepts. InProceedings of the
eighth annual conference of the cognitive science
so-ciety, volume 1, page 12. Amherst, MA.
Urvashi Khandelwal, He He, Peng Qi, and Dan Juraf-sky. 2018. Sharp nearby, fuzzy far away: How
neu-ral language models use context. In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 284–294. Association for Computational Lin-guistics.
Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018. Lstms can learn syntax-sensitive dependencies well,
but modeling structure makes them better. In
Pro-ceedings of the 56th Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics (Volume 1: Long
Papers), volume 1, pages 1426–1436.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
2016. Assessing the ability of lstms to learn
syntax-sensitive dependencies.Transactions of the
Associa-tion for ComputaAssocia-tional Linguistics, 4:521–535.
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated
corpus of english: The penn treebank.
Computa-tional linguistics, 19(2):313–330.
G´abor Melis, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language
models. International Conference on Learning
Rep-resentations.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018a. An analysis of neural language
modeling at multiple scales. arXiv preprint
arXiv:1803.08240.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018b. Regularizing and optimizing lstm
language models. In International Conference on
Learning Representations.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture
mod-els. InInternational Conference on Learning
Repre-sentations.
Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock´y, and Sanjeev Khudanpur. 2010. Recurrent
neural network based language model. InEleventh
Annual Conference of the International Speech
Com-munication Association.
George Miller. 1998. WordNet: An electronic lexical
database. MIT press.
Graham Neubig. 2017. Neural machine translation
and sequence-to-sequence models: A tutorial. arXiv
preprint arXiv:1703.01619.
Kamila Poliˇsensk´a, Shula Chiat, and Penny Roy. 2015. Sentence repetition: what does the task measure? International Journal of Language and
Communica-tion Disorders, 50(1):106–118.
Mike Schuster and Kuldip K Paliwal. 1997.
Bidirec-tional recurrent neural networks. IEEE Transactions
on Signal Processing, 45(11):2673–2681.
Martin Sundermeyer, Ralf Schl´uter, and Hermann Ney. 2012. Lstm neural networks for language modeling.
InThirteenth annual conference of the international
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks.
In Advances in neural information processing