Discussion for Studies 3 and 4 - Understanding Semantic Implicit Learning through distributiona

The results obtained in Studies 3 and 4 show that dsms are not only able to capture the more general structure of semantic memory but also that their predictions extend to languages other than English. In Study 3, using the best performing models of Study 1, we validated the results obtained elsewhere in the literature (Buchanan et al., 2001; Chen & Mirman, 2012; Mirman & Magnuson, 2008; Shaoul & Westbury, 2010; Siakaluk et al., 2003) that the semantic neighbourhood density as computed by dsms is a significant predictor of ldrts. Study 4 elaborates on these results postulating that if such effects are found in English, then they should manifest in other languages too. While due to practical considerations, the coverage of the languages examined was quite limited, we find similar effects for snd in Chinese, Malay, French, and Dutch.

The first striking result of the two studies above is the mismatch between the improvement (∆R2) in English and the other languages. Concretely, for all the dsms the average improvement

was 0.115, whereas for Chinese, Dutch, French, and Malay 0.025. While, therefore, the snd measure improves the fit of the models for the other languages, it seems that it might be

Table 4.6Summary of the best performing neural embeddings models predicting ldrts from

neighbourhood density for each language. The predictor variables for each of these models were the same as the baseline model (see text) with the addition of the neighbhourhood density estimates given by the neural embeddings. Bm₀shows the Bayes Factor with mixture-

of-variance priors comparing the model that includes the semantic similarity covariate against the baseline. The significance levels refer to the difference between these models and the baseline (see text).

Language R2 ∆R2 B_m 0 Chinese 0.32 +0.01*** 1.59×1012 Dutch 0.34 +0.04*** 3.56×1042 French 0.41 +0.03*** 6.06×10115 Malay 0.56 +0.02*** 1.67×104 Significance levels:†p_<0.1,*p_<0.05,**p_<0.01,***p_<0.001

something special about English. There are at least three potential explanations we see for this mismatch in performance before we assert that for some reason these models work better in English. Firstly, the quality of the corpus used was much higher than the rest of the languages. As introduced in §2.4.2, the British National Corpus is a pre-processed, human annotated corpus of both spoken and written English containing high-quality texts from professional authors. Comparing that, on the other hand, with the other languages, we see a stark difference. For French, the corpus was crawled from the web, which does not guarantee high-quality texts. Similarly, the Dutch corpus, which contains freely available subtitles, and the Malay corpus, crawled from Wikipedia, are not necessarily written by professionals. The only ‘good’ quality corpus was, in fact, the Chinese, which, however, was quite small in size. Secondly, the blp had the practical advantage that trial-level reaction times were made available, instead of item-level averages. This advantage made it possible for us to filter out errors and potential outliers (following the procedure outlined in §3.4.1) in the data, something that we were not able to do for the other languages. A final issue we note is that the blp baseline was quite low compared to the rest of the languages (even the similar baseline of the spp). This issue points to the direction that there might be some errors in the data input of the original dataset, which have cascaded to our variable selection procedure.

Another issue was the size of the corpora used in the Chinese and Malay simulations. Practically, this could potentially be a serious problem for dsms as the sparsity of information would result in lower quality representations. In short, smaller corpora would not provide sufficient information to the dsms to generate semantic representations that accurately sum-

marise the contexts in which each word appears. The solution the Bayesian Optimiser finds is to (a) reduce the size of the vectors to minimise the potential noise in the representations and (b) to exploit as much information as possible from the corpus. Regarding (a), we have noted that if we increase the size of the vectors to the point where the corpus cannot provide enough information, then the representations are susceptible to noise lowering their quality (see also, Landauer & Dumais, 1997). As for the second point, looking at Table 4.5 we see that the best Chinese and Malay models do not trim the most/least frequent words in the corpus and the window sizes take the maximum value to exploit as much information from the context as possible.

Let us now take a closer look at the parameters for English, Dutch, and French. The English and Dutch corpora are large enough to provide sufficient information to form high-quality semantic representations. As such, the models discard potentially redundant information as high-frequency words which are usually function words such as determiners, particles, prepositions, or pronouns, which have little semantic content. On the other hand, they retain low-frequency words, as such might be discriminative enough to enrich the representations. Interestingly, the window size differed in these two models possibly reflecting linguistic differences. The size of the resulting semantic vectors also differed between Dutch and English; this is most likely the result of the scope of the two corpora; the Dutch corpus encompasses more topics (different sorts of movies) and is more versatile than the English one (which mostly contains texts gathered from newspapers and novels). This last point is corroborated by the fact that the number of different wordtypeswas 941046 for the Dutch corpus, whereas only 341056 for the bnc. Even after removal of the high-frequency tokens, each word in Dutch was ‘seen’ in the context of more different words, something that needs larger vectors to be encoded. Similarly, the French corpus also contains a high number of word types (885945) resulting, again, in a larger vector sizes. Finally, we see that the French model did not discard high-frequency words and had an increased window size. The increased window size could be explained by the presence of high-frequency intervening words as the distance between the target, and any meaningful context words is increased. It is unclear why the Optimiser chose not to discard high-frequency words (although this was not a significant parameter), and we consider this to be an artefact of the corpus used.

In document Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective (Page 137-139)