• No results found

Conclusions and Discussion

The present study investigated the possibility of using computational methods for loanword detection. Several model architectures and configurations were trained and tested to arrive at the best model. The final model incorporated phonotactic, lexical, and semantic distributional features and used an SVM binary classifier to predict if a word was borrowed. The model achieved state-of-the-art results on the evaluation test, surpassing the results of previous similar models (Mi et al.

2016, Mi et al. 2018, Fenogenova et al. 2016, Tsvetkov et al. 2015).

Moreover, the detection model has performed exceptionally well in a real-life setting. When applied to existing Russian Web corpora, the model allowed to compile an extended corpus of recent Russian borrowings from English. In addition to the identified loanwords, the corpus contains their predicted English donor words, morphological features and part-of-speech tags, and sentences in which they occur. The importance of the corpus is emphasized by the morphological analysis of the identified loanwords described in Section 4. A frequency study of oblique noun forms focused on the indeclinability of borrowed nouns and showed positive correlation between the frequencies of noun lemmas and their oblique forms, suggesting that there is no growing trend of indeclinability among novel Russian loanwords. In addition, a subset of the corpus was used to analyze stem-final consonant mutations in borrowed verbs. The analysis confirmed the previously attested underapplication of expected mutations but found no correlation between the frequency of the verbs and the frequency of verb forms containing expected mutations. The corpus is released with the hope to further research of loanwords and language constraints.

At the same time, the detection model has certain limitations. For example, it is too sensitive to hyphenated words and ended up mis-classifying a large number of native abbreviated or

hard-41

hyphenated words. In the future, the compiled loanword corpus can be used to further improve the accuracy of the detection model.

The construction of the corpus also has certain challenges. In particular, the initial corpus aggregated from model predictions was too noisy partly due to the model errors, but mostly due to the errors in the lemmatization and part-of-speech tagging. The lemmatization and part-of-speech tagging errors are interconnected in that an incorrect tag will lead to an application of incorrect lemmatization rules. While previous studies resorted to reducing the noise of the compiled corpora through manual revision (Fenogenova et al. 2016, Muravyev et al. 2014), the goal of this study was to apply strictly computational methods. Thus, instead of revising the corpus by hand, an intersection of two lemmatization models and two tagging models was used to reduce the number of errors. More research is needed to improve existing morphological analyzers, which will help increase the accuracy of the detection model.

42

Appendix: Novel Russian Loanwords with Predicted Donor Words

ab'juzerstvo 'abuser' ab'juzerša 'abuser' ab'juzivna 'abusive' ab'juzit' 'abuse' averzivnyj 'aversive' avtofikšn 'auto fiction' aggregirovat' 'aggregate' agonizirovat' 'agonize' agregatirovat' 'aggregate' adaptive 'adaptive' advansment 'advancement' advantejdž 'advantage' adventivnyj 'adventive' adventura 'adventure' adventurizm 'adventurism' advenčers 'adventures' advenčurer 'adventurer'

advert 'advert'

advizor 'advisor'

advizori 'advisory' advokejt 'advocate' adderal 'Adderall' adderollovyj 'Adderall'

adopter 'adopter'

akvizišn 'acquisition' akkaunting 'accounting' akkaunts 'accounts' akkolejd 'accolade' akkomplišed 'accomplished' aksident 'accident' alinkljuziv 'all inclusive'

alièkspressovskij 'AliExpress' anderkaver 'undercover' anderkarta 'undercard' anderkloking 'underclocking'

animal 'animal' atributirovat' 'attributing'

audišen 'audition'

autlajner 'outliner' autsortingovyj 'out sorting' affiliejtka 'affiliate' ačivit' 'achieve' baghantit' 'bug hunt'

bajbek 'buyback'

bajkfrendli 'bike friendly' balansit' 'balance'

bampit' 'bump'

bampovat' 'bump'

barbekjuit' 'barbecue' batterskotč 'butterscotch' bezakkauntnyj 'account'

43 blokirovat'sja 'blocking'

blokovyj 'block'

bljutusnyj 'Bluetooth' bolderingovyj 'bouldering'

bonusit' 'bonus'

brauzit' 'browse' brejkanut' 'break' brejkdaunut' 'breakdown'

brendik 'brand'

brendonutyj 'brand' bridernyj 'breeder' bridževyj 'bridge'

broud 'broad'

brutforsit' 'brutе force'

brèjn 'brain'

brèndirovat' 'brand'

bukat' 'book'

bukmarkat' 'bookmark' bulšitit' 'bullshit'

bègmèn 'bagman'

bèkspejsnut' 'backspace' bèttari 'battery'

vegetarianit' 'vegetarian' vesternizirovat' 'westernize'

viktimblejmit' 'victim blaming' viktimizirovat' 'victimize' viralit' 'viral' virtualizirovat' 'virtualize'

višer 'wisher'

vrekerit' 'wrecker'

vudik 'woody'

vèjd 'wade'

gazlajt 'gaslight' gazlajtit' 'gaslight'

gendersvičnut' 'gender switch' geokodirovat' 'geocode' geotarget 'geo target'

gljuten 'gluten'

gould 'gold'

grejlistit' 'gray list'

grejp 'grape'

grejsovyj 'grace' grindit' 'grind'

grinman 'green man'

gugljatka 'google'

darkglass 'dark glass' daunskejlit' 'downscale'

devajsez 'devices'

dedakšen 'deduction' dedlajnit' 'deadline' dednejmit' 'dead name' dezjorvz 'deserves'

44 dekriptovat' 'decrypt'

deplatformit' 'deplatform' deskremblirovat' 'descramble' deskripšen 'description' despavnit'sja 'despawn' destinejšen 'destination' destoršn 'distortion'

džejlbrejker 'jailbreaker' dženerejšnz 'generations'

dženerik 'generic'

džentrificirovat' 'gentrify'

džimit' 'gym'

džinglbellit' 'jingle bell'

džogger 'jogger'

džoit' 'joy'

džojnit' 'join' džornèlizm 'journalism' džuniper 'juniper'

diskaverabiliti 'discoverability' diskavri 'discovery'

dumskrollar 'doom scroller' dèdlajn 'deadline' iksterminèjt' 'exterminate'

impress 'impress'

impressiv 'impressive'

impruv 'improve'

inbridirovat' 'inbreed' invajtit' 'invite' inversnut'sja 'inverse' invojsing 'invoicing' kamingautnut’sya ‘coming out’

kampervan 'campervan' kansellit' 'cancel' karpenter 'carpenter' karpentri 'carpentry' karšaring 'carsharing' kastodial'nyj 'custodial'

kastomnyj ‘custom’ kinkšejmit' 'kink shame' kitčit' 'kitchen'

klejm 'claim'

klikabel'nyj 'clickable' klikat' 'click' klikbejtit' 'clickbait'

45 klikingovyj 'clicking'

kliknut' 'click'

klinikal 'clinical'

klènki 'clanky'

klèriti 'clarity' koasternyj 'coaster' kovergentnyj 'convergent' kollajd 'collide' kollajding 'colliding' kollautit' 'callout'

kollbek 'callback'

kollektabls 'collectables' kolonizejšn 'colonization' kommentit' 'comment'

komments 'comments'

kommjoršial 'commercial'

kommitit' 'commit'

komm'jutit' 'commute' kompetišin 'competition' komfortit' 'comfort'

kongratilèjšns 'congratulations' konnekšen 'connection' konsolidejtid 'consolidated' konsolidejšn 'consolidation' konstrakšn 'construction' konfljuentnyj 'confluent' konformer 'conformer' konf'juzing 'confusing' konf'juzit' 'confuse' kopipastat' 'copy paste' koučingovyj 'coaching'

krauder 'crowder'

krauding 'crowding'

kraudfajnit' 'crowd fine' kraudfandingovyj 'crowdfunding' kraudfandit' 'crowdfund' kraudfunding 'crowdfunding' kraftit' 'craft' lavless 'loveless'

lavsong 'love song'

lavstori 'love store' lavhejt 'love hate' lavčajld 'love child'

lajaut 'layout'

lajkanut' 'like' lajknut' 'like' lajmlajt 'limelight' lajnvork 'line work'

lajfa 'life' loukosternyj 'low-cost'

louless 'lawless' mastermind 'mastermind' mafflit' 'muffle' medikejšnyj 'medication'

mejlovyj 'male'

mejnsplejnit' 'mansplain' mensplejnit' 'mansplain' menspreder 'manspreading' menspredit'sja 'manspreading' mentalizirovat' 'mentalizing'

menšenz 'mentions'

milkovyj 'milk'

mirroring 'mirroring' misbihejv 'misbehave' misgendering 'misgendering' misgenderit' 'misgender' misdirekšn 'misdirection'

46

modellit' 'model'

moderniti 'modernity'

mortar 'mortar'

muvat' 'move’

mèjnstriming 'mainstreaming' mèjntejner 'maintainer'

nekless 'neckless'

njord 'nerd' notifikejšen 'notification'

nèjked 'naked' oversempling 'oversampling' overtonit' 'overtone' overšèrit' 'overshare' okkljudirovat' 'occlude' okkljuzionnyj 'occlusion'

olmoust 'almost'

openkèjs 'open case' opensurs 'open source' operejšn 'operation' patibrejker 'party breaker' patčit' 'patch'

persistirovat' 'persist' persistit' 'persist' pikapnut' 'pick-up'

pikat' 'pick'

pinakl 'pinnacle' pirsingovat' 'piercing' plejgraundovskij 'playground' plejgraundskij 'playground' plejgruppa 'playgroup' plejlandija 'playland' plejlend 'playland' plejstejšnskij 'PlayStation' plejtestit' 'playtest' popkornbuk 'popcorn book' porkupajn 'porcupine' portabl 'portable' prajmirovat' 'prime'

prajml 'primal'

prajoriti 'priority' predikt 'predict' predikšn 'prediction' prezentejšn 'presentation'

prejz 'praise'

preskul 'preschool' prodakšnz 'productions'

47 prononsièjšn 'pronunciation'

prospamit' 'spam'

pèčvorking 'patchworking'

rajval 'rival' rebutovat'sja 'reboot'

revamp 'revamp'

reversit' 'reverse' revižen 'revision' revižn 'revision'

redakšn 'reduction'

redevelopment 'redevelopment' redraft 'redraft'

rejtit' 'rate' rekognajz 'recognize' rekonstrakšen 'reconstruction' rekrutit' 'recruit'

reprezentativit' 'representative' resellit' 'resell'

resiv 'receive'

respavn 'respawn'

respavnit' 'respawn' respektivnyj 'respective'

restrimit' 'restream' retardnut'sja 'retard' retvitit' 'retweet' retriv 'retriever' retrovejv 'retro wave' refajneri 'refinery' sabskrajber 'subscriber' sajdkvest 'side quest' sajting 'sighting'

sajtnyj 'sight'

48 sekondari 'secondary'

sekondli 'secondly' sekstit' 'sext' seksualizirovat' 'sexualize' sekuljarnyj 'secular' sek'juritizirovat' 'securitize' sellaut 'sellout' selfisnut'sja 'selfies' selfharmlit'sja 'self-harm'

sender 'sender' skoringovyj 'scoring' skrapbukovyj 'scrapbook' skrapmanija 'scrap' skremblirovat' 'scramble' skribl 'scribble' skrinit' 'screen' skrinšotit’ 'screenshot' skrinšotočnyj 'screenshot' skrollit' 'scroll' smartfonit'sja 'smartphone' smartfonnyj 'smartphone'

smètčit' 'match'

snejp 'snape'

snorkelit'sja 'snorkel'

snup 'snoop'

spavnkil 'spawn kill' spajnbrejker 'spine breaker'

sparkat' 'spark'

sparsit' 'parse' spešilti 'specialty' spešièlti 'specialty' spojlerit' 'spoiler'

spènk 'spank'

stakat' 'stack' starri 'starry' stejdždajving 'stage diving' stendaper 'stand-up' stenit' 'stan' stils'jut 'stealth'

stoper 'stopper'

storizmejker 'stories maker' storiteling 'storytelling' straggl 'struggle' strajder 'strider' strajking 'striking' strajpi 'stripy' strajpirovat' 'stripe' stranger 'stranger' strangulejt 'strangulate' strejndž 'strange' strejčing 'stretching' stridfud 'street food' strimerša 'streamer' strimlajn 'streamline' stritlajt 'streetlight' stritrejs 'street race' strongmèn 'strongman' superkul 'supercool' supernatural 'supernatural' supersajz 'supersize' superstajl 'super style' superstarz 'superstars' supersèmpling 'super sampling' superčardž 'supercharge'

suspendirovat' 'suspend'

49

sèjvori 'savory'

sèjvpojnt 'save point'

sèjk 'sake'

sèjlzmèn 'salesman'

sèjn 'sane'

sèjntski 'saint'

sèlfharming 'self-harming' sèmplirovat' 'sample'

tajmkiller 'time killer' tajmless 'timeless'

tajmmendžment 'time management' tajmstamp 'timestamp'

tajmfors 'time force' targetirovat'sja 'target'

taunhauz 'townhouse' testflajt 'test flight' tehnikalli 'technically' timfajt 'team fight' tinderdejt 'tinder date' tindermetč 'tinder match' topvojs 'top voice' topgerlz 'top girls' trablšutit' 'trouble shoot' trajharder 'try harder' trajhardmen 'try hard man' tranzišen 'transition' trankviliti 'tranquility' transgender 'transgender' transgressirovat' 'transgress' transkodirovat' 'transcode' transmišn 'transmission' transseksualka 'transsexual' trevelmanija 'travel mania' trejdanut' 'trade'

trejderovat' 'trader' trejderstvovat' 'trader' trejlraning 'trail running'

trejnhop 'train hop' trejts 'traits' trekat' 'track'

trekk 'track'

trekrekord 'track record' triangl 'triangle' zatriggerit’ ‘trigger’

triggerit'sja 'trigger' trikeri 'trickery' trimmingovat' 'trimming' tripanut' 'trip'

ul'travajolens 'ultra-violence' unforgiven 'unforgiven' faerbejz 'fire base' faerbol 'fire ball' faergard 'fireguard' fajrvud 'firewood'

fajèri 'fiery' fanfikšn 'fan fiction' fejdingovat' 'fading'

fejkanut' 'fake'

fejklav 'fake love' fejk-njus ‘fake news’

fejk-novost’ ‘fake’

fejsbidding 'face bidding'

50 fingerprinter 'finger printer' fingeršejming 'finger shaming' finišd 'finished'

fišeri 'fishery' fišnet 'fishnet' flajtšering 'flight sharing' flaunder 'flounder'

flèšmobit' 'flash mob' follbèk 'fallback' follovit'sja 'follow' forbiden 'forbidden'

former 'former'

fortifikejšn 'fortification' fotospamit' 'photo spam' fotospojler 'photo spoiler' fototipnyj 'phototype' fruting 'fruiting' frutkejk 'fruitcake' frèdžajl 'fragile' fudmarket 'food market' fudpejring 'food pairing' fudtrak 'food track' fudšering 'food sharing' futbord 'footboard'

f'jurri 'furry' fètšejmit' 'fat shame' hajdifinišn 'high definition' hajlajtit' 'highlight'

haknutyj 'hack'

hammerhed 'hammerhead'

handred 'hundred'

hanibejbez 'honey babes'

hanters 'hunters'

harasit' 'harass' harasser 'harasser' hardkorovec 'hardcore' harrikejnz 'hurricanes' hartbrejk 'heartbreak' hevenli 'heavenly' hedžingovyj 'hedging' hedlajn 'headline' hejtit’ 'hate' hitmejker 'hit maker'

hičat' 'hitch'

hoarding 'hoarding'

holdevej 'hold away'

honest 'honest'

houmofis 'home office'

houmran 'home run'

houmstej 'homestay'

humanizirovat' 'humanize'

hèdžit' 'hedge'

hèdhantit' 'headhunt'

hèp 'hap'

hètčer 'hatcher'

51 šejpšifter 'shape shifter' šiping 'shipping'

šipmen 'shipman'

šipment 'shipment'

šitpostit’ 'shit post' šitposting 'shit posting'

šopit'sja 'shop'

šopper 'shopper'

èvidentnyj 'evident' èvoljušn 'evolution'

èdalteri 'adultery' èdvajzori 'advisory' èksfolirovat' 'exfoliate' èmbossirovat' 'emboss' èngejdžment 'engagement' èndorsment 'endorsement'

èndurans 'endurance'

ènerdžajzingovyj 'energizing'

ènivèjs 'anyways'

èskortirovat' 'escort' èstimèjtit' 'estimate'

51

References

Adouane, W., Semmar, N., & Johansson, R. (2016). Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects, 53–61.

Alvarez-Mellado, E. (2020). An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper Headlines. Proceedings of the LREC 2020 – 4th Workshop on Computational Approaches to Code Switching, 1–8.

Baldwin, T., & Lui, M. (2010). Language identification: the long and the short of the matter.

Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 229–237.

Balota, D.A., Yap, M.J., Hutchison, K.A., Cortese, M.J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.

Bisani, M., & Ney, H. (2008). Joint-Sequence Models for Grapheme-to-Phoneme Conversion.

Speech Communication, 50(5), 434–451.

Bobkova, N., & Montermini, F. (2021). Suffix Rivalry in Russian: What Low Frequency Words Tell Us. MMM12 Online proceedings 2020.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors With Subword Information. Transactions of the Association for Computational Linguistics, 5.

Brejter, M.A. (1996). Современные лексические заимствования в русском языке:

«чужеродные речения» или средство обогащения литературного языка? Rusistika 14, 33-45.

Brysbaert, M., & New, B. (2009). Moving Beyond Kučera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English. Behavior research methods, 41(4), 977-990.

Campbell, L. (1998). Historical linguistics: an introduction. Edinburgh: Edinburgh University Press.

Chelba, C., Brants, T., Neveitt, W., & Xu, P. (2010). Study on Interaction between Entropy Pruning and Kneser-Ney Smoothing. Proceedings of Interspeech (2010), 2242-2245.

Ciobanu, A.M., & Dinu, L.P. (2015). Automatic Discrimination between Cognates and Borrowings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), 431–437.

52

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 8440-8451.

Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20, 273-297.

Davidson, L., & Noyer, R. (1997). Local Phonology in Huave: Nativization and the Ranking of Faithfulness Constraints. Proceedings of the 15th West Coast Conference on Formal Linguistics, 65–80.

Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39 (1), 1–38.

Dyakov, A.I. (2010). Словарь английских заимствований русского языка. Novosibirsk:

Novosibirskoe knizhnoe izdatel’stvo.

Efremova T.V. (2000). Новый словарь русского языка. Толково-образовательный. Moscow:

Russkiy Yazyk.

Einar, H. (1950). The Analysis of Linguistic Borrowing. Language, 26, 210–231.

Eskova, N. (2008). И еще раз про букву ё. Nauka i Zhizn 7.

Fenogenova, A., Karpov, I., & Kazorin, V. (2016). A General Method Applicable to the Search for Anglicisms in Russian Social Network Texts. 2016 IEEE Artificial Intelligence and Natural Language Conference (AINL), 1-6.

Forcada, M., Ginestí-Rosell, M., Nordfalk, J., O'Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J., Sánchez-Martínez, F., Ramírez-Sánchez, G., & Tyers, F. (2011). Apertium: A Free/Open-Source Platform for Rule-Based Machine Translation. Machine Translation 25 (2), 127-144.

Gorlach, M. (ed). (2001). A Dictionary of European Anglicisms: A Usage Dictionary of Anglicisms in Sixteen European Languages. Oxford University Press, Oxford, England.

Gorman, K. (2013). Generative Phonotactics. [Doctoral dissertation, University of Pennsylvania].

Gorman, K. (2016). Pynini: A Python Library for Weighted Finite-State Grammar Compilation.

Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, 75-80.

Gorman, K. (In preparation). CityLex: A Free English Lexical Database.

Gorman, K., Kirov, C., Roark, B., Sproat, R. (In press). Structured Abbreviation Expansion in Context. Findings of EMNLP.

53

Graudina, L. K., Ickovic, V. A., & Katlinskaja, L. P. (1976). Грамматическая правильность русской речи: Опыт частотно-стилистического словаря вариантов. Moscow: Nauka.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning Word Vectors for 157 Languages. arXiv:1802.06893 [cs.CL].

Haspelmath, M. (2008). Loanword Typology: Steps Toward a Systematic Cross-Linguistic Study of Lexical Borrowability. Aspects of Language Contact: New Theoretical, Methodological and Empirical Findings with Special Focus on Romancisation Processes, 43-62.

Hellsten, L., Roark, B., Goyal, P., Allauzen, C., Beaufays, F., Ouyang, T., Riley, M., & Rybach, D. (2017). Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers.

Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing, 10–19.

Honnibal, M., & Montani, I. (2017, January). spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing. Sentometrics Research. https://sentometrics-research.com/publication/72/.

Janurik, S. (2011) Рост аналитизма в русском языке и англо-русские языковые контаты.

Исследования по теоретической лингвистике: материалы симпозиума «Смена парадигмы в венгерской лингвистической русистике», 101–115.

Jenkins, A., Gupta, V., & Lenoir, M. (2019). General Regression Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, and Feedforward Neural Networks.

arXiv:1911.07115 [eess.SY].

Kingma, D.P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization.

arXiv:1412.6980.

Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S.J., McCarthy, A.D., Kübler, S., Yarowsky, D., Eisner, J., & Hulden, M. (2018).

UniMorph 2.0: Universal Morphology. arXiv:1810.11101 [cs.CL].

Kruengkrai, C., Srichaivattana, P., Sornlertlamvanich, V., & Isahara, H. (2005). Language Identification Based on String Kernels. Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT2005), 896-899.

Kummerfeld, J.K., Berg-Kirkpatrick, T., & Klein, D. (2015). An Empirical Analysis of Optimization for Max-Margin NLP. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 273–279.

Kurokhtina, G. (1996). Новые слова и значения в современном русском языке. Rusistika 13, 21-25.

54

Kutuzov, A., & Kuzmenko E. (2017). WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models. Analysis of Images, Social Networks and Texts. AIST 2016.

Communications in Computer and Information Science, 661, 155–161.

Lee, J.L., Ashby, L.F.E., Garza, M.E., Lee-Sikka, Y., Miller, S., Wong, A., McCarthy, A.D., &

Gorman, K. (2020). Massively Multilingual Pronunciation Modeling with wikipron.

Proceedings of the 12th language resources and evaluation conference, 4223–4228.

Li, H., & Ma, B. (2005). A Phonotactic Language Model for Spoken Language Identification.

Proceedings of the 43rd Annual Meeting of the ACL, 515–522.

Lyashevskaya, O., & Sharov, S. (2009). Frequency Dictionary of the Modern Russian Language (The Russian National Corpus). Azbukovnik.

Medvedeva, M., Kroon, M., & Plank, B. (2017). When Sparse Traditional Models Outperform Dense Neural Networks: The Curious Case of Discriminating between Similar Languages.

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 156–163.

Mi, C., Yang, Y., Zhou, X., Wang, L., Li, X., & Jiang, T. (2016). Recurrent Neural Network Based Loanwords Identification in Uyghur. Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers, 209–217.

Mi, C., Yang, Y., Wang, L., Zhou, X., & Jiang, T. (2018). A Neural Network Based Model for Loanword Identification in Uyghur. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

Michel, J.B., Shen, Y.K., Aiden, A.V., Veres, A., Gray, M.K., Brockman, W., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A.,

& Aiden, E.L. (2010, December 12) Quantitative Analysis of Culture Using Millions of Digitized Books. Science. https://www.science.org/doi/abs/10.1126/science.1199644.

Mikhelson, A.D. (1865). Обьяснение 25 000 иностранных слов, вошедших в употребление в русский язык, с означением их корней. Moscow.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL].

Moshagen, S. N., Pirinen, T. A., and Trosterud, T. (2013). Building an Open-Source Development Infrastructure for Language Technology Projects. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), 16 (85), 343–352.

Muravyev, N.A., Panchenko, A.I., & Obiedkov, S.A. (2014). Neologisms on Facebook.

arXiv:1804.05831 [cs.CL].

Murphy, D.L. (2000). The Gender of Inanimate Indeclinable Common Nouns in Modern Russian [Doctoral dissertation, Ohio State University].

55

Ney, H., Essen, U., & Kneser, R. (1994). On Structuring Probabilistic Dependences in Stochastic Language Modelling. Computer Speech & Language, 8(1), 1–38.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Pertsova, K. (2016). Transderivational Relations and Paradigm Gaps in Russian verbs. Glossa: a journal of general linguistics 1(1), 13.

Roark, B., & Sproat, R. (2014). Hippocratic Abbreviation Expansion. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 364–369.

Novak, J.R., Minematsu, N., & Hirose, K. (2016). Phonetisaurus: Exploring Grapheme-to-Phoneme Conversion with Joint N-Grams Models in the WFST Framework. Natural Language Engineering, 22(6), 907–938.

Russian National Corpus. (2009). Национальный корпус русского языка: 2006-2008. Saint-Petersburg.

Ryazanova-Clarke, L., & Terence, W. (1999). The Russian Language Today. London/New York:

Routledge.

Sagalova, E. N. (1997). Иноязычные речевые элементы в современной русской периодике.

Novye slova i slovari novyx slov, 66-77.

Sagot, B. (2018). A Multilingual Collection of CoNLL-U-Compatible Morphological Lexicons.

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC2018).

Segalovich, I. (2003). A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications. MLMTA'03, 273-280.

Shavrina, T., & Shapovalova, O.A. (2017). To the Methodology of Corpus Construction for Machine Learning: Taiga Syntax Tree Corpus and Parser. Proceedings of CORPORA2017.

Shisholina, A. (2016). Типология новейших композитов русского языка. Orenburg State University Vestnik, 4 (192), 49-53.

Slioussar, N., & Kholodilova, M. (2013). Paradigm Leveling in Non-Standard Russian. Formal Approaches to Slavic Linguistics: The second MIT meeting 2011, 243-258.

56

Magomedova, V., & Slioussar, N. (2017). Paradigm Leveling: The Decay of Consonant Alternations in Russian. International Morphology Meeting 2014.

Stankiewicz, E. (1968b). Declension and Gradation of Russian Substantives in Contemporary Standard Russian. The Hague: Mouton.

Stolcke, A. (1998). Entropy-Based Pruning of Backoff Language Models. Proceedings of the DARPA Broadcast News and Understanding Workshop, 270–274.

Straka, M., Hajic, J., & Straková, J. (2016). UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing.

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 4290-4297.

Timofeeva, G. (1995). Английские заимствования в русском языке 90-х годов. Russian Language Journal, 49, 275-286.

Tsvetkov, Y., Ammar, W., & Dyer, C. (2015). Constraint-Based Models of Lexical Borrowing.

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, 598–608.

Weide, R. (2005). The Carnegie Mellon Pronouncing Dictionary [cmudict. 0.7b]. Carnegie Mellon University.

Wenzek, G., Lachaux, M.A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., Grave, E.

(2020). CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data.

Proceedings of the 12th Language Resources and Evaluation Conference (LREC), 4003-4012.

Zhang, T. (2004). Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. ICML 2004: Proceedings of the Twenty-First International Conference on Machine Learning, 919-926.

Zissman, M.A. (1996). Comparison of Four Approaches to Automatic Language Identification of Telephone Speech. IEEE Transactions on Speech and Audio Processing, 4(1), 31-44.

Related documents