A Multilayer System of Lexical Resources for Language Technology Infrastructure

Maciej Piasecki, Marek Maziarz, Michał Marci´nczuk, Paweł K˛edzia, Adam Radziszewski, Wroclaw University of Technology

Semantic text processing (e.g. semantic text searching or Information Ex- traction) must be well grounded on the analysis performed on the lexical semantic level. Individual text words (at least selected) must be mapped on the repository of lexical senses to provide a basis for the processing of the lexico-semantic structures. The repository can be automatically in- duced from text corpora, but far better can be achieved if it is manually constructed. Thus, for the needs of the CLARIN-PL – the Polish part of the CLARIN infrastructure – we decided to build a system of lexical resources organised around plWordNet 3.0 as its fundament. The system comprises:

• plWordNet 3.0 – a huge wordnet for Polish providing description of lexical meanings (more than 260 000 planned) and the system of lexico- semantic relations (Maziarz et al. 2013),

• NELexicon 2.0 – a very large dictionary of Polish Proper Names (about 2,5 million PNs) classified into 102 semantic categories (Marci´nczuk et al., 2013),

• mapping of the plWordNet senses to extended SUMO Ontology (K˛edzia and Piasecki 2014),

• a lexicon of lexico-syntactic descriptions of multiword lexical units (about 60 000 descriptions, cf. Kurc et al. 2012),

• and a mapping of plWordNet 3.0 to an extended Princeton WordNet 3.1 for English (cf. Rudnicka et al. 2012).

PlWordNet is a huge network of lexico-semantic relations that provide description for lexical senses that are represented as the network nodes. As plWordNet is exclusively focused on the faithful description of the lexical meanings, the common knowledge description is kept in a separated layer in the form of a combined upper level and middle level ontology. The mapping between the wordnet nodes and the ontology concepts provides also formalised interpretation for the lexical meanings.

Proper Names are an open class, are not a proper part of a lexicon and represent objects in the reality. That is why they are described in a sepa- rate layer described by 102 semantic categories related to the ontology and partially mapped onto the wordnet.

Multiword lexical units are semantically described as wordnet graph nodes, but the wordnet do not include the description of their lexico-syntactic structure that is necessary for their recognition in text (Radziszewski 2013). Thus, the whole system is supplemented by a dictionary of structural descriptions (Kurc et al. 2012). The last element – the mapping to Princeton

WordNet – goes beyond the monolingual system, has been described in (e.g., Rudnicka et al. 2012) and will be omitted here.

As CLARIN infrastructure is meant to provide support for text processing in Humanities and Social Sciences, one can expect practically any texts deliv- ered to the system by the future users. Thus, in order to make the system of the lexical resources a feasible basis, we has to provide extensive coverage for word types and meanings. plWordNet 3.0 is built as a major exten- sion of the previous version 2.0 and it final size was provisionally set to the size approaching the size of the Polish dictionaries ever made. plWordNet 3.0 will cover at least 260 000 lexical meanings that should correspond to about 200 000 morphological base forms (entry forms) described. More- over, plWordNet has been built on the basis of the corpus as the primary knowledge source and its development was supported by a number of tools for extraction of the lexico-semantic knowledge from text. The most im- portant aspects of the plWordNet, its development process and the present state (more than 214 000 senses and 152 000 base forms) will be presented in the talk and the paper.

plWordNet is stored in a network database and is accessible via web service. It can be exported to the XML representation based on the Lexical Markup Framework representation that allows for easy integration with the existing resources of the Linked Open Data initiative.

The described system of lexical resources has been applied in, e.g., Word Sense Disambiguation (e.g. K˛edzia et al. 2014), Named Entity Recogni- tion (Marcińczuk et al., 2013) Information Extraction (Marcińczuk and Ptak, 2012) and Open Domain Question Answering (Marcińczuk et al., 2013). The resource system and constructed language tools are used in several applications for the CLARIN-PL users, e.g. for text semantic classification or iden- tification of text-to-text relations, that will be discussed in the talk.

References

[1] K˛edzia P., and Piasecki M. (2014), Ruled-based, Interlingual Motivated Mapping of plWordNet onto SUMO Ontology. Proceedings of the Ninth International Con- ference on Language Resources and Evaluation (LREC’14), 2014.

[2] K˛edzia P., Piasecki M., Koco´n J., Indyka-Piasecka A. (2014), Distributionally extended network-based Word Sense Disambiguation in semantic clustering of Pol- ish texts, FIA 2014.

[3] Kurc R., Piasecki M., Bartosz B. (2012), Constraint Based Description of Polish Multiword Expressions. W Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), e. Calzolari, Nicoletta i inni. Istanbul, Turkey : European Language Resources Association (ELRA), May. [4] Marci´nczuk M., Koco´n J. and Janicki M. (2013), Liner2 – A Customizable Frame-

work for Proper Names Recognition for Polish. In: Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467, pp 231-253.

[5] Marci´nczuk M. and Ptak M. (2012), Preliminary Study on Automatic Induction of Rules for Recognition of Semantic Relations between Proper Names in Polish Texts. In: Text, Speech and Dialogue, Lecture Notes in Computer Science. [6] Marci´nczuk M., Radziszewski A., Piasecki M., Piasecki D. and Ptak M. (2013),

Evaluation of Baseline Information Retrieval for Polish Open-domain Ques- tion Answering System. In: Recent Advances in Natural Language Processing (RANLP) 2013, pp. 428-435.

[7] Maziarz M., Piasecki M., Rudnicka E., Szpakowicz S. (2013), Beyond the transfer-and-merge wordnet construction: plWordNet and a comparison with WordNet. In: Proceedings of International Conference on Recent Advances in Natural Language Processing, pp. 443–452.

[8] Radziszewski A. (2013), Learning to lemmatise Polish noun phrases. Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers). Sofia, Bulgaria : Association for Computational Linguistics, August 2013.

[9] Rudnicka E., Maziarz M., Piasecki M., Szpakowicz S. (2012), A strategy of Map- ping Polish WordNet on Princeton WordNet, Proceedings of Coling 2012, pp. 1039-48.

Literary Criticism of Cartography and Indexing of Spatial

In document Practical Applications of Language Corpora (Page 93-96)