The Bantu languages have a solid documented grammatical and lexical founda- tion. These serve as traditional language resources supporting humans in cre- ating and processing text in human language technologies today (Bosch,2007). Halfway through the nineteenth century interest in the field of Bantu grammars was sparked off by the work of missionaries whose primary task was to reach the people in their own languages (Bosch,2007). One of the treasures that emerged from these studies was the establishment of a broad taxonomy of all the African languages mainly through German researchers (Bleek, 1851, 1862, 1869; Mein-
hof, 1932), Guthrie (1948) and the linguistics department of Oxford University,
Belgian research (Meeussen, 1956; Meeussen and Rodegem, 1969) and others. This research for a common lexical base and reconstructed forms for the all the African languages mirrored the original studies into Indo-European languages that attempted to find a reconstructed base for the European languages.
Towards the end of 1986 the HSRC (Human Sciences Research Council) com- missioned the LEXINET investigation in order to determine the extent to which computer processing of language abroad might be relevant to South Africa, and to formulate proposals for possible local developments (Bosch, 2007; Morris,
1988). The investigation was divided into seven sub-areas, of which the so-called TEXTNET entailed the investigation into computer processing of language data.
In the ensuing report published in 1988, it was noted that in general there was very little progress in this field in South Africa at the time, especially in comparison to the pace at which NLP was developing abroad. The African WordNet Project gave new impetus to the requirements for contributing to NLP by developing either new base concepts or producing a mapping to Global WordNet base con- cepts. Significant progress has been made in these areas by the African WordNet Project (Griesel and Bosch,2013,2014;Madonsela et al.,2016;Mojapelo,2016;
University of South Africa,2011, 2013,2014). The aim of the African WordNet
Project is to create a platform for WordNet development for African languages, based on existing global networks such as the English WordNet (Princeton), the EuroWordNet and the BalkaNet (Bosch, 2007).
Linking the African language WordNets to one another is strategic. Since much of the international work around WordNet and SUMO has been connected to interlingual indices and upper ontologies, this is also a goal of the Global WordNet Project (Bond et al., 2016; Vossen, 2007b). There are already over 40 different language WordNets, and the establishment of interlingual indices and ontologies would make cross-linguistic information retrieval and question an- swering possible, and significantly aid machine translation (Fellbaum and Vossen,
2012; Hor´ak and Rambousek,2010;Peters et al., 1998;Pianta et al., 2002).
In the linguistics of the Bantu languages, there have been projects over the last 50 years aimed at aligning the natural language core concepts of the Bantu languages. The two main approaches originally have been those of Compara- tive Bantu and Proto-Bantu (Fleisch, 2008). The Comparative On-line Bantu Dictionary (CBOLD) project has taken the initial linguistic comparative Bantu and Proto-Bantu approach and attempted to unify and extend it (Bostoen and
Bastin,2016; Schadeberg, 2002).
The CBOLD project was initiated in 1994 by Larry Hyman and John Lowe and was aimed at producing a lexicographic database in Berkeley to support and enhance the theoretical, descriptive, and historical linguistic study of the languages of the Bantu language family. CBOLD includes a list of reconstructed Proto-Bantu roots (based on the Comparative Bantu tables of Guthrie (1948) and the Bantu Lexical Reconstruction (BLR) list of (Meeussen and Rodegem,
1969) ), thousands of additional reconstructed regional roots called Bantu Lexical Reconstructions 2 (BLR2) (based on the current work of scholars in Tervuren and elsewhere), and reflexes of these roots for a substantial subset of more than 500 daughter languages. The Tervuren Museum’s Linguistics Sections continued work and updated the original BLR list from (Meeussen and Rodegem, 1969). They combined it with the Guthrie research to produce an electronic database called BLR2. It was meant to be the follow-up of Meeussen’s original manuscript
(Bostoen and Bastin,2016;Schadeberg,2002). A newer version of BLR2, called
BLR3 was released in 2002 (Bastin et al., 2005; Schadeberg, 2002). The main enhancement from BLR2 to BLR3 was the data representation (Bostoen and
Of these roots used by BLR3, the CBOLD project has selected 10 000 BLR3 reconstructions that represent so-called main entries of which there are 1 400. These main entries are referred to as basic reconstructed etymons. These have been further categorized by Maho (2001) to isolate all main entries that have modern reflexes in Zone A and Zone S as shown in Figure 3.2 (Zone S is the region containing all the Southern African Bantu languages).
are geographically maximally removed and hence it is of great significance if the same proto-form occurs in both (Maho, 2009). This emphasises the generality and the hierarchical importance level of a concept. This produces 375 roots.
Maho (2001) also isolated all main entries that have modern reflexes in at least 14 zones (231 roots). The two lists produce a core collection of 407 roots.
Concerns have been expressed regarding the use of proto-language in the Bantu language context and the agreement of the unity within the Bantu lan- guages, as well as the challenges to describe the disagreements on the nature of this unity (Marten, 2006). As mentioned in Section 1.4 these concerns are primarily based on the lack of written historical records for the Bantu languages. The challenge in the last century that led to the compilation of BLR3 was the creation of lists of cognate linguistic items in the absence of written historical evidence. The scholars involved used the principles of historical linguistics and language reconstruction to find cognates that on the surface may seem unrelated due to phonological changes over time. Diachronic semantics and semantic re- construction have received far less attention within Bantu historical linguistics
(Bostoen and Bastin, 2016) than in other languages. Fleisch(2008) gives a de-
tailed historical overview and summary of the reconstruction of lexical meaning in Bantu. Unlike sound change, semantic change is not necessarily unidirectional but could be multi-directional and cyclic (Bostoen and Bastin, 2016). Bostoen
(2001) gives a detailed and specific Bantu case study involving these sort of semantic shifts. He cites an example in which oil palm, palm oil, palm nut, and blood are associated. It is shown that it is difficult to determine which of these was the original meaning of the BLR3 entry and in which direction it evolved semantically (Bostoen, 2001). As mentioned above, the particular challenge is
the lack of written historical records for the Bantu languages, and hence much of this semantic research remains purely theoretical.