• No results found

1.3 Scope and contributions

2.1.1 Two sides of the same problem

One key step in documenting an UL is to identify (parts of) the lexicon, a central problem addressed in this work. However, to be fully usable by linguists, language learners, ethnologists, etc., discovered lexical units in the UL need to be associated with their counterpart in the WL, and therefore with of proxy of their meaning. We are thus facing two problems:

• A segmentation problem, as we need to transform a continuous sequence of units π in the UL into words or subword units (see Figure 2.1a).

• An alignment problem, as we need to map unknown discovered units in the UL with known units in the WL (see Figure 2.1b).

It is natural to think of the segmentation problem for the UL side as a preprocessing task before one can perform an alignment to the word units in the WL. This approach, depicted Figure2.2a, is indeed taken by many researchers in order to align comparable

1Mboshi, Myene, and Basaa in the BULB project.

2French, in the BULB project.

3When the units are the results of (unsupervised) acoustic units discovery.

2.1. WORD SEGMENTATION AND ALIGNMENT 17

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a mur / amɓɔŋ la/ / makala

(a) The segmentation problem: transforming a continuous sequence of units in the UL into words or subword units.

mur amɓɔŋ la makala

how does the man make donuts

(b) The alignment problem: mapping units in the UL with known units in the WL.

Figure 2.1: A first view of the word segmentation and alignment problems.

units (Section 2.6.1). Conversely, alignment between units in the UL and words in the WL can help inferring a segmentation on the UL side, as depicted in Figure 2.2b, although, alone, this approach is less practical for reasons explained in Section 2.6.

Lastly, segmentation and alignment can be jointly modeled, in the hope that refined information regarding segmentation during training will inform the alignment decisions the model makes, while refined alignment will guide towards better segmentation of the UL (Figure 2.2cand Section 2.6.2).

mur amɓɔŋ la makala

how does the man make donuts m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION

ALIGNMENT

(a)

mur amɓɔŋ la makala how does the man make donuts

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION ALIGNMENT

(b)

mur amɓɔŋ la makala how does the man make donuts

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION ALIGNMENT

(c)

Figure 2.2: The segmentation and alignment tasks in relationship to each other; ex-ample from Bàsàá by Fatima Hamlaoui. Segmentation can serve as a preprocessing step for alignment (top left), while alignment can guide segmentation (top right). Both segmentation and alignment can also be learnt jointly (bottom).

In the remainder of this thesis, we will refer to the unsupervised word segmentation task indifferently as word segmentation or word discovery, and to the automatic word alignment task as word alignment. Even though our work focuses mainly on the word segmentation task and its evaluation, the entangled nature of these two problems leads us to also briefly review the literature related to automatic word alignment. We now introduce both tasks more formally.

2.1.1.1 Word segmentation

The word segmentation (or discovery) task is, per se, a monolingual task consisting in identifying boundaries around word units from an unsegmented stream of symbols in a given language. Formally, it consists in defining a function associating the sequence π = π1, . . . , πl, . . . , πL to a sequence ω= ω1, . . . , ωj, . . . , ωJ, whereωj, for j∈ [1, J], is a word. It is well known that a formal definition of the word word is hard to produce in many languages. This discussion would reach far outside of the scope of this work, so we will somewhat dodge the problem by using word either to relate to what a linguist would call (or has called in its annotations) a word, or to refer to a token, the output of a tokenizer for a particular language.

An equivalent definition of the word segmentation task is to associate the sequence π1, . . . , πl, . . . , πL to a sequence of binary decisions b1, . . . , bl, . . . , bL−1, with each bl corresponding to the presence (1), or absence (0), of a word boundary after unit πl in the original sequence. Note that sentence boundaries are known in our scenario, and that they are also word boundaries for the first and last words in the sentence.

In that respect, the word segmentation task presents strong links with unsupervised morphology learning. This is because, from an abstract point of view, morphology learning and lexical acquisition problems can be viewed as instances of a same generic task, which is to learn to segment an input stream of symbols in an unsupervised way, and to extract a minimal inventory of units, be they called words or morphemes.

Some of the background work we review in this chapter will therefore be concerned with learning morphology (learning a list of morphemes in particular) rather than a lexicon. Moreover, the question of the segmentation granularity, i.e. whether we, in effect, segment at the word or subword level (or multi-word level for that matter), will recur in this work.

Another line of research addresses the task of segmenting sentences into words in languages having no overt word separator in their orthography (Chinese, Japanese, Thai, etc.) without supervision. As it is formally identical to the task we have just defined, this will also be relevant to our study, although the purpose remains often distant to the language documentation goal we pursue. In particular, many studies in this line of research approach the word segmentation task with machine translation in mind: in this context, the right segmentation granularities for the source and target sides of the corpus are determined by the translation performance achieved for each particular language pair. Rather than the linguistic soundness of the decomposition of, say, the source language, researchers aim at finding its right decomposition when translated into a particular target language.

The word segmentation problem is tightly coupled with the design of language mod-els, i.e. probabilistic models assigning a probability distribution P (w1, . . . , wI) to a

2.1. WORD SEGMENTATION AND ALIGNMENT 19

sequence of words w1, . . . , wI. Without loss of generality, this probability distribution can be rewritten, using the chain rule, as

P (w1, . . . , wI) = P (w1) YI i=2

P (wi| w1, . . . , wi−1) . (2.1)

Except for some of the early approaches to word segmentation (Section 2.2), and most paradigmatic approaches (Section2.3), language models are the theoretical backbone for word segmentation. Computational modeling of child language acquisition, for instance, has been heavily relying on such models. We review various studies related to the word segmentation task in Sections 2.2,2.3, and 2.4.

2.1.1.2 Word alignment

The automatic word alignment task, contrary to the segmentation task we just defined, is a bilingual task in essence. Informally, it consists, given a parallel corpus aligned at the sentence level, to identify links between words (or more generally “units”) that are mutual translations of each others.

More formally, this can be viewed as learning a symmetrical binary relation R over the sets VS and VT indexing word positions in the source and target parts of each sentence pair.4 For reasons that will become clearer in Sections2.5and2.6,5 we assume here that our source sentence is the WL sequence w, and that the target sentence is the UL sequence ω, but this could be reversed. Learning this binary relation corresponds to learning, for each sentence pair, a subset of the Cartesian product VS × VT. A word alignment can, therefore, equivalently be represented by a simple bipartite graph,6 making links more explicit. Both mathematical objects can be represented as matrices, in which binary values indicate the presence or absence of a link; the search space A, hence, will correspond to all binary matrices A = (aij), with aij = 1 if source word wi

is aligned to target word ωj, andaij = 0 otherwise.

For computational reasons, however, the search spaceA can be restricted to binary vectors a= a1, . . . , aJ, in which eachaj ∈ [1, I] indicates the word position in the source sentence w to which target word ωj is aligned to. This drastically reduces the size of the search space, from 2I×J to IJ, and likens the word alignment task to a sequence labeling task, in which word ωj, aligned to wordwaj, is labelled aj. Figure2.3depicts the various representations we just discussed. Other representations for alignments can also be found in the literature, especially when trying to align units of different granularities, introducing for instance the concept of spans.

As we note in Section 2.5, where we provide the theoretical foundations for statis-tical word alignment, the concept of word alignments emerged in the first word-based

4The source and target terminology is standard in the machine translation (MT) literature. It identifies the direction of the translation, from source to target language. However, due to the use of the noisy channel model, this terminology can sometimes become confusing.

5Succinctly, if ω is replaced by its unsegmented counterpart π, and since standard alignment models allow only for one outbound link per target units, this direction will better accommodate the alignment between, say, words and phonemes.

6An undirected graph, without multiple edges or loops, in which every edge connects a vertex in VS to one in VT.

there is nothing to say il

n' y a rien à dire there is nothing to say

il n’ y a rien à dire

1 2 3 4 5

1 2 3 4 5 6 7

{1 ! 1, 2 ! 3, 3 ! 2, 4 ! 2, 5 ! 3, 6 ! 4, 7 ! 5}

Binary relation or bipartite graph

Sequence of labels Matrix

Figure 2.3: Various representations for word-to-word alignment. (English is the source language, and French, the target language.)

models for machine translation. Alignments have subsequently been the foundation of statistical machine translation (SMT) (Lopez, 2008; Koehn, 2010), and specifically phrase-based SMT, as they allow for the extraction of relevant phrase pairs used to build translation tables. In recent years, and as neural machine translation (NMT) has gradually superseded SMT in machine translation, there has been significantly less work on word alignment; NMT indeed, in its current form, does not rely crucially on such a concept.