Storage using Lexical Transducer - Algorithms for Lexicon Implementation

Chapter 6 Algorithms for Lexicon Implementation

6.4 Storage using Lexical Transducer

Lexical transducers are a better solution for representing lexicon of a language, especially for those languages that are known to have more inflectional morphology. To build a lexical transducer in order to store lexicon a through morphological analysis of the language is needed. The purpose of morphological analysis is to define

the word stems, prefixes and suffixes as well as a set of rules – known as morphotactics. The morphotactics tell how to combine the roots, stems, prefixes and suffixes with each other to make meaningful words. For example, there are two words to represent Muslim: the ‘moslem’ ( ) and ‘mosalmaan’ (ن ). To make

antonym non-Muslim we can use prefix ‘Gayr’ ( ) with ‘moslem’ to make ‘Gayr moslem’ ( a ), but we cannot use it to make ‘Gayr mosalmaan’ (ن a ).

These rules that govern which affix can be joined with which stem are known as morphotactics.

The following subsections cover some basic definitions and simpler data structures used to introduce lexical transducers and to define a method for automatically separating stems from affixes.

6.4.1 Trie – Tree Structure

A trie, or a character tree, is one of the solutions to store a lexicon with less storage space requirement as compared to string storage using a linear list, binary search tree or hash tables. A trie is a tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

Definition: For a given set of strings S = {A1, A2, … , AN }, where each string

Ai contains characters from the given set of alphabets A = { a, b, c, …, z }; the trie for the given set S is defined recursively as:

{

1 2 r

}

Trie(S)= Trie(S\ ), Trie(S\ ), α α ", Trie(S\ )α

where S\αj means the subset of S consisting of strings that start with αj, stripped of their initial letter αj; recursion is halted when S is empty resulting in an empty trie.

6.4.2 Finite State Automata

Definition: A deterministic finite state automata is a 5–tuple: D = 〈Σ, S, s, F, μ〉,

where:

Σ: is a finite set of alphabets S: is a finite set of states s ∈ S: is the starting state F ⊆ S: is the set of final states

μ(r, a) : S × Σ → S is a transition function, where r ∈ S, a ∈ Σ. If we consider σ ∈ Σ*_{, then μ is extended over S × Σ}*_{using induction: μ(r, ε) = r, and}

μ(r, σa) = μ(μ(r, σ), a), in case μ(r, σ) and μ(r, a) are defined, otherwise μ(r, σa) is undefined

A finite state machine with at most one transition for each symbol and state combination is a deterministic finite state automaton (DFSA).

Definition: An automaton is acyclic, when for ∀r ∈ S and ∀σ ∈ Σ+, there is

μ(r, σ) ≠ r.

The language of acyclic finite state automata is finite. By merging all equivalent sub-tries of a full trie into one, we can get acyclic DFSA.

A trie can be compressed to a minimal acyclic finite-state automata, which is also known as directed acyclic word graph (DWAG) by using algorithms (Daciuk 1998; Ciura and Deorowicz 2001). A directed acyclic graph represents the suffixes of a given string in which each edge is labeled with a character. The characters along a path from the root to a node make the substring, which the node is representing.

Definition: The deterministic finite state automata D = 〈Σ, S, s, F, μ〉 is called

minimal, for a given language L(D) ⊆ Σ*_{, when for every other deterministic finite}

state automata D' = 〈Σ, S', s', F', μ'〉 having language L(D') = L(D), there exists the inequality |S| ≤ |S'_{|, where |S| represents number of states. This means that a minimal}

DFSA has minimum number of states for the given language. For a non-empty language, it is minimal if and only if every state is reachable from the starting state, from every state a final state is reachable, and there are no different equivalent states. There always exists a unique minimal automation for a given language (Daciuk 1998).

6.4.3 Implementation of Word Insertion

The algorithm for the word insertion in a trie, presented in Appendix B, is implemented and tested for English and Urdu word lists. Insertion of the word list (92) into a trie:

(92) _a، _a،ں _a، _{a، a،ے a،ں} _{a،ی a،}_ا

Figure 6.1: A Trie for Representing Urdu Words

In the trie shown in Figure 6.1, each path from root to a leaf represents a single word and the branching in tree represents successive characters. Trie is certainly an improvement over plain string storage, but it can be observed that paths compression is possible by representing common suffixes as one path instead of many paths with the same suffix. Result, then, of course will be a graph instead of a tree. The terms

ر ک ب ا ی ی ا ں ل ڑ ک ا ی ی ا ں ی و ے ے

states and transitions from automata theory will be used instead of nodes and branches from the graph theory. In Figure 6.1, four paths [a ا – a ی –a ےa – ں ایa] can be compressed to one with one final state denoted as , instead of a simple circle , which represents an ordinary state. This simple form of DFSA has one start state and one final state.

One class of algorithms for acyclic DFSA construction is by minimization of the trie (Daciuk 1998; Ciura and Deorowicz 2001) and another class of algorithms is for directly building acyclic DFSA from the given set of strings (Mihov; Daciuk 1998). Inserting the same word list (92) results in an acyclic DFSA, which is shown in Figure 6.2. We can see that the automaton created using above algorithm is both deterministic and acyclic. It has a single start state (with no in-coming transition) and a single final state (with no out-going transition).

Figure 6.2: An acyclic DFSA for Urdu Words

The algorithm to construct the DFSA with no duplicate state is implemented. The resulting FSA is minimal, acyclic and deterministic. Although there are some algorithms available for unsorted words (Daciuk 1998), but the algorithm implemented was for sorted words, which is given in Appendix B. For minimal acyclic DFSA, we could have more than one final state. Therefore states are divided into two classes: First, the terminal final state (TFS), having no out-going transition. There is only one TFS in an automaton. Second, the intermediate final state (IFS) that can have out-going transitions as well as in-coming transitions. There can be many IFS in an automaton. So each state is stored with a flag that tell whether it is a final state or not. The minimal acyclic DFSA is shown in Figure 6.3.

Figure 6.3: A Minimal Acyclic DFSA for Urdu Words

ر _ک ب ی ی ا ں _ک _ڑ ل ی و ک ا ے ر _ک ب ی ا ں ک _ڑ ل ک ا ے ی و

6.4.4 Affix Recognition by minimal acyclic DFSA

Morphological analysis of a language could be performed automatically if we could find morphological roots, stems, prefixes and suffixes for a given language. In this section, it is demonstrated that minimal acyclic DFSA can be used for automatic stemming application. Given the word list of a language containing all morphological forms, we can find prefixes and suffixes strings starting from intermediate final state (IFS) to the terminal final state (TFS). The IFS with many incoming branches shows that this final state is shared by many words, and therefore it is a good candidate for the identification of an affix. The following algorithm is proposed for finding suffixes from a list of words having all word forms.

1: create minimal acyclic DFSA from the list of sorted words. 2: for each state in the DFSA

3: if an IFS state has many incoming branches 4: mark it as Suffix State

5: find suffix string from Suffix state to TFS

The parts of words from start state to suffix state are candidates for being stems. For finding prefixes, the same algorithm is used, but first all strings are reversed and finally each suffix found is also reversed, which results in the required prefix.

Implementation and testing of algorithm showed that although correct identification of affixes and stems is carried out but there is also noticeable false detection and therefore more work on the algorithm may be made to reduce or remove false detection. Only those prefixes and suffixes are to be retained for inclusion to lexical transducer that has moderate frequency in the list of words and at the same time, they have morphological significance.

In document DEVELOPMENT OF ALGORITHMS AND COMPUTATIONAL GRAMMAR FOR URDU (Page 103-107)