Algorithms of Decomposition - Syllable-Based Compression

We describe four universal algorithms of decomposition into syllables (see alg. 2): universal left PUL, universal right PUR, universal middle-left PUML,

and universal middle-right PUMR. These four algorithms are called algorithms

of class PU. Inputs of these algorithms are message M and language L. These

algorithms are composed from two phases. The first one is an initialization common for all algorithms of the class PU. The second one is different for

each algorithm.

• In the initialization phase we decompose the message M into words by algorithm A. Algorithm of class PU is processing single words.

CHAPTER 4. SYLLABLE-BASED COMPRESSION 43 Algorithm 2 Decomposing into syllables

1: input langauge L = (Σ, ΣLetter, ΣDigit, ΣLower, φ, A) and message M =

αi...αn, αi ∈ Σ

2: output decomposition M into syllables

3: decompose M into words ω1, ..., ωm by A

4: for i = 1, ..., n do

5: Let ωi = ωi1...ωik

6: if ωi is word from non-letters then

7: output(ωi), continue

8: end if

9: for j = 1, ..., k do determine role ωij by function φ endfor

10: find maximal blocks of vowels βi1, ..., βip in word ωi

11: find maximal blocks of consonants γi1, ..., γir in word ωi

12: if p < 2 then

13: output(ωi), continue

14: end if

15: if γir is a suffix of ωi then

16: βip = βip· γir

17: remove γir from list of maximal blocks of consonants, r = r − 1

18: end if

19: if γi1 is a prefix of ωi then

20: βi1 = γi1· βi1

21: remove γi1 from list of maximal blocks of consonants, r = r − 1

22: end if

23: for j = 1, ..., r do

24: Let γij = γij1...γijh /*βij is the first block of vowels before γij in ωi

and βi(j+1) is the first block of vowels after γij in ωi*/

25: if algorithm is PUL then

26: βij = βij · γij

27: else if algorithm is PUR then

28: βi(j+1) = γij· βi(j+1)

29: else if algorithm is PUMR or (algorithm is PUML and h = 1) then

30: βij = βij · γij1...γijbh/2c

31: βi(j+1) = γijbh/2c+1...γijh· βi(j+1)

32: else if algorithm is PUML and h 6= 1 then

33: βij = βij · γij1...γijdh/2e

34: βi(j+1) = γijdh/2e+1...γijh· βi(j+1)

35: end if

36: end for

37: for j = 1, ..., p do output(βij) endfor

CHAPTER 4. SYLLABLE-BASED COMPRESSION 44 • For each word ωi from letters and for each letter ωij in ωi the function

φ decides if ωij has the role of consonant or vowel.

• Maximal blocks (blocks that cannot be extended) of vowels βij and

maximal blocks of consonats γij are found afterwards. Blocks of vowels

longer than three are usually not in natural languages, so maximal length of block of vowels is set to 3. For each block of vowels we must keep in memory its begin and end.

• The number of syllables of ωi is equal to the number of maximal blocks

of vowels p. If ωi have none or one block of vowels, then the whole ωi

is marked as one syllable. If ωi have at least two blocks of vowels, then

syllables will be created by adding consonants to blocks of vowels. • Consonants γi1, which are in ωi before first block of vowels, are added

to this block βi1. Consonants γir, which are in the word following the

last block of vowels, are added to this block βip.

Particular algorithms of class PU are different in the way of adding con-

sonants, which are between two blocks of vowels. They are named according to to the ways of the adding.

• Universal left PUL adds all consonants between blocks of vowels to the

left block.

• Universal right PUR adds all consonants between blocks of vowels to

the right block.

• Universal right PUMR in the case of 2n (even count) consonants between

blocks adds to both blocks n consonants. In the case of 2n + 1 (odd count) consonants between blocks it adds to the left block n consonants and to the right block n + 1 consonants.

• Universal right PUML in case of 2n (even count) consonants between

blocks adds to both blocks n consonants. In the case of 2n + 1 (odd count) consonants between blocks it adds n + 1 consonants to the left block and n consonants to the right block. The only exception from this rule is the case when between blocks it is only one consonant, this consonant is added to the right block.

CHAPTER 4. SYLLABLE-BASED COMPRESSION 45 Example 4.15

We will decompose word priesthood into syllables. We are using language LEN. Blocks of vowels are (in order of appearance): ie, oo.

correct decomposition into syllables: priest-hood

universal left PUL: priesth-ood

universal right PUR: prie-sthood

universal middle-left PUML: priest-hood (correct form)

Chapter 5 Small Text Files Compression

Small text files were the first task in our research of syllable-based compression. We have chosen them because we expected that syllable-based compression could handle the small files better than the word-based compression. As the file size increased, we expected this chance to decrease. Our expecta- tion was that with the smallest files, character-based compression would be the best, with larger files, syllable-based compression, and word-based compression would be the most efficient for the largest files. Finding the exact division points of these changes was also our priority.

The next step was to pick the suitable compression methods for the imple- mentation. We have decided to pick one example of statistical compression methods and one example of dictionary compression methods. We have decided to design and implement syllable-based variants of LZW and Huffman Coding methods. In this phase, we were mostly interested to see whether syllable-based methods can in some cases achieve a better compression ra- tio than word-based methods or not and we did not consider time or space complexity too much. This chapter is based on our works [1, 2, 4].

The last part of this chapter 5.3.1 examines the possibility of compressing small files using Burrows-Wheeler Transformation. It is related to the rest of the chapter only by being focused on syllable-based compression of small text files. This part was written based on our previous articles [13, 12].

5.1 Introduction

Text compression methods are usually optimized for large or very large text files. In practice it is usually necessary to compress (collections of) smaller files like newspaper articles, mail messages, etc.

As the syllables are somewhere between characters and words, it is rea-

CHAPTER 5. SMALL TEXT FILES COMPRESSION 47 sonable to expect that the syllable compression could be advantageous somewhere between character compression and word compression – it is, on middle- sized files.

Knowledge of the structure of the coded message can be very useful for the design of a successful compression method. When compressing text documents, the structure of messages depends on the language used. We can expect that documents written in the same language could posses a similar structure.

The similarity of languages can be seen considering many aspects. Lan- guage classification can be made, for example, according to their use of fixed or free word order or whether they have a simple or rich morphology.

The languages with rich morphology include for example Czech or Ger- man. In these languages a syllable is a natural element logically somewhere between a character and a word. Words are often composed from two or more syllables.

In document Syllable-Based Compression (Page 42-47)