Simple Syntactic and Morphological Processing Can Help English Hindi Statistical Machine Translation

(1)

Simple Syntactic and Morphological Processing Can Help English-Hindi

Statistical Machine Translation

Ananthakrishnan Ramanathan,

Pushpak Bhattacharyya

Department of Computer Science

and Engineering

Indian Institute of Technology

Powai, Mumbai-400076

India

{

anand,pb

}

@cse.iitb.ac.in

Jayprasad Hegde, Ritesh M. Shah,

Sasikumar M

CDAC Mumbai (formerly NCST)

Gulmohar Cross Road No. 9

Juhu, Mumbai-400049

India

{

jjhegde,ritesh,sasi

}

@cdacmumbai.in

Abstract

In this paper, we report our work on incor-porating syntactic and morphological infor-mation for English to Hindi statistical ma-chine translation. Two simple and compu-tationally inexpensive ideas have proven to be surprisingly effective: (i) reordering the English source sentence as per Hindi syntax, and (ii) using the sufﬁxes of Hindi words. The former is done by applying simple trans-formation rules on the English parse tree. The latter, by using a simple sufﬁx separa-tion program. With only a small amount of bilingual training data and limited tools for Hindi, we achieve reasonable performance and substantial improvements over the base-line phrase-based system. Our approach es-chews the use of parsing or other sophisti-cated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian lan-guages currently.

1 Introduction

Techniques for leveraging syntactic and morpholog-ical information for statistmorpholog-ical machine translation (SMT) are receiving a fair amount of attention nowa-days. For SMT from English to Indian languages, these techniques are especially important for the fol-lowing three reasons: (i) Indian languages differ widely from English in terms of word-order; (ii) In-dian languages are morphologically quite rich; and

(iii) large amounts of parallel corpora are not avail-able for these languages, though smaller amounts of text in specific domains (such as health, tourism, and agriculture) are now becoming accessible. It might therefore be expected that using syntactic and mor-phological information for English to Indian lan-guage SMT will prove highly beneficial in terms of achieving reasonable performance out of limited parallel corpora. However, the difficulty in this is that crucial tools, such as parsers and morphological analyzers, are not widely available for Indian lan-guages yet.

In this paper, we present our work on incorporat-ing syntactic and morphological information for En-glish to Hindi SMT. Our approach, which eschews the use of parsing and other tools for Hindi, is two-pronged:

1. Incorporating syntactic information by com-bining phrase-based models with a set of struc-tural preprocessing rules on English

2. Incorporating morphological information by using a simple sufﬁx separation program for Hindi, the likes of which can be created with limited effort for other Indian languages as well

Signiﬁcant improvements over the baseline phrase-based SMT system are obtained using our approach. Table 1 illustrates this with an example1. Since only limited linguistic effort and tools are required for the target language, we believe that the framework we propose is suitable for SMT from En-glish to other Indian languages as well.

1

(2)

input For a celestial trip of the scientiﬁc kind, visit the planetarium.

reference

e e, ae।

vaigyaanika tariike ke eka divya saira ke lie, taaraamandala aaem

scientiﬁc kind of a celestial trip for, planetarium visit (come)

baseline

pr , ।

ke svargiiya yaatraa ke vaigyaanika prakaara, kaa taaraagruha hai

of celestial trip of scientiﬁc kind, of planetarium is

baseline+syn

pr e, ।

vaigyaanika prakaara ke svargiiya yaatraa ke lie, taaraagruha hai

scientiﬁc kind of celestial trip for, planetarium is

baseline+syn+morph

pr e, ।

vaigyaanika prakaara ke svargiiya yaatraa ke lie, taaraagruha dekhem

scientiﬁc kind of celestial trip for, planetarium visit (see)

Table 1: Effects of Syntactic and Morphological Processing (reference: human reference translation;

baseline: phrase-based system;syn: with syntactic information;morph: with morphological information)

The rest of this paper is organized as follows: Sec-tion 2 outlines related work. SecSec-tion 3 describes our approach – ﬁrst, the phrase-based baseline system is sketched brieﬂy, leading up to the techniques used for incorporating syntactic and morphological infor-mation within this system. Experimental results are discussed in section 4. Section 5 concludes the pa-per with some directions for future work.

2 Related Work

Statistical translation models have evolved from the word-based models originally proposed by Brown et al. (1990) to syntax-based and phrase-based tech-niques.

The beginnings of phrase-based translation can be seen in the alignment template model introduced by Och et al. (1999). A joint probability model for phrase translation was proposed by Marcu and Wong (2002). Koehn et al. (2003) propose certain heuristics to extract phrases that are consistent with bidirectional word-alignments generated by the IBM models (Brown et al., 1990). Phrases extracted us-ing these heuristics are also shown to perform bet-ter than syntactically motivated phrases, the joint model, and IBM model 4 (Koehn et al., 2003).

Syntax-based models use parse-tree representa-tions of the sentences in the training data to learn, among other things, tree transformation probabili-ties. These methods require a parser for the target language and, in some cases, the source language

too. Yamada and Knight (2001) propose a model that transforms target language parse trees to source language strings by applying reordering, insertion, and translation operations at each node of the tree. Graehl and Knight (2004) and Melamed (2004), pro-pose methods based on tree-to-tree mappings. Ima-mura et al. (2005) present a similar method that achieves signiﬁcant improvements over a phrase-based baseline model for Japanese-English transla-tion.

Recently, various preprocessing approaches have been proposed for handling syntax within SMT. These algorithms attempt to reconcile the word-order differences between the source and target lan-guage sentences by reordering the source lanlan-guage data prior to the SMT training and decoding cy-cles. Nießen and Ney (2004) propose some restruc-turing steps for German-English SMT. Popovic and Ney (2006) report the use of simple local trans-formation rules for Spanish-English and Serbian-English translation. Collins et al. (2006) propose German clause restructuring to improve German-English SMT.

The use of morphological information for SMT has been reported in (Nießen and Ney, 2004) and (Popovic and Ney, 2006). The detailed experi-ments by Nießen and Ney (2004) show that the use of morpho-syntactic information drastically reduces the need for bilingual training data.

(3)

pro-poses factored translation models that combine fea-ture functions to handle syntactic, morphological, and other linguistic information in a log-linear model.

Our work uses a preprocessing approach for in-corporating syntactic information within a phrase-based SMT system. For incorporating morphology, we use a simple sufﬁx removal program for Hindi and a morphological analyzer for English. These as-pects are described in detail in the next section.

3 Syntactic & Morphological Information

for English-Hindi SMT

3.1 Phrase-Based SMT: the Baseline

Given a source sentencef, SMT chooses as its trans-lationˆe, which is the sentence with the highest prob-ability:

ˆ

e=arg max

e p(e|f)

According to Bayes’ decision rule, this is written as:

ˆ

e=arg max

e p(e)p(f|e)

The phrase-based model that we use as our base-line system (deﬁned by Koehn et al. (2003)) com-putes the translation modelp(f|e)by using a phrase translation probability distribution. The decoding process works by segmenting the input sentencef into a sequence ofI phrasesfI₁. A uniform proba-bility distribution over all possible segmentations is assumed. Each phrase f_i is translated into a target language phrase e_i with probability φ(f_i|e_i). Re-ordering is penalized according to a simple exponen-tial distortion model.

The phrase translation table is learnt in the fol-lowing manner: The parallel corpus is word-aligned bidirectionally, and using various heuristics (see (Koehn et al., 2003) for details) phrase correspon-dences are established. Given the set of collected phrase pairs, the phrase translation probability is cal-culated by relative frequency:

φ(f|e) = count(f, e)

fcount(f, e)

Lexical weighting, which measures how well words within phrase pairs translate to each other, validates the phrase translation, and addresses the problem of data sparsity.

The language modelp(e)used in our baseline sys-tem is a trigram model with modiﬁed Kneser-Ney smoothing (Chen and Goodman, 1998).

The weights for the various components of the model (phrase translation model, language model, distortion model etc.) are set by minimum error rate training (Och, 2003).

3.2 Syntactic Information

As mentioned in section 2, phrase-based models have emerged as the most successful method for SMT. These models, however, do not handle syntax in a natural way. Reordering of phrases during trans-lation is typically managed by distortion models, which have proved not entirely satisfactory (Collins et al., 2006), especially for language pairs that differ a lot in terms of word-order. We use a preprocess-ing approach to get over this problem, by reorderpreprocess-ing the English sentences in the training and test corpora before the SMT system kicks in. This reduces, and often eliminates, the ‘distortion load’ on the phrase-based system.

The reordering rules that we use for prepro-cessing can be broadly described by the following transformation rule going from English to Hindi word order (Rao et al, 2000):

SSmV VmOOmCm→CmSmSOmOVmV

X_{: Corresponding constituent in Hindi,} whereX isS,O, orV

Xm: modiﬁer ofX

Essentially, the SVO order of English is changed to SOV order, and post-modiﬁers are converted to pre-modiﬁers. Our preprocessing module effects this by parsing the input English sentence2and

ap-2_{Dan Bikel’s parser was used for parsing}

(4)

structural transformation

morph analysis (English) Giza++

alignment correction

phrase extraction suffix separation

(Hindi)

decoder

suffix separation (Hindi)

Figure 1:Syntactic and Morphological Processing: Schematic

plying a handful of reordering rules on the parse tree. Table 2 illustrates this with an example.

3.3 Morphological Information

If an SMT system considers different morphologi-cal forms of a word as independent entities, a cru-cial source of information is neglected. It is con-ceivable that with the use of morphological informa-tion, especially for morphologically rich languages, the requirement for training data might be much re-duced. This is indicated, for example, in recent work on German-English statistical MT with limited bilingual training data (Nießen and Ney, 2004), and also in other applications such as statistical part-of-speech tagging of Hindi (Gupta et al., 2006).

The separation of morphological suffixes con-flates various forms of a word, which results in higher counts for both words and suffixes, thereby

countering the problem of data sparsity. As an exam-ple, assume that the following sentence pair is part of the bilingual training corpus:

English: Players should just play.

Hindi:

e।

khilaadiyom ko kevala khelanaa caahie

Hindi (sufﬁx separated): i e।

khilaada iyom ko kevala khela naa caahie

Now, consider the input sentence, “The men came across some players,” which should be translated as

“a ! ” (aadmiyom ko

(5)

English

amariikaa ke raashtrapati ne juuna mem bhaarata kii yaatraa kii

Table 2:English and Hindi Word-Order

a ae a a e

Table 3:Hindi Sufﬁx List

evidence from the above sentence pair in the train-ing corpus). Also, the general relationship between the oblique case (indicated by the sufﬁxi(iyom)) and the case marker (ko) is not learnt, but only the speciﬁc relationship between ( khi-laadiyom) and (ko). This indicates the necessity of using morphological information for languages such as Hindi.

To incorporate morphological information, we use a morphological analyzer (Minnen et al., 2001) for English, and a simple suffix separation program for Hindi. The suffix separation program is based on the Hindi stemmer presented in (Ananthakrish-nan and Rao, 2003), and works by separating from each word the longest possible suffix from table 3. A detailed analysis of noun, adjective, and verb inflec-tions that were used to create this list can be found in (McGregor, 1977) and (Rao, 1996). A few examples of each type are given below:

Noun Inﬂections: Nouns in Hindi are inﬂected based on the case (direct or oblique), the number (singular or plural), and the gender (masculine or

feminine3). For example, (ladakaa - boy) becomes (ladake) when in oblique case, and the plural (ladake - boys) becomes (ladakom). The feminine noun (ladakii- girl) is inﬂected as +(ladakiyaam- plural direct) and (ladakiyom- plural oblique), but it re-mains uninﬂected in the singular direct case.

Adjective Inflections: Adjectives which end in a(aa) ora(aam) in their direct singular mascu-line form agree with the noun in gender, number, and case. For example, the singular directa,!(accha) is inflected asa,! (acche) in all other masculine forms, and asa,!(acchii) in all feminine forms. Other adjectives are not inflected.

Verb Inflections: Hindi verbs are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. (Rao, 1996) provides a com-plete list of verb inflection rules.

The overall process used for incorporating syn-tactic and morphological information, as described in this section, is shown in ﬁgure 1.

3

(6)

Technique Evaluation Metric

BLEU mWER SSER roughly understandable+ understandable+

baseline 12.10 77.49 91.20 10% 0%

baseline+syn 16.90 69.18 74.40 42% 12%

baseline+syn+morph 15.88 70.69 66.40 46% 28%

Table 4:Evaluation Results(baseline: phrase-based system;syn: with syntactic information;morph: with morphological information)

4 Experimental Results

The corpus described in the table below was used for the experiments.

#sentences #words

Training 5000 120,153

Development 483 11,675

Test 400 8557

Monolingual (Hindi) 49,937 1,123,966

The baseline system was implemented by training the phrase-based system described in section 3 on the 5000 sentence training corpus.

For the Hindi language model, we compared var-ious n-gram models, and found trigram models with modiﬁed Kneser-Ney smoothing to be the best per-forming (Chen and Goodman, 1998). One language model was learnt from the Hindi part of the 5000 sentence training corpus. The larger monolingual Hindi corpus was used to learn another language model. The SRILM toolkit4 was used for the lan-guage modeling experiments.

The development corpus was used to set weights for the language models, the distortion model, the phrase translation model etc. using minimum er-ror rate training. Decoding was performed using Pharaoh5_.

fnTBL (Ngai and Florian, 2001) was used to POS tag the English corpus, and Bikel’s parser was used for parsing. The reordering program was written us-ing the perl module Parse::RecDescent.

We evaluated the various techniques on the fol-lowing criteria. For the objective criteria (BLEU and mWER), two reference translations per sentence were used.

• BLEU (Papineni et al., 2001): This measures

4

http://www.speech.sri.com/projects/srilm/

5

http://www.isi.edu/licensed-sw/pharaoh/

the precision of n-grams with respect to the ref-erence translations, with a brevity penalty. A higher BLEU score indicates better translation.

• mWER (multi-reference word error rate) (Nießen et al., 2000): This measures the edit distance with the most similar refer-ence translation. Thus, a lower mWER score is desirable.

• SSER(subjective sentence error rate) (Nießen et al., 2000): This is calculated using human judgements. Each sentence was judged by a hu-man evaluator on the following ﬁve-point scale, and the SSER was calculated as described in (Nießen et al., 2000).

0 Nonsense

1 Roughly understandable 2 Understandable

3 Good 4 Perfect

Again, the lower the SSER, the better the trans-lation.

(7)

An Example: Consider, again, the example in table 1. The word-order in the baseline translation is woeful, while the translations after syntactic pre-processing (baseline+syn and baseline+syn+morph) follow the correct Hindi order (compare with the ref-erence translation). The effect of sufﬁx separation can be seen from the verb form ((dekhem) – visit or see) in the last translation (baseline+syn+morph). The reason for this is that the pair “visit→” is not available to be learnt from the original and the syntactically preprocessed corpora, but the follow-ing pairs are: (i) to visit → (ii) worth visit-ing→ -, and (iii) can visit→ . Thus, the baseline and baseline+syn models are not able to produce the correct verb form for “visit”. On the other hand, the baseline+syn+morph model, due to the sufﬁx separation process, combines (dekha) and e(em) from different mappings in the aligned corpus, e.g., “visit +ing→ ” and “sing

→ e”, to get the right translation for visit () in this context.

5 Conclusion

We have presented in this paper an effective frame-work for English-Hindi phrase-based SMT. The re-sults demonstrate that signiﬁcant improvements are possible through the use of relatively simple tech-niques for incorporating syntactic and morphologi-cal information.

Since all Indian languages follow SOV order, and are relatively rich in terms of morphology, the framework presented should be applicable to En-glish to Indian language SMT in general. Given that morphological and parsing tools are not yet widely available for Indian languages, an approach like ours which minimizes use of such tools for the target lan-guage would be quite desirable.

In future work, we propose to experiment with a more sophisticated morphological analyzer. As more parallel corpora become available, we also in-tend to measure the effects of using morphology on corpora requirements. Finally, a formal evaluation of these techniques for other Indian languages (es-pecially Dravidian languages such as Tamil) would be interesting.

Acknowledgements

We are grateful to Sachin Anklekar and Saurabh Kushwaha for their assistance with the tedious task of collecting and preprocessing the corpora.

References

Ananthakrishnan Ramanathan and Durgesh Rao, A Lightweight Stemmer for Hindi, Workshop on Computational Linguistics for South-Asian Languages, EACL, 2003.

Peter F. Brown, John Cocke, Stephen Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin, A Statistical Approach to Machine Translation, Computational Linguistics, 16(2), pages 79–85, June 1990.

Stanley F. Chen and Joshua T. Goodman, An Empir-ical Study of Smoothing Techniques for Lan-guage Modeling, Technical Report TR-10-98, Computer Science Group, Harvard University, 1998.

Michael Collins, Philipp Koehn, and Ivona Kucerova, Clause Restructuring for Statistical Machine Translation, Proceedings of ACL, pages 531–540, 2006.

Jonathan Graehl and Kevin Knight, Training Tree Transducers, Proceedings of HLT-NAACL, 2004.

Kuhoo Gupta, Manish Shrivastava, Smriti Singh, and Pushpak Bhattacharyya, Morphological Richness Offsets Resource Poverty – an Expe-rience in Builing a POS Tagger for Hindi, Pro-ceedings of ACL-COLING, 2006.

Kenji Imamura, Hideo Okuma, Eiichiro Sumita, Practical Approach to Syntax-based Statisti-cal Machine Translation, Proceedings of MT-SUMMIT X, 2005.

Philipp Koehn and Hieu Hoang, Factored Transla-tion Models, Proceedings of EMNLP, 2007.

(8)

Daniel Marcu and William Wong, A Phrase-based Joint Probability Model for Statistical Machine Translation,Proceedings of EMNLP, 2002.

R. S. McGregor,Outline of Hindi Grammar, Oxford University Press, Delhi, India, 1974.

I. Dan Melamed, Statistical Machine Translation by Parsing, Proceedings of ACL, 2004.

Guido Minnen, John Carroll, and Darren Pearce, Applied Morphological Processing of English,

Natural Language Engineering, 7(3), 207–223, 2001.

G. Ngai and R. Florian, Transformation-based Learning in the Fast Lane, Proceedings of NAACL, 2001.

Sonja Nießen, Franz Josef Och, Gregor Leusch, and Hermann Ney, An Evaluation Tool for Ma-chine Translation: Fast Evaluation for MT Re-search, International Conference on Language Resources and Evaluation, pages 39–45, 2000.

Sonja Nießen and Hermann Ney, Statistical Ma-chine Translation with Scarce Resources Using Morpho-syntactic Information, Computational Linguistics, 30(2), pages 181–204, 2004.

Franz Josef Och, Christoph Tillman, and Hermann Ney, Improved Alignment Models for Sta-tistical Machine Translation, Proceedings of EMNLP, pages 20–28, 1999.

Franz Josef Och, Minimum Error Rate Training in Statistical Machine Translation,Proceedings of ACL, 2003.

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, BLEU: a Method for Automatic Evaluation of Machine Translation, IBM Re-search Report, Thomas J. Watson Research Center, 2001.

Maja Popovic and Hermann Ney, Statistical Ma-chine Translation with a Small Amount of Bilingual Training Data, 5th LREC SALTMIL Workshop on Minority Languages, pages 25– 29, 2006.

Durgesh Rao, Natural Language Generation for En-glish to Hindi Human-Aided Machine Transla-tion of News Stories, Master’s Thesis, Indian Institute of Technology, Bombay, 1996.

Durgesh Rao, Kavitha Mohanraj, Jayprasad Hegde, Vivek Mehta, and Parag Mahadane, A Prac-tical Framework for Syntactic Transfer of Compound-Complex Sentences for English-Hindi Machine Translation, Proceedings of KBCS, 2000.