manual word alignment

Top PDF manual word alignment:

Research on Deep Learning HMM Word Alignment

Research on Deep Learning HMM Word Alignment

The experimental data contains monolingual data, bilingual parallel corpus and bilingual sentences with manual word alignment. We get all the monolingual data from the Internet. After the process of de-duplication and segmentation, there are about 1.1 billion sentences in English section and 0.3 billion sentences in Chinese section. These monolingual texts are used to train the low-dimensional vector in the model, and used as the character of reference system for the part of Our bilingual parallel corpus contains all bilingual corpus in NIST08 machine translation evaluation and bilingual data mining from the Internet. After the pre-process of de-duplication, bilingual data altogether contains about 260 million bilingual statements. Bilingual parallel corpus is used for training the neural network word alignment model presented from context, and the word alignment model of other reference system. We use the manual word alignment data used in Haghighi[13] , which contains 491 bilingual sentence pairs. Due to differences in Chinese Word segmentation tools, we use some heuristic rules to convert the data into our word segmentation standard. In addition to these 491 words, which are used for the evaluation data, we also have the manual labeling for the 600 words came from FBIS data set (LDC2003E14). This set of 600 words, as the development set, is used to adjust context models and various parameters in the basic system.
Show more

5 Read more

Semi supervised Word Alignment with Mechanical Turk

Semi supervised Word Alignment with Mechanical Turk

Word alignment is used in various natural language processing tasks. Most state-of-the-art statistical machine translation systems rely on word alignment as a prepro- cessing step. The quality of word alignment is usually measured by AER, which is loosely related to BLEU score (Lopez and Resnik, 2006). There has been re- search on utilizing manually aligned corpus to assist auto- matic word alignment, and obtains encouraging results on alignment error rate. (Callison-Burch et al., 2004; Blun- som and Cohn, 2006; Fraser and Marcu, 2006; Niehues and Vogel, 2008; Taskar et al., 2005; Liu et al., 2005; Moore, 2005). However, how to obtain large amount of alignments with good quality is problematic. Labeling word-aligned parallel corpora requires significant amount of labor. In this paper we explore the possibility of us- ing Amazon Mechanical Turk (MTurk) to obtain manual word alignment faster, cheaper, with high quality.
Show more

5 Read more

Word Alignment with Synonym Regularization

Word Alignment with Synonym Regularization

For an empirical evaluation of the proposed method, we used a bilingual parallel corpus of English-French Hansards (Mihalcea and Pedersen, 2003). The corpus consists of over 1 million sen- tence pairs, which include 447 manually word- aligned sentences. We selected 100 sentence pairs randomly from the manually word-aligned sen- tences as development data for tuning the regu- larization weight ζ, and used the 347 remaining sentence pairs as evaluation data. We also ran- domly selected 10k, 50k, and 100k sized sentence pairs from the corpus as additional training data. We ran the unsupervised training of our proposed word alignment model on the additional training data and the 347 sentence pairs of the evaluation data. Note that manual word alignment of the 347 sentence pairs was not used for the unsuper- vised training. After the unsupervised training, we evaluated the word alignment performance of our proposed method by comparing the manual word alignment of the 347 sentence pairs with the pre- diction provided by the trained model.
Show more

5 Read more

Are ACT’s Scores Increasing with Better Translation Quality?

Are ACT’s Scores Increasing with Better Translation Quality?

This paper gives a detailed description of the ACT (Accuracy of Connective Trans- lation) metric, a reference-based metric that assesses only connective translations. ACT relies on automatic word-level align- ment (using GIZA++) between a source sentence and respectively the reference and candidate translations, along with other heuristics for comparing translations of discourse connectives. Using a dictio- nary of equivalents, the translations are scored automatically or, for more accu- racy, semi-automatically. The accuracy of the ACT metric was assessed by human judges on sample data for English/French, English/Arabic, English/Italian and En- glish/German translations; the ACT scores are within 2-5% of human scores.
Show more

6 Read more

Word Order Typology through Multilingual Word Alignment

Word Order Typology through Multilingual Word Alignment

Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuz- man Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar T¨ackstr¨om, Claudia Bedini, N´uria Bertomeu Castell´o, and Jungmee Lee. 2013. Uni- versal dependency annotation for multilingual pars- ing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 92–97, Sofia, Bulgaria, August. Association for Computational Linguistics. Cos¸kun Mermer and Murat Sarac¸lar. 2011. Bayesian word alignment for statistical machine translation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 182–187, Stroudsburg, PA, USA. Association for Computational Linguistics.
Show more

7 Read more

Word Alignment Combination over Multiple Word Segmentation

Word Alignment Combination over Multiple Word Segmentation

Instead of time-consuming segmentation optimi- zation based on alignment or postponing segmenta- tion combination late till SMT decoding phase, we try to combine word alignments over multiple monolingually motivated word segmentation on Chinese-English pair, in order to improve word alignment quality and translation performance for all segmentations. We introduce a tabular structure called word segmentation network (WSN for short) to encode multiple segmentations of a Chinese sen- tence, and define skeleton links (SL for short) be- tween spans of WSN and words of English sentence. The confidence score of a SL is defined over multiple segmentations. Our combination al- gorithm picks up potential SLs based on their con- fidence scores similar to Xiang et al. (2010), and then projects each selected SL to link in all seg- mentation respectively. Our algorithm is simple, efficient, easy to implement, and can effectively improve word alignment quality on all segmenta- tions simultaneously, and alignment errors caused
Show more

5 Read more

Exact Maximum Inference for the Fertility Hidden Markov Model

Exact Maximum Inference for the Fertility Hidden Markov Model

The results in Table 1 compare several differ- ent algorithms on this same data. The first line is a baseline HMM using exact posterior computa- tion and inference with the standard dynamic pro- gramming algorithms. The next line shows the fer- tility HMM with approximate posterior computa- tion from Gibbs sampling but with final alignment selected by the Viterbi algorithm. Clearly fertil- ity modeling is improving alignment quality. The prior work compared Viterbi with a form of local search (sampling repeatedly and keeping the max), finding little difference between the two (Zhao and Gildea, 2010). Here, however, the difference be- tween a dual decomposition and Viterbi is signifi- cant: their results were likely due to search error. 5 Conclusions and future work
Show more

5 Read more

AN OVERVIEW OF TECHNOLOGY EVOLUTION: INVESTIGATING THE FACTORS INFLUENCING 
NON BITCOINS USERS TO ADOPT BITCOINS AS ONLINE PAYMENT TRANSACTION METHOD

AN OVERVIEW OF TECHNOLOGY EVOLUTION: INVESTIGATING THE FACTORS INFLUENCING NON BITCOINS USERS TO ADOPT BITCOINS AS ONLINE PAYMENT TRANSACTION METHOD

Current Phrase-Based SMT (PBSMT) systems[7] use the GIZA++ tool to produce word alignments, running the tool in both directions, source to target and target to source[8]. Then complex heuristics are applied to obtain a symmetrized alignment. For instance, a grow-diag- final method starts from the intersection of the two word alignments and enhance it with union alignment. We will not discuss this situation in this paper, where mostly the machine translation is surveyed. The size of phrase table is important for better word alignment, because the smaller size phrase table is more precise than big ones. Besides this, the machine translation performance depends on the size of the phrase table, since at least half of phrases can be removed without any loss in quality. The quality of phrase extraction has a vital role in the overall translation quality. In the current phrase extraction algorithm phrases are directly injected to the phrase table, when a phrase is extracted from word alignment. In other words, word alignment affects the machine translation quality through phrase table.
Show more

11 Read more

Improving Word Alignment of Rare Words with Word Embeddings

Improving Word Alignment of Rare Words with Word Embeddings

can result in better alignments. Bilingual word embedding models like (Zou et al., 2013; Søgaard et al., 2015) train vectors for words in both languages in the same vector space, Hence a translation model can be made by these embeddings and it can be really useful for the task of word alignment. Despite the advantages of bilingual embedding models, to achieve a good performance by these models and build- ing more informative vector representations for words, a large amount of data is required, which is not available in low-resource language pairs. On the other hand, providing a good monolingual corpus for most of the languages needs less effort and building a model that mostly uses monolingual data is more reasonable.
Show more

7 Read more

Regularizing Mono  and Bi Word Models for Word Alignment

Regularizing Mono and Bi Word Models for Word Alignment

Such models are known to have weaknesses called garbage collection. This refers to the phe- nomenon that rarely occurring source words tend to align to a significant portion of the target words in the respective sentences, since the probability mass of the frequent words is better used to ex- plain the sentences without rare words. The effect is known to worsen when one moves beyond sin- gle word based models (DeNero et al., 2006).

10 Read more

Data Cleaning for Word Alignment

Data Cleaning for Word Alignment

Firstly, the posterior-based approach (Liang, 06) looks at the posterior probability and partially delays the alignment decision. However, this ap- proach does not have any extension in its 1 : n uni-directional mappings in its word alignment. Secondly, the aforementioned phrase alignment (Marcu and Wong, 02) considers the n : m map- ping directly bilingually generated by some con- cepts without word alignment. However, this ap- proach has severe computational complexity prob- lems. Thirdly, linguistic motivated phrases, such as a tree aligner (Tinsley et al., 06), provides n : m mappings using some information of parsing re- sults. However, as the approach runs somewhat in a reverse direction to ours, we omit it from the dis- cussion. Hence, this paper will seek for the meth- ods that are different from those approaches and whose computational cost is cheap.
Show more

9 Read more

Discriminative Word Alignment with a Function Word Reordering Model

Discriminative Word Alignment with a Function Word Reordering Model

In this paper, we introduce a new approach to im- proving the modeling of reordering in alignment. In- stead of relying on monolingual parses, we condi- tion our reordering model on the behavior of func- tion words and the phrases that surround them. Function words are the “syntactic glue” of sen- tences, and in fact many syntacticians believe that functional categories, as opposed to substantive cat- egories like noun and verb, are primarily responsi- ble for cross-language syntactic variation (Ouhalla, 1991). Our reordering model can be seen as offering a reasonable approximation to more fully elaborated bilingual syntactic modeling, and this approxima- tion is also highly practical, as it demands no exter- nal knowledge (other than a list of function words) and avoids the practical issues associated with the use of monolingual parses, e.g. whether the mono- lingual parser is robust enough to produce reliable output for every sentence in training data.
Show more

11 Read more

Improving Word Alignment by Adjusting Chinese Word Segmentation

Improving Word Alignment by Adjusting Chinese Word Segmentation

In the aligning phase, the original IBM model 1 does not work properly as we expected. Because the English words prefer to link to single character and it results that some correct Chinese translations will not be linked. The reason is that the probability of a morpheme, say p( 教育 |education), is always less than its substring, p( 教 |education), since whatever 教育 occurs 教 and 育 always occur but not vice versa. So the aligning result will be 教 /Education and 署 /Department, 育 is abandoned. To overcome this problem, a constraint of alignment is imposed to the model to ensure that the aligning result covers every Chinese characters of a target word and no overlapped characters in the result morpheme sequence. For instances, both 教 /Education 署 /Department and 教 育 /Education 育署 /Department are not allowed alignment sequences. The constraint is applied to each possible aligning result. If the alignment violates the constraint, it will be rejected.
Show more

8 Read more

Using Senses in HMM Word Alignment

Using Senses in HMM Word Alignment

In this model, two senses (synsets) are function- ally equivalent, if the list of words that have them in their senselist is the same for both senses. That is to say, if the partial counts that will be added to either of the senses will be the same, there is no way of distinguishing between the two senses under this model. For example, in WordNet 3.0, among the synsets listed for the word ‘small’, there are 3 that have as constituent words only ‘small’ and ‘little’. These 3 synsets would be functionally equivalent for our purposes. When this occurs, the senses that are equivalent are collated under one name, so that it’s possible to find out which senses a particular sense is made up of.
Show more

8 Read more

A Discriminative Matching Approach to Word Alignment

A Discriminative Matching Approach to Word Alignment

On top of these features, we included other kinds of information, such as word-similarity features designed to capture cognate (and ex- act match) information. We added a feature for exact match of words, exact match ignoring ac- cents, exact matching ignoring vowels, and frac- tion overlap of the longest common subsequence. Since these measures were only useful for long words, we also added a feature which indicates that both words in a pair are short. These or- thographic and other features improved AER to 14.4. The running example now has the align- ment in Figure 1(c), where one improvement may be attributable to the short pair feature – it has stopped proposing the-de, partially because the short pair feature downweights the score of that pair. A clearer example of these features making a difference is shown in Figure 2, where both the exact-match and character overlap fea-
Show more

8 Read more

A Discriminative Framework for Bilingual Word Alignment

A Discriminative Framework for Bilingual Word Alignment

In light of our claims about the ease of optimiz- ing the models, we should make some comments on the time need to train the parameters. Our cur- rent implementation of the alignment search is writ- ten in Perl, and is therefore quite slow. Alignment of our 500,000 sentence pair corpus with the LLR- based mode took over a day on a 2.8 GHz Pentium IV workstation. Nevertheless, the parameter opti- mization was still quite fast, since it took only a few iterations over our 224 sentence pair development set. With either the LLR-based or CLP-based mod- els, one combined learning/evaluation pass of per- ceptron training always took less than two minutes, and it never took more that six passes to reach the local optimum we took to indicate convergence. To- tal training time was greater since we used multiple runs of perceptron learning with different learning rates for the LLR-based model and different condi- tional link probability discounts for CLP 1 , but total training time for each model was around an hour. 7 Related Work
Show more

8 Read more

Word Alignment via Quadratic Assignment

Word Alignment via Quadratic Assignment

Recently, discriminative word alignment methods have achieved state-of-the-art accuracies by extend- ing the range of information sources that can be easily incorporated into aligners. The chief advan- tage of a discriminative framework is the ability to score alignments based on arbitrary features of the matching word tokens, including orthographic form, predictions of other models, lexical context and so on. However, the proposed bipartite match- ing model of Taskar et al. (2005), despite being tractable and effective, has two important limita- tions. First, it is limited by the restriction that words have fertility of at most one. More impor- tantly, first order correlations between consecutive words cannot be directly captured by the model. In this work, we address these limitations by enrich- ing the model form. We give estimation and infer- ence algorithms for these enhancements. Our best model achieves a relative AER reduction of 25% over the basic matching formulation, outperform- ing intersected IBM Model 4 without using any overly compute-intensive features. By including predictions of other models as features, we achieve AER of 3
Show more

8 Read more

Knowledge Intensive Word Alignment with KNOWA

Knowledge Intensive Word Alignment with KNOWA

To simulate at least partly the improvement that one can expect from an increase in the size of MultiSemCor, we trained GIZA++ on the union of the available MultiSemCor and EuroCor. The results of the training on MultiSemCor only, and on the union of MultiSemCor and EuroCor are reported in Table 7. Besides the row for the all- word task, the table contains also a SemCor row. This task concerns all the words that have been manually tagged in SemCor, and roughly corresponds to the content-word task. As the purpose of MultiSemCor is transferring lexical annotations from the English annotated words to the corresponding Italian words, it is particularly important that the alignment for the annotated words be correct. The results showed that GIZA++ works consistently better in the Italian-to-English direction, rather than vice versa, so we report the former direction. Only for the training on the union of the MultiSemCor and EuroCor data, we also report the results calculated by resorting to the symmetrization by intersection of the two alignments. Table 7 below shows that the MultiSemCor task is less difficult than the EuroCor Task; that GIZA++ consistently performs worse on content words; and finally that the increase in the dimensions of the training corpus produces a non marginal improvement in the precision, although not in the recall measure. Symmetrization produces a big improvement in precision but also an unacceptable worsening of the recall measure for GIZA++.
Show more

7 Read more

Discriminative Word Alignment by Linear Modeling

Discriminative Word Alignment by Linear Modeling

Recent years have witnessed the rapid development of discriminative alignment methods. As a first attempt, Och and Ney (2003) proposed the Model 6, which is a log-linear combination of the IBM models and the HMM model. Cherry and Lin (2003) develop a statistical model to find word alignments, which allows for easy integration of context-specific features. Liu, Liu, and Lin (2005) apply the log-linear model used in SMT (Och and Ney 2002) to word alignment and report significant improvements over the IBM models. Moore (2005) presents a discriminative framework for word alignment and uses averaged perceptron for parameter optimization. Taskar, Lacoste- Julien, and Klein (2005) treat the alignment prediction task as a maximum weight bipartite matching problem and use the large-margin method to train feature weights. Neural networks and transformation-based learning have also been introduced to word alignment (Ayan, Dorr, and Monz 2005a, 2005b). Blunsom and Cohn (2006) propose a new discriminative model based on conditional random fields (CRF). Fraser and Marcu (2006) use sub-models of IBM Model 4 as features and train feature weights using a semi-supervised algorithm. Ayan and Dorr (2006b) use a maximum entropy model to combine word alignments. Cherry and Lin (2006) show that introducing soft syntactic constraints through discriminative training can improve alignment quality. Lacoste- Julien et al. (2006) extend the bipartite matching model of Taskar, Lacoste-Julien, and Klein (2005) by including fertility and first-order interactions. Recently, max-product belief propagation has been successfully applied to discriminative word alignment (Niehues and Vogel 2008; Cromier`es and Kurohashi 2009). Haghighi et al. (2009) investi- gate supervised word alignment methods that exploit inversion transduction grammar (ITG) constraints.
Show more

38 Read more

Improving Word Alignment with Bridge Languages

Improving Word Alignment with Bridge Languages

We present three sets of experiments. In Table 4, we describe the first set where all 9 alignment mod- els are trained on nearly the same set of sentences (1.9M sentences, 57.5M words in English). This makes the alignment models in all bridge languages comparable. In the first row marked None, we do not use a bridge language. Instead, an Ar-En alignment model is trained directly on the set of sentence pairs. The next four rows give the performance of align- ment models trained using the bridge languages Es, Fr, Ru and Zh respectively. For each language, we use the procedure (Section 3) to obtain the posterior probability matrix for Arabic-English from Arabic- X and X-English matrices. The row AC1 refers to alignment combination using interpolation of poste- rior probabilities described in Section 4. We com- bine posterior probability matrices from the systems in the first four rows: None, Es, Ru and Zh. We exclude the Zh system from the AC1 combination because it is found to degrade the translation perfor- mance by 0.2 points on the test set.
Show more

9 Read more

Show all 10000 documents...