Chapter 2. Literature Review
2.6 Existing Machine Translation Tools and Services
There are several existing machine translation tools, mostly available online. The most well-known of these are Google Translate, Bing Translator, Systran, Babel Fish, and Language Weaver.
Google Inc.’s language translation service, Google Translate, is a system based on statistical machine translation. It is currently probably the best-known online language translation service provider, performing hundreds of millions of translations every day. It is able to translate text selections, whole documents, and also web pages between a number of languages. Speech recognition software used in conjunction with its translation engine makes the service able to translate the spoken word also, and release the output as speech using text-to-speech software (Henderson, 2010). Google Translate’s SMT approach originally only supported English and Arabic, and was released in 2006. Until October 2007 the earlier versions of the service used Systran-based software for languages other than Arabic, Russian and Chinese. Currently, it offers full support for translation between 64 different languages, and also partial support for 11 “alpha” languages, which are still in the earlier stages of development (Aiken & Balan, 2011) .
Like many of Google’s services, Translate is free for the public to use. Google also incorporates user input into its service, enabling users to contribute a better translation in order to improve the efficiency of the service. Users are also asked to submit alternative words or phrases, where necessary, when dealing with technical terms. Statistics from these user inputs are taken and the system modified to continually update and provide more accurate translation. The ability for a SMT-based system to improve itself with use was one of the main driving reasons for Google’s shift to SMT. The benefits of this are enormous – with the public’s constant use and contribution, the system is guaranteed to be constantly improving.
Additional features in the system, such as automatic language detection, default English translation, and automatic web page translation in Google Chrome’s browser, make Translate an attractive tool to use. Programs and applications using the main system are available on systems such as iOS and Android, and can perform tasks such
27
as real-time chat translation (using GChat chatbot), and speech-to-speech translation for 14 different languages.
Google Translate’s fully supported languages include the following:
Afrikaans, Albanian, Arabic, Belarusian, Bulgarian, Catalan, Chinese (simplified), Chinese (traditional), Croatian, Czech, Danish, Dutch, English, Estonian, Esperanto, Filipino, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh and Yiddish.
The 11 “alpha” languages still in the earlier stages of development currently include: Armenian, Azerbaijani, Basque, Georgian, Gujarati, Haitian, Creole, Kannada, Latin, Tamil, Telugu, Urdu. This means that although they are available to use, they produce less reliable translation result than those languages that are fully supported, and not all features are available for those languages (such as speech input/output, and applications using the main system).
The initial SMT system was researched and developed by Franz-Josef Och, the head of Google’s machine translation department. According to Och, if developing an SMT system from square one, a bilingual text corpus of over a million words, and two monolingual corpora with over a billion words each, would be needed to form a sound base from which to work. Statistical models are then taken from this data and used to translate between the language pair. Google was able to use United Nations’ documents to obtain this immense amount of data, since the same document is usually written in each of the six official UN languages - Arabic, Chinese, English, French, Russian and Spanish. This means that Google Translate now manages a huge multi- lingual corpus of twenty billion words for these languages.
Microsoft also provides a translation service enabling users to translate selections of text and even whole web pages into other supported languages. Bing Translator was previously called Windows Live Translator, and used Systran as its backend translation software. Microsoft Research has now developed its own translation software, called Microsoft Translation, which powers the language pairs currently offered by the service. Where computer-related translation is required (including technical computer terms), Microsoft uses its own syntax-based SMT technology.
28
Systran is a machine translation company founded in 1968 by Dr Peter Toma. One of the longest-standing machine translation companies, Systran has performed a significant amount of work for the United States Department of Defence. Translation services Yahoo, Babel Fish and AOL use Systran as the software base for their systems. Apple Mac’s OS X operating system uses Systran in its Dashboard Translation widget.
Systran employs a sentence-by-sentence approach to translation, focusing on and processing individual words and their dictionary definitions before parsing the sentence to generate a translation output. The three main groups of modules composing Systran’s framework are: Dictionary, Systems Software, and Linguistic Software. These groups work together to create an automatic machine translation system (Senellart, Dienes et al., 2001).
Babel Fish is a web-based automatic machine translation program built by AltaVista and used by Yahoo. It is comically named after the fictitious translating animal from Douglas Adam’s book The Hitchhiker’s Guide to the Galaxy. Babel Fish uses Systran’s translation system as a software base, and can translate among English, Simplified and Traditional Chinese, Dutch, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, Swedish and Spanish .
The translations provided by Babel Fish are not as reliable as those given by other translation services, and Babel Fish is considered to be only a minor contributor to the language translation industry.
In 2002, Kevin Knight and Daniel Marcu of the University of Southern California founded Language Weaver (now known as SDL Language Weaver), a company commercializing a statistical approach to language translation and spoken language processing.
The software systems used by SDL Language Weaver give an example of slightly more recent progress in the statistical approach to machine translation. It implements learning algorithms to obtain statistical models from bilingual corpora. Since these models originate from pre-existing aligned language pairs, the output is statistically more likely to be accurate (Soricut, Bach et al., 2012).
Another feature of Language Weaver is its ability to be customised to translate technical material. The software’s learning capabilities aid it in specialising in different subjects or styles.
29
Language Weaver has incorporated the product of recent progress in statistical machine translation systems and, with some degree of success, is now able to create translation systems for language pairs that have limited amounts of bilingual text. Language Weaver currently offers translation for English to and from French, Italian, German, Greek, Danish, Spanish, Dutch, Portuguese, Swedish, Russian, Czech, Romanian, Polish, Arabic, Persian, Simplified and Traditional Chinese, Korean and Hindi. It also offers Arabic/Spanish, Arabic/French, Spanish/French, and French/German.
Although Language Weaver currently translates using phrase-based SMT, their researchers are currently studying how to incorporate syntax-based statistical machine translation, as this approach can be used to improve translation quality for certain language pairs.
Though its main service area is in machine language translation, Language Weaver also offers several other service products, such as Alignment Tool, and Customiser. Alignment tool is a translation memory generator, and takes an input of a translated document, aligns it at segment level, and saves a translation memory file. Customiser is a tool which aids in fine-tuning machine translation output, helping to specify translation to a narrow domain.