STATISTICAL MACHINE TRANSLATION – PART 1: What is SMT?

Author: Hồ Xuân Vinh


With all of my respects for Professor Jong-Hyeok Lee, this work is based on a serie of wonderful and interesting lectures of him.


Recenly, I had the opportunity to work and learn from a Machine Translation research team in POSTECH. I had an interview with project leader Hong-seok and planned to write a post about it, to share a brilliant story about how they could start from scratch 4 months ago and now can achieve result better than those giants in this field such as Google or Samsung. Unfortunately, due to secret policy, I could not make it public to share with everyone. This leads to another idea in my mind: why not sharing the knowledge in Statistical Machine Translation class with my own voice, for people who do not have the chance like me to listen to one of the amazing eyewitnesses of the adventure to conquer the mystery land of Natural Language Processing. This first part and the following would be explained by me, sometimes there are conversations between me and lab mates or professor to discuss further about concerned problem. I hope this could help who struggle seeing SMT as a black box, instead of an amazing crystal clear in day light. 


Warning: you can start by yourself without reading my blog with 2 famous books: “An Introduction to Machine Translation”, John Hutchins & Harold L. Somers, Academic Press, 1992 and “Statistical Machine Translation”, Philipp Koehn, Cambridge University Press, 2010. Usually, you will be introduced basic terms in Linguistic, but I will assume you already know it (from Wikipedia!?), so I only talk about the mechanism of Machine Translation . Hope that would not let you down :))


It’s 2016 now, and everyone is talking about the explosion era of data. One of the biggest advantages of that is now you can access to many many resources to look for things you need: recipient of Phở (Vietnamese food), 1957th sentence of Shakespeare’s poem, or even the blog you are reading right now. The fact is that, sometimes, the document you found written in foreign language, and you might wait for a decade until someone who finds it interesting enough to translate to your language. Nah, here comes the part where our hero shows up, Machine Translation, or in short, MT. Clicking the Translation button of Google, and tadaa, problem is resolved in a blink of eye.

In order to get to this marvelous achievement, the science community  started their first fight from World War II, with the famous story of cracking Enigma machine by the father of Computer Science: Alan Turing. The non-moral lesson we can learn from the story is that: languages basically are the same thing with different representation, just like we can use decimal or binary system to write number 10. Once we find the patterns, or rules of individual language, then we can convert from this one to another one easily. AND here we are, the annoying part is that, language, like its inventor, is way too complicated than necessary and (so far) no one can captures all of its variant in just a list of rules.

You can stop reading right now, and saying: “The Human is fool creature”. Yes, and by that, is also what made us beautiful. Different geographies, different cultures, and many other factors on Earth affect the creation of a language: no space between words (Chinese, Thai), directions (right to left: Arabic, top to bottom: Japanese), lexical terms (many words to describe “rice” in Vietnamese, more than 200 names for “cheese” in France), word order (S-V-O, S-O-V)… Beside that, a language always evolves through history, new words are created, some words lose is original meaning, or worse, vanishes from the modern dictionary. Because we human are fools, we do not see them as a burden, but another mountain to conquer. On that remarkable journey, a shining milestone is Statistical Machine Translation (SMT), stronger than rule-based, might be weaker than current state-of-the-art Neural Machine Translation (NMT), but special. It is special, because people change their thinking from “you must know well about a language to translate it” to just by using statistics, machine could translate a language that you have never seen before.

What is Machine Translation?

Translation is the process of moving texts from one human language (source language) to another (target language), in a way that preserves meaning. Machine Translation automates (part of) the process: fully automatic translation or computer-aided (human) translation.

In more details, on the scale of human intervention in MT, we have:

  • MT (Mechanical/Automatic Translation):
    • FAHQT – Fully automatic high-quality translation: no intervention from human.
  • CAT (Computer-Aided Translation)
    • HAMT – Human aided machine translation: to check manually for pre and pos-editing.
    • MAHT – Machine aided human translation: system provides linguistic aids for translation (dictionaries, grammars, previously translated texts…).

In general, we have 3 main factors that a MT system aims to achieve: Fully automatic, high quality, and unrestricted texts (input). The fact is, until now, it is still impossible to have all 3 of them at the same time. Figure 1 shows that story: if you want MT runs automatically without any manual editing, yet also the output is high quality, then the training corpus must be restricted to avoid noise. Or, if you want high quality translation with unrestricted text-input, then the only way is translating it manually. Understanding this small chart, you will know the required ingredients for a good MT system.

So, what exactly is a restricted input? It is imposing restrictions on input texts to minimize problems for translator. This could still keep the text natural (sublanguage) or imposed (controlled language), based on your intention. For human, the text should be more readable, less ambiguous, more “focused”. For MT, there are fewer syntactic constructions, closed vocabulary with fewer homonyms, and greater certainty about interpretation. Note that:

  • Sublanguage: a specialized vocabulary and grammar of a specific subject domain and/or document type. Words are not known to the non-specialist and having restricted grammar patterns. Ex: weather report, stock market report, etc.
  • Controlled language: a simplified version of language that may range over all the subject areas), widely used in technical authoring. Ex: Basic English is a minimal variety of English, in that the number of general-purpose words could be a few hundred rather than 75.000.
mt_factor

Figure 1. 3 factors of MT

Beside the main scenario of MT is translating 1 source language to 1 target language, there are other different scenarios for MT:

  • Assimilation (acquisition): imagine you are a local publisher and want to introduce classical novels all around the world for your native readers. Novels could be written in many different languages and belong to many categories: from romantic, adventurous, to science fiction. This would need a lot of effort in post-editing task to keep the spirit of the origins. Due to this complicatedness, many errors in translation can occur and the editor/publisher must keep them under control. In other words, MT can translate many Sources to 1 Target for all-purpose translation (document with free styles and topics). The output must be manually checked after running, and the MT system must be robustness (prevent and handle errors when executing) and coverage (?).
  • Dissemination (distribution): now you switch your job into an author, whose favorite category is fantasy. In an attempt to make your works more well-known, you look for publishers could help you publish them in other countries. In this case, we see that the style and topic of document are determined. MT translates 1 Source to many Targets working the same way:  special translation (controlled style, single topic), no post-editing, and requires textual quality.
  • Direct communication: when people speaking directly to each other with different languages, in order to understand, you will need Speech recognition & synthesis for converting speech to text and other way. Then using MT for translating the text, and because this is a conversation, so the topic is variety and it would be helpful if the system is provided the context.

A brief history of MT

The so-called pioneer in the early days of Machine translation, Warren Weaver, after 2 years working with his colleagues at the Rockefeller Foundation since 1947, proposed a memorandum [1] with 4 main approaches for MT. Surprisingly, those ideas have come to reality nowadays, which shows how great his vision is. They are:

  • Statistical MT: A word-by-word translation is too naive and simple that does not work well, especially when the word is a polysemy. We could make it better by using context information, such as surrounding words, topic of document… to achieve better translation.
  •  Neural MT: Inspired by work on an early type of neural networks by McCulloch and Pitt, given a set of premises, and let the logical works for the computer to handle itself.
  • Cryptography: a foreign language is just a way to encrypt your native language – plain text. If we can find a two-way function, then it is possible to apply to translation.
  • Linguistic universal: there is a way to represent all different languages into one same universal language. This would be the bridge to translation one language to another one.

“When I look at an article written in Russian, I say, This is really written
in English, but it has been coded in some strange symbols. I will now
proceed to decode.”

Warren Weaver’s memorandum (1949)

The decade of optimism (1954-1966) starts with the first MT conference at MIT. 2 years later, first MT system is demonstrated by Georgetown University and IBM, opened many opportunities for funding and foundation of famous organizations/system such as SYSTRAN, PAHO. In the same year, Mechanical Translation, first MT journal established.  In 1962, there were 48 working groups deeply involved in R&D. Unfortunately, the computer’s calculation ability was not as powerful as nowadays. Scientists did not concentrate much on develop formal theories of language structure, and no computational linguistic resources available: part-of-speech tagger, syntactic parser, digital lexicon. To sum up, the first generation is direct translation decade: word-for-word translation with little or no general syntactic analysis.

In 1959 report, Bar-Hillel’s criticism said that the disambiguation (ability to distinguish meaning of a word) plays an important role in translation, which depend on world knowledge and there is no way to program this into a computer. So, a fully automatic high quality MT is not only impractical, but also impossible in principle. Since then, the involvement of human is required.

With not much significant improvement in the first decade, an organization named ALPAC (Automatic Language Processing Advisory Committee) was founded. Its report in 1966 said that there is no need for machine translation, because we have enough human translators. In the near future, this field isn’t going to work with general, scientific text, whichever the goal was. From the perspective of modern time, this conclusion seems hasty and ridiculous, but maybe due to the lack of commercial internet and disappointment, this decision was made and marked the end point of an exciting era, not only in Natural Language Processing, but also Artificial Intelligence fields in general.

Example in Bar-Hillel’s criticism: “Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy.”

The word “pen” is “play-pen(fenced enclosure)”, not writing instrument.

Q: “Do you agree with the criticism?”

A: “No, because during this time, they choose the word-for-word approach. The correct translation of “pen” are being solved right now. If we check surrounding words, we can find many relating and helpful information. Beside that, information such as syntactic, semantic could be utilised.”

The funny thing is, some people are afraid of the development of MT, as the same reason for AI in Terminator movie. This came up with some popular misconceptions:

  • MT is a waste of time because it can not translate Shakespeare. Fact: we do not need a perfect translation, most of the time, an understandable one is enough.
  • There was/is an MT which translated “The spirit is willing, but the flesh is weak” into the Russian equivalent of “The vodka is good, but the steak is lousy”, and “hydraulic ram” into the French equivalent of “water goat”. MT is useless. Fact: we can solve it by choose phrase as translation unit rather than word.
  • Generally, the translation quality of MT system is very low. This makes them useless in practice. Fact: they are getting better everyday.
  • MT threatens the job of translators. Fact:MT is still far from this scenario, and if so, it still needs a good bilingual corpus for training, which no other than manually made by translators.

During AI winter, MT started to regain its reputation with rule-based approach – second generation. 1970, SYSTRAN was installed at USAF, and restricted language TITUS is presented. 1975, sublanguage English-French with domain weather broadcasts Météo. 1979, Eurotra [3] projects, funded by the European Commission, began with the hope of creating a state-of-the-art MT system for the then seven, later nine, official languages of the European community. The third generation (rule-based, semantics-based ) involving long‐term efforts compiling grammar rules & creating dictionaries. Many systems were created and introduced: Interlingual systems, transfer-based systems, knowledge-based systems, speech translation, and computer-based tools. The change in approaching is supported by the changes since late 1980s in many other things: increasing of MT by large enterprises, translation memory, growth in PC systems, the impact of the internet, online translation, and research on corpus-based MT methods.

Summary

In conclusion, MT is still far from the goal set in 1950s, but it’s getting better everyday and achieves significant results. A MT system is very useful in tasks that requires too much translation for human with great consistency, great speed, and does not need to be top quality. On the other hand, important situation like life and death situations or require subtle knowledge of the words and high degree of literary skill( pharmaceutical business, court proceedings) still need human involvement.

I hope with this first part, you can have a brief understanding about history and basic factors of a MT system. Part 2 will introduce the MT structure and paradigm plus basic idea of SMT. Please feel free to share your thought about this post, what you want to know mote or any matter relevant to MT. See you next time.

Reference

[1] https://en.wikipedia.org/wiki/Warren_Weaver

[2] https://en.wikipedia.org/wiki/Georgetown%E2%80%93IBM_experiment

[3] https://en.wikipedia.org/wiki/Eurotra

[4] Bonnie Dorr, et al, “A Survey of Current Paradigms in Machine Translation” LAMP‐TR‐027, 1998

[5] W. John Hutchins, “Early Years in Machine Translation,” John Benjamin, 2000

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s