Sie sind auf Seite 1von 9

The statistical machine translation

Prepared by
Yana Zaiets
Group 2.1
The first ideas of Statistical Machine Translation were introduced by
Warren Weaver as far back as 1947.
Statistical machine translation was re-introduced in the late 1980s and
early 1990s by researchers at IBM's Thomas J. Watson Research Center

SMT is a machine translation paradigm where translations are generated


based on statistical models, whose parameters are derived from the
analysis of bilingual text corpora (text bodies) – a source text of translated
material and a target text of untranslated material.
Language model
A language model is an essential component of any statistical machine
translation system, which aids in making the translation as fluent as possible. It
is a function that takes a translated sentence and returns the probability of it
being said by a native speaker.
Other than word order, language models may also help with word choice: if a
foreign word has multiple possible translations, these functions may give better
probabilities for certain translations in specific contexts in the target language.
Word-based translation
In word-based translation, the fundamental unit of translation is a word in some
natural language.

Phrase-based translation
In phrase-based translation, the aim is to reduce the restrictions of word-based
translation by translating whole sequences of words, where the lengths may
differ.

Syntax-based translation
Syntax-based translation is based on the idea of translating syntactic units, rather
than single words or strings of words (as in phrase-based MT).
Benefits of SMT:

• More efficient use of human and data resources


• There are many parallel corpora in machine-readable format and even more
monolingual data.
• Generally, SMT systems are not tailored to any specific pair of languages.
• Rule-based translation systems require the manual development of linguistic
rules, which can be costly, and which often do not generalize to other
languages.
• More fluent translations owing to use of a language model
Shortcomings of SMT:

• Corpus creation can be costly.


• Specific errors are hard to predict and fix.
• Results may have superficial fluency that masks translation problems.
• Statistical machine translation usually works less well for language pairs with
significantly different word order.
• The benefits obtained for translation between Western European languages
are not representative of results for other language pairs, owing to smaller
training corpora and greater grammatical differences.
Conclusion

Statistical machine translation utilizes statistical translation models


whose parameters stem from the analysis of monolingual and
bilingual corpora. Building statistical translation models is a quick
process, but the technology relies heavily on existing multilingual
corpora. A minimum of 2 million words for a specific domain and
even more for general language are required. Theoretically it is
possible to reach the quality threshold but most companies do not
have such large amounts of existing multilingual corpora to build the
necessary translation models. Additionally, statistical machine
translation is CPU intensive and requires an extensive hardware
configuration to run translation models for average performance
levels.
Systems implementing statistical machine translation:

• Google Translate (started transition to neural machine translation in 2016)

• Microsoft Translator (started transition to NMT in 2016)

• Omniscien Technologies

• SYSTRAN (started transition to NMT in 2016)

• Yandex.Translate (switched to hybrid approach incorporating neural

machine translation in 2017)

Das könnte Ihnen auch gefallen