Machine Translation Techniques
Historically three different approaches to MT have been used: direct translation, interlingual translation and transfer based translation. From the 1980's and early 1990's a few new approaches were also introduced. These recent approaches to machine translation are knowledge-based, corpus-based, hybrid methods and human in loop.
Direct translation is the oldest approach to MT. If the MT system uses direct translation, it usually meant that the source language text was not analyzed structurally beyond morphology. The translation is based on large dictionaries and word-by-word translation with some simple grammatical adjustments e.g. on word order and morphology. A direct translation system is designed for a specific source and target language pair. The translation unit of the approach is usually a word.
The lexicon is normally conceived of as the repository of word-specific information. Traditional lexical resources, such as machine readable dictionaries, therefore contain lists of words. These lists might delineate senses of a word, represent the meaning of a word, or specify the syntactic frames in which a word can appear, but the level of granularity with which they are concerned is the individual word. There are many linguistic phenomena which pose a challenge to this "word focus" in the lexicon. The incorporation of elements at a higher level of abstraction -- at the phrasal level, where particular words are grouped together into fixed phrases -- provides a basis for improved computational processing of language.
One of the oldest still used MT systems today, Systran, is basically a direct translation system. The first version of it was published in 1969. Over the years the system has been developed quite much, but still its translation capability is mainly based on very large bilingual dictionaries. No general linguistic theory or parsing principles are necessarily present for direct translation to work; these systems depends instead on well developed dictionaries, morphological analysis, and text processing software.
The interlingua approach was historically the next steps in the development of MT. Esperanto was an interlingua for translating between languages. In an interlingua based MT approach translation is done via an intermediary (semantic) representation of the SL text. Interlingua is supposed to be a language independent representation from which translations can be generated to different target languages. The interlingua approach assumes that it is possible to convert source texts into representations common to more than one language. From such interlingual representations texts are generated into other languages. Translation is thus in two stages: from the source language to the interlingua (IL) and from the IL to the target language.
Transfer systems divide translation into steps which clearly differentiate source language and target language parts. The first stage converts source texts into abstract representations; the second stage converts these into equivalent target language-oriented representations; and the third generates the final target language texts. Whereas the interlingua approach necessarily requires complete resolution of all ambiguities in the SL text so that translation into any other language is possible, in the transfer approach only those ambiguities inherent in the language in question are tackled.
Knowledge-based machine translation follows the linguistic and computational instructions supplied to it by human researchers in linguistics and programming. The texts to be translated have to be presented to the computer in machine-readable form. The machine translation process may be unidirectional between a pair of languages: the translation is possible only from Russian to English, for example, and not vice versa, in one system. Or it may be bidirectional.
The dominant approach since around 1970 has been to use handcrafted linguistic rules, but this approach is very expensive to build, requiring the manual entry of large numbers of "rules" by trained linguists. This approach does not scale up well to a general system. Such systems also produce translations that are awkward and hard to understand.
Corpus-based approaches to machine translation (statistical or example-based) tried, and partially succeeded to replace traditional rule-based approaches, beginning in the mid-1990s, following the developments in language technology. The main advantage of corpus-based machine translation systems is that they are self-customising in the sense that they can learn the translations of terminology and even stylistic phrasing from previously translated materials.
One of the many problems in the field of machine translation is that expressions (multi-word terms) convey ideas that transcend the meanings of the individual words in the expression. A sentence may have unambiguous meaning, but each word in the sentence can have many different meanings. Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Autocoders transform text into an index of coded nomenclature terms (sometimes called a "concept index" or "concept signature").
Word sense disambiguation is a technique for assigning the most appropriate meaning to a polysemous word within a given context. Word sense disambiguation is considered essential for applications that use knowledge of word meanings in open text, such as machine translation, knowledge acquisition, information retrieval, and information extraction. Accordingly, word sense disambiguation may be used by many commercial applications, such as automatic machine translation (e.g. see the translation services offered by www.altavista.com, www.google.com), intelligent information retrieval (helping the users of search engines find information that is more relevant to their search), text classification, and others.
By the turn of the century, this newer approach based on statistical models - a word or phrase is translated to one of a number of possibilities based on the probability that it would occur in the current context - has achieved marked success. The best examples substantially outperform rule-based systems. Statistics-based machine translation (SMT) also may prove easier and less expensive to expand, if the system can be taught new knowledge domains or languages by giving it large samples of existing human-translated texts.
Despite some success, however, severe problems still exist: outputs are often ungrammatical and the quality and accuracy of translation falls well below that of a human linguist - and well below demands of all but highly specialized commercial markets.
Hybrid methods are still fundamentally statistics-based, but incorporate higher level abstract syntax rules to arrive at the final translation. Such hybrids have been explored in the research community, but without any real success because it was difficult to merge the fundamentally different approaches. New algorithms exploit knowledge of how words, phrases and patterns should be translated; knowledge of how syntax-based and non-syntax based translation rules should be applied; and knowledge of how syntactically based target structures should be generated. Cross-lingual parsers of increasing complexity provide methods to choose different syntactic orderings in different situations.
Human-in-loop approaches respond to the difficulties in translation from one language to another that are inherent in Machine Translation. Languages are not symmetrically translatable word for word — greatly complicating software design and making perfect translation impossible. The greater the differences between languages’ structure and culture, the greater the difficulty to accurately translate the intent of the speaker. As with any machine translation, conversions are normally not context-sensitive and may not fully convert text into its intended meaning. Language experts noted that machine translation software will never be able to replace a human translator’s ability to interpret fine nuances, cultural references, and the use of slang terms or idioms.
Machine translation is not perfect, and may create some poor translations (which can be corrected). Computers, however limited for aiding nonlinguists, are powerful tools for linguists in intelligence and special operations to sort through tons of untranslated information or “triage” documents, sorting contents by priority. Machine “gisting” (reviewing intelligence documents to determine if they contain target key words or phrases) is used to better manage their workloads and target the information that trained linguists need to review in depth. An automated translation system can be used for translation of technical terms and consistent translation of stock phrases in diplomatic and legal documents to help human translators work more efficiently.
Both the private and the public sectors are exploring advances in machine translation of spoken and written communications. Off-the-shelf commercial software is designed for commercially viable languages, but not for the less-commonly taught, low-density languages. Numerous demonstration projects are under way, and early results show some promise for this type of technology.