User:LissandraNewton2198
From In-Portal Developers Guide
(New page: Machine Translation - The way it operates, What Users Expect, and What They Get Machine translation (MT) systems have become ubiquitous. This ubiquity is because of a mix of increased nee...) |
Current revision (19:11, 5 September 2012) (view source) (New page: Machine Translation - The way it operates, What Users Expect, and What They Get Machine translation (MT) systems have become ubiquitous. This ubiquity is because of a mix of increased nee...) |
Current revision
Machine Translation - The way it operates, What Users Expect, and What They Get
Machine translation (MT) systems have become ubiquitous. This ubiquity is because of a mix of increased need for translation in today's global marketplace, with an exponential rise in computing power that makes such systems viable. And under the right circumstances, MT systems certainly are a powerful tool. They provide low-quality translations in situations where low-quality translation is preferable to no translation in any way, or where a rough translation of a large document delivered within minutes or minutes is a lot more useful than the usual good translation delivered in three weeks' time.
Unfortunately, inspite of the widespread accessibility of MT, it really is clear how the purpose and limitations of such systems are often misunderstood, and their capability widely overestimated. On this page, I must give you a brief overview of how MT systems work and so how they may go to best use. Then, I'll present some data about how Internet-based MT is being used right this moment, and reveal that ubersetzungen online there's a chasm involving the intended and actual use of such systems, understanding that users still need educating regarding how to use MT systems effectively.
How machine translation works
You may have expected that the computer translation program would use grammatical rules from the languages showcased, combining them with some type of in-memory "dictionary" to generate the resulting translation. And indeed, that's essentially how some earlier systems worked. But most modern MT systems actually take a statistical approach that is quite "linguistically blind". Essentially, the device is trained on the corpus of example translations. The result is a statistical model that incorporates information including:
- "when what (a, b, c) occur in succession in a sentence, there is an X% chance the words (d, e, f) will exist in succession in the translation" (N.B. there doesn't have to be the same quantity of words in each pair); - "given two successive words (a, b) in the target language, if word (a) leads to -X, it comes with an X% chance that word (b) will end in -Y".
Given a tremendous body of such observations, the machine are able to translate a sentence by considering various candidate translations-- made by stringing words together almost randomly (the truth is, via some 'naive selection' process)-- picking the statistically most likely option.
On hearing this high-level description of how MT works, most people are surprised that a real "linguistically blind" approach works in any respect. What's even more surprising is it typically works more effectively than rule-based systems. This really is partly because depending on grammatical analysis itself introduces errors to the equation (automated analysis isn't completely accurate, and humans don't always agree on how you can analyse a sentence). And training a system on "bare text" permits you to base a system on a great deal more data than would otherwise be possible: corpora of grammatically analysed texts are small, and few in number; pages of "bare text" can be found in their trillions.
However, what this process means is that the quality of translations is incredibly dependent upon how well components of the source text are represented within the data originally used to train the device. Should you accidentally type he'll almost certainly returned or vous avez demander (instead of he can return or vous avez demande), it will be hampered because sequences like will returned are unlikely to own occurred many times inside the training corpus (or worse, could possibly have occurred having a completely different meaning, as with they needed his will returned on the solicitor). And since the system has little thought of grammar (to sort out, for instance, that returned is often a kind of return, and "the infinitive is likely after he will"), it in essence has little to be.
Similarly, you might ask it to translate a sentence that is perfectly grammatical and customary in everyday use, but which include features which happen to not have been common in the training corpus. MT systems are generally trained for the forms of text that human translations are plentiful, like technical or business documents, or transcripts of meetings of multilingual parliaments and conferences. This provides MT systems a natural bias towards some types of formal or technical text. And also if everyday vocabulary is still covered by the training corpus, the grammar every day speech (like using tu instead of usted in Spanish, or while using present tense rather than future tense in numerous languages) might not exactly.
MT systems in reality
Researches and developers of computer translation systems have always been aware that one of the greatest dangers is public misperception of these purpose and limitations. Somers (2003)[1], observing the use of MT web in forums, comments that: "This increased visibility of MT has received many side effets. [...] There exists a need to teach everyone about the inferior of raw MT, and, importantly, why the product quality is so low." Observing MT in use last year, there's sadly little evidence that users' awareness of these problems has improved.
For example, I'll present a smaller sample of information from your Spanish-English MT service that we make available in the Espanol-Ingles site. The service operates by taking the user's input, applying some "cleanup" processes (such as correcting some common orthographical errors and decoding common installments of "SMS-speak"), and then seeking translations in (a) a bank of examples from your site's Spanish-English dictionary, and (b) a MT engine. Currently, Google Translate is employed for the MT engine, although a custom engine works extremely well later on. The figures I present allow me to share from an analysis of 549 Spanish-English queries given to the device from machines in Mexico[2]-- put simply, we assume that most users are translating from other native language.
First, what exactly are people with all the MT system for? Per query, I could a "best guess" with the user's purpose for translating the query. On many occasions, the purpose is quite obvious; in some cases, there is clearly ambiguity. With that caveat, I judge that within 88% of cases, the intended me is fairly clear-cut, and categorise these uses the following:
Learning about just one word or term: 38% Translating an elegant text: 23% Internet chat session: 18% Homework: 9% An unusual (or else alarming!) observation is that in that large proportion of cases, users are using the translator to find information on one particular word or term. The truth is, 30% of queries was comprised of an individual word. The finding might be a surprising since the site in question also has a Spanish-English dictionary, and shows that users confuse the objective of dictionaries and translators. While not represented inside the raw figures, there have been clearly some cases of consecutive searches where it appeared which a user was deliberately separating a sentence or phrase that will have probably been better translated if left together. Perhaps on account of student over-drilling on dictionary usage, we have seen, for instance, a query for cuarto para ("quarter to") followed immediately by the query for a number. There's clearly a desire to coach students and users in general about the distinction between the electronic dictionary as well as the machine translator[3]: specifically, that the dictionary will guide the user to picking the correct translation due to the context, but requires single-word or single-phrase lookups, whereas a translator generally is ideal on whole sentences and given a single word or term, will just report the statistically most frequent translation.
I estimate that in less than a quarter of cases, users are using the MT system for the "trained-for" function of translating or gisting a formal text (and are entering a whole sentence, or at least partial sentence rather than a remote noun phrase). Obviously, it is impossible to know whether some of these translations were then designed for publication without further proof, which definitely is not the purpose of the machine.
The utilization for translating formal texts is currently almost rivalled by the use to translate informal on-line chat sessions-- a context in which MT systems are typically not trained. The on-line chat context poses particular trouble for MT systems, since features like non-standard spelling, lack of punctuation and presence of colloquialisms not found in other written contexts are common. For chat sessions being translated effectively would possibly require a dedicated system trained over a considerably better (and perhaps custom-built) corpus.