It’s not what you say, but what you don’t say that matters. Such is the thinking behind a new approach to automatic translation, which looks at the relationships between words in different languages.

Meaning in spaces

Technology Review reports that engineers at Google have come up with a system that creates the phrasebooks and dictionaries necessary to deliver automatic translations. To do this, it mines data and creates a model of one language, which it then compares against a model of another. The engineers, headed up by research scientist Tomas Mikolov, have come up with a model that puts greater emphasis on the vector space between the words and numbers, and in so doing has created a mathematical conundrum out of a linguistic one, the technology news provider says.

Mikolov and his colleagues Quoc V Le and Ilya Sutskever have published a report on their new system through Cornell University Library. The paper, entitled Exploiting Similarities Among Languages for Machine Translation, claims to explain how currently missing words and phrases can be acquired and translated by machines that have learnt the language system by understanding the “mapping between languages from small bilingual data”.

According to the developers: “This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.” By doing away with the traditional method of comparing the same text in different languages, it learns the structure of one language so it can compare it with the skeleton of another.

As Technology Review explains, two sentences in different languages will probably be constructed from similar words, which will feature in similar places within the sentence. For example, ‘the cat sat on the mat‘ becomes ‘le chat s’est assis sur le tapis‘ in French and ‘kot usiadl na macie‘ in Polish. The word for ‘cat’ is highlighted in bold in each language, and the word ‘mat’ is in italics. In all of these cases, ‘cat’ comes at the beginning of the sentence, and ‘mat’ at the end. This association is what the new system relies on, by perceiving these relationships as vectors that can then be used mathematically to create a translation.

It could be said that computers are far better at understanding maths than they are at getting to grips with linguistics, so by making machine translation a mathematical equation, the developers were able to come up with a way of solving it. Using a bilingual dictionary and phrasebook, they created a map to compare groups of words in two languages to create the through line that the translation is built upon. From short word groups, the system was expanded out to longer ones incorporating more vector spaces and Mikolov explains the translations the system has achieved in English to Spanish and vice versa have been surprisingly accurate. This is then used as a way of spotting mistakes in the dictionaries machine translations rely on, to make the results more accurate.

According to the team behind the method, it is just as effective whether the two languages are from the same linguistic family – such as Indo-European – or from two separate branches. It is an exciting development, from which machine translation may be made more accurate and efficient, but the developers note it is only the beginning.

The trouble with syntax

Currently, few, if any, would argue that machine translation generates more accurate results than a professional human translator. Indeed, for truly fluent, accurate and readable results, a living and breathing linguist is really the only way to go.

As its critics have pointed out, machine translation struggles with syntax. While Mikolov et al’s system is built upon an understanding that the same words in different languages will appear in corresponding spaces within a sentence, this does not follow through with all languages. A further on-going issue with automatic translation is that while some language pairs – typically the most common ones – are improving in automatic translation all the time, others return far less reliable results. This is because the Statistical Machine Translation technology used by most of the free automatic services works by analysing texts and translations already online in order to scrutinise the matches it needs to create the translation. The obvious problem with this is that where language pairs are not as ubiquitous, the results will not be as accurate.

With new content being uploaded to the internet every day, this is something that will change, and automatic translations will continue to improve. The vector space model created by Mikolov is likely to play an integral role in these improvements. However, until machine translators are returning consistently high quality results no matter how long the text being translated or the languages it is being translated into, a human translator remains the safest option when considering language services.