Computer scientists at MIT and Israel's Technion recently developed an algorithm that analyzes quirks and errors in English-as-a-second-language (ESL) essays to predict the author's native tongue, potentially unlocking new tools for translation and teaching and also producing an alternative map of language similarity.
The breakthrough got us thinking: How do native speakers of different languages typically screw up English?
Researchers Dr. Boris Katz and graduate student Yevgeni Berzak offered the following examples over email:
In Russian, there are no determiners (no words equivalent to "a" and "the" in English). As a result, native speakers of Russian tend to omit determiners in English. For example, a native speaker of Russian uses the following sentence in our data: "I was champion of swimming competition in Russia," instead of, "I was the champion of a swimming competition in Russia." Another example is "Everyone has now a car" which would be the correct order of modifiers if translated word-to-word to Russian, instead of the proper form in English "Everyone has a car now."
In French, adjectives follow nouns (in most cases). Consequently, native speakers of French use phrases such as "lessons very important" instead of "very important lessons," "afternoon free" instead of "free afternoon" and so forth. Another pattern that is common with French natives is preference for the construction "noun preposition noun" over "noun noun" compounds which are less frequent in French. Hence, French natives may prefer "licence for sailing" over "sailing licence," and "manager of Bruce Springsteen" instead of "Bruce Springsteen's manager."
With natives of German we have the example "The pollution will more and more increase," which would be the correct position of the verb in German.
Two other types of ESL patterns that are influenced by the native language are prepositions, and count/mass noun distinctions.
ESL learners would often direct-translate the preposition from their native language (for example, Portuguese natives confuse "in" and "on" in a systematic way, based on the usage of the equivalent prepositions in Portuguese).
Different languages have different count/mass noun distinctions. For example "information", "furniture" and "homework" are count nouns in French, and so natives of French may say "informations", "furnitures" and "homeworks" in English.
Here's a description of the process provided by the researchers:
The algorithm receives ESL essays by native speakers of different languages as input. It starts by learning linguistic patterns in ESL that are characteristic of each of these languages. The kind of patterns we take into consideration are word order, syntactic relations between words, and morphological suffixes and prefixes.
Using these patterns it is possible to predict the native language of a given ESL document, but in fact our method reveals something even deeper: based on these patterns, the algorithm estimates the similarity between all the language pairs, and clusters them into a tree. It turns out that this tree is very similar to a tree that one would obtain when using similarities according to the linguistic properties of these languages as manually documented by linguists. These linguistic properties are called "typological features" and refer to things like word order, valid syntactic constructions, how to form negation, are there morphological suffixes and prefixes or not, etc.
Finally, the algorithm uses the ESL-based similarity tree for prediction of typological features of languages for which we have no prior linguistic data. For example, imagine that I know nothing about the typological features of Portuguese (I don't know the correct word order, how to form negation etc), but I have a bunch of ESL documents written by natives of Portuguese. Using the similarity tree from ESL, I can determine that Portuguese is in fact similar to Spanish and Catalan. Assuming I do know the typological features for these two languages, I can now predict the typological features of Portuguese based on the features of Spanish and Catalan. Of course, this method is not perfect, but it's a pretty good approximation: in 72% of the cases the predicted typological features are correct.
To sum up, the practical value of the algorithm is the ability to predict native languages, their similarity structure, and their typological features directly from ESL. The fact that we can do this also tells us something new and interesting about language: there is a strong relation between native language typology and second language usage. In other words, the properties of your native language systematically affect the way you use a foreign language.
Based entirely on ESL essays, the researchers were able to create a language similarity tree, showing the close links between, for instance, Japanese and Korean. Their language similarity tree looks remarkably similar to one based on data from The World Atlas of Language Structures (WALS).
"The striking thing about this tree is that our system inferred it without having seen a single word in any of these languages," Berzak said.
Here's a comparison of the ESL tree and the WALS tree:
Courtesy of the researchers
And an artistic rendering of the ESL tree:
Jose-Luis Olivares/MIT