Machine translators are becoming an indispensable tool in modern life.
No matter where you are in the world, the United States, Brazil, France, or the island of Borneo in Asia, with the help of machine translation, software such as Google and Facebook can translate almost any text content on the platform into the local language.
But what you may not know is that most translation systems use English as an intermediate language for translation . In other words, when translating Chinese into French, it is actually from Chinese to English and then to French.
The reason for this is because English translation data sets (including translation in and out) are very large and easy to obtain. However, the use of English as an intermediary reduces the accuracy of translation as a whole, and at the same time makes the entire process more complicated and bloated.
For example, on Facebook, News Feed alone requires about 20 billion translations per day.
In response to these problems, Facebook recently developed a new machine translation model, which can directly realize two-way translation between two languages without using English . The score of the new model under the BLEU evaluation algorithm is 10 higher than that of the traditional English-based model. Minute.
Facebook’s new model is called M2M-100, and Facebook claims that it is the first multilingual machine translation model that can directly translate back and forth between any pair of 100 languages. Facebook AI constructed a huge data set consisting of 7.5 billion sentences in 100 languages. Using this data set, the research team trained a general translation model with more than 15 billion parameters. According to a Facebook blog, the model can “get relevant language information and reflect more diverse language texts and Language form”.
“The main challenge is how we use our translation system to effectively meet the needs of people all over the world,” said Angela Fan, an assistant researcher at Facebook AI, in an interview. “You have to translate all languages, involving various needs that people will encounter. For example, there are many places in the world where local people use multiple languages, and English is not among them, but the existing translation system heavily relies on English She also pointed out that two-thirds of the billions of posts posted daily in 160 languages on the Facebook platform are in languages other than English.
In order to do this, Facebook needs to use a variety of new technologies to collect large amounts of public data from all over the world. “A lot of the work here is actually based on our years of research at Facebook. Just like different Lego blocks, we are a bit like putting together blocks to build today’s system,” Fan explained.
The team first used CommonCrawl to collect text samples from the web, which is an open web crawling database. Then they set out to use FastText to identify the language of the text, which is a text classification system developed and open sourced by Facebook a few years ago. “This system basically looks at some tests and then tries to determine what language the text is written in,” Fan said. “In this way, we separate a bunch of web texts into different languages. Next, our goal is to identify the corresponding sentences. .”
“Traditionally, people use human translators to create translation data,” she continued. “This is difficult to do on a large scale. For example, it is difficult to find people who speak English and Tamil at the same time, and French and Tamil at the same time. It’s even more difficult. Non-English translation is still an area that needs to be strengthened.”
In order to mine the necessary data on a large scale, Fan’s team relies heavily on the LASER system. “It reads sentences, grabs the text and constructs a mathematical representation of the text. Sentences with the same meaning will be mapped to the same meaning,” she explained. “If I have a sentence in Chinese and a sentence in French, they say the same Things, they will overlap like a Venn diagram—the overlapping area is considered a set of corresponding sentences.”
Of course, not all languages have a lot of text content online.
In these situations, Fan’s team used single language data to improve. Taking Chinese to French as an example, Fan explained: “If my goal is to translate Chinese to French, but for some reasons, the translation quality is not good enough, then I can try to improve the French-language data . What I want to do is to train a system in reverse: from French to Chinese. For example, I get all French from Wikipedia and then translate it to Chinese.”
As a result, there are a large number of “artificially synthesized” corpus generated by machine translation. Fan said, “After having these’artificially synthesized’ Chinese reverse-translated from French, I can add these data to my forward model. That is, I use the original Chinese data to add this supplement. “Synthesize” the data and then translate them all into French. Because of the new additions of example sentences—both at both input and output—the model will be more powerful.”
It remains to be seen whether this project will produce a “digital Babel” that can perform lossless translation among more than 6,200 spoken languages around the world. Fan pointed out that the ultimate success of this project depends on the amount of resources that AI can use. For major languages such as French, Chinese, German, Spanish and Hindi, the resources are massive. “People use these languages to write a lot of text on the Internet,” she said. “They can contribute a lot of data, and our models can use this data to get better.”
“For languages with very few resources, I personally identified a lot of language categories that we might need to improve,” Fan continued. “For African languages, we are quite good at Swahili and Afrikaans. We can make a lot of improvements in languages like Zulu. We need to face additional research challenges in these languages.” M2M -100 GitHub code link:
https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
For more such interesting article like this, app/softwares, games, Gadget Reviews, comparisons, troubleshooting guides, listicles, and tips & tricks related to Windows, Android, iOS, and macOS, follow us on Google News, Facebook, Instagram, Twitter, YouTube, and Pinterest.