At the Forefront of Machine Learning

Computer scientist Derry Wijaya builds tools to translate "low-resource" languages and track how media perspectives shape public opinion

| in Features

By Jeremy Schwab

In an analog world, translating from one language to another was often done with the help of a dictionary. In today’s digital world, with access to incredibly fast computer algorithms and a vast amount of data, researchers are building a new landscape for language translation. And because computers are the ones doing the translating in this new reality, the landscape is one that a machine can understand. Derry Wijaya, a computer scientist at CAS and a pioneer in this new space, explains.

Derry Wijaya

“The current natural-language processing approach is to treat words not like characters but rather like a vector,” she says. “So you can embed that word in a vector space. It’s a high-dimensional space, and each point in that space is a word.”

The idea is that words that are similar will live nearby each other in that space. So “kitten” or “furry” are close together with “cat.” Computer scientists like Wijaya can then map two languages onto each other, creating a rough translation. This rough translation is then refined by feeding the computer program a diet of correct sentences in each language. The program then learns how to better translate between the two vector spaces (learning where the languages differ from each other).

All of this is done using a technique called deep learning, which is a method of machine learning that empowers computer programs to learn through trial and error and begin to solve problems for themselves without a human having to feed them exact instructions.

Getting from 100 to 7,000

Of the roughly 7,000 human languages, Google Translate, which is increasingly used to communicate between languages, can translate only about 100. Wijaya envisions a day when anybody with a smartphone or computer can quickly translate between their language and another. To help bring that day a little bit closer, she developed a tool using natural-language processing that can translate from a so-called “under-resourced” language to another language. Under-resourced languages are rich in cultural meaning and tradition but do not have a wide presence on the web and so are harder to translate using a more traditional technique like comparing parallel or like sentences.

“There is this big gap between what’s available and what could be available,” says Wijaya, a CAS assistant professor of computer science.

Since Wijaya published her findings, others in the field have built off of the techniques she helped pioneer. Eventually, she hopes that this approach can help big social media platforms better police hate speech in under-resourced languages by detecting and taking down posts before they incite violence. She also foresees other uses, like allowing immigrants to communicate better with people in their new countries, or helping social media users communicate more easily across languages (for instance to send disaster aid). 

A faster way

Since Wijaya began working in natural-language processing, the field has grown dramatically as more and more researchers take advantage of deep learning tools. And in this growing pack, Wijaya’s research still stands out. For instance, the prestigious Association for Computational Linguistics conference recently accepted a paper she submitted. The conference, which will be virtual this year due to the pandemic, has a very low acceptance rate for papers, with over 4,000 applications this year (twice the amount as last year). 

Wijaya’s paper describes a shortcut of sorts that she has discovered for doing data analysis between two languages. She and colleague Lei Guo from the BU College of Communication had been collaborating on an analysis of the news treatment of topics like gun violence or global warming across languages and countries. And what Wijaya found was that if you are looking to analyze the use of very specific terms like “gun control” or “gun rights” and also the general tone of the news article those terms are used in, you can do it without teaching a computer how to translate entire languages.

How does it work? Wijaya and Guo first create a “frame,” or way of analyzing news articles, in English. They do this by teaching a computer program to categorize articles based on their tone and point of view. Then Wijaya has that program interface with programs that are trained separately in German and Turkish using her vector-space approach. She then teaches the programs the correct translations in each language of just certain key words like “gun control” or “gun rights” or other hot-button terms that are directly relevant to her research focus and often convey the point of view of the author. 

“These key words act as anchors that let you code-switch,” she explains, using a term for switching rapidly from one language to another. “So the vectors become more similar across languages.” 

Wijaya found that the program was then able to analyze the point of view of the article just as accurately as it would have using full-language translation software like Google Translate. The accuracy she achieved is possible in part because Google Translate introduces some errors that Wijaya avoided and that were critical to her research focus, such as when Google Translate translated “gun control” in English to “gun rights” in German—effectively the opposite meaning.

For the purpose of this type of narrow research (as opposed to translating whole languages), it doesn’t matter that her program doesn’t fully understand or translate every word in an article. “It’s the same way humans first start to learn language,” she says. “Maybe we don’t know the whole sentence but we learn the important bits of it in order to start to understand it.”

In the same vein, Wijaya sees the broader field of machine learning and deep learning as in its infancy, still trying to understand the basics and build off of that. She imagines a future where machines begin to use both perception and reasoning to analyze information.

“I think with the neural network, we have a way of allowing the machine to perceive the world in a way we have never been able to before, for instance perceiving images or videos without needing to write rules about it to help them understand it,” she says. “They can produce good representations of the world, such as sentences as vectors. But I think there is still a long way to go in terms of having human intelligence in machines, because these things are just perception.”

“There is more to intelligence than perceiving things,” she continues. “Like reasoning. And doing reasoning with vectors is not easy. For instance, to use logic like ‘If x then y,’ if x and y are both vectors then you can’t do reasoning very easily with that. So I think the way it’s moving is towards using neural networks as our senses. Then doing more internal representation, which is more symbolic. So this combination of distributional representation and symbolic representation would be where it’s going.”

Diversifying computer science

Wijaya also makes it her mission to expand access to and awareness of opportunities in computer science to young women. Along with fellow CAS computer scientist Kate Saenko, she is organizing a workshop for undergraduate women in computer science-related fields to learn more about research opportunities in computer science. Postponed until the fall due to the coronavirus, the workshop already has 50 applicants from universities around the Boston area. Participants will learn about opportunities to build research experience (with the focus of this year’s being artificial intelligence) while still undergraduates and be encouraged to apply to graduate programs—a level of computer science training where there are not as many women.

The program is called Explore CS Research Iniative, and Google Research gave funding to 24 universities this year to participate. The initiative is based on one at Carnegie Mellon University, which Wijaya helped organize when she was a PhD student there.

“I feel it is important to increase the involvement of women in CS research and research careers for many reasons,” says Wijaya. “Women are still currently underrepresented in these areas. Since 2000, women have earned only one in five computer science doctoral degrees. I truly believe that by diversifying the field, we can discover better and more interesting problems and ideas to solve in research as this diversity can bring fresh perspectives that are often much needed in research.”

More Stories