When AI fails the language test, who is left out of the conversation?

US

By Sara Ruberg, The New York Times Company

Stanford researchers gave a popular artificial intelligence chatbot a language test.

They asked the bot in Vietnamese to write a traditional poem in the form known as “song thất lục bát” that follows a pattern of lines made up of seven, seven, six, then eight words. When the bot spit out an answer, it wrote a poem but didn’t follow the format.

The team tried a different prompt, asking what the proper Vietnamese word was for a mother’s younger brother, and it responded with the words for a father’s younger and older siblings.

These flaws are not unique to Claude 3.5, the chatbot by the AI company Anthropic that the researchers queried, but they illustrate some of the ways in which AI can get language outside of standard American English wrong.

While the use of AI has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. AI experts worry that the language gap could exacerbate technological inequities and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years “can potentially lead to a few decades of economic delay,” said Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that AI tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the AI model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because AI tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up more than 60% of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5% of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grassroots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Products You May Like

Articles You May Like

UK leader Starmer aims to shed his gloomy image with a key speech
USPS wants to raise stamp prices 5 times over the next 3 years
'SNL' kicks off 50th season with Jean Smart, Jelly Roll
Inside Mayor Brandon Johnson’s stunning Chicago schools fight as CTU contract talks stall
US seeks to ban Chinese software from cars, citing national security concerns

Leave a Reply

Your email address will not be published. Required fields are marked *