Skip to content Skip to sidebar Skip to footer

Student builds AI tool to revitalize endangered Indigenous language

Jared Coleman, who recently earned his Ph.D. in computer science, and his supervisor, Bhaskar Krishnamachari, are bound by a shared love of languages—both human and computer.

Krishnamachari grew up in India speaking Tamil, Hindi and English, and started learning French and Mandarin Chinese in college. Coleman, who was raised Anglophone, loved Spanish in high school and learned Portuguese from his now-wife and friends in college.

During the pandemic, Coleman started taking online classes in a lesser-known language: Owens Valley Paiute. Coleman is a member of the Big Pine Paiute Tribe of Owens Valley—his father, David, grew up on the tribe’s reservation in Big Pine, CA, and Paiute is his ancestral language.

ChatGPT and other large language models (LLMs) exhibit human-level performance on many natural-language tasks in English because one-fifth of the world speaks English. The same is true of other widely used tongues. But Paiute is deemed a “no-resource language,” meaning there are no publicly available Paiute sentences translated into English on which to train a machine learning model.

In a new paper, “LLM-Assisted Rule-Based Machine Translation for Low/No-Resource Languages,” appearing on the pre-print server arXiv, Coleman and Krishnamachari propose a machine translation approach called LLM-RBMT (Rule-Based Machine Translation) to help people learn no-resource languages. The paper’s co-authors are Khalil Iskarous, a USC Dornsife associate professor of linguistics, and Ruben Rosales, an independent researcher.

Their approach consists of a more “old school” rule-based translator tools and a more advanced, natural language-based LLM. In the researchers’ method, the LLM does not translate into or from Owens Valley Paiute. Instead, it helps to guide the rule-based translators, which rely on grammatical and vocabulary rules to translate between languages.

“Essentially, the LLM acts as a sophisticated intermediary, using its advanced understanding of language to make sure the rule-based system produces accurate translations,” said Coleman.

The translation tool simplifies complex sentences and uses placeholders (in this case, English words) for unknown words. While this process loses some meaning, it still produces understandable and grammatically correct translations.

This method, said Coleman, mirrors how language learners naturally speak by mixing known and unknown words, making it a practical tool for real-world use.

“The tool is smart enough, given a few hints, to be able to do a lot of the translation on its own,” adds Krishnamachari.

Leave a comment