Sign Up

1905 Colorado Avenue, Boulder, CO 80309

View map

Title: NLP for Endangered Languages: Where do we start?

Abstract: With more than half the world's 7,100 languages in danger of disappearing by the end of this century, the study and preservation of extremely low-resource languages is caught in a vicious cycle. Academic and community linguists struggle to process language data due to out-of-date software applications that lack machine learning powered assistance. While NLP models could automate crucial early steps, they first need enough training data in the language. This talk explores strategic points where we could break the vicious cycle. Linguists who express interest in adopting AI assistance for analysis and annotation find available user interfaces to NLP models difficult to set up and or the models too inaccurate, leaving them in doubt about the usefulness of NLP in language documentation. Many linguists understand that lots of training data is necessary to foster language technology development but wonder how much manual annotation they should budget from limited personnel, time, and funding. NLP researchers interested in low-resource languages often follow familiar pipelines or respond to commercial interests, focusing on speech recognition and machine translation, but do these tasks promise the highest impact on the study and preservation of endangered languages? By asking fundamental questions, we can determine baselines, benchmarks, and metrics that address the reality of extremely low-resource settings and explore the potential of NLP to accelerate the documentation of endangered languages and facilitate technological development by their speakers.

 

Bio: Dr. Sarah Moeller is Assistant Professor of Linguistics and director of the Machine Learning for Endangered Language Documentation (MELD) Lab and co-founder of African Language Technology for Equitable Representation (ALTER) at the University of Florida. She received her PhD in Linguistics and Cognitive Science at CU Boulder. She is an interdisciplinary researcher bridging technological and knowledge gaps between natural language processing (NLP) and "pen-and-paper" linguistics. Her published works in both social sciences and ACL venues such as “AI for Interlinearization and POS-tagging: Teaching Linguists to Fish” and “Leveraging Syntactic Dependencies in Disambiguation: The Case of African American English” explore how NLP contributes to the scientific study of minority languages and how those languages test the limits of NLP/AI.