Tulu, my native mother tongue, is a niche language with approximately 2 million speakers, primarily from the coastal regions of Karnataka, including Mangalore and Udupi. While the state language, Kannada, boasts 45 million speakers, Tulu has historically been preserved through oral traditions, lacking a written script until recently.
In a landmark achievement, Tulu was officially added to Unicode on September 10, 2024. This effort, led by the Tulu Literary Academy and software engineer U.B. Pavanaj, came after years of work, including the launch of the Tulu Wikipedia in 2016. The inclusion of the Tulu-Tigalari script in Unicode marks a crucial step toward digitizing and preserving this language in the digital age.
However, oral preservation alone cannot keep pace with modern digital communication. Project Vaani—a Google-funded initiative—has recorded over 15,000 hours of conversations in 59 Indian languages, but Tulu has not yet been included. This gap underscores the urgent need for further action to preserve and promote Tulu in digital formats.
To contribute to this cause, I have begun an audio transcription effort in collaboration with native Tulu speakers, Aditya and Shashwik. We are recording basic conversational phrases along with their English translations to create a dataset for fine-tuning OpenAI’s Whisper model. Whisper, an automatic speech recognition (ASR) and translation model, offers a promising avenue to develop a Tulu-English translator and bolster the language’s digital presence.
Additionally, I have submitted a request to Google Research to include Tulu in their transcription efforts for underrepresented languages. This initiative aligns with global strides to digitize and preserve linguistic diversity, and I hope to see Tulu receive the recognition it deserves.
If you have suggestions or would like to contribute to this project, feel free to reach out!
Best,
Aryan