Is AI the New Lingua Franca of Unwritten Languages?
According to current linguistic data, there are 7,139 known living languages in the world. Of these, 3,661 have some form of written documentation, while approximately 3,011 to 3,500 have no written form whatsoever. There are 293 known writing systems in human history, of which only 156 are currently in use. And of those 156 active scripts, many cover hundreds of languages each — the Latin alphabet alone serves approximately 4.9 billion people across Germanic, Romance, and Austronesian language families. The Arabic script serves around 660 million speakers across over two dozen languages. Devanagari covers approximately 608 million people across roughly 120 languages.
But for the roughly half of humanity’s languages that were never written. They are invisible to the written world. In India, the country with the world’s greatest linguistic diversity, over 1,000 languages have no written form. Papua New Guinea has over 800 languages, many of them spoken by communities of fewer than 1,000 people. Sub-Saharan Africa carries the highest concentration of unwritten languages on Earth. Mexico’s indigenous communities alone have over 100 unwritten languages and dialects. These are not marginal footnotes in the story of human language but they are its heartbeat.
UNESCO estimates that approximately 50% of the world’s languages are endangered. A language is classified as endangered when children are no longer learning it as their mother tongue — when transmission between generations breaks down. By 2100, linguists estimate that between 50% to 90% of the world’s languages may be extinct. The primary drivers are economic pressure toward dominant global languages like English, Mandarin, Hindi, and Spanish; urbanization that separates communities from their linguistic homelands; colonial and post-colonial education systems that punished or discouraged indigenous language use; and media ecosystems that operate almost exclusively in a handful of major languages.
India alone has lost approximately 220 languages in the last 50 years, according to the People’s Linguistic Survey of India. A further 197 are classified as endangered. The crisis is not theoretical — it is unfolding in real time, in living communities, in the last elders who carry the final sounds of a language in their memory
Lingua Franca — A Concept Reborn in the AI Era
Historically, the proliferation of a lingua franca was inextricably linked to the acquisition of a written scripts which becomes a prerequisite for its survival across the expansive trade routes and administrative bureaucracies of empires. Consequently, oral traditions lacking a formal orthography were often relegated to obsolescence and are unable to endure the “silence” of the unrecorded past.
The advent of Artificial Intelligence fundamentally subverts this historical trajectory. By leveraging the computational power of Large Language Models (LLMs) to decode complex phonetic patterns and structural nuances directly from speech. AI bypasses the traditional necessity for a standardized alphabet. In doing so, it emerges as a sophisticated digital conduit, facilitating the universal translation of “unwritten” tongues and granting them an unprecedented form of archival permanence.
What AI Can Do for Non-Written Languages
The most immediate contribution AI makes to unwritten languages is documentation. Traditional linguistic fieldwork required a trained linguist to spend years embedded in a community, manually transcribing speech, analyzing grammar, and producing dictionaries and grammars. This process was slow, expensive, and constrained by the availability of specialists who could work with specific language families.
AI-powered speech recognition systems can now process thousands of hours of oral recordings, identify phonemes, map grammatical structures, and generate draft transcriptions at speeds that would take human researchers decades. Even with minimal training data of Non-Written Languages models like Meta’s MMS (Massively Multilingual Speech) and Google’s USM have demonstrated the ability to work with hundreds of previously unsupported languages.
In New Zealand, machine learning models have been used to transcribe audio recordings of Maori oral traditions, creating digital archives that can be searched, studied, and taught from. Similar projects are underway for Aboriginal Australian languages, Quechua in the Andes, and dozens of African oral languages
Creating Written Forms for the First Time
Perhaps AI’s most transformative role is creating written representations for languages that never had them. By analyzing audio patterns and producing phonetic transcriptions, AI tools can effectively give a language its first written form , that which becomes a foundation from which dictionaries, primers, and educational materials can be built.
This is not a simple technical feat. It requires careful collaboration with language communities to validate the phonetic representations chosen, to ensure that tonal distinctions are captured accurately, and to build writing systems that speakers themselves can learn and adopt. But AI provides the analytical scaffolding on which this community-driven work can be built far more rapidly than ever before.
India’s Response — A National Case Study
The Magnitude of India’s Linguistic Heritage
India is the world’s most linguistically complex nation. Its 1,652 recorded languages span four major language families — Indo-European, Dravidian, Austro-Asiatic, and Tibeto-Burman — and represent thousands of years of distinct cultural evolution. Tribal languages, spoken by India’s Scheduled Tribes who number over 104 million people, form the most vulnerable tier of this diversity. Most have no written script. Many are spoken by communities of fewer than 10,000 people. Several have only a handful of elderly speakers remaining.
SPPEL — The Government’s Core Initiative
The Government of India’s Scheme for Protection and Preservation of Endangered Languages (SPPEL), administered through the Central Institute of Indian Languages (CIIL) in Mysore, represents the country’s most systematic effort at documentation. SPPEL has identified 117 endangered languages for immediate documentation and aims to cover approximately 500 lesser-known languages in total. The program conducts fieldwork, records vocabulary and grammar, produces bilingual and trilingual dictionaries, and creates pictorial glossaries and ethno-linguistic profiles that are made available through digital repositories.
The work of SPPEL is painstaking and deeply human. Community members like Vasamalli of the Toda tribe in Tamil Nadu’s Nilgiri hills have worked with SPPEL linguists to develop language primers — in Toda’s case, using Tamil script as a phonetic vehicle for a language that had no script of its own. This is precisely the kind of collaborative, community-rooted documentation that forms the foundation for later AI-assisted preservation.
AI-Powered National Infrastructure — Bhashini and Adi-Vaani
India’s most ambitious AI language initiative is Bhashini — a national language translation mission developed under the Ministry of Electronics and Information Technology. Bhashini aims to provide universal language access across India’s linguistic spectrum using automatic speech recognition, text-to-speech synthesis, and neural machine translation. While its initial focus has been on India’s 22 scheduled languages, the platform is being extended toward tribal and unscheduled languages.
Building on this foundation, India launched Adi-Vaani in 2024 — the country’s first AI-driven platform specifically designed for tribal language preservation. Adi-Vaani integrates speech recognition and natural language processing to handle languages including Santali, Bhili, Mundari, and Gondi. It enables real-time translation, helps document oral traditions, and makes tribal languages usable within digital environments for the first time.
The Ministry of Tribal Affairs and the TRI-ECE Scheme
India’s Ministry of Tribal Affairs has committed significant financial resources to AI-based language preservation through the TRI-ECE scheme. This includes funding of Rs. 3.122 crore to a collaborative initiative between BITS Pilani, IIT campuses, and the Bhashini platform for developing AI translation tools that convert English and Hindi text and speech into tribal languages and back. Trilingual dictionaries, oral literature collections, folklore documentation, and school primers in tribal languages are being produced through Tribal Research Institutes across multiple states.
The National Education Policy 2020 has also created a critical structural shift: by mandating mother-tongue-based multilingual education in the early years of schooling, NEP 2020 creates both the need and the institutional will to develop educational materials in tribal languages. AI becomes the tool through which this policy commitment can be practically fulfilled at scale.
UNESCO and International Collaboration
India’s preservation efforts are also gaining international support. The UNESCO New Delhi office has collaborated with the Indira Gandhi National Centre for Arts on identifying and documenting India’s tribal and lesser-known languages. India’s participation in the United Nations’ International Decade of Indigenous Languages (2022–2032) has further mobilized resources and political will. Tamil Nadu’s tribal welfare department has committed an initial corpus fund of Rs. 2 crore specifically for preserving the languages of the Toda, Kota, Solaga, Kani, and Narikuravar communities.
The Bridge and Its Builders
The question posed at the beginning of this essay — ‘Is AI the New Lingua Franca of Unwritten Languages?’ — can now be answered with both affirmation and nuance. Yes: AI is becoming the most powerful technological bridge ever created for languages that writing never reached. It can hear what paper could not receive. It can learn what no alphabet was designed to encode. It can translate, teach, preserve, and transmit across the digital divide that has left nearly half of humanity’s languages invisible in the modern world.
But a lingua franca is only as good as the relationships it enables. The Latin lingua franca of medieval Europe also consolidated institutional power in Rome. The English lingua franca of the 20th century also marginalized indigenous languages on every continent. AI carries the same dual capacity: to serve as a democratic, decolonizing force for linguistic heritage — or to become a new instrument of extraction, where the cultural wealth of the world’s oral communities flows into the training datasets of distant corporations, while the communities themselves remain as marginalized as before ?.
Key References and Data Sources
Ethnologue: Languages of the World, 2024 Edition
UNESCO Atlas of the World’s Languages in Danger, 2024
People’s Linguistic Survey of India, Ganesh Devy (Ed.), 2013
Ministry of Tribal Affairs, Government of India — Annual Report 2023–24
National Education Policy 2020, Ministry of Education, Government of India
Is AI the New Lingua Franca of Unwritten Languages? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.