The Meros Project — Preserving Uzbek Literary Heritage Through AI

Abstract

The Meros Project (meros: “heritage” in Uzbek) is a multi-modal digital humanities initiative employing advanced neural architectures and community engagement to systematically address the computational and cultural challenges of the Silent Canon—a vast body of Uzbek literary heritage that remains inaccessible to global scholarship and to digital AI systems alike. The Project constructs a living, semantically-aware digital repository serving both specialized academic communities and broader public audiences, aligned with global calls to use technology in service of endangered cultures. This paper presents the scientific, technical, and ethical foundations of the Meros Project across four core objectives: scalable neural translation, contextual knowledge augmentation through a structured knowledge graph, community-powered human-in-the-loop collaboration, and intelligent open-access dissemination. With over 35 million native speakers and a literary tradition of six centuries underrepresented in global AI infrastructure, Uzbek presents the paradigmatic case for language-inclusive artificial intelligence—and for reconceptualizing the human-machine collaboration paradigm in literary translation.

Keywords: Uzbek NLP, low-resource language models, Silent Canon, digital humanities, neural translation, knowledge graph, human-in-the-loop, literary corpus, cultural heritage, Navoi, intangible cultural heritage, open access

1. The Silent Canon: Uzbek Literary Heritage and the AI Divide

The twenty-first century has witnessed an unprecedented convergence of computational power, large-scale data availability, and theoretical breakthroughs in machine learning—collectively constituting the era of Large Language Models (LLMs). Yet this technological renaissance has been profoundly unequal in its geographic and linguistic distribution. A structural asymmetry now defines the global AI landscape: the languages of the economically powerful are perpetually amplified, while the languages that carry the oldest and richest cultural traditions are rendered computationally silent.

The Meros Project designates this asymmetry the Silent Canon: the body of literary and cultural heritage that exists in languages underrepresented in—or entirely absent from—the training infrastructure of modern AI systems. The Uzbek literary tradition constitutes one of the most significant and least computationally accessible instances of the Silent Canon. Spanning more than six centuries, from the classical verse of Alisher Navoi and the autobiographical prose of Babur through the modernist novel and contemporary poetry, this tradition has shaped the cultural identity of tens of millions of people while remaining largely invisible to the global scholarly and digital public.

The Meros Project is a direct response to this condition. Its name encodes its mission: meros (میراس / мерос), heritage. Its ambition is to broaden whose stories are told and how—to decolonize historical narratives and enhance cultural heritage preservation through the responsible, community-centered deployment of artificial intelligence.

1.1 Why Uzbek? A Language at a Crossroads

Uzbek is the official state language of the Republic of Uzbekistan and the second most widely spoken Turkic language globally, after Turkish. According to the Ethnologue (2024 edition), Uzbek has approximately 35–37 million native speakers and an additional 5–6 million second-language users across Uzbekistan, Tajikistan, Kyrgyzstan, Afghanistan, and diaspora communities worldwide. As a member of the Karluk branch of the Turkic language family, Uzbek exhibits unique phonological, morphological, and lexical properties shaped by centuries of contact with Persian, Arabic, and Russian.

From a digital and AI perspective, Uzbek occupies a paradoxical position. On one hand, Uzbekistan’s Ministry of Digital Technologies issued a formal mandate in 2025 to develop a national Uzbek language corpus for AI, explicitly acknowledging that most Uzbek-language materials—including literature, academic works, and oral tradition recordings—have not yet been digitized (Kun.uz, 2025). On the other hand, the majority of the classical literary canon, composed in Perso-Arabic script over six centuries, has never been computationally processed at scale. Bridging this gap is the foundational challenge the Meros Project exists to meet.

1.2 The Digital Inequity of the Silent Canon

Of the roughly 7,000 languages spoken on Earth today, fewer than 100 receive meaningful representation in the training data of flagship AI systems. English alone accounts for approximately 46–48% of all web-crawled text used in LLM pretraining, while Uzbek represents approximately 0.02%. This disparity is not merely a technical inconvenience; it is, as the Stanford Human-Centered AI Institute (2024) documents, a structural failure of the global AI ecosystem to serve the communities that most need its tools.

“Digital is a continent—vast, evolving, and ours to shape.” The Meros Project insists that this shaping must be done with justice and inclusivity, ensuring that the digital continent does not reproduce the exclusions of the analog world.
Adapted from the Meros Project founding statement

2. Mission and Core Objectives

The Meros Project’s core mission is to construct a living, semantically-aware digital repository of Uzbek literary heritage that serves both specialized academic communities and broader public audiences. It fundamentally reconceptualizes the human-machine collaboration paradigm in literary translation: rather than positioning AI as a replacement for human expertise, the Project establishes a symbiotic architecture where neural networks perform computationally intensive tasks while human experts provide higher-order cognitive functions. This collaborative framework aims to produce translations and annotations that achieve not merely lexical accuracy but cultural resonance, scholarly rigor, and literary authenticity.

The Project is organized around four interdependent core objectives, each addressed in detail in subsequent sections of this paper.

Scalable Neural Translation

To architect a high-throughput pipeline utilizing domain-adapted transformer models capable of processing substantial volumes of Uzbek literature while maintaining semantic fidelity, stylistic coherence, and cultural authenticity through advanced fine-tuning methodologies.

Contextual Knowledge Augmentation

To construct a comprehensive knowledge graph that enriches each translation with multi-layered scholarly annotations, diachronic historical contextualization, and intertextual relationship mapping across the full breadth of the Uzbek literary tradition.

III

Community-Powered Collaboration

To implement a Human-in-the-Loop framework that empowers a global community of scholars, students, and heritage speakers to contribute to the translation and annotation process, ensuring cultural knowledge remains a public good rather than a proprietary resource.

Intelligent & Open Dissemination

To deploy the resulting multilingual corpus within a sophisticated, open-access digital platform, democratizing access to this vital cultural heritage while maintaining scholarly rigor and attribution integrity.

2.1 Objective I: Scalable Neural Translation

Neural machine translation of Uzbek literary text—and in particular of Chagatai classical poetry—presents a challenge qualitatively different from document translation or information retrieval. Literary translation requires the preservation of metrical structure, semantic density, stylistic register, allusive layers, and cultural resonance: properties that standard sequence-to-sequence models trained on parallel news corpora are fundamentally ill-equipped to handle. The Meros Project’s neural translation pipeline addresses this through domain adaptation of transformer models on curated literary corpora, combined with retrieval-augmented generation (RAG) grounded in the Meros knowledge graph, enabling the model to condition its translations on intertextual and historical context rather than surface statistical patterns alone.

2.2 Objective II: Contextual Knowledge Augmentation

Translation without context is impoverishment. The Meros Project’s knowledge graph transforms each translated literary text from an isolated document into a node in a rich semantic network, linked to biographical data about the author, the historical events referenced in the text, the Persian and Arabic literary precedents being invoked or subverted, the Quranic allusions embedded in the verse, and the formal conventions of the genre being employed. This approach operationalizes the scholarly practice of tahqiq—critical edition—at machine scale, without displacing the role of human scholars in constructing and validating the knowledge structure.

2.3 Objective III: Community-Powered Collaboration

The Human-in-the-Loop (HITL) framework is not an afterthought in the Meros Project’s architecture: it is the mechanism by which the Project fulfills its commitment that cultural knowledge remains a public good. The HITL interface allows scholars of Turkic and Persian literature to correct and refine neural translations; heritage speakers to flag culturally inappropriate renderings; students to contribute annotations to less-studied texts; and the global diaspora community to participate in decisions about how their heritage is represented. This distributed contribution model is governed by transparent attribution protocols ensuring that every contributor’s intellectual labor is acknowledged and credited.

2.4 Objective IV: Intelligent and Open Dissemination

The Meros Project’s dissemination platform is designed to serve two distinct but overlapping audiences: the scholarly community, which requires advanced search, corpus analysis tools, and access to the full annotation layer; and the general public, including diaspora communities and students, who require accessible, translated, and contextually enriched versions of the texts. All materials produced by the Project are released under open-access licenses, operationalizing the principle that Uzbek cultural heritage is a public good that cannot be enclosed by commercial interests.

3. The Linguistic Architecture of Uzbek: Computational Challenges

Before designing AI systems for Uzbek, it is essential to understand the language’s structural properties—particularly those that render standard NLP pipelines, designed for analytic languages such as English, fundamentally inadequate. These properties are not obstacles to be engineered around but constitutive features of the language whose correct computational modeling is a prerequisite for culturally faithful translation.

3.1 Agglutinative Morphology and Tokenization

Uzbek is a highly agglutinative language: grammatical relationships are expressed by attaching sequences of morphemes to a root, yielding a single orthographic word that may encode what English requires an entire clause to express. A single verb stem can produce hundreds of inflected forms, each encoding tense, aspect, mood, voice, person, number, negation, and evidentiality simultaneously. Standard subword tokenization strategies such as Byte-Pair Encoding (BPE) or WordPiece, optimized for analytic languages, fail to segment Uzbek word boundaries correctly, treating morphological suffixes as out-of-vocabulary noise.

As Mansurov and Mansurov (2021) demonstrated in developing UzBERT, a morphologically-aware tokenizer is necessary to achieve competitive downstream performance. The BERTbek team (ACL Anthology, 2024) corroborated this finding, showing that a morphologically-aware tokenizer directly contributed to BERTbek-News-Big achieving an F1-score of 78.69% on Named Entity Recognition—a substantial improvement over multilingual BERT applied naively to the same task. The Meros-LLM adopts and extends this approach, training a 40,000-token morphology-aware Uzbek vocabulary integrated with the base model’s tokenizer.

3.2 The Multi-Script Problem

The most distinctive computational challenge for Uzbek NLP is the language’s turbulent orthographic history. Over the past century, Uzbek has been written in four distinct scripts—Perso-Arabic (pre-1928), the New Uzbek Latin script (1928–1940), Cyrillic (1940–1992), and the modern Latin script (1992–present, fully standardized 2025)—each encoding different layers of the literary tradition. The classical canon resides in Perso-Arabic; the Soviet literary heritage in Cyrillic; contemporary writing in Latin. A comprehensive Uzbek corpus for AI training must aggregate across all four systems, requiring robust machine transliteration pipelines whose non-triviality Mansurov and Mansurov (2021b) document in detail.

For the Meros Project, the multi-script challenge is not merely technical but hermeneutic: each script encodes a different historical moment in Uzbek cultural consciousness. The Project’s corpus design preserves the original script alongside transliterated versions, treating orthographic provenance as a dimension of scholarly metadata rather than an obstacle to be erased.

3.3 Comparative Linguistic Statistics

Table 1. Comparative linguistic and digital statistics for Uzbek and selected reference languages. Wikipedia data as of late 2024. The table illustrates the structural distinctiveness of Uzbek relative to better-resourced languages and the scale of the digital representation gap.
Feature	Uzbek	English	Turkish	Arabic
Language family	Turkic (Karluk)	Germanic (West)	Turkic (Oghuz)	Semitic (Central)
Morphological type	Agglutinative	Analytic	Agglutinative	Templatic/Semitic
Native speakers (millions)	35–37	380+	80+	310+
Currently active scripts	Latin, Cyrillic	Latin	Latin	Arabic
Script reforms (20th c.)	3	0	1	0
Vowel harmony	Partial	None	Full	None
Word order	SOV	SVO	SOV	VSO/SVO
Wikipedia articles (2024)	~210,000	~7,000,000	~500,000	~1,200,000

4. The Low-Resource Language Problem in Modern AI

The performance of AI language systems is fundamentally determined by the volume, quality, and diversity of training data. This reality encodes a structural bias into the global AI ecosystem: languages with large digital footprints are perpetually advantaged, while languages with smaller digital presences—regardless of the cultural wealth they encode—are systematically underserved. Addressing this bias is inseparable from the Meros Project’s mission.

4.1 Training Data Asymmetries

Table 2. Language representation in pretraining corpora and downstream NLP performance. NER F1 scores are approximate, drawn from published benchmarks on standard test sets. The Uzbek row illustrates the core challenge the Meros Project addresses.
Language	CommonCrawl share	Wikipedia articles	mBERT NER F1 (approx.)	Resource tier
English	~46%	~7,000,000	>90%	Very high
German	~8%	~2,900,000	>87%	High
Turkish	~1.2%	~500,000	~82%	Medium
Uzbek	~0.02%	~210,000	~55–65%	Low
Kyrgyz	~0.005%	~90,000	~50–60%	Very low
Uyghur	~0.003%	~12,000	<50%	Extremely low

The relationship between language representation in pretraining data and downstream performance is approximately log-linear: doubling the amount of training data for a language corresponds to a fixed increment in benchmark performance. Languages starting from near-zero representation require orders-of-magnitude more data simply to reach the performance baseline that English achieved decades ago. The cascading consequences for Uzbek are severe and well-documented: machine translation quality for Uzbek–English pairs lags comparable Turkic languages by 8–15 BLEU points; Automatic Speech Recognition systems exhibit word error rates of 25–40% compared to sub-5% for English; and Uzbek remains entirely absent from major multilingual evaluation suites including XTREME and XGLUE.

4.2 Methodological Responses

The Meros Project deploys four established methodological strategies to mitigate data scarcity, each integrated into the overall pipeline architecture.

Cross-Lingual Transfer Learning

Turkic languages share deep morphological and syntactic structures. Leveraging models pretrained on Turkish, Kazakh, or Kyrgyz as initialization points for Uzbek fine-tuning can significantly reduce data requirements. The cross-lingual transfer paradigm, established by mBERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020), demonstrates that syntactic structure generalizes across related languages even without explicit bilingual supervision.

Parameter-Efficient Fine-Tuning (PEFT)

Techniques including Low-Rank Adaptation (LoRA), prefix tuning, and QLoRA allow adaptation of large pretrained models to new domains and languages by updating only a small fraction of parameters. For the Meros Project’s literary domain, LoRA fine-tuning of a multilingual base model on the Meros Corpus represents a computationally efficient pathway that preserves the base model’s broad language competencies while acquiring literary register.

Retrieval-Augmented Generation (RAG)

Rather than encoding all cultural and literary knowledge in model parameters, the Meros Project’s RAG architecture couples the neural translation model with the Meros knowledge graph as an external retrieval system. This enables the model to condition its translations on verified intertextual, biographical, and historical context—exactly the higher-order cognitive functions that the human-machine symbiosis framework assigns to the structured knowledge layer rather than the parametric model.

Synthetic Data Augmentation

Back-translation, constrained paraphrasing, and template-based generation can expand limited corpora, with the important caveat that augmentation of literary text must preserve semantic and stylistic fidelity. For the Meros Project, synthetic augmentation is applied selectively to under-represented genres and periods within the corpus, subject to expert review within the HITL framework.

5. Existing Language Models for Uzbek: State of the Art

Understanding the existing landscape of Uzbek pretrained language models is essential for positioning the Meros Project’s technical contributions. A nascent but growing research community has produced several significant artifacts, each with documented strengths and limitations that directly inform the Meros-LLM’s design.

5.1 UzBERT (2021)

The first monolingual pretrained language model for Uzbek, UzBERT was developed by Mansurov and Mansurov (2021) at Copper City Labs. Trained on a high-quality news corpus of approximately 142 million words, UzBERT adopts the BERT architecture (Devlin et al., 2019) and was released under the MIT open-source license, making it the foundational artifact of the Uzbek NLP ecosystem. Its primary limitation—critical for the Meros Project’s literary aims—is its training corpus: by focusing exclusively on news text, UzBERT captures contemporary journalistic Uzbek but lacks exposure to the elevated literary register, archaic vocabulary, and Perso-Arabic loanword density that characterize classical Uzbek literary language.

5.2 BERTbek (2024)

Published at the ACL SIGUL workshop in 2024, BERTbek represents the current state of the art in Uzbek pretrained language models. BERTbek distinguishes itself through: a morphologically-aware tokenizer specifically engineered for Uzbek’s agglutinative structure; training on a larger and more diverse corpus encompassing both Cyrillic and Latin script sources; and comprehensive evaluation across Named Entity Recognition, sentiment analysis, and multi-label topic classification. BERTbek-News-Big achieves an NER F1-score of 78.69%, outperforming mBERT and all prior baselines. The Meros Project proposes a “BERTbek-Meros” variant trained on literary rather than news corpora as the encoder component of its RAG retrieval stack.

5.3 mGPT-1.3B-Uzbek and UzRoberta (2023–2024)

The AI Forever research group released mGPT-1.3B-Uzbek, a 1.3 billion parameter generative language model—the largest publicly available generative model for Uzbek as of early 2026. Its decoder-only architecture makes it suitable for text generation and zero-shot classification. UzRoberta (2024) applies the RoBERTa pretraining improvements to Uzbek. Both models share the corpus limitation of their predecessors: training data skewed toward contemporary rather than classical Uzbek, leaving the Silent Canon computationally unaddressed.

5.4 Model Comparison and the Meros-LLM

Table 3. Existing Uzbek language models and the proposed Meros-LLM. Parameters marked ~ are approximate. NER F1 scores on Uzbek test sets where available; * indicates estimation from related benchmarks. The Meros-LLM row specifies the Project’s target architecture.
Model	Year	Architecture	Parameters	Corpus	NER F1	Open source
mBERT	2019	BERT Encoder	~110M	104 languages	~55–65%	Yes
UzBERT	2021	BERT Encoder	~110M	~142M words	~70%*	Yes (MIT)
BERTbek	2024	BERT Encoder	~110M	Larger, diverse	78.69%	Partial
mGPT-1.3B-Uzbek	2023	GPT Decoder	1.3B	Multilingual	N/A (gen.)	Yes (HF)
UzRoberta	2024	RoBERTa	~125M	N/A	Preliminary	Partial
Meros-LLM (proposed)	2027+	Decoder + RAG	7–13B	~548M tokens (literary)	TBD	Target: Yes

6. The Meros Corpus: Building the Silent Canon’s Digital Foundation

The Meros Corpus is the primary material contribution of the Project’s first phase. It is not merely a collection of text files but a principled, multi-layered artifact whose design reflects the scholarly and cultural values of the communities it serves. Its construction operationalizes Objective I (Scalable Neural Translation) by providing the training data, and Objective II (Contextual Knowledge Augmentation) by providing the primary source material for the knowledge graph.

6.1 The Classical Uzbek Literary Canon

Alisher Navoi (1441–1501)

Widely regarded as the greatest Chagatai poet, Navoi produced an extraordinary volume of verse: the Khazain ul-Maoni divans contain approximately 50,000 lines of poetry across four volumes, complemented by the five-epic Khamsa and the first comparative grammar of Turkic and Persian (Muhokamat ul-Lughatain). Recent computational work has begun constructing a Navoi corpus using digital technologies, with the first version created and semantic tagging underway (ResearchGate, 2024). For the Meros Project, the Navoi corpus constitutes the core of the Silent Canon: the richest and most studied body of Uzbek literary text, yet almost entirely absent from AI training infrastructure.

Zahir ud-Din Muhammad Babur (1483–1530)

The Baburnama—founder-of-empire memoir, nature diary, and military chronicle in one—is among the most historically significant prose works in any Turkic language. Its Chagatai idiom bridges classical and transitional Uzbek and makes it an indispensable component of the corpus.

The 19th–20th Century Transition and the Modern Novel

The transition from Chagatai to modern literary Uzbek, marked by poets including Uvaysiy, Nodira, Muqimi, and Furqat, and subsequently by the first Uzbek novel (O‘tgan Kunlar, Abdulla Qodiriy, 1926) and the Soviet-era prose of Oybek and Abdulla Qahhor, completes the arc of the Meros Corpus from the classical to the contemporary.

6.2 Annotation Schema

The Meros Corpus is designed according to a principled multi-layer annotation schema enabling both neural model training and knowledge graph construction. The schema comprises: a metadata layer recording author, title, date, genre, script, and transliteration provenance; a morphological layer providing part-of-speech tags, morpheme segmentation, and lemmatization using the UzUDT Universal Dependencies framework; a semantic layer annotating named entities and literary motifs (love, mysticism, nature, sovereignty); a metrical layer, for poetry, marking aruz prosodic patterns; and an intertextual layer cross-referencing Quranic allusions, Persian literary precedents, and historical events. This annotation structure directly feeds the knowledge graph constructed in Objective II.

6.3 Digitization and OCR Challenges

A significant portion of the Uzbek literary canon exists only in physical form: manuscript collections in the Alisher Navoi National Library, the Institute of Oriental Studies in Tashkent, and archives in St. Petersburg, London, and Tehran. OCR for Arabic-script Chagatai texts presents severe challenges including calligraphic variation across scribal traditions; diacritical marks frequently omitted in manuscripts; ligature ambiguity; and right-to-left directionality combined with Persian and Arabic loan passages. The UNESCO and Aga Khan Trust for Culture have demonstrated that deep learning-based OCR trained on Perso-Arabic manuscripts can achieve character error rates below 5% in controlled conditions, providing a feasible pathway. The Meros Project treats historical script variants not as obstacles but as primary evidence about the textual transmission of Uzbek literature.

6.4 Proposed Corpus Composition

Table 4. Proposed composition of the Meros Corpus. Token estimates based on average tokenization of comparable text types. The composition reflects the full arc of Uzbek literary history from the classical to the contemporary.
Corpus component	Period	Script	Est. tokens	Current status
Navoi collected works	15th c.	Arabic / Latin	~15M	Partial digital
Baburnama	16th c.	Arabic / Latin	~3M	Digitized (tr.)
Classical Chagatai anthology	14–17th c.	Arabic	~25M	Manuscript
19th–20th c. Uzbek literature	1800–1950	Cyrillic / Latin	~50M	Partial digital
Soviet-era Uzbek prose	1920–1991	Cyrillic	~200M	Partial digital
Contemporary Uzbek literature	1991–pres.	Latin	~100M	Largely digital
Academic / scientific Uzbek text	2000–pres.	Latin	~150M	Digital
Oral tradition transcriptions	Various	Latin	~5M	In progress
Total proposed corpus	—	Multi-script	~548M tokens	—

Soviet-era prose (1920–1991)

200M tokens

Academic / scientific text

150M tokens

Contemporary literature

100M tokens

19th–20th c. literature

50M tokens

Classical Chagatai anthology

25M tokens

Navoi collected works

15M tokens

Baburnama + oral tradition

8M tokens

Figure 1. Proposed Meros Corpus composition by estimated token count (total: ~548M tokens). Bar widths are proportional to the estimated token count of each component relative to the total corpus.

7. Neural Translation Pipeline: Technical Architecture

This section specifies the technical architecture of the Meros Project’s neural translation pipeline, implementing Objective I (Scalable Neural Translation). The architecture is designed to achieve high-throughput processing of Uzbek literary text while maintaining the semantic fidelity, stylistic coherence, and cultural authenticity that distinguish literary translation from information-retrieval translation.

7.1 Base Model Selection

The Meros Project adopts a continued pretraining strategy: selecting an existing open-source multilingual model and extending its pretraining on the Meros Corpus. This approach avoids training from scratch while achieving domain and language specialization. Evaluated candidate base models include: Mistral 7B / Mistral Nemo (12B), offering strong multilingual performance and efficient grouped-query attention; Meta LLaMA 3.1 (8B), providing a 128K token context window valuable for long literary texts; BLOOMZ (1.7B–7.1B), designed for multilingual zero-shot generalization across 46 languages; and Qwen2.5 (7B–72B), demonstrating strong performance on Turkic language families.

7.2 Tokenizer Adaptation

Standard tokenizers for models such as LLaMA 3 or Mistral were trained primarily on English and Western European languages, causing Uzbek text to be heavily over-segmented—each Uzbek word tokenized into many more subword units than the equivalent English word. The Meros Project trains a new vocabulary of 30,000–50,000 Uzbek-specific subword tokens using BPE on the Meros Corpus, merges this vocabulary with the base model’s tokenizer, and initializes new embeddings by averaging embeddings of semantically related tokens from the original vocabulary. This approach, documented by Cui et al. (2024) for Chinese LLaMA adaptation, reduces Uzbek token fertility (tokens per word) by approximately 40–60%.

7.3 Three-Stage Training Pipeline

Stage 1: General Uzbek Pretraining

The base model is continued-pretrained on approximately 300–400 million tokens of general-purpose modern Uzbek text (news, academic articles, web text), bridging the gap between the model’s prior multilingual knowledge and the target Uzbek language distribution.

Stage 2: Literary Domain Pretraining

The model is subsequently pretrained on the literary and classical components of the Meros Corpus (~100–150M tokens), with multiple training epochs and a reduced learning rate to prevent catastrophic forgetting. Classical Chagatai material is subject to a separate OCR quality filtering pass within the HITL framework before inclusion.

Stage 3: Instruction Fine-Tuning and RLHF

The pretrained model is instruction fine-tuned on a dataset of Uzbek literary tasks: poem interpretation, biographical question answering, literary translation, poetic composition in classical meters, and textual analysis. Human feedback from Uzbek literature scholars is incorporated via Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), aligning outputs with expert literary judgment.

7.4 Evaluation Framework

The Meros Project will develop a suite of literary benchmarks for Uzbek—the first of their kind for the classical tradition—including: UzNavoi-QA (question answering over Navoi’s divans); UzPoetry-Genre (classification into ghazal, qasida, rubaiy, masnavi); UzMeter (identification of aruz prosodic patterns); UzScript-Trans (Arabic-to-Latin transliteration quality for Chagatai texts); and UzLit-Summ (summarization against expert-written summaries). Research published at IEEE (2024) has already demonstrated the feasibility of genre classification as a machine learning task in the Uzbek literary domain, providing a methodological precedent.

8. Contextual Knowledge Augmentation: The Meros Knowledge Graph

The Meros Knowledge Graph implements Objective II, transforming each translated literary text from an isolated document into a node in a rich semantic network. The knowledge graph enriches each text with multi-layered scholarly annotations, diachronic historical contextualization, and intertextual relationship mapping. It operationalizes the scholarly practice of tahqiq—critical edition—at machine scale, without displacing human scholars from the construction and validation of that knowledge structure.

8.1 Graph Architecture

The knowledge graph is organized as a directed heterogeneous network with five primary node types: Author nodes encoding biographical, historical, and literary-historical data about Uzbek authors; Work nodes representing individual poems, prose works, and their manuscript variants; Concept nodes encoding key literary motifs, Sufi philosophical terms, historical events, and cultural references; Person and Place nodes for named entities extracted from the corpus and linked to historical authority records; and Intertextual Source nodes representing Persian, Arabic, and Quranic texts that Uzbek authors drew upon, adapted, or subverted. Edges between nodes encode typed relationships: alludes-to, responds-to, was-influenced-by, portrays, set-in, and so forth.

Retrieval Layer (RAG)

Author Nodesbiographical + historical

→

Work Nodestext + manuscript variants

→

Concept Nodesmotifs, terms, events

→

Intertextual SourcesPersian / Arabic / Quranic

Generation Layer (Meros-LLM)

Querytranslation / question

→

Graph Retrievalcontext + annotation

→

Conditioned Generationculturally-grounded output

→

HITL Reviewexpert validation

Figure 2. Simplified architecture of the Meros knowledge-augmented translation pipeline. The RAG layer retrieves structured context from the knowledge graph; the Meros-LLM generates conditioned translations; the HITL layer provides expert validation and iterative refinement.

8.2 Diachronic Contextualization

A distinctive feature of the Meros Knowledge Graph is its diachronic dimension: the graph encodes not only synchronic relationships within the 15th-century corpus but the transmission and transformation of Uzbek literary conventions across time. A reader or AI system querying the graph about a specific ghazal by Navoi can trace both the Persian precedents Navoi was responding to and the Uzbek poets of subsequent centuries who responded to Navoi in turn. This longitudinal view of the literary tradition is precisely the “multi-layered scholarly annotation” that makes translation contextually meaningful rather than merely lexically accurate.

9. Community-Powered Collaboration: The Human-in-the-Loop Framework

The Human-in-the-Loop (HITL) framework is the mechanism by which the Meros Project fulfills its defining commitment: that cultural knowledge remains a public good rather than a proprietary resource. The Project fundamentally reconceptualizes the human-machine collaboration paradigm in literary translation. Rather than positioning AI as a replacement for human expertise, the Meros Project establishes a symbiotic architecture in which neural networks perform computationally intensive tasks—candidate translation generation, entity recognition, intertextual retrieval, morphological analysis—while human experts provide the higher-order cognitive functions of cultural judgment, interpretive authority, and scholarly validation.

This collaborative framework produces translations and annotations that achieve not merely lexical accuracy but cultural resonance, scholarly rigor, and literary authenticity.

9.1 Contributor Roles and Permissions

The HITL interface is structured around four contributor roles, each with distinct permissions and responsibilities. Expert scholars—credentialed Turkologists, Uzbek literature specialists, and historians of Central Asia—have authority to accept, reject, or substantially revise neural translation drafts and to validate knowledge graph entries. Advanced reviewers—advanced graduate students and qualified heritage speakers—may propose annotations and flag culturally problematic renderings for expert review. Community contributors—the global Uzbek diaspora and general public—may submit suggested corrections, report errors, and participate in prioritization of which texts receive translation resources. Automated quality assurance systems flag low-confidence neural outputs for mandatory human review before publication.

9.2 Attribution and Intellectual Labor

Every contribution to the Meros Project’s corpus and knowledge graph is recorded in a transparent, versioned attribution system. The principle is explicit: the intellectual labor of every contributor is acknowledged, credited, and citable. This operationalizes the Project’s commitment to cultural knowledge as a public good by ensuring that the community contributions which make the Project possible are not silently absorbed into a corporate or institutional product.

10. Cultural Heritage, Ethics, and Global Precedents

The Meros Project operates at the intersection of AI and intangible cultural heritage in a domain that has attracted increasing scholarly and policy attention. On the first International Day of the Intangible Cultural Heritage (17 October 2024), a global UNESCO panel explored the impact of AI on cultural heritage, calling specifically for AI systems to be fed more diverse data sources to reduce bias, and for ethical frameworks foregrounding community rights and consent (UNESCO ICH, 2024). The Meros Project’s design is a direct response to these calls.

10.1 Ethical Principles

The Project adheres to five ethical commitments derived from UNESCO’s Recommendation on the Ethics of Artificial Intelligence (2021) and the emerging literature on AI for endangered language communities. Community sovereignty: Uzbek literary heritage belongs to Uzbek communities; the Meros Corpus is compiled and governed with formal involvement of Uzbek academic institutions and civil society organizations. Transparency: training data, architecture, and evaluation results are fully documented and openly published. Non-displacement: AI augments rather than replaces human scholars, poets, and tradition-keepers. Accessibility: all Meros outputs are freely accessible to Uzbek-speaking communities globally. Historical sensitivity: classical texts—particularly the Sufi poetry central to the Uzbek tradition—require culturally informed curation that recognizes their sacred dimensions.

10.2 Global Precedents

Several analogous projects provide methodological guidance. Te Hiku Media’s Papa Reo project in New Zealand developed ASR models for te reo Māori achieving 92% transcription accuracy under a data sovereignty framework in which the Māori community owns and controls all training data—a model directly applicable to the Meros Project. InkubaLM (Lelapa AI, Africa) demonstrates that linguistic AI need not require billion-dollar compute budgets: Africa’s first multilingual small language model achieves 75% size reduction without performance loss. India’s government-backed Bhashini platform, supporting 350+ AI-powered language models, provides a model for the institutional infrastructure the Meros Project seeks to replicate at the Central Asian regional level.

11. Intelligent and Open Dissemination

Objective IV specifies the deployment of the Meros Corpus and Meros-LLM within a sophisticated, open-access digital platform. The dissemination platform is designed to serve two distinct but overlapping audiences: the scholarly community, requiring advanced corpus analysis tools, full annotation layer access, and versioned citation infrastructure; and the general public, including diaspora communities and students, requiring accessible, translated, and contextually enriched text presentation. All materials are released under open-access licenses operationalizing the principle that Uzbek cultural heritage is a public good.

11.1 Scholarly Interface

The scholarly interface provides full-text search across the Meros Corpus with morphological normalization; access to the complete knowledge graph and annotation layers; concordance and frequency analysis tools; manuscript image viewing alongside transcription and transliteration; and exportable citation packages meeting standard bibliographic requirements for academic publication.

11.2 Public Interface and Application Ecosystem

The public interface provides contextually enriched translation alongside the original text; integrated explanatory notes drawn from the knowledge graph; biographical and historical contextualizations of authors and works; and links to the HITL contribution interface for those wishing to participate. Built on the Meros-LLM, the platform further enables: an educational reading assistant for secondary and university-level Uzbek literature curricula; literary machine translation into English and Russian for the Uzbek diaspora; and cultural heritage interpretation tools for Uzbekistan’s UNESCO World Heritage sites at Samarkand, Bukhara, and Khiva.

12. Research Agenda: 2026–2030

Phase 1: Corpus Foundation (2026–2027)

Digitize and OCR-process priority classical texts: Navoi complete works, Baburnama, selected canonical anthology.
Develop Arabic-to-Latin transliteration pipeline optimized for Chagatai Uzbek.
Publish Meros Corpus v1.0 (target: 200M tokens) under an open scholarly license.
Establish formal institutional partnerships: Alisher Navoi National Library; Tashkent State University of Uzbek Language and Literature; international Turkology centers.
Initiate knowledge graph construction: Author and Work node population from existing bibliographic databases.

Phase 2: Model and Knowledge Graph Development (2027–2028)

Extend tokenizer of selected base model with 40,000-token morphology-aware Uzbek vocabulary.
Complete three-stage training pipeline: general Uzbek pretraining, literary domain pretraining, and RLHF instruction fine-tuning.
Develop and publish UzNavoi-QA, UzPoetry-Genre, UzMeter, and UzScript-Trans benchmark datasets.
Release Meros-LLM 7B as open-source model on Hugging Face.
Launch HITL contribution platform in beta with recruited expert reviewer cohort.

Phase 3: Dissemination Platform and Ecosystem (2028–2030)

Launch public-access Meros digital repository with full scholarly and public interfaces.
Deploy educational reading assistant for Uzbek literature curriculum.
Initiate literary machine translation of Navoi’s Khamsa into English and Russian.
Publish comprehensive research findings in Computational Linguistics, Language Resources and Evaluation, and Digital Scholarship in the Humanities.
Convene inaugural Central Asian Digital Humanities Symposium on language-inclusive AI.

12.1 Conclusion

The Meros Project stands at a historically singular moment. The open-source AI ecosystem now provides multilingual base models that can be adapted rather than built from scratch. Uzbekistan’s Ministry of Digital Technologies has formally recognized the urgency of Uzbek language corpus development for AI. And the global scholarly community is increasingly attentive to the ethical imperative of language-inclusive AI. The risk of inaction is not merely that Uzbek will remain underrepresented in AI systems: it is that the digital revolution will replicate, at civilizational scale, the same marginalization that Uzbek literature has faced in global academic discourse for centuries.

The Meros Project is a refusal of that marginalization. It is a wager that the computational tools of the twenty-first century can be made to serve not only the languages of the economically powerful, but the languages in which Navoi wrote of love and justice, in which Babur recorded a world in transformation, in which generations of Uzbek poets found the words for what it means to be human. That is the heritage worth preserving. That is the meros.

References

ACL Anthology. (2024). BERTbek: A Pretrained Language Model for Uzbek. Proceedings of SIGUL 2024. https://aclanthology.org/2024.sigul-1.5

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Ott, M., Zettlemoyer, L., Stoyanov, V., & Barrault, L. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.

Ethnologue. (2024). Uzbek: A language of Uzbekistan. SIL International. https://www.ethnologue.com/language/uzb

IEEE Xplore. (2024). A computational approach to recognizing poetry genres in Uzbek texts. IEEE Conference Publication. https://ieeexplore.ieee.org/iel8/10758439/10758440/10758540.pdf

Kun.uz. (2025, October 21). Uzbekistan to develop Uzbek language corpus to enhance AI capabilities. https://kun.uz/en/

Mansurov, B., & Mansurov, A. (2021). UzBERT: Pretraining a BERT model for Uzbek. arXiv preprint arXiv:2108.09814. https://arxiv.org/abs/2108.09814

Mansurov, B., & Mansurov, A. (2021b). Uzbek Cyrillic–Latin–Cyrillic machine transliteration. arXiv preprint arXiv:2101.05162. https://arxiv.org/abs/2101.05162

MDPI. (2023). Development of language models for continuous Uzbek speech recognition system. Sensors, 23(3), 1145. https://doi.org/10.3390/s23031145

ResearchGate. (2024). Methods of processing the Uzbek language corpus texts. https://www.researchgate.net/publication/376514447

ScienceDirect. (2024). Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation. Data in Brief. https://doi.org/10.1016/j.dib.2024.110538

Stanford Human-Centered AI Institute. (2024). Mind the (language) gap: Mapping the challenges of LLM development in low-resource language contexts. https://hai.stanford.edu/policy/mind-the-language-gap

UNESCO. (2021). Recommendation on the ethics of artificial intelligence. https://unesdoc.unesco.org/ark:/48223/pf0000381137

UNESCO Intangible Cultural Heritage. (2024). Exploring the impact of artificial intelligence and intangible cultural heritage. https://ich.unesco.org/en/

World Economic Forum. (2025, October). Why generative AI needs to be trained on more languages. https://www.weforum.org/stories/2025/10/generative-ai-languages-llm/

Zhu, Q., & Liu, X. (2025). The application of artificial intelligence in the revitalization of intangible cultural heritage helps the cultural industry succeed. International Journal of Cultural Policy. https://doi.org/10.1177/14727978251337999