rubai-corrector-transcript-uz

Transcript-display normalization model for Uzbek ASR output with mixed Uzbek/Russian support. Built on ByT5 architecture.

This is the transcript-display variant of the rubai-corrector model family. For the fine-tuning foundation checkpoint, see rubai-corrector-base.

Authors

Sardor Islomov — lead author
Davron Ibrokhimov

This checkpoint is tuned for:

display-ready punctuation and casing
apostrophe normalization
OCR / ASR typo cleanup
Latin Russian -> Cyrillic Russian recovery
mixed Uzbek/Russian transcript cleanup
selected text-to-number normalization patterns

Intended Use

Use this model after ASR to convert noisy transcript text into better display text.

It is best for:

Rubai-style Uzbek ASR postprocessing
Uzbek display text cleanup
mixed Uzbek/Russian lines where Russian appears in Latin transcription

Primary upstream ASR models this normalizer is intended to follow within the same Rubai model family:

The model is focused on line-level transcript outputs that look like the text produced by those ASR models.

It is not optimized for:

literal no-edit transcript preservation
noisy Gemini-style mixed-script metadata targets with forced Cyrillic inside Uzbek morphology
aggressive general denormalization beyond the transcript-display objective

Model Family

Model	Use Case
rubai-corrector-base	Fine-tuning base for new correction tasks
rubai-corrector-transcript-uz (this model)	ASR transcript display normalization, mixed Uzbek/Russian

Both models share the same ByT5 architecture. This variant is fine-tuned from the base with additional transcript-display objectives.

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "islomov/rubai-corrector-transcript-uz"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = "bugun yaxshi kun. segodnya xoroshiy den."
inputs = tokenizer(f"correct: {text}", return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=256)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)

Expected output:

Bugun yaxshi kun. Сегодня хороший день.

Real Example Outputs

The examples below are taken from this exact checkpoint's saved eval/test outputs.

Abbreviations / shorthand

Input:     tlefon rqami
Output:    Telefon raqami

Input:     telefon rqami qaysi
Output:    Telefon raqami qaysi

Apostrophes

Input:     ozbekiston gozal mamlakat bolgan
Output:    O'zbekiston go'zal mamlakat bo'lgan

Input:     men ozim kordim
Output:    Men o'zim ko'rdim.

OCR / ASR noise

Input:     0zbekiston Respub1ikasi
Output:    O'zbekiston Respublikasi

Input:     5alom dostlar
Output:    Salom do'stlar

Numbers

Input:     uchrashuv o'n beshinchi yanvar kuni
Output:    Uchrashuv 15-yanvar kuni

Input:     narxi yigirma besh ming so'm
Output:    Narxi 25 000 so'm

Mixed Uzbek + Russian

Input:     bugun yaxshi kun. segodnya xoroshiy den.
Output:    Bugun yaxshi kun. Сегодня хороший день.

Input:     men bozorga bordim. tam ya kupil xleb.
Output:    Men bozorga bordim. Там я купил хлеб.

Russian only

Input:     segodnya xoroshaya pogoda
Output:    Сегодня хорошая погода

Input:     privet kak dela
Output:    Привет как дела

Mixed script

Input:     privet kak делa
Output:    Привет как дела

Input:     zaklad bersa keyin gaplashamiz
Output:    Заклад bersa keyin gaplashamiz

Display-text cleanup

Input:     mustahkamlik sinovida spark boshqa avtomobillarni ortda qoldirdi.
Output:    Mustahkamlik sinovida Spark boshqa avtomobillarni ortda qoldirdi.

Input:     kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin
Output:    Kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin.

Known Tradeoff

This model is more display-oriented than the previous base.

That means:

it is better at final punctuation and finished-sentence formatting
it may add a final period in places where an old reference omitted it

Files

test_model.py small runnable example/test script for local use and HF packaging

How This Model Was Trained

This model is fine-tuned from rubai-corrector-base (which itself is built on google/byt5-small).

The fine-tuning added transcript-display objectives on top of the base correction capabilities:

Uzbek transcript/display pairs from ASR output
Russian recovery pairs from Latin-script ASR output
Punctuation and formatting polish data

The model expects the correct: instruction prefix at inference time.

Acknowledgements

Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.

Thank you to the community that supports Uzbek language technology. In particular:

MetaSell for support and resources
Kotib for their support and collaboration on Uzbek STT
Global Move for backing open Uzbek NLP work

Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.

Support my works and open-source movement: https://tirikchilik.uz/islomovs

Downloads last month: 253

Safetensors

Model size

0.3B params

Tensor type

F32

islomov
/

rubai-corrector-transcript-uz