Emirati VITS Male TTS Model
Bilingual VITS-based Text-to-Speech model with male voice for Emirati Arabic and English. Fine-tuned on 70 hours of bilingual audio data, this model delivers natural, conversational speech optimized for call center applications and general-purpose TTS in both languages.
Model Overview
Key Features:
- Bilingual: Native support for both Emirati Arabic and English
- Single-speaker male voice model
- 22050 Hz sample rate
- Emirati Arabic dialect-specific phonemization
- Seamless Arabic-English code-switching
- Emirati-specific text normalization (numbers, dates, currencies)
- Built with NVIDIA NeMo 2.6.1
Model Details
Architecture
- Model Type: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
- Framework: NVIDIA NeMo 2.6.1
- Hidden Channels: 192
- Filter Channels: 768
- Number of Layers: 6
- Attention Heads: 2
Training
- Epochs: 179
- Final Loss: 20.47
- Dataset: 70 hours of bilingual audio
- Use Case: Call center applications, conversational speech
- Sample Rate: 22050 Hz
- Mel Channels: 80
Language Support
- Bilingual Model: Equal support for Emirati Arabic and English
- Emirati Arabic: Gulf Arabic dialect with native phonemization
- English: Full English TTS support
- Code-switching: Seamless mixing of Arabic and English in the same sentence
- G2P: EmiratiG2P with dialect-specific phonemization (from custom NeMo fork)
- Text Tokenizer: IPATokenizer (International Phonetic Alphabet)
Quick Start
macOS Users: See INSTALL_MAC.md for detailed installation instructions with quick start examples.
Prerequisites
Python: 3.10 or later (tested with Python 3.14)
# Install UV (modern Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create Python environment (3.10, 3.11, 3.12, 3.13, or 3.14)
uv venv --python 3.14
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Installation
# Install PyTorch (CPU version shown, for GPU use CUDA-enabled version)
uv pip install torch torchvision torchaudio
# Install NeMo with TTS support
uv pip install nemo_toolkit[tts]
# Install audio libraries
uv pip install soundfile librosa
# Install Pynini (required for text normalization)
uv pip install pynini
# Install custom Emirati text normalization
uv pip install git+https://github.com/VadzimBelski-ScienceSoft/NeMo-text-processing.git
Inference
Using the provided inference.py CLI:
# Download the model files first
git clone https://huggingface.co/vadimbelsky/emirati-vits-male-1.0
cd emirati-vits-male-1.0
# Arabic example
python inference.py \
--text "ู
ุฑุญุจุงุ ููู ุญุงูู ุงูููู
ุ" \
--out output_ar.wav \
--ckpt VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
--hparams hparams.yaml \
--require-normalize
# English example
python inference.py \
--text "Hello, how are you today?" \
--out output_en.wav \
--ckpt VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
--hparams hparams.yaml \
--require-normalize
Using Python directly:
from pathlib import Path
import torch
from nemo.collections.tts.models import VitsModel
from bilingual_text_normalizer import BilingualTextNormalizer
import soundfile as sf
# Load model
model = VitsModel.restore_from(
"VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt",
map_location=torch.device("cpu") # or "cuda" for GPU
)
model.eval()
# Initialize bilingual normalizer
normalizer = BilingualTextNormalizer(ar_lang="ar_ae", en_lang="en")
# Synthesize speech
text = "ู
ุฑุญุจุงุ ููู ุญุงููุ" # or English text
normalized = normalizer.normalize(text)
with model.nemo_infer():
tokens = model.parse(normalized)
audio = model.convert_text_to_waveform(tokens=tokens)
# Save audio
sf.write("output.wav", audio.squeeze().cpu().numpy(), 22050)
Usage Examples
Example 1: Basic Arabic TTS
arabic_text = "ุงูููู
ุงูุฌู ุฌู
ูู ูู ุฏุจู"
normalized = normalizer.normalize(arabic_text)
with model.nemo_infer():
tokens = model.parse(normalized)
audio = model.convert_text_to_waveform(tokens=tokens)
sf.write("weather.wav", audio.squeeze().cpu().numpy(), 22050)
Example 2: Numbers in Emirati Dialect
# Numbers are automatically converted to Emirati pronunciation
text_with_numbers = "ุงูุณุนุฑ 1500 ุฏุฑูู
" # Will pronounce as "ุฃูู ูุฎู
ุณ ู
ูุฉ"
normalized = normalizer.normalize(text_with_numbers)
with model.nemo_infer():
tokens = model.parse(normalized)
audio = model.convert_text_to_waveform(tokens=tokens)
sf.write("price.wav", audio.squeeze().cpu().numpy(), 22050)
Example 3: Mixed Arabic-English Code-Switching
mixed_text = "ุฃูุง ุฃุนู
ู ูู Microsoft ูู Dubai"
normalized = normalizer.normalize(mixed_text)
with model.nemo_infer():
tokens = model.parse(normalized)
audio = model.convert_text_to_waveform(tokens=tokens)
sf.write("mixed.wav", audio.squeeze().cpu().numpy(), 22050)
OpenAI-Compatible TTS Server
For easy integration with applications expecting OpenAI-compatible APIs, we provide a FastAPI server that implements the /v1/audio/speech endpoint.
Installation
# Install server dependencies
uv pip install ".[server]"
Running the Server
# Start the server (default: http://localhost:8000)
python openai_tts_server.py
# Or with custom settings
python openai_tts_server.py \
--checkpoint VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
--device cuda \
--host 0.0.0.0 \
--port 8000
Usage Examples
Using curl:
# Generate speech from Arabic text
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "ู
ุฑุญุจุงุ ููู ุญุงูู ุงูููู
ุ",
"voice": "emirati-male",
"response_format": "mp3",
"speed": 1.0
}' \
--output speech.mp3
# Generate speech from English text
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, how are you today?",
"voice": "emirati-male",
"response_format": "wav"
}' \
--output speech.wav
Using Python (OpenAI client):
from openai import OpenAI
# Point to local server
client = OpenAI(
api_key="not-needed", # API key not required for local server
base_url="http://localhost:8000/v1"
)
# Generate speech
response = client.audio.speech.create(
model="tts-1",
voice="emirati-male",
input="ู
ุฑุญุจุงุ ููู ุญุงููุ",
response_format="mp3",
speed=1.0
)
# Save to file
response.stream_to_file("output.mp3")
Using Python (requests):
import requests
url = "http://localhost:8000/v1/audio/speech"
data = {
"model": "tts-1",
"input": "ุงูุณูุงู
ุนูููู
",
"voice": "emirati-male",
"response_format": "wav",
"speed": 1.0
}
response = requests.post(url, json=data)
with open("output.wav", "wb") as f:
f.write(response.content)
API Parameters
- model: Model identifier (use "tts-1" or "tts-1-hd", both map to Emirati VITS)
- input: Text to synthesize (max 4096 characters)
- voice: Voice name (only "emirati-male" supported)
- response_format: Audio format - "mp3", "wav", "flac", "opus", "aac", or "pcm"
- speed: Playback speed (0.25 to 4.0, default: 1.0)
Server Endpoints
POST /v1/audio/speech- Text-to-speech synthesis (OpenAI compatible)GET /health- Health check endpointGET /- API informationGET /docs- Interactive API documentation (Swagger UI)
Audio Samples
Emirati Arabic:
English:
Model Configuration
Key parameters from hparams.yaml:
- Pitch Range: 50-400 Hz (male voice)
- Segment Size: 12288
- Mel Frequency Bins: 80
- FFT Size: 1024
- Hop Length: 256
- Window Size: 1024
- Spectral Normalization: Enabled
Text Normalization Features
The model uses custom Emirati Arabic text normalization (ar_ae language code) with:
- Sun letter assimilation: Enabled
- Vowel insertion: Enabled (vowel: 'a')
- English G2P fallback: Enabled for code-switching
- Emirati-specific number forms: E.g., "1500" โ "ุฃูู ูุฎู ุณ ู ูุฉ"
- Currency support: 20+ regional and international currencies
- Date normalization: Emirati dialect date expressions
The bilingual text normalizer (bilingual_text_normalizer.py) automatically:
- Detects script type (Arabic, English, or mixed)
- Routes to appropriate normalizer (
ar_aefor Arabic,enfor English) - Sanitizes Unicode punctuation and special characters
Note: Requires the custom NeMo-text-processing library with ar_ae support.
Limitations & Known Issues
- Dialect specificity: Optimized for Emirati Arabic; may not generalize well to other Arabic dialects
- Single speaker: Male voice only, no multi-speaker support
- Audio quality: 22050 Hz sample rate (standard quality, not high-fidelity)
- Code-switching: Works best with Arabic primary text and occasional English words
- Platform support: Pynini/OpenFst installation can be challenging on macOS/Windows (Linux recommended)
- Performance: CPU inference is slow; GPU significantly improves speed
Technical Requirements
- Python: 3.10 or later (tested with Python 3.14)
- RAM: Minimum 4GB, recommended 8GB+
- GPU: Optional but recommended (NVIDIA GPU with CUDA support)
- Disk space: ~500MB for model + dependencies
- Operating System: Linux (recommended), macOS (with caveats), Windows (WSL recommended)
License
This model is released under the Apache 2.0 license.
Citation
If you use this model in your research or applications, please cite the VITS paper:
@inproceedings{kim2021conditional,
title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning (ICML)},
year={2021}
}
For this specific model:
@misc{emirati-vits-male-2026,
author={Belsky, Vadim},
title={Emirati VITS Male TTS Model},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/vadimbelsky/emirati-vits-male-1.0}}
}
Acknowledgments
- Built with NVIDIA NeMo framework
- Custom Emirati text normalization library
- Based on the VITS architecture by Kim et al.
Contact & Consultation
If you're looking for consultation on how to modify and fine-tune this model, I provide training and consultation services. Connect with me on LinkedIn or visit my blog.
- Downloads last month
- -