Emirati VITS Male TTS Model

Bilingual VITS-based Text-to-Speech model with male voice for Emirati Arabic and English. Fine-tuned on 70 hours of bilingual audio data, this model delivers natural, conversational speech optimized for call center applications and general-purpose TTS in both languages.

Model Overview

Key Features:

Bilingual: Native support for both Emirati Arabic and English
Single-speaker male voice model
22050 Hz sample rate
Emirati Arabic dialect-specific phonemization
Seamless Arabic-English code-switching
Emirati-specific text normalization (numbers, dates, currencies)
Built with NVIDIA NeMo 2.6.1

Model Details

Architecture

Model Type: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
Framework: NVIDIA NeMo 2.6.1
Hidden Channels: 192
Filter Channels: 768
Number of Layers: 6
Attention Heads: 2

Training

Epochs: 179
Final Loss: 20.47
Dataset: 70 hours of bilingual audio
Use Case: Call center applications, conversational speech
Sample Rate: 22050 Hz
Mel Channels: 80

Language Support

Bilingual Model: Equal support for Emirati Arabic and English
- Emirati Arabic: Gulf Arabic dialect with native phonemization
- English: Full English TTS support
- Code-switching: Seamless mixing of Arabic and English in the same sentence
G2P: EmiratiG2P with dialect-specific phonemization (from custom NeMo fork)
Text Tokenizer: IPATokenizer (International Phonetic Alphabet)

Quick Start

macOS Users: See INSTALL_MAC.md for detailed installation instructions with quick start examples.

Prerequisites

Python: 3.10 or later (tested with Python 3.14)

# Install UV (modern Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create Python environment (3.10, 3.11, 3.12, 3.13, or 3.14)
uv venv --python 3.14
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Installation

# Install PyTorch (CPU version shown, for GPU use CUDA-enabled version)
uv pip install torch torchvision torchaudio

# Install NeMo with TTS support
uv pip install nemo_toolkit[tts]

# Install audio libraries
uv pip install soundfile librosa

# Install Pynini (required for text normalization)
uv pip install pynini

# Install custom Emirati text normalization
uv pip install git+https://github.com/VadzimBelski-ScienceSoft/NeMo-text-processing.git

Inference

Using the provided inference.py CLI:

# Download the model files first
git clone https://huggingface.co/vadimbelsky/emirati-vits-male-1.0
cd emirati-vits-male-1.0

# Arabic example
python inference.py \
  --text "مرحبا، كيف حالك اليوم؟" \
  --out output_ar.wav \
  --ckpt VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
  --hparams hparams.yaml \
  --require-normalize

# English example
python inference.py \
  --text "Hello, how are you today?" \
  --out output_en.wav \
  --ckpt VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
  --hparams hparams.yaml \
  --require-normalize

Using Python directly:

from pathlib import Path
import torch
from nemo.collections.tts.models import VitsModel
from bilingual_text_normalizer import BilingualTextNormalizer
import soundfile as sf

# Load model
model = VitsModel.restore_from(
    "VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt",
    map_location=torch.device("cpu")  # or "cuda" for GPU
)
model.eval()

# Initialize bilingual normalizer
normalizer = BilingualTextNormalizer(ar_lang="ar_ae", en_lang="en")

# Synthesize speech
text = "مرحبا، كيف حالك؟"  # or English text
normalized = normalizer.normalize(text)

with model.nemo_infer():
    tokens = model.parse(normalized)
    audio = model.convert_text_to_waveform(tokens=tokens)

# Save audio
sf.write("output.wav", audio.squeeze().cpu().numpy(), 22050)

Usage Examples

Example 1: Basic Arabic TTS

arabic_text = "اليوم الجو جميل في دبي"
normalized = normalizer.normalize(arabic_text)

with model.nemo_infer():
    tokens = model.parse(normalized)
    audio = model.convert_text_to_waveform(tokens=tokens)

sf.write("weather.wav", audio.squeeze().cpu().numpy(), 22050)

Example 2: Numbers in Emirati Dialect

# Numbers are automatically converted to Emirati pronunciation
text_with_numbers = "السعر 1500 درهم"  # Will pronounce as "ألف وخمس مية"
normalized = normalizer.normalize(text_with_numbers)

with model.nemo_infer():
    tokens = model.parse(normalized)
    audio = model.convert_text_to_waveform(tokens=tokens)

sf.write("price.wav", audio.squeeze().cpu().numpy(), 22050)

Example 3: Mixed Arabic-English Code-Switching

mixed_text = "أنا أعمل في Microsoft في Dubai"
normalized = normalizer.normalize(mixed_text)

with model.nemo_infer():
    tokens = model.parse(normalized)
    audio = model.convert_text_to_waveform(tokens=tokens)

sf.write("mixed.wav", audio.squeeze().cpu().numpy(), 22050)

OpenAI-Compatible TTS Server

For easy integration with applications expecting OpenAI-compatible APIs, we provide a FastAPI server that implements the /v1/audio/speech endpoint.

Installation

# Install server dependencies
uv pip install ".[server]"

Running the Server

# Start the server (default: http://localhost:8000)
python openai_tts_server.py

# Or with custom settings
python openai_tts_server.py \
  --checkpoint VITS_emirati_v3--loss_gen_all=20.4726-epoch=179-last.ckpt \
  --device cuda \
  --host 0.0.0.0 \
  --port 8000

Usage Examples

Using curl:

# Generate speech from Arabic text
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "مرحبا، كيف حالك اليوم؟",
    "voice": "emirati-male",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output speech.mp3

# Generate speech from English text
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, how are you today?",
    "voice": "emirati-male",
    "response_format": "wav"
  }' \
  --output speech.wav

Using Python (OpenAI client):

from openai import OpenAI

# Point to local server
client = OpenAI(
    api_key="not-needed",  # API key not required for local server
    base_url="http://localhost:8000/v1"
)

# Generate speech
response = client.audio.speech.create(
    model="tts-1",
    voice="emirati-male",
    input="مرحبا، كيف حالك؟",
    response_format="mp3",
    speed=1.0
)

# Save to file
response.stream_to_file("output.mp3")

Using Python (requests):

import requests

url = "http://localhost:8000/v1/audio/speech"
data = {
    "model": "tts-1",
    "input": "السلام عليكم",
    "voice": "emirati-male",
    "response_format": "wav",
    "speed": 1.0
}

response = requests.post(url, json=data)

with open("output.wav", "wb") as f:
    f.write(response.content)

API Parameters

model: Model identifier (use "tts-1" or "tts-1-hd", both map to Emirati VITS)
input: Text to synthesize (max 4096 characters)
voice: Voice name (only "emirati-male" supported)
response_format: Audio format - "mp3", "wav", "flac", "opus", "aac", or "pcm"
speed: Playback speed (0.25 to 4.0, default: 1.0)

Server Endpoints

POST /v1/audio/speech - Text-to-speech synthesis (OpenAI compatible)
GET /health - Health check endpoint
GET / - API information
GET /docs - Interactive API documentation (Swagger UI)

Audio Samples

Emirati Arabic:

English:

Model Configuration

Key parameters from hparams.yaml:

Pitch Range: 50-400 Hz (male voice)
Segment Size: 12288
Mel Frequency Bins: 80
FFT Size: 1024
Hop Length: 256
Window Size: 1024
Spectral Normalization: Enabled

Text Normalization Features

The model uses custom Emirati Arabic text normalization (ar_ae language code) with:

Sun letter assimilation: Enabled
Vowel insertion: Enabled (vowel: 'a')
English G2P fallback: Enabled for code-switching
Emirati-specific number forms: E.g., "1500" → "ألف وخمس مية"
Currency support: 20+ regional and international currencies
Date normalization: Emirati dialect date expressions

The bilingual text normalizer (bilingual_text_normalizer.py) automatically:

Detects script type (Arabic, English, or mixed)
Routes to appropriate normalizer (ar_ae for Arabic, en for English)
Sanitizes Unicode punctuation and special characters

Note: Requires the custom NeMo-text-processing library with ar_ae support.

Limitations & Known Issues

Dialect specificity: Optimized for Emirati Arabic; may not generalize well to other Arabic dialects
Single speaker: Male voice only, no multi-speaker support
Audio quality: 22050 Hz sample rate (standard quality, not high-fidelity)
Code-switching: Works best with Arabic primary text and occasional English words
Platform support: Pynini/OpenFst installation can be challenging on macOS/Windows (Linux recommended)
Performance: CPU inference is slow; GPU significantly improves speed

Technical Requirements

Python: 3.10 or later (tested with Python 3.14)
RAM: Minimum 4GB, recommended 8GB+
GPU: Optional but recommended (NVIDIA GPU with CUDA support)
Disk space: ~500MB for model + dependencies
Operating System: Linux (recommended), macOS (with caveats), Windows (WSL recommended)

License

This model is released under the Apache 2.0 license.

Citation

If you use this model in your research or applications, please cite the VITS paper:

@inproceedings{kim2021conditional,
  title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2021}
}

For this specific model:

@misc{emirati-vits-male-2026,
  author={Belsky, Vadim},
  title={Emirati VITS Male TTS Model},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/vadimbelsky/emirati-vits-male-1.0}}
}

Acknowledgments

Built with NVIDIA NeMo framework
Custom Emirati text normalization library
Based on the VITS architecture by Kim et al.

Contact & Consultation

If you're looking for consultation on how to modify and fine-tune this model, I provide training and consultation services. Connect with me on LinkedIn or visit my blog.

Downloads last month: -