Text-to-Speech
NeMo
Georgian
tts
georgian
magpie-tts
NMikka commited on
Commit
ee46407
Β·
verified Β·
1 Parent(s): 07835be

Update inference script for NeMo 2.7.2 with Georgian-aware chunking

Browse files
Files changed (1) hide show
  1. README.md +93 -24
README.md CHANGED
@@ -21,6 +21,7 @@ pipeline_tag: text-to-speech
21
  A fine-tuned [MagPIE TTS](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) model for Georgian (αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒ˜) text-to-speech synthesis.
22
 
23
  This is the **open-source TTS model fine-tuned specifically for Georgian**, produced as part of the [Georgian TTS Benchmark](https://github.com/NikaGaworworw/TTS_pipelines).
 
24
  ## Evaluation Results
25
 
26
  Evaluated on the full [FLEURS Georgian](https://huggingface.co/datasets/google/fleurs) test set (979 samples) using round-trip intelligibility:
@@ -37,9 +38,8 @@ Evaluated on the full [FLEURS Georgian](https://huggingface.co/datasets/google/f
37
  ### Installation
38
 
39
  ```bash
40
- # MagPIE TTS requires NeMo 2.8+ (not yet on PyPI β€” install from source)
41
- git clone https://github.com/NVIDIA/NeMo.git
42
- cd NeMo && pip install -e ".[tts]"
43
  pip install huggingface_hub
44
  ```
45
 
@@ -48,29 +48,89 @@ pip install huggingface_hub
48
  ### Inference
49
 
50
  ```python
 
51
  import torch
52
  import torchaudio
53
  from huggingface_hub import hf_hub_download
54
- from nemo.collections.tts.models import MagpieTTSModel
55
- from nemo.collections.tts.parts.utils.tts_dataset_utils import chunk_text_for_inference
56
 
57
  # Download and load model
58
  nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo")
59
  model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
60
  model = model.eval().cuda()
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  # Synthesize
63
  text = "გამარჯობა, მე მαƒ₯αƒ•αƒ˜αƒ αƒ›αƒαƒ’αƒžαƒαƒ˜ და αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒαƒ“ αƒ•αƒšαƒαƒžαƒαƒ αƒαƒ™αƒαƒ‘."
64
 
65
- chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
66
- text=text,
67
- language="ka",
68
- tokenizer_name="text_ce_tokenizer",
69
- text_tokenizer=model.tokenizer,
70
- eos_token_id=model.eos_id,
71
- )
72
 
73
- chunk_state = model.create_chunk_state(batch_size=1)
 
 
 
74
  all_codes = []
75
 
76
  for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
@@ -80,7 +140,7 @@ for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
80
  "speaker_indices": 1, # speaker index (0-4)
81
  }
82
  with torch.no_grad():
83
- output = model.generate_speech(
84
  batch,
85
  chunk_state=chunk_state,
86
  end_of_text=[i == len(chunked_tokens) - 1],
@@ -117,15 +177,14 @@ def synthesize(model, text, speaker=1, use_cfg=True):
117
  Returns:
118
  waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz
119
  """
120
- chunked_tokens, chunked_tokens_len, _ = chunk_text_for_inference(
121
- text=text,
122
- language="ka",
123
- tokenizer_name="text_ce_tokenizer",
124
- text_tokenizer=model.tokenizer,
125
- eos_token_id=model.eos_id,
126
- )
127
-
128
- chunk_state = model.create_chunk_state(batch_size=1)
129
  all_codes = []
130
 
131
  for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
@@ -135,7 +194,7 @@ def synthesize(model, text, speaker=1, use_cfg=True):
135
  "speaker_indices": speaker,
136
  }
137
  with torch.no_grad():
138
- output = model.generate_speech(
139
  batch,
140
  chunk_state=chunk_state,
141
  end_of_text=[i == len(chunked_tokens) - 1],
@@ -172,6 +231,16 @@ MagPIE TTS is an **encoder-decoder transformer** (not a diffusion or flow model)
172
 
173
  **Classifier-Free Guidance (CFG)** runs two forward passes (with/without text conditioning) and interpolates. Set `use_cfg=False` for ~2x faster inference with slightly lower quality.
174
 
 
 
 
 
 
 
 
 
 
 
175
  ## Speakers
176
 
177
  The model has 5 baked speaker embeddings from pretraining. Set via `speaker_indices` in the batch dict.
 
21
  A fine-tuned [MagPIE TTS](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) model for Georgian (αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒ˜) text-to-speech synthesis.
22
 
23
  This is the **open-source TTS model fine-tuned specifically for Georgian**, produced as part of the [Georgian TTS Benchmark](https://github.com/NikaGaworworw/TTS_pipelines).
24
+
25
  ## Evaluation Results
26
 
27
  Evaluated on the full [FLEURS Georgian](https://huggingface.co/datasets/google/fleurs) test set (979 samples) using round-trip intelligibility:
 
38
  ### Installation
39
 
40
  ```bash
41
+ # Requires NeMo 2.7.2 (install from source at the tested commit)
42
+ pip install nemo_toolkit[tts]@git+https://github.com/NVIDIA-NeMo/NeMo.git@3d73c48aca1ae3be44657267b81f25dc3201161a
 
43
  pip install huggingface_hub
44
  ```
45
 
 
48
  ### Inference
49
 
50
  ```python
51
+ import re
52
  import torch
53
  import torchaudio
54
  from huggingface_hub import hf_hub_download
55
+ from nemo.collections.tts.models.magpietts import MagpieTTSModel
 
56
 
57
  # Download and load model
58
  nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo")
59
  model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
60
  model = model.eval().cuda()
61
 
62
+ TOKENIZER_NAME = "text_ce_tokenizer"
63
+ MAX_TOKENS_PER_CHUNK = 400 # ~133 Georgian chars, keeps well under 500 decoder steps
64
+
65
+
66
+ def split_georgian_text(text: str) -> list[str]:
67
+ """Split Georgian text into chunks suitable for TTS inference.
68
+
69
+ Splitting priority:
70
+ 1. Sentence-ending punctuation (. ! ?)
71
+ 2. Clause-level punctuation (, ; : β€”)
72
+ 3. Word boundaries (whitespace) as last resort for very long spans
73
+ """
74
+ sentences = re.split(r'(?<=[.!?])\s+', text)
75
+
76
+ chunks = []
77
+ for sentence in sentences:
78
+ est_tokens = len(sentence.encode('utf-8'))
79
+ if est_tokens <= MAX_TOKENS_PER_CHUNK:
80
+ chunks.append(sentence)
81
+ continue
82
+
83
+ clauses = re.split(r'(?<=[,;:β€”])\s+', sentence)
84
+ current = ""
85
+ for clause in clauses:
86
+ combined = f"{current} {clause}".strip() if current else clause
87
+ if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK:
88
+ current = combined
89
+ else:
90
+ if current:
91
+ chunks.append(current)
92
+ if len(clause.encode('utf-8')) > MAX_TOKENS_PER_CHUNK:
93
+ words = clause.split()
94
+ current = ""
95
+ for word in words:
96
+ combined = f"{current} {word}".strip() if current else word
97
+ if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK:
98
+ current = combined
99
+ else:
100
+ if current:
101
+ chunks.append(current)
102
+ current = word
103
+ else:
104
+ current = clause
105
+ if current:
106
+ chunks.append(current)
107
+
108
+ return [c for c in chunks if c.strip()]
109
+
110
+
111
+ def tokenize_chunks(chunks: list[str], tokenizer, eos_id: int):
112
+ """Tokenize pre-split text chunks, appending EOS to each."""
113
+ chunked_tokens = []
114
+ chunked_tokens_len = []
115
+ for chunk in chunks:
116
+ tokens = tokenizer.encode(text=chunk, tokenizer_name=TOKENIZER_NAME)
117
+ tokens = tokens + [eos_id]
118
+ tokens = torch.tensor(tokens, dtype=torch.int32)
119
+ chunked_tokens.append(tokens)
120
+ chunked_tokens_len.append(tokens.shape[0])
121
+ return chunked_tokens, chunked_tokens_len
122
+
123
+
124
  # Synthesize
125
  text = "გამარჯობა, მე მαƒ₯αƒ•αƒ˜αƒ αƒ›αƒαƒ’αƒžαƒαƒ˜ და αƒ₯αƒαƒ αƒ—αƒ£αƒšαƒαƒ“ αƒ•αƒšαƒαƒžαƒαƒ αƒαƒ™αƒαƒ‘."
126
 
127
+ if text[-1] not in ".!?,:;":
128
+ text += "."
 
 
 
 
 
129
 
130
+ chunks = split_georgian_text(text)
131
+ chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id)
132
+
133
+ chunk_state = model.create_longform_chunk_state(batch_size=1)
134
  all_codes = []
135
 
136
  for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
 
140
  "speaker_indices": 1, # speaker index (0-4)
141
  }
142
  with torch.no_grad():
143
+ output = model.generate_long_form_speech(
144
  batch,
145
  chunk_state=chunk_state,
146
  end_of_text=[i == len(chunked_tokens) - 1],
 
177
  Returns:
178
  waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz
179
  """
180
+ text = text.strip()
181
+ if text[-1] not in ".!?,:;":
182
+ text += "."
183
+
184
+ chunks = split_georgian_text(text)
185
+ chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id)
186
+
187
+ chunk_state = model.create_longform_chunk_state(batch_size=1)
 
188
  all_codes = []
189
 
190
  for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
 
194
  "speaker_indices": speaker,
195
  }
196
  with torch.no_grad():
197
+ output = model.generate_long_form_speech(
198
  batch,
199
  chunk_state=chunk_state,
200
  end_of_text=[i == len(chunked_tokens) - 1],
 
231
 
232
  **Classifier-Free Guidance (CFG)** runs two forward passes (with/without text conditioning) and interpolates. Set `use_cfg=False` for ~2x faster inference with slightly lower quality.
233
 
234
+ ## Text Chunking
235
+
236
+ Georgian text requires custom chunking because NeMo's built-in `split_by_sentence` doesn't handle Georgian properly (incorrect capitalization, no splitting of long sentences). The chunker included above splits text with this priority:
237
+
238
+ 1. **Sentence-ending punctuation** (`.` `!` `?`)
239
+ 2. **Clause-level punctuation** (`,` `;` `:` `β€”`)
240
+ 3. **Word boundaries** as a last resort
241
+
242
+ Each chunk is limited to 400 bytes (~133 Georgian characters), keeping well under the model's 500 decoder step limit.
243
+
244
  ## Speakers
245
 
246
  The model has 5 baked speaker embeddings from pretraining. Set via `speaker_indices` in the batch dict.