Hi, I’m Seniru Epasinghe 👋

I’m a AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.
I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

🌐 Connect with me

Sinscribe

This model is a fine-tuned version of openai/whisper-small on the Sinhala CSV + FLACs dataset. It achieves the following results on the evaluation set:

Loss: 0.0829
Wer: 29.1387

Intended uses & limitations

Can be used for Sinhala speech to text conversions. Make sure to input noise low audio to the model, to get the best outcome.

Training and evaluation data

Trained on the custom dataset - seniruk/sinscribe-sinhala-stt

Trained on above final dataset with 2 epochs on a device with below spec for 41:00:59 hours

16GB RAM
NVIDIA GeForce RTX 4060 Ti 16GB
12th Gen Intel(R) Core(TM) i5-12400F (2.50 GHz) - 64-bit

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.1871	0.1102	1000	0.1834	51.9170
0.1429	0.2204	2000	0.1517	44.7541
0.1345	0.3307	3000	0.1336	41.0627
0.1183	0.4409	4000	0.1237	38.6625
0.114	0.5511	5000	0.1151	36.9654
0.1056	0.6613	6000	0.1080	35.2670
0.0968	0.7715	7000	0.1037	34.4457
0.1011	0.8817	8000	0.0986	33.2741
0.0971	0.9920	9000	0.0961	32.7147
0.0713	1.1022	10000	0.0947	32.0250
0.0706	1.2124	11000	0.0940	32.0766
0.0691	1.3226	12000	0.0907	31.2485
0.0684	1.4328	13000	0.0893	30.9512
0.0718	1.5430	14000	0.0875	30.3592
0.0642	1.6533	15000	0.0859	30.0388
0.0667	1.7635	16000	0.0842	29.5840
0.0667	1.8737	17000	0.0835	29.3193
0.0677	1.9839	18000	0.0829	29.1387

Inferencing

import torch
import soundfile
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

device = "cpu"
torchaudio.set_audio_backend("soundfile") 

model = WhisperForConditionalGeneration.from_pretrained("seniruk/whisper-small-si-cpu").to(device)
processor = WhisperProcessor.from_pretrained("seniruk/whisper-small-si-cpu")

def transcribe(audio_path):
    try:
        if audio_path is None:
            return "No audio received. Please record something."

        waveform, sample_rate = torchaudio.load(audio_path)

        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)

        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
            waveform = resampler(waveform)
            sample_rate = 16000

        waveform = waveform.squeeze().numpy()

        inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_features.to("cpu")
        predicted_ids = model.generate(inputs)
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

        return transcription

    except Exception as e:
        return f"Error during transcription: {e}"

print(transcribe('audio.wav'))

Gradio UI

import torch
import soundfile
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import gradio as gr

device = "cpu"
torchaudio.set_audio_backend("soundfile")

model = WhisperForConditionalGeneration.from_pretrained("seniruk/whisper-small-si-cpu").to(device)
processor = WhisperProcessor.from_pretrained("seniruk/whisper-small-si-cpu")

MAX_DURATION_SECONDS = 30  # Limit: 30 seconds of audio

def transcribe(audio_path):
    try:
        if audio_path is None:
            return "No audio received. Please record or upload a file."

        # Load audio
        waveform, sample_rate = torchaudio.load(audio_path)

        # Convert to mono
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)

        # Duration check
        duration = waveform.shape[1] / sample_rate
        if duration > MAX_DURATION_SECONDS:
            return f"Audio too long ({duration:.1f}s). Please use an audio clip shorter than {MAX_DURATION_SECONDS}s."

        # Resample if necessary
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
            waveform = resampler(waveform)
            sample_rate = 16000

        waveform = waveform.squeeze().numpy()

        # Process through Whisper
        inputs = processor(
            waveform,
            sampling_rate=sample_rate,
            return_tensors="pt"
        ).input_features.to(device)

        with torch.no_grad():
            predicted_ids = model.generate(inputs)
            transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

        return transcription

    except Exception as e:
        return f"Error during transcription: {e}"


iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["microphone", "upload"], type="filepath", label="Record or Upload Audio"),
    outputs=gr.Textbox(label="Transcription"),
    title="Whisper Small Sinhala (CPU)",
    description=(
        "🎙️ Sinhala speech-to-text using the fine-tuned Whisper Small model (Sinscribe).\n"
        "You can record or upload audio up to 30 seconds long."
    ),
)

iface.launch()

GPU runnable version of above model is below

seniruk/whisper-small-si

Framework versions

Transformers 4.48.0
Pytorch 2.5.1+cu121
Datasets 3.6.0
Tokenizers 0.21.4

Downloads last month: 14

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for seniruk/whisper-small-si-cpu

Base model

openai/whisper-small

Finetuned

(3490)

this model

seniruk
/

whisper-small-si-cpu