bargein-classifier
Open-source barge-in detection for voice agents. Classifies whether a user is interrupting (barge-in) or just backchanneling ("uh-huh", "yeah") during an agent's turn.
Designed as a self-hosted alternative to proprietary adaptive interruption handling.
Overview
| Architecture | 2D CNN on log-mel spectrograms |
| Model size | 373 KB (ONNX) |
| Input | 2s window, 16 kHz mono PCM |
| Inference | ~5 ms CPU (ONNX Runtime) |
| Streaming | 100 ms hop, sliding window |
| Training data | AMI + ICSI meeting corpora (CC BY 4.0) |
How It Works
Place downstream of VAD in a voice pipeline. When the user speaks while the agent is talking:
- VAD detects speech overlap
- This model scores the user's audio
- High probability = barge-in, agent should yield
- Low probability = backchannel or noise, agent keeps speaking
The model expects the user's audio in isolation (after echo cancellation), not a mixed signal.
Performance
Detection accuracy
| Metric | Value |
|---|---|
| PR-AUC | 0.912 |
| Precision @ 95% recall | 0.772 |
| Majority baseline | 0.709 |
Hard negative rejection
| Sound | False positive rate |
|---|---|
| Laugh | 13% |
| Cough | 22% |
| Throat clear | 18% |
| Breath-laugh | 8% |
Cross-corpus generalization (AMI-only variant)
| Eval set | PR-AUC | Precision @ 95% recall |
|---|---|---|
| AMI (same-corpus) | 0.972 | 0.909 |
| ICSI (cross-corpus) | 0.979 | 0.894 |
Quickstart
ONNX Runtime
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("bargein.onnx", providers=["CPUExecutionProvider"])
meta = np.load("bargein.onnx.meta.npz", allow_pickle=True)
threshold = float(meta["threshold"][0])
# features: (1, 1, 64, 200) log-mel spectrogram from 2s of 16kHz audio
logits = session.run(None, {"input": features})[0]
prob = 1.0 / (1.0 + np.exp(-logits[0]))
is_bargein = prob >= threshold
HTTP Server
BARGEIN_MODEL_PATH=bargein.onnx uvicorn server.app:app --host 0.0.0.0 --port 8080
curl -X POST http://localhost:8080/bargein --data-binary @audio.raw
# {"is_bargein": true, "probability": 0.87, "threshold": 0.339, "prediction_duration_ms": 4.8}
Deployment
| Requirement | Specification |
|---|---|
| Runtime | CPU-only (ONNX Runtime) |
| RAM | < 100 MB |
| GPU | Not required |
| Dependencies | onnxruntime, numpy |
Training Data
~55K labeled events from two meeting corpora:
- AMI (138 meetings, ~29K events) β dialog-act annotations as weak supervision
- ICSI (75 meetings, ~26K events) β MRDA dialog-act tags
- Hard negatives (~5K) β laugh, cough, throat clear, breath sounds
Audio source: individual per-speaker headset channels, analogous to a user's microphone with echo cancellation.
Labels are weak supervision from dialog-act ontologies, not human-audited barge-in judgments.
Limitations
- English only β trained on English meeting corpora
- Domain gap β trained on meeting audio, not voice agent audio. Deploy in shadow mode first to validate on your traffic.
- Single-speaker input β expects isolated user audio (with echo cancellation). Performance degrades on mixed/summed channels.
- Weak labels β there is a label noise ceiling from the dialog-act proxy. Human-audited fine-tuning would improve quality.
Training Configuration
| Parameter | Value |
|---|---|
| Framework | PyTorch, exported to ONNX |
| Optimizer | Adam, lr=0.003 |
| Batch size | 32 |
| Epochs | 20 (early stopping, patience=5) |
| Features | 64-band log-mel, 512 FFT, 160 hop |
| Threshold | Recall-constrained sweep (recall >= 95%) |
Author
Borislav Novikov (bnovkov012@gmail.com)
Citation
@software{novikov2026bargein,
title={bargein-classifier: Open-source barge-in detection for voice agents},
author={Novikov, Borislav},
year={2026},
url={https://huggingface.co/bnovikov/bargein-classifier}
}
Training data
@article{carletta2005ami,
title={The AMI meeting corpus: A pre-announcement},
author={Carletta, Jean},
journal={Machine Learning for Multimodal Interaction},
year={2005}
}
@inproceedings{janin2003icsi,
title={The ICSI meeting corpus},
author={Janin, Adam and Baron, Don and Edwards, Jane and Ellis, Dan and Gelbart, David and Morgan, Nelson and Peskin, Barbara and Pfau, Thilo and Shriberg, Elizabeth and Stolcke, Andreas and Wooters, Chuck},
booktitle={ICASSP},
year={2003}
}
License
Model weights: CC BY 4.0 (same as training data).