Data2Vec Audio (GGUF)
GGUF conversion of facebook/data2vec-audio-base-960h for use with CrispASR.
Model Details
- Architecture: Data2Vec Audio โ wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head
- Parameters: ~95M
- Training: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss
- Language: English only
- License: Apache 2.0
- WER: 1.89% (LibriSpeech test-clean), 4.07% (test-other)
Usage with CrispASR
# Uses the wav2vec2 backend (auto-detected from GGUF architecture)
crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav
Architecture Notes
Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter:
- 5-layer positional convolution (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU
- Global encoder LayerNorm BEFORE transformer layers (vs after for wav2vec2)
- POST-norm encoder despite using LayerNorm in CNN (wav2vec2-large uses pre-norm)
All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags.
Files
| File | Size | JFK Transcription |
|---|---|---|
| data2vec-audio-base-960h-f16.gguf | 196 MB | perfect |
| data2vec-audio-base-960h-q4_k.gguf | 79 MB | perfect |
| data2vec-audio-base-960h-q8_0.gguf | 120 MB | perfect |
Accuracy
Tested on JFK inaugural address (11s):
AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU
ASK WHAT YOU CAN DO FOR YOUR COUNTRY
Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription.
Citation
@inproceedings{baevski2022data2vec,
title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
booktitle={ICML},
year={2022}
}
- Downloads last month
- 97
Hardware compatibility
Log In to add your hardware
8-bit
16-bit
Model tree for cstr/data2vec-audio-960h-GGUF
Base model
facebook/data2vec-audio-base-960h