Data2Vec Audio (GGUF)

GGUF conversion of facebook/data2vec-audio-base-960h for use with CrispASR.

Model Details

  • Architecture: Data2Vec Audio โ€” wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head
  • Parameters: ~95M
  • Training: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss
  • Language: English only
  • License: Apache 2.0
  • WER: 1.89% (LibriSpeech test-clean), 4.07% (test-other)

Usage with CrispASR

# Uses the wav2vec2 backend (auto-detected from GGUF architecture)
crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav

Architecture Notes

Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter:

  1. 5-layer positional convolution (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU
  2. Global encoder LayerNorm BEFORE transformer layers (vs after for wav2vec2)
  3. POST-norm encoder despite using LayerNorm in CNN (wav2vec2-large uses pre-norm)

All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags.

Files

File Size JFK Transcription
data2vec-audio-base-960h-f16.gguf 196 MB perfect
data2vec-audio-base-960h-q4_k.gguf 79 MB perfect
data2vec-audio-base-960h-q8_0.gguf 120 MB perfect

Accuracy

Tested on JFK inaugural address (11s):

AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU
ASK WHAT YOU CAN DO FOR YOUR COUNTRY

Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription.

Citation

@inproceedings{baevski2022data2vec,
  title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
  author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
  booktitle={ICML},
  year={2022}
}
Downloads last month
97
GGUF
Model size
93.9M params
Architecture
wav2vec2
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/data2vec-audio-960h-GGUF

Quantized
(1)
this model