CommonLID
Common Crawl's language identification benchmark, sampled from real-world web text and human-validated across hundreds of language varieties.
Reference โข License: common-crawl-tou โข Main score: macro_f1
commonlid โ sorted by Macro F1
Model | Macro F1 | Micro F1 | Mean FPR (%) | Languages | Samples/s |
|---|---|---|---|---|---|
OpenLID-v2 | 60.4 | 70.4 | 0.07 | 1456 | 60383.4 |
- Model โ Identifier of the language identification model.
- Macro F1 โ Unweighted mean of per-language F1 (x100), averaged over languages with at least one gold sample in this dataset (paper / gold-only definition). Higher is better. This is the headline ranking column.
- Micro F1 โ Sample-weighted F1 (x100): pooled correct / pooled predictions across all gold samples. Less affected by rare languages than macro F1. Higher is better.
- Mean FPR (%) โ Mean per-language false-positive rate (paper-style): how often the model labels a non-target sentence as the target language. Lower is better.
- Languages โ Number of distinct languages the model emitted on this dataset (
set(gold) | set(pred)). Reflects the model's output vocabulary on this test, not the gold language count. - Samples/s โ Throughput during evaluation (samples processed per second). Hardware-dependent; useful for relative comparison only.
Click a row to load per-language metrics.
Language | F1 | Precision | Recall | FPR (%) | GT | Predictions | Correct |
|---|---|---|---|---|---|---|---|
- Language โ ISO 639-3 code of the gold and/or predicted language.
- F1 โ Per-language F1 score (x100). Harmonic mean of precision and recall.
- Precision โ Per-language precision (x100) = correct / predictions for this language. How often the model is right when it predicts this language.
- Recall โ Per-language recall (x100) = correct / gold-count for this language. How much of this language's gold set the model recovers.
- FPR (%) โ Paper-style false-positive rate: FP / (FP + TN_correct_other). Counts how often samples in other languages are misclassified as this one.
- GT โ Gold-truth sample count for this language.
- Predictions โ Number of times the model predicted this language.
- Correct โ Predictions that match the gold label.
CommonLID (nano)
Common Crawl's language identification benchmark, sampled from real-world web text and human-validated across hundreds of language varieties. Nano slice โ stratified sample (max 1000 + min 5 per language) of the parent benchmark, with the schema normalised to (index, text, language_iso639_3).
Reference โข License: common-crawl-tou โข Main score: macro_f1
commonlid_nano โ sorted by Macro F1
Model | Macro F1 | Micro F1 | Mean FPR (%) | Languages | Samples/s |
|---|---|---|---|---|---|
GPT-4o-mini | 72.5 | 77.3 | 0.24 | 130 | 358120.3 |
- Model โ Identifier of the language identification model.
- Macro F1 โ Unweighted mean of per-language F1 (x100), averaged over languages with at least one gold sample in this dataset (paper / gold-only definition). Higher is better. This is the headline ranking column.
- Micro F1 โ Sample-weighted F1 (x100): pooled correct / pooled predictions across all gold samples. Less affected by rare languages than macro F1. Higher is better.
- Mean FPR (%) โ Mean per-language false-positive rate (paper-style): how often the model labels a non-target sentence as the target language. Lower is better.
- Languages โ Number of distinct languages the model emitted on this dataset (
set(gold) | set(pred)). Reflects the model's output vocabulary on this test, not the gold language count. - Samples/s โ Throughput during evaluation (samples processed per second). Hardware-dependent; useful for relative comparison only.
Click a row to load per-language metrics.
Language | F1 | Precision | Recall | FPR (%) | GT | Predictions | Correct |
|---|---|---|---|---|---|---|---|
- Language โ ISO 639-3 code of the gold and/or predicted language.
- F1 โ Per-language F1 score (x100). Harmonic mean of precision and recall.
- Precision โ Per-language precision (x100) = correct / predictions for this language. How often the model is right when it predicts this language.
- Recall โ Per-language recall (x100) = correct / gold-count for this language. How much of this language's gold set the model recovers.
- FPR (%) โ Paper-style false-positive rate: FP / (FP + TN_correct_other). Counts how often samples in other languages are misclassified as this one.
- GT โ Gold-truth sample count for this language.
- Predictions โ Number of times the model predicted this language.
- Correct โ Predictions that match the gold label.
Source: commoncrawl/commonlid-results @ HEAD.