CommonLID Leaderboard

Results for the CommonLID and CommonLID-nano benchmarks. Headline metric: macro F1. Models are ranked by macro F1 within each tab; click a row to see per-language metrics.

๐Ÿ“ Blog post โ€ข ๐Ÿ“„ Paper

CommonLID

Common Crawl's language identification benchmark, sampled from real-world web text and human-validated across hundreds of language varieties.

Reference โ€ข License: common-crawl-tou โ€ข Main score: macro_f1

commonlid โ€” sorted by Macro F1

commonlid โ€” sorted by Macro F1
Model
Macro F1
Micro F1
Mean FPR (%)
Languages
Samples/s
OpenLID-v2
60.4
70.4
0.07
1456
60383.4
  • Model โ€” Identifier of the language identification model.
  • Macro F1 โ€” Unweighted mean of per-language F1 (x100), averaged over languages with at least one gold sample in this dataset (paper / gold-only definition). Higher is better. This is the headline ranking column.
  • Micro F1 โ€” Sample-weighted F1 (x100): pooled correct / pooled predictions across all gold samples. Less affected by rare languages than macro F1. Higher is better.
  • Mean FPR (%) โ€” Mean per-language false-positive rate (paper-style): how often the model labels a non-target sentence as the target language. Lower is better.
  • Languages โ€” Number of distinct languages the model emitted on this dataset (set(gold) | set(pred)). Reflects the model's output vocabulary on this test, not the gold language count.
  • Samples/s โ€” Throughput during evaluation (samples processed per second). Hardware-dependent; useful for relative comparison only.

Click a row to load per-language metrics.

Language
F1
Precision
Recall
FPR (%)
GT
Predictions
Correct
  • Language โ€” ISO 639-3 code of the gold and/or predicted language.
  • F1 โ€” Per-language F1 score (x100). Harmonic mean of precision and recall.
  • Precision โ€” Per-language precision (x100) = correct / predictions for this language. How often the model is right when it predicts this language.
  • Recall โ€” Per-language recall (x100) = correct / gold-count for this language. How much of this language's gold set the model recovers.
  • FPR (%) โ€” Paper-style false-positive rate: FP / (FP + TN_correct_other). Counts how often samples in other languages are misclassified as this one.
  • GT โ€” Gold-truth sample count for this language.
  • Predictions โ€” Number of times the model predicted this language.
  • Correct โ€” Predictions that match the gold label.