Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
AniFileBERT
中文:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。
English: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.
This repository is the Hugging Face model repo used by MiruPlay as tools/anime_parser.
Model Details / 模型信息
| Item | Value |
|---|---|
| Architecture / 架构 | BertForTokenClassification |
| Tokenizer / 分词器 | Custom character tokenizer in anifilebert/tokenizer.py |
| Parameters / 参数量 | 4,783,631 |
| Hidden size / 隐层维度 | 256 |
| Layers / 层数 | 4 |
| Attention heads / 注意力头 | 8 |
| Max sequence length / 最大长度 | 128 |
| Labels / 标签 | 37 BIO labels, schema v2: language-aware TITLE_*, path-aware PATH_TITLE_*/PATH_SEASON, plus EPISODE, GROUP, RESOLUTION, SOURCE, SPECIAL, TAG |
| Default checkpoint / 默认权重 | Repository root files (config.json, model.safetensors, vocab.json, tokenizer_config.json) |
| ONNX export / ONNX 导出 | exports/anime_filename_parser.onnx |
| Training lineage / 训练链路 | reports/training_lineage.json |
中文:根目录就是发布 checkpoint,不再保留旧的 model/ 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 from_pretrained() 只能加载 token-classification 权重。
English: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. from_pretrained() only loads token-classification weights.
Intended Use / 使用场景
中文
- 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。
- 覆盖常见结构:
[GROUP] TITLE - EP [META]、点分隔S01E07、国漫多括号标题、BD 特典NCOP/NCED/IV05、长集数、第二季别名等。 - 不适合泛化为自然语言 NER;这是结构化文件名解析任务。
English
- Parse anime release filenames for media library scraping, classification, search, and display.
- Covers common layouts:
[GROUP] TITLE - EP [META], dottedS01E07, Chinese animation bracket layouts, BD extras such asNCOP/NCED/IV05, long-running episode numbers, and season aliases. - This is not a general natural-language NER model; it is a structured filename parser.
Install / 安装
uv sync
If the dataset submodule is missing:
git submodule update --init --recursive
Quick Start / 快速使用
Run the Python parser:
uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
Expected output:
{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
Load the raw Transformers model:
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
中文:如果需要完整字段解析,请 clone 本仓库并使用 python -m anifilebert.inference,因为分词器和后处理是自定义的。
English: For complete field parsing, clone this repo and use python -m anifilebert.inference; the tokenizer and postprocessing are custom.
ONNX Usage / ONNX 使用
The ONNX graph outputs token logits only. A complete parser still needs:
- custom character tokenization,
- constrained BIO decoding,
- field aggregation and thin string/number normalization.
本仓库提供最小可运行示例:
uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
Static graph shapes:
input_ids:int64[1,128]attention_mask:int64[1,128]logits:float32[1,128,37]
More details: docs/onnx.md and docs/android.md.
Evaluation / 评估
Current published checkpoint:
| Metric / 指标 | Value / 数值 |
|---|---|
| Fixed regression, model-only / 固定回归,纯模型聚合 | 28/28 full match = 100% |
| Fixed regression, default thin runtime / 固定回归,默认薄层运行时 | 28/28 full match = 100% |
| Held-out parse, model-only / held-out 解析,纯模型聚合 | 2046/2048 full match = 99.90% |
| Held-out parse, default thin runtime / held-out 解析,默认薄层运行时 | 2046/2048 full match = 99.90% |
| Token/entity eval / token/entity 评估 | F1 0.9999, token accuracy 0.99996 |
| ONNX parity / ONNX 误差 | max abs diff 4.8637e-05 |
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg 12.18 ms, P95 14.45 ms |
中文:当前发布模型是 37 维 schema v2 字符级模型,最终权重来自 repaired hard-focus encoded-cache 训练;细节见 reports/training_lineage.json。README 主指标以 model-only 和默认薄层 normalized-only 为准;旧版结构规则辅助层已移除,不再作为运行时或质量对照。
English: The published checkpoint is a 37-label schema v2 character-tokenizer model, finished with a repaired hard-focus encoded-cache training run. See reports/training_lineage.json for details. README quality numbers prioritize model-only and the default thin normalized-only runtime; structural filename assists have been removed from the runtime and quality reports.
Run regression:
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
Performance / 性能
Benchmark command:
性能测试命令:
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
Recorded local CPU benchmark on the 28-case fixed regression set, single-threaded, using the default thin runtime: tokenization, model/session forward, constrained BIO decoding, entity aggregation, and light string/number normalization:
记录中的本地 CPU 单线程 benchmark 使用 28 条固定回归 case,默认薄层运行时, 包含 tokenizer、模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化:
| Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
|---|---|---|---|---|---|---|
| PyTorch | 44.04 | 13.50 | 13.05 | 17.36 | 21.38 | 74.1 |
| ONNX Runtime | 45.64 | 12.18 | 11.97 | 14.45 | 15.96 | 82.1 |
中文:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
English: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned.
Training / 训练
Training uses the dataset submodule at datasets/AnimeName.
Current release training uses schema v2 JSONL plus a Rust pre-encoded cache on the Windows RTX 5070 Ti worker.
The hard-focus JSONL and encoded-cache paths under data/ are generated local
artifacts and are intentionally ignored; regenerate them from the authoritative
dataset and schema before rerunning this exact training command.
.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
--data-file data/schema_v2_hard_focus_char_seed63.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--encoded-cache-dir data/encoded_cache/schema_v2_hard_focus_char_seed63_split995_repaired_v2 `
--save-dir checkpoints/ablation-schema-v2-hardfocus-cache-repaired-from-baseline-seed62-10epoch-rerun `
--init-model-dir checkpoints/ablation-schema-v2-baseline8h4l-cache-10epoch-seed62/final `
--epochs 10 `
--batch-size 512 `
--learning-rate 0.00004 `
--warmup-steps 120 `
--max-seq-length 128 `
--train-split 0.995 `
--checkpoint-steps 1000 `
--save-total-limit 3 `
--parse-eval-limit 2048 `
--case-eval-file data/parser_regression_cases.json `
--bf16 `
--no-periodic-eval `
--perf-log-steps 100 `
--perf-sample-interval 1.0 `
--seed 63 `
--experiment-name ablation-schema-v2-hardfocus-cache-repaired-from-baseline-seed62-10epoch-rerun
Build or rebuild encoded caches with tools/encoded_dataset_cache after changing
the JSONL, vocab, label schema, max length, split ratio, or seed. See
docs/training.md for the full cache-first flow and
synthetic augmentation follow-ups.
python -m anifilebert.train writes:
- Hugging Face checkpoints under
--save-dir, final/run_metadata.json,final/trainer_eval_metrics.json,final/parse_eval_metrics.json,final/case_metrics.jsonunless--no-case-evalis used,final/perf_metrics.jsonwhen--perf-log-stepsis set,- TensorBoard logs unless
--no-tensorboardis used.
Full workflow: docs/training.md.
Dataset / 数据集
Authoritative dataset snapshot:
datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json
Current snapshot:
- rows / 行数:
759738 - failed relabel rows / 重标注失败行:
0 - strict BIO violations / 严格 BIO 违规:
0 - character vocab / 字符词表:
6199 - character coverage / 字符覆盖率:
99.9952%with the published 6199-token char vocab
中文:datasets/AnimeName 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。
English: datasets/AnimeName is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.
Repository Layout / 仓库结构
config.json
model.safetensors
tokenizer_config.json
vocab.json
training_args.bin
anifilebert/
tools/
data/parser_regression_cases.json
datasets/AnimeName/
exports/anime_filename_parser.onnx
docs/
reports/
Maintenance / 维护
See docs/maintenance.md for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.
Limitations / 局限
中文
- 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。
- ONNX 只包含模型 logits,不包含 tokenizer、BIO decode 和薄字段规范化;移动端必须保持 tokenizer/vocab/config 一致。
source当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。
English
- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync.
sourceis currently a single field, while real filenames may contain platform, release source, codec, and language tags together.
- Downloads last month
- 246
Dataset used to train ModerRAS/AniFileBERT
Evaluation results
- Fixed parser model-only full-match accuracy on AniFileBERT fixed parser regression casesself-reported1.000
- Fixed parser thin-runtime full-match accuracy on AniFileBERT fixed parser regression casesself-reported1.000