AniFileBERT

中文：AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段：字幕组、标题、季、集数、分辨率、来源和 special tag。

English: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.

This repository is the Hugging Face model repo used by MiruPlay as tools/anime_parser.

Model Details / 模型信息

Item	Value
Architecture / 架构	`BertForTokenClassification`
Tokenizer / 分词器	Custom character tokenizer in `anifilebert/tokenizer.py`
Parameters / 参数量	4,783,631
Hidden size / 隐层维度	256
Layers / 层数	4
Attention heads / 注意力头	8
Max sequence length / 最大长度	128
Labels / 标签	37 BIO labels, schema v2: language-aware `TITLE_`, path-aware `PATH_TITLE_`/`PATH_SEASON`, plus `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL`, `TAG`
Default checkpoint / 默认权重	Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`)
ONNX export / ONNX 导出	`exports/anime_filename_parser.onnx`
Training lineage / 训练链路	`reports/training_lineage.json`

中文：根目录就是发布 checkpoint，不再保留旧的 model/ 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”，不再默认启用重结构规则；直接 from_pretrained() 只能加载 token-classification 权重。

English: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. from_pretrained() only loads token-classification weights.

Intended Use / 使用场景

中文

解析番剧/动画发布文件名，用于媒体库刮削、归类、搜索和展示。
覆盖常见结构：[GROUP] TITLE - EP [META]、点分隔 S01E07、国漫多括号标题、BD 特典 NCOP/NCED/IV05、长集数、第二季别名等。
不适合泛化为自然语言 NER；这是结构化文件名解析任务。

English

Parse anime release filenames for media library scraping, classification, search, and display.
Covers common layouts: [GROUP] TITLE - EP [META], dotted S01E07, Chinese animation bracket layouts, BD extras such as NCOP/NCED/IV05, long-running episode numbers, and season aliases.
This is not a general natural-language NER model; it is a structured filename parser.

Install / 安装

uv sync

If the dataset submodule is missing:

git submodule update --init --recursive

Quick Start / 快速使用

Run the Python parser:

uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"

Expected output:

{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}

Load the raw Transformers model:

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")

中文：如果需要完整字段解析，请 clone 本仓库并使用 python -m anifilebert.inference，因为分词器和后处理是自定义的。

English: For complete field parsing, clone this repo and use python -m anifilebert.inference; the tokenizer and postprocessing are custom.

ONNX Usage / ONNX 使用

The ONNX graph outputs token logits only. A complete parser still needs:

custom character tokenization,
constrained BIO decoding,
field aggregation and thin string/number normalization.

本仓库提供最小可运行示例：

uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"

Static graph shapes:

input_ids: int64[1,128]
attention_mask: int64[1,128]
logits: float32[1,128,37]

More details: docs/onnx.md and docs/android.md.

Evaluation / 评估

Current published checkpoint:

Metric / 指标	Value / 数值
Fixed regression, model-only / 固定回归，纯模型聚合	28/28 full match = `100%`
Fixed regression, default thin runtime / 固定回归，默认薄层运行时	28/28 full match = `100%`
Held-out parse, model-only / held-out 解析，纯模型聚合	2046/2048 full match = `99.90%`
Held-out parse, default thin runtime / held-out 解析，默认薄层运行时	2046/2048 full match = `99.90%`
Token/entity eval / token/entity 评估	F1 `0.9999`, token accuracy `0.99996`
ONNX parity / ONNX 误差	max abs diff `4.8637e-05`
CPU thin-runtime latency / CPU 薄层运行时延迟	ONNX avg `12.18 ms`, P95 `14.45 ms`

中文：当前发布模型是 37 维 schema v2 字符级模型，最终权重来自 repaired hard-focus encoded-cache 训练；细节见 reports/training_lineage.json。README 主指标以 model-only 和默认薄层 normalized-only 为准；旧版结构规则辅助层已移除，不再作为运行时或质量对照。

English: The published checkpoint is a 37-label schema v2 character-tokenizer model, finished with a repaired hard-focus encoded-cache training run. See reports/training_lineage.json for details. README quality numbers prioritize model-only and the default thin normalized-only runtime; structural filename assists have been removed from the runtime and quality reports.

Run regression:

uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json

Performance / 性能

Benchmark command:

性能测试命令：

uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json

Recorded local CPU benchmark on the 28-case fixed regression set, single-threaded, using the default thin runtime: tokenization, model/session forward, constrained BIO decoding, entity aggregation, and light string/number normalization:

记录中的本地 CPU 单线程 benchmark 使用 28 条固定回归 case，默认薄层运行时，包含 tokenizer、模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化：

Backend / 后端	Load ms / 加载 ms	Avg ms / 平均 ms	P50 ms	P95 ms	P99 ms	files/s
PyTorch	44.04	13.50	13.05	17.36	21.38	74.1
ONNX Runtime	45.64	12.18	11.97	14.45	15.96	82.1

中文：这是完整薄层 parser 的端到端延迟，不是只测模型 forward。移动端实现应复用 ONNX session，并保持 tokenizer/BIO/薄规范化逻辑一致。

English: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned.

Training / 训练

Training uses the dataset submodule at datasets/AnimeName.

Current release training uses schema v2 JSONL plus a Rust pre-encoded cache on the Windows RTX 5070 Ti worker.

The hard-focus JSONL and encoded-cache paths under data/ are generated local artifacts and are intentionally ignored; regenerate them from the authoritative dataset and schema before rerunning this exact training command.

.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file data/schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --encoded-cache-dir data/encoded_cache/schema_v2_hard_focus_char_seed63_split995_repaired_v2 `
  --save-dir checkpoints/ablation-schema-v2-hardfocus-cache-repaired-from-baseline-seed62-10epoch-rerun `
  --init-model-dir checkpoints/ablation-schema-v2-baseline8h4l-cache-10epoch-seed62/final `
  --epochs 10 `
  --batch-size 512 `
  --learning-rate 0.00004 `
  --warmup-steps 120 `
  --max-seq-length 128 `
  --train-split 0.995 `
  --checkpoint-steps 1000 `
  --save-total-limit 3 `
  --parse-eval-limit 2048 `
  --case-eval-file data/parser_regression_cases.json `
  --bf16 `
  --no-periodic-eval `
  --perf-log-steps 100 `
  --perf-sample-interval 1.0 `
  --seed 63 `
  --experiment-name ablation-schema-v2-hardfocus-cache-repaired-from-baseline-seed62-10epoch-rerun

Build or rebuild encoded caches with tools/encoded_dataset_cache after changing the JSONL, vocab, label schema, max length, split ratio, or seed. See docs/training.md for the full cache-first flow and synthetic augmentation follow-ups.

python -m anifilebert.train writes:

Hugging Face checkpoints under --save-dir,
final/run_metadata.json,
final/trainer_eval_metrics.json,
final/parse_eval_metrics.json,
final/case_metrics.json unless --no-case-eval is used,
final/perf_metrics.json when --perf-log-steps is set,
TensorBoard logs unless --no-tensorboard is used.

Full workflow: docs/training.md.

Dataset / 数据集

Authoritative dataset snapshot:

datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json

Current snapshot:

rows / 行数: 759738
failed relabel rows / 重标注失败行: 0
strict BIO violations / 严格 BIO 违规: 0
character vocab / 字符词表: 6199
character coverage / 字符覆盖率: 99.9952% with the published 6199-token char vocab

中文：datasets/AnimeName 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库，再提交父仓库的 submodule pointer。

English: datasets/AnimeName is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.

Repository Layout / 仓库结构

config.json
model.safetensors
tokenizer_config.json
vocab.json
training_args.bin
anifilebert/
tools/
data/parser_regression_cases.json
datasets/AnimeName/
exports/anime_filename_parser.onnx
docs/
reports/

Maintenance / 维护

See docs/maintenance.md for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.

Limitations / 局限

中文

发布命名没有统一标准，极端 OCR 噪声、乱码、非动画命名仍可能失败。
ONNX 只包含模型 logits，不包含 tokenizer、BIO decode 和薄字段规范化；移动端必须保持 tokenizer/vocab/config 一致。
source 当前是单值字段，复杂文件名里可能同时存在平台、发布源、编码器和语言标签。

English

Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync.
source is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.

Downloads last month: 246

Safetensors

Model size

4.79M params

Tensor type

F32

Dataset used to train ModerRAS/AniFileBERT

Evaluation results

Fixed parser model-only full-match accuracy on AniFileBERT fixed parser regression cases
self-reported

1.000
Fixed parser thin-runtime full-match accuracy on AniFileBERT fixed parser regression cases
self-reported

1.000