--- license: mit language: - en library_name: onnx tags: - text-classification - web-scraping - boilerplate-removal - content-extraction - onnx - tiny pipeline_tag: text-classification --- # WebRank A 3.14M-parameter transformer that scores web text on a `[0, 1]` scale where **1 = real content** and **0 = boilerplate** (cookie banners, navs, footers, CTAs, error pages, JS placeholders, paywalls). Ships as a 3.2 MB INT8 ONNX file that runs anywhere ONNX Runtime runs — Python, JS (browser/node), Go, Rust, C++, Java, .NET. Built as the post-processing filter for the [Keiro Browser](https://github.com/keirolabs) crawl pipeline, released as open source. ## Files | File | Size | Description | |------|------|-------------| | `webrank.int8.onnx` | 3.2 MB | INT8-quantized model — recommended | | `webrank.onnx` | 12 MB | FP32 model | | `tokenizer.json` | 1.1 MB | HuggingFace `tokenizers` BPE vocab | ## Architecture ``` input_ids [B, 256] int64 ↓ token + position embeddings (dim=128) ↓ 5 × { LayerNorm → MHA(8 heads, SDPA) → residual → LayerNorm → FFN(512) → residual } ↓ LayerNorm → mean-pool over non-pad tokens ↓ Linear(128→128) → GELU → Dropout → Linear(128→1) → sigmoid ↓ score [B] float32 ``` - **Vocab:** 16,384 byte-level BPE - **Max seq length:** 256 BPE tokens - **Params:** 3,135,617 - **Pretraining:** masked language modeling - **Fine-tuning:** binary classification with BCE loss ## Usage ### Python ```python import numpy as np import onnxruntime as ort from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizer.json") sess = ort.InferenceSession("webrank.int8.onnx", providers=["CPUExecutionProvider"]) def encode(text, max_len=256): pad_id = tok.token_to_id("[PAD]") ids = tok.encode(text).ids[:max_len] # post-processor adds [CLS]/[SEP] ids += [pad_id] * (max_len - len(ids)) return np.array([ids], dtype=np.int64) def score(text): out = sess.run(["score"], {"input_ids": encode(text)})[0] return float(out.flatten()[0]) print(score("Mitochondria are membrane-bound organelles found in eukaryotic cells.")) # 0.93 print(score("We use cookies to improve your experience. Accept all cookies.")) # 0.08 ``` Batched: ```python def score_batch(texts): ids = np.concatenate([encode(t) for t in texts], axis=0) return sess.run(["score"], {"input_ids": ids})[0].flatten() ``` ### JavaScript (browser / Node) ```js import * as ort from "onnxruntime-web"; const session = await ort.InferenceSession.create("/webrank.int8.onnx"); // tokenize text into a BigInt64Array of length 256 using a JS BPE // library that loads tokenizer.json const tensor = new ort.Tensor("int64", ids, [1, 256]); const out = await session.run({ input_ids: tensor }); console.log(out.score.data[0]); // 0..1 ``` ### Go ```go import ort "github.com/yalue/onnxruntime_go" ort.SetSharedLibraryPath("libonnxruntime.so") ort.InitializeEnvironment() defer ort.DestroyEnvironment() input, _ := ort.NewTensor(ort.NewShape(1, 256), ids /* []int64 */) output, _ := ort.NewEmptyTensor[float32](ort.NewShape(1)) sess, _ := ort.NewAdvancedSession( "webrank.int8.onnx", []string{"input_ids"}, []string{"score"}, []ort.Value{input}, []ort.Value{output}, nil, ) sess.Run() fmt.Println(output.GetData()[0]) ``` ## Performance Measured on a Ryzen 7 (CPU only, ONNX Runtime 1.20): | Variant | Single-row | Batch-18 | Size | |---|---|---|---| | FP32 | 5.9 ms | 238 ms | 12 MB | | INT8 | 6.6 ms | 222 ms | 3.2 MB | INT8 is **3.8× smaller** with ≤0.024 score drift and identical predictions on every test case. Quantization overhead cancels matmul savings at 3M params, so single-row latency is roughly equivalent — INT8 wins on size and on batched throughput. ## Training data - **Pretraining:** [`Salesforce/wikitext`](https://huggingface.co/datasets/Salesforce/wikitext) `wikitext-103-raw-v1`, ~29k articles, ~110M tokens. - **Fine-tuning:** 30k labeled examples (15k positive / 15k negative). - Positives: 7.5k paragraph-level + 7.5k sentence-level extracts from wikitext articles, filtered for prose-like structure. - Negatives: synthetically generated boilerplate from 40+ templates (cookie banners, navs, footers, CTAs, JS placeholders, error pages, paywall stubs), with deliberately varied length (40% single template, 30% pair, 20% triple, 10% stack of 4–6). The mixed-length sampling on both sides is important — without it the model learns to use sequence length as a shortcut. ## Training procedure - **Pretraining:** masked language modeling (BERT-style 80/10/10 mask), AdamW (lr 3e-4, betas 0.9/0.95, wd 0.01), cosine schedule with 100-step warmup, gradient clipping 1.0, batch size 32, 800 steps total. ~75 minutes on CPU. - **Fine-tuning:** binary classification head with BCE loss, AdamW (lr 5e-5), 3 epochs over 12k training rows, batch size 64. ~38 minutes on CPU. - Training framework: PyTorch (vanilla, no HuggingFace `transformers` for the model itself). ## Evaluation On a held-out 3,000-row validation split: | Metric | Value | |---|---| | Accuracy | 1.000 | | Precision | 1.000 | | Recall | 0.999 | | F1 | 1.000 | | Loss | 0.0074 | Held-out val is trivially separable because synthetic boilerplate vs wikitext prose is a fairly easy decision boundary. For a more honest read, on **18 hand-written real-world snippets** (none from the training distribution): - **16 / 18 correct** on the binary cutoff. - The 2 failures are: - `404 - Page not found. The page you are looking for might have been removed...` → 0.75 (false positive for content) - `This article is for subscribers only. Subscribe now to read the full story...` → 0.72 (false positive for content) Both are paywall/error pages styled as natural prose — the synthetic templated negatives never showed the model that *prose-shaped* boilerplate exists. Closing this gap requires real-world hard-negative mining. ## Limitations 1. **English only.** The byte-level tokenizer tolerates other scripts but the classifier was never trained on them. 2. **Domain shift.** Trained on wikitext-103 (encyclopedic English). Short technical statements like *"PostgreSQL uses MVCC for transactions"* or casual writing score lower than they should because they don't match wikitext prose style. 3. **Prose-shaped boilerplate.** Paywall walls, well-written 404 pages, and "subscribe to read" stubs can confuse it because the synthetic negatives are templated, not naturalistic. 4. **Sequence cap of 256 tokens.** Long documents must be chunked by the caller. The intended use is per-paragraph scoring during crawl post-processing, not whole-page classification. 5. **Pretraining cap of 800 steps.** Final MLM loss ~7.18 (16K vocab unigram baseline ≈ 7.2). The classifier still works fine because the binary task is easy enough that the trunk doesn't need a deeply converged language model — but a longer pretraining run would help the borderline cases. ## Intended use Drop into a web crawler / scraper as a post-extraction quality filter. Score each paragraph or block, drop anything below ~0.5, keep the rest. Cheap enough (≈6 ms/paragraph on CPU) to run inline at crawl time. **Not** intended as a general-purpose text classifier, content moderator, toxicity detector, or anything else. It does one thing. ## Reproducing The full training pipeline is in the [GitHub repo](https://github.com/keirolabs). End-to-end on a Ryzen 7 takes ~115 minutes: ```bash python collect.py # 1 min download wikitext, build labels python tokenizer.py # 1 min train 16K BPE python pretrain.py # 75 min MLM pretraining python finetune.py # 38 min binary classification python export.py # 2 sec PyTorch → ONNX FP32 python quantize_onnx.py # 5 sec ONNX FP32 → INT8 ``` ## License MIT. Do whatever you want with it. ## Citation ```bibtex @misc{webrank2026, title = {WebRank: a 3M-parameter boilerplate classifier for web text}, author = {Keirolabs}, year = {2026}, url = {https://huggingface.co/mannybr/Webrank-nano} } ```