---
license: mit
language:
- en
library_name: onnx
tags:
- text-classification
- web-scraping
- boilerplate-removal
- content-extraction
- onnx
- tiny
pipeline_tag: text-classification
---

# WebRank

A 3.14M-parameter transformer that scores web text on a `[0, 1]` scale where
**1 = real content** and **0 = boilerplate** (cookie banners, navs, footers,
CTAs, error pages, JS placeholders, paywalls).

Ships as a 3.2 MB INT8 ONNX file
that runs anywhere ONNX Runtime runs — Python, JS (browser/node), Go, Rust,
C++, Java, .NET.

Built as the post-processing filter for the [Keiro Browser](https://github.com/keirolabs)
crawl pipeline, released as open source.

## Files

| File | Size | Description |
|------|------|-------------|
| `webrank.int8.onnx` | 3.2 MB | INT8-quantized model — recommended |
| `webrank.onnx` | 12 MB | FP32 model |
| `tokenizer.json` | 1.1 MB | HuggingFace `tokenizers` BPE vocab |

## Architecture

```
input_ids [B, 256]   int64
   ↓
token + position embeddings (dim=128)
   ↓
5 × { LayerNorm → MHA(8 heads, SDPA) → residual
                → LayerNorm → FFN(512) → residual }
   ↓
LayerNorm → mean-pool over non-pad tokens
   ↓
Linear(128→128) → GELU → Dropout → Linear(128→1) → sigmoid
   ↓
score [B]   float32
```

- **Vocab:** 16,384 byte-level BPE
- **Max seq length:** 256 BPE tokens
- **Params:** 3,135,617
- **Pretraining:** masked language modeling
- **Fine-tuning:** binary classification with BCE loss

## Usage

### Python

```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("webrank.int8.onnx",
                            providers=["CPUExecutionProvider"])

def encode(text, max_len=256):
    pad_id = tok.token_to_id("[PAD]")
    ids = tok.encode(text).ids[:max_len]   # post-processor adds [CLS]/[SEP]
    ids += [pad_id] * (max_len - len(ids))
    return np.array([ids], dtype=np.int64)

def score(text):
    out = sess.run(["score"], {"input_ids": encode(text)})[0]
    return float(out.flatten()[0])

print(score("Mitochondria are membrane-bound organelles found in eukaryotic cells."))
# 0.93

print(score("We use cookies to improve your experience. Accept all cookies."))
# 0.08
```

Batched:

```python
def score_batch(texts):
    ids = np.concatenate([encode(t) for t in texts], axis=0)
    return sess.run(["score"], {"input_ids": ids})[0].flatten()
```

### JavaScript (browser / Node)

```js
import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("/webrank.int8.onnx");
// tokenize text into a BigInt64Array of length 256 using a JS BPE
// library that loads tokenizer.json
const tensor = new ort.Tensor("int64", ids, [1, 256]);
const out = await session.run({ input_ids: tensor });
console.log(out.score.data[0]);  // 0..1
```

### Go

```go
import ort "github.com/yalue/onnxruntime_go"

ort.SetSharedLibraryPath("libonnxruntime.so")
ort.InitializeEnvironment()
defer ort.DestroyEnvironment()

input, _  := ort.NewTensor(ort.NewShape(1, 256), ids /* []int64 */)
output, _ := ort.NewEmptyTensor[float32](ort.NewShape(1))
sess, _   := ort.NewAdvancedSession(
    "webrank.int8.onnx",
    []string{"input_ids"}, []string{"score"},
    []ort.Value{input}, []ort.Value{output}, nil,
)
sess.Run()
fmt.Println(output.GetData()[0])
```

## Performance

Measured on a Ryzen 7 (CPU only, ONNX Runtime 1.20):

| Variant | Single-row | Batch-18 | Size |
|---|---|---|---|
| FP32 | 5.9 ms | 238 ms | 12 MB |
| INT8 | 6.6 ms | 222 ms | 3.2 MB |

INT8 is **3.8× smaller** with ≤0.024 score drift and identical predictions
on every test case. Quantization overhead cancels matmul savings at 3M
params, so single-row latency is roughly equivalent — INT8 wins on size
and on batched throughput.

## Training data

- **Pretraining:** [`Salesforce/wikitext`](https://huggingface.co/datasets/Salesforce/wikitext)
  `wikitext-103-raw-v1`, ~29k articles, ~110M tokens.
- **Fine-tuning:** 30k labeled examples (15k positive / 15k negative).
  - Positives: 7.5k paragraph-level + 7.5k sentence-level extracts from
    wikitext articles, filtered for prose-like structure.
  - Negatives: synthetically generated boilerplate from 40+ templates
    (cookie banners, navs, footers, CTAs, JS placeholders, error pages,
    paywall stubs), with deliberately varied length (40% single template,
    30% pair, 20% triple, 10% stack of 4–6).

The mixed-length sampling on both sides is important — without it the
model learns to use sequence length as a shortcut.

## Training procedure

- **Pretraining:** masked language modeling (BERT-style 80/10/10 mask),
  AdamW (lr 3e-4, betas 0.9/0.95, wd 0.01), cosine schedule with 100-step
  warmup, gradient clipping 1.0, batch size 32, 800 steps total.
  ~75 minutes on CPU.
- **Fine-tuning:** binary classification head with BCE loss, AdamW
  (lr 5e-5), 3 epochs over 12k training rows, batch size 64.
  ~38 minutes on CPU.
- Training framework: PyTorch (vanilla, no HuggingFace `transformers`
  for the model itself).

## Evaluation

On a held-out 3,000-row validation split:

| Metric | Value |
|---|---|
| Accuracy  | 1.000 |
| Precision | 1.000 |
| Recall    | 0.999 |
| F1        | 1.000 |
| Loss      | 0.0074 |

Held-out val is trivially separable because synthetic boilerplate vs
wikitext prose is a fairly easy decision boundary. For a more honest
read, on **18 hand-written real-world snippets** (none from the training
distribution):

- **16 / 18 correct** on the binary cutoff.
- The 2 failures are:
  - `404 - Page not found. The page you are looking for might have been removed...`  → 0.75 (false positive for content)
  - `This article is for subscribers only. Subscribe now to read the full story...`  → 0.72 (false positive for content)

Both are paywall/error pages styled as natural prose — the synthetic
templated negatives never showed the model that *prose-shaped* boilerplate
exists. Closing this gap requires real-world hard-negative mining.

## Limitations

1. **English only.** The byte-level tokenizer tolerates other scripts but
   the classifier was never trained on them.
2. **Domain shift.** Trained on wikitext-103 (encyclopedic English).
   Short technical statements like *"PostgreSQL uses MVCC for transactions"*
   or casual writing score lower than they should because they don't match
   wikitext prose style.
3. **Prose-shaped boilerplate.** Paywall walls, well-written 404 pages,
   and "subscribe to read" stubs can confuse it because the synthetic
   negatives are templated, not naturalistic.
4. **Sequence cap of 256 tokens.** Long documents must be chunked by the
   caller. The intended use is per-paragraph scoring during crawl
   post-processing, not whole-page classification.
5. **Pretraining cap of 800 steps.** Final MLM loss ~7.18 (16K vocab
   unigram baseline ≈ 7.2). The classifier still works fine because the
   binary task is easy enough that the trunk doesn't need a deeply
   converged language model — but a longer pretraining run would help
   the borderline cases.

## Intended use

Drop into a web crawler / scraper as a post-extraction quality filter.
Score each paragraph or block, drop anything below ~0.5, keep the rest.
Cheap enough (≈6 ms/paragraph on CPU) to run inline at crawl time.

**Not** intended as a general-purpose text classifier, content moderator,
toxicity detector, or anything else. It does one thing.

## Reproducing

The full training pipeline is in the [GitHub repo](https://github.com/keirolabs).
End-to-end on a Ryzen 7 takes ~115 minutes:

```bash
python collect.py        #  1 min   download wikitext, build labels
python tokenizer.py      #  1 min   train 16K BPE
python pretrain.py       # 75 min   MLM pretraining
python finetune.py       # 38 min   binary classification
python export.py         #  2 sec   PyTorch → ONNX FP32
python quantize_onnx.py  #  5 sec   ONNX FP32 → INT8
```

## License

MIT. Do whatever you want with it.

## Citation

```bibtex
@misc{webrank2026,
  title  = {WebRank: a 3M-parameter boilerplate classifier for web text},
  author = {Keirolabs},
  year   = {2026},
  url    = {https://huggingface.co/mannybr/Webrank-nano}
}
```