---
language: en
license: apache-2.0
base_model: google/gemma-2b
tags:
  - text-classification
  - toxic-content
  - safety
  - constitutional-classifier
  - lora
  - peft
  - gemma
metrics:
  - accuracy
  - f1
model-index:
  - name: constitutional-toxic-classifier-gemma
    results:
      - task:
          type: text-classification
        metrics:
          - type: accuracy
            value: 0.8852
          - type: f1
            value: 0.9020
          - type: precision
            value: 0.8984
          - type: recall
            value: 0.9057
---

# constitutional-toxic-classifier-gemma

Constitutional toxic content classifier fine-tuned on synthetic safety data,
inspired by Anthropic's [Constitutional Classifiers paper](https://arxiv.org/abs/2501.18837).

**Type**: LoRA adapters only (tiny, ~10–30 MB). You need the base model `google/gemma-2b` and `peft` installed.

---

## Model Performance

| Metric    | Value  |
|-----------|--------|
| Accuracy  | 0.8852 |
| F1        | 0.9020 |
| Precision | 0.8984 |
| Recall    | 0.9057 |

**Confusion matrix**

|                | Predicted Safe | Predicted Toxic |
|----------------|---------------|-----------------|
| **Actual Safe**  | TN = 675     | FP = 113       |
| **Actual Toxic** | FN = 104     | TP = 999       |

---

## Quick Start

### Install

```bash
pip install transformers peft torch
```

> **Gemma license required** — accept the license at
> <https://huggingface.co/google/gemma-2b> before downloading the base model.

### Load and run inference

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

BASE_MODEL = "google/gemma-2b"
ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma"

# 1. Load base Gemma + LoRA adapters
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)
base = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=2,
    torch_dtype=torch.float16,   # use float32 on CPU
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

# 2. Run inference
text = "I will hurt you"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits

label_id = logits.argmax(-1).item()
labels   = {0: "safe", 1: "toxic"}
print(f"{text!r}  →  {labels[label_id]}")
```

### Batch inference

```python
texts = [
    "Have a great day!",
    "I will destroy you",
    "Thanks for your help",
    "You are worthless",
]
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)
with torch.no_grad():
    logits = model(**inputs).logits

labels = {0: "safe", 1: "toxic"}
for text, pred in zip(texts, logits.argmax(-1).tolist()):
    print(f"{labels[pred]:5s}  {text!r}")
```

---

## Training Details

| Parameter      | Value          |
|----------------|----------------|
| Base model     | `google/gemma-2b` |
| Task           | Binary sequence classification (safe / toxic) |
| LoRA rank (r)  | 16       |
| LoRA alpha     | 32   |
| LoRA dropout   | 0.1 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Max length     | 256   |
| Learning rate  | 0.0002           |
| Batch size     | 8   |
| Training data  | Synthetic data generated from constitutional rules |

---

## Labels

| ID | Label |
|----|-------|
| 0  | safe  |
| 1  | toxic |

---

## Constitutional Approach

The training data was generated using a **toxicity constitution** — a set of
rules defining what counts as harmful content (hate speech, threats, harassment,
self-harm promotion, etc.). Synthetic safe and toxic examples were generated
from these rules to create balanced training data.

See the original paper: [Constitutional Classifiers: Defending against Universal Jailbreaks](https://arxiv.org/abs/2501.18837)

---

## Limitations

- Trained on synthetic data — real-world distribution may differ
- English-only
- Binary classification only (no severity scoring)
- Context-blind: each text is classified independently

---

## Citation

If you use this model, please cite:

```bibtex
@article{sharma2025constitutional,
  title={Constitutional Classifiers: Defending against Universal Jailbreaks},
  author={Sharma, Mrinank and others},
  journal={arXiv preprint arXiv:2501.18837},
  year={2025}
}
```