secllmuser's picture
Update README.md
ec7da67 verified
metadata
language: en
license: apache-2.0
base_model: google/gemma-2b
tags:
  - text-classification
  - toxic-content
  - safety
  - constitutional-classifier
  - lora
  - peft
  - gemma
metrics:
  - accuracy
  - f1
model-index:
  - name: constitutional-toxic-classifier-gemma
    results:
      - task:
          type: text-classification
        metrics:
          - type: accuracy
            value: 0.8852
          - type: f1
            value: 0.902
          - type: precision
            value: 0.8984
          - type: recall
            value: 0.9057

constitutional-toxic-classifier-gemma

Constitutional toxic content classifier fine-tuned on synthetic safety data, inspired by Anthropic's Constitutional Classifiers paper.

Type: LoRA adapters only (tiny, ~10–30 MB). You need the base model google/gemma-2b and peft installed.


Model Performance

Metric Value
Accuracy 0.8852
F1 0.9020
Precision 0.8984
Recall 0.9057

Confusion matrix

Predicted Safe Predicted Toxic
Actual Safe TN = 675 FP = 113
Actual Toxic FN = 104 TP = 999

Quick Start

Install

pip install transformers peft torch

Gemma license required — accept the license at https://huggingface.co/google/gemma-2b before downloading the base model.

Load and run inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

BASE_MODEL = "google/gemma-2b"
ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma"

# 1. Load base Gemma + LoRA adapters
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)
base = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=2,
    torch_dtype=torch.float16,   # use float32 on CPU
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

# 2. Run inference
text = "I will hurt you"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits

label_id = logits.argmax(-1).item()
labels   = {0: "safe", 1: "toxic"}
print(f"{text!r}{labels[label_id]}")

Batch inference

texts = [
    "Have a great day!",
    "I will destroy you",
    "Thanks for your help",
    "You are worthless",
]
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)
with torch.no_grad():
    logits = model(**inputs).logits

labels = {0: "safe", 1: "toxic"}
for text, pred in zip(texts, logits.argmax(-1).tolist()):
    print(f"{labels[pred]:5s}  {text!r}")

Training Details

Parameter Value
Base model google/gemma-2b
Task Binary sequence classification (safe / toxic)
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj
Max length 256
Learning rate 0.0002
Batch size 8
Training data Synthetic data generated from constitutional rules

Labels

ID Label
0 safe
1 toxic

Constitutional Approach

The training data was generated using a toxicity constitution — a set of rules defining what counts as harmful content (hate speech, threats, harassment, self-harm promotion, etc.). Synthetic safe and toxic examples were generated from these rules to create balanced training data.

See the original paper: Constitutional Classifiers: Defending against Universal Jailbreaks


Limitations

  • Trained on synthetic data — real-world distribution may differ
  • English-only
  • Binary classification only (no severity scoring)
  • Context-blind: each text is classified independently

Citation

If you use this model, please cite:

@article{sharma2025constitutional,
  title={Constitutional Classifiers: Defending against Universal Jailbreaks},
  author={Sharma, Mrinank and others},
  journal={arXiv preprint arXiv:2501.18837},
  year={2025}
}