--- language: en license: apache-2.0 base_model: google/gemma-2b tags: - text-classification - toxic-content - safety - constitutional-classifier - lora - peft - gemma metrics: - accuracy - f1 model-index: - name: constitutional-toxic-classifier-gemma results: - task: type: text-classification metrics: - type: accuracy value: 0.8852 - type: f1 value: 0.9020 - type: precision value: 0.8984 - type: recall value: 0.9057 --- # constitutional-toxic-classifier-gemma Constitutional toxic content classifier fine-tuned on synthetic safety data, inspired by Anthropic's [Constitutional Classifiers paper](https://arxiv.org/abs/2501.18837). **Type**: LoRA adapters only (tiny, ~10–30 MB). You need the base model `google/gemma-2b` and `peft` installed. --- ## Model Performance | Metric | Value | |-----------|--------| | Accuracy | 0.8852 | | F1 | 0.9020 | | Precision | 0.8984 | | Recall | 0.9057 | **Confusion matrix** | | Predicted Safe | Predicted Toxic | |----------------|---------------|-----------------| | **Actual Safe** | TN = 675 | FP = 113 | | **Actual Toxic** | FN = 104 | TP = 999 | --- ## Quick Start ### Install ```bash pip install transformers peft torch ``` > **Gemma license required** — accept the license at > before downloading the base model. ### Load and run inference ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel import torch BASE_MODEL = "google/gemma-2b" ADAPTER_REPO = "secllmuser/constitutional-toxic-classifier-gemma" # 1. Load base Gemma + LoRA adapters tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO) base = AutoModelForSequenceClassification.from_pretrained( BASE_MODEL, num_labels=2, torch_dtype=torch.float16, # use float32 on CPU trust_remote_code=True, ) model = PeftModel.from_pretrained(base, ADAPTER_REPO) model.eval() # 2. Run inference text = "I will hurt you" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): logits = model(**inputs).logits label_id = logits.argmax(-1).item() labels = {0: "safe", 1: "toxic"} print(f"{text!r} → {labels[label_id]}") ``` ### Batch inference ```python texts = [ "Have a great day!", "I will destroy you", "Thanks for your help", "You are worthless", ] inputs = tokenizer( texts, return_tensors="pt", padding=True, truncation=True, max_length=256, ) with torch.no_grad(): logits = model(**inputs).logits labels = {0: "safe", 1: "toxic"} for text, pred in zip(texts, logits.argmax(-1).tolist()): print(f"{labels[pred]:5s} {text!r}") ``` --- ## Training Details | Parameter | Value | |----------------|----------------| | Base model | `google/gemma-2b` | | Task | Binary sequence classification (safe / toxic) | | LoRA rank (r) | 16 | | LoRA alpha | 32 | | LoRA dropout | 0.1 | | Target modules | q_proj, k_proj, v_proj, o_proj | | Max length | 256 | | Learning rate | 0.0002 | | Batch size | 8 | | Training data | Synthetic data generated from constitutional rules | --- ## Labels | ID | Label | |----|-------| | 0 | safe | | 1 | toxic | --- ## Constitutional Approach The training data was generated using a **toxicity constitution** — a set of rules defining what counts as harmful content (hate speech, threats, harassment, self-harm promotion, etc.). Synthetic safe and toxic examples were generated from these rules to create balanced training data. See the original paper: [Constitutional Classifiers: Defending against Universal Jailbreaks](https://arxiv.org/abs/2501.18837) --- ## Limitations - Trained on synthetic data — real-world distribution may differ - English-only - Binary classification only (no severity scoring) - Context-blind: each text is classified independently --- ## Citation If you use this model, please cite: ```bibtex @article{sharma2025constitutional, title={Constitutional Classifiers: Defending against Universal Jailbreaks}, author={Sharma, Mrinank and others}, journal={arXiv preprint arXiv:2501.18837}, year={2025} } ```