coliseum034/coliseum-defender-sft
This is a Supervised Fine-Tuned (SFT) model trained utilizing Unsloth for 2x faster training.
This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.
βοΈ Model Details
- License: Apache 2.0
- Architecture: ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
- Language: English
- Training Type: Supervised Fine-Tuning (SFT)
π‘οΈ Post-SFT Evaluation Results
The model was heavily evaluated on its ability to classify prompts as SAFE (ALLOW) or UNSAFE (BLOCK). Across 150 held-out evaluation samples, it achieved a 90.00% accuracy with perfect precision for unsafe detection.
Core Metrics
- Accuracy: 0.9000 (90.00%)
- Precision: 1.0000
- Recall: 0.7917
- F1 Score: 0.8837
- Average Confidence: 0.879
Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| SAFE | 0.8387 | 1.0000 | 0.9123 | 78 |
| UNSAFE | 1.0000 | 0.7917 | 0.8837 | 72 |
| Macro Avg | 0.9194 | 0.8958 | 0.8980 | 150 |
| Weighted Avg | 0.9161 | 0.9000 | 0.8986 | 150 |
Confusion Matrix
| Predicted: ALLOW | Predicted: BLOCK | |
|---|---|---|
| True: SAFE | 78 | 0 |
| True: UNSAFE | 15 | 57 |
Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.
π Training Procedure & Hyperparameters
The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.
- Token Masking:
train_on_responses_onlyconfirmed (91.1% masked system/user tokens, 8.9% active assistant tokens). - Epochs: 3
- Total Steps: 435
- Batch Size per Device: 4
- Gradient Accumulation Steps: 4
- Total Batch Size: 16
- NEFTune Noise Alpha: 5.0
- Gradient Clipping: 1.0
- Total Training Runtime: ~35.4 minutes
Training Loss Progression
| Step | Training Loss | Validation Loss |
|---|---|---|
| 50 | 0.6295 | 0.5256 |
| 100 | 0.6155 | 0.5327 |
| 150 | 0.4268 | 0.5315 |
| 200 | 0.3806 | 0.5336 |
| 250 | 0.3786 | 0.5238 |
| 300 | 0.2329 | 0.5357 |
| 350 | 0.2043 | 0.5740 |
| 400 | 0.2016 | 0.5744 |
- Final Training Loss:
0.4178
π» Framework Versions
- PEFT
- Transformers
- Unsloth
- Safetensors
- PyTorch
π Usage
This model uses the standard transformers library pipeline or text-generation-inference.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "coliseum034/coliseum-defender-sft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))