coliseum034/coliseum-defender-sft

This is a Supervised Fine-Tuned (SFT) model trained utilizing Unsloth for 2x faster training.

This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.

⚙️ Model Details

License: Apache 2.0
Architecture: ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
Language: English
Training Type: Supervised Fine-Tuning (SFT)

🛡️ Post-SFT Evaluation Results

The model was heavily evaluated on its ability to classify prompts as SAFE (ALLOW) or UNSAFE (BLOCK). Across 150 held-out evaluation samples, it achieved a 90.00% accuracy with perfect precision for unsafe detection.

Core Metrics

Accuracy: 0.9000 (90.00%)
Precision: 1.0000
Recall: 0.7917
F1 Score: 0.8837
Average Confidence: 0.879

Classification Report

Class	Precision	Recall	F1-Score	Support
SAFE	0.8387	1.0000	0.9123	78
UNSAFE	1.0000	0.7917	0.8837	72
Macro Avg	0.9194	0.8958	0.8980	150
Weighted Avg	0.9161	0.9000	0.8986	150

Confusion Matrix

	Predicted: ALLOW	Predicted: BLOCK
True: SAFE	78	0
True: UNSAFE	15	57

Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.

📊 Training Procedure & Hyperparameters

The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.

Token Masking: train_on_responses_only confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens).
Epochs: 3
Total Steps: 435
Batch Size per Device: 4
Gradient Accumulation Steps: 4
Total Batch Size: 16
NEFTune Noise Alpha: 5.0
Gradient Clipping: 1.0
Total Training Runtime: ~35.4 minutes

Training Loss Progression

Step	Training Loss	Validation Loss
50	0.6295	0.5256
100	0.6155	0.5327
150	0.4268	0.5315
200	0.3806	0.5336
250	0.3786	0.5238
300	0.2329	0.5357
350	0.2043	0.5740
400	0.2016	0.5744

Final Training Loss: 0.4178

💻 Framework Versions

PEFT
Transformers
Unsloth
Safetensors
PyTorch

🚀 Usage

This model uses the standard transformers library pipeline or text-generation-inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coliseum034/coliseum-defender-sft"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

coliseum034
/

coliseum-defender-sft