coliseum034/coliseum-defender-sft

This is a Supervised Fine-Tuned (SFT) model trained utilizing Unsloth for 2x faster training.

This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.

βš™οΈ Model Details

  • License: Apache 2.0
  • Architecture: ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
  • Language: English
  • Training Type: Supervised Fine-Tuning (SFT)

πŸ›‘οΈ Post-SFT Evaluation Results

The model was heavily evaluated on its ability to classify prompts as SAFE (ALLOW) or UNSAFE (BLOCK). Across 150 held-out evaluation samples, it achieved a 90.00% accuracy with perfect precision for unsafe detection.

Core Metrics

  • Accuracy: 0.9000 (90.00%)
  • Precision: 1.0000
  • Recall: 0.7917
  • F1 Score: 0.8837
  • Average Confidence: 0.879

Classification Report

Class Precision Recall F1-Score Support
SAFE 0.8387 1.0000 0.9123 78
UNSAFE 1.0000 0.7917 0.8837 72
Macro Avg 0.9194 0.8958 0.8980 150
Weighted Avg 0.9161 0.9000 0.8986 150

Confusion Matrix

Predicted: ALLOW Predicted: BLOCK
True: SAFE 78 0
True: UNSAFE 15 57

Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.

πŸ“Š Training Procedure & Hyperparameters

The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.

  • Token Masking: train_on_responses_only confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens).
  • Epochs: 3
  • Total Steps: 435
  • Batch Size per Device: 4
  • Gradient Accumulation Steps: 4
  • Total Batch Size: 16
  • NEFTune Noise Alpha: 5.0
  • Gradient Clipping: 1.0
  • Total Training Runtime: ~35.4 minutes

Training Loss Progression

Step Training Loss Validation Loss
50 0.6295 0.5256
100 0.6155 0.5327
150 0.4268 0.5315
200 0.3806 0.5336
250 0.3786 0.5238
300 0.2329 0.5357
350 0.2043 0.5740
400 0.2016 0.5744
  • Final Training Loss: 0.4178

πŸ’» Framework Versions

  • PEFT
  • Transformers
  • Unsloth
  • Safetensors
  • PyTorch

πŸš€ Usage

This model uses the standard transformers library pipeline or text-generation-inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coliseum034/coliseum-defender-sft"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using coliseum034/coliseum-defender-sft 1