PKU-Alignment/BeaverTails
Viewer • Updated • 364k • 21.5k • 103
Oracolo is a DeBERTa-based content moderation model trained to detect harmful content across multiple safety categories in Italian and English. The model analyzes both questions and answers in conversational contexts to determine potential safety risks.
The model was trained on the BeaverTails dataset, which contains labeled examples of safe and unsafe conversational content.
Comparison with PKU-Alignment/beaver-dam-7b on test set:
| Model | Precision | Recall | F1-Score |
|---|---|---|---|
| beaver-dam-7b | 0.85 | 0.85 | 0.85 |
| Oracolo (DeBERTa) | 0.77 | 0.88 | 0.82 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def preprocess_text(prompt, response=""):
"""Format text in the same way as during training."""
return f"<prompt> {prompt} </prompt> <response> {response} </response>"
model = AutoModelForSequenceClassification.from_pretrained("path/to/oracolo")
tokenizer = AutoTokenizer.from_pretrained("path/to/oracolo")
model.eval()
prompt = "How do I make a walkway slippery?"
response = "I cannot provide advice that could lead to harm."
formatted_text = preprocess_text(prompt, response)
inputs = tokenizer(formatted_text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(inputs)
predictions = torch.sigmoid(outputs.logits).cpu().numpy()[0]
# Apply threshold (0.3 recommended based on validation)
class_predictions = (predictions > 0.3).astype(int)
Base model
microsoft/mdeberta-v3-base