TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies
What Is This?
In plain English: Security frameworks like NIST 800-53, OWASP ASVS, and MITRE ATLAS each describe security controls in their own language. For example, NIST might say "The system enforces password complexity requirements" while OWASP says "Verify that passwords have a minimum length of 12 characters." These two controls are about the same thing, but they use different words and numbering systems.
OpenCRE is a public taxonomy that acts as a Rosetta Stone for security frameworks -- it organizes security concepts into ~522 "hubs" (topics like "Password policy", "Input validation", "Access control") and maps controls from different frameworks to these hubs.
TRACT is an AI model that automates this mapping. Give it any security control text, and it tells you which CRE hub(s) that control belongs to. This saves hundreds of hours of manual expert work when onboarding a new security framework.
Who is this for?
- Security professionals mapping frameworks for compliance crosswalks
- GRC (Governance, Risk, Compliance) teams harmonizing multiple standards
- Researchers studying relationships across security taxonomies
- Tool builders who need automated framework-to-framework translation
Quick Start
Installation
pip install sentence-transformers numpy
Basic Usage (5 lines)
from sentence_transformers import SentenceTransformer
import numpy as np, json
# Load the model and its bundled data
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")
# Predict: what CRE hub does this control belong to?
query = model.encode(["Enforce password complexity requirements"], normalize_embeddings=True)
sims = (query @ hub_emb.T)[0]
for idx in np.argsort(sims)[-5:][::-1]:
print(f" {hub_ids[idx]}: {sims[idx]:.3f}")
Full Inference with Calibration (Recommended)
The bundled predict.py script handles text sanitization, temperature-scaled confidence scores, and out-of-distribution detection:
# Single control
python predict.py "Ensure AI models are tested for adversarial robustness"
# Batch from file (one control per line)
python predict.py --file controls.txt --top-k 10
# JSON output for programmatic use
python predict.py --file controls.txt --top-k 5 --json
Example output:
555-083 (0.342) Testing against backdoor poisoning
011-322 (0.218) Testing against evasion
663-550 (0.147) Testing against model theft by inference
130-171 (0.089) Runtime model io integrity controls
234-123 (0.064) Weakening training set backdoors
Each line shows: hub_id (calibrated_confidence) hub_name. Higher confidence = stronger match. An [OOD] flag appears when the input is too dissimilar to anything the model has seen (see Out-of-Distribution Detection below).
How It Works
The Assignment Paradigm
TRACT uses an assignment approach, not a pairwise comparison:
g(control_text) --> CRE_hub_position
Each control is independently mapped to the CRE ontology. The model never compares two controls directly ("is control A similar to control B?"). Instead, it asks: "where in the CRE taxonomy does this control belong?"
This matters because:
- Scalability: Adding a new framework requires encoding its controls once, not comparing them against every existing control
- Consistency: The CRE hub assignment is independent of what other frameworks exist
- Transitivity: If NIST control X maps to hub H, and OWASP control Y also maps to hub H, then X and Y are implicitly related -- without ever comparing them directly
Architecture (Technical)
Input text --> [Tokenizer] --> [BGE-large-en-v1.5 + LoRA] --> 1024-dim embedding
|
(dot product)
|
Hub embeddings (522 x 1024, pre-computed) --------------------------+
|
cosine similarity scores
|
[temperature scaling (T=0.0738)]
|
calibrated confidence
|
[OOD check (threshold=0.568)]
|
ranked predictions
- Base model: BAAI/bge-large-en-v1.5 (335M parameters, 1024-dimensional embeddings)
- Fine-tuning method: LoRA (Low-Rank Adaptation) -- rank=16, alpha=32, dropout=0.1, applied to query/key/value attention matrices
- This release contains fully merged weights -- no adapter files needed, loads like any SentenceTransformer model
- Training objective: MNRL (Multiple Negatives Ranking Loss) with contrastive learning -- the model learns to place controls close to their correct hub and far from incorrect hubs in embedding space
- Text-aware batch sampling: Training batches group semantically similar controls together, creating harder negatives that force the model to make finer distinctions
- Training data: 4,237 framework-to-hub links from 22 OpenCRE-linked frameworks, producing 4,061 training pairs after deduplication
Evaluation
What Is LOFO Cross-Validation?
Standard train/test splits would leak information: if OWASP ASVS controls appear in both training and test sets, the model could memorize ASVS-specific language rather than learning general security concepts.
Leave-One-Framework-Out (LOFO) is stricter. For each evaluation fold:
- One entire framework is held out (e.g., all MITRE ATLAS controls)
- The model is trained on the remaining frameworks
- Hub firewall: Hub representations are rebuilt WITHOUT the held-out framework's data -- this prevents the model from "remembering" the held-out framework's contributions to hub embeddings
- The model predicts hub assignments for the held-out framework's controls
This simulates the real use case: mapping a brand-new framework the model has never seen.
Results
| Fold | hit@1 | Zero-shot | Delta | hit@any | n |
|---|---|---|---|---|---|
| MITRE ATLAS | 0.279 | 0.273 | +0.006 | 0.27906976744186046 | 43 |
| NIST AI 100-2 | 0.429 | 0.107 | +0.322 | 0.42857142857142855 | 28 |
| OWASP AI Exchange | 0.762 | 0.619 | +0.143 | 0.7619047619047619 | 63 |
| OWASP Top10 for LLM | 0.333 | 0.333 | +0.000 | 0.3333333333333333 | 6 |
| OWASP Top10 for ML | 0.714 | 0.429 | +0.285 | 0.7142857142857143 | 7 |
| Micro average | 0.537 | 0.400 | +0.138 | 0.537 | 147 |
Reading this table:
- hit@1: The model's top prediction matches the correct hub (strict accuracy)
- Zero-shot: Accuracy of the base model before fine-tuning (the improvement from training)
- Delta: How much fine-tuning helped (positive = improvement)
- hit@any: Accuracy when the control correctly maps to multiple hubs (since ~35% of controls belong to more than one hub, this is a fairer measure)
- n: Number of controls in that framework's test set
What the numbers mean:
- OWASP AI Exchange (76.2%): Strong performance -- the model correctly assigns 3 out of 4 AI security controls to their right hub on the first try
- MITRE ATLAS (27.9%): Weakest fold. ATLAS techniques are highly specific ("Adversarial Perturbation" vs. "Data Poisoning") and map to closely related hubs that are hard to disambiguate. The model often picks a neighboring hub rather than the exact one
- Micro average (53.7%): Overall, the model's top prediction is correct about half the time -- and when accounting for multi-hub controls, accuracy is higher
Confidence Intervals
All metrics include bootstrap confidence intervals (10,000 resamples, 95% CI). The aggregate hit@1 CI is [0.462, 0.612], reflecting the relatively small evaluation set (147 controls across 5 AI frameworks).
Calibration: Understanding Confidence Scores
What Is Calibration?
Raw model outputs are cosine similarities (how close two vectors are). These are useful for ranking (higher = better match) but are NOT probabilities. A score of 0.85 does not mean "85% chance this is correct."
TRACT applies temperature scaling to convert rankings into better-calibrated confidence scores:
confidence = softmax(similarity / T)
where T=0.0738 (learned from a held-out calibration set of 420 traditional framework controls).
Calibration Metrics
| Metric | Value | What It Means |
|---|---|---|
| Temperature (T) | 0.0738 | Sharpens the similarity distribution -- small T means the model is very "peaky" (strongly favors top matches) |
| ECE | 0.079 (95% CI [0.049, 0.111]) | Expected Calibration Error -- how far confidence scores deviate from true accuracy. 0.0 = perfectly calibrated. 0.079 means scores are off by ~8 percentage points on average |
| OOD threshold | 0.568 | If the maximum similarity is below this, the input is likely outside the model's knowledge (see below) |
| Conformal quantile | 0.9971 | 99.7% of correct predictions fall above this similarity threshold |
Out-of-Distribution Detection
When you give the model text that is completely unrelated to security (e.g., a recipe or a news article), it will still produce predictions -- but they will all have low similarity scores. The model flags inputs as out-of-distribution (OOD) when:
max(similarity_to_any_hub) < 0.568
OOD predictions are marked with [OOD] in the output. Treat OOD predictions with extra skepticism -- they indicate the model is guessing rather than making an informed assignment.
Bridge Analysis: Connecting AI and Traditional Security
Background
The CRE ontology contains 522 hubs. Some hubs are linked only by AI security frameworks (like MITRE ATLAS), some only by traditional frameworks (like NIST 800-53), and some by both:
| Category | Count | Example |
|---|---|---|
| AI-only | 21 | "Testing against evasion," "GenAI model alignment" |
| Traditional-only | 382 | "Input validation," "Password policy" |
| Naturally bridged (both) | 60 | "Data poisoning" (linked by both ATLAS and CWE) |
| Unlinked (structural) | 59 | Internal grouping nodes without framework links |
What Bridge Analysis Does
For the 21 AI-only hubs, the model identifies which traditional hubs are conceptually closest using embedding similarity. For example:
"Human AI oversight" (AI-only) ←→ "Security governance regarding people" (traditional) Cosine similarity: 0.774
Both hubs are about the same core concept: humans must remain accountable for security decisions, whether in AI systems or traditional security programs.
Method and Review Process
- Compute similarity matrix: 21 AI-only hubs x 382 traditional-only hubs (8,022 pairs)
- Extract top-3: For each AI-only hub, take the 3 most similar traditional hubs (63 candidates total)
- Expert review: A human security expert reviewed all 63 candidates and accepted or rejected each based on domain knowledge -- the similarity score is a ranking signal, not an automatic classifier
- Acceptance threshold: Candidates above the 99th percentile of the full similarity matrix (cosine >= 0.45) were considered; 4 additional candidates were rejected for specious LLM-rationalized connections
Results
- Candidates evaluated: 63
- Accepted bridges: 46 (recorded as bidirectional
related_hub_idsin the hierarchy) - Rejected: 17
Accepted bridges are stored in cre_hierarchy.json as related_hub_ids. They represent lateral conceptual connections between AI and traditional security -- they do not change the hierarchical structure, model weights, or calibration.
Full bridge evidence, similarity scores, and review decisions are in bridge_report.json.
Bundled Files
This repository contains the model plus all data needed for standalone inference:
| File | Size | Purpose |
|---|---|---|
0_Transformer/model.safetensors |
~1.3 GB | Fully merged model weights (BGE-large + LoRA, no adapter needed) |
predict.py |
~5 KB | Standalone inference script -- run without installing TRACT |
train.py |
~3 KB | Reproduction guide with exact hyperparameters |
hub_ids.json |
~12 KB | Ordered list of 522 hub IDs matching model output dimensions |
hub_embeddings.npy |
~2 MB | Pre-computed 522 x 1024 hub embedding matrix |
cre_hierarchy.json |
~800 KB | Full CRE taxonomy tree with bridge links |
hub_descriptions.json |
~200 KB | Human-readable descriptions for each hub |
calibration.json |
~1 KB | Temperature, OOD threshold, conformal quantile |
bridge_report.json |
~15 KB | Bridge analysis evidence and review decisions |
Reproducing the Model
See train.py for the exact configuration. Full reproduction requires cloning the TRACT repository which contains custom training procedures (text-aware batch sampling, LOFO cross-validation with hub firewall, temperature-scaled contrastive loss).
Detailed Usage Examples
Example 1: Map a Single Control
from sentence_transformers import SentenceTransformer
import numpy as np
import json
# Load everything
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy") # shape: (522, 1024)
hierarchy = json.load(open("cre_hierarchy.json"))
cal = json.load(open("calibration.json"))
# Encode your control text (normalize_embeddings=True is required)
text = "The application must validate all user input before processing"
query = model.encode([text], normalize_embeddings=True) # shape: (1, 1024)
# Compute similarities (dot product = cosine for unit vectors)
sims = (query @ hub_emb.T)[0] # shape: (522,)
# Apply temperature scaling for calibrated confidence
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
confidence = softmax(sims / cal["t_deploy"])
# Get top-5 predictions
top5 = np.argsort(confidence)[-5:][::-1]
for idx in top5:
hid = hub_ids[idx]
hub = hierarchy["hubs"][hid]
ood = " [OOD]" if float(np.max(sims)) < cal["ood_threshold"] else ""
print(f" {hid} ({confidence[idx]:.3f}){ood} {hub['name']}")
print(f" Path: {hub['hierarchy_path']}")
Example 2: Batch-Map an Entire Framework
import json
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")
# Your framework controls (e.g., parsed from a CSV or JSON)
controls = [
{"id": "AC-1", "text": "Access control policy and procedures"},
{"id": "AC-2", "text": "Account management and provisioning"},
{"id": "IA-5", "text": "Authenticator management including password rules"},
]
# Encode all controls at once (much faster than one at a time)
texts = [c["text"] for c in controls]
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
# Compute all similarities in one matrix multiply
all_sims = embeddings @ hub_emb.T # shape: (n_controls, 522)
# Build crosswalk
crosswalk = []
for i, ctrl in enumerate(controls):
top_idx = int(np.argmax(all_sims[i]))
crosswalk.append({
"control_id": ctrl["id"],
"control_text": ctrl["text"],
"predicted_hub": hub_ids[top_idx],
"similarity": round(float(all_sims[i, top_idx]), 4),
})
# Save as JSON
with open("crosswalk.json", "w") as f:
json.dump(crosswalk, f, indent=2)
Example 3: Find Related Hubs via Bridges
import json
hierarchy = json.load(open("cre_hierarchy.json"))
# Find all AI/traditional bridge connections
for hub_id, hub in hierarchy["hubs"].items():
related = hub.get("related_hub_ids", [])
if related:
print(f"{hub['name']} ({hub_id})")
for rid in related:
rhub = hierarchy["hubs"][rid]
print(f" <-> {rhub['name']} ({rid})")
print()
Limitations and Known Issues
ATLAS fold performance (27.9% hit@1): MITRE ATLAS techniques map to closely related hubs (e.g., "Data Poisoning" vs. "Adversarial Perturbation") that are hard to disambiguate. The model often predicts a neighboring hub rather than the exact one. hit@5 is 27.9%, showing the correct hub is usually in the top 5.
Multi-hub controls (35%): About 1 in 3 controls legitimately maps to more than one hub. hit@1 alone understates performance -- the hit@any column in the evaluation table is a fairer measure.
Calibration is approximate: ECE=0.079 means confidence scores are off by ~8 percentage points on average. Treat them as ordinal rankings (higher = better), not as exact probabilities.
Training data scope: Calibrated on 420 traditional framework holdout items. Accuracy on AI-specific text may differ from the reported metrics, especially for concepts not well-represented in the 5 AI frameworks.
Not a replacement for expert judgment: Model predictions are a starting point for compliance crosswalks. A security professional should review all assignments, especially for high-stakes compliance work.
Language: English only. The base model (BGE-large-en-v1.5) and all training data are English.
What does NOT work for this task: DeBERTa-v3-NLI achieves hit@1=0.000 -- Natural Language Inference (textual entailment) is fundamentally different from semantic similarity for taxonomy assignment. Do not substitute NLI models.
Ethical Considerations
- This model is a decision-support tool, not an autonomous compliance engine. All predictions require human review before use in security assessments or regulatory filings.
- The model was trained on publicly available security framework data. No proprietary or confidential data was used.
- Active learning rounds during development used expert-reviewed predictions, not autonomous deployment.
- Bridge analysis connections were individually reviewed by a human security expert; automated connections were not added without review.
Environmental Impact
- Training compute: NVIDIA H100 GPU via RunPod, 4.2 GPU-hours total (including LOFO cross-validation, ablation studies, and final deployment model)
- Inference deployment: Runs on an NVIDIA Jetson Orin AGX edge device (~30W TDP). A single control prediction takes <100ms on consumer hardware.
- Carbon context: Estimated 1.3 kWh training energy (US average grid: ~0.5 kg CO2e)
Glossary
| Term | Definition |
|---|---|
| CRE | Common Requirements Enumeration -- a universal taxonomy of security topics maintained by OpenCRE.org |
| Hub | A node in the CRE taxonomy tree representing a security concept (e.g., "Input validation," "Access control") |
| LOFO | Leave-One-Framework-Out -- cross-validation method where an entire framework is held out for testing |
| Hub firewall | During LOFO evaluation, hub embeddings are rebuilt WITHOUT the held-out framework to prevent information leakage |
| hit@1 | The model's single best prediction matches the correct hub |
| hit@any | The model's top prediction matches ANY of the control's correct hubs (relevant for multi-hub controls) |
| ECE | Expected Calibration Error -- measures how well confidence scores match actual accuracy |
| OOD | Out-of-Distribution -- input text is too different from training data for reliable prediction |
| LoRA | Low-Rank Adaptation -- an efficient fine-tuning method that trains small adapter matrices instead of modifying all model weights |
| Bridge | A discovered conceptual connection between an AI-specific and a traditional CRE hub |
| Temperature scaling | A post-hoc calibration technique that sharpens or smooths the model's output distribution |
Citation
@software{tract2026,
title = {TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies},
author = {Rock},
year = {2026},
url = {https://github.com/rockcyber/TRACT}
}
License
MIT License for model weights and code. The base model (BAAI/bge-large-en-v1.5) is also MIT licensed.
Bundled data files (CRE hierarchy, hub descriptions, bridge report) are sourced from publicly available security frameworks and OpenCRE.org, provided under CC0 1.0 Universal.
- Downloads last month
- 13
Evaluation results
- hit@1 (micro-averaged, LOFO)self-reported0.537
- ECE (calibration error)self-reported0.079