TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies

What Is This?

In plain English: Security frameworks like NIST 800-53, OWASP ASVS, and MITRE ATLAS each describe security controls in their own language. For example, NIST might say "The system enforces password complexity requirements" while OWASP says "Verify that passwords have a minimum length of 12 characters." These two controls are about the same thing, but they use different words and numbering systems.

OpenCRE is a public taxonomy that acts as a Rosetta Stone for security frameworks -- it organizes security concepts into ~522 "hubs" (topics like "Password policy", "Input validation", "Access control") and maps controls from different frameworks to these hubs.

TRACT is an AI model that automates this mapping. Give it any security control text, and it tells you which CRE hub(s) that control belongs to. This saves hundreds of hours of manual expert work when onboarding a new security framework.

Who is this for?

Security professionals mapping frameworks for compliance crosswalks
GRC (Governance, Risk, Compliance) teams harmonizing multiple standards
Researchers studying relationships across security taxonomies
Tool builders who need automated framework-to-framework translation

Quick Start

Installation

pip install sentence-transformers numpy

Basic Usage (5 lines)

from sentence_transformers import SentenceTransformer
import numpy as np, json

# Load the model and its bundled data
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")

# Predict: what CRE hub does this control belong to?
query = model.encode(["Enforce password complexity requirements"], normalize_embeddings=True)
sims = (query @ hub_emb.T)[0]
for idx in np.argsort(sims)[-5:][::-1]:
    print(f"  {hub_ids[idx]}: {sims[idx]:.3f}")

Full Inference with Calibration (Recommended)

The bundled predict.py script handles text sanitization, temperature-scaled confidence scores, and out-of-distribution detection:

# Single control
python predict.py "Ensure AI models are tested for adversarial robustness"

# Batch from file (one control per line)
python predict.py --file controls.txt --top-k 10

# JSON output for programmatic use
python predict.py --file controls.txt --top-k 5 --json

Example output:

  555-083 (0.342) Testing against backdoor poisoning
  011-322 (0.218) Testing against evasion
  663-550 (0.147) Testing against model theft by inference
  130-171 (0.089) Runtime model io integrity controls
  234-123 (0.064) Weakening training set backdoors

Each line shows: hub_id (calibrated_confidence) hub_name. Higher confidence = stronger match. An [OOD] flag appears when the input is too dissimilar to anything the model has seen (see Out-of-Distribution Detection below).

How It Works

The Assignment Paradigm

TRACT uses an assignment approach, not a pairwise comparison:

g(control_text) --> CRE_hub_position

Each control is independently mapped to the CRE ontology. The model never compares two controls directly ("is control A similar to control B?"). Instead, it asks: "where in the CRE taxonomy does this control belong?"

This matters because:

Scalability: Adding a new framework requires encoding its controls once, not comparing them against every existing control
Consistency: The CRE hub assignment is independent of what other frameworks exist
Transitivity: If NIST control X maps to hub H, and OWASP control Y also maps to hub H, then X and Y are implicitly related -- without ever comparing them directly

Architecture (Technical)

Input text --> [Tokenizer] --> [BGE-large-en-v1.5 + LoRA] --> 1024-dim embedding
                                                                    |
                                                               (dot product)
                                                                    |
Hub embeddings (522 x 1024, pre-computed) --------------------------+
                                                                    |
                                                          cosine similarity scores
                                                                    |
                                                    [temperature scaling (T=0.0738)]
                                                                    |
                                                          calibrated confidence
                                                                    |
                                                    [OOD check (threshold=0.568)]
                                                                    |
                                                          ranked predictions

Base model: BAAI/bge-large-en-v1.5 (335M parameters, 1024-dimensional embeddings)
Fine-tuning method: LoRA (Low-Rank Adaptation) -- rank=16, alpha=32, dropout=0.1, applied to query/key/value attention matrices
This release contains fully merged weights -- no adapter files needed, loads like any SentenceTransformer model
Training objective: MNRL (Multiple Negatives Ranking Loss) with contrastive learning -- the model learns to place controls close to their correct hub and far from incorrect hubs in embedding space
Text-aware batch sampling: Training batches group semantically similar controls together, creating harder negatives that force the model to make finer distinctions
Training data: 4,237 framework-to-hub links from 22 OpenCRE-linked frameworks, producing 4,061 training pairs after deduplication

Evaluation

What Is LOFO Cross-Validation?

Standard train/test splits would leak information: if OWASP ASVS controls appear in both training and test sets, the model could memorize ASVS-specific language rather than learning general security concepts.

Leave-One-Framework-Out (LOFO) is stricter. For each evaluation fold:

One entire framework is held out (e.g., all MITRE ATLAS controls)
The model is trained on the remaining frameworks
Hub firewall: Hub representations are rebuilt WITHOUT the held-out framework's data -- this prevents the model from "remembering" the held-out framework's contributions to hub embeddings
The model predicts hub assignments for the held-out framework's controls

This simulates the real use case: mapping a brand-new framework the model has never seen.

Results

Fold	hit@1	Zero-shot	Delta	hit@any	n
MITRE ATLAS	0.279	0.273	+0.006	0.27906976744186046	43
NIST AI 100-2	0.429	0.107	+0.322	0.42857142857142855	28
OWASP AI Exchange	0.762	0.619	+0.143	0.7619047619047619	63
OWASP Top10 for LLM	0.333	0.333	+0.000	0.3333333333333333	6
OWASP Top10 for ML	0.714	0.429	+0.285	0.7142857142857143	7
Micro average	0.537	0.400	+0.138	0.537	147

Reading this table:

hit@1: The model's top prediction matches the correct hub (strict accuracy)
Zero-shot: Accuracy of the base model before fine-tuning (the improvement from training)
Delta: How much fine-tuning helped (positive = improvement)
hit@any: Accuracy when the control correctly maps to multiple hubs (since ~35% of controls belong to more than one hub, this is a fairer measure)
n: Number of controls in that framework's test set

What the numbers mean:

OWASP AI Exchange (76.2%): Strong performance -- the model correctly assigns 3 out of 4 AI security controls to their right hub on the first try
MITRE ATLAS (27.9%): Weakest fold. ATLAS techniques are highly specific ("Adversarial Perturbation" vs. "Data Poisoning") and map to closely related hubs that are hard to disambiguate. The model often picks a neighboring hub rather than the exact one
Micro average (53.7%): Overall, the model's top prediction is correct about half the time -- and when accounting for multi-hub controls, accuracy is higher

Confidence Intervals

All metrics include bootstrap confidence intervals (10,000 resamples, 95% CI). The aggregate hit@1 CI is [0.462, 0.612], reflecting the relatively small evaluation set (147 controls across 5 AI frameworks).

Calibration: Understanding Confidence Scores

What Is Calibration?

Raw model outputs are cosine similarities (how close two vectors are). These are useful for ranking (higher = better match) but are NOT probabilities. A score of 0.85 does not mean "85% chance this is correct."

TRACT applies temperature scaling to convert rankings into better-calibrated confidence scores:

confidence = softmax(similarity / T)

where T=0.0738 (learned from a held-out calibration set of 420 traditional framework controls).

Calibration Metrics

Metric	Value	What It Means
Temperature (T)	0.0738	Sharpens the similarity distribution -- small T means the model is very "peaky" (strongly favors top matches)
ECE	0.079 (95% CI [0.049, 0.111])	Expected Calibration Error -- how far confidence scores deviate from true accuracy. 0.0 = perfectly calibrated. 0.079 means scores are off by ~8 percentage points on average
OOD threshold	0.568	If the maximum similarity is below this, the input is likely outside the model's knowledge (see below)
Conformal quantile	0.9971	99.7% of correct predictions fall above this similarity threshold

Out-of-Distribution Detection

When you give the model text that is completely unrelated to security (e.g., a recipe or a news article), it will still produce predictions -- but they will all have low similarity scores. The model flags inputs as out-of-distribution (OOD) when:

max(similarity_to_any_hub) < 0.568

OOD predictions are marked with [OOD] in the output. Treat OOD predictions with extra skepticism -- they indicate the model is guessing rather than making an informed assignment.

Bridge Analysis: Connecting AI and Traditional Security

Background

The CRE ontology contains 522 hubs. Some hubs are linked only by AI security frameworks (like MITRE ATLAS), some only by traditional frameworks (like NIST 800-53), and some by both:

Category	Count	Example
AI-only	21	"Testing against evasion," "GenAI model alignment"
Traditional-only	382	"Input validation," "Password policy"
Naturally bridged (both)	60	"Data poisoning" (linked by both ATLAS and CWE)
Unlinked (structural)	59	Internal grouping nodes without framework links

What Bridge Analysis Does

For the 21 AI-only hubs, the model identifies which traditional hubs are conceptually closest using embedding similarity. For example:

"Human AI oversight" (AI-only) ←→ "Security governance regarding people" (traditional) Cosine similarity: 0.774

Both hubs are about the same core concept: humans must remain accountable for security decisions, whether in AI systems or traditional security programs.

Method and Review Process

Compute similarity matrix: 21 AI-only hubs x 382 traditional-only hubs (8,022 pairs)
Extract top-3: For each AI-only hub, take the 3 most similar traditional hubs (63 candidates total)
Expert review: A human security expert reviewed all 63 candidates and accepted or rejected each based on domain knowledge -- the similarity score is a ranking signal, not an automatic classifier
Acceptance threshold: Candidates above the 99th percentile of the full similarity matrix (cosine >= 0.45) were considered; 4 additional candidates were rejected for specious LLM-rationalized connections

Results

Candidates evaluated: 63
Accepted bridges: 46 (recorded as bidirectional related_hub_ids in the hierarchy)
Rejected: 17

Accepted bridges are stored in cre_hierarchy.json as related_hub_ids. They represent lateral conceptual connections between AI and traditional security -- they do not change the hierarchical structure, model weights, or calibration.

Full bridge evidence, similarity scores, and review decisions are in bridge_report.json.

Bundled Files

This repository contains the model plus all data needed for standalone inference:

File	Size	Purpose
`0_Transformer/model.safetensors`	~1.3 GB	Fully merged model weights (BGE-large + LoRA, no adapter needed)
`predict.py`	~5 KB	Standalone inference script -- run without installing TRACT
`train.py`	~3 KB	Reproduction guide with exact hyperparameters
`hub_ids.json`	~12 KB	Ordered list of 522 hub IDs matching model output dimensions
`hub_embeddings.npy`	~2 MB	Pre-computed 522 x 1024 hub embedding matrix
`cre_hierarchy.json`	~800 KB	Full CRE taxonomy tree with bridge links
`hub_descriptions.json`	~200 KB	Human-readable descriptions for each hub
`calibration.json`	~1 KB	Temperature, OOD threshold, conformal quantile
`bridge_report.json`	~15 KB	Bridge analysis evidence and review decisions

Reproducing the Model

See train.py for the exact configuration. Full reproduction requires cloning the TRACT repository which contains custom training procedures (text-aware batch sampling, LOFO cross-validation with hub firewall, temperature-scaled contrastive loss).

Detailed Usage Examples

Example 1: Map a Single Control

from sentence_transformers import SentenceTransformer
import numpy as np
import json

# Load everything
model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")        # shape: (522, 1024)
hierarchy = json.load(open("cre_hierarchy.json"))
cal = json.load(open("calibration.json"))

# Encode your control text (normalize_embeddings=True is required)
text = "The application must validate all user input before processing"
query = model.encode([text], normalize_embeddings=True)  # shape: (1, 1024)

# Compute similarities (dot product = cosine for unit vectors)
sims = (query @ hub_emb.T)[0]                  # shape: (522,)

# Apply temperature scaling for calibrated confidence
def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

confidence = softmax(sims / cal["t_deploy"])

# Get top-5 predictions
top5 = np.argsort(confidence)[-5:][::-1]
for idx in top5:
    hid = hub_ids[idx]
    hub = hierarchy["hubs"][hid]
    ood = " [OOD]" if float(np.max(sims)) < cal["ood_threshold"] else ""
    print(f"  {hid} ({confidence[idx]:.3f}){ood} {hub['name']}")
    print(f"    Path: {hub['hierarchy_path']}")

Example 2: Batch-Map an Entire Framework

import json
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rockCO78/tract-cre-assignment")
hub_ids = json.load(open("hub_ids.json"))
hub_emb = np.load("hub_embeddings.npy")

# Your framework controls (e.g., parsed from a CSV or JSON)
controls = [
    {"id": "AC-1", "text": "Access control policy and procedures"},
    {"id": "AC-2", "text": "Account management and provisioning"},
    {"id": "IA-5", "text": "Authenticator management including password rules"},
]

# Encode all controls at once (much faster than one at a time)
texts = [c["text"] for c in controls]
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)

# Compute all similarities in one matrix multiply
all_sims = embeddings @ hub_emb.T  # shape: (n_controls, 522)

# Build crosswalk
crosswalk = []
for i, ctrl in enumerate(controls):
    top_idx = int(np.argmax(all_sims[i]))
    crosswalk.append({
        "control_id": ctrl["id"],
        "control_text": ctrl["text"],
        "predicted_hub": hub_ids[top_idx],
        "similarity": round(float(all_sims[i, top_idx]), 4),
    })

# Save as JSON
with open("crosswalk.json", "w") as f:
    json.dump(crosswalk, f, indent=2)

Example 3: Find Related Hubs via Bridges

import json

hierarchy = json.load(open("cre_hierarchy.json"))

# Find all AI/traditional bridge connections
for hub_id, hub in hierarchy["hubs"].items():
    related = hub.get("related_hub_ids", [])
    if related:
        print(f"{hub['name']} ({hub_id})")
        for rid in related:
            rhub = hierarchy["hubs"][rid]
            print(f"  <-> {rhub['name']} ({rid})")
        print()

Limitations and Known Issues

ATLAS fold performance (27.9% hit@1): MITRE ATLAS techniques map to closely related hubs (e.g., "Data Poisoning" vs. "Adversarial Perturbation") that are hard to disambiguate. The model often predicts a neighboring hub rather than the exact one. hit@5 is 27.9%, showing the correct hub is usually in the top 5.
Multi-hub controls (35%): About 1 in 3 controls legitimately maps to more than one hub. hit@1 alone understates performance -- the hit@any column in the evaluation table is a fairer measure.
Calibration is approximate: ECE=0.079 means confidence scores are off by ~8 percentage points on average. Treat them as ordinal rankings (higher = better), not as exact probabilities.
Training data scope: Calibrated on 420 traditional framework holdout items. Accuracy on AI-specific text may differ from the reported metrics, especially for concepts not well-represented in the 5 AI frameworks.
Not a replacement for expert judgment: Model predictions are a starting point for compliance crosswalks. A security professional should review all assignments, especially for high-stakes compliance work.
Language: English only. The base model (BGE-large-en-v1.5) and all training data are English.
What does NOT work for this task: DeBERTa-v3-NLI achieves hit@1=0.000 -- Natural Language Inference (textual entailment) is fundamentally different from semantic similarity for taxonomy assignment. Do not substitute NLI models.

Ethical Considerations

This model is a decision-support tool, not an autonomous compliance engine. All predictions require human review before use in security assessments or regulatory filings.
The model was trained on publicly available security framework data. No proprietary or confidential data was used.
Active learning rounds during development used expert-reviewed predictions, not autonomous deployment.
Bridge analysis connections were individually reviewed by a human security expert; automated connections were not added without review.

Environmental Impact

Training compute: NVIDIA H100 GPU via RunPod, 4.2 GPU-hours total (including LOFO cross-validation, ablation studies, and final deployment model)
Inference deployment: Runs on an NVIDIA Jetson Orin AGX edge device (~30W TDP). A single control prediction takes <100ms on consumer hardware.
Carbon context: Estimated 1.3 kWh training energy (US average grid: ~0.5 kg CO2e)

Glossary

Term	Definition
CRE	Common Requirements Enumeration -- a universal taxonomy of security topics maintained by OpenCRE.org
Hub	A node in the CRE taxonomy tree representing a security concept (e.g., "Input validation," "Access control")
LOFO	Leave-One-Framework-Out -- cross-validation method where an entire framework is held out for testing
Hub firewall	During LOFO evaluation, hub embeddings are rebuilt WITHOUT the held-out framework to prevent information leakage
hit@1	The model's single best prediction matches the correct hub
hit@any	The model's top prediction matches ANY of the control's correct hubs (relevant for multi-hub controls)
ECE	Expected Calibration Error -- measures how well confidence scores match actual accuracy
OOD	Out-of-Distribution -- input text is too different from training data for reliable prediction
LoRA	Low-Rank Adaptation -- an efficient fine-tuning method that trains small adapter matrices instead of modifying all model weights
Bridge	A discovered conceptual connection between an AI-specific and a traditional CRE hub
Temperature scaling	A post-hoc calibration technique that sharpens or smooths the model's output distribution

Citation

@software{tract2026,
  title = {TRACT: Transitive Reconciliation and Assignment of CRE Taxonomies},
  author = {Rock},
  year = {2026},
  url = {https://github.com/rockcyber/TRACT}
}

License

MIT License for model weights and code. The base model (BAAI/bge-large-en-v1.5) is also MIT licensed.

Bundled data files (CRE hierarchy, hub descriptions, bridge report) are sourced from publicly available security frameworks and OpenCRE.org, provided under CC0 1.0 Universal.

Downloads last month: 13

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

hit@1 (micro-averaged, LOFO)
self-reported

0.537
ECE (calibration error)
self-reported

0.079