TCH-Net: Multi-Branch IoT Botnet Detection on BRIDGE

Paper: BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection
Authors: Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey β€” KIIT University
Submitted to: Journal of Network and Computer Applications
Dataset: Ammar-ss/BRIDGE
Code: github.com/Ammar-ss/TCH-Net


What is this?

The IoT botnet detection field has a quiet problem. Almost every published system gets trained on one dataset, reports numbers in the high 90s, and calls it done. The trouble is those numbers don't travel. A model tuned to CICIDS-2017 will see completely different traffic statistics when you point it at Bot-IoT or N-BaIoT β€” different capture tools, different devices, different attack toolkits. The benchmark looked easy because it was a closed world.

TCH-Net is a multi-branch neural network built to handle this more honestly. It's trained and evaluated on BRIDGE, a unified benchmark that maps five structurally distinct public datasets into a shared 46-feature space. The goal was to build something that could survive being tested on genuinely heterogeneous data β€” and then to measure exactly how hard that actually is.

The architecture has three parallel branches. The Temporal branch (T) runs three paths simultaneously: a residual depthwise-separable convolutional BiGRU for local and medium-range patterns, a stride-downsampled BiGRU for coarser dynamics, and a full-resolution pre-LayerNorm Transformer covering all 32 timesteps for global context. Different botnet attack categories manifest at different temporal scales β€” DDoS flooding shows up in burst-level signatures, C&C beaconing in medium-scale periodic patterns, scan-then-exploit sequences in global ordering β€” and a single-resolution encoder has to trade one against the others. Three paths running in parallel resolves that.

The Statistical branch (H) mean-pools the window and runs it through an MLP. It captures distributional structure that doesn't depend on ordering at all, which is orthogonal to what the T-branch does. The Contextual branch (C) encodes the source dataset and device category as learned embeddings. On its own it's nearly random (AUC β‰ˆ 0.50) β€” it doesn't predict attack labels independently. What it does is condition the fusion mechanism on where the input came from.

All three branches get fused through CB-GAF (Cross-Branch Gated Attention Fusion). Each branch queries the other two simultaneously via cross-attention, then a learned sigmoid vector gate β€” 128-dimensional, not a scalar β€” controls feature-wise how much cross-branch information gets absorbed. That vector gating is important in the heterogeneous setting. For a dataset like N-BaIoT where 85% of canonical features are zero-filled, the gate can learn to downweight the largely empty H-branch at specific dimensions rather than hard-coding that decision or just averaging in noise across the board.

2,692,696 parameters. Inference latency 6.43ms on Tesla T4. Fits on NVIDIA Jetson with room to spare.


Intended Use

In-Scope

  • IoT botnet detection research on network flow data
  • Cross-dataset generalisation benchmarking in heterogeneous IDS settings
  • Ablation or architecture comparison studies using the BRIDGE benchmark

Out-of-Scope

  • Production deployment without retraining: The LODO gap (0.2719 F1) indicates the model does not reliably generalise to unseen dataset distributions without adaptation. Do not deploy this in a live network without fine-tuning on in-distribution data.
  • Non-IoT or enterprise network traffic: BRIDGE covers IoT-specific datasets. Behaviour on corporate LAN/WAN traffic is not evaluated.
  • Real-time per-packet classification: TCH-Net operates on flow-level feature windows of 32 timesteps. It requires completed or windowed flows, not individual packets.
  • Unknown context at scale: The contextual branch requires dataset source and device category IDs. If these are unavailable, pass ctx = torch.zeros(B, 2, dtype=torch.long) β€” performance will degrade modestly but gracefully.

Limitations

  • LODO F1 = 0.5577. The model does not generalise well to unseen dataset distributions. This is the best LODO result across all evaluated architectures (+0.09–0.17 above baselines), but the gap is real and quantified.
  • N-BaIoT achieves high F1 (0.9854) largely because Mirai/BASHLITE signatures are statistically distinct in only 7 of 46 features. This is not a BRIDGE-wide pattern.
  • Edge-IIoTset is the hardest case (F1 = 0.6755) due to IIoT packet-level traffic structures differing from the flow-level distributions dominating training.
  • 85% zero-fill on N-BaIoT canonical features is a BRIDGE artefact β€” the canonical feature space was built around CICIDS-style flow features, which do not map cleanly to all source datasets.
  • Not validated on live traffic captures or real-world deployment scenarios.

Results (5 seeds: 42, 123, 456, 789, 2024)

Metric TCH-Net Best Baseline (Transformer-IDS)
F1 0.8296 Β± 0.0028 0.7958 Β± 0.0030
ROC-AUC 0.9380 Β± 0.0025 0.9147 Β± 0.0012
MCC 0.6972 Β± 0.0056 0.6255 Β± 0.0067
PR-AUC 0.8912 Β± 0.0031 0.8699 Β± 0.0041

TCH-Net outperforms all 12 baselines on all four metrics. All differences are statistically significant (p < 0.05, one-sided paired Wilcoxon signed-rank test).

Full Comparison Table

Model F1 ROC-AUC MCC Ξ”F1
TCH-Net (Ours) 0.8296 Β± 0.0028 0.9380 0.6972 β€”
Transformer-IDS 0.7958 Β± 0.0030 0.9147 0.6255 +0.0338**
1D-CNN-IDS 0.7932 Β± 0.0076 0.9076 0.6213 +0.0364*
CNN-LSTM 0.7919 Β± 0.0137 0.9056 0.6208 +0.0377*
BiLSTM-IDS 0.7805 Β± 0.0010 0.8975 0.5972 +0.0491**
BiGRU-IDS 0.7805 Β± 0.0011 0.8962 0.5987 +0.0491**
DeepDefense 0.7627 Β± 0.0011 0.8776 0.5638 +0.0669***
XGBoost 0.7265 Β± 0.0014 0.8704 0.5542 +0.1031***
GraphSAGE-Approx 0.7097 Β± 0.0004 0.8259 0.4465 +0.1199***
Kitsune-AE 0.7045 Β± 0.0007 0.8200 0.4362 +0.1251***
MLP-IDS 0.7039 Β± 0.0008 0.8152 0.4348 +0.1257***
IoT-DNN 0.7009 Β± 0.0002 0.8146 0.4278 +0.1287***
Random Forest 0.4323 Β± 0.0082 0.8005 0.3557 +0.3973***

Per-Dataset Performance

Dataset Coverage DetRate False Alarm F1
CICIDS-2017 93% 0.9433 0.0309 0.9505
CIC-IoT-2023 87% 0.8827 0.0257 0.9211
N-BaIoT 15% 0.9982 0.0206 0.9854
Edge-IIoTset 22% 0.6844 0.2589 0.6755

N-BaIoT achieves the highest F1 despite 85% of features being zero-filled. Mirai and BASHLITE botnet traffic is statistically distinctive enough in just 7 features that the separation is stark. Edge-IIoTset is the hardest case β€” IIoT packet-level traffic structures differently from the flow-level distributions that dominate training.

Leave-One-Dataset-Out (LODO) Generalisation

The honest number. Train on four datasets, test on the fifth, repeated five times.

Held-Out LODO F1 LODO AUC
CICIDS-2017 0.3128 Β± 0.232 0.0509
CIC-IoT-2023 0.6013 Β± 0.000 0.1440
Bot-IoT 0.5934 Β± 0.011 0.5693
Edge-IIoTset 0.6791 Β± 0.008 0.6841
N-BaIoT 0.6021 Β± 0.000 0.8171
MEAN 0.5577 0.4531

Generalisation gap: random-split F1 (0.8296) βˆ’ LODO mean (0.5577) = +0.2719.

This gap is not a TCH-Net problem. All five deep learning baselines scored between 0.39–0.47 LODO F1 under the same protocol. TCH-Net's 0.5577 is the highest LODO score across all evaluated architectures β€” +0.09 to +0.17 above baselines. The gap is a measurement of how hard the cross-dataset problem actually is. The BRIDGE LODO mean of 0.5577 is the first formally quantified community generalisation baseline in heterogeneous IoT intrusion detection.

Temporal Split Check

Split F1 AUC MCC
Random (5 seeds) 0.8296 0.9380 0.6972
Temporal (1 seed) 0.8203 0.9261 0.6831
Ξ” βˆ’0.0093 βˆ’0.0119 βˆ’0.0141

The small drop under temporal splitting confirms TCH-Net's performance is not driven by temporal leakage.


Architecture

Input: (B, 32, 46)  β€” batch Γ— window Γ— canonical features

Shared Feature Projection (residual):
  Linear(46β†’92) β†’ LayerNorm β†’ GELU β†’ Dropout(Ξ΄/2)
  → Linear(92→46) → LayerNorm        X̃ = X + f_proj(X)

T-Branch β€” three parallel paths, merged to 512d:
  Path 1: ResConvSEΓ—3 + MaxPool β†’ BiGRU(128/dir, 2L)              β†’ 8Γ—256
  Path 2: StrideConv(s=2,64ch) β†’ BiGRU(64/dir, 1L) β†’ AvgPool(8)  β†’ 8Γ—128
  Path 3: Linear(46β†’128) + LearnablePE β†’ TransEnc(Pre-LN,2L,8H)
           β†’ strip CLS β†’ AvgPool(8)                                β†’ 8Γ—128
  Merge:  concat β†’ LayerNorm(512) β†’ MHA(8 heads) β†’ mean pool       β†’ 512d

H-Branch:   mean(X̃, dim=time) → MLP(46→128→64, BN+GELU+Dropout)  → 64d

C-Branch:   Embed_ds(5,32)[c_ds] β€– Embed_dev(6,32)[c_dev]         β†’ 64d

CB-GAF (Cross-Branch Gated Attention Fusion):
  Project each branch to d_f=128
  Each branch queries both others simultaneously via cross-attention
  Per-branch vector gate g^i ∈ (0,1)^128  (feature-wise, not scalar)
  x_fused = g^i βŠ™ x_self + (1βˆ’g^i) βŠ™ x_cross
  concat(T_fused, C_fused, H_fused) β†’ LayerNorm                   β†’ 384d

Classifier (residual head):
  raw_proj(mean(X̃)) → 64d
  concat(384d fused, 64d raw) β†’ z ∈ 448d
  Linear(448β†’256) β†’ BN+GELU+Dropout
  Linear(256β†’128) + Wskip(448β†’128)  ← residual skip
  Linear(128β†’2) β†’ softmax

Aux Decoder (training only, Ξ»=0.05):
  MLP(384β†’64β†’46) β€” prevents information collapse in CB-GAF

Ablation Summary

Branch Ablation (2 seeds, CB-GAF replaced with concat)

Variant F1 Ξ”F1
T+C+H (Full) 0.8296 β€”
T+H 0.7756 βˆ’0.0540
T+C 0.7752 βˆ’0.0544
T only 0.7753 βˆ’0.0543
H only 0.7054 βˆ’0.1242
C+H 0.7061 βˆ’0.1235
C only 0.6000 βˆ’0.2296

The C-branch alone gets near-random performance (AUC β‰ˆ 0.50) β€” confirming its role is fusion conditioning, not independent prediction.

Novelty Component Ablation (2 seeds)

Variant F1 Ξ”F1
Full TCH-Net 0.8296 β€”
w/o CB-GAF 0.7759 βˆ’0.0537
w/o MSTE (Three-Path) 0.7760 βˆ’0.0536
w/o Aux Loss 0.7755 βˆ’0.0541
w/o All (v2 baseline) 0.7752 βˆ’0.0544

Removing any single novel component costs ~0.054 F1.


Computational Profile (NVIDIA Tesla T4)

Model Params Latency Throughput Memory F1
TCH-Net 2.692M 6.43Β±0.18ms 20.5k sps 10.27MB 0.8296
BiLSTM-IDS 0.609M 0.74Β±0.02ms 34.2k sps 2.32MB 0.7805
Transformer-IDS 0.618M 1.22Β±0.03ms 36.8k sps 2.36MB 0.7958
1D-CNN-IDS 0.068M 0.69Β±0.03ms 406.9k sps 0.26MB 0.7932

The 6.43ms latency supports 20,000+ detections/second under batch processing. The 10.27MB footprint is deployable on NVIDIA Jetson hardware. For microcontroller-class endpoints (Cortex-M, ESP32), quantisation or knowledge distillation would be needed.


Installation

pip install torch scikit-learn numpy huggingface_hub

Tested with Python 3.9+. No specific version pinning required beyond a modern PyTorch (β‰₯2.0 recommended for torch.compile compatibility).


Loading the Model

import torch
import pickle
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

# ── Download files ──────────────────────────────────────────────────────────
ckpt_path   = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "tch_net_best.pth")
scaler_path = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "scaler.pkl")

# ── Load scaler ──────────────────────────────────────────────────────────────
with open(scaler_path, "rb") as f:
    scaler = pickle.load(f)

# ── Load checkpoint ──────────────────────────────────────────────────────────
ckpt   = torch.load(ckpt_path, map_location="cpu")
config = ckpt["config"]

# ── Define TCHNet ─────────────────────────────────────────────────────────────
# Full class definition is in bridge-and-tch-net.ipynb and the GitHub repo.
# Paste or import TCHNet before instantiating:
#   from tch_net import TCHNet   (if using the GitHub repo)
#   OR copy the class from the notebook.

model = TCHNet(
    nf=config["n_features"],   # 46
    ws=config["window_size"],  # 32
    nc=config["n_classes"],    # 2
)
model.load_state_dict(ckpt["state_dict"])
model.eval()

# ── Preprocess ───────────────────────────────────────────────────────────────
# X_raw: np.ndarray of shape (N, 46) β€” raw canonical flow features
X_scaled = np.clip(scaler.transform(X_raw), -10, 10).astype(np.float32)

# ── Inference ────────────────────────────────────────────────────────────────
# x:   FloatTensor (B, 32, 46) β€” windowed, scaled flow features
# ctx: LongTensor  (B, 2)      β€” [dataset_source_id, device_category_id]
#
#   dataset_source_id:  0=CICIDS-2017  1=CIC-IoT-2023  2=Bot-IoT
#                       3=Edge-IIoTset  4=N-BaIoT
#   device_category_id: 0=sensor  1=camera  2=appliance  3=IIoT
#                       4=server  5=unknown
#
#   If context is unknown: ctx = torch.zeros(B, 2, dtype=torch.long)
#   The C-branch has no independent predictive power β€” unknown context
#   degrades gracefully, it does not break inference.

with torch.no_grad():
    logits, _ = model(x, ctx)
    probs = F.softmax(logits, dim=-1)
    preds = logits.argmax(dim=-1)  # 0 = benign, 1 = attack

Files

File Description
tch_net_best.pth Best checkpoint (highest F1 across all 5 seeds)
tch_net_seed_42.pth Per-seed checkpoint, seed 42
tch_net_seed_123.pth Per-seed checkpoint, seed 123
tch_net_seed_456.pth Per-seed checkpoint, seed 456
tch_net_seed_789.pth Per-seed checkpoint, seed 789
tch_net_seed_2024.pth Per-seed checkpoint, seed 2024
scaler.pkl RobustScaler (q5–q95) fitted on BRIDGE training split β€” required for inference
manifest.json Config, per-seed metrics, feature names
BRIDGE and TCH-Net (FULL PAPER).ipynb Complete experimental notebook (all 12 baselines, branch ablation, novelty ablation, LODO, temporal split, adversarial robustness, HP sensitivity)
bridge-and-tch-net.ipynb Clean training-only notebook (TCH-Net, 5 seeds, saves checkpoints)

Training Hyperparameters

Parameter Value
Optimizer AdamW
Learning rate 5Γ—10⁻⁴
Weight decay 5Γ—10⁻⁡
Scheduler Cosine annealing, 2-epoch warmup
Loss Focal (Ξ³=2.0, Ξ±-weighted, Ξ΅=0.05) + Aux (Ξ»=0.05)
Batch size 512
Max epochs / patience 30 / 5
Sequence length W=32, stride S=4
Dropout 0.15
Input augmentation Gaussian noise (Οƒ=0.01, p=0.30, train only)
AMP fp16 on CUDA

Citation

@article{bhilwarawala2026bridge,
  title   = {{BRIDGE} and {TCH-Net}: Heterogeneous Benchmark and Multi-Branch
             Baseline for Cross-Domain {IoT} Botnet Detection},
  author  = {Bhilwarawala, Ammar and Rongmei, Likhamba and Sharma, Harsh
             and Jena, Arya and Singh, Kaushal and Piri, Jayashree and Dey, Raghunath},
  journal = {arXiv preprint arXiv:2604.11324},
  year    = {2026}
}

Model Card Authors

Ammar Bhilwarawala, KIIT University.
For questions or issues, open a discussion on this repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Ammar-ss/BRIDGE_and_TCH-Net

Space using Ammar-ss/BRIDGE_and_TCH-Net 1

Paper for Ammar-ss/BRIDGE_and_TCH-Net

Evaluation results