TCH-Net: Multi-Branch IoT Botnet Detection on BRIDGE
Paper: BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection
Authors: Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey β KIIT University
Submitted to: Journal of Network and Computer Applications
Dataset: Ammar-ss/BRIDGE
Code: github.com/Ammar-ss/TCH-Net
What is this?
The IoT botnet detection field has a quiet problem. Almost every published system gets trained on one dataset, reports numbers in the high 90s, and calls it done. The trouble is those numbers don't travel. A model tuned to CICIDS-2017 will see completely different traffic statistics when you point it at Bot-IoT or N-BaIoT β different capture tools, different devices, different attack toolkits. The benchmark looked easy because it was a closed world.
TCH-Net is a multi-branch neural network built to handle this more honestly. It's trained and evaluated on BRIDGE, a unified benchmark that maps five structurally distinct public datasets into a shared 46-feature space. The goal was to build something that could survive being tested on genuinely heterogeneous data β and then to measure exactly how hard that actually is.
The architecture has three parallel branches. The Temporal branch (T) runs three paths simultaneously: a residual depthwise-separable convolutional BiGRU for local and medium-range patterns, a stride-downsampled BiGRU for coarser dynamics, and a full-resolution pre-LayerNorm Transformer covering all 32 timesteps for global context. Different botnet attack categories manifest at different temporal scales β DDoS flooding shows up in burst-level signatures, C&C beaconing in medium-scale periodic patterns, scan-then-exploit sequences in global ordering β and a single-resolution encoder has to trade one against the others. Three paths running in parallel resolves that.
The Statistical branch (H) mean-pools the window and runs it through an MLP. It captures distributional structure that doesn't depend on ordering at all, which is orthogonal to what the T-branch does. The Contextual branch (C) encodes the source dataset and device category as learned embeddings. On its own it's nearly random (AUC β 0.50) β it doesn't predict attack labels independently. What it does is condition the fusion mechanism on where the input came from.
All three branches get fused through CB-GAF (Cross-Branch Gated Attention Fusion). Each branch queries the other two simultaneously via cross-attention, then a learned sigmoid vector gate β 128-dimensional, not a scalar β controls feature-wise how much cross-branch information gets absorbed. That vector gating is important in the heterogeneous setting. For a dataset like N-BaIoT where 85% of canonical features are zero-filled, the gate can learn to downweight the largely empty H-branch at specific dimensions rather than hard-coding that decision or just averaging in noise across the board.
2,692,696 parameters. Inference latency 6.43ms on Tesla T4. Fits on NVIDIA Jetson with room to spare.
Intended Use
In-Scope
- IoT botnet detection research on network flow data
- Cross-dataset generalisation benchmarking in heterogeneous IDS settings
- Ablation or architecture comparison studies using the BRIDGE benchmark
Out-of-Scope
- Production deployment without retraining: The LODO gap (0.2719 F1) indicates the model does not reliably generalise to unseen dataset distributions without adaptation. Do not deploy this in a live network without fine-tuning on in-distribution data.
- Non-IoT or enterprise network traffic: BRIDGE covers IoT-specific datasets. Behaviour on corporate LAN/WAN traffic is not evaluated.
- Real-time per-packet classification: TCH-Net operates on flow-level feature windows of 32 timesteps. It requires completed or windowed flows, not individual packets.
- Unknown context at scale: The contextual branch requires dataset source and device category IDs. If these are unavailable, pass
ctx = torch.zeros(B, 2, dtype=torch.long)β performance will degrade modestly but gracefully.
Limitations
- LODO F1 = 0.5577. The model does not generalise well to unseen dataset distributions. This is the best LODO result across all evaluated architectures (+0.09β0.17 above baselines), but the gap is real and quantified.
- N-BaIoT achieves high F1 (0.9854) largely because Mirai/BASHLITE signatures are statistically distinct in only 7 of 46 features. This is not a BRIDGE-wide pattern.
- Edge-IIoTset is the hardest case (F1 = 0.6755) due to IIoT packet-level traffic structures differing from the flow-level distributions dominating training.
- 85% zero-fill on N-BaIoT canonical features is a BRIDGE artefact β the canonical feature space was built around CICIDS-style flow features, which do not map cleanly to all source datasets.
- Not validated on live traffic captures or real-world deployment scenarios.
Results (5 seeds: 42, 123, 456, 789, 2024)
| Metric | TCH-Net | Best Baseline (Transformer-IDS) |
|---|---|---|
| F1 | 0.8296 Β± 0.0028 | 0.7958 Β± 0.0030 |
| ROC-AUC | 0.9380 Β± 0.0025 | 0.9147 Β± 0.0012 |
| MCC | 0.6972 Β± 0.0056 | 0.6255 Β± 0.0067 |
| PR-AUC | 0.8912 Β± 0.0031 | 0.8699 Β± 0.0041 |
TCH-Net outperforms all 12 baselines on all four metrics. All differences are statistically significant (p < 0.05, one-sided paired Wilcoxon signed-rank test).
Full Comparison Table
| Model | F1 | ROC-AUC | MCC | ΞF1 |
|---|---|---|---|---|
| TCH-Net (Ours) | 0.8296 Β± 0.0028 | 0.9380 | 0.6972 | β |
| Transformer-IDS | 0.7958 Β± 0.0030 | 0.9147 | 0.6255 | +0.0338** |
| 1D-CNN-IDS | 0.7932 Β± 0.0076 | 0.9076 | 0.6213 | +0.0364* |
| CNN-LSTM | 0.7919 Β± 0.0137 | 0.9056 | 0.6208 | +0.0377* |
| BiLSTM-IDS | 0.7805 Β± 0.0010 | 0.8975 | 0.5972 | +0.0491** |
| BiGRU-IDS | 0.7805 Β± 0.0011 | 0.8962 | 0.5987 | +0.0491** |
| DeepDefense | 0.7627 Β± 0.0011 | 0.8776 | 0.5638 | +0.0669*** |
| XGBoost | 0.7265 Β± 0.0014 | 0.8704 | 0.5542 | +0.1031*** |
| GraphSAGE-Approx | 0.7097 Β± 0.0004 | 0.8259 | 0.4465 | +0.1199*** |
| Kitsune-AE | 0.7045 Β± 0.0007 | 0.8200 | 0.4362 | +0.1251*** |
| MLP-IDS | 0.7039 Β± 0.0008 | 0.8152 | 0.4348 | +0.1257*** |
| IoT-DNN | 0.7009 Β± 0.0002 | 0.8146 | 0.4278 | +0.1287*** |
| Random Forest | 0.4323 Β± 0.0082 | 0.8005 | 0.3557 | +0.3973*** |
Per-Dataset Performance
| Dataset | Coverage | DetRate | False Alarm | F1 |
|---|---|---|---|---|
| CICIDS-2017 | 93% | 0.9433 | 0.0309 | 0.9505 |
| CIC-IoT-2023 | 87% | 0.8827 | 0.0257 | 0.9211 |
| N-BaIoT | 15% | 0.9982 | 0.0206 | 0.9854 |
| Edge-IIoTset | 22% | 0.6844 | 0.2589 | 0.6755 |
N-BaIoT achieves the highest F1 despite 85% of features being zero-filled. Mirai and BASHLITE botnet traffic is statistically distinctive enough in just 7 features that the separation is stark. Edge-IIoTset is the hardest case β IIoT packet-level traffic structures differently from the flow-level distributions that dominate training.
Leave-One-Dataset-Out (LODO) Generalisation
The honest number. Train on four datasets, test on the fifth, repeated five times.
| Held-Out | LODO F1 | LODO AUC |
|---|---|---|
| CICIDS-2017 | 0.3128 Β± 0.232 | 0.0509 |
| CIC-IoT-2023 | 0.6013 Β± 0.000 | 0.1440 |
| Bot-IoT | 0.5934 Β± 0.011 | 0.5693 |
| Edge-IIoTset | 0.6791 Β± 0.008 | 0.6841 |
| N-BaIoT | 0.6021 Β± 0.000 | 0.8171 |
| MEAN | 0.5577 | 0.4531 |
Generalisation gap: random-split F1 (0.8296) β LODO mean (0.5577) = +0.2719.
This gap is not a TCH-Net problem. All five deep learning baselines scored between 0.39β0.47 LODO F1 under the same protocol. TCH-Net's 0.5577 is the highest LODO score across all evaluated architectures β +0.09 to +0.17 above baselines. The gap is a measurement of how hard the cross-dataset problem actually is. The BRIDGE LODO mean of 0.5577 is the first formally quantified community generalisation baseline in heterogeneous IoT intrusion detection.
Temporal Split Check
| Split | F1 | AUC | MCC |
|---|---|---|---|
| Random (5 seeds) | 0.8296 | 0.9380 | 0.6972 |
| Temporal (1 seed) | 0.8203 | 0.9261 | 0.6831 |
| Ξ | β0.0093 | β0.0119 | β0.0141 |
The small drop under temporal splitting confirms TCH-Net's performance is not driven by temporal leakage.
Architecture
Input: (B, 32, 46) β batch Γ window Γ canonical features
Shared Feature Projection (residual):
Linear(46β92) β LayerNorm β GELU β Dropout(Ξ΄/2)
β Linear(92β46) β LayerNorm XΜ = X + f_proj(X)
T-Branch β three parallel paths, merged to 512d:
Path 1: ResConvSEΓ3 + MaxPool β BiGRU(128/dir, 2L) β 8Γ256
Path 2: StrideConv(s=2,64ch) β BiGRU(64/dir, 1L) β AvgPool(8) β 8Γ128
Path 3: Linear(46β128) + LearnablePE β TransEnc(Pre-LN,2L,8H)
β strip CLS β AvgPool(8) β 8Γ128
Merge: concat β LayerNorm(512) β MHA(8 heads) β mean pool β 512d
H-Branch: mean(XΜ, dim=time) β MLP(46β128β64, BN+GELU+Dropout) β 64d
C-Branch: Embed_ds(5,32)[c_ds] β Embed_dev(6,32)[c_dev] β 64d
CB-GAF (Cross-Branch Gated Attention Fusion):
Project each branch to d_f=128
Each branch queries both others simultaneously via cross-attention
Per-branch vector gate g^i β (0,1)^128 (feature-wise, not scalar)
x_fused = g^i β x_self + (1βg^i) β x_cross
concat(T_fused, C_fused, H_fused) β LayerNorm β 384d
Classifier (residual head):
raw_proj(mean(XΜ)) β 64d
concat(384d fused, 64d raw) β z β 448d
Linear(448β256) β BN+GELU+Dropout
Linear(256β128) + Wskip(448β128) β residual skip
Linear(128β2) β softmax
Aux Decoder (training only, Ξ»=0.05):
MLP(384β64β46) β prevents information collapse in CB-GAF
Ablation Summary
Branch Ablation (2 seeds, CB-GAF replaced with concat)
| Variant | F1 | ΞF1 |
|---|---|---|
| T+C+H (Full) | 0.8296 | β |
| T+H | 0.7756 | β0.0540 |
| T+C | 0.7752 | β0.0544 |
| T only | 0.7753 | β0.0543 |
| H only | 0.7054 | β0.1242 |
| C+H | 0.7061 | β0.1235 |
| C only | 0.6000 | β0.2296 |
The C-branch alone gets near-random performance (AUC β 0.50) β confirming its role is fusion conditioning, not independent prediction.
Novelty Component Ablation (2 seeds)
| Variant | F1 | ΞF1 |
|---|---|---|
| Full TCH-Net | 0.8296 | β |
| w/o CB-GAF | 0.7759 | β0.0537 |
| w/o MSTE (Three-Path) | 0.7760 | β0.0536 |
| w/o Aux Loss | 0.7755 | β0.0541 |
| w/o All (v2 baseline) | 0.7752 | β0.0544 |
Removing any single novel component costs ~0.054 F1.
Computational Profile (NVIDIA Tesla T4)
| Model | Params | Latency | Throughput | Memory | F1 |
|---|---|---|---|---|---|
| TCH-Net | 2.692M | 6.43Β±0.18ms | 20.5k sps | 10.27MB | 0.8296 |
| BiLSTM-IDS | 0.609M | 0.74Β±0.02ms | 34.2k sps | 2.32MB | 0.7805 |
| Transformer-IDS | 0.618M | 1.22Β±0.03ms | 36.8k sps | 2.36MB | 0.7958 |
| 1D-CNN-IDS | 0.068M | 0.69Β±0.03ms | 406.9k sps | 0.26MB | 0.7932 |
The 6.43ms latency supports 20,000+ detections/second under batch processing. The 10.27MB footprint is deployable on NVIDIA Jetson hardware. For microcontroller-class endpoints (Cortex-M, ESP32), quantisation or knowledge distillation would be needed.
Installation
pip install torch scikit-learn numpy huggingface_hub
Tested with Python 3.9+. No specific version pinning required beyond a modern PyTorch (β₯2.0 recommended for torch.compile compatibility).
Loading the Model
import torch
import pickle
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
# ββ Download files ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ckpt_path = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "tch_net_best.pth")
scaler_path = hf_hub_download("Ammar-ss/BRIDGE_and_TCH-Net", "scaler.pkl")
# ββ Load scaler ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
with open(scaler_path, "rb") as f:
scaler = pickle.load(f)
# ββ Load checkpoint ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ckpt = torch.load(ckpt_path, map_location="cpu")
config = ckpt["config"]
# ββ Define TCHNet βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Full class definition is in bridge-and-tch-net.ipynb and the GitHub repo.
# Paste or import TCHNet before instantiating:
# from tch_net import TCHNet (if using the GitHub repo)
# OR copy the class from the notebook.
model = TCHNet(
nf=config["n_features"], # 46
ws=config["window_size"], # 32
nc=config["n_classes"], # 2
)
model.load_state_dict(ckpt["state_dict"])
model.eval()
# ββ Preprocess βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# X_raw: np.ndarray of shape (N, 46) β raw canonical flow features
X_scaled = np.clip(scaler.transform(X_raw), -10, 10).astype(np.float32)
# ββ Inference ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# x: FloatTensor (B, 32, 46) β windowed, scaled flow features
# ctx: LongTensor (B, 2) β [dataset_source_id, device_category_id]
#
# dataset_source_id: 0=CICIDS-2017 1=CIC-IoT-2023 2=Bot-IoT
# 3=Edge-IIoTset 4=N-BaIoT
# device_category_id: 0=sensor 1=camera 2=appliance 3=IIoT
# 4=server 5=unknown
#
# If context is unknown: ctx = torch.zeros(B, 2, dtype=torch.long)
# The C-branch has no independent predictive power β unknown context
# degrades gracefully, it does not break inference.
with torch.no_grad():
logits, _ = model(x, ctx)
probs = F.softmax(logits, dim=-1)
preds = logits.argmax(dim=-1) # 0 = benign, 1 = attack
Files
| File | Description |
|---|---|
tch_net_best.pth |
Best checkpoint (highest F1 across all 5 seeds) |
tch_net_seed_42.pth |
Per-seed checkpoint, seed 42 |
tch_net_seed_123.pth |
Per-seed checkpoint, seed 123 |
tch_net_seed_456.pth |
Per-seed checkpoint, seed 456 |
tch_net_seed_789.pth |
Per-seed checkpoint, seed 789 |
tch_net_seed_2024.pth |
Per-seed checkpoint, seed 2024 |
scaler.pkl |
RobustScaler (q5βq95) fitted on BRIDGE training split β required for inference |
manifest.json |
Config, per-seed metrics, feature names |
BRIDGE and TCH-Net (FULL PAPER).ipynb |
Complete experimental notebook (all 12 baselines, branch ablation, novelty ablation, LODO, temporal split, adversarial robustness, HP sensitivity) |
bridge-and-tch-net.ipynb |
Clean training-only notebook (TCH-Net, 5 seeds, saves checkpoints) |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5Γ10β»β΄ |
| Weight decay | 5Γ10β»β΅ |
| Scheduler | Cosine annealing, 2-epoch warmup |
| Loss | Focal (Ξ³=2.0, Ξ±-weighted, Ξ΅=0.05) + Aux (Ξ»=0.05) |
| Batch size | 512 |
| Max epochs / patience | 30 / 5 |
| Sequence length | W=32, stride S=4 |
| Dropout | 0.15 |
| Input augmentation | Gaussian noise (Ο=0.01, p=0.30, train only) |
| AMP | fp16 on CUDA |
Citation
@article{bhilwarawala2026bridge,
title = {{BRIDGE} and {TCH-Net}: Heterogeneous Benchmark and Multi-Branch
Baseline for Cross-Domain {IoT} Botnet Detection},
author = {Bhilwarawala, Ammar and Rongmei, Likhamba and Sharma, Harsh
and Jena, Arya and Singh, Kaushal and Piri, Jayashree and Dey, Raghunath},
journal = {arXiv preprint arXiv:2604.11324},
year = {2026}
}
Model Card Authors
Ammar Bhilwarawala, KIIT University.
For questions or issues, open a discussion on this repository.
Dataset used to train Ammar-ss/BRIDGE_and_TCH-Net
Space using Ammar-ss/BRIDGE_and_TCH-Net 1
Paper for Ammar-ss/BRIDGE_and_TCH-Net
Evaluation results
- F1 (macro, 5 seeds) on BRIDGEself-reported0.830
- ROC-AUC (5 seeds) on BRIDGEself-reported0.938
- MCC (5 seeds) on BRIDGEself-reported0.697
- LODO F1 (leave-one-dataset-out) on BRIDGEself-reported0.558