arxiv:2604.24040

Improving Robustness of Tabular Retrieval via Representational Stability

Published on Apr 27

· Submitted by

Kushal Raj Bhandari on Apr 28

Upvote

Authors:

Kushal Raj Bhandari ,

Abstract

Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as csv, tsv, html, markdown, and ddl, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across MPNet, BGE-M3, ReasonIR, and SPLADE. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}.

View arXiv page View PDF GitHub 1 Add to collection

Community

KBhandari11

Paper author Paper submitter about 14 hours ago

•

edited about 14 hours ago

Transformer retrievers flatten tables into token sequences, and the choice of serialization format , CSV, HTML, DDL, Markdown, and so on, produces substantially different embeddings even when the underlying table data stays identical. This paper quantifies instability across four retriever families and three benchmarks, then proposes averaging embeddings across serialization formats to compute a stable centroid representation. A lightweight residual bottleneck adapter then learns to approximate that centroid from a single serialization at inference time, keeping the base encoder frozen throughout.

Key findings:

Serialization format acts as a first-order retrieval variable, with Recall@1 swings as large as 0.26 on NQ-Tables for a single retriever across formats
Centroid averaging outranks every individual serialization format in aggregate pairwise comparisons across all tested models
The residual adapter delivers meaningful gains for dense retrievers, particularly MPNet and ReasonIR on syntactically heavy or structurally perturbed serializations
The subset adapter, trained only on WTQ and WikiSQL, transfers partially to unseen NQ-Tables data, suggesting the learned correction captures format-structural patterns rather than dataset-specific ones