John Smith's picture
In a Training Loop 🔄

John Smith PRO

John6666

AI & ML interests

None yet

Recent Activity

reacted to tomaarsen's post with 🚀 about 3 hours ago
🌐 I've just published Sentence Transformers v5.4 to make the project fully multimodal for embeddings and reranking. The release also includes a modular CrossEncoder, and automatic Flash Attention 2 input flattening. Details: You can now use SentenceTransformer and CrossEncoder with text, images, audio, and video, with the same familiar API. That means you can compute embeddings for an image and a text query using model.encode(), compare them with model.similarity(), and it just works. Models like Qwen3-VL-Embedding-2B and jinaai/jina-reranker-m0 are supported out of the box. Beyond multimodal, I also fully modularized the CrossEncoder class. It's now a torch.nn.Sequential of composable modules, just like SentenceTransformer has been. This unlocked support for generative rerankers (CausalLM-based models like mxbai-rerank-v2 and the Qwen3 rerankers) via a new LogitScore module, which wasn't possible before without custom code. Also, Flash Attention 2 now automatically skips padding for text-only inputs. If your batch has a mix of short and long texts, this gives you a nice speedup and lower VRAM usage for free. I wrote a blog post walking through the multimodal features with practical examples. Check it out if you want to get started, or just point your Agent to the URL: https://huggingface.co/blog/multimodal-sentence-transformers This release has set up the groundwork for more easily introducing late-interaction models (both text-only and multimodal) into Sentence Transformers in the next major release. I'm looking forward to it!
reacted to tomaarsen's post with 🤗 about 3 hours ago
🌐 I've just published Sentence Transformers v5.4 to make the project fully multimodal for embeddings and reranking. The release also includes a modular CrossEncoder, and automatic Flash Attention 2 input flattening. Details: You can now use SentenceTransformer and CrossEncoder with text, images, audio, and video, with the same familiar API. That means you can compute embeddings for an image and a text query using model.encode(), compare them with model.similarity(), and it just works. Models like Qwen3-VL-Embedding-2B and jinaai/jina-reranker-m0 are supported out of the box. Beyond multimodal, I also fully modularized the CrossEncoder class. It's now a torch.nn.Sequential of composable modules, just like SentenceTransformer has been. This unlocked support for generative rerankers (CausalLM-based models like mxbai-rerank-v2 and the Qwen3 rerankers) via a new LogitScore module, which wasn't possible before without custom code. Also, Flash Attention 2 now automatically skips padding for text-only inputs. If your batch has a mix of short and long texts, this gives you a nice speedup and lower VRAM usage for free. I wrote a blog post walking through the multimodal features with practical examples. Check it out if you want to get started, or just point your Agent to the URL: https://huggingface.co/blog/multimodal-sentence-transformers This release has set up the groundwork for more easily introducing late-interaction models (both text-only and multimodal) into Sentence Transformers in the next major release. I'm looking forward to it!
View all activity

Organizations

Glide's profile picture open/ acc's profile picture mekasiu's profile picture Solving Real World Problems's profile picture FashionStash Group meeting's profile picture No More Copyright's profile picture SAGEA's profile picture XORTRON - Criminal Computing's profile picture