Instructions to use rp-yu/Dimple-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rp-yu/Dimple-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="rp-yu/Dimple-7B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rp-yu/Dimple-7B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rp-yu/Dimple-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rp-yu/Dimple-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rp-yu/Dimple-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/rp-yu/Dimple-7B
- SGLang
How to use rp-yu/Dimple-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rp-yu/Dimple-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rp-yu/Dimple-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rp-yu/Dimple-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rp-yu/Dimple-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use rp-yu/Dimple-7B with Docker Model Runner:
docker model run hf.co/rp-yu/Dimple-7B
🤗 Model | 💬 Demo: Chat with Dimple | 📑 Paper | ✨ Code
💧 Dimple-7B
Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing an autoregressive-then-diffusion training strategy:
- Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
- Stage 2: Diffusion-based fine-tuning for enhanced instruction-following capabilities.
Trained on the same dataset as LLaVA-NEXT, Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%, demonstrating that diffusion-based multimodal language models can match its autoregressive counterparts under similar training budget.
🔍 Highlights
- Hybrid Training: Combines autoregressive and diffusion training.
- Diffusion Decoding: Supports confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
- Controllable Generation: Enables fine-grained control over format, structure, and length via structure priors.
- Autoregressive-like Prefilling: Enhances inference speed using prefilling techniques.
📊 Evaluation Results
| Benchmark | Dimple-7B (ours) | LLaVA-1.5-7B | LLaVA-NEXT-7B | Eagle-7B | Eagle2-9B | Qwen-VL-7B | Qwen2.5-VL-7B |
|---|---|---|---|---|---|---|---|
| Training Samples | 1.3M | 1.2M | 1.3M | 2.4M | 27.8M | 1.5B | - |
| Training Tokens | 0.8B | - | - | - | - | - | 2.6T |
| Base LLM | Dream (Qwen2.5) | Vicuna | Vicuna-1.5 | Vicuna | Qwen2.5 | Qwen | Qwen2.5 |
| GQA | 59.2 | 62.0 | 64.8 | 64.9 | - | 59.3 | - |
| MMBench (en test) | 74.6 | 64.3 | 68.7 | 68.4 | - | - | 83.5 |
| MME (Perception) | 1514 | 1510 | 1519 | 1528 | - | - | - |
| MME (Cognition) | 432 | - | 332 | - | - | - | - |
| MME (Total) | 1946 | - | 1851 | - | - | - | 2347 |
| POPE | 86.2 | 85.8 | 86.7 | 88.8 | - | - | - |
| MMMU (val) | 45.2 | - | 35.8 | 36.3 | 56.1 | - | 58.6 |
| SQA (img) | 77.1 | 66.8 | 72.8 | 70.0 | - | - | - |
| AI2D | 74.4 | - | 65.4 | - | 83.9 | 62.3 | 83.9 |
| ChartQA | 63.4 | - | 54.9 | 67.7 | 86.4 | 65.7 | 87.3 |
| TextVQA | 61.6 | - | 64.8 | - | 83.0 | - | - |
| OCRBench | 565 | - | 490 | 529 | - | - | - |
| MathVista (mini) | 42.3 | - | 33.0 | - | 63.8 | 37.0 | 68.2 |
| MMVet | 41.2 | 31.1 | 47.3 | - | 62.2 | - | 67.1 |
🛠️ Environment
Make sure your environment includes the following versions:
transformers==4.46.2
torch==2.5.1
accelerate==1.6.0
🚀 Inference Example
import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image
model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
[{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "Describe this image."}
]}],
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]
inputs = processor(
text=text,
images=images,
videos=None,
padding="longest",
return_tensors="pt",
)
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=64,
output_history=True,
return_dict_in_generate=True,
steps=64,
temperature=0.2,
top_p=0.95,
alg="origin",
use_cache=True,
alg_p_threshold=0.95,
use_original_confidence=True,
decoding_pipeline="dim",
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.
📚 Citation
@misc{dimple,
title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding},
author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
year={2025},
eprint={2505.16990},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.16990},
}
- Downloads last month
- 30
Model tree for rp-yu/Dimple-7B
Base model
Dream-org/Dream-v0-Instruct-7B