Instructions to use Qwen/Qwen-7B-Chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen-7B-Chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen-7B-Chat", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen-7B-Chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen-7B-Chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qwen/Qwen-7B-Chat
- SGLang
How to use Qwen/Qwen-7B-Chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-7B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-7B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Qwen/Qwen-7B-Chat with Docker Model Runner:
docker model run hf.co/Qwen/Qwen-7B-Chat
貌似很拉跨,一个7B的模型3090显存都不够载入,要是不安装它推荐的加速包,速度慢的像狗。
貌似很拉跨,一个7B的模型3090显存都不够载入,要是不安装它推荐的加速包,速度慢的像狗。
这是目前开源模型性能最好的了
速度确实有点慢,相比我们内部框架慢了10倍左右,目前还在看HF版本的原因。
3090完全载入完全没问题的。是不是没开bf16?试下
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
3090完全没问题啊
简单乘法: 4 (float32) * 7B = 28G > 24G. 如果用fp16的话是 2 * 7 = 14G < 24g就可以加载了
默认的慢也是正常的, 你用的是transformer的库, 和模型没关系。
3090 可以换一下A100或者H100 就快了
似乎只能用一个CPU线程,这可能是慢的主要原因吧
这是目前开源模型性能最好的了
性能最好的是GLM2-6B吧
性能最好的中文模型听说是百川的
用vllm加速跑的飞快
用vllm加速跑的飞快
请问vllm加速测试在什么样的GPU上面呀?
3090没问题,刚试了测试程序,4bit用40%RAM, 8bit用50%RAM,不调用bitsandbytes用70%RAM左右,速度很快
用vllm加速跑的飞快
请问vllm加速测试在什么样的GPU上面呀?
我是在A100上测的哈
确实很占显存,在单张teslav100上分别运行chatglm2-6b和qwen-7b,输入长模板进行问答,qwen会报显存不够的错误。
3090完全载入完全没问题的。是不是没开bf16?试下
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
我之前测试3090加载推理也没有问题,但是微调没跑得起来,这个有跑过微调吗,需要多大的显存? max-length: 1024, bs: 1