文章预览
从 moss-003-sft-data 数据集中采样 2052 条中英文用户输入数据,基于 vllm 0.6.1 推理框架,分别测试了 Qwen2.5-7B-Instruct , Qwen2.5-7B-Instruct-AWQ , Qwen2.5-7B-Instruct-GPTQ-Int4 三个模型在不同参数配置下的推理速度,耗时结果如下表所示: generation_config Qwen2.5-7B-Instruct Qwen2.5-7B-Instruct-AWQ Qwen2.5-7B-Instruct-GPTQ-Int4 temperature=0.7 top_p=0.8 repetition_penalty=1.05 max_tokens=512 353.88s 498.58s 460.70s temperature=0 repetition_penalty=1.0 max_tokens=512 259.86s 410.14s 371.89s temperature=0.7 top_p=0.8 repetition_penalty=1.05 max_tokens=512 : model Qwen2.5-7B-Instruct Qwen2.5-7B-Instruct-AWQ Qwen2.5-7B-Instruct-GPTQ-Int4 speed input 115.08 toks/s 81.66 toks/s 88.38 toks/s speed output 2548.73 toks/s 1818.71 toks/s 1947.42 toks/s temperature=0 repetition_penalty=1.0 max_tokens=512 : model Qwen2.5-7B-Instruct Qwen2.5-7B-Instruct-AWQ Qwen2.5-7B-Instruct-GPTQ-Int4 speed input 156.78 toks/s 99
………………………………