使用NVIDIA/TensorRT-LLM 量化qwen/Qwen-1_8B-Chat
地址 https://github.com/NVIDIA/TensorRT-LLM
1 先安装 native/container-toolkit/
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
2 提前下载需要量化的模型
我这里使用 qwen/Qwen-1_8B-Chat
可以是 huggingface 下载
或者 https://www.modelscope.cn/ 下载
放在本地路径 ./Qwen-1_8B-Chat
3 再安装NVIDIA/TensorRT-LLM 的 docker镜像
我这里与官方有所不同:
docker run 去掉 --rm
参数,就是docker 容器出后,保存更改
添加 -v ./Qwen-1_8B-Chat:/home/qwen1_8b_chat
参数把模型路径挂载到 docker 容器 /home/qwen1_8b_chat 路径下
# Obtain and start the basic docker image environment.
docker run --runtime=nvidia --gpus all --entrypoint /bin/bash -v ./Qwen-1_8B-Chat:/home/qwen1_8b_chat -it nvidia/cuda:12.1.0-devel-ubuntu22.04
在容器内安装依赖:
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
# Check installation
python3 -c "import tensorrt_llm"
clone代码
在容器内目录 /home 下执行:
git clone https://github.com/NVIDIA/TensorRT-LLM.git
到qwen目录安装依赖
cd /home/TensorRT-LLM/examples/qwen
pip install -r requirements.txt
构建qwen模型 使用单个 GPU 并且执行 INT8 weight-only 量化
1 命令行配置环境变量,根据自己的环境配置
MODEL_DIR=/home/qwen1_8b_chat
OUT_CHECK_POINT_DIR=/home/checkpoints/qwen_checkpoint_1gpu_fp16_wq
TRT_ENGINES_DIR=/home/trt_engines/qwen1_8b_chat/weight_only/1-gpu/
2 转换为checkpoint
python3 convert_checkpoint.py --model_dir $MODEL_DIR \
--output_dir $OUT_CHECK_POINT_DIR \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8
3 构建 TRT-engine
trtllm-build --checkpoint_dir $OUT_CHECK_POINT_DIR \
--output_dir $TRT_ENGINES_DIR \
--gemm_plugin float16
4 运行量化模型推理
python3 ../run.py --input_text "你好,请问你叫什么?" \
--max_output_len=50 \
--tokenizer_dir $MODEL_DIR \
--engine_dir=$TRT_ENGINES_DIR
过程中遇到的问题
1 flash-attention 安装很久失败
Building wheel for flash-attn (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
fatal: not a git repository (or any of the parent directories): .git
解决方案
2 库的兼容性:
以下是经过自己多次尝试的结果推荐值:
accelerate==0.25.0
auto-gptq==0.6.0
tensorrt-llm==0.10.0.dev2024041600
root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep accelerate
accelerate 0.25.0
root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep auto
auto-gptq 0.6.0
root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep tensorrt-llm
tensorrt-llm 0.10.0.dev2024041600
root@996e547c677a:/home/TensorRT-LLM/examples/qwen#