使用NVIDIA/TensorRT-LLM 量化qwen/Qwen-1

使用NVIDIA/TensorRT-LLM 量化qwen/Qwen-1_8B-Chat

地址 https://github.com/NVIDIA/TensorRT-LLM

1 先安装 native/container-toolkit/

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

2 提前下载需要量化的模型

我这里使用 qwen/Qwen-1_8B-Chat
可以是 huggingface 下载
或者 https://www.modelscope.cn/ 下载
放在本地路径 ./Qwen-1_8B-Chat

3 再安装NVIDIA/TensorRT-LLM 的 docker镜像

我这里与官方有所不同：
docker run 去掉 --rm 参数，就是docker 容器出后，保存更改
添加 -v ./Qwen-1_8B-Chat:/home/qwen1_8b_chat 参数把模型路径挂载到 docker 容器 /home/qwen1_8b_chat 路径下

# Obtain and start the basic docker image environment.
docker run --runtime=nvidia --gpus all --entrypoint /bin/bash  -v ./Qwen-1_8B-Chat:/home/qwen1_8b_chat -it nvidia/cuda:12.1.0-devel-ubuntu22.04

在容器内安装依赖：

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

# Check installation
python3 -c "import tensorrt_llm"

clone代码

在容器内目录 /home 下执行：

git clone https://github.com/NVIDIA/TensorRT-LLM.git

到qwen目录安装依赖

cd /home/TensorRT-LLM/examples/qwen

pip install -r requirements.txt

构建qwen模型使用单个 GPU 并且执行 INT8 weight-only 量化

1 命令行配置环境变量，根据自己的环境配置

MODEL_DIR=/home/qwen1_8b_chat
OUT_CHECK_POINT_DIR=/home/checkpoints/qwen_checkpoint_1gpu_fp16_wq
TRT_ENGINES_DIR=/home/trt_engines/qwen1_8b_chat/weight_only/1-gpu/

2 转换为checkpoint

python3 convert_checkpoint.py --model_dir $MODEL_DIR \
                              --output_dir $OUT_CHECK_POINT_DIR \
                              --dtype float16 \
                              --use_weight_only \
                              --weight_only_precision int8

3 构建 TRT-engine


trtllm-build --checkpoint_dir $OUT_CHECK_POINT_DIR \
            --output_dir $TRT_ENGINES_DIR \
            --gemm_plugin float16

4 运行量化模型推理

python3 ../run.py --input_text "你好，请问你叫什么？" \
                  --max_output_len=50 \
                  --tokenizer_dir $MODEL_DIR \
                  --engine_dir=$TRT_ENGINES_DIR

过程中遇到的问题

1 flash-attention 安装很久失败

  Building wheel for flash-attn (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [9 lines of output]
      fatal: not a git repository (or any of the parent directories): .git

解决方案

2 库的兼容性：
以下是经过自己多次尝试的结果推荐值：

accelerate==0.25.0   
auto-gptq==0.6.0 
tensorrt-llm==0.10.0.dev2024041600

root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep accelerate
accelerate                    0.25.0
root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep auto      
auto-gptq                     0.6.0
root@996e547c677a:/home/TensorRT-LLM/examples/qwen# pip list |grep tensorrt-llm
tensorrt-llm                  0.10.0.dev2024041600
root@996e547c677a:/home/TensorRT-LLM/examples/qwen#

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mfbz.cn/a/559442.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！