安装GPU环境

📅 2026/7/4 21:01:03 👁️ 阅读次数 📝 编程学习
安装GPU环境

1. 概述

记录GPU驱动安装步骤

2. NVIDIA 驱动安装

2.1 检查显卡驱动

# 安装 aplay,ubuntu-drivers命令会调sudoaptinstallalsa-utils
sudoubuntu-drivers devices
ubuntu-drivers devices udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. ERROR:root:aplaycommandnot found==/sys/devices/pci0000:00/0000:00:0f.0==modalias:pci:v000015ADd00000405sv000015ADsd00000405bc03sc00i00 vendor:VMware model:SVGA II Adapter driver:open-vm-tools-desktop - distrofree==/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0==modalias:pci:v000010DEd00002204sv00001458sd0000403Bbc03sc00i00 vendor:NVIDIA Corporation model:GA102[GeForce RTX3090]manual_install: True driver:nvidia-driver-570-server - distro non-free driver:nvidia-driver-470-server - distro non-free driver:nvidia-driver-580-open - distro non-free driver:nvidia-driver-580 - distro non-free driver:nvidia-driver-570-open - distro non-free driver:nvidia-driver-470 - distro non-free driver:nvidia-driver-590-server-open - distro non-free driver:nvidia-driver-535-server-open - distro non-free driver:nvidia-driver-535-open - distro non-free driver:nvidia-driver-580-server - distro non-free driver:nvidia-driver-580-server-open - distro non-free driver:nvidia-driver-570-server-open - distro non-free driver:nvidia-driver-570 - distro non-free driver:nvidia-driver-535 - distro non-free driver:nvidia-driver-590-server - distro non-free driver:nvidia-driver-535-server - distro non-free driver:nvidia-driver-590 - distro non-free driver:nvidia-driver-590-open - distro non-free recommended driver:xserver-xorg-video-nouveau - distrofreebuiltin

2.2 选择带open后缀的驱动

  1. 上述推荐了驱动 : nvidia-driver-590-open - distro non-free recommended
sudoaptinstallnvidia-driver-590-open

2.3 安装完后重启

reboot

3. nvidia-fabricmanager安装(可选)

  • nvidia-fabricmanager 是专门管理多张通过NVLink/NVSwitch互连的NVIDIA GPU的软件。
  • 如果是单卡安装,服务器启动会报错提示

3.1 查看驱动版本

nvidia-smi
root@ubuntu:~# nvidia-smiTue Jan2014:43:072026+-----------------------------------------------------------------------------------------+|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|+-----------------------------------------+------------------------+----------------------+|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||=========================================+========================+======================||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|+-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||=========================================================================================||No running processes found|+-----------------------------------------------------------------------------------------+

3.2 下载fabricmanager软件

Nvidia官方下载地址

1.下载和驱动版本一样的fabricmanager软件。这里是590.48.01

  • nvidia-fabricmanager_*.deb:这是主软件包,运行服务所必需。

  • nvidia-fabricmanager-dev_*.deb:这是开发包(头文件等),仅在你需要编译基于该组件的软件时才需要。请忽略它们。

# 示例exportDRIVER_VERSION=590.48.01wgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F'.''{print $1}')_${DRIVER_VERSION}-1_amd64.deb
wgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager_${DRIVER_VERSION}-0ubuntu1_amd64.deb

3.3 安装fabricmanager

# 示例dpkg-invidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F'.''{print $1}')_${DRIVER_VERSION}-1_amd64.deb
dpkg-invidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb Selecting previously unselected package nvidia-fabricmanager.(Reading database...147187files and directories currently installed.)Preparing to unpack nvidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb... Unpacking nvidia-fabricmanager(590.48.01-0ubuntu1)... Setting up nvidia-fabricmanager(590.48.01-0ubuntu1)... Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /usr/lib/systemd/system/nvidia-fabricmanager.service. Could not execute systemctl: at /usr/bin/deb-systemd-invoke line148. Processing triggersforlibc-bin(2.39-0ubuntu8.6)...
  • 查看是否正常运行
systemctl status nvidia-fabricmanager
systemctl status nvidia-fabricmanager ● nvidia-fabricmanager.service - NVIDIA fabric managerserviceLoaded: loaded(/lib/systemd/system/nvidia-fabricmanager.service;enabled;vendor preset: enabled)Active: active(running)since Sun2025-10-1913:47:06 CST;3months1day ago Main PID:3020(nv-fabricmanage)Tasks:19(limit:629145)Memory:32.0M CPU: 1h 25min41.831s CGroup: /system.slice/nvidia-fabricmanager.service └─3020 /usr/bin/nv-fabricmanager-c/usr/share/nvidia/nvswitch/fabricmanager.cfg101913:46:53 ubuntu22-172-027-003-002 systemd[1]: Starting NVIDIA fabric manager service...101913:46:53 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: Detected Pre-NVL5 system101913:46:55 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Connected to1node.101913:47:06 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Successfully configured all the available GPUs and NVSwitches to route NVL>101913:47:06 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: Started"Nvidia Fabric Manager"101913:47:06 ubuntu22-172-027-003-002 systemd[1]: Started NVIDIA fabric manager service.
  • 检查已安装的Fabric Manager版本
dpkg-l|grepnvidia-fabricmanager
dpkg-l|grepnvidia-fabricmanager ii nvidia-fabricmanager590.48.01-0ubuntu1 amd64 Fabric ManagerforNVSwitch based systems

禁用nvidia-fabricmanager自动升级,将下面570改为安装的版本

sudoapt-mark hold nvidia-fabricmanager-590
nvidia-fabricmanager-590seton hold.

查看已禁用版本,有输出则为已禁用

sudoapt-mark showhold
nvidia-fabricmanager-590

4. 安装软件

4.1 安装CUDA Toolkit、NVIDIA Container Toolkit

  • 开发或运行任何需要直接调用GPU的应用程序–> 需要 CUDA Toolkit
  • 希望在Docker容器内使用GPU–> 需要 NVIDIA Container Toolkit
  • 很多AI开发、科学计算的场景 --> 两者都需要

4.2 CUDA Toolkit 说明

  1. 这是由NVIDIA提供的、用于开发和运行GPU加速应用程序的完整软件平台。它包含了编译器、数学库、调试工具等。具体如下:
  • CUDA 驱动(nvidia-driver):已经安装了 nvidia-driver-590-open。
  • CUDA 运行时(CUDA Runtime)及开发工具(nvcc编译器、cuBLAS等库):这是CUDA Toolkit软件包的主体。
  • 如果需要:编译或运行任何直接调用GPU的C++/Python程序(例如,从源码编译PyTorch/TensorFlow,运行CUDA C++项目),那么必须安装CUDA Toolkit。

4.3 CUDA Toolkit 安装

  1. 查看当前安装版本(如果安装过)
nvcc-V
(base)root@ubuntu:/public/software# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0

访问官网:NVIDIA CUDA Toolkit 下载页
官方手册链接

选择:Linux -> x86_64 -->Ubuntu -> 24.04 -> deb (network)

严格按照网页上给出的命令行指令执行即可。网络安装方式会自动配置源,并确保安装与系统驱动兼容的版本。

wgethttps://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudodpkg-icuda-keyring_1.1-1_all.debsudoapt-getupdate
apt-cachesearch cuda-toolkit cuda-toolkit-12 - CUDA Toolkit12meta-package cuda-toolkit - CUDA Toolkit meta-package cuda-toolkit-12-5 - CUDA Toolkit12.5meta-package cuda-toolkit-12-5-config-common - Common config packageforCUDA Toolkit12.5. cuda-toolkit-12-config-common - Common config packageforCUDA Toolkit12. cuda-toolkit-config-common - Common config packageforCUDA Toolkit. cuda-toolkit-12-6 - CUDA Toolkit12.6meta-package cuda-toolkit-12-6-config-common - Common config packageforCUDA Toolkit12.6. cuda-toolkit-12-8 - CUDA Toolkit12.8meta-package cuda-toolkit-12-8-config-common - Common config packageforCUDA Toolkit12.8. cuda-toolkit-12-9 - CUDA Toolkit12.9meta-package cuda-toolkit-12-9-config-common - Common config packageforCUDA Toolkit12.9. cuda-toolkit-13-0 - CUDA Toolkit13.0meta-package cuda-toolkit-13-0-config-common - Common config packageforCUDA Toolkit13.0. cuda-toolkit-13 - CUDA Toolkit13meta-package cuda-toolkit-13-config-common - Common config packageforCUDA Toolkit13. cuda-toolkit-13-1 - CUDA Toolkit13.1meta-package cuda-toolkit-13-1-config-common - Common config packageforCUDA Toolkit13.1. nvidia-cuda-toolkit - NVIDIA CUDA development toolkit nvidia-cuda-toolkit-doc - NVIDIA CUDA and OpenCL documentation nvidia-cuda-toolkit-gcc - NVIDIA CUDA development toolkit(GCC compatibility)
  • 查看当前可安装的版本
aptlist|grep-E"cuda-toolkit-[0-9]{2}-[0-9]{1,2}"
WARNING:aptdoes not have a stable CLI interface. Use with cautioninscripts. cuda-toolkit-11-7-config-common/unknown11.7.99-1 all cuda-toolkit-11-7/unknown11.7.1-1 amd64 cuda-toolkit-11-8-config-common/unknown11.8.89-1 all cuda-toolkit-11-8/unknown11.8.0-1 amd64 cuda-toolkit-12-0-config-common/unknown12.0.146-1 all cuda-toolkit-12-0/unknown12.0.1-1 amd64 cuda-toolkit-12-1-config-common/unknown12.1.105-1 all cuda-toolkit-12-1/unknown12.1.1-1 amd64 cuda-toolkit-12-2-config-common/unknown12.2.140-1 all cuda-toolkit-12-2/unknown12.2.2-1 amd64 cuda-toolkit-12-3-config-common/unknown12.3.101-1 all cuda-toolkit-12-3/unknown12.3.2-1 amd64 cuda-toolkit-12-4-config-common/unknown12.4.127-1 all cuda-toolkit-12-4/unknown12.4.1-1 amd64 cuda-toolkit-12-5-config-common/unknown12.5.82-1 all cuda-toolkit-12-5/unknown12.5.1-1 amd64 cuda-toolkit-12-6-config-common/unknown12.6.77-1 all cuda-toolkit-12-6/unknown12.6.3-1 amd64 cuda-toolkit-12-8-config-common/unknown12.8.90-1 all cuda-toolkit-12-8/unknown12.8.2-1 amd64 cuda-toolkit-12-9-config-common/unknown12.9.79-1 all cuda-toolkit-12-9/unknown12.9.2-1 amd64 cuda-toolkit-13-0-config-common/unknown13.0.96-1 all cuda-toolkit-13-0/unknown13.0.3-1 amd64 cuda-toolkit-13-1-config-common/unknown13.1.80-1 all cuda-toolkit-13-1/unknown13.1.2-1 amd64 cuda-toolkit-13-2-config-common/unknown13.2.75-1 all cuda-toolkit-13-2/unknown13.2.1-1 amd64 cuda-toolkit-13-3-config-common/unknown,now13.3.29-1 all[installed,auto-removable]cuda-toolkit-13-3/unknown13.3.1-1 amd64
  • nvidia-smi 命令右上角显示的CUDA版本是当前驱动支持的最高版本
    ** 注意,不是安装最新版本就好 **
    ** 安装vLLM 默认预编译版本 **(当前是13.0,后续会随vLLM发展而变化)
aptinstallcuda-toolkit-13-0 cuda-toolkit-13-0-config-common
  • 添加环境变量
vim/etc/profile
exportPATH="/usr/local/cuda-13.0/bin:$PATH"exportLD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"
source/etc/profile
nvcc-V
(vllm-env)root@xunku:/public/vLLM# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:58:59_PM_PDT_2025 Cuda compilation tools, release13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0
sudoapt-mark hold nvidia-driver-595-open cuda-toolkit-13-0 cuda-toolkit-13-0-config-common

4.4 Docker GPU 说明

Docker GPU 支持 (NVIDIA Container Toolkit),是一个让Docker容器能够访问和使用宿主机(Host)NVIDIA GPU的工具集。它实质上是创建了一个兼容层,将宿主机的GPU驱动映射到容器内部。具体如下:

  • 主要是 nvidia-container-toolkit 这个包,它会修改Docker的配置。
  • 如果需要:在Docker容器内运行任何需要GPU的镜像(例如,运行 docker run --gpus all nvidia/cuda:12.1.1-base-ubuntu24.04 或官方的PyTorch/TensorFlow Docker镜像),那么必须安装此工具包。

4.5 Docker GPU 安装

官方安装指南

  1. 安装工具
sudoapt-getupdate&&sudoapt-getinstall-y--no-install-recommendscurlgnupg2
sudoapt-getupdate&&sudoapt-getinstall-y--no-install-recommendscurlgnupg2 Hit:1 http://mirrors.aliyun.com/ubuntu noble InRelease Hit:2 http://mirrors.aliyun.com/ubuntu noble-updates InRelease Hit:3 http://mirrors.aliyun.com/ubuntu noble-security InRelease Hit:4 http://mirrors.aliyun.com/ubuntu noble-backports InRelease Hit:5 http://mirrors.aliyun.com/ubuntu noble-proposed InRelease Hit:6 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64 InRelease Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Donecurlis already the newest version(8.5.0-2ubuntu10.6). gnupg2 is already the newest version(2.4.4-2ubuntu17.4). The following packages were automatically installed and are no longer required: liburing2 mailcap plocate Use'sudo apt autoremove'to remove them.0upgraded,0newly installed,0to remove and4not upgraded.
  1. 配置仓库
curl-fsSLhttps://nvidia.github.io/libnvidia-container/gpgkey|sudogpg--dearmor-o/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg\&&curl-s-Lhttps://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list|\sed's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'|\sudotee/etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. 更新仓库
sudoapt-getupdate
  1. 安装工具包
exportNVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1sudoapt-getinstall-y\nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION}\nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
  1. 配置容器
sudonvidia-ctk runtime configure--runtime=docker
INFO[0000]Loading config from /etc/docker/daemon.json INFO[0000]Wrote updated config to /etc/docker/daemon.json INFO[0000]It is recommended thatdockerdaemon be restarted.
  1. 重启容器
sudosystemctl restartdocker
  1. 运行测试容器
  • 找了半天,终于找到了一个可以下载的镜像
dockerpull nvidia/cuda:13.0.1-runtime-ubuntu22.0413.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pulling fs layer 7e9d3b636b44: Pulling fs layer 3a2ba8ed1759: Pulling fs layer 5eaf68e8e556: Waiting c03b8ec8dd33: Waiting 47140273f24a: Waiting f1e29f967bcf: Waiting 48feaf8fd5bd: Waiting 8006ce821e80: Waiting13.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pull complete 7e9d3b636b44: Pull complete 3a2ba8ed1759: Pull complete 5eaf68e8e556: Pull complete c03b8ec8dd33: Pull complete 47140273f24a: Pull complete f1e29f967bcf: Pull complete 48feaf8fd5bd: Pull complete 8006ce821e80: Pull complete Digest: sha256:e4511e846c49e5495ef3d80c82b8f5dd597c6ef5c7f355601ead776ae3e96c67 Status: Downloaded newer imagefornvidia/cuda:13.0.1-runtime-ubuntu22.04 docker.io/nvidia/cuda:13.0.1-runtime-ubuntu22.04
dockerrun--rm--gpusall nvidia/cuda:13.0.1-runtime-ubuntu22.04 nvidia-smi
============CUDA============CUDA Version13.0.1 Container image Copyright(c)2016-2023, NVIDIA CORPORATION&AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license A copy of this license is made availableinthis container at /NGC-DL-CONTAINER-LICENSEforyour convenience. Tue Jan2013:37:442026+-----------------------------------------------------------------------------------------+|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|+-----------------------------------------+------------------------+----------------------+|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||=========================================+========================+======================||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|+-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||=========================================================================================||No running processes found|+-----------------------------------------------------------------------------------------+