ubuntu-intall-cuda

Ubuntu20.04如何安装CUDA

  1. 安装显卡驱动
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#更新系统,安装必须用到的工具
exxk@exxk:~$ sudo apt update && sudo apt upgrade -y
exxk@exxk:~$ sudo apt install -y curl wget gnupg lsb-release
#添加 NVIDIA 驱动的 PPA
exxk@exxk:~$ sudo add-apt-repository ppa:graphics-drivers/ppa
exxk@exxk:~$ sudo apt update
#检测可用驱动: 它会列出所有设备(如显卡)和相应的推荐驱动。如果知道安装什么去驱动了,可以不用安装
exxk@exxk:~$ sudo apt install -y ubuntu-drivers-common
exxk@exxk:~$ ubuntu-drivers devices
ERROR:root:could not open aplay -l
Traceback (most recent call last):
File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect
aplay = subprocess.Popen(
File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'aplay' #系统试图使用aplay来检测音频设备,但该工具未安装,可以忽略该错误
== /sys/devices/pci0000:00/0000:00:10.0 ==
modalias : pci:v000010DEd000025E2sv000017AAsd0000382Dbc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-535 - distro non-free
driver : nvidia-driver-550-open - third-party non-free
driver : nvidia-driver-535-open - distro non-free
driver : nvidia-driver-550 - third-party non-free
driver : nvidia-driver-560 - third-party non-free recommended #推荐的驱动
driver : nvidia-driver-535-server-open - distro non-free
driver : nvidia-driver-545-open - third-party non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-555-open - third-party non-free
driver : nvidia-driver-545 - third-party non-free
driver : nvidia-driver-555 - third-party non-free
driver : nvidia-driver-560-open - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin
# exxk@exxk:~$ sudo apt install -y nvidia-driver-560 该安装方式会安装桌面,其实用不到,因此采用下面的安装方式
exxk@exxk:~$ sudo apt install -y nvidia-driver-560 --no-install-recommends
exxk@exxk:~$ sudo reboot
exxk@exxk:~$ nvidia-smi
Tue Nov 19 07:27:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:00:10.0 Off | N/A |
| N/A 45C P8 3W / 60W | 15MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 933 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
  1. 安装docker(可以在安装系统时勾选上docker,这一步就可以省略,建议不要省略,安装系统勾选安装的docker 版本较低)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#安装 Docker 仓库密钥和工具
exxk@exxk:~$ sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
exxk@exxk:~$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
#添加 Docker 仓库
exxk@exxk:~$ echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
exxk@exxk:~$ sudo apt update
#安装 Docker
exxk@exxk:~$ sudo apt install -y docker-ce docker-ce-cli containerd.io
#将当前用户添加到 Docker 组
exxk@exxk:~$ sudo usermod -aG docker $USER
exxk@exxk:~$ sudo reboot
#查看docker版本,如果没有添加当前用户到docker组,需要加sudo
exxk@exxk:~$ docker version
Client: Docker Engine - Community
Version: 27.3.1
API version: 1.47
Go version: go1.22.7
Git commit: ce12230
Built: Fri Sep 20 11:41:03 2024
OS/Arch: linux/amd64
Context: default

Server: Docker Engine - Community
Engine:
Version: 27.3.1
API version: 1.47 (minimum version 1.24)
Go version: go1.22.7
Git commit: 41ca978
Built: Fri Sep 20 11:41:03 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.23
GitCommit: 57f17b0a6295a39009d861b89e3b3b87b005ca27
runc:
Version: 1.1.14
GitCommit: v1.1.14-0-g2c9f560
docker-init:
Version: 0.19.0
GitCommit: de40ad0
  1. 安装 NVIDIA Container Toolkit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 配置生产存储库:
exxk@exxk:~$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
#(可选)配置存储库以使用实验性软件包:
exxk@exxk:~$ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
#从存储库更新包列表:
exxk@exxk:~$ sudo apt-get update
#安装 NVIDIA Container Toolkit 软件包:
exxk@exxk:~$ sudo apt-get install -y nvidia-container-toolkit
#配置并重启 Docker
exxk@exxk:~$ sudo nvidia-ctk runtime configure --runtime=docker
exxk@exxk:~$ sudo systemctl restart docker
#运行一个简单的容器测试 GPU 是否可用
exxk@exxk:~$ docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi
Tue Nov 19 07:58:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:00:10.0 Off | N/A |
| N/A 39C P8 3W / 60W | 15MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

CUDA使用

镜像如何选择

NVIDIA 提供了官方的 nvidia/cuda Docker 镜像,用于构建和运行 GPU 加速的容器化应用。

镜像说明: nvidia/cuda:<CUDA版本>--<镜像类型>-<基础系统>

  • 镜像类型:

    名称 标识 介绍 大小
    基础镜像 nvidia/cuda:-base 不包含运行时库,只包含基本的 CUDA 工具链(不包含运行时库,只包含基本的 CUDA 工具链) 100MB
    运行时镜像 nvidia/cuda:-runtime 部署已经编译好的应用程序(不包含开发工具) >2G
    开发镜像 nvidia/cuda:-devel 开发和调试 CUDA 应用程序 >3G
    完整镜像 nvidia/cuda:-full 包含开发镜像的所有内容,并添加了额外的示例代码和文档 没有
  • CUDA版本:运行和驱动对应版本,一般可以向下兼容,根据我的驱动推荐使用 CUDA 12.6(与您的驱动版本一致),也可以使用 CUDA 12.x 或 11.x

  • cuDNN:包含 NVIDIA 的 cuDNN 库(深度学习加速库),这是深度学习框架(如 TensorFlow、PyTorch)的核心组件

  • 基础系统:推荐使用宿主机的系统,例如:宿主机是Ubuntu就Ubuntu,宿主机 CentOS就 CentOS

根据我的环境,推荐镜像docker pull nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04

使用

1
2
3
4
5
6
7
8
9
#-w /workspace 指定工作目录,-p 8888:8888提前预设Jupyter的端口
docker run -it --gpus all --name cuda-dev \
-v /home/exxk/workspace:/workspace \
-w /workspace \
-p 8888:8888 \
nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04 bash
root@287c0288ad93:/workspace# apt update
root@287c0288ad93:/workspace# apt install -y python3 python3-pip
#上传代码到/home/exxk/workspace