過去ページ docker200322
nvidiaカードが入ったマシンにdockerを入れて、そのコンテナでGPU計算を行ってみる
インストールする計算機はこんな感じ物
[root@docker ~]# cat /etc/redhat-release
Rocky Linux release 9.2 (Blue Onyx)
[root@docker ~]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.104.05 Sat Aug 19 01:15:15 UTC 2023
GCC version: gcc version 11.3.1 20221121 (Red Hat 11.3.1-4) (GCC)
[root@docker ~]# ls -l /usr/local/cuda
ls: cannot access /usr/local/cuda: No such file or directory <-- cudaライブラリは入れていない
[root@docker ~]#
まずOS提供ではなく docker 側で提供するリポジトリからdockerを入れる
[root@docker ~]# dnf install yum-utils
[root@docker ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
一応yumで調べると、「docker.x86_64」はOS提供のリポジトリから得られるdockerのようで、
今回は「docker-ce.x86_64」を入れます。こちらは「docker-ce」側で提供してパッケージみたい
[root@docker ~]# dnf install docker-ce
(同時に containerd.io、docker-ce-cli、docker-ce-rootless-extras、docker-compose-plugin、docker-buildx-pluginもインストールされる)
[root@docker ~]# systemctl start docker
一応バージョン確認
[root@docker ~]# docker --version
Docker version 24.0.6, build ed223bc
[root@docker ~]#
次に「NVIDIA Container Toolkit」をインストールします
本家様 https://github.com/NVIDIA/nvidia-docker
次にdockerにnvidiaのツールキットを載せます.
[root@docker ~]# curl -s -o /etc/yum.repos.d/nvidia-docker.repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
[root@docker ~]# cat /etc/yum.repos.d/nvidia-docker.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[root@docker ~]#
[root@docker ~]# dnf install -y nvidia-container-toolkit
(「libnvidia-container-tools」と「libnvidia-container1」「nnvidia-container-toolkit-base」が同時にインストールされる)
[root@docker ~]# systemctl restart docker
nvidia-docker.repo の中身はこちらでdocker/NVIDIAContainerToolkit
ここでちょいとテスト
[root@docker ~]# nvidia-container-cli info
NVRM version: 535.104.05
CUDA version: 12.2
Device Index: 0
Device Minor: 0
Model: NVIDIA RTX A2000
Brand: NvidiaRTX
GPU UUID: GPU-23cc3ee7-31d3-a068-2f61-5aa00052d084
Bus Location: 00000000:13:00.0
Architecture: 8.6
[root@docker ~]#
っで問題ないとかきのよに表示される
「nvidia-container-cli: initialization error: nvml error: driver not loaded」の時はNVIDIA#o13e41e5の「persistence mode」にすれば回避されるみたい.
そしてdockerを使ってのGPUテスト
[root@docker ~]# docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.8.0-runtime-ubuntu22.04' locally
11.8.0-runtime-ubuntu22.04: Pulling from nvidia/cuda
6b851dcae6ca: Pull complete
:
:
Mon Sep 11 18:32:25 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A2000 Off | 00000000:13:00.0 Off | Off |
| 30% 41C P8 6W / 70W | 2MiB / 6138MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[root@docker ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia/cuda 11.8.0-runtime-ubuntu22.04 af0cef3d3ee9 2 months ago 2.65GB
[root@docker ~]#
単にグループ「docker」にユーザを加えればいいです
[root@docker ~]# id illya
id: ‘illya’: no such user
[root@docker ~]# cat /etc/subuid
[root@docker ~]#
[root@docker ~]# id illya
uid=1000(illya) gid=1000(illya) groups=1000(illya)
[root@docker ~]# useradd -m illya
[root@docker ~]# passwd illya
Changing password for user illya.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.
[root@docker ~]#
[root@docker ~]# cat /etc/subuid
illya:100000:65536
[root@docker ~]# cat /etc/subgid
illya:100000:65536
[root@docker ~]#
ユーザを作成すると /etc/subuid と /etc/subgid に値が入る.
[root@docker ~]# su - illya
[illya@docker ~]$ id
uid=1000(illya) gid=1000(illya) groups=1000(illya) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[illya@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
docker: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create": dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.
[illya@docker ~]$
そのままでは権限がないみたい. なのでユーザを docker グループに加える
[root@docker ~]# usermod -aG docker illya
[root@docker ~]# id illya
uid=1000(illya) gid=1000(illya) groups=1000(illya),986(docker)
[root@docker ~]# su - illya
[illya@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: NVIDIA RTX A2000 (UUID: GPU-23cc3ee7-31d3-a068-2f61-5aa00052d084)
[illya@docker ~]$
[illya@docker ~]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia/cuda 11.8.0-runtime-ubuntu22.04 af0cef3d3ee9 2 months ago 2.65GB
[illya@docker ~]$
っで使えるようになった.
rootではなく自分のプロセスとしてdockerを動かしてみる. 前段はrootアカウントでdockerデーモンが動いてます.
このrootで動いているdockerデーモンを停止します
[root@docker ~]# systemctl stop docker.socket
[root@docker ~]# systemctl disable docker.socket
[root@docker ~]# vigr <--- dockerグループから illya を削除
[root@docker ~]# reboot <--- 「/var/run/docker/」や「/var/run/docker.sock」を削除させるために
ここでは
「dnf install docker-ce」の際に「docker-ce-rootless-extras」パッケージがインストールされていてこれを使う
「/etc/subuid」「/etc/subgid」はユーザアカウントを作成した際に追加される. だがnisクライアントとかになると手動で対処かな.
っで設定します. 単に「dockerd-rootless-setuptool.sh」を実行します
[illya@docker ~]$ dockerd-rootless-setuptool.sh --help
Usage: /usr/bin/dockerd-rootless-setuptool.sh [OPTIONS] COMMAND
A setup tool for Rootless Docker (dockerd-rootless.sh).
Documentation: https://docs.docker.com/go/rootless/
Options:
-f, --force Ignore rootful Docker (/var/run/docker.sock)
--skip-iptables Ignore missing iptables
Commands:
check Check prerequisites
install Install systemd unit (if systemd is available) and show how to manage the service
uninstall Uninstall systemd unit
[illya@docker ~]$
[illya@docker ~]$ lsmod |grep ip_tables | wc -l
0
[illya@docker ~]$
[illya@docker ~]$ dockerd-rootless-setuptool.sh --skip-iptables install
[INFO] Creating /home/illya/.config/systemd/user/docker.service
[INFO] starting systemd service docker.service
+ systemctl --user start docker.service
+ sleep 3
+ systemctl --user --no-pager --full status docker.service
● docker.service - Docker Application Container Engine (Rootless)
Loaded: loaded (/home/illya/.config/systemd/user/docker.service; disabled; preset: disabled)
Active: active (running) since Wed 2023-09-13 03:35:24 JST; 3s ago
Docs: https://docs.docker.com/go/rootless/
Main PID: 1037 (rootlesskit)
Tasks: 39
Memory: 176.2M
CPU: 173ms
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/docker.service
tq1037 rootlesskit --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --iptables=false
tq1049 /proc/self/exe --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --iptables=false
tq1070 slirp4netns --mtu 65520 -r 3 --disable-host-loopback --enable-sandbox --enable-seccomp 1049 tap0
tq1077 dockerd --iptables=false
mq1096 containerd --config /run/user/1000/docker/containerd/containerd.toml
+ DOCKER_HOST=unix:///run/user/1000/docker.sock
+ /usr/bin/docker version
Client: Docker Engine - Community
Version: 24.0.6
API version: 1.43
Go version: go1.20.7
Git commit: ed223bc
Built: Mon Sep 4 12:33:18 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.6
API version: 1.43 (minimum version 1.12)
Go version: go1.20.7
Git commit: 1a79695
Built: Mon Sep 4 12:31:49 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.22
GitCommit: 8165feabfdfe38c65b599c4993d227328c231fca
runc:
Version: 1.1.8
GitCommit: v1.1.8-0-g82f18fe
docker-init:
Version: 0.19.0
GitCommit: de40ad0
rootlesskit:
Version: 1.1.1
ApiVersion: 1.1.1
NetworkDriver: slirp4netns
PortDriver: builtin
StateDir: /tmp/rootlesskit3872238327
slirp4netns:
Version: 1.2.0
GitCommit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
+ systemctl --user enable docker.service
Created symlink /home/illya/.config/systemd/user/default.target.wants/docker.service → /home/illya/.config/systemd/user/docker.service.
[INFO] Installed docker.service successfully.
[INFO] To control docker.service, run: `systemctl --user (start|stop|restart) docker.service`
[INFO] To run docker.service on system startup, run: `sudo loginctl enable-linger illya`
[INFO] Creating CLI context "rootless"
Successfully created context "rootless"
[INFO] Using CLI context "rootless"
Current context is now "rootless"
[INFO] Make sure the following environment variable(s) are set (or add them to ~/.bashrc):
export PATH=/usr/bin:$PATH
[INFO] Some applications may require the following environment variable too:
export DOCKER_HOST=unix:///run/user/1000/docker.sock
[illya@docker ~]$
[illya@docker ~]$ export DOCKER_HOST=unix:///run/user/1000/docker.sock
[illya@docker ~]$ echo "export DOCKER_HOST=unix:///run/user/1000/docker.sock" >> ~/.bashrc <--環境設定
dockerのコマンド自体は /usr/bin に配置されているdockerを使ってます
これでユーザ権限dockerが稼働してます
[illya@docker ~]$ systemctl --user status docker
● docker.service - Docker Application Container Engine (Rootless)
Loaded: loaded (/home/illya/.config/systemd/user/docker.service; enabled; preset: disabled)
Active: active (running) since Wed 2023-09-13 03:35:24 JST; 1min 14s ago
Docs: https://docs.docker.com/go/rootless/
Main PID: 1037 (rootlesskit)
Tasks: 39
Memory: 176.8M
CPU: 428ms
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/docker.service
tq1037 rootlesskit --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver>
tq1049 /proc/self/exe --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-dri>
tq1070 slirp4netns --mtu 65520 -r 3 --disable-host-loopback --enable-sandbox --enable-seccomp 1049 tap0
tq1077 dockerd --iptables=false
mq1096 containerd --config /run/user/1000/docker/containerd/containerd.toml
[illya@docker ~]$
dockerのimageとかは「~/.local/share/docker/」に置かれます
[illya@docker ~]$ ls -l ~/.local/share/docker/
total 48
drwx--x--x. 4 illya illya 4096 Sep 13 03:35 buildkit
drwx--x--x. 3 illya illya 4096 Sep 13 03:35 containerd
drwx--x---. 2 illya illya 4096 Sep 13 03:35 containers
-rw-------. 1 illya illya 36 Sep 13 03:35 engine-id
drwx--x---. 3 illya illya 4096 Sep 13 03:35 fuse-overlayfs
drwx------. 3 illya illya 4096 Sep 13 03:35 image
drwxr-x---. 3 illya illya 4096 Sep 13 03:35 network
drwx------. 4 illya illya 4096 Sep 13 03:35 plugins
drwx------. 2 illya illya 4096 Sep 13 03:35 runtimes
drwx------. 2 illya illya 4096 Sep 13 03:35 swarm
drwx------. 2 illya illya 4096 Sep 13 03:35 tmp
drwx-----x. 2 illya illya 4096 Sep 13 03:35 volumes
[illya@docker ~]$
っでテスト
[illya@docker ~]$ which docker
/usr/bin/docker
[illya@docker ~]$ docker --version
Docker version 24.0.6, build ed223bc
[illya@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
[illya@docker ~]$
残念ながらエラーになる. どうもGPUを利用するような「--gpus all」を入れるとダメ見たい.
ここからhttps://github.com/NVIDIA/nvidia-docker/issues/1447から行けそうな気がするのだがダメ見たい.
っで「systemd.unified_cgroup_hierarchy=0」を組んでもダメ.
「/etc/nvidia-container-runtime/config.toml」の「#no-cgroups = false」を「no-cgroups = true」にすると回避された.
[illya@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
:
:
GPU 0: NVIDIA RTX A2000 (UUID: GPU-23cc3ee7-31d3-a068-2f61-5aa00052d084)
[illya@docker ~]$