過去ページ docker200322
nvidiaカードが入ったマシンにdockerを入れて、そのコンテナでGPU計算を行ってみる
インストールする計算機はこんな感じ物
[root@docker ~]# cat /etc/redhat-release
Rocky Linux release 9.5 (Blue Onyx)
[root@docker ~]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 570.133.07 Fri Mar 14 13:12:07 UTC 2025
GCC version: gcc version 11.5.0 20240719 (Red Hat 11.5.0-2) (GCC)
[root@docker ~]# ls -l /usr/local/cuda
ls: cannot access /usr/local/cuda: No such file or directory <-- cudaライブラリは入れていない
[root@docker ~]#
まずOS提供ではなく docker 側で提供するリポジトリからdockerを入れる
[root@docker ~]# dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
[root@docker ~]# ls -l /etc/yum.repos.d/
total 24
-rw-r--r--. 1 root root 1919 Mar 28 02:30 docker-ce.repo <--- 追加された
-rw-r--r--. 1 root root 6610 Nov 1 12:27 rocky-addons.repo
-rw-r--r--. 1 root root 1165 Nov 1 12:27 rocky-devel.repo
-rw-r--r--. 1 root root 2387 Nov 1 12:27 rocky-extras.repo
-rw-r--r--. 1 root root 3417 Nov 1 12:27 rocky.repo
[root@docker ~]#
一応yumで調べると、「docker.x86_64」はOS提供のリポジトリから得られるdockerのようで、
今回は「docker-ce.x86_64」を入れます。こちらは「docker-ce」側で提供してパッケージみたい
[root@docker ~]# dnf install docker-ce
(同時に containerd.io、docker-ce-cli、docker-ce-rootless-extras、docker-compose-plugin、docker-buildx-pluginもインストールされる)
[root@docker ~]# systemctl enable docker --now
一応バージョン確認
[root@docker ~]# docker --version
Docker version 28.0.4, build b8034c0
[root@docker ~]#
次に「NVIDIA Container Toolkit」をインストールします
本家様 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
インストール手順https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
次にdockerにnvidiaのツールキットを載せます.
(リポジトリのインストール)
[root@docker ~]# curl -s -o /etc/yum.repos.d/nvidia-container-toolkit.repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
[root@docker ~]# cat /etc/yum.repos.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[root@docker ~]#
(NVIDIA Container Toolkitのインストール)
[root@docker ~]# dnf install -y nvidia-container-toolkit
(「libnvidia-container-tools」と「libnvidia-container1」「nnvidia-container-toolkit-base」が同時にインストールされる)
[root@docker ~]# systemctl restart docker
nvidia-docker.repo の中身はこちらでdocker/NVIDIAContainerToolkit
ここでちょいとテスト
[root@docker ~]# nvidia-container-cli info
NVRM version: 570.133.07
CUDA version: 12.8
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce GTX 1070
Brand: GeForce
GPU UUID: GPU-a49de51b-de1e-52f3-1e3f-ce704e159713
Bus Location: 00000000:06:10.0
Architecture: 6.1
[root@docker ~]#
っで問題ないとかきのよに表示される
「nvidia-container-cli: initialization error: nvml error: driver not loaded」の時はNVIDIA#o13e41e5の「persistence mode」にすれば回避されるみたい.
そしてdockerを使ってのGPUテスト
[root@docker ~]# docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.8.0-runtime-ubuntu22.04' locally
11.8.0-runtime-ubuntu22.04: Pulling from nvidia/cuda
:
:
Thu Mar 27 17:45:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1070 Off | 00000000:06:10.0 Off | N/A |
| 27% 29C P8 7W / 151W | 7MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
[root@docker ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia/cuda 11.8.0-runtime-ubuntu22.04 d8fb74ecc8b2 16 months ago 2.65GB
[root@docker ~]#
単にグループ「docker」にユーザを加えればいいです
[root@docker ~]# grep docker /etc/group
docker:x:978:
[root@docker ~]# usermod -aG docker saber
[root@docker ~]# grep docker /etc/group
docker:x:978:saber
[root@docker ~]# su - saber
[saber@docker ~]$ id
uid=1000(saber) gid=1000(saber) groups=1000(saber),978(docker) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[saber@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
:
GPU 0: NVIDIA GeForce GTX 1070 (UUID: GPU-a49de51b-de1e-52f3-1e3f-ce704e159713)
[saber@docker ~]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia/cuda 11.8.0-runtime-ubuntu22.04 d8fb74ecc8b2 16 months ago 2.65GB
[saber@docker ~]$
もしdockerグループに入っていないと下記のようなエラーになります.
[root@docker ~]# usermod -G `id -ng saber` saber
[root@docker ~]# grep docker /etc/group
docker:x:978:
[root@docker ~]# id saber
uid=1000(saber) gid=1000(saber) groups=1000(saber)
[root@docker ~]# su - saber
[saber@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
docker: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
Run 'docker run --help' for more information
[saber@docker ~]$
[saber@docker ~]$ docker images
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
[saber@docker ~]$
追加は「usermod -aG docker <アカウント>」. 削除は「usermod -G `id -ng <アカウント>` <アカウント>」で
グループにdockerを追加する方法は docker 自身はroot権限で動きます. それを自分の権限でも動かすようにするのがrootkess docker.
/etc/subuid と /etc/subgid にユーザが記載されていればいいみたい. アカウントを作成すると自動的に追記されます. この2つのファイルはdockerのインストールに関係なくはじめから用意されています
[root@docker ~]# id saber
uid=1000(saber) gid=1000(saber) groups=1000(saber)
[root@docker ~]# cat /etc/subuid
saber:100000:65536
[root@docker ~]# cat /etc/subgid
saber:100000:65536
[root@docker ~]#
別のアカウント管理系 nisやldap、samba-adの場合は手動で /etc/subuid と /etc/subgid をメンテする必要があります.
そして
「dnf install docker-ce」の際に「docker-ce-rootless-extras」パッケージがインストールされていて、この中にある
「dockerd-rootless-setuptool.sh」をdockerを実行したいアカウントで実行します
sshでマシンにログインしてください. 「su - <アカウント>」では「dockerd-rootless-setuptool.sh」が正しく機能しませんので.
[saber@docker ~]$ id
id=1000(saber) gid=1000(saber) groups=1000(saber) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 <-- グループ docker から離れてます
[saber@docker ~]$
[saber@docker ~]$ lsmod |grep ip_tables | wc -l
0
[saber@docker ~]$
[saber@docker ~]$ dockerd-rootless-setuptool.sh --skip-iptables install
[INFO] Creating /home/saber/.config/systemd/user/docker.service
[INFO] starting systemd service docker.service
+ systemctl --user start docker.service
+ sleep 3
+ systemctl --user --no-pager --full status docker.service
● docker.service - Docker Application Container Engine (Rootless)
Loaded: loaded (/home/saber/.config/systemd/user/docker.service; disabled; preset: disabled)
Active: active (running) since Fri 2025-03-28 04:28:28 JST; 3s ago
Docs: https://docs.docker.com/go/rootless/
Main PID: 2458 (rootlesskit)
Tasks: 40
Memory: 65.5M
CPU: 235ms
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/docker.service
tq2458 rootlesskit --state-dir=/run/user/1000/dockerd-rootless --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --iptables=false
tq2471 /proc/self/exe --state-dir=/run/user/1000/dockerd-rootless --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --iptables=false
tq2496 slirp4netns --mtu 65520 -r 3 --disable-host-loopback --enable-sandbox --enable-seccomp 2471 tap0
tq2506 dockerd --iptables=false
mq2527 containerd --config /run/user/1000/docker/containerd/containerd.toml
+ DOCKER_HOST=unix:///run/user/1000/docker.sock
+ /usr/bin/docker version
Client: Docker Engine - Community
Version: 28.0.4
API version: 1.48
Go version: go1.23.7
Git commit: b8034c0
Built: Tue Mar 25 15:08:34 2025
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 28.0.4
API version: 1.48 (minimum version 1.24)
Go version: go1.23.7
Git commit: 6430e49
Built: Tue Mar 25 15:06:50 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.26
GitCommit: 753481ec61c7c8955a23d6ff7bc8e4daed455734
runc:
Version: 1.2.5
GitCommit: v1.2.5-0-g59923ef
docker-init:
Version: 0.19.0
GitCommit: de40ad0
rootlesskit:
Version: 2.3.4
ApiVersion: 1.1.1
NetworkDriver: slirp4netns
PortDriver: builtin
StateDir: /run/user/1000/dockerd-rootless
slirp4netns:
Version: 1.3.1
GitCommit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
+ systemctl --user enable docker.service
Created symlink /home/saber/.config/systemd/user/default.target.wants/docker.service → /home/saber/.config/systemd/user/docker.service. <-- 自動起動になってくれます.
[INFO] Installed docker.service successfully.
[INFO] To control docker.service, run: `systemctl --user (start|stop|restart) docker.service`
[INFO] To run docker.service on system startup, run: `sudo loginctl enable-linger saber`
[INFO] CLI context "rootless" already exists
[INFO] Using CLI context "rootless"
Current context is now "rootless"
[INFO] Make sure the following environment variable(s) are set (or add them to ~/.bashrc):
export PATH=/usr/bin:$PATH
[INFO] Some applications may require the following environment variable too: <-- ここに従って.bashrcに設定を施します
export DOCKER_HOST=unix:///run/user/1000/docker.sock
[saber@docker ~]$
[saber@docker ~]$ export DOCKER_HOST=unix:///run/user/1000/docker.sock
[saber@docker ~]$ echo "export DOCKER_HOST=unix:///run/user/1000/docker.sock" >> ~/.bashrc
これでユーザ権限dockerが稼働してます
[saber@docker ~]$ systemctl --user status docker <-- 自身のdockerデーモンの確認
● docker.service - Docker Application Container Engine (Rootless)
Loaded: loaded (/home/saber/.config/systemd/user/docker.service; enabled; preset: disabled) <-- 再起動後も自動的にサービスが立ち上がります.
:
:
[saber@docker ~]$ docker info
Client: Docker Engine - Community
Version: 28.0.4
:
Docker Root Dir: /home/saber/.local/share/docker
:
[saber@docker ~]$
dockerのimageとかは「~/.local/share/docker/」に置かれます
追加したなら削除も記載しないと.
[saber@docker ~]$ systemctl --user stop docker
[saber@docker ~]$ systemctl --user disable docker
[saber@docker ~]$ dockerd-rootless-setuptool.sh --skip-iptables uninstall
[saber@docker ~]$ /usr/bin/rootlesskit rm -rf /home/saber/.local/share/docker
[saber@docker ~]$ vi ~/.bashrc <-- DOCKER_HOST の削除確認
これで外れます.
[saber@docker ~]$ which docker
/usr/bin/docker
[saber@docker ~]$ docker --version
Docker version 28.0.4, build b8034c0
[saber@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: (改行
unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown
[saber@docker ~]$
残念ながらエラーになる.
一応「/etc/nvidia-container-runtime/config.toml」で下記のような修正を行うと行ける. cgroupを使わないって...GPUリソース管理が出来なくなる?. slurm/openpbsではどうなるの?ってやや心配
|
その上で
[saber@docker ~]$ docker run --gpus all --rm nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L
:
:
GPU 0: NVIDIA GeForce GTX 1070 (UUID: GPU-a49de51b-de1e-52f3-1e3f-ce704e159713)
[saber@docker ~]$