gpu搭載計算ノードには gres.conf を追加設置します.
「nvml」が有効ならGPUのあり/なしに関係なく下記の「gres.conf」を配布すれば足ります
[root@slurm ~]# /opt/slurm/etc/gres.conf
#
AutoDetect=nvml
[root@slurm ~]#
もしくは「AutoDetect=nvml」を使わずに共通の「gres.conf」を作るなら
[root@slurm ~]# /opt/slurm/etc/gres.conf
#
#AutoDetect=nvml
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0
NodeName=n2 Name=gpu File=/dev/nvidia0 COREs=0,1
[root@slurm ~]#
のようにします
すこし込み入ってnuma構成の計算機でGPUとcoreが近い方を指定したいのなら
「nvidia-smi topo -m」の値を参考に下記のようにします
[root@slurm ~]#
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0,1,2,3
NodeName=n1 Name=gpu File=/dev/nvidia1 COREs=4,5,6,7
NodeName=n1 Name=gpu File=/dev/nvidia2 COREs=8,9,10,11
NodeName=n1 Name=gpu File=/dev/nvidia3 COREs=12,13,14,15
#(あるいは)
NodeName=n2 Name=gpu File=/dev/nvidia[0-1] COREs=0,1,2,3,4,5,6,7
NodeName=n2 Name=gpu File=/dev/nvidia[2-3] COREs=8,9,10,11,12,13,14,15
[root@slurm ~]#
ちなみに「nvidia-smi topo -m」の出力ですが、1枚だとこんな感じ
[root@e ~]# nvidia-smi topo -m
GPU0 CPU Affinity NUMA Affinity
GPU0 X 0-3 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
[root@e ~]#
2枚なら
[root@s ~]# nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X PHB 0-7 N/A
GPU1 PHB X 0-7 N/A
(略
[root@s ~]#
こんな感じ.
4枚なら
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X SYS SYS SYS 0-15 N/A
GPU1 SYS X SYS SYS 0-15 N/A
GPU2 SYS SYS X SYS 0-15 N/A
GPU3 SYS SYS SYS X 0-15 N/A
とか
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NODE SYS SYS 0-15 0
GPU1 NODE X SYS SYS 0-15 0
GPU2 SYS SYS X NODE 16-31 1
GPU3 SYS SYS NODE X 16-31 1
OpenPBS/GPU#e002b507とほぼおなじっぽいものを作った. コマンド名は...qloadでもいいけど「sload」なのかな
- ! - | ! - | | | | ! - | ! - | - | - | ! - | - | ! ! - | ! - | | ! - | ! - | | ! - ! - | | - - | | | - | | | ! ! ! - - | - - | - | ! ! | ! - | | | | ! |
|
これを「/apps/local/bin/sload」として飾って
実行するとこんな感じ
[illya@slurm ~]$ sload
Queue Run wait Host CPU usage GPU
workq 2 0 n1 1/1 ******************** 0/0
n2 1/2 **********---------- 1/1 *
[illya@slurm ~]$