gpu搭載計算ノードには gres.conf を追加設置します.
「nvml」が有効ならGPUのあり/なしに関係なく下記の「gres.conf」を配布すれば足ります

[root@slurm ~]# /opt/slurm/etc/gres.conf
#
AutoDetect=nvml
 
[root@slurm ~]#

もしくは「AutoDetect=nvml」を使わずに共通の「gres.conf」を作るなら

[root@slurm ~]# /opt/slurm/etc/gres.conf
#
#AutoDetect=nvml
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0
NodeName=n2 Name=gpu File=/dev/nvidia0 COREs=0,1
[root@slurm ~]#

のようにします

すこし込み入ってnuma構成の計算機でGPUとcoreが近い方を指定したいのなら
「nvidia-smi topo -m」の値を参考に下記のようにします

[root@slurm ~]#
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0,1,2,3
NodeName=n1 Name=gpu File=/dev/nvidia1 COREs=4,5,6,7
NodeName=n1 Name=gpu File=/dev/nvidia2 COREs=8,9,10,11
NodeName=n1 Name=gpu File=/dev/nvidia3 COREs=12,13,14,15
 
#(あるいは)
NodeName=n2 Name=gpu File=/dev/nvidia[0-1] COREs=0,1,2,3,4,5,6,7
NodeName=n2 Name=gpu File=/dev/nvidia[2-3] COREs=8,9,10,11,12,13,14,15
[root@slurm ~]#

ちなみに「nvidia-smi topo -m」の出力ですが、1枚だとこんな感じ

[root@e ~]# nvidia-smi topo -m
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-3             N/A
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
[root@e ~]#

2枚なら

[root@s ~]# nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      PHB     0-7             N/A
GPU1    PHB      X      0-7             N/A
 
(略
[root@s ~]#

こんな感じ.
4枚なら

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      SYS     SYS     SYS     0-15            N/A
GPU1    SYS      X      SYS     SYS     0-15            N/A
GPU2    SYS     SYS      X      SYS     0-15            N/A
GPU3    SYS     SYS     SYS      X      0-15            N/A

とか

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NODE    SYS     SYS     0-15            0
GPU1    NODE     X      SYS     SYS     0-15            0
GPU2    SYS     SYS      X      NODE    16-31           1
GPU3    SYS     SYS     NODE     X      16-31           1

トップ   編集 添付 複製 名前変更     ヘルプ   最終更新のRSS
Last-modified: 2022-12-26 (月) 02:48:46 (87d)