gpu搭載計算ノードには gres.conf を追加設置します.
「nvml」が有効ならGPUのあり/なしに関係なく下記の「gres.conf」を配布すれば足ります

[root@slurm ~]# /opt/slurm/etc/gres.conf
#
AutoDetect=nvml
 
[root@slurm ~]#

もしくは「AutoDetect=nvml」を使わずに共通の「gres.conf」を作るなら

[root@slurm ~]# /opt/slurm/etc/gres.conf
#
#AutoDetect=nvml
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0
NodeName=n2 Name=gpu File=/dev/nvidia0 COREs=0,1
[root@slurm ~]#

のようにします

すこし込み入ってnuma構成の計算機でGPUとcoreが近い方を指定したいのなら
「nvidia-smi topo -m」の値を参考に下記のようにします

[root@slurm ~]#
NodeName=n1 Name=gpu File=/dev/nvidia0 COREs=0,1,2,3
NodeName=n1 Name=gpu File=/dev/nvidia1 COREs=4,5,6,7
NodeName=n1 Name=gpu File=/dev/nvidia2 COREs=8,9,10,11
NodeName=n1 Name=gpu File=/dev/nvidia3 COREs=12,13,14,15
 
#(あるいは)
NodeName=n2 Name=gpu File=/dev/nvidia[0-1] COREs=0,1,2,3,4,5,6,7
NodeName=n2 Name=gpu File=/dev/nvidia[2-3] COREs=8,9,10,11,12,13,14,15
[root@slurm ~]#

ちなみに「nvidia-smi topo -m」の出力ですが、1枚だとこんな感じ

[root@e ~]# nvidia-smi topo -m
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-3             N/A
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
[root@e ~]#

2枚なら

[root@s ~]# nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      PHB     0-7             N/A
GPU1    PHB      X      0-7             N/A
 
(略
[root@s ~]#

こんな感じ.
4枚なら

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      SYS     SYS     SYS     0-15            N/A
GPU1    SYS      X      SYS     SYS     0-15            N/A
GPU2    SYS     SYS      X      SYS     0-15            N/A
GPU3    SYS     SYS     SYS      X      0-15            N/A

とか

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NODE    SYS     SYS     0-15            0
GPU1    NODE     X      SYS     SYS     0-15            0
GPU2    SYS     SYS      X      NODE    16-31           1
GPU3    SYS     SYS     NODE     X      16-31           1

OpenPBSのqloadっぽいもの

OpenPBS/GPU#e002b507とほぼおなじっぽいものを作った. コマンド名は...qloadでもいいけど「sload」なのかな

鴻鴻
-
!
 
 
 
 
 
 
-
|
!
 
-
|
|
|
|
!
 
-
|
!
 
-
|
-
|
-
|
!
-
|
-
|
!
!
 
-
|
!
 
-
|
|
!
 
-
|
!
 
-
|
|
!
 
-
!
 
 
-
|
|
-
-
|
|
|
-
|
|
|
!
!
!
-
-
|
-
-
|
-
|
!
!
|
!
-
|
|
|
|
!
#!/bin/perl
use strict;
 
 
my %queue = (
        "workq"=>["n1","n2"]
         );
my @q0 = ("workq");
#------------------------------------------------#
# nodes cpu
my (%tot,%cpu);
open(OUT, "sinfo --format='%N %P %c %C %E  %G' --Node --noheader|");
while(<OUT>){
  my @line = split(/\s+/,$_);
  $tot{$line[0]} = $line[2];
  my @cstat = split(/\//,$line[3]);
  $cpu{$line[0]}=$cstat[0];
}
close(OUT);
#------------------------------------------------#
# nodes gpu
my (%gpus,%ugpus);
open(OUT, "sinfo -O NodeList,Gres,GresUsed --Node --noheader|");
while(<OUT>){
  my @line = split(/\s+/,$_);
  if (    $line[1] eq "(null)" ){
      $gpus{$line[0]} = 0;
  }elsif( $line[1] =~ /^gpu:(\d+)\(/ ){
      $gpus{$line[0]} = $1;
  }
  if (    $line[2] eq "gpu:0" ){
      $ugpus{$line[0]} = 0;
  }elsif( $line[2] =~ /^gpu:\(null\):(\d+)\(IDX/ ){
      $ugpus{$line[0]} = $1;
  }
}
close(OUT);
#------------------------------------------------#
# running job
my (%run);
open(OUT, "squeue -o '%P' -t RUNNING --noheader | uniq -c | awk '{print \$2,\$1}'|");
while(<OUT>){
  my @line = split(/\s+/,$_);
  $run{$line[0]} = $line[1];
}
close(OUT);
#------------------------------------------------#
# wait job
my (%wait);
open(OUT, "squeue -o '%P' -t PENDING --noheader | uniq -c | awk '{print \$2,\$1}'|");
while(<OUT>){
  my @line = split(/\s+/,$_);
  $wait{$line[0]} = $line[1];
}
close(OUT);
#------------------------------------------------#
 
printf("%5s%4s%5s%6s%5s%16s%13s\n","Queue","Run", "wait", "Host","CPU","usage","GPU");
 
for(my $i = 0 ; $i <= $#q0 ; $i++ ){
  my $q = $q0[$i];
  my $n; my $job; my $np;
  foreach $n ( 0 .. $#{ $queue{$q} } ){
     if ( $n == 0 ){
        printf("%5s%4d%5d%6s%6s%22s%6s %s\n", $q, $run{$q},  $wait{$q}, $queue{$q}[$n],  $cpu{ $queue{$q}[$n] }."/".$tot{ $queue{$q}[$n] },
             &jobbar( $cpu{ $queue{$q}[$n] } , $tot{ $queue{$q}[$n] } ), $ugpus{ $queue{$q}[$n]  }."/".$gpus{ $queue{$q}[$n] },
             &jobbarG($ugpus{ $queue{$q}[$n] } , $gpus{ $queue{$q}[$n] } ) );
     } else {
        printf("%20s%6s%22s%6s %s\n",                                   $queue{$q}[$n],  $cpu{ $queue{$q}[$n] }."/".$tot{ $queue{$q}[$n] },
             &jobbar( $cpu{ $queue{$q}[$n] } , $tot{ $queue{$q}[$n] } ), $ugpus{ $queue{$q}[$n]  }."/".$gpus{ $queue{$q}[$n] },
             &jobbarG($ugpus{ $queue{$q}[$n] } , $gpus{ $queue{$q}[$n] } ) );
     }
  }
}
####
sub jobbar{
  my $jobbar="";
  for ( my $i=0;$i<20;$i++){
         if ( $i/20 < $_[0]/$_[1] ){
            $jobbar=$jobbar."*";
         }else{
            $jobbar=$jobbar."-";
         }
  }
  return $jobbar;
}
sub jobbarG{
  if ( $_[1] == 0 ){ return;}
  my $jobbar= "*" x ( $_[0] );
  $jobbar = $jobbar . "-" x ( $_[1] - $_[0]);
return $jobbar;
}

これを「/apps/local/bin/sload」として飾って

実行するとこんな感じ

[illya@slurm ~]$ sload
Queue Run wait  Host  CPU           usage          GPU
workq   2    0    n1   1/1  ********************   0/0
                  n2   1/2  **********----------   1/1 *
[illya@slurm ~]$
最新の60件
2024-09-16 2024-09-14 2024-09-12 2024-09-09 2024-09-08 2024-09-06 2024-09-05 2024-09-04 2024-09-02 2024-09-01 2024-08-31 2024-08-28 2024-08-21 2024-08-18 2024-08-17 2024-08-16 2024-08-15 2024-08-14 2024-08-11 2024-08-09 2024-08-01 2024-07-27 2024-07-26 2024-07-16 2024-07-15 2024-07-12 2024-07-07 2024-06-22 2024-06-21 2024-06-17 2024-06-14 2024-06-11 2024-06-10 2024-06-08 2024-06-07 2024-06-02 2024-06-01 2024-05-30 2024-05-16 2024-04-26 2024-04-15 2024-04-11

edit


トップ   編集 差分 履歴 添付 複製 名前変更 リロード   新規 一覧 検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2023-09-08 (金) 16:12:56