複数のGPUが存在する場合、いろいろ指定しないと動かないみたい

32coreを有するマシンで4枚GPUを持っているとします

1プロセスで動かす

openmpiを有効にしていないバイナリーでのお話

$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
 :
 :
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see log).   <--実行している計算ノードは AVX512が使えるからそっちのバイナリーを用意してはとの意
Reading file adh_cubic.tpr, VERSION 2023.1 (single precision)
 
-------------------------------------------------------
Program:     gmx mdrun, version 2023.1
Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 220)
 
Fatal error:
When using GPUs, setting the number of OpenMP threads without specifying the
number of ranks can lead to conflicting demands. Please specify the number of
thread-MPI ranks as well (option -ntmpi).
 
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 1
 :
 :
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see log).
Reading file adh_cubic.tpr, VERSION 2023.1 (single precision)
Changing nstlist from 10 to 80, rlist from 0.9 to 1.035
 
1 GPU selected for this run.                                                <-- このjobは1つのGPUを使います
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0                                                                <-- particle-particleの計算はGPUID:0、PMEの計算もGPUID:0でします
PP tasks will do (non-perturbed) short-ranged interactions on the GPU 
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 32 OpenMP threads
 
starting mdrun 'NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water'
10000 steps,     20.0 ps.
step 5760: timed with pme grid 100 100 100, coulomb cutoff 0.900: 462.2 M-cycles
step 5920: timed with pme grid 84 84 84, coulomb cutoff 1.050: 220.3 M-cycles
step 6080: timed with pme grid 72 72 72, coulomb cutoff 1.225: 878.6 M-cycles
step 6240: timed with pme grid 80 80 80, coulomb cutoff 1.102: 230.8 M-cycles
step 6400: timed with pme grid 84 84 84, coulomb cutoff 1.050: 214.6 M-cycles
step 6560: timed with pme grid 96 96 96, coulomb cutoff 0.919: 188.3 M-cycles
step 6720: timed with pme grid 96 96 96, coulomb cutoff 0.919: 188.0 M-cycles
              optimal pme grid 96 96 96, coulomb cutoff 0.919
step 9900, remaining wall clock time:     0 s
Writing final coordinates.
step 10000, remaining wall clock time:     0 s
               Core t (s)   Wall t (s)        (%)
       Time:      459.583       14.377     3196.6
                 (ns/day)    (hour/ns)
Performance:      120.201        0.200
 
GROMACS reminds you: "Misslycka kan man med all kod" (Mats Nylen)

「nvidia-smi」で表示されるGPUの内で4番目のGPUで計算したい場合は 「-gpu_id 3」を使う

$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 1 -gpu_id 3
 :
 :
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:3,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 32 OpenMP threads
 
(略
step 10000, remaining wall clock time:     0 s
               Core t (s)   Wall t (s)        (%)
       Time:      454.374       14.218     3195.7
                 (ns/day)    (hour/ns)
Performance:      121.547        0.197
 
GROMACS reminds you: "Get Down In 3D" (George Clinton)

せっかく複数のGPUがあるのでそれを使いたいとなるが、

$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 2
 :
 :
On host xxxx 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:     <-- [-ntmpi 2]でmpiが2になって、rank毎に割り当てられるGPUIDが異なる
  PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 2 MPI threads
Using 32 OpenMP threads per tMPI thread
 
 
WARNING: Oversubscribing the recommended max load of 32 logical CPUs with 64 threads.
         This will cause considerable performance loss.
 :
 :
               Core t (s)   Wall t (s)        (%)
       Time:     2417.065       37.832     6388.9
                 (ns/day)    (hour/ns)
Performance:       45.680        0.525
 :

っが速度は落ちた. 警告もあって、適切にthreadsを割り当てるために「OMP_NUM_THREADS」を割り当てると速度が向上する. っが単一MPIほどではない.

$ OMP_NUM_THREADS=16 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 2
 :
 :
On host xxxxxx 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 2 MPI threads
Using 16 OpenMP threads per tMPI thread
 :
               Core t (s)   Wall t (s)        (%)
       Time:      646.740       20.229     3197.1
                 (ns/day)    (hour/ns)
Performance:       85.431        0.281
 :

「Using 2 MPI threads」と表記され、GPUも2枚使われているが、PIDは同じでした. threadsベースな模様.
ならば4枚使わせてみる

$ OMP_NUM_THREADS=8 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 4 -gpu_id 0123
 :
 :
On host xxxxx  4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
  PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread
 :
               Core t (s)   Wall t (s)        (%)
       Time:      547.835       17.136     3197.0
                 (ns/day)    (hour/ns)
Performance:      100.852        0.238

particle-particle(PP)の計算に4つのGPUが割り当てられた. 1つくらいはPMEに割り当てたいなら

$ OMP_NUM_THREADS=8 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -gputasks 0123 -nb gpu -pme gpu -npme 1 -ntmpi 4
 :
 :
On host xxxxx 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
  PP:0,PP:1,PP:2,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread
 :
               Core t (s)   Wall t (s)        (%)
       Time:      303.138        9.478     3198.3
                 (ns/day)    (hour/ns)
Performance:      182.335        0.132
 :

mpiを使って複数プロセスで動かす

前段は「-ntmpi 4」としても1つのプロセスしか動きません. 同時に4つのGPUプロセスが動きますが、これも同じPIDを示します.

っで次にgromacs/mpiで作ったmpi版を動かしてみます

計算機は20coreなマシンで4つGPUを積んでます.
まずはそのままmpirunを繋げると

$ mpirun  gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
 :
On host aaaaaa  4 GPUs selected for this run.
Mapping of GPU IDs to the 20 GPU tasks in the 20 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:2,PP:2,PP:2,PP:2,PP:2,PP:3,PP:3,PP:3,PP:3,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 20 MPI processes
 
Non-default thread affinity set, disabling internal thread affinity
 
Using 20 OpenMP threads per MPI process
 :
               Core t (s)   Wall t (s)        (%)
       Time:   284989.419      712.497    39998.7
                 (ns/day)    (hour/ns)
Performance:        2.426        9.895
 :

と20core分のmpiプロセスを発行させ、1プロセスあたり 20 threads で動かしてしまいます. 全体で400threadsで動いてしまいます.
またGPUも20プロセス分に分割されて動きます.
そのためか遅いみたい

GPUが4枚なので4mpiにして、1mpi辺り5threadsにしてみる

$ OMP_NUM_THREADS=5 mpirun -n 4 gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
 :
On host aaaaaa  4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
  PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 4 MPI processes
 
Non-default thread affinity set, disabling internal thread affinity
 
Using 5 OpenMP threads per MPI process
 :
               Core t (s)   Wall t (s)        (%)
       Time:      668.254       33.421     1999.5
                 (ns/day)    (hour/ns)
Performance:       51.710        0.464
 :

OMP_NUM_THREADSで1mpi辺りのthreadsを指定して[-n]でmpiプロセス数を指定してます

ただ、PMEの計算がGPUで行われていないようなのでそれを追加します

$ OMP_NUM_THREADS=5 mpirun -n 4  gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log   -npme 1
 :
On host aaaaaa  4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
  PP:0,PP:1,PP:2,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 4 MPI processes
 :
               Core t (s)   Wall t (s)        (%)
       Time:      407.216       20.366     1999.5
                 (ns/day)    (hour/ns)
Performance:       84.856        0.283
 :

トップ   編集 添付 複製 名前変更     ヘルプ   最終更新のRSS
Last-modified: 2023-05-04 (木) 12:32:33 (36d)