複数のGPUが存在する場合、いろいろ指定しないと動かないみたい
32coreを有するマシンで4枚GPUを持っているとします
1プロセスで動かす †
openmpiを有効にしていないバイナリーでのお話
$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
:
:
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see log). <--実行している計算ノードは AVX512が使えるからそっちのバイナリーを用意してはとの意
Reading file adh_cubic.tpr, VERSION 2023.1 (single precision)
-------------------------------------------------------
Program: gmx mdrun, version 2023.1
Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 220)
Fatal error:
When using GPUs, setting the number of OpenMP threads without specifying the
number of ranks can lead to conflicting demands. Please specify the number of
thread-MPI ranks as well (option -ntmpi).
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 1
:
:
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see log).
Reading file adh_cubic.tpr, VERSION 2023.1 (single precision)
Changing nstlist from 10 to 80, rlist from 0.9 to 1.035
1 GPU selected for this run. <-- このjobは1つのGPUを使います
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:0,PME:0 <-- particle-particleの計算はGPUID:0、PMEの計算もGPUID:0でします
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 32 OpenMP threads
starting mdrun 'NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water'
10000 steps, 20.0 ps.
step 5760: timed with pme grid 100 100 100, coulomb cutoff 0.900: 462.2 M-cycles
step 5920: timed with pme grid 84 84 84, coulomb cutoff 1.050: 220.3 M-cycles
step 6080: timed with pme grid 72 72 72, coulomb cutoff 1.225: 878.6 M-cycles
step 6240: timed with pme grid 80 80 80, coulomb cutoff 1.102: 230.8 M-cycles
step 6400: timed with pme grid 84 84 84, coulomb cutoff 1.050: 214.6 M-cycles
step 6560: timed with pme grid 96 96 96, coulomb cutoff 0.919: 188.3 M-cycles
step 6720: timed with pme grid 96 96 96, coulomb cutoff 0.919: 188.0 M-cycles
optimal pme grid 96 96 96, coulomb cutoff 0.919
step 9900, remaining wall clock time: 0 s
Writing final coordinates.
step 10000, remaining wall clock time: 0 s
Core t (s) Wall t (s) (%)
Time: 459.583 14.377 3196.6
(ns/day) (hour/ns)
Performance: 120.201 0.200
GROMACS reminds you: "Misslycka kan man med all kod" (Mats Nylen)
「nvidia-smi」で表示されるGPUの内で4番目のGPUで計算したい場合は 「-gpu_id 3」を使う
$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 1 -gpu_id 3
:
:
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
PP:3,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI thread
Using 32 OpenMP threads
(略
step 10000, remaining wall clock time: 0 s
Core t (s) Wall t (s) (%)
Time: 454.374 14.218 3195.7
(ns/day) (hour/ns)
Performance: 121.547 0.197
GROMACS reminds you: "Get Down In 3D" (George Clinton)
せっかく複数のGPUがあるのでそれを使いたいとなるが、
$ gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 2
:
:
On host xxxx 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: <-- [-ntmpi 2]でmpiが2になって、rank毎に割り当てられるGPUIDが異なる
PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 2 MPI threads
Using 32 OpenMP threads per tMPI thread
WARNING: Oversubscribing the recommended max load of 32 logical CPUs with 64 threads.
This will cause considerable performance loss.
:
:
Core t (s) Wall t (s) (%)
Time: 2417.065 37.832 6388.9
(ns/day) (hour/ns)
Performance: 45.680 0.525
:
っが速度は落ちた. 警告もあって、適切にthreadsを割り当てるために「OMP_NUM_THREADS」を割り当てると速度が向上する. っが単一MPIほどではない.
$ OMP_NUM_THREADS=16 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 2
:
:
On host xxxxxx 2 GPUs selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 2 MPI threads
Using 16 OpenMP threads per tMPI thread
:
Core t (s) Wall t (s) (%)
Time: 646.740 20.229 3197.1
(ns/day) (hour/ns)
Performance: 85.431 0.281
:
「Using 2 MPI threads」と表記され、GPUも2枚使われているが、PIDは同じでした. threadsベースな模様.
ならば4枚使わせてみる
$ OMP_NUM_THREADS=8 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -ntmpi 4 -gpu_id 0123
:
:
On host xxxxx 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread
:
Core t (s) Wall t (s) (%)
Time: 547.835 17.136 3197.0
(ns/day) (hour/ns)
Performance: 100.852 0.238
particle-particle(PP)の計算に4つのGPUが割り当てられた. 1つくらいはPMEに割り当てたいなら
$ OMP_NUM_THREADS=8 gmx mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -gputasks 0123 -nb gpu -pme gpu -npme 1 -ntmpi 4
:
:
On host xxxxx 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread
:
Core t (s) Wall t (s) (%)
Time: 303.138 9.478 3198.3
(ns/day) (hour/ns)
Performance: 182.335 0.132
:
mpiを使って複数プロセスで動かす †
前段は「-ntmpi 4」としても1つのプロセスしか動きません. 同時に4つのGPUプロセスが動きますが、これも同じPIDを示します.
っで次にgromacs/mpiで作ったmpi版を動かしてみます
計算機は20coreなマシンで4つGPUを積んでます.
まずはそのままmpirunを繋げると
$ mpirun gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
:
On host aaaaaa 4 GPUs selected for this run.
Mapping of GPU IDs to the 20 GPU tasks in the 20 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:2,PP:2,PP:2,PP:2,PP:2,PP:3,PP:3,PP:3,PP:3,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 20 MPI processes
Non-default thread affinity set, disabling internal thread affinity
Using 20 OpenMP threads per MPI process
:
Core t (s) Wall t (s) (%)
Time: 284989.419 712.497 39998.7
(ns/day) (hour/ns)
Performance: 2.426 9.895
:
と20core分のmpiプロセスを発行させ、1プロセスあたり 20 threads で動かしてしまいます. 全体で400threadsで動いてしまいます.
またGPUも20プロセス分に分割されて動きます.
そのためか遅いみたい
GPUが4枚なので4mpiにして、1mpi辺り5threadsにしてみる
$ OMP_NUM_THREADS=5 mpirun -n 4 gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log
:
On host aaaaaa 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
Using 4 MPI processes
Non-default thread affinity set, disabling internal thread affinity
Using 5 OpenMP threads per MPI process
:
Core t (s) Wall t (s) (%)
Time: 668.254 33.421 1999.5
(ns/day) (hour/ns)
Performance: 51.710 0.464
:
OMP_NUM_THREADSで1mpi辺りのthreadsを指定して[-n]でmpiプロセス数を指定してます
ただ、PMEの計算がGPUで行われていないようなのでそれを追加します
$ OMP_NUM_THREADS=5 mpirun -n 4 gmx_mpi mdrun -v -s adh_cubic.tpr -o adh_cubic.trr -c adh_cubic.gro -e adh_cubic.edr -g adh_cubic.log -npme 1
:
On host aaaaaa 4 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:1,PP:2,PME:3
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 4 MPI processes
:
Core t (s) Wall t (s) (%)
Time: 407.216 20.366 1999.5
(ns/day) (hour/ns)
Performance: 84.856 0.283
: