Catalog
Skill by nvidia
mcore-run-on-slurm
How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions,
NVIDIA skillDeveloperAI EngineerMl EngineerHpc DeveloperMegatron CoreAI And Machine Learning