Catalog
Skill by nvidia
nemo-mbridge-multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
NVIDIA skillDeveloperAI EngineerDevOps EngineerMl EngineerHpc DeveloperAI And Machine Learning