Skip to content

aihpi/slurm-gpu-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLURM GPU Workshop for Distributed LLM Training

This repository contains scripts and configurations for distributed training of Large Language Models (LLMs) using SLURM and containers. The workshop focuses on using LlamaFactory for fine-tuning Meta Llama 3 models across multiple nodes.

Prerequisites

  • Access to a SLURM-based HPC cluster with GPU nodes
  • NVIDIA container support (Pyxis/Enroot)
  • HuggingFace account with access to Meta Llama 3 models

Environment Setup

1. Container Setup

The workshop uses NVIDIA PyTorch containers. You have two options:

Option A: Use the shared container

# Create a directory for your personal containers
mkdir -p ~/slurm-gpu-workshop/containers

# Copy the shared container to your personal directory
cp /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh ~/slurm-gpu-workshop/containers/llama_factory.sqsh

Option B: Create your own container

# Create the container (gives it a name)
enroot create --name pyxis_llama_factory /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh

# Export to a squashfs file
enroot export pyxis_llama_factory > ~/slurm-gpu-workshop/containers/llama_factory.sqsh

2. HuggingFace Authentication

To access gated models like Meta Llama 3:

# Create a directory for your HuggingFace token
mkdir -p ~/.huggingface

# Store your token (replace with your actual token)
echo "hf_your_actual_token_here" > ~/.huggingface/token
chmod 600 ~/.huggingface/token  # Make it private

Important: Make sure your token has the necessary permissions to access Meta Llama 3 models. You can check this at https://huggingface.co/settings/tokens.

Repository Structure

slurm-gpu-workshop/
├── LLaMA-Factory/         # LlamaFactory code (included as Git subtree)
├── containers/            # Container files
├── logs/                  # Job logs
├── nccl_logs/             # NCCL logs for debugging distributed training
└── slurm-scripts/         # SLURM job scripts
    ├── launcher.sh        # Script that runs inside the container
    ├── multi-node-job.slurm  # Multi-node job submission script
    └── simple-job.sbatch  # Simple single-node job without containers

Running Distributed Training

1. Getting Started

This repository includes LLaMA-Factory as a Git subtree, so you don't need to clone it separately. Simply clone this repository to get everything you need:

git clone https://github.com/aihpi/slurm-gpu-workshop.git
cd slurm-gpu-workshop

2. Choose Your Job Type

Option A: Multi-Node Distributed Training (with containers)

For training across multiple nodes with GPU acceleration:

cd ~/slurm-gpu-workshop
sbatch slurm-scripts/multi-node-job.slurm

Option B: Simple Single-Node Training (without containers)

For simple training on a single node without container overhead:

cd ~/slurm-gpu-workshop
sbatch slurm-scripts/simple-job.sbatch

3. Monitor your job

# Check job status
squeue -u $USER

# Monitor logs in real-time
tail -f logs/<job_id>/debug_output.log

Different Job Approaches

Multi-Node Job (with containers)

The multi-node approach provides:

  • Isolated environment with all dependencies pre-installed
  • Consistent runtime across different nodes
  • NCCL-optimized communication for distributed training
  • Ability to scale across multiple nodes
  • Support for InfiniBand for high-speed node communication

Simple Job (without containers)

The simple job approach:

  • Uses the host's Python environment directly
  • Simpler setup without container overhead
  • Limited to a single node
  • May require manual installation of dependencies
  • Good for initial testing and development

Troubleshooting

Container Issues

If you encounter container issues, check:

  1. Pyxis Plugin: Verify Pyxis is properly installed

    srun --container-image=alpine:latest echo "Hello from container"
  2. Container Path: Make sure your container path is correct

    ls -la ~/slurm-gpu-workshop/containers/
  3. Node Availability: Check if the nodes are available

    sinfo -p defq

HuggingFace Token Issues

If you see 403 errors trying to access models:

  1. Token Permissions: Ensure your token has the correct permissions for gated models
  2. Token Validation: Test your token
    curl -I -H "Authorization: Bearer $(cat ~/.huggingface/token)" https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/config.json

Distributed Training Issues

  1. NCCL Logs: Check the NCCL logs for communication issues

    cat ~/slurm-gpu-workshop/nccl_logs/<job_id>_*.txt
  2. Network Configuration: Ensure InfiniBand is properly configured

    # Your NCCL settings should include
    export NCCL_IB_DISABLE=0
    export NCCL_IB_CUDA_SUPPORT=1

Key Configuration Files

multi-node-job.slurm

This script sets up the environment and launches the distributed training:

  • Allocates nodes and GPUs
  • Mounts directories into the container
  • Sets up NCCL for distributed communication
  • Runs the launcher script inside the container

launcher.sh

This script runs inside the container and:

  • Activates the Python environment
  • Sets up HuggingFace authentication
  • Configures distributed training parameters
  • Launches the training process with proper rank configuration

simple-job.sbatch

This script provides a simpler approach:

  • Runs directly on the host without containers
  • Creates and activates a virtual environment
  • Sets up HuggingFace authentication
  • Runs training on a single node

Advanced Usage

DeepSpeed ZeRO-3 Integration

To use DeepSpeed ZeRO-3 for larger models:

# Add this to your job submission
export USE_DEEPSPEED=1
sbatch slurm-scripts/multi-node-job.slurm

Custom Configuration

Modify the CONFIG_PATH environment variable to use your own training configuration:

export CONFIG_PATH="path/to/your/config.yaml"
sbatch slurm-scripts/multi-node-job.slurm

License

This project is provided as-is for educational purposes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages