SLURM GPU Workshop for Distributed LLM Training

This repository contains scripts and configurations for distributed training of Large Language Models (LLMs) using SLURM and containers. The workshop focuses on using LlamaFactory for fine-tuning Meta Llama 3 models across multiple nodes.

Prerequisites

Access to a SLURM-based HPC cluster with GPU nodes
NVIDIA container support (Pyxis/Enroot)
HuggingFace account with access to Meta Llama 3 models

Environment Setup

1. Container Setup

The workshop uses NVIDIA PyTorch containers. You have two options:

Option A: Use the shared container

# Create a directory for your personal containers
mkdir -p ~/slurm-gpu-workshop/containers

# Copy the shared container to your personal directory
cp /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh ~/slurm-gpu-workshop/containers/llama_factory.sqsh

Option B: Create your own container

# Create the container (gives it a name)
enroot create --name pyxis_llama_factory /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh

# Export to a squashfs file
enroot export pyxis_llama_factory > ~/slurm-gpu-workshop/containers/llama_factory.sqsh

2. HuggingFace Authentication

To access gated models like Meta Llama 3:

# Create a directory for your HuggingFace token
mkdir -p ~/.huggingface

# Store your token (replace with your actual token)
echo "hf_your_actual_token_here" > ~/.huggingface/token
chmod 600 ~/.huggingface/token  # Make it private

Important: Make sure your token has the necessary permissions to access Meta Llama 3 models. You can check this at https://huggingface.co/settings/tokens.

Repository Structure

slurm-gpu-workshop/
├── LLaMA-Factory/         # LlamaFactory code (included as Git subtree)
├── containers/            # Container files
├── logs/                  # Job logs
├── nccl_logs/             # NCCL logs for debugging distributed training
└── slurm-scripts/         # SLURM job scripts
    ├── launcher.sh        # Script that runs inside the container
    ├── multi-node-job.slurm  # Multi-node job submission script
    └── simple-job.sbatch  # Simple single-node job without containers

Running Distributed Training

1. Getting Started

This repository includes LLaMA-Factory as a Git subtree, so you don't need to clone it separately. Simply clone this repository to get everything you need:

git clone https://github.com/aihpi/slurm-gpu-workshop.git
cd slurm-gpu-workshop

2. Choose Your Job Type

Option A: Multi-Node Distributed Training (with containers)

For training across multiple nodes with GPU acceleration:

cd ~/slurm-gpu-workshop
sbatch slurm-scripts/multi-node-job.slurm

Option B: Simple Single-Node Training (without containers)

For simple training on a single node without container overhead:

cd ~/slurm-gpu-workshop
sbatch slurm-scripts/simple-job.sbatch

3. Monitor your job

# Check job status
squeue -u $USER

# Monitor logs in real-time
tail -f logs/<job_id>/debug_output.log

Different Job Approaches

Multi-Node Job (with containers)

The multi-node approach provides:

Isolated environment with all dependencies pre-installed
Consistent runtime across different nodes
NCCL-optimized communication for distributed training
Ability to scale across multiple nodes
Support for InfiniBand for high-speed node communication

Simple Job (without containers)

The simple job approach:

Uses the host's Python environment directly
Simpler setup without container overhead
Limited to a single node
May require manual installation of dependencies
Good for initial testing and development

Troubleshooting

Container Issues

If you encounter container issues, check:

Pyxis Plugin: Verify Pyxis is properly installed

srun --container-image=alpine:latest echo "Hello from container"

Container Path: Make sure your container path is correct
```
ls -la ~/slurm-gpu-workshop/containers/
```
Node Availability: Check if the nodes are available
```
sinfo -p defq
```

HuggingFace Token Issues

If you see 403 errors trying to access models:

Token Permissions: Ensure your token has the correct permissions for gated models

Token Validation: Test your token

curl -I -H "Authorization: Bearer $(cat ~/.huggingface/token)" https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/config.json

Distributed Training Issues

NCCL Logs: Check the NCCL logs for communication issues
```
cat ~/slurm-gpu-workshop/nccl_logs/<job_id>_*.txt
```

Network Configuration: Ensure InfiniBand is properly configured

# Your NCCL settings should include
export NCCL_IB_DISABLE=0
export NCCL_IB_CUDA_SUPPORT=1

Key Configuration Files

multi-node-job.slurm

This script sets up the environment and launches the distributed training:

Allocates nodes and GPUs
Mounts directories into the container
Sets up NCCL for distributed communication
Runs the launcher script inside the container

launcher.sh

This script runs inside the container and:

Activates the Python environment
Sets up HuggingFace authentication
Configures distributed training parameters
Launches the training process with proper rank configuration

simple-job.sbatch

This script provides a simpler approach:

Runs directly on the host without containers
Creates and activates a virtual environment
Sets up HuggingFace authentication
Runs training on a single node

Advanced Usage

DeepSpeed ZeRO-3 Integration

To use DeepSpeed ZeRO-3 for larger models:

# Add this to your job submission
export USE_DEEPSPEED=1
sbatch slurm-scripts/multi-node-job.slurm

Custom Configuration

Modify the CONFIG_PATH environment variable to use your own training configuration:

export CONFIG_PATH="path/to/your/config.yaml"
sbatch slurm-scripts/multi-node-job.slurm

License

This project is provided as-is for educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LLaMA-Factory		LLaMA-Factory
slurm-scripts		slurm-scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLURM GPU Workshop for Distributed LLM Training

Prerequisites

Environment Setup

1. Container Setup

Option A: Use the shared container

Option B: Create your own container

2. HuggingFace Authentication

Repository Structure

Running Distributed Training

1. Getting Started

2. Choose Your Job Type

Option A: Multi-Node Distributed Training (with containers)

Option B: Simple Single-Node Training (without containers)

3. Monitor your job

Different Job Approaches

Multi-Node Job (with containers)

Simple Job (without containers)

Troubleshooting

Container Issues

HuggingFace Token Issues

Distributed Training Issues

Key Configuration Files

multi-node-job.slurm

launcher.sh

simple-job.sbatch

Advanced Usage

DeepSpeed ZeRO-3 Integration

Custom Configuration

License

About

Releases

Packages

Contributors 2

Languages

License

aihpi/slurm-gpu-workshop

Folders and files

Latest commit

History

Repository files navigation

SLURM GPU Workshop for Distributed LLM Training

Prerequisites

Environment Setup

1. Container Setup

Option A: Use the shared container

Option B: Create your own container

2. HuggingFace Authentication

Repository Structure

Running Distributed Training

1. Getting Started

2. Choose Your Job Type

Option A: Multi-Node Distributed Training (with containers)

Option B: Simple Single-Node Training (without containers)

3. Monitor your job

Different Job Approaches

Multi-Node Job (with containers)

Simple Job (without containers)

Troubleshooting

Container Issues

HuggingFace Token Issues

Distributed Training Issues

Key Configuration Files

multi-node-job.slurm

launcher.sh

simple-job.sbatch

Advanced Usage

DeepSpeed ZeRO-3 Integration

Custom Configuration

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages