This repository contains scripts and configurations for distributed training of Large Language Models (LLMs) using SLURM and containers. The workshop focuses on using LlamaFactory for fine-tuning Meta Llama 3 models across multiple nodes.
- Access to a SLURM-based HPC cluster with GPU nodes
- NVIDIA container support (Pyxis/Enroot)
- HuggingFace account with access to Meta Llama 3 models
The workshop uses NVIDIA PyTorch containers. You have two options:
# Create a directory for your personal containers
mkdir -p ~/slurm-gpu-workshop/containers
# Copy the shared container to your personal directory
cp /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh ~/slurm-gpu-workshop/containers/llama_factory.sqsh
# Create the container (gives it a name)
enroot create --name pyxis_llama_factory /cm/shared/workspace/nvidia_pytorch_files/nvidia-pytorch-24.12-py312.sqsh
# Export to a squashfs file
enroot export pyxis_llama_factory > ~/slurm-gpu-workshop/containers/llama_factory.sqsh
To access gated models like Meta Llama 3:
# Create a directory for your HuggingFace token
mkdir -p ~/.huggingface
# Store your token (replace with your actual token)
echo "hf_your_actual_token_here" > ~/.huggingface/token
chmod 600 ~/.huggingface/token # Make it private
Important: Make sure your token has the necessary permissions to access Meta Llama 3 models. You can check this at https://huggingface.co/settings/tokens.
slurm-gpu-workshop/
├── LLaMA-Factory/ # LlamaFactory code (included as Git subtree)
├── containers/ # Container files
├── logs/ # Job logs
├── nccl_logs/ # NCCL logs for debugging distributed training
└── slurm-scripts/ # SLURM job scripts
├── launcher.sh # Script that runs inside the container
├── multi-node-job.slurm # Multi-node job submission script
└── simple-job.sbatch # Simple single-node job without containers
This repository includes LLaMA-Factory as a Git subtree, so you don't need to clone it separately. Simply clone this repository to get everything you need:
git clone https://github.com/aihpi/slurm-gpu-workshop.git
cd slurm-gpu-workshop
For training across multiple nodes with GPU acceleration:
cd ~/slurm-gpu-workshop
sbatch slurm-scripts/multi-node-job.slurm
For simple training on a single node without container overhead:
cd ~/slurm-gpu-workshop
sbatch slurm-scripts/simple-job.sbatch
# Check job status
squeue -u $USER
# Monitor logs in real-time
tail -f logs/<job_id>/debug_output.log
The multi-node approach provides:
- Isolated environment with all dependencies pre-installed
- Consistent runtime across different nodes
- NCCL-optimized communication for distributed training
- Ability to scale across multiple nodes
- Support for InfiniBand for high-speed node communication
The simple job approach:
- Uses the host's Python environment directly
- Simpler setup without container overhead
- Limited to a single node
- May require manual installation of dependencies
- Good for initial testing and development
If you encounter container issues, check:
-
Pyxis Plugin: Verify Pyxis is properly installed
srun --container-image=alpine:latest echo "Hello from container"
-
Container Path: Make sure your container path is correct
ls -la ~/slurm-gpu-workshop/containers/
-
Node Availability: Check if the nodes are available
sinfo -p defq
If you see 403 errors trying to access models:
- Token Permissions: Ensure your token has the correct permissions for gated models
- Token Validation: Test your token
curl -I -H "Authorization: Bearer $(cat ~/.huggingface/token)" https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/config.json
-
NCCL Logs: Check the NCCL logs for communication issues
cat ~/slurm-gpu-workshop/nccl_logs/<job_id>_*.txt
-
Network Configuration: Ensure InfiniBand is properly configured
# Your NCCL settings should include export NCCL_IB_DISABLE=0 export NCCL_IB_CUDA_SUPPORT=1
This script sets up the environment and launches the distributed training:
- Allocates nodes and GPUs
- Mounts directories into the container
- Sets up NCCL for distributed communication
- Runs the launcher script inside the container
This script runs inside the container and:
- Activates the Python environment
- Sets up HuggingFace authentication
- Configures distributed training parameters
- Launches the training process with proper rank configuration
This script provides a simpler approach:
- Runs directly on the host without containers
- Creates and activates a virtual environment
- Sets up HuggingFace authentication
- Runs training on a single node
To use DeepSpeed ZeRO-3 for larger models:
# Add this to your job submission
export USE_DEEPSPEED=1
sbatch slurm-scripts/multi-node-job.slurm
Modify the CONFIG_PATH environment variable to use your own training configuration:
export CONFIG_PATH="path/to/your/config.yaml"
sbatch slurm-scripts/multi-node-job.slurm
This project is provided as-is for educational purposes.