slurm_sweep
is the missing (small) piece to efficiently run hyperparameter sweeps on SLURM clusters by combining the power of weights and biases (W&B) and simple_slurm. It allows you to efficiently parallelize sweeps with job arrays, while tracking experiments and results on W&B. All you need is:
- W&B account.
config.yaml
file that defines your sweep.train.py
script, that specifies the actual training and evaluation.
Create an account on W&B and take a look at our examples in the examples folder. These contain both config.yaml
and train.py
scripts.
You need config file in yaml
format. This file should have three sections:
general
: you need to define at least theproject_name
and theentity
for the sweep on wandB.slurm
: any valid slurm option. Depends on your cluster, see thesimple_slurm
docs.wandb
: standard W&B config for a hyperparameter sweep.
This needs to be a python script that defines the training and evaluation logic. It should call wandb.init()
and retrieve parameters from wandb.config
. It can log values using wandb.log
. See the W&B docs.
Once you're ready, you can test your config file using slurm-sweep validate_config config.yaml
. If this passes, create a submission script using slurm-sweep configure-sweep config.yaml
, and submit with sbatch submit.sh
.
You need to have Python 3.10 or newer installed on your system. If you don't have Python installed, we recommend installing uv.
There are two alternative options to install slurm_sweep
:
-
Install the latest release from PyPI:
pip install slurm_sweep
-
Install the latest development version:
pip install git+https://github.com/quadbio/slurm_sweep.git@main
See the changelog.
If you found a bug, please use the issue tracker.