Skip to content

giovtorres/slurm-docker-cluster

Repository files navigation

Slurm Docker Cluster

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. This repository simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

🏁 Getting Started

To get up and running with Slurm in Docker, make sure you have the following tools installed:

Clone the repository:

git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster

📦 Containers and Volumes

This setup consists of the following containers:

  • mysql: Stores job and cluster data.
  • slurmdbd: Manages the Slurm database.
  • slurmctld: The Slurm controller responsible for job and resource management.
  • c1, c2: Compute nodes (running slurmd).

Persistent Volumes:

  • etc_munge: Mounted to /etc/munge
  • etc_slurm: Mounted to /etc/slurm
  • slurm_jobdir: Mounted to /data
  • var_lib_mysql: Mounted to /var/lib/mysql
  • var_log_slurm: Mounted to /var/log/slurm

🛠️ Building the Docker Image

The version of the Slurm project and the Docker build process can be simplified by using a .env file, which will be automatically picked up by Docker Compose.

Update the SLURM_TAG and IMAGE_TAG found in the .env file and build the image:

docker compose build

Alternatively, you can build the Slurm Docker image locally by specifying the SLURM_TAG as a build argument and tagging the container with a version (IMAGE_TAG):

docker build --build-arg SLURM_TAG="slurm-21-08-6-1" -t slurm-docker-cluster:21.08.6 .

🚀 Starting the Cluster

Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:

docker compose up -d

To specify a specific version and override what is configured in .env, specify the IMAGE_TAG:

IMAGE_TAG=21.08.6 docker compose up -d

This will start up all containers in detached mode. You can monitor their status using:

docker compose ps

📝 Register the Cluster

After the containers are up and running, register the cluster with SlurmDBD:

./register_cluster.sh

Tip: Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused.

For real-time cluster logs, use:

docker compose logs -f

🖥️ Accessing the Cluster

To interact with the Slurm controller, open a shell inside the slurmctld container:

docker exec -it slurmctld bash

Now you can run any Slurm command from inside the container:

[root@slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

🧑‍💻 Submitting Jobs

The cluster mounts the slurm_jobdir volume across all nodes, making job files accessible from the /data directory. To submit a job:

[root@slurmctld /]# cd /data/
[root@slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2

Check the output of the job:

[root@slurmctld data]# cat slurm-2.out
c1

🔄 Cluster Management

Stopping and Restarting:

Stop the cluster without removing the containers:

docker compose stop

Restart it later:

docker compose start

Deleting the Cluster:

To completely remove the containers and associated volumes:

docker compose down -v

⚙️ Advanced Configuration

You can modify Slurm configurations (slurm.conf, slurmdbd.conf) on the fly without rebuilding the containers. Just run:

./update_slurmfiles.sh slurm.conf slurmdbd.conf
docker compose restart

This makes it easy to add/remove nodes or test new configuration settings dynamically.

🤝 Contributing

Contributions are welcomed from the community! If you want to add features, fix bugs, or improve documentation:

  1. Fork this repo.
  2. Create a new branch: git checkout -b feature/your-feature.
  3. Submit a pull request.

📄 License

This project is licensed under the MIT License.