Slurm Usage Guide (Simple Linux Utility for Resource Management): Workload Manager Widely Used in Startups, Public Organizations, and Research Labs

Chase Na

25 May 2025 — 5 min read

In my last post, we learned about LSF.

Most big tech companies and large corporations use LSF, but LSF licenses are quite expensive.

From a user perspective, LSF is convenient, but Slurm can be really good when used properly. People who are proficient with Slurm often find LSF frustrating.

So government agencies, universities, and startups use free open-source products, which is where Slurm comes in.

For detailed documentation, please refer to the official site below. I'll summarize the things I use most frequently.

Slurm Workload Manager - Overview Overview Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions...

Slurm is a job management and job scheduling system frequently used in High Performance Computing (HPC) environments.

1. Introduction to High Performance Computing (HPC) and Clusters

1.1. What is High Performance Computing (HPC)?

High Performance Computing (HPC) is technology that uses supercomputers or computer clusters to perform large-scale computational tasks. These tasks include semiconductor design, climate modeling, genetic research, complex simulations, machine learning, or large-scale data analysis. Since these tasks are difficult to process with a single computer or take too much time, HPC is needed to solve them quickly through parallel processing.

1.2. What is a Cluster?

A cluster is a structure where multiple computers (or nodes) are connected through a network to operate as a single system. Each node has its own CPU, memory, and disk space, but the entire cluster cooperates through shared storage and network to process tasks.

Clusters exist in various sizes, from small-scale (a few nodes) to large-scale supercomputers with thousands of nodes. In HPC environments, clusters enable parallel processing of tasks, which is essential in many scientific and engineering fields.

1.3. Summary: Why Use Clusters?

The reasons for using clusters are:

Parallel Processing: Tasks can be distributed across multiple nodes for simultaneous processing.
Large-scale Task Processing: Enables processing of large-scale data or complex calculations that cannot be handled by a single computer.
Resource Sharing: Multiple users can simultaneously use cluster resources.

2. Introduction to Slurm

2.1. What is Slurm?

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager that manages and schedules jobs on Linux-based clusters.

Slurm provides the following key functions:

Resource Allocation: Allocates cluster nodes to users.
Job Execution Framework: Starts, executes, and monitors jobs on allocated nodes.
Job Queue Management: When multiple jobs are requested simultaneously, schedules jobs based on priority and resource availability.

Slurm is open-source, fault-tolerant, highly scalable, and used by many supercomputers (over 60% of the TOP500 list).

According to USC, 60% of TOP500 computing centers use Slurm.

2.2. Why Use Slurm?

Slurm is preferred in HPC environments for the following reasons:

Scalability: Supports everything from small clusters to large-scale supercomputers.
Reliability: Designed so that the entire system is not affected even if parts of the system fail.
Resource Management: Maximizes cluster performance through efficient resource allocation and job scheduling.
Plugin Support: Can be used flexibly through plugins that support various hardware and features.
Community Support: As an open-source project, many users and developers contribute, with rich forums and documentation.

3. How Slurm Works

Slurm's architecture consists of the following components:

slurmctld: The central controller of the cluster that manages the status of all nodes, jobs, and users. Responsible for job scheduling and resource allocation.
slurmd: A daemon running on each compute node that receives instructions from slurmctld to execute jobs and report node status.
slurmdbd (optional): A database daemon for job records and accounting. Used for resource limits and fair scheduling.

The job workflow is as follows:

Users submit jobs using sbatch / srun commands.
slurmctld puts the job in a queue and waits until resources are available.
When resources are allocated, slurmctld instructs the slurmd on the corresponding node to start the job.
slurmd executes the job and reports to slurmctld when completed.

This structure enables efficient management of cluster resources and fair processing of multiple users' jobs.

4. Getting Started with Slurm

To use Slurm, you need to know a few basic commands. Below are essential Slurm commands for beginners:

4.1. Basic Commands

Command	Description
`sinfo`	Shows cluster partitions and node status.
`squeue`	Shows the job queue (including running and waiting jobs).
`sbatch`	Submits batch scripts.
`scancel`	Cancels jobs.
`srun`	Executes commands on allocated nodes.
`salloc`	Allocates resources for interactive sessions.
`scontrol`	Views or modifies detailed information about jobs, nodes, etc.
`sacct`	Shows job accounting information.

sinfo: Shows cluster partitions and node status.

bash

sinfo

Example output:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug*    up    infinite     2  idle node[1-2]
compute   up    infinite    10 alloc node[3-12]

squeue: Shows the job queue.

bash

squeue

Example output:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
   12   compute myjob user1  R 0:10     1 node3
   13     debug  test user2 PD 0:00     1 (Resources)

sbatch: Submits batch scripts.

bash

sbatch myjob.sh

scancel: Cancels jobs.

bash

scancel <job_id>

srun: Executes commands on allocated nodes.

bash

srun hostname

salloc: Allocates resources for interactive sessions.

bash

salloc --time=00:10:00 --ntasks=1

4.2. Writing Slurm Batch Scripts

Most jobs are submitted through batch scripts.

#sbatch directives:

--job-name: Sets the job name.
--output: Specifies the output file.
--time: Sets the maximum execution time for the job.
--ntasks: Specifies the number of tasks to run.
--mem: Specifies the required memory amount.

Example batch script:

bash

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=result_%j.out
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mem=1G

echo "Starting job on $(hostname)"
# Your job commands here
python my_script.py
echo "Job completed"

5. Job Monitoring and Debugging

After submitting jobs, here's how to check status and debug problems:

Check job status: Use squeue -u <your_username> to check your job status.
Job details: Use scontrol show job <job_id> to check detailed job information.
Check output files: Check the file specified in --output to see if the job executed successfully.
Cancel jobs: Use scancel <job_id> to cancel jobs.
Accounting information: Use sacct -j <job_id> to check job resource usage and exit codes.

When debugging, first check error messages in the output files. Common problems include syntax errors in batch scripts, resource request errors, or job command errors.

6. Conclusion

Slurm is an excellent choice for organizations that need powerful job scheduling capabilities without the high licensing costs of commercial alternatives. Whether you're working in a startup, government agency, or research lab, Slurm provides the scalability, reliability, and flexibility needed for modern HPC workloads.

Its open-source nature, extensive community support, and proven track record in the world's most powerful supercomputers make it an ideal solution for managing computational resources efficiently.

Slurm Usage Guide (Simple Linux Utility for Resource Management): Workload Manager Widely Used in Startups, Public Organizations, and Research Labs

Chase Na

1. Introduction to High Performance Computing (HPC) and Clusters

1.1. What is High Performance Computing (HPC)?

1.2. What is a Cluster?

1.3. Summary: Why Use Clusters?

2. Introduction to Slurm

2.1. What is Slurm?

2.2. Why Use Slurm?

3. How Slurm Works

4. Getting Started with Slurm

4.1. Basic Commands

4.2. Writing Slurm Batch Scripts

5. Job Monitoring and Debugging

6. Conclusion

Read more

Memory Hierarchy & Memory Wall

[STA] Synchronous Clocks vs. Asynchronous Clocks

What is Physical Design Rule Checking (Physical DRC)?

Why is Interconnect Delay Still Large with Advanced Process Nodes?