SLURM Setup Documentation for a Single Node Cluster

Here are the instructions to set up a single node SLURM cluster on your machine. The following setup works for Ubuntu 22.04.3.

Step 1: Setup MariaDB Database for Accounting

Before you can start building your SLURM cluster, you must first setup a database for storage of account information and resource limits. This is necessary if you want to enforce resource limitations on certain users or groups within your cluster. SLURM has built-in support for MySQL. Here we show the steps for setting up a MariaDB database server on your system and creating a SLURM accounting database that your SLURM daemons can later communicate with to store and access information.

  • Install MariaDB.

    sudo apt install mariadb-server
    
  • Configure MySQL.

    sudo mysql_secure_installation
    

    When prompted feel free to remove anonymous users and disable remote root login. Also you may switch to unix_authentication, however, its not required as it should be enabled by default. This means that you can login as the 'root' user in MariaDB only if you are the 'root' user in your system, or if you have sudo privileges. You may choose to set a password if you wish to allow login as root without sudo.

  • Login as root user in MariaDB. Note that you need sudo privileges for this.

    sudo mysql -u root
    

    Following this you should be logged in as root inside your database.

  • Create SLURM accounting database and a SLURM user with complete access to modify this database. Feel free to set 'password' to something more secure if desired.

    CREATE DATABASE slurm_acct_db;
    CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'password';
    GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
    FLUSH PRIVILEGES;
    EXIT;
    

    You should now be logged out of MariaDB.

  • Enable the MariaDB server to start automatically on system reboot and restart the database server.

    sudo systemctl enable mariadb
    sudo systemctl restart mariadb
    
  • You should now be able to view the slurm user in the users table and check the databases table to see if the slurm_acct_db has been created.

    sudo mysql -u root
    
    SELECT User, Host, plugin FROM mysql.user;
    SHOW DATABASES;
    EXIT;
    
  • Optional: Change authentication plugin afterwards for root user.

    sudo mysql -u root
    
    ALTER USER 'root'@'localhost' IDENTIFIED WITH unix_socket;
    FLUSH PRIVILEGES;
    EXIT;
    

Step 2: Setup Munge for Authentication

SLURM uses the MUNGE authentication tool to allow for secure communication between the different SLURM daemons running on our cluster.

  • Install MUNGE.

    sudo apt install munge
    
  • Check if installation is correct. You should see a STATUS: SUCCESS message on running the below command.

    munge -n | unmunge | grep STATUS
    

    Also you should be able to find a munge key located at /etc/munge/munge.key. If you can't find it, create one using the following command:

    sudo /usr/sbin/mungekey
    
  • Set appropriate permissions for all munge files. Note that a munge user should have been automatically created after installing MUNGE. You can check this with cat /etc/passwd.

    sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
    sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
    sudo chmod 0755 /run/munge/
    sudo chmod 0700 /etc/munge/munge.key
    sudo chown -R munge: /etc/munge/munge.key
    
  • Modify MUNGE configuration file to allow system logging.

    sudo sed -i '/# OPTIONS/aOPTIONS="--syslog"' /etc/default/munge
    
  • Enable the munge daemon to start on system reboot and restart it.

    sudo systemctl enable munge
    sudo systemctl restart munge
    

Step 3: Setup SLURM

  • Install SLURM.

    sudo apt install slurm-wlm
    sudo apt install slurmdbd
    

    The above should install 3 daemons in total:

    • slurmd - The daemon that runs on each compute node in the cluster.
    • slurmctld - The central controller daemon that communicates with each slurmd and managaes all nodes in the cluser.
    • slurmdbd - The database daemon that will communicate with our MariaDB database to store accounting information and resource limits.

    Since our setup only includes one node in our cluster, all 3 daemons will be running on the same server.

  • Configure slurmdbd. Create slurmdbd.conf file and set proper permissions and configurations to allow for proper accounting.

    sudo touch /etc/slurm/slurmdbd.conf
    sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
    sudo chmod 600 /etc/slurm/slurmdbd.conf
    sudo sh -c "cat > /etc/slurm/slurmdbd.conf" <<EOF
    # logging level
    ArchiveEvents=no
    ArchiveJobs=no
    ArchiveSteps=no
    ArchiveSuspend=no
    
    # service
    DbdHost=$(hostname -s)
    SlurmUser=slurm
    AuthType=auth/munge
    
    # logging; remove this to use syslog
    LogFile=/var/log/slurm/slurmdbd.log
    
    # database backend
    StoragePass=password
    StorageUser=slurm
    StorageType=accounting_storage/mysql
    StorageLoc=slurm_acct_db
    EOF
    

    Please note that the permissions must be set correctly for slurmdbd to work. Any modifications from the above may prevent the database daemon from starting up.

  • Increase storage size for buffer pools and log files, as well as lock wait timeout to allow for extended queries:

    sudo sh -c "cat >> /etc/mysql/my.cnf" <<EOF
    [mysqld]
    innodb_buffer_pool_size=4096M
    innodb_log_file_size=64M
    innodb_lock_wait_timeout=900
    max_allowed_packet=16M
    EOF
    
  • Configure slurm. Usually in multi-node clusters this config file must be present in the controller node as well as all worker nodes. Make sure to set up the proper file permissions here as well. We set up the config file to allow the use of GPUs through gres (General Resources).

    sudo touch /etc/slurm/slurm.conf
    sudo chown slurm:slurm /etc/slurm/slurm.conf
    sudo chmod 664 /etc/slurm/slurm.conf
    sudo sh -c "cat > /etc/slurm/slurm.conf" <<EOF
    # identification
    ClusterName=$(hostname -s)
    ControlMachine=$(hostname -s)
    
    # authentication
    AuthType=auth/munge
    
    # service
    # proctrack/cgroup controls the freezer cgroup
    SlurmUser=slurm
    SlurmctldPort=6817
    SlurmdPort=6818
    SlurmdSpoolDir=/var/lib/slurm/slurmd
    StateSaveLocation=/var/lib/slurm/slurmctld
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd%n.pid
    SwitchType=switch/none
    ProctrackType=proctrack/linuxproc
    MpiDefault=none
    RebootProgram=/sbin/reboot
    
    # get back on track as soon as possible
    ReturnToService=2
    
    # logging
    SlurmctldLogFile=/var/log/slurm/slurmctld.log
    SlurmdLogFile=/var/log/slurm/slurmd.log
    
    # accounting
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=$(hostname -s)
    AccountingStorageEnforce=limits
    AccountingStorageTRES=gres/gpu
    JobAcctGatherType=jobacct_gather/linux
    
    # checkpointing
    #CheckpointType=checkpoint/blcr
    
    # scheduling
    SchedulerType=sched/backfill
    PriorityType=priority/multifactor
    PriorityDecayHalfLife=3-0
    PriorityMaxAge=7-0
    PriorityFavorSmall=YES
    PriorityWeightAge=1000
    PriorityWeightFairshare=0
    PriorityWeightJobSize=125
    PriorityWeightPartition=1000
    PriorityWeightQOS=0
    PreemptMode=suspend,gang
    PreemptType=preempt/partition_prio
    
    # wait 30 minutes before assuming that a node is dead
    SlurmdTimeout=1800
    
    # core and memory is the scheduling units
    # task/cgroup controls cpuset, memory and devices cgroups
    SelectType=select/cons_tres
    SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
    TaskPlugin=task/affinity,task/cgroup
    
    # computing nodes
    GresTypes=gpu
    NodeName=$(hostname -s) Gres=gpu:$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l) RealMemory=$(grep "^MemTotal:" /proc/meminfo | awk '{print int($2/1024)}') Sockets=$(grep "^physical id" /proc/cpuinfo | sort -uf | wc -l) CoresPerSocket=$(grep "^cpu cores" /proc/cpuinfo | awk -F':' 'NR==1{print $2}' | xargs) ThreadsPerCore=$(lscpu | awk -F':' '/^Thread\(s\) per core/ {print $2}' | xargs) State=UNKNOWN
    
    # partitions
    PartitionName=DEFAULT  Nodes=$(hostname -s) OverSubscribe=FORCE:1 MaxTime=INFINITE State=UP
    PartitionName=intern  Priority=10 Default=No
    PartitionName=employee Priority=20 Default=Yes AllowAccounts=employee_account_1,employee_account_2,employee_account_3
    EOF
    

    Feel free to change the partition names and add more if required. Think of partitions of queues that users can submit SLURM jobs to. The above config creates two partions - intern and employee, both with configurations set by the DEFAULT partition configuration. In the above configuration, whenever a user submits a job without specifying the partition, it defaults to the employee partition. Accounts are groups of users. When you set the AllowAccounts parameter for a particular partition, it means that only users belonging to those accounts will be allowed to submit jobs to that partition. The line 'AccountingStorageEnforce=limits' is required if you want to set resource limits on certain users within your cluster. The line 'GresTypes=gpu' is necessary for allowing GPU support in SLURM. GRES stands for General Resources. So is the line 'SelectType=select/cons_tres' (do NOT set this to cons_res if you want GPU support).

  • Create gres.conf file. This is required if you want to allow GPU support for SLURM.

    sudo touch /etc/slurm/gres.conf
    sudo chown slurm:slurm /etc/slurm/gres.conf
    sudo chmod 664 /etc/slurm/gres.conf
    sudo sh -c "cat > /etc/slurm/gres.conf" <<EOF
    Name=gpu File=/dev/nvidia[0-$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l | awk '{print $1 - 1}')]
    EOF
    
  • Enable and restart slurmd, slurmctld, and slurmdbd.

    sudo systemctl enable slurmd slurmdbd slurmctld
    sudo systemctl restart slurmd slurmdbd slurmctld
    

    You can check the status of all 5 of your daemons now and make sure that they're running:

    systemctl status slurmd slurmctld slurmdbd mariadb munge
    

    Finally, make sure to set the node as idle so that it's ready to accept jobs:

    sudo scontrol update nodename=$(hostname -s) state=idle
    

Step 4: Create Cluster, Accounts, and Users

  • Create cluster.

    sudo sacctmgr -i add cluster $(hostname -s)
    
  • Create accounts. Accounts are groups of users. Here we'll create 4 accounts: employee_account_1, employee_account_2, employee_account_3, and intern_account. Make sure the names match with the names you used for the AllowAccounts parameter in your slurm.conf file.

    sudo sacctmgr -i add account employee_account_1 Description="1st Employee Account" Organization="Org 1"
    sudo sacctmgr -i add account employee_account_2 Description="2nd Employee Account" Organization="Org 2"
    sudo sacctmgr -i add account employee_account_3 Description="3rd Employee Account" Organization="Org 3"
    sudo sacctmgr -i add account intern_account Description="Intern Account" Organization="Intern Org"
    
  • Create users. Users in SLURM must have the same username as the system users that will be submitting the job. Thus, if your system username is 'johndoe' then to submit jobs to SLURM you must also create a SLURM user called 'johndoe'. Here we'll create 2 users - employee_1, that belongs to the accounts employee_account_1, employee_account_2, and employee_account_3, and intern_1, that belongs to the account intern_account.

    sudo sacctmgr add user employee_1 DefaultAccount=employee_account_1 Account=employee_account_1,employee_account_2,employee_account_3 AdminLevel=Admin Partition=employee
    sudo sacctmgr add user intern_1 DefaultAccount=intern_account Partion=intern 
    

    The above commands set the default accounts for our new users as well as the default partitions to which they'll submit their SLURM jobs. Note that you can add multiple users to a single account. Also, a single user can be part of multiple accounts.

  • Check for newly created users, accounts, and cluster:

    sacctmgr show user
    sacctmgr show account
    sacctmgr show cluster
    
  • Modify and delete users, accounts, clusters. To modify / delete a created user, account or cluster you can use the following format:

    sudo sacctmgr modify <entity> set <options> where <options>
    

    For example, to modify the name of the user employee_1, we can do:

    sudo sacctmgr modify user set name=permanent_employee_1 where name=employee_1
    

    Similarly, we can delete this user by running the following:

    sudo sacctmgr remove user where name=employee_1
    

    Check out the SLURM accounting page for more info: https://slurm.schedmd.com/accounting.html

  • Set user resource limits. Here we're limiting the amount of resources the user intern_1 can use. We give them 16 CPU threads, 10 GB of memory, and 1 GPU.

    sacctmgr modify user intern_1 set GrpTRES=cpu=16,mem=100G,gres/gpu=1
    

    To unset these limits run the following:

    sacctmgr modify user intern_1 set GrpTRES=cpu=-1,mem=-1,gres/gpu=-1
    
  • Check the user resource limits.

    sacctmgr show associations user=ronitagarwala01,test_intern format=User%25,GrpTRES%50,Account,Partition,Cluster%25
    

    Note an association is the most basic accounting unit in SLURM. It a set that consists of a user, account, cluster, and optionally a partition.

Step 5: Submit Batch Jobs to your SLURM Cluster.

  • In order to submit a job on your SLURM cluster, you must be logged in to your system as one of the users in your SLURM database. A SLURM job consists of a bash script with a bunch of SBATCH directives. Here is an example of our batch job script slurm_tester.sh:

    #!/bin/sh
    #SBATCH --time=200
    #SBATCH --gres=gpu:1
    #SBATCH --output=slurm_output.txt
    #SBATCH --job-name=SLURM_Testing
    #SBATCH --partition=employee
    #SBATCH --account=employee_account_1
    #SBATCH --cpus-per-task=1
    #SBATCH --ntasks=1
    #SBATCH --mem-per-gpu=32G
    srun -l python3 slurm_tester.py
    

    Here we run a simple python script called slurm_tester.py that we have defined in the same directory as slurm_tester.sh. Here is what it looks like:

    import time
    time.sleep(20) # Sleep for 20 seconds
    print("Hello World!")
    

    A batch script in slurm consists of a bunch of Sbatch directives, which are comments starting with #SBATCH that tell SLURM about the job requirements - including requested resources and account configurations - for this batch. Not all of these directives need to befined for each job. Here are some descriptions of each directive.

    • --time: Time requested for job in minutes. If job execution does not complete withinin specified time limit, it will be cancelled.
    • --gres: Used to specify general resources. Here used to request a single GPU for the batch job.
    • --output: Specify the output file.
    • --job-name: Specify job name.
    • --partition: Specify the partition to submit the job to.
    • --account: Specify the account to submit the job to.
    • ntasks: Number of tasks to be launched. A task is an instance of your program that is launched on a compute node. Tasks can run in parallel and use MPI to communicate with one another.
    • --cpus-per-task: Number of CPU threads required per task.
    • --mem-per-gpu: Amount of memory per GPU required for the job. Here we set it to 32 GB per GPU. When no unit is specified after the number, the default units are in MB. Note that alternatives include --mem for setting memory allocation and --mem-per-cpu for setting memory per CPU. Note that only one of these three directives can be used at a time.

    More directives can be found in the official SLURM docs: https://slurm.schedmd.com/sbatch.html

    The srun command launches a job step in SLURM, which are sets of possibly parallel tasks within a job. We can have multiple srun statements with our batch script.

    To submit our batch job to SLURM, run the following command from your terminal:

    sbatch slurm_tester.sh
    
  • Check the status of your submitted job with squeue.

    squeue
    

    Once a job is submitted to SLURM, it either starts executing if there are enough resources available, or its put into a queue to wait for resources to be free.

  • See information about submitted jobs, including job ID, status, name, associated user, account, partition, and allocated resources:

    sudo sacct --format=JobID,JobName,User,Account,Partition,AllocCPUS,State,ExitCode,AllocTRES%50
    
  • Cancel a running job:

    scancel <JobID>
    

    For example, to cancel the job with JobID 32, run:

    scancel 32
    
  • For troubleshooting, try looking at the slurmctld, slurmd, and slurmdbd log files:

    sudo cat /var/log/slurm/slurmctld.log
    sudo cat /var/log/slurm/slurmd.log
    sudo cat /var/log/slurm/slurmdbd.log
    

Optional: Set up a more secure password for slurm database user

  • Create password with pwgen:
    sudo apt install pwgen
    DB_SLURM_PASS=$(pwgen -s 16 1)
    
  • Change slurm user password:
    sudo mysql -u root -e "ALTER USER 'slurm'@'localhost' IDENTIFIED BY '${DB_SLURM_PASS}';"
    
  • Store your password in /etc/slurm directory for safe-keeping. Also change permissions so that only slurm user or root / superuser can access it. This is important if you don't want anyone to change the database and thus get access to more resources than they are allocated:
    echo $DB_SLURM_PASS | sudo tee /etc/slurm/slurm_db_password.txt > dev_null
    sudo chmod 600 /etc/slurm/slurm_db_password.txt
    
  • Change StoragePass in slurmdbd.conf:
    sudo sed -i "s/StoragePass=.*/StoragePass=$DB_SLURM_PASS/" /etc/slurm/slurmdbd.conf
    

Additional Resources