Skip to main content
A managed Slurm cluster turns the GPU nodes you already reserved into a Slurm scheduler you reach over SSH. Ornn runs the scheduler and login node; you submit batch and distributed jobs with the standard Slurm tools. Logins use the SSH keys saved on your account.

When to use Slurm

  • You run batch jobs or distributed training and want a queue and scheduler.
  • Your team shares a pool of GPUs and needs fair scheduling across jobs.
  • You already have sbatch job scripts.
Prefer containers and orchestration? Use Kubernetes access instead. Want a single host you SSH into? See VM or Bare Metal.

Prerequisites

  • The bid is promoted to a confirmed reservation that is visible in your portfolio.
  • Checkout and payment for the reservation are complete.
  • Your account has at least one active SSH public key registered — Slurm logins use your account keys. See Manage SSH keys.

Launch the cluster

1

Open Clusters

From the console, go to Clusters (/clusters) and start a new cluster, or open your reservation in Portfolio and choose the Slurm access mode.
2

Choose Slurm and a network mode

Pick Slurm, then a network mode:
  • Public — the SSH login is reachable over the internet.
  • Private — the SSH login is reachable only over your reservation’s WireGuard VPN.
3

Launch

Launch the cluster. The first launch after an enrollment takes a few extra minutes while the GPU images download. The scheduler starts once every node is GPU-ready; your SSH login details appear on the manage page when it’s ready.

Connect over SSH

When the scheduler is ready, the cluster panel shows your SSH login host and port, a ready-to-run Connect command, and the login host key so you can verify it on first connect. Copy those values and connect:
ssh <login-user>@<login-host> -p <login-port>

Submit a job

On the login node, inspect the cluster and claim GPUs with the standard Slurm tools:
sinfo                          # partitions and node states
sacct -n -X --format=JobID,State,Elapsed

srun -N1 hostname              # single-node smoke; use -N<node-count> to span nodes
ssh <node-name> hostname       # pick a node name shown by sinfo or srun

srun --gpus=1 nvidia-smi       # grab 1 GPU and print the GPU table

sbatch my-job.sh               # submit a batch job
squeue --me                    # watch your queued and running jobs
A minimal GPU sbatch script:
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --gpus=1
#SBATCH --output=train-%j.out

srun python train.py

Private clusters: bring up WireGuard first

Private clusters expose nothing publicly — the SSH login is reachable only over your reservation’s WireGuard tunnel. Set it up once:
# 1. Generate a WireGuard keypair locally.
wg genkey | tee privatekey | wg pubkey

# 2. Paste the printed PUBLIC key into the cluster panel's Connect section and
#    generate a config.
# 3. Save the returned config (fill in your private key) and bring the tunnel up.
sudo wg-quick up ./ornn-wg.conf
Once the tunnel is up, the SSH login above works unchanged.

Tear down

Tearing down stops the scheduler, revokes SSH login access, drains the workers, and returns the GPU nodes to your reservation. A torn-down or failed cluster relaunches in place.
Teardown interrupts running jobs and can’t be undone. Persist checkpoints and data to your own object storage first.

Troubleshooting

Slurm logins use your account SSH keys. Confirm the private key your SSH client uses matches a key registered on your account, and that the cluster has finished issuing login access (the panel shows the SSH login once it’s ready).
The scheduler starts after every node reports GPU-ready. Wait for the manage page to show the SSH login host and port; the first launch is paced by the GPU image pull.
If you reconnect after a relaunch, remove the old entry and reconnect: ssh-keygen -R <login-host>. Verify the new key against the host key shown on the cluster panel.
Make sure the WireGuard tunnel is up (sudo wg show) before connecting. Re-generate the peer config from the Connect section if your public key changed.

What’s next

access

Kubernetes access

Prefer container orchestration? Run a managed Kubernetes cluster on your reserved GPUs instead.
ssh-keys

Manage SSH keys

Register the account SSH keys your Slurm logins use.