Compute Node Access via SSH

This guide explains how to connect to compute nodes in the Odin cluster. All compute nodes are dynamically provisioned — they don’t have persistent instances until you submit a job requesting them.

Understanding Dynamic Nodes

The Odin cluster uses auto-scaling to reduce costs:

Nodes are created on-demand when you submit a job requesting resources
Nodes automatically shut down after they become idle
First job submission may take 2-5 minutes while the node instance starts
Node DNS names are automatically registered when instances are assigned

This means you cannot SSH to a compute node unless:

A job has been submitted that requests those compute resources
The node has been allocated by SLURM (assigned a physical instance)
Your job is currently running or pending on that node

Check Node Status

View All Nodes and Their Status

sinfo

Output example:

PARTITION      AVAIL  TIMELIMIT   NODES  STATE NODELIST
cpu*              up   infinite      10   idle c7i-node-[01-10]
gpu-inferencing   up   infinite       5   idle g5-node-[01-05]
odin              up   infinite       2   idle p5-node-[01-02]
albus             up   infinite       2   idle p5-node-[03-04]

idle: Nodes exist but no jobs running (may or may not have physical instance)
allocated: Node is in use by a running job
down: Node is offline

Check if a Node Has a Physical Instance

To determine if a compute node currently has a running EC2 instance:

# SSH to a login node first
ssh login1

# Check node details using scontrol
scontrol show node c7i-node-01

Look for these indicators in the output:

NodeName=c7i-node-01
  State=idle
  RealMemory=61440
  AllocMemory=0
  BootTime=2025-01-15T12:34:56

Key fields:

BootTime: When the instance started (if recent, instance is running; if old, instance is likely shut down)
RealMemory: Total memory (only populated if instance is running)
State: Current state (idle/allocated/down)

Quick check for all nodes:

# Show brief status
sinfo -N -l

# Show only allocated nodes (have jobs running)
sinfo -R

Submit an Interactive Job to Allocate a Node

To allocate a compute node and SSH to it, submit an interactive job using salloc:

Basic Interactive Job (CPU)

# Allocate 1 CPU node for 1 hour
salloc --partition=cpu --nodes=1 --time=01:00:00

Output:

salloc: Granted job allocation 12345
salloc: Waiting for node configuration
salloc: Nodes c7i-node-01 are ready for job

The prompt returns to your shell on the compute node. You now have an active job allocation.

Interactive Job with GPUs

# Allocate 1 GPU node with 2 H100 GPUs for 2 hours
salloc --partition=odin --nodes=1 --gres=gpu:2 --time=02:00:00

# Allocate all 8 GPUs on one H100 node
salloc --partition=odin --nodes=1 --gres=gpu:8 --time=04:00:00

# Allocate A10G for inference
salloc --partition=gpu-inferencing --nodes=1 --gres=gpu:1 --time=01:00:00

Interactive Job with Specific Resources

# CPUs only
salloc --partition=cpu --nodes=1 --cpus-per-task=16 --mem=32G --time=02:00:00

# GPUs with CPU count
salloc --partition=odin --nodes=1 --gres=gpu:2 --cpus-per-task=48 --mem=500G --time=04:00:00

SSH to a Compute Node

Once your interactive job is allocated, you have a shell on the compute node. However, if you need to SSH to a node where a job is already running:

Get the Node Name from Your Job

From your login node (login1 or login2), run:

# View your running jobs
squeue -u $USER

Output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345       odin train-job   user  R       5:32      1 p5-node-01

The node name is in the NODELIST column.

# From login node (login1 or login2), SSH to the compute node
ssh p5-node-01

# Or using FQDN
ssh p5-node-01.odin.cluster.local

Note: You must SSH to the compute node from a login node. You cannot SSH directly to compute nodes from your laptop — you must go through the login node as an intermediary.

Verify GPU Allocation (if applicable)

Once logged into the compute node:

# Check which GPUs are allocated to your job
echo $CUDA_VISIBLE_DEVICES
# Output: 0,1 (if you requested 2 GPUs)

# Verify GPU visibility
nvidia-smi

# See detailed info
nvidia-smi -L

The CUDA_VISIBLE_DEVICES environment variable is automatically set by SLURM and restricts which GPUs your process can see and use.

Connect via VS Code Remote

VS Code can connect directly to compute nodes, allowing you to edit code and run terminals directly on the allocated resources.

Prerequisites

VS Code with Remote - SSH extension
SSH key configured (see SSH Setup)
Active SLURM job allocation (salloc)

On your laptop/local machine: Open VS Code
Command Palette (Cmd+Shift+P on Mac, Ctrl+Shift+P on Linux/Windows)
Type: “Remote-SSH: Connect to Host”
Enter SSH connection string: login1 (to jump through login node)
In the remote terminal on login1, verify your job is running:
```
squeue -u $USER
# Find the node name, e.g., p5-node-01
```
Open a new remote window from VS Code command palette again
Type: “Remote-SSH: Connect to Host”
Enter: p5-node-01@login1 or p5-node-01
Select Linux as the platform
Wait for VS Code to install remote server (first time only)

VS Code now runs directly on the compute node with full access to allocated resources.

How Resources Are Bounded

When you allocate resources with salloc, SLURM enforces hard limits on what your session can access:

GPU Bounding (Most Common Case)

Example: You allocate 2 GPUs

salloc --partition=odin --nodes=1 --gres=gpu:2 --time=04:00:00

SLURM does the following:

Sets CUDA_VISIBLE_DEVICES=0,1 — Only GPUs 0 and 1 are visible to your process
GPU Isolation — Even if the node has 8 H100 GPUs, processes in your job can only see and use GPUs 0,1
Memory Enforcement — SLURM tracks GPU memory usage per job and will kill processes exceeding their allocation

Verification in VS Code terminal:

nvidia-smi
# Output shows ONLY the 2 GPUs you requested
# GPU 0: ...
# GPU 1: ...
# (GPUs 2-7 are NOT listed)

echo $CUDA_VISIBLE_DEVICES
# Output: 0,1

What happens if you try to use more GPUs:

# In VS Code terminal, try this Python code
import torch
print(torch.cuda.device_count())  # Returns: 2 (only your 2 GPUs)
print(torch.cuda.is_available())  # Returns: True

# Trying to access GPU 2 will fail
torch.cuda.set_device(2)  # ERROR: Invalid device ordinal

CPU Bounding

Example: You allocate 8 CPUs

salloc --partition=cpu --nodes=1 --cpus-per-task=8 --mem=32G --time=02:00:00

SLURM sets:

CPU Affinity — Processes are pinned to the 8 allocated CPUs
Memory Limit — Your job cannot use more than 32GB RAM (processes killed if exceeded)
Task Accounting — SLURM tracks all processes under your job

Verification in VS Code:

# See your job's CPU and memory limits
scontrol show job $SLURM_JOB_ID

# Check CPU affinity
taskset -cp $$  # Shows which CPUs your shell is bound to

# Check memory usage
free -h  # System shows total, but your job is limited to 32GB

Mixed Resources (Common GPU Use Case)

Example: You allocate 2 GPUs + 48 CPUs + 500GB memory

salloc --partition=odin --nodes=1 --gres=gpu:2 --cpus-per-task=48 --mem=500G --time=04:00:00

All three are bounded:

GPUs: CUDA_VISIBLE_DEVICES=0,1 (only 2 of 8 H100s)
CPUs: 48 CPU cores available (out of 192 on p5.48xlarge)
Memory: 500GB cap (out of 1.9TB on the node)

In VS Code Remote

When you connect VS Code to the compute node within your salloc session:

VS Code process inherits SLURM limits — The VS Code remote server process runs under your SLURM job
All integrated terminals see the same bounds — Any terminal you open in VS Code is bounded by your allocation
Debugging respects limits — If you debug a GPU training script, it can only use your 2 allocated GPUs
Background tasks are bounded — Any task you run from VS Code is subject to the same resource limits

Example workflow:

# 1. From login node: Allocate 2 H100 GPUs for 4 hours
salloc --partition=odin --nodes=1 --gres=gpu:2 --time=04:00:00
# Output: Granted job allocation 54321
# Output: Nodes p5-node-02 are ready

# 2. In VS Code: Connect to p5-node-02
#    (Remote-SSH: Connect to Host → p5-node-02)

# 3. In VS Code integrated terminal, verify GPUs:
nvidia-smi
# Shows ONLY 2 H100 GPUs

# 4. Create training script
cat > train.py << 'EOF'
import torch
import torch.nn as nn

device = torch.device('cuda:0')  # Use GPU 0
model = nn.Linear(1000, 1000).to(device)

# This works - training uses your 2 allocated GPUs
# (PyTorch distributes across available CUDA_VISIBLE_DEVICES)

print(f"Using {torch.cuda.device_count()} GPUs")
EOF

# 5. Run in VS Code terminal:
python train.py
# Output: Using 2 GPUs

Resource Enforcement & Limits

SLURM doesn’t just suggest limits — it actively enforces them:

Resource	What Happens When Exceeded
GPUs	Processes cannot access GPUs beyond `CUDA_VISIBLE_DEVICES`
Memory	Job is killed (OOMKilled)
Time Limit	Job is automatically cancelled at time limit
CPUs	Task rate-limited to allocated CPUs; processes don’t get killed but run slower

Example: Running out of memory

# You allocated 32GB
salloc --partition=cpu --nodes=1 --mem=32G --time=01:00:00

# In VS Code terminal, try to allocate 50GB:
python -c "import numpy as np; x = np.zeros((50*1024**3,), dtype=np.uint8))"

# Result: Job killed by SLURM
# salloc: Job 12345 exceeded 32GB memory limit
# salloc: Job step aborted

Ending Your Session

Release Interactive Allocation

# If you used salloc, exit the shell
exit

# Or cancel the job
scancel <job-id>

The job ends, the node becomes idle, and if no other jobs use it, the EC2 instance will shut down.

Exit VS Code Remote Session

In VS Code, click the remote status indicator (bottom-left corner) and select “Close Remote Connection”. This doesn’t cancel your SLURM job — you must exit or scancel from the terminal.

Troubleshooting

“Permission denied” when SSH to node

Problem: SSH connection rejected Solution: Verify you have SSH access configured (see SSH Setup)

Node name not found / DNS error

Problem: ssh: Could not resolve hostname Solution: The node doesn’t have a physical instance yet. Check node status:

sinfo -N -l
scontrol show node <node-name>

If the node shows down state, it may not have been allocated yet. Submit a job first.

Can’t SSH — node is idle

Problem: Node shows in sinfo but you can’t connect Solution: Idle nodes without jobs may not have active EC2 instances. You must submit a job (or interactive salloc) to bring the instance online:

salloc --partition=<partition> --nodes=1 --time=01:00:00

GPU not visible in VS Code session

Problem: nvidia-smi shows no GPUs Solution: Verify SLURM allocated GPUs:

echo $CUDA_VISIBLE_DEVICES
scontrol show job $SLURM_JOB_ID

Make sure your job requested GPUs with --gres=gpu:N.

VS Code connection times out

Problem: Remote connection fails after several seconds Solution:

Ensure you can SSH manually to the node first
Check that the node has internet access
Verify SSH key is correctly configured
Check VS Code Remote SSH logs in the Output panel

SLURM Job Management - Job submission guide
GPU Jobs - Detailed GPU job examples
SSH Setup - SSH configuration
Cluster Access - General cluster access