Compute Node Access via SSH
This guide explains how to connect to compute nodes in the Odin cluster. All compute nodes are dynamically provisioned — they don’t have persistent instances until you submit a job requesting them.
Understanding Dynamic Nodes
The Odin cluster uses auto-scaling to reduce costs:
- Nodes are created on-demand when you submit a job requesting resources
- Nodes automatically shut down after they become idle
- First job submission may take 2-5 minutes while the node instance starts
- Node DNS names are automatically registered when instances are assigned
This means you cannot SSH to a compute node unless:
- A job has been submitted that requests those compute resources
- The node has been allocated by SLURM (assigned a physical instance)
- Your job is currently running or pending on that node
Check Node Status
View All Nodes and Their Status
sinfo
Output example:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up infinite 10 idle c7i-node-[01-10]
gpu-inferencing up infinite 5 idle g5-node-[01-05]
odin up infinite 2 idle p5-node-[01-02]
albus up infinite 2 idle p5-node-[03-04]
- idle: Nodes exist but no jobs running (may or may not have physical instance)
- allocated: Node is in use by a running job
- down: Node is offline
Check if a Node Has a Physical Instance
To determine if a compute node currently has a running EC2 instance:
# SSH to a login node first
ssh login1
# Check node details using scontrol
scontrol show node c7i-node-01
Look for these indicators in the output:
NodeName=c7i-node-01
State=idle
RealMemory=61440
AllocMemory=0
BootTime=2025-01-15T12:34:56
Key fields:
- BootTime: When the instance started (if recent, instance is running; if old, instance is likely shut down)
- RealMemory: Total memory (only populated if instance is running)
- State: Current state (idle/allocated/down)
Quick check for all nodes:
# Show brief status
sinfo -N -l
# Show only allocated nodes (have jobs running)
sinfo -R
Submit an Interactive Job to Allocate a Node
To allocate a compute node and SSH to it, submit an interactive job using salloc:
Basic Interactive Job (CPU)
# Allocate 1 CPU node for 1 hour
salloc --partition=cpu --nodes=1 --time=01:00:00
Output:
salloc: Granted job allocation 12345
salloc: Waiting for node configuration
salloc: Nodes c7i-node-01 are ready for job
The prompt returns to your shell on the compute node. You now have an active job allocation.
Interactive Job with GPUs
# Allocate 1 GPU node with 2 H100 GPUs for 2 hours
salloc --partition=odin --nodes=1 --gres=gpu:2 --time=02:00:00
# Allocate all 8 GPUs on one H100 node
salloc --partition=odin --nodes=1 --gres=gpu:8 --time=04:00:00
# Allocate A10G for inference
salloc --partition=gpu-inferencing --nodes=1 --gres=gpu:1 --time=01:00:00
Interactive Job with Specific Resources
# CPUs only
salloc --partition=cpu --nodes=1 --cpus-per-task=16 --mem=32G --time=02:00:00
# GPUs with CPU count
salloc --partition=odin --nodes=1 --gres=gpu:2 --cpus-per-task=48 --mem=500G --time=04:00:00
SSH to a Compute Node
Once your interactive job is allocated, you have a shell on the compute node. However, if you need to SSH to a node where a job is already running:
Get the Node Name from Your Job
From your login node (login1 or login2), run:
# View your running jobs
squeue -u $USER
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 odin train-job user R 5:32 1 p5-node-01
The node name is in the NODELIST column.
SSH from Login Node to Compute Node
# From login node (login1 or login2), SSH to the compute node
ssh p5-node-01
# Or using FQDN
ssh p5-node-01.odin.cluster.local
Note: You must SSH to the compute node from a login node. You cannot SSH directly to compute nodes from your laptop — you must go through the login node as an intermediary.
Verify GPU Allocation (if applicable)
Once logged into the compute node:
# Check which GPUs are allocated to your job
echo $CUDA_VISIBLE_DEVICES
# Output: 0,1 (if you requested 2 GPUs)
# Verify GPU visibility
nvidia-smi
# See detailed info
nvidia-smi -L
The CUDA_VISIBLE_DEVICES environment variable is automatically set by SLURM and restricts which GPUs your process can see and use.
Connect via VS Code Remote
VS Code can connect directly to compute nodes, allowing you to edit code and run terminals directly on the allocated resources.
Prerequisites
- VS Code with Remote - SSH extension
- SSH key configured (see SSH Setup)
- Active SLURM job allocation (
salloc)
Connect to Compute Node from Login Node
- On your laptop/local machine: Open VS Code
- Command Palette (
Cmd+Shift+Pon Mac,Ctrl+Shift+Pon Linux/Windows) - Type: “Remote-SSH: Connect to Host”
- Enter SSH connection string:
login1(to jump through login node) - In the remote terminal on login1, verify your job is running:
squeue -u $USER # Find the node name, e.g., p5-node-01 - Open a new remote window from VS Code command palette again
- Type: “Remote-SSH: Connect to Host”
- Enter:
p5-node-01@login1orp5-node-01 - Select Linux as the platform
- Wait for VS Code to install remote server (first time only)
VS Code now runs directly on the compute node with full access to allocated resources.
How Resources Are Bounded
When you allocate resources with salloc, SLURM enforces hard limits on what your session can access:
GPU Bounding (Most Common Case)
Example: You allocate 2 GPUs
salloc --partition=odin --nodes=1 --gres=gpu:2 --time=04:00:00
SLURM does the following:
- Sets
CUDA_VISIBLE_DEVICES=0,1— Only GPUs 0 and 1 are visible to your process - GPU Isolation — Even if the node has 8 H100 GPUs, processes in your job can only see and use GPUs 0,1
- Memory Enforcement — SLURM tracks GPU memory usage per job and will kill processes exceeding their allocation
Verification in VS Code terminal:
nvidia-smi
# Output shows ONLY the 2 GPUs you requested
# GPU 0: ...
# GPU 1: ...
# (GPUs 2-7 are NOT listed)
echo $CUDA_VISIBLE_DEVICES
# Output: 0,1
What happens if you try to use more GPUs:
# In VS Code terminal, try this Python code
import torch
print(torch.cuda.device_count()) # Returns: 2 (only your 2 GPUs)
print(torch.cuda.is_available()) # Returns: True
# Trying to access GPU 2 will fail
torch.cuda.set_device(2) # ERROR: Invalid device ordinal
CPU Bounding
Example: You allocate 8 CPUs
salloc --partition=cpu --nodes=1 --cpus-per-task=8 --mem=32G --time=02:00:00
SLURM sets:
- CPU Affinity — Processes are pinned to the 8 allocated CPUs
- Memory Limit — Your job cannot use more than 32GB RAM (processes killed if exceeded)
- Task Accounting — SLURM tracks all processes under your job
Verification in VS Code:
# See your job's CPU and memory limits
scontrol show job $SLURM_JOB_ID
# Check CPU affinity
taskset -cp $$ # Shows which CPUs your shell is bound to
# Check memory usage
free -h # System shows total, but your job is limited to 32GB
Mixed Resources (Common GPU Use Case)
Example: You allocate 2 GPUs + 48 CPUs + 500GB memory
salloc --partition=odin --nodes=1 --gres=gpu:2 --cpus-per-task=48 --mem=500G --time=04:00:00
All three are bounded:
- GPUs:
CUDA_VISIBLE_DEVICES=0,1(only 2 of 8 H100s) - CPUs: 48 CPU cores available (out of 192 on p5.48xlarge)
- Memory: 500GB cap (out of 1.9TB on the node)
In VS Code Remote
When you connect VS Code to the compute node within your salloc session:
- VS Code process inherits SLURM limits — The VS Code remote server process runs under your SLURM job
- All integrated terminals see the same bounds — Any terminal you open in VS Code is bounded by your allocation
- Debugging respects limits — If you debug a GPU training script, it can only use your 2 allocated GPUs
- Background tasks are bounded — Any task you run from VS Code is subject to the same resource limits
Example workflow:
# 1. From login node: Allocate 2 H100 GPUs for 4 hours
salloc --partition=odin --nodes=1 --gres=gpu:2 --time=04:00:00
# Output: Granted job allocation 54321
# Output: Nodes p5-node-02 are ready
# 2. In VS Code: Connect to p5-node-02
# (Remote-SSH: Connect to Host → p5-node-02)
# 3. In VS Code integrated terminal, verify GPUs:
nvidia-smi
# Shows ONLY 2 H100 GPUs
# 4. Create training script
cat > train.py << 'EOF'
import torch
import torch.nn as nn
device = torch.device('cuda:0') # Use GPU 0
model = nn.Linear(1000, 1000).to(device)
# This works - training uses your 2 allocated GPUs
# (PyTorch distributes across available CUDA_VISIBLE_DEVICES)
print(f"Using {torch.cuda.device_count()} GPUs")
EOF
# 5. Run in VS Code terminal:
python train.py
# Output: Using 2 GPUs
Resource Enforcement & Limits
SLURM doesn’t just suggest limits — it actively enforces them:
| Resource | What Happens When Exceeded |
|---|---|
| GPUs | Processes cannot access GPUs beyond CUDA_VISIBLE_DEVICES |
| Memory | Job is killed (OOMKilled) |
| Time Limit | Job is automatically cancelled at time limit |
| CPUs | Task rate-limited to allocated CPUs; processes don’t get killed but run slower |
Example: Running out of memory
# You allocated 32GB
salloc --partition=cpu --nodes=1 --mem=32G --time=01:00:00
# In VS Code terminal, try to allocate 50GB:
python -c "import numpy as np; x = np.zeros((50*1024**3,), dtype=np.uint8))"
# Result: Job killed by SLURM
# salloc: Job 12345 exceeded 32GB memory limit
# salloc: Job step aborted
Ending Your Session
Release Interactive Allocation
# If you used salloc, exit the shell
exit
# Or cancel the job
scancel <job-id>
The job ends, the node becomes idle, and if no other jobs use it, the EC2 instance will shut down.
Exit VS Code Remote Session
In VS Code, click the remote status indicator (bottom-left corner) and select “Close Remote Connection”. This doesn’t cancel your SLURM job — you must exit or scancel from the terminal.
Troubleshooting
“Permission denied” when SSH to node
Problem: SSH connection rejected Solution: Verify you have SSH access configured (see SSH Setup)
Node name not found / DNS error
Problem: ssh: Could not resolve hostname
Solution: The node doesn’t have a physical instance yet. Check node status:
sinfo -N -l
scontrol show node <node-name>
If the node shows down state, it may not have been allocated yet. Submit a job first.
Can’t SSH — node is idle
Problem: Node shows in sinfo but you can’t connect
Solution: Idle nodes without jobs may not have active EC2 instances. You must submit a job (or interactive salloc) to bring the instance online:
salloc --partition=<partition> --nodes=1 --time=01:00:00
GPU not visible in VS Code session
Problem: nvidia-smi shows no GPUs
Solution: Verify SLURM allocated GPUs:
echo $CUDA_VISIBLE_DEVICES
scontrol show job $SLURM_JOB_ID
Make sure your job requested GPUs with --gres=gpu:N.
VS Code connection times out
Problem: Remote connection fails after several seconds Solution:
- Ensure you can SSH manually to the node first
- Check that the node has internet access
- Verify SSH key is correctly configured
- Check VS Code Remote SSH logs in the Output panel
Related Documentation
- SLURM Job Management - Job submission guide
- GPU Jobs - Detailed GPU job examples
- SSH Setup - SSH configuration
- Cluster Access - General cluster access