Troubleshooting
Common issues and solutions for the Odin HPC cluster.
SSH Connection Issues
Permission Denied
Symptoms: Permission denied (publickey) when connecting.
Solutions:
- Verify your key permissions:
chmod 600 ~/.ssh/id_rsa ls -la ~/.ssh/ - Ensure your public key is in
users.yaml:grep YOUR_USERNAME odin/terraform/users.yaml -
Check that user sync workflow has been run
- Verify SSH agent is running:
eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_rsa ssh-add -l # Should list your key
Host Key Changed
Symptoms: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!
Solution: Clear the Odin-specific known hosts file:
rm ~/.ssh/known_hosts.odin
Cannot Resolve Hostname
Symptoms: Could not resolve hostname login1.odin.cluster.local
Solutions:
-
Ensure VPN is connected
- Test DNS from jump host:
ssh jump-host "nslookup login1.odin.cluster.local" - Verify ProxyJump is configured in SSH config
Connection Timed Out
Symptoms: Connection hangs or times out.
Solutions:
- Check if instances are running:
cd odin/terraform terraform output odin_pcluster_headnode - Test connectivity from jump host:
ssh jump-host "ping -c 3 login1.odin.cluster.local"
User Account Issues
User Not Found
Symptoms: id: username: no such user
Solutions:
- Verify user is in
users.yaml - Run the update-users workflow
- Check user sync logs in GitHub Actions
Groups Not Assigned
Symptoms: User missing from expected groups.
Solution: Verify group configuration in users.yaml:
users:
username:
groups: ["users", "docker"]
group_ids: [2010, 100]
Storage Issues
FSx Mount Not Available
Symptoms: /mnt/odin or /mnt/qcs not accessible.
Solutions:
- Check mount status:
ssh login1 "df -h | grep mnt" ssh login1 "mount | grep fsx" - Verify FSx filesystem status in AWS Console
Files Not Appearing from S3
Symptoms: Files uploaded to S3 don’t appear in FSx.
Solutions:
-
Wait a few minutes for auto-import
- Check DRA status:
lfs hsm_state /mnt/odin/bucket-name/filename - Force refresh (requires admin):
sudo lfs hsm_restore /mnt/odin/bucket-name/filename
Files Not Syncing to S3
Symptoms: Files written to FSx don’t appear in S3.
Solutions:
- Check HSM state:
lfs hsm_state /mnt/odin/bucket-name/filename - Force archive:
sudo lfs hsm_archive /mnt/odin/bucket-name/filename
SLURM Issues
Job Pending Forever
Symptoms: Job stays in PENDING state.
Solutions:
- Check job reason:
squeue -j <job-id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" - Common reasons:
Resources- Waiting for nodes to start (normal)Priority- Lower priority job waitingQOSMaxJobsPerUserLimit- User job limit reached
- Check partition status:
sinfo -p <partition>
Job Failed Immediately
Symptoms: Job exits immediately with error.
Solutions:
- Check job output file:
cat job-<id>.out cat job-<id>.err - Check job exit code:
sacct -j <job-id> --format=JobID,State,ExitCode
Cannot Find GPUs
Symptoms: CUDA_VISIBLE_DEVICES empty or nvidia-smi fails.
Solutions:
- Verify
--gres=gpu:Nin job script - Check you’re on a GPU partition
- Verify driver is loaded:
nvidia-smi
Data Manager Issues
RDP Connection Failed
Symptoms: Cannot connect to Windows Data Manager via RDP.
Solutions:
- Verify SSH tunnel is running:
ssh -L 3389:data-manager-windows.odin.cluster.local:3389 jump-host -N - Check Windows instance status:
ssh jump-host "ping -c 3 data-manager-windows.odin.cluster.local" - Verify port forwarding:
netstat -an | grep 3389 # Should show LISTEN
SMB Share Access Denied
Symptoms: Cannot mount Samba shares from Windows.
Solutions:
- Verify credentials:
aws secretsmanager get-secret-value \ --secret-id odin/samba-users/YOUR_USERNAME \ --query 'SecretString' --output text | jq -r '.password' - Check Samba service:
ssh data-manager-linux "sudo systemctl status smbd" - Verify user exists in Samba:
ssh data-manager-linux "sudo pdbedit -L"
Getting Help
If issues persist:
- Check terraform outputs:
cd odin/terraform terraform output -
Review infrastructure status in AWS Console
- Contact the Odin infrastructure team on Slack: #qcs-infra-notification