Troubleshooting

Common issues and solutions for the Odin HPC cluster.

SSH Connection Issues

Permission Denied

Symptoms: Permission denied (publickey) when connecting.

Solutions:

  1. Verify your key permissions:
    chmod 600 ~/.ssh/id_rsa
    ls -la ~/.ssh/
    
  2. Ensure your public key is in users.yaml:
    grep YOUR_USERNAME odin/terraform/users.yaml
    
  3. Check that user sync workflow has been run

  4. Verify SSH agent is running:
    eval "$(ssh-agent -s)"
    ssh-add ~/.ssh/id_rsa
    ssh-add -l  # Should list your key
    

Host Key Changed

Symptoms: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

Solution: Clear the Odin-specific known hosts file:

rm ~/.ssh/known_hosts.odin

Cannot Resolve Hostname

Symptoms: Could not resolve hostname login1.odin.cluster.local

Solutions:

  1. Ensure VPN is connected

  2. Test DNS from jump host:
    ssh jump-host "nslookup login1.odin.cluster.local"
    
  3. Verify ProxyJump is configured in SSH config

Connection Timed Out

Symptoms: Connection hangs or times out.

Solutions:

  1. Check if instances are running:
    cd odin/terraform
    terraform output odin_pcluster_headnode
    
  2. Test connectivity from jump host:
    ssh jump-host "ping -c 3 login1.odin.cluster.local"
    

User Account Issues

User Not Found

Symptoms: id: username: no such user

Solutions:

  1. Verify user is in users.yaml
  2. Run the update-users workflow
  3. Check user sync logs in GitHub Actions

Groups Not Assigned

Symptoms: User missing from expected groups.

Solution: Verify group configuration in users.yaml:

users:
  username:
    groups: ["users", "docker"]
    group_ids: [2010, 100]

Storage Issues

FSx Mount Not Available

Symptoms: /mnt/odin or /mnt/qcs not accessible.

Solutions:

  1. Check mount status:
    ssh login1 "df -h | grep mnt"
    ssh login1 "mount | grep fsx"
    
  2. Verify FSx filesystem status in AWS Console

Files Not Appearing from S3

Symptoms: Files uploaded to S3 don’t appear in FSx.

Solutions:

  1. Wait a few minutes for auto-import

  2. Check DRA status:
    lfs hsm_state /mnt/odin/bucket-name/filename
    
  3. Force refresh (requires admin):
    sudo lfs hsm_restore /mnt/odin/bucket-name/filename
    

Files Not Syncing to S3

Symptoms: Files written to FSx don’t appear in S3.

Solutions:

  1. Check HSM state:
    lfs hsm_state /mnt/odin/bucket-name/filename
    
  2. Force archive:
    sudo lfs hsm_archive /mnt/odin/bucket-name/filename
    

SLURM Issues

Job Pending Forever

Symptoms: Job stays in PENDING state.

Solutions:

  1. Check job reason:
    squeue -j <job-id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
    
  2. Common reasons:
    • Resources - Waiting for nodes to start (normal)
    • Priority - Lower priority job waiting
    • QOSMaxJobsPerUserLimit - User job limit reached
  3. Check partition status:
    sinfo -p <partition>
    

Job Failed Immediately

Symptoms: Job exits immediately with error.

Solutions:

  1. Check job output file:
    cat job-<id>.out
    cat job-<id>.err
    
  2. Check job exit code:
    sacct -j <job-id> --format=JobID,State,ExitCode
    

Cannot Find GPUs

Symptoms: CUDA_VISIBLE_DEVICES empty or nvidia-smi fails.

Solutions:

  1. Verify --gres=gpu:N in job script
  2. Check you’re on a GPU partition
  3. Verify driver is loaded:
    nvidia-smi
    

Data Manager Issues

RDP Connection Failed

Symptoms: Cannot connect to Windows Data Manager via RDP.

Solutions:

  1. Verify SSH tunnel is running:
    ssh -L 3389:data-manager-windows.odin.cluster.local:3389 jump-host -N
    
  2. Check Windows instance status:
    ssh jump-host "ping -c 3 data-manager-windows.odin.cluster.local"
    
  3. Verify port forwarding:
    netstat -an | grep 3389  # Should show LISTEN
    

SMB Share Access Denied

Symptoms: Cannot mount Samba shares from Windows.

Solutions:

  1. Verify credentials:
    aws secretsmanager get-secret-value \
      --secret-id odin/samba-users/YOUR_USERNAME \
      --query 'SecretString' --output text | jq -r '.password'
    
  2. Check Samba service:
    ssh data-manager-linux "sudo systemctl status smbd"
    
  3. Verify user exists in Samba:
    ssh data-manager-linux "sudo pdbedit -L"
    

Getting Help

If issues persist:

  1. Check terraform outputs:
    cd odin/terraform
    terraform output
    
  2. Review infrastructure status in AWS Console

  3. Contact the Odin infrastructure team on Slack: #qcs-infra-notification