Notifications Overview

The QCS HPC cluster provides automated notifications through Slack to keep you informed about important cluster events. All notifications are sent to the #qcs-infra-notification channel.

Channel Access

The #qcs-infra-notification Slack channel is private. To join the channel and receive notifications:

  1. Contact the QCS HPC Admin or QCS HPC Owner
  2. Request access to the #qcs-infra-notification channel
  3. Once added, you’ll receive all cluster notifications

Notification Types

1. Node Creation Events

Notifications are sent when compute nodes are created as part of auto-scaling operations.

Trigger: ParallelCluster auto-scaling creates new compute nodes Information Includes:

  • Node name and type (CPU, GPU, etc.)
  • Instance ID
  • IP address
  • Partition assigned
  • Status (CONFIGURING, IDLE, etc.)

2. Job Submission and Completion Events

Notifications track the lifecycle of SLURM jobs submitted to the cluster.

Job Submission:

  • Job ID
  • Job name
  • Submitting user
  • Partition
  • Resource requirements (nodes, CPUs, GPUs)
  • Status: SUBMITTED

Job Completion:

  • Job ID
  • Completion status (COMPLETED, FAILED, CANCELLED, TIMEOUT)
  • Execution time
  • Exit code
  • Associated compute nodes

3. CloudWatch Alarms

Automated alarms monitor the health and performance of the cluster infrastructure.

EC2 Instance Alarms:

  • Instance status checks
  • CPU utilization thresholds
  • Network connectivity issues
  • Storage capacity warnings

ParallelCluster Alarms:

  • Cluster creation/deletion events
  • Auto-scaling activities
  • Node initialization failures
  • Configuration errors

Notification Examples

Node Creation

🔔 Node Created
Cluster: odin-rnd-us
Node: cpu-dy-cpu-compute-5
Type: t3.xlarge
Instance ID: i-0123456789abcdef0
IP Address: 10.0.50.25
Partition: cpu
Status: CONFIGURING

Job Submitted

📋 Job Submitted
Job ID: 125
Job Name: my-analysis
User: username
Partition: gpu-inferencing
CPUs: 8
GPUs: 1
Nodes: 1
Status: SUBMITTED

Job Completed

✅ Job Completed
Job ID: 125
Job Name: my-analysis
Status: COMPLETED
Run Time: 45 minutes
Exit Code: 0
Nodes: gpu-inferencing-dy-gpu-inferencing-compute-2

Notification Settings

Notifications are automatically configured and cannot be disabled at the cluster level. To stop receiving individual notifications:

  1. Mute the #qcs-infra-notification channel in Slack
  2. Adjust your Slack notification preferences for that channel
  3. Contact QCS HPC Admin if you need to be removed from the channel

Support

For questions about notifications or to report issues:

  • QCS HPC Admin: [contact information]
  • Slack: Mention @qcs-hpc-admins in #qcs-infra-notification
  • Email: [support email if available]