RWTH High Performance Computing (HPC)
You can find more information about the service in our documentation portal.
[CLAIX-2025] Issues with respect to reachability
At the moment, the CLAIX-2025 fabric suffers from reachability issues. We are currently working on analyzing and identifying the root cause of the issues.
Our hardware vendor has identified an issue that needs further analysis.
[CLAIX-2025] Large CDU active - GPU restrictions will be removed soon
Due to limited cooling capacity, 12 GPU nodes were disabled from the batch operation. With having the large CDU installed and activated today, these limitations can be removed soon.
Recently expired reports
RegApp maintenance
The RegApp will be briefly unavailable due to system changes.
Login to the HPC and connected services will be unavailable.
[CLAIX-2025] Gateway Reconfiguration
To Increase the bandwidth and redundancy, the gateway configuration to the CLAIX-2025 HPC fabric must be re-configured.
Since the IP addresses will change, the reachability of all fabric-only nodes will be interrupted. This implies that the batch operation will be interrupted as well.
The dialog nodes are still reachable via Ethernet.
We are updating the configuration now. Please be warned that a disruption will occcur at short-hand.
The reconfiguration was fixed over night, and the the operation immediately resumed.
[CLAIX-2025] Installation of the Large CDU
The large CDU will be installed to replace the temporarily installed small CDUs. During the installation the load on the cluster must be reduced. Disruptions from the test operation cannot be excluded.
The large CDU cannot be connected at the moment. The maintenance has to be postponed.
[CLAIX-2025] Fabric Reboot
Attention
The test operation of CLAIX-2025 must be temporarily suspended.
ALL NODES WILL BE SET TO DOWN DUE TO A REQUIRED FABRIC REBOOT
Due to a switch malfunction, the respective switch must be removed from the CLAIX-2025 fabric and the fabric rebooted to restore a stable operation. A replacement is not possible at the moment. Henceforth, after having removed the switch, following nodes cannot be used in the batch system until further notice:
i25s[0011-0022],n25l[0001-0040],n25t[0001-0004]
The cluster is available again for test operation. The aforementioned nodes remain unavailable until further notice.
[CLAIX-2025] Switch replacement
One replacement switch was faulty and must be replaced once again. At least n25s0001..0064 should be affected and cannot be used during the maintenance.
The maintenance has to be extended to all batch nodes since all switches need to be rebooted and a diagnosis after the replacement. The test operation of CLAIX-2025 will be suspended until the end of the maintenance.
The switch replacement is finished, and the fabric seems to be stable.
[Jupyterhub] Reboot required
Due to mitigating a security issue, the Jupyterhub will be rebooted at 08:45 CEST.