Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable. Please note the following: - User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance. - No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance. - Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated. - Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs. - Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards. - Files on your personal or project directories will not be available during the maintenance.
Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.
We must unfortunately postpone the release of the HPC system for normal use until Wednesday. We apologise for the delays.
Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.
All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m. We apologize once again for the unforseen inconveniences.
The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.
The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.
Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.
The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.
Aktuell drainen alle Knoten aufgrund einer Störung. Wir arbeiten mit dem Hersteller daran.
Das Problem konnte gelöst werden, der Cluster ist wieder in Operation.