Rechner-Cluster

Mehr Informationen zu dem Service finden Sie in unserem Dokumentationsportal.

Bug with Jobs landing on _low partitions

Teilstörung
Donnerstag, 20.11.2025 12:05 - Donnerstag, 20.11.2025 17:19

We currently have issues with Jobs using the default account landing on _low partitions

20.11.2025 13:06
Updates

Strange submission error might be experienced while we fix the issue. Please re-submit.

20.11.2025 15:56

Full system maintenance.

Wartung
Dienstag, 18.11.2025 08:00 - Dienstag, 18.11.2025 19:00

Due to various necessary maintenance works, the entire CLAIX HPC System will be unavailable.

The initial phase of the maintenance should last until 12:00. After which filesystem and login nodes should be available.
Jobs will not run until the full maintenance works are completed.
We aim to also upgrade the Slurm scheduler during the downtime.

Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the initial part of the maintenance the maintenance.
- No Slurm jobs will be able to run during the entire maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the initial part of the maintenance.

10.11.2025 14:35
Updates

The first part of the maintenance will take unexpectedly longer. At the moment, we cannot estimate when the maintenance work of the first part will be finished.

18.11.2025 11:56

The network maintenance is still ongoing.

18.11.2025 16:01

The maintenance works were sadly delayed due to circumstances out of our control. This will delay the HPC systems availability by a few hours.

18.11.2025 17:15

Due to issues during the network maintenance and a short-hand failure of the storage backend of one infrastructure server, required in the maintenance, the maintenance tasks were delayed. Due to the delay, some tasks had to be postponed. The Cluster is operational again.

18.11.2025 19:08

RegApp Wartung

Teilwartung
Donnerstag, 23.10.2025 14:30 - Donnerstag, 23.10.2025 15:30

The RegApp service will be temporarily unavailable due to OS updates and to prepare for upcoming changes.
Login to the HPC frontends and to perfmon.hpc.itc.rwth-aachen.de will be unavailable.
It is recommended to login to the services beforehand.

20.10.2025 09:38

Temporary Suspension of GPU Batch Service

Teilwartung
Freitag, 17.10.2025 08:05 - Montag, 20.10.2025 06:05

Due to urgent security updates, the batch service will be temporarily suspended for all GPU nodes. After the updates are deployed, the service will be resumed. Running jobs are not affected and will continue. The dialog node login23-g-1 is not affected.

Once the updates are automatically installed upon job finalization, the nodes will be available again on short-notice.

17.10.2025 08:35
Updates

The majority of the GPU nodes' jobs were finished early. The nodes could be upgraded without a noticable interruption of the service.

20.10.2025 10:38

HOME filesystem unavailable

Störung
Donnerstag, 09.10.2025 22:15 - Freitag, 10.10.2025 09:30

The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second.
We apologize for the inconveniences this has caused.

09.10.2025 16:45
Updates

The underlying problem has turned out to be more persistent than expected and we are consulting with the vendor for a fix. As a result, the majority of batch nodes remain out of operation until we can guarantee stable access to the filesystem. We are working to remedy the situation as soon as possible.

09.10.2025 17:25

The problem has been solved and the cluster is back in operation.

09.10.2025 20:08

The issue re-occured and persistes at the moment. We are currently working on a solution.

10.10.2025 07:21

The issues could be resolved.

10.10.2025 09:34

Full global maintenance of the HPC CLAIX Systems

Teilwartung
Dienstag, 30.09.2025 15:35 - Freitag, 03.10.2025 19:25

Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable.
Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance.
- No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the maintenance.

16.09.2025 15:31
Updates

Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.

30.09.2025 15:08

We must unfortunately postpone the release of the HPC system for normal use until Wednesday.
We apologise for the delays.

30.09.2025 20:12

Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.

01.10.2025 17:09

All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m.
We apologize once again for the unforseen inconveniences.

01.10.2025 18:18

The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.

02.10.2025 10:25

The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.

02.10.2025 15:39

Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.

03.10.2025 19:26

Migration to Rocky Linux 9

Teilwartung
Mittwoch, 01.10.2025 08:00 - Mittwoch, 01.10.2025 15:30

The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.

30.09.2025 17:45

NFS Störung der GPFS Server

Störung
Mittwoch, 17.09.2025 21:35 - Donnerstag, 18.09.2025 10:06

Aktuell drainen alle Knoten aufgrund einer Störung. Wir arbeiten mit dem Hersteller daran.

18.09.2025 09:16
Updates

Das Problem konnte gelöst werden, der Cluster ist wieder in Operation.

18.09.2025 10:06

$HOME and $WORK filesystems are again unavailable

Störung
Freitag, 29.08.2025 09:45 - Freitag, 29.08.2025 11:00

Due to issues with the underlying filesystem servers for $HOME and $WORK, the batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

29.08.2025 09:55
Updates

Access to the filesystems has been restored. We apologize for the inconvenience.

29.08.2025 11:38

$HOME and $WORK unavailable on login23-g-1

Störung
Dienstag, 26.08.2025 15:00 - Mittwoch, 27.08.2025 11:30

Due to issues with the GPFS filesystem $HOME and $WORK are not available on login23-g-1.
The issue has been resolved.

27.08.2025 11:40

login23-1 unreachable

Teilwartung
Dienstag, 26.08.2025 08:45 - Dienstag, 26.08.2025 10:00

To resolve issues and fault analysis of the frontend node, the dialog node login23-1 is temporarily unavailable.

26.08.2025 09:01

Reboot of CLAIX-2023 copy nodes

Teilwartung
Montag, 25.08.2025 08:00 - Montag, 25.08.2025 16:16

Due to a pending kernel update, the CLAIX-2023 copy nodes copy23-1 and copy23-2 will be rebooted on Monday, 2025-08-25.

22.08.2025 12:38
Updates

Due to pending mandatory firmware updates, the maintenance needs to be prolonged.

25.08.2025 08:36

Due to issues with respect to the firmware update, the end of the maintenance is delayed.

25.08.2025 09:52

copy23-1 is available again.

25.08.2025 12:00

The Maintenance is completed.

25.08.2025 16:16

GPU login node unavailable due to maintenance

Teilwartung
Freitag, 22.08.2025 08:00 - Freitag, 22.08.2025 15:45

Due to mandatory maintenance work, the login GPU node n23-g-1 will not be available during the maintenance.
During the maintenance, the node's firmware will be updated which takes several hours to complete.

21.08.2025 14:36
Updates

The Firmware upgrades are completed. However, there is an issue with respect to the SNC configuration. We were unable to enable the cluster-wide standard setting SNC4 on the dialog node. Further investigation is required.
Notwithstanding this, the GPU login node can be used until further notice.

22.08.2025 15:40

$HOME and $WORK filesystems are again unavailable

Störung
Dienstag, 19.08.2025 19:10 - Mittwoch, 20.08.2025 09:30

Due to issues with the underlying filesystem servers for $HOME and $WORK, the majority of batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

19.08.2025 19:37
Updates

The issue has been resolved and the affected machines are back in operation. We are actively working with the vendor to remedy the situation and apologize for the repeated downtimes.

20.08.2025 10:16

File System Issues

Störung
Montag, 18.08.2025 16:52 - Montag, 18.08.2025 17:59

Due to issues with the shared filesystems, all services with respect to the HPC cluster are substantially impacted. The issues are under current investigation, and we are working on a solution.

18.08.2025 17:41
Updates

On most of the nodes, the shared filesystems' function could be restored.

19.08.2025 08:12

$HOME and $WORK filesystems unavailable

Störung
Freitag, 15.08.2025 13:00 - Freitag, 15.08.2025 15:00

Due to issues with the GPFS file system, the $HOME and $WORK directories may currently be unavailable on the login nodes. Additionally, some compute nodes are temporarily not available due to these issues. Our team is working to resolve the problem as quickly as possible.

15.08.2025 14:19
Updates

The issues could be resolved.

15.08.2025 15:09

$HOME and $WORK filesystems unavailable and cause jobs to fail.

Teilstörung
Mittwoch, 13.08.2025 12:45 - Mittwoch, 13.08.2025 14:47

Due to issues with the underlying filesystem servers for $HOME and $WORK; GPFS, some batch nodes are currently unavailable and login nodes might be unstable. This may have caused running jobs to fail.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.
EDIT: Systems seem to be back online and operating normally.

13.08.2025 13:51

Emergency Shutdown of CLAIX-2023

Störung
Mittwoch, 30.07.2025 12:30 - Mittwoch, 30.07.2025 18:50

Due to cooling issues, the CLAIX-2023 cluster has been shut down. Accessing the cluster is currently not possible.
Please note that running jobs may have been terminated due to the shutdown.
We are actively working on identifying the root cause and resolving the issue as quickly as possible.
We apologize for any inconvenience this may cause and thank you for your understanding.

30.07.2025 12:36
Updates

The cluster is resuming operation now that the cooling system has been cleared for load. All batch nodes should be back in operation shortly.

30.07.2025 18:33

Migration of login23-3 and login23-x-1 to Rocky 9

Wartung
Montag, 28.07.2025 07:00 - Montag, 28.07.2025 18:00

The remaining Rocky 8 login nodes login23-3 and login23-x-1 will be upgraded to Rocky 9. These nodes will be unavailable during the upgrade process. Please use the other available login nodes in the meantime. The copy nodes are unaffected and will remain Rocky 8 nodes until further notice.

25.07.2025 11:15

Anmeldung schlägt fehl

Störung
Dienstag, 22.07.2025 11:30 - Dienstag, 22.07.2025 11:44

Zurzeit ist eine Anmeldung über RWTH Single Sign-On nicht möglich.
Bestehende Sessions sind nicht davon betroffen.
Wir arbeiten an der Behebung des Problems.
--English version---

22.07.2025 11:37
Updates

Die Störung wurde behoben.

22.07.2025 11:45

p7zip replaced with 7zip

Hinweis
Dienstag, 22.07.2025 09:30 - Dienstag, 22.07.2025 09:30

Due to security concerns, the previously installed file archiver "p7zip" (a fork of "7zip"), was replaced by the original program.
p7zip was forked two decades ago from 7zip due to initially lacking Unix and Linux support. Unfortunately, development is lacking severly behind of the upstream code such that issues are only partially fixed if at all. Since the upstream 7zip contains also several functional and performance improvements in addition to the bug fixes, the overall performance should have improved for the users as well. The usage should not change aside of minor changes due to the development history.

22.07.2025 13:22

$HOME and $WORK filesystems are again unavailable

Störung
Montag, 14.07.2025 09:45 - Montag, 14.07.2025 11:30

Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.

14.07.2025 10:20
Updates

The $HOME and $WORK filesystems are back online

14.07.2025 11:45

$HOME and $WORK filesystems unavailable

Störung
Donnerstag, 10.07.2025 16:30 - Donnerstag, 10.07.2025 18:30

Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.

10.07.2025 16:47
Updates

The filesystems have been properly re-mounted on all nodes and the system is back in normal operation. If your jobs have crashed due to I/O errors, please resubmit them. We again apologize for the inconvenience.

10.07.2025 19:04

Firmware Upgrade of Frontend Nodes

Teilwartung
Montag, 07.07.2025 06:00 - Montag, 07.07.2025 07:30

The Firmware of login23-3 and login23-4 must be upgraded. During the Upgrade, the nodes will not be available. Please use the other dialog nodes in the meantime.

02.07.2025 12:56
Updates

The firmware upgrades are completed.

07.07.2025 07:36

Altes Ticket ohne Titel

Teilstörung
Samstag, 05.07.2025 12:15 - Samstag, 05.07.2025 12:45

Im angegeben Zeitraum kam es teilweise zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error.
Dies betrifft alle Services, welche den Single Sign-On Login verwenden.
--- English version ---

07.07.2025 11:45

User namespaces on all Rocky 9 systems deactivated

Teilstörung
Montag, 16.06.2025 14:15 - Donnerstag, 03.07.2025 18:00

Due to an open security issue we have deactivated user namespaces on all Rocky 9 systems.
This feature is mainly used by containerization software and affects the way apptainer containers will behave.
Most users should not experience any interruptions. If you experience any problems, please contact us as
usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using.

16.06.2025 14:26
Updates

The user namespaces can be activated again once the Rocky 9 nodes are upgraded to Rocky 9.6.
The dialog systems (cf. other ticket) were upgraded already. Currently we are procuring the upgrade to Rocky 9.6 in the background to minimize the downtime. Already upgraded nodes have user namespaces enabled by default again.

25.06.2025 09:32

The remaining Rocky 9.5 nodes are planned to be updated to Rocky Linux 9.6 until 2025-07-04

26.06.2025 15:48

Usability of Desktop Environments on CLAIX-2023 Frontend Nodes Restricted due to Security Issues

Hinweis
Freitag, 20.06.2025 09:15 - Donnerstag, 03.07.2025 12:00

Due to high-risk security issues within some components, the affected packages had to be removed from the cluster to enable operation until the security issues can be fixed. Unfortunately, the removal breaks the desktop environments XFCE and Mate due to tight dependencies to the respective removed packages. Consequently, the file managers "Thunar" and "Caja" as well as some other utilities cannot be used at the moment when using a GUI login.
If a GUI login is mandatory, please use the IceWM environment instead. However, some of the GUI applications might not work as expected (see above).
The text-based/terminal usage of the frontend nodes is not affected by the temporal change.

20.06.2025 09:26
Updates

The upgraded Rocky 9 login nodes have received patches that solve the security issues. On these nodes, Mate and XFCE can be used again as desktop environment. However, The Rocky 8 login nodes (i.e., login23-3, login23-x-1) are still awaiting the respecitve patches. Until the patches are available, the usage of the respective environments remains limited.

24.06.2025 11:01

The Rocky 8 frontend nodes received the pending update and can now be used with Mate and XFCE again.

03.07.2025 13:02

Login per RWTH Single Sign-On gestört

Störung
Donnerstag, 26.06.2025 15:45 - Donnerstag, 26.06.2025 16:16

Aktuell kommt es zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error.
Dies betrifft alle Services, welche den Single Sign-On Login verwenden.
Die zuständige Fachabteilung ist bereits informiert und arbeitet an der Behebung.
--- English version ---

26.06.2025 15:53
Updates

Die Störung konnte behoben werden.
--- English version ---

26.06.2025 16:16

HPC JupyterHub down for maintenance

Wartung
Mittwoch, 25.06.2025 09:00 - Mittwoch, 25.06.2025 11:00

The HPC JupyterHub will be down for maintenance during the hours of 9:00 and 11:00.

24.06.2025 15:53

Upgrade of Login Nodes

Teilwartung
Montag, 23.06.2025 06:00 - Mittwoch, 25.06.2025 09:51

On Monday, June 23rd, the CLAIX-2023 Rocky 9 frontend nodes login23-1, login23-2 and login23-4 will be upgraded to Rocky Linux 9.6.
During the upgrade, these nodes will temporarily not be available. Please use the Rocky 8 frontend nodes in the meantime.

20.06.2025 13:44
Updates

login23-g-1 and login23-x-2 will be reinstalled with Rocky Linux 9.6. Until the reinstallation is finished, these nodes are temporarily unavailable.

23.06.2025 05:50

The GPU login node login23-g-1 is reinstalled and available with Rocky 9.6. The other nodes require further work.

23.06.2025 08:56

The login nodes login23-1, login23-2, login23-4 are updated and available again. login23-x-2 requires some further work.

23.06.2025 10:33

All Rocky 9-login nodes are upgraded to Rocky Linux 9.6.

25.06.2025 09:51

Job submission to some new Slurm projects --accounts fails

Teilstörung
Samstag, 07.06.2025 10:00 - Montag, 16.06.2025 14:16

Our second Slurm controller machine has lost memory modules and is currently pending maintenance.
This machine is responsible for keeping our Slurm DB in a current state.
Without it the submission to some new Slurm projects/ accounts will fail.
Please write a ticket if the problem persists.

10.06.2025 14:04

login23-2 currently not available

Teilstörung
Donnerstag, 12.06.2025 17:30 - Montag, 16.06.2025 10:20

login23-2 is currently not available. Please use one of the other dialog systems.

12.06.2025 17:39
Updates

login23-2 is available again.

16.06.2025 10:22

System maintenance for the hpcwork filesystem

Teilwartung
Mittwoch, 11.06.2025 07:00 - Donnerstag, 12.06.2025 17:40

During the maintenance, the dialog systems will continue to be available
most of the time but *without access to the hpcwork directories*.
Batch jobs will not be running.

02.06.2025 14:59
Updates

Due to unexpected problems with the filesystem update, we will have to prolong the maintenance until tomorrow. As of now, we presume that the problem will be solved until noon and will release the batch queues as soon as possible. We apologize for the inconvenience.

11.06.2025 16:42

All maintenance tasks have been completed and both HPCWORK and the batch queues are operational again.

12.06.2025 17:57

HPC password change unavailable

Teilstörung
Donnerstag, 12.06.2025 10:15 - Donnerstag, 12.06.2025 11:00

Due to issues with the new RegApp version, changing the HPC password may fail.

12.06.2025 10:20
Updates

The issue has been identified and a fix has been deployed.

12.06.2025 11:06

RegApp Maintenance

Teilwartung
Mittwoch, 11.06.2025 09:30 - Mittwoch, 11.06.2025 10:00

The RegApp software will be updated to the newest version. During this time, login to the frontends and perfmon may be unavailable. It is recommended to log in beforehand.

04.06.2025 14:52

cp command of hpcwork files may fail on Rocky Linux 9 systems

Teilstörung
Donnerstag, 05.06.2025 11:30 - Mittwoch, 11.06.2025 07:00

On Rocky Linux 9 systems (especially login23-1 or login23-4) the cp command of hpcwork files may fail with the error "No data available". Current workarounds: either use "cp --reflink=never ..." to copy files or run the cp command on one of the Rocky Linux 8 nodes, e.g. copy23-1 or copy23-2.

05.06.2025 11:44
Updates

During system maintenance on 11.06. we will install a new version of the filesystem client software which will fix the problem. Until then please use "cp --reflink=never ..." to copy hpcwork files or copy files on Rocky 8 systems (e.g. copy23-1 or copy23-2).

06.06.2025 11:01

Routing über neue XWiN-Router

Wartung
Dienstag, 10.06.2025 21:00 - Dienstag, 10.06.2025 22:13

In dem Zeitraum wird das Routing der bisherigen XWiN-Router (Nexus 7700) auf die neuen XWiN-Router (Catalyst 9600) umgestellt. Diese Router sind für die Netzverbindung der RWTH unerlässlich. Für diese Umstellung ist auch die Migration der DFN-Anbindung, welche redundant nach Frankfurt und Hannover geschaltet sind, sowie der RWTH-Firewall auf die neuen System erforderlich.
Innerhalb des Wartungsfensters wird es zu Ausfällen oder Teilausfällen der Außenanbindung kommen. Alle Services der RWTH (bspw. VPN, Email, RWTHonline, RWTHmoodle) werden innerhalb dieses Zeitraums nicht zur Verfügung zu stehen. Die Erreichbarkeit von Services innerhalb des RWTH-Netzes ist aufgrund eingeschränkter DNS-Funktionalität zeitweise nicht gegeben.
---english---

07.05.2025 12:47
Updates

Der Uplink nach Frankfurt wurde erfolgreich auf das neue System geschwenkt.

10.06.2025 21:07

Umbau des Uplinks nach Hannover beginnt.

10.06.2025 21:36

Uplink nach Hannover auf das neue System umgezogen.

10.06.2025 21:45

BGP v4/v6 nach Frankfurt und Hannover sind nun über die neue Routern funktional.

10.06.2025 21:57

Es stehen noch ein paar kleinere Nacharbeiten an.

10.06.2025 22:03

Wartung ist abgeschlossen. Der Datenverkehr läuft nun vollständig über die neuen Router!

10.06.2025 22:12

Problematik mit der Anbindung zur Physik identifiziert, Lösung erfolgt morgen früh.

10.06.2025 23:50

Partial Migration of CLAIX-2023 to Rocky Linux 9

Wartung
Montag, 02.06.2025 09:00 - Montag, 02.06.2025 19:00

During the maintenance, we will migrate half of the CLAIX-2023 nodes (ML & HPC) and the entire devel partition to Rocky Linux 9. Additionally, the dialog systems login23-1, login23-x-1, and login23-x-2 will be migrated.

30.05.2025 11:14

Aktuell keine Home- und Work-Verzeichnisse auf login23-2

Teilstörung
Freitag, 16.05.2025 11:30 - Freitag, 23.05.2025 15:00

Auf login23-2 stehen aktuell keine Home- und Work-Verzeichnisse zur Verfuegung. Bitte weichen Sie auf eines der anderen Dialogsysteme aus.

16.05.2025 11:38

Swapping MFA backend of RegApp

Teilwartung
Mittwoch, 21.05.2025 15:00 - Mittwoch, 21.05.2025 16:00

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

16.05.2025 15:31

Swapping MFA backend of RegApp

Wartung
Mittwoch, 14.05.2025 12:45 - Mittwoch, 14.05.2025 12:45

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

09.05.2025 08:37
Updates

we have to postpone the maintenance

14.05.2025 12:55

Coolage disturbed, emergency shutdown of CLAIX23

Störung
Montag, 12.05.2025 13:15 - Montag, 12.05.2025 16:15

We have a problem with the external coolage. We therefore have to do an emergency shutdown of CLAIX23.

12.05.2025 13:24
Updates

We are now powering on the cluster again. The frontend (login-) nodes will be available soon again. Until further notice, the batchsystem for CLAIX23 keeps stopped. We hope to resolve the issue today.

12.05.2025 14:49

HPC JupyterHub maintenance

Wartung
Dienstag, 06.05.2025 09:00 - Dienstag, 06.05.2025 17:00

The HPC JupyterHub will be down for maintenance to update some internal software packages.

02.05.2025 10:14

Access to hpcwork directories hanging

Teilstörung
Montag, 05.05.2025 16:00 - Montag, 05.05.2025 17:15

Currently the access to the hpcwork directories may hang due to problems with the file servers. The problem is being worked on.

29.04.2025 10:02
Updates

The accesss to $HPCWORK should work again now.

29.04.2025 10:47

The access to the hpcwork directories may hang again.

05.05.2025 16:48

The accesss to $HPCWORK should work again.

06.05.2025 08:01

HPCJupyterHub Profile Installation Unavailable

Teilstörung
Freitag, 28.03.2025 13:15 - Dienstag, 29.04.2025 11:27

The installation of HPCJupyterHub Profiles is currently unavailable due to issues with our HPC container system.
A downtime of at least two weeks is expected. We apologize for the inconvenience.
Update: New Profiles can be installed now, after Apptainer changes.

28.03.2025 13:30

System Upgrade of Login Node

Teilwartung
Montag, 24.03.2025 05:00 - Donnerstag, 17.04.2025 15:25

The CLAIX-2023 login node login23-4 will be upgraded to Rocky Linux 9.5 for assisting the migration to Rocky Linux 9. During the version upgrade, the login node will not be available.

21.03.2025 11:20
Updates

The CLAIX-2023 dialog node login23-4 is currently unavalable due to testing with respect to the planned migration to Rocky Linux 9. Please use an other dialog node until access to this node is available again.

25.03.2025 15:01

The new Rocky 9 login node login23-4 will be available soon. We are resolving the last issues before it can be used for the Rocky 9 pilot phase.

17.04.2025 15:22

The Rocky 9 login node login23-4 is available now.

17.04.2025 15:27

SSH Command Key Approval Currently Not Possible

Hinweis
Donnerstag, 10.04.2025 08:00 - Mittwoch, 16.04.2025 14:35

Due to a bug in the RegApp, SSH command keys cannot be approved at the moment. We are working on a solulution.
Update 16.04 14:35:
The cause of the bug has been identified and fixed

10.04.2025 14:57

[RegApp] hotfix deployment

Teilwartung
Mittwoch, 16.04.2025 14:30 - Mittwoch, 16.04.2025 14:35

the bug affecting ssh command key approval has been identified.
A hotfix will be deployed momentarily, two factor logins on the HPC will be temporarily unavailable.

16.04.2025 14:13

Maintenance of the RegApp application

Teilwartung
Mittwoch, 09.04.2025 09:00 - Mittwoch, 09.04.2025 11:00

No new logins are possible during the maintenance work. Users who are already logged in will not be disturbed and the maintenance will not affect the rest of the cluster. All frontends are still available, as is the batch system with its computing nodes.

10.03.2025 11:02

Bad Filesystem Performance

Teilstörung
Donnerstag, 23.01.2025 12:30 - Freitag, 28.03.2025 09:31

At the moment, some issues with the file systems that impact the performance are observed. File access can be severely delayed consequently. The issue is currently under investigation.

24.01.2025 14:38
Updates

We were not able to identify the root cause of the observed issues and are still working on a solution.

10.03.2025 13:30

We changed some network settings on the GPFS servers, but that did not change anything. We are still working with the manufacturer to get a solution.

20.03.2025 12:08

Additional changes have been made to the configuration. Initial tests tend to show an improvement in the GPFS performance. A full qualitative analysis of the performance is pending

21.03.2025 11:23

The performance of the HOME and WORK filesystems are much better now.

28.03.2025 09:33

Global Maintenance of CLAIX-2023

Wartung
Mittwoch, 05.03.2025 08:00 - Freitag, 07.03.2025 15:00

Due to maintenance tasks on a cluster-wide scale that cannot be performed online, the whole batch service will be suspended for the maintenance.
The Regapp is also affected by the maintenance.

18.02.2025 14:56
Updates

The infrastructure of the HOME and WORK filesystems will also be under maintenance. Hence, the frontend nodes will be inacessible as well until the pending tasks are completed.

26.02.2025 06:19

Due to delays in servicing the file systems, the cluster maintenance needs to be prolonged.

06.03.2025 12:38

Due to delays in deploying the updated file system, the maintenance hat do be prolonged.

07.03.2025 13:57

Jupyterhub Temporarily Unavailable Due To Kernel Update

Teilwartung
Montag, 17.02.2025 06:00 - Montag, 17.02.2025 09:00

Due to a scheduled kernel update, the jupyterhub node is temporarily unavailable. The node will be available again, as soon as the update is completed.

17.02.2025 06:16

Kurze SSO-Störung

Störung
Mittwoch, 12.02.2025 20:25 - Mittwoch, 12.02.2025 21:00

Heute Abend kam es gegen ca. 20:25 - 21:00 zu einem Performance-Tief des SSO-Service und wurde innerhalb dieser Zeit von der Fachabteilung geprüft. Das Problem ist behoben und schnelle Logins ohne Wartezeiten sind wieder möglich.
---English---

12.02.2025 21:41

Login nodes unavailable due to kernel update

Teilwartung
Montag, 10.02.2025 05:45 - Montag, 10.02.2025 07:45

Due to a scheduled Kernelupdate, the login nodes, i.e., login23-*, are temporarily unavailable during the update. The nodes will be available again, as soon as the update is completed.

10.02.2025 00:47
Updates

The maintenance had to be prolonged for few minutes.

10.02.2025 07:25

Some CLAIX23 Nodes Unavailable Due To Update Issues

Teilwartung
Freitag, 24.01.2025 07:00 - Freitag, 24.01.2025 13:35

During a kernel update issues occured which lead to delays in re-deploying the nodes to the batch service. Consequently, a large number of compute nodes is unavailable at the moment (reserved and/or down). We are currently working on handing the respective nodes and will release them as fast as possible.

24.01.2025 09:52
Updates

The updates could be completed. The nodes are available again.

24.01.2025 14:34

HPC JupyterHub unavailable

Störung
Mittwoch, 22.01.2025 13:30 - Mittwoch, 22.01.2025 15:14

The HPC JupyterHub is currently unavailable due to unforeseen errors in the filesystems.
A solution is being worked on.

22.01.2025 13:41

Node unavailable due to kernel update

Teilwartung
Montag, 20.01.2025 07:45 - Montag, 20.01.2025 08:45

The kernel of the CLAIX23 copy node claix23-2 has to be updated. The node will be available again as soon as the update is finished.

20.01.2025 07:46

RegApp Störung

Störung
Dienstag, 14.01.2025 09:00 - Dienstag, 14.01.2025 16:40

Es ist zurzeit nicht möglich in der RegApp das HPC Passwort zu ändern.
Das anlegen neuer Accounts funktioniert nicht.

14.01.2025 15:25

RegApp Wartung

Wartung
Dienstag, 14.01.2025 08:00 - Dienstag, 14.01.2025 10:10

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit sind keine Logins möglich.
Update:
Die RegApp ist seit 9 Uhr wieder erreichbar, die Kommunikation mit dem 2. Faktor ist aber gestört. Entsprechend funktioniert der 2FA-Login am HPC noch nicht.
Update 10:10
Die Störung ist behoben, der 2FA-Login am HPC funktioniert wieder.

06.01.2025 15:31

Wartung der RegApp

Wartung
Montag, 23.12.2024 09:00 - Montag, 23.12.2024 09:30

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit ist kein Login in der RegApp möglich.

19.12.2024 11:04

Slurm hiccup possible

Warnung
Donnerstag, 19.12.2024 15:50 - Donnerstag, 19.12.2024 16:00

We are migrating the slurm controller to a new host. It might come to short timeouts. We try to minimize that as much as possible.

19.12.2024 09:53
Updates

The first try was not successfull, we are on the old master again. We are analyzing the problems that occurred and try again later.

19.12.2024 10:01

we make another attempt

19.12.2024 15:52

Issues regarding Availability

Teilstörung
Montag, 02.12.2024 11:15 - Mittwoch, 18.12.2024 16:00

There may currently be login problems with various login nodes. We are working on a solution.

02.12.2024 11:58
Updates

The observed issues affect the batch service as well. Consequently many batch jobs may have failed.

02.12.2024 12:18

The observed problems can be concluded from power issues. We cannot exclude that further systems may have to be controlled shut down temporarily due to a problem resolution if required. However, we hope the issues can be resolved without any additional measures.

02.12.2024 14:31

The cluster can be accessed again. cf. ticket https:
maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/9521 for further details.
Several nodes, however, are still unavailable due to the consequences of the aforementioned issues. We are currently working on resolving the issues.

04.12.2024 12:57

Infiniband issues leading to not reachable nodes

Teilstörung
Freitag, 13.12.2024 21:30 - Montag, 16.12.2024 10:57

Due to to be analyzed infiniband problems many nodes including the whole GPU cluster are not reachable at the moment. We are working together with the manufacturer, to solve the problems.

16.12.2024 08:04
Updates

The problem could be fixed

16.12.2024 10:57

Single Sign-On und MFA gestört

Teilstörung
Mittwoch, 11.12.2024 08:00 - Mittwoch, 11.12.2024 09:30

Zurzeit ist der Single Sign-On und die Multifaktorauthentifizierung sporadisch gestört. Wir arbeiten bereits an einer Lösung und bitten um Geduld.
---english---

11.12.2024 08:12

Nodes drained due filesystem issues.

Teilstörung
Sonntag, 08.12.2024 06:45 - Sonntag, 08.12.2024 17:15

Dear users, Sunday the 8.12.2024 at 06:53:00 AM the home filesystems of most nodes went offline.
This might have negatively crashed some jobs and no new jobs can start during the downtime.
We are actively working on the issue.

08.12.2024 16:22
Updates

Most nodes are coming back online. Apologies for the troubles. We expect most nodes to be usable by 18:00.

08.12.2024 16:56

lost filesystem $HOME $WORK connection

Hinweis
Donnerstag, 05.12.2024 22:00 - Freitag, 06.12.2024 10:00

Due to a problem in our network, some nodes lost their connection to the $HOME and $WORK file system. This included the login23-1 and login23-2 nodes. The issue has been resolved now.

06.12.2024 14:00

Emergency Shutdown of CLAIX-2023 Due to Failed Cooling

Störung
Montag, 02.12.2024 15:15 - Mittwoch, 04.12.2024 06:00

CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation.
The cluster will be operational again if the inducting issues can be resolved.

02.12.2024 15:20
Updates

Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.

03.12.2024 10:34

The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance.
Once these investigations are complete, the affected nodes will be made available through the batch system again.

04.12.2024 10:41

New file server for home and work directories

Wartung
Freitag, 22.11.2024 12:00 - Dienstag, 26.11.2024 13:45

We are putting a new file server for the home and work directories into operation. For this purpose we will carry out a system maintenance in order to finally synchronise all data over the weekend.

20.11.2024 09:32
Updates

The maintenance needs to be extended.

25.11.2024 13:32

Due to some issues preventing a normal batch service, the maintenance had to be extended.

26.11.2024 13:33

Limited Usability of CLAIX-2023

Teilstörung
Donnerstag, 21.11.2024 09:45 - Donnerstag, 21.11.2024 17:15

Wegen anhaltender externer Störungen im RWTH-Netzwerk ist der Cluster nur eingeschränkt erreichbar und funktionsfähig. Die zuständige Netzwerkfachabteilung arbeitet bereits an einer Lösung der Probleme.

21.11.2024 12:47
Updates

The issues could not be resolved until now and may persist trhoughout tomorrow as well.

21.11.2024 16:00

The issues have been resolved.

22.11.2024 06:52

Power disruption / Stromausfall

Teilstörung
Freitag, 15.11.2024 17:00 - Samstag, 16.11.2024 14:00

Um 17:00 Uhr hat es einen kurzzeitigen Stromausfall im Raum Aachen gegeben. Die Stromversorgung besteht wieder, jedoch ist die Mehrzahl der Compute-Knoten infolgedessen ausgefallen. Es ist unklar, wann der Betrieb wieder aufgenommen werden kann.
Es wird momentan daran gearbeitet, kritische Dienste zu sichern und wiederherzustellen.

15.11.2024 18:43
Updates

Nachdem die kritische Infrastruktur zum Betrieb der Systeme wiederhergestellt werden konnte, wurde der HPC-Cluster wieder bereitgestellt und freigegeben. Allerdings sind durch die Auswirkungen des Stromausfalls eine größere Zahl GPU-Knoten nicht mehr verfügbar. Wir arbeiten an der Behebung der Probleme, können allerdings noch keine Prognose geben, wann und ob die Systeme wieder verfügbar sein werden.

15.11.2024 21:06

Der Großteil der ML Systeme (GPUs) konnten heute wieder hochgefahren und in den Batchbetrieb übergeben werden.

16.11.2024 14:04

Scheduler Hiccup

Störung
Donnerstag, 14.11.2024 10:45 - Donnerstag, 14.11.2024 10:55

Our Slurm workload manager crashed due to an unknown reason. Functionality could be restored at short hand. Further investigations are ongoing.

14.11.2024 10:59

GPU Malfunction on GPU Login Node

Teilstörung
Dienstag, 12.11.2024 09:15 - Dienstag, 12.11.2024 10:35

Currently, a GPU of the GPU login node login23-g-1 shows an issue. The node is unavailable until the issue is resolved.

12.11.2024 09:29
Updates

The issues could be resolved.

12.11.2024 10:36

Login malfunction

Teilstörung
Mittwoch, 23.10.2024 17:00 - Donnerstag, 24.10.2024 08:40

It is currently not possible to log in to the login23-* frontends. There is a problem with two-factor authentication

24.10.2024 08:09

Top500 run for GPU nodes

Hinweis
Freitag, 27.09.2024 08:00 - Freitag, 27.09.2024 11:00

We are doing a new top500 run for the ML partition of CLAIX23.
The GPU nodes will not be available during that run.
Other Nodes and login23-g-1 might be also unavailable:
i23m[0027-0030],r23m[0023-0026,0095-0098,0171-0174],n23i0001

26.09.2024 15:03

Zertifikat abgelaufen

Störung
Montag, 23.09.2024 08:00 - Montag, 23.09.2024 09:03

Aufgrund von dem abgelaufenen Zertifikat für idm.rwth-aachen.de können keine IdM-Anwendungen und die Anwendungen, die über RWTH Single Sign-On angebunden sind, aufgerufen werden.
- Beim Aufrufen von IdM-Anwendungen wird eine Meldung zur unsicheren Verbindung angezeigt.
- Beim Aufrufen von Anwendungen mit dem Zugang über RWTH Single Sign-On wird eine Meldung zu fehlenden Berechtigungen angezeigt.
Wir arbeiten mit Hochdruck an der Lösung des Problems.
--- English ---

23.09.2024 08:11
Updates

Das Zertifikat wurde aktualisiert und die Anwendungen können wieder aufgerufen werden. Bitte löschen Sie den Browsercache, bevor Sie die Seiten wieder aufrufen.

23.09.2024 09:06

Störung der RegApp -> kein login auf dem Cluster möglich

Teilstörung
Montag, 09.09.2024 10:45 - Montag, 09.09.2024 11:30

Leider kam es in dem genannten Zeitraum zu einer Störung der RegApp, so dass man sich nicht auf den Frontends des Clusters einloggen konnte. Bereits bestehende Verbindungen wurden davon nicht beeinflusst. Das Problem ist behoben.

09.09.2024 15:40

Altes Ticket ohne Titel

Teilwartung
Montag, 09.09.2024 08:00 - Montag, 09.09.2024 09:00

copy23-2 data transfer system will be unavailable for maintenance.

02.09.2024 09:01
Updates

The maintenance is completed.

09.09.2024 09:00

Firmware Update of InfiniBand Gateways

Teilwartung
Donnerstag, 05.09.2024 15:00 - Freitag, 06.09.2024 12:15

The firmware of the InfiniBand Gateways will be updated. The firmware update will be performed in background and should not cause any interruption of service.

05.09.2024 15:31
Updates

The updates are completed.

06.09.2024 13:19

Hosts der RWTH Aachen teilweise nicht aus Netzen von anderen Providern erreichbar

Teilstörung
Samstag, 24.08.2024 20:15 - Sonntag, 25.08.2024 21:00

Aufgrund einer Störung des DNS liefern die Nameserver verschiedener Provider aktuell keine IP-Adresse für Hosts unter *.rwth-aachen.de zurück.
Als Workaround können Sie alternative DNS-Server in Ihren Verbindungseinstellungen hinterlegen, wie z.B. die Level3-Nameserver (4.2.2.2 und 4.2.2.1) oder von Comodo (8.26.56.26 und 8.20.247.20). Ggf ist es auch möglich den VPN-Server der RWTH zu erreichen, dann nutzen Sie bitte VPN.

25.08.2024 10:34
Updates

Anleitungen zur Konfiguration eines alternativen DNS-Server unter Windows finden Sie über die folgenden Links:
https:
www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/
https:
www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html
Als Alternative können Sie auch VPN nutzen. Wenn Sie den VPN-Server nicht erreichen, können Sie nach der folgenden Anleitung die Host-Datei unter Windows anpassen. Dadurch kann der Server vpn.rwth-aachen.de erreicht werden. Dazu muss der folgenden Eintrag hinzugefügt werden:
134.130.5.231 vpn.rwth-aachen.de
https:
www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/

25.08.2024 13:20

Die Host der RWTH Aachen sind nun wieder auch von ausserhalb des RWTH Netzwerkes erreichbar.

25.08.2024 21:10

Auch nach der Störungsbehebung am 25.8. um 21 Uhr kann es bei einzelnen Nutzer*innen zu Problemen gekommen sein. Am 26.8. um 9 Uhr wurden alle Nacharbeiten abgeschlossen, sodass es zu keinen weiteren Problemen kommen sollte.

26.08.2024 15:25

MPI/CPU Jobs Failed to start overnight

Teilstörung
Montag, 19.08.2024 17:15 - Dienstag, 20.08.2024 08:15

Many nodes suffered an issue after our updates on the 19.08.2024, resulting in jobs failing on the CPU partitions.
If your job failed to start or failed on startup, please consider requeuing it if necessary. This list of jobs was identified as possibly being affected by the issue:
48399558,48468084,48470374,48470676,48473716,48473739,48473807,48473831,
48475599,48475607_0,48475607_1,48475607_2,48475607_3,48475607_4,48475607_5,
48475607_6,48475607_7,48475607_8,48475607_9,48475607_10,48475607_11,48475607_12,
48475607_13,48475607_14,48475607_15,48475607_16,48475607_17,48475607_18,48475607_19,
48476753,48482255,48485168,48486404,48488874_5,48488874_6,48488874_7,48488874_8,
48488874_9,48488874_10,48488874_11,48488875_9,48488875_10,48488875_11,48489133_1,
48489133_2,48489133_3,48489133_4,48489133_5,48489133_6,48489133_7,48489133_8,48489133_9,
48489133_10,48489154_0,48489154_1,48489154_2,48489154_3,48489154_4,48489154_5,48489154_6,48489154_7,
48489154_8,48489154_9,48489154_10,48489154_11,48489154_12,48489154_13,48489154_14,48489154_15,
48489154_16,48489154_17,48489154_18,48489154_19,48489154_20,48489154_21,48489154_22,48489154_23,
48489154_24,48489154_25,48489154_26,48489154_27,48489154_28,48489154_29,48489154_30,48489154_31,
48489154_32,48489154_33,48489154_34,48489154_35,48489154_36,48489154_37,48489154_38,48489154_39,
48489154_40,48489154_41,48489154_42,48489154_43,48489154_44,48489154_45,48489154_46,48489154_47,
48489154_100,48489154_101,48489154_102,48489154_103,48489154_104,48489154_105,48489154_106,48489154_107,
48489154_108,48489154_109,48489154_110,48489154_111,48489154_112,48489154_113,48489154_114,48489154_115,
48489154_116,48489154_117,48489154_118,48489154_119,48489154_120,48489154_121,48489154_122,48489154_123,
48489154_124,48489154_125,48489154_126,48489154_127,48489154_128,48489154_129,48489154_130,48489154_131,
48489154_132,48489154_133,48489154_134,48489154_135,48489154_136,48489154_137,48489154_138,48489154_139,
48489154_140,48489154_141,48489154_142,48489154_143,48489154_144,48489154_145,48489154_146,48489154_147,
48489154_148,48489154_149,48489154_150,48489154_151,48489154_152,48489154_153,48489154_154,48489154_155,
48489154_156,48489154_157,48489154_158,48489154_159,48489154_160,48489154_161,48489154_162,48489154_163,
48489154_164,48489154_165,48489154_166,48489154_167,48489154_168,48489154_169,48489154_170,48489154_171,
48489154_172,48489154_173,48489154_174,48489154_175,48489154_176,48489154_177,48489154_178,48489154_179,
48489154_180,48489154_181,48489154_182,48489154_183,48489154_184,48489154_185,48489154_186,48489154_187,
48489154_188,48489154_189,48489154_190,48489154_191,48489154_192,48489154_193,48489154_194,48489154_195,
48489618_1,48489618_2,48489618_3,48489618_4,48489618_5,48489618_6,48489618_7,48489618_8,48489618_9,48489618_10,
48489776,48489806_6,48489806_55,48489806_69,48489806_98,48489842,48489843,48489844,48489845,48489882_1,48489882_2,
48489882_3,48489882_4,48489882_5,48489882_6,48489882_7,48489882_8,48489882_9,48489882_10,48494481,48494490,48494752,
48494753,48494754,48494755,48494756,48494757,48494758,48494759,48494760

20.08.2024 11:34

Maintenance

Wartung
Montag, 19.08.2024 07:00 - Montag, 19.08.2024 16:00

Due to updates to our compute nodes, the HPC system will be unavailable for Maintenance.
The login nodes will available at noon without interruptions, but the batch queue for jobs won't be usable during the maintenance work.
As soon as the maintenance work has been completed, batch operation will be enabled again.
These jobs should be requeued if necessary:
48271714,48271729,48271731,48463405,48463406,48463407,48466930,
48466932,48468086,48468087,48468088,48468089,48468090,48468091,
48468104,48468105,48468108,48468622,48469133,48469262,48469404,
48469708,48469734,48469740,48469754,48469929,48470011,48470017,
48470032,48470042,48470045,48474641,48474666,48475362,48489829,
48489831,48489833_2,48489838

09.08.2024 11:01

Old HPCJupyterHub GPU profiles might run slower on the new c23g nodes.

Hinweis
Freitag, 24.05.2024 11:00 - Freitag, 09.08.2024 13:46

Please migrate your notebooks to work with newer c23 GPU Profiles!
The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs.
Redeployment of these old profiles is necessary and will take some time.

24.05.2024 11:15

MPI jobs may crash

Teilstörung
Dienstag, 16.07.2024 16:12 - Donnerstag, 01.08.2024 09:15

Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.

22.07.2024 09:37
Updates

We have identified the issue and are currently testing workarounds with the affected users.

24.07.2024 12:41

After successful tests with affected users, we have rolled out a workaround that automatically prevents this issue for our IntelMPI installations. We advise users to remove any custom workarounds from their job scripts to ensure compatibility with future changes.

01.08.2024 10:28

c23i Partition is DOWN for the HPC JupyterHub

Teilstörung
Donnerstag, 18.07.2024 15:15 - Montag, 29.07.2024 10:14

The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition.
A solution is momentarily unknown and will be investigated.
The HPC JupyterHub will not be able to use it until it is resolved.

18.07.2024 15:29

Temporary Deactivation of User Namespaces

Teilstörung
Montag, 08.07.2024 14:15 - Donnerstag, 18.07.2024 13:00

Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.

08.07.2024 14:32
Updates

User namespaces are available again.

18.07.2024 13:00

Quotas on HPCWORK may not work correctly

Teilstörung
Donnerstag, 27.06.2024 14:30 - Donnerstag, 18.07.2024 12:30

The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.

27.06.2024 14:40
Updates

File quotas for all hpcwork directories were increased to one million.

18.07.2024 12:39

Reconfiguration of File Systems and Kernel Update

Wartung
Montag, 15.07.2024 07:00 - Dienstag, 16.07.2024 16:11

During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.

10.07.2024 09:43
Updates

The maintenance had to be extended for final filesystem tasks

15.07.2024 15:24

Due to unforseen problems, the maintenance has to be extended to tomorrow 16.07.2024 18.00. We do not expect the manufacturer of the filesystem to take that long, but expect to open the cluster earlier again.

15.07.2024 17:24

The maintenance could be ended successfully. Once again, sorry for the long delay.

16.07.2024 16:12

HPCJupyterHub down due to update to 5.0.0

Störung
Mittwoch, 26.06.2024 15:00 - Donnerstag, 27.06.2024 16:00

HPCJupyterHub is down after faied update to 5.0.0. will stay until the update is complete.
HPCJupyterHub could not be updated to 5.0.0. Remains at 4.1.5.

26.06.2024 15:04

FastX web servers on login18-x-1 and login18-x-2 stopped

Warnung
Mittwoch, 15.05.2024 14:00 - Donnerstag, 27.06.2024 14:26

The FastX web servers on login18-x-1 and login18-x-2 have been stopped, i.e. the addresses https:
login18-x-1.hpc.itc.rwth-aachen.de:3300 and https:
login18-x-2.hpc.itc.rwth-aachen.de:3300 are not available anymore. Please use login23-x-1 or login23-x-2 instead.

15.05.2024 14:38
Updates

login18-x-1 and login18-x-2 has been decommissioned.

27.06.2024 14:29

Maintenance

Wartung
Mittwoch, 26.06.2024 08:00 - Mittwoch, 26.06.2024 16:00

Due to maintenance work on the water cooling system, Claix23 must be empty during the specified period. As soon as the maintenance work has been completed, batch operation will be enabled again. The dialog systems are not affected by the maintenance work.

12.06.2024 07:48
Updates

Additionally, between 10 and 11 o'clock, there will be a maintenance of the RegApp. During this time, new logins will not be possible, existing connections will not be disturbed.

25.06.2024 14:04

Upgrade to Rocky Linux 8.10

Teilwartung
Donnerstag, 13.06.2024 11:15 - Mittwoch, 26.06.2024 16:00

Due to the reached EOL of Rocky 8.9, the MPI nodes of CLAIX23 must be upgraded to Rocky 8.10. The upgrade is performed in background during production to minimize the downtime of the cluster. However, during the Upgrade, free nodes will be removed on a selection-basis and will not be available for job submission until the upgrade is completed.
Please keep in mind that during the update, the library versions installed will likely change. Thus, the performance and application behaviour may vary compared to earlier runs.

13.06.2024 11:49
Updates

Starting now, all new jobs will be scheduled to Rocky 8.10 nodes. The remaining nodes that still need to be updated are unvailable for job submission. These nodes will be upgraded as soon as possible after their jobs' completion.

14.06.2024 18:22

The update of the frontend and batch nodes is completed. Remaining nodes (i.e. integrated hosting and service nodes) will be updated on the cluster maintanance scheduled for 2024-06-26.

20.06.2024 08:49

Update of Frontend Nodes

Teilwartung
Mittwoch, 26.06.2024 08:00 - Mittwoch, 26.06.2024 10:00

The dialog nodes (i.e. login23-1/2/3/4, login23-x-1/2) will be updated to Rocky 8.10 today within the weekly reboot. The upgrade of copy23-1/2 will follow.

17.06.2024 05:08
Updates

The copy frontend nodes (copy23-1, copy23-2) will be updated to Rocky Linux 8.10 during the cluster maintanance ond 2024-06-26.

24.06.2024 09:13

The update of the remaining frontend nodes is completed.

26.06.2024 11:12

Altes Ticket ohne Titel

Teilstörung
Dienstag, 25.06.2024 10:00 - Dienstag, 25.06.2024 17:00

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

25.06.2024 16:11

Error on user/project management

Teilstörung
Donnerstag, 20.06.2024 10:00 - Montag, 24.06.2024 10:32

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

20.06.2024 12:09
Updates

The issue has been resolved.

24.06.2024 10:33

Project management

Teilstörung
Mittwoch, 29.05.2024 15:30 - Mittwoch, 12.06.2024 16:30

During this period no RWTH-S, THESIS, LECTURE or WestAI projects can be granted. We apologize for the inconvenience.

29.05.2024 15:42

RegApp Maintenance

Teilwartung
Mittwoch, 12.06.2024 09:00 - Mittwoch, 12.06.2024 10:00

Due to maintenance of the RegApp Identiy Provider, it is not possible to establish new connections to the cluster during the specified period. Existing connections and batch operation are not affected by the maintenance.

04.06.2024 14:28

Deactivation of User Namespaces

Hinweis
Mittwoch, 27.03.2024 08:15 - Montag, 29.04.2024 18:00

Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte Usernamespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt, und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.

27.03.2024 08:14
Updates

A kernel update addressing the issue was released upstream and will be available to the compute cluster, soon. Upon the update, usernamespaces can be enabled, again.

04.04.2024 11:11

We are planning to re-enable user namespaces on April, 29th after some final adjustments

24.04.2024 17:22

Performance Problems on HPCWORK

Hinweis
Montag, 08.04.2024 11:00 - Mittwoch, 24.04.2024 17:00

We currently register recurring performance degradations on HPCWORK directories which might be partly worsened by the on-going migration process leading on to the filesystem migration on April, 17th. The problems cannot be traced back to a single cause but are actively investigated.

12.04.2024 11:35
Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

16.04.2024 16:21

HPC JupyterHub update

Wartung
Dienstag, 23.04.2024 07:00 - Mittwoch, 24.04.2024 12:00

During the Claix HPC System Maintenance, the HPC JupyterHub will be updated to a newer version.
This will improve Claix 2023 support as well mandatory security updates.
The whole clusters needs to be updated with a new kernel.

23.04.2024 07:03
Updates

The migration was successfully completed.

24.04.2024 13:40

Migration from lustre18 to lustre22

Teilwartung
Dienstag, 23.04.2024 07:00 - Mittwoch, 24.04.2024 12:00

In the last weeks, we started migrating all HPCWORK data to a new filesystem. In this Maintenance we will do the final migration step. HPCWORK will not be available during this maintenance.

10.04.2024 11:26
Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

16.04.2024 16:23

System Maintenance

Wartung
Dienstag, 23.04.2024 07:00 - Mittwoch, 24.04.2024 12:00

The whole clusters needs to be updated with a new kernel such that user namespaces can be reenabled again, please compare https:
maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/8929
Simultaneously the Infiniband Stack will be updated for better performance and stability.
During this maintenance, the dialog systems and the batchsystem will not be available. The dialog systems are expected to be reopened in the early morning.
We do not believe that the maintenance will last the whole day but expect the cluster to open earlier.

10.04.2024 11:22
Updates

Due to technical problems, we will have to postpone the maintenance to 23.04.2024 07:00.

16.04.2024 16:22

Unfortunately, unplanned complications have arisen during maintenance, so that maintenance will have to be extended until midday tomorrow. We will endeavor to complete the work by then.
We apologize for any inconvenience this may cause.

23.04.2024 16:27

Top500 - Benchmark

Warnung
Donnerstag, 11.04.2024 17:00 - Freitag, 12.04.2024 09:10

During the stated time Claix-2023 will not be available due to a benchmark run for the Top500 list[1]. Batch jobs which cannot finish before the start of this downtime or which are scheduled during this time period will be kept in queue and started after the cluster resumes operation.
[1] https:
www.top500.org

11.04.2024 17:09
Updates

The nodes are available now again

12.04.2024 09:27

Longer waiting times in the ML partition

Hinweis
Mittwoch, 03.04.2024 16:00 - Donnerstag, 11.04.2024 13:11

There are currently longer waiting times in the ML partition as the final steps of the acceptance process are still being carried out.

04.04.2024 10:09
Updates

The waiting times should be better now

11.04.2024 13:11

RegApp Service Update

Wartung
Mittwoch, 03.04.2024 14:00 - Mittwoch, 03.04.2024 14:30

Am 03.04.2024 wird die RegApp aktualisiert. Während des Updatefensters kann der Dienst für kurze Zeit unterbrochen sein. Aktive Sitzungen sollten nicht betroffen sein.

27.03.2024 13:59

Problems with submitting jobs

Teilstörung
Mittwoch, 03.04.2024 12:00 - Mittwoch, 03.04.2024 14:03

There are currently problems when submitting jobs. We are working on fixing the problems and apologize for the inconvenience.

03.04.2024 12:36
Updates

The problem is solved now.

03.04.2024 14:03

Deactivation of User Namespaces

Hinweis
Freitag, 12.01.2024 10:30 - Donnerstag, 08.02.2024 08:00

Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte User Namespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.
Update:
Wir haben einen Bugfix für die betroffene Softwarekomponente installiert und User Namespaces wieder aktiviert.

12.01.2024 10:43

Verzeichnis "hpcwork" ist leer

Teilstörung
Montag, 29.01.2024 10:15 - Montag, 29.01.2024 11:34

Zurzeit werden keine Daten auf /hpcwork angezeigt.
Die Fachabteilung ist informiert und arbeitet an der Lösung.
---english---

29.01.2024 10:26
Updates

Die Störung wurde behoben.

29.01.2024 11:34

Scheduled Reboot of CLAIX18 Copy Nodes

Störung
Montag, 29.01.2024 06:00 - Montag, 29.01.2024 07:15

Both CLAIX18 copy nodes will be rebooted on Monday, January 29th, 6.00 am (CET) due to a scheduled kernel upgrade. The systems will temporarily unavailable and cannot be used until the kernel update is finished.

26.01.2024 17:15

Netzwerkprobleme

Warnung
Freitag, 19.01.2024 19:45 - Samstag, 20.01.2024 09:30

Aufgrund von Netzwerkromplemen kann es im angegeben Zeitraum zu Problemen bei der Nutzung des Clusters gekommen sein.

22.01.2024 07:45

Verbindung mit dem Windows-Cluster nicht möglich

Teilstörung
Freitag, 29.12.2023 14:45 - Montag, 01.01.2024 00:00

Momentan kann keine Verbindung zum Windows-Cluster hergestellt werden.
Die Kollegen sind informiert und arbeiten an der Behebung des Problems.
-- english --

29.12.2023 14:55
Updates

Die Störung konnte behoben werden. Eine Verbindung mit dem Windows-Cluster ist wieder möglich.
--English Version--

03.01.2024 11:46

jupyterhub.hpc.itc.rwth-aachen.de DNS Temporary out of Service

Störung
Donnerstag, 14.12.2023 15:30 - Donnerstag, 14.12.2023 15:55

The jupyterhub.hpc.itc.rwth-aachen.de DNS is Temporary out of Service for 20 Minutes. Problems accessing the hpc JupyterHub might arise from this failure. Please wait until the system comes back online.

14.12.2023 15:33

Wartung HPC-Benutzerverwaltung

Wartung
Dienstag, 05.12.2023 10:00 - Dienstag, 05.12.2023 12:00

Aufgrund von Wartungsmassnahmen erfolgt das Einrichten von HPC-Accounts verzoegert. Passwort-Aenderungen sind nicht moeglich.

05.12.2023 09:55

login18-x-2 gestoert

Teilstörung
Montag, 27.11.2023 12:45 - Dienstag, 28.11.2023 14:40

login18-x-2 ist defekt und steht deshalb aktuell nicht zur Verfuegung.

28.11.2023 12:50
Updates

Das System ist wieder ok.

28.11.2023 14:40