Rechner-Cluster

Mehr Informationen zu dem Service finden Sie in unserem Dokumentationsportal.

Bug with Jobs landing on _low partitions

Partial Outage
Thursday 11/20/2025 12:05 PM - Thursday 11/20/2025 05:19 PM

We currently have issues with Jobs using the default account landing on _low partitions

20.11.2025 13:06
Updates

Strange submission error might be experienced while we fix the issue. Please re-submit.

20.11.2025 15:56

Full system maintenance.

Maintenance
Tuesday 11/18/2025 08:00 AM - Tuesday 11/18/2025 07:00 PM

Due to various necessary maintenance works, the entire CLAIX HPC System will be unavailable.

The initial phase of the maintenance should last until 12:00. After which filesystem and login nodes should be available.
Jobs will not run until the full maintenance works are completed.
We aim to also upgrade the Slurm scheduler during the downtime.

Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the initial part of the maintenance the maintenance.
- No Slurm jobs will be able to run during the entire maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the initial part of the maintenance.

10.11.2025 14:35
Updates

The first part of the maintenance will take unexpectedly longer. At the moment, we cannot estimate when the maintenance work of the first part will be finished.

18.11.2025 11:56

The network maintenance is still ongoing.

18.11.2025 16:01

The maintenance works were sadly delayed due to circumstances out of our control. This will delay the HPC systems availability by a few hours.

18.11.2025 17:15

Due to issues during the network maintenance and a short-hand failure of the storage backend of one infrastructure server, required in the maintenance, the maintenance tasks were delayed. Due to the delay, some tasks had to be postponed. The Cluster is operational again.

18.11.2025 19:08

RegApp maintenance

Partial Maintenance
Thursday 10/23/2025 02:30 PM - Thursday 10/23/2025 03:30 PM

The RegApp service will be temporarily unavailable due to OS updates and to prepare for upcoming changes.
Login to the HPC frontends and to perfmon.hpc.itc.rwth-aachen.de will be unavailable.
It is recommended to login to the services beforehand.

20.10.2025 09:38

Temporary Suspension of GPU Batch Service

Partial Maintenance
Friday 10/17/2025 08:05 AM - Monday 10/20/2025 06:05 AM

Due to urgent security updates, the batch service will be temporarily suspended for all GPU nodes. After the updates are deployed, the service will be resumed. Running jobs are not affected and will continue. The dialog node login23-g-1 is not affected.

Once the updates are automatically installed upon job finalization, the nodes will be available again on short-notice.

17.10.2025 08:35
Updates

The majority of the GPU nodes' jobs were finished early. The nodes could be upgraded without a noticable interruption of the service.

20.10.2025 10:38

HOME filesystem unavailable

Outage
Thursday 10/09/2025 10:15 PM - Friday 10/10/2025 09:30 AM

The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second.
We apologize for the inconveniences this has caused.

09.10.2025 16:45
Updates

The underlying problem has turned out to be more persistent than expected and we are consulting with the vendor for a fix. As a result, the majority of batch nodes remain out of operation until we can guarantee stable access to the filesystem. We are working to remedy the situation as soon as possible.

09.10.2025 17:25

The problem has been solved and the cluster is back in operation.

09.10.2025 20:08

The issue re-occured and persistes at the moment. We are currently working on a solution.

10.10.2025 07:21

The issues could be resolved.

10.10.2025 09:34

Full global maintenance of the HPC CLAIX Systems

Partial Maintenance
Tuesday 09/30/2025 03:35 PM - Friday 10/03/2025 07:25 PM

Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable.
Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance.
- No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the maintenance.

16.09.2025 15:31
Updates

Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.

30.09.2025 15:08

We must unfortunately postpone the release of the HPC system for normal use until Wednesday.
We apologise for the delays.

30.09.2025 20:12

Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.

01.10.2025 17:09

All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m.
We apologize once again for the unforseen inconveniences.

01.10.2025 18:18

The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.

02.10.2025 10:25

The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.

02.10.2025 15:39

Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.

03.10.2025 19:26

Migration to Rocky Linux 9

Partial Maintenance
Wednesday 10/01/2025 08:00 AM - Wednesday 10/01/2025 03:30 PM

The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.

30.09.2025 17:45

NFS Störung der GPFS Server

Outage
Wednesday 09/17/2025 09:35 PM - Thursday 09/18/2025 10:06 AM

Aktuell drainen alle Knoten aufgrund einer Störung. Wir arbeiten mit dem Hersteller daran.

18.09.2025 09:16
Updates

Das Problem konnte gelöst werden, der Cluster ist wieder in Operation.

18.09.2025 10:06

$HOME and $WORK filesystems are again unavailable

Outage
Friday 08/29/2025 09:45 AM - Friday 08/29/2025 11:00 AM

Due to issues with the underlying filesystem servers for $HOME and $WORK, the batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

29.08.2025 09:55
Updates

Access to the filesystems has been restored. We apologize for the inconvenience.

29.08.2025 11:38

$HOME and $WORK unavailable on login23-g-1

Outage
Tuesday 08/26/2025 03:00 PM - Wednesday 08/27/2025 11:30 AM

Due to issues with the GPFS filesystem $HOME and $WORK are not available on login23-g-1.
The issue has been resolved.

27.08.2025 11:40

login23-1 unreachable

Partial Maintenance
Tuesday 08/26/2025 08:45 AM - Tuesday 08/26/2025 10:00 AM

To resolve issues and fault analysis of the frontend node, the dialog node login23-1 is temporarily unavailable.

26.08.2025 09:01

Reboot of CLAIX-2023 copy nodes

Partial Maintenance
Monday 08/25/2025 08:00 AM - Monday 08/25/2025 04:16 PM

Due to a pending kernel update, the CLAIX-2023 copy nodes copy23-1 and copy23-2 will be rebooted on Monday, 2025-08-25.

22.08.2025 12:38
Updates

Due to pending mandatory firmware updates, the maintenance needs to be prolonged.

25.08.2025 08:36

Due to issues with respect to the firmware update, the end of the maintenance is delayed.

25.08.2025 09:52

copy23-1 is available again.

25.08.2025 12:00

The Maintenance is completed.

25.08.2025 16:16

GPU login node unavailable due to maintenance

Partial Maintenance
Friday 08/22/2025 08:00 AM - Friday 08/22/2025 03:45 PM

Due to mandatory maintenance work, the login GPU node n23-g-1 will not be available during the maintenance.
During the maintenance, the node's firmware will be updated which takes several hours to complete.

21.08.2025 14:36
Updates

The Firmware upgrades are completed. However, there is an issue with respect to the SNC configuration. We were unable to enable the cluster-wide standard setting SNC4 on the dialog node. Further investigation is required.
Notwithstanding this, the GPU login node can be used until further notice.

22.08.2025 15:40

$HOME and $WORK filesystems are again unavailable

Outage
Tuesday 08/19/2025 07:10 PM - Wednesday 08/20/2025 09:30 AM

Due to issues with the underlying filesystem servers for $HOME and $WORK, the majority of batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

19.08.2025 19:37
Updates

The issue has been resolved and the affected machines are back in operation. We are actively working with the vendor to remedy the situation and apologize for the repeated downtimes.

20.08.2025 10:16

File System Issues

Outage
Monday 08/18/2025 04:52 PM - Monday 08/18/2025 05:59 PM

Due to issues with the shared filesystems, all services with respect to the HPC cluster are substantially impacted. The issues are under current investigation, and we are working on a solution.

18.08.2025 17:41
Updates

On most of the nodes, the shared filesystems' function could be restored.

19.08.2025 08:12

$HOME and $WORK filesystems unavailable

Outage
Friday 08/15/2025 01:00 PM - Friday 08/15/2025 03:00 PM

Due to issues with the GPFS file system, the $HOME and $WORK directories may currently be unavailable on the login nodes. Additionally, some compute nodes are temporarily not available due to these issues. Our team is working to resolve the problem as quickly as possible.

15.08.2025 14:19
Updates

The issues could be resolved.

15.08.2025 15:09

$HOME and $WORK filesystems unavailable and cause jobs to fail.

Partial Outage
Wednesday 08/13/2025 12:45 PM - Wednesday 08/13/2025 02:47 PM

Due to issues with the underlying filesystem servers for $HOME and $WORK; GPFS, some batch nodes are currently unavailable and login nodes might be unstable. This may have caused running jobs to fail.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.
EDIT: Systems seem to be back online and operating normally.

13.08.2025 13:51

Emergency Shutdown of CLAIX-2023

Outage
Wednesday 07/30/2025 12:30 PM - Wednesday 07/30/2025 06:50 PM

Due to cooling issues, the CLAIX-2023 cluster has been shut down. Accessing the cluster is currently not possible.
Please note that running jobs may have been terminated due to the shutdown.
We are actively working on identifying the root cause and resolving the issue as quickly as possible.
We apologize for any inconvenience this may cause and thank you for your understanding.

30.07.2025 12:36
Updates

The cluster is resuming operation now that the cooling system has been cleared for load. All batch nodes should be back in operation shortly.

30.07.2025 18:33

Migration of login23-3 and login23-x-1 to Rocky 9

Maintenance
Monday 07/28/2025 07:00 AM - Monday 07/28/2025 06:00 PM

The remaining Rocky 8 login nodes login23-3 and login23-x-1 will be upgraded to Rocky 9. These nodes will be unavailable during the upgrade process. Please use the other available login nodes in the meantime. The copy nodes are unaffected and will remain Rocky 8 nodes until further notice.

25.07.2025 11:15

Login not possible

Outage
Tuesday 07/22/2025 11:30 AM - Tuesday 07/22/2025 11:44 AM

RWTH Single Sign-On login ist not possible at the moment.
Existing sessions are not affected.
We are working on fixing the problem.

22.07.2025 11:37
Updates

The problem has been solved.

22.07.2025 11:45

p7zip replaced with 7zip

Notice
Tuesday 07/22/2025 09:30 AM - Tuesday 07/22/2025 09:30 AM

Due to security concerns, the previously installed file archiver "p7zip" (a fork of "7zip"), was replaced by the original program.
p7zip was forked two decades ago from 7zip due to initially lacking Unix and Linux support. Unfortunately, development is lacking severly behind of the upstream code such that issues are only partially fixed if at all. Since the upstream 7zip contains also several functional and performance improvements in addition to the bug fixes, the overall performance should have improved for the users as well. The usage should not change aside of minor changes due to the development history.

22.07.2025 13:22

$HOME and $WORK filesystems are again unavailable

Outage
Monday 07/14/2025 09:45 AM - Monday 07/14/2025 11:30 AM

Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.

14.07.2025 10:20
Updates

The $HOME and $WORK filesystems are back online

14.07.2025 11:45

$HOME and $WORK filesystems unavailable

Outage
Thursday 07/10/2025 04:30 PM - Thursday 07/10/2025 06:30 PM

Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.

10.07.2025 16:47
Updates

The filesystems have been properly re-mounted on all nodes and the system is back in normal operation. If your jobs have crashed due to I/O errors, please resubmit them. We again apologize for the inconvenience.

10.07.2025 19:04

Firmware Upgrade of Frontend Nodes

Partial Maintenance
Monday 07/07/2025 06:00 AM - Monday 07/07/2025 07:30 AM

The Firmware of login23-3 and login23-4 must be upgraded. During the Upgrade, the nodes will not be available. Please use the other dialog nodes in the meantime.

02.07.2025 12:56
Updates

The firmware upgrades are completed.

07.07.2025 07:36

Altes Ticket ohne Titel

Partial Outage
Saturday 07/05/2025 12:15 PM - Saturday 07/05/2025 12:45 PM

There were disruptions to RWTH Single Sign-On in some cases during the specified period . After entering your login details, the screen loads and then an Internal Server Error appears. This affects all services that use the RWTH Single Sign-On login.

07.07.2025 11:45

User namespaces on all Rocky 9 systems deactivated

Partial Outage
Monday 06/16/2025 02:15 PM - Thursday 07/03/2025 06:00 PM

Due to an open security issue we have deactivated user namespaces on all Rocky 9 systems.
This feature is mainly used by containerization software and affects the way apptainer containers will behave.
Most users should not experience any interruptions. If you experience any problems, please contact us as
usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using.

16.06.2025 14:26
Updates

The user namespaces can be activated again once the Rocky 9 nodes are upgraded to Rocky 9.6.
The dialog systems (cf. other ticket) were upgraded already. Currently we are procuring the upgrade to Rocky 9.6 in the background to minimize the downtime. Already upgraded nodes have user namespaces enabled by default again.

25.06.2025 09:32

The remaining Rocky 9.5 nodes are planned to be updated to Rocky Linux 9.6 until 2025-07-04

26.06.2025 15:48

Usability of Desktop Environments on CLAIX-2023 Frontend Nodes Restricted due to Security Issues

Notice
Friday 06/20/2025 09:15 AM - Thursday 07/03/2025 12:00 PM

Due to high-risk security issues within some components, the affected packages had to be removed from the cluster to enable operation until the security issues can be fixed. Unfortunately, the removal breaks the desktop environments XFCE and Mate due to tight dependencies to the respective removed packages. Consequently, the file managers "Thunar" and "Caja" as well as some other utilities cannot be used at the moment when using a GUI login.
If a GUI login is mandatory, please use the IceWM environment instead. However, some of the GUI applications might not work as expected (see above).
The text-based/terminal usage of the frontend nodes is not affected by the temporal change.

20.06.2025 09:26
Updates

The upgraded Rocky 9 login nodes have received patches that solve the security issues. On these nodes, Mate and XFCE can be used again as desktop environment. However, The Rocky 8 login nodes (i.e., login23-3, login23-x-1) are still awaiting the respecitve patches. Until the patches are available, the usage of the respective environments remains limited.

24.06.2025 11:01

The Rocky 8 frontend nodes received the pending update and can now be used with Mate and XFCE again.

03.07.2025 13:02

Login via RWTH Single Sign-On not working

Outage
Thursday 06/26/2025 03:45 PM - Thursday 06/26/2025 04:16 PM

There are currently disruptions to RWTH Single Sign-On. After entering your login details, the screen loads and then an Internal Server Error appears. This affects all services that use the RWTH Single Sign-On login.
The specialist department has been informed and is working to resolve the issue.

26.06.2025 15:53
Updates

The problem has been fixed.

26.06.2025 16:16

HPC JupyterHub down for maintenance

Maintenance
Wednesday 06/25/2025 09:00 AM - Wednesday 06/25/2025 11:00 AM

The HPC JupyterHub will be down for maintenance during the hours of 9:00 and 11:00.

24.06.2025 15:53

Upgrade of Login Nodes

Partial Maintenance
Monday 06/23/2025 06:00 AM - Wednesday 06/25/2025 09:51 AM

On Monday, June 23rd, the CLAIX-2023 Rocky 9 frontend nodes login23-1, login23-2 and login23-4 will be upgraded to Rocky Linux 9.6.
During the upgrade, these nodes will temporarily not be available. Please use the Rocky 8 frontend nodes in the meantime.

20.06.2025 13:44
Updates

login23-g-1 and login23-x-2 will be reinstalled with Rocky Linux 9.6. Until the reinstallation is finished, these nodes are temporarily unavailable.

23.06.2025 05:50

The GPU login node login23-g-1 is reinstalled and available with Rocky 9.6. The other nodes require further work.

23.06.2025 08:56

The login nodes login23-1, login23-2, login23-4 are updated and available again. login23-x-2 requires some further work.

23.06.2025 10:33

All Rocky 9-login nodes are upgraded to Rocky Linux 9.6.

25.06.2025 09:51

Job submission to some new Slurm projects --accounts fails

Partial Outage
Saturday 06/07/2025 10:00 AM - Monday 06/16/2025 02:16 PM

Our second Slurm controller machine has lost memory modules and is currently pending maintenance.
This machine is responsible for keeping our Slurm DB in a current state.
Without it the submission to some new Slurm projects/ accounts will fail.
Please write a ticket if the problem persists.

10.06.2025 14:04

login23-2 currently not available

Partial Outage
Thursday 06/12/2025 05:30 PM - Monday 06/16/2025 10:20 AM

login23-2 is currently not available. Please use one of the other dialog systems.

12.06.2025 17:39
Updates

login23-2 is available again.

16.06.2025 10:22

System maintenance for the hpcwork filesystem

Partial Maintenance
Wednesday 06/11/2025 07:00 AM - Thursday 06/12/2025 05:40 PM

During the maintenance, the dialog systems will continue to be available
most of the time but *without access to the hpcwork directories*.
Batch jobs will not be running.

02.06.2025 14:59
Updates

Due to unexpected problems with the filesystem update, we will have to prolong the maintenance until tomorrow. As of now, we presume that the problem will be solved until noon and will release the batch queues as soon as possible. We apologize for the inconvenience.

11.06.2025 16:42

All maintenance tasks have been completed and both HPCWORK and the batch queues are operational again.

12.06.2025 17:57

HPC password change unavailable

Partial Outage
Thursday 06/12/2025 10:15 AM - Thursday 06/12/2025 11:00 AM

Due to issues with the new RegApp version, changing the HPC password may fail.

12.06.2025 10:20
Updates

The issue has been identified and a fix has been deployed.

12.06.2025 11:06

RegApp Maintenance

Partial Maintenance
Wednesday 06/11/2025 09:30 AM - Wednesday 06/11/2025 10:00 AM

The RegApp software will be updated to the newest version. During this time, login to the frontends and perfmon may be unavailable. It is recommended to log in beforehand.

04.06.2025 14:52

cp command of hpcwork files may fail on Rocky Linux 9 systems

Partial Outage
Thursday 06/05/2025 11:30 AM - Wednesday 06/11/2025 07:00 AM

On Rocky Linux 9 systems (especially login23-1 or login23-4) the cp command of hpcwork files may fail with the error "No data available". Current workarounds: either use "cp --reflink=never ..." to copy files or run the cp command on one of the Rocky Linux 8 nodes, e.g. copy23-1 or copy23-2.

05.06.2025 11:44
Updates

During system maintenance on 11.06. we will install a new version of the filesystem client software which will fix the problem. Until then please use "cp --reflink=never ..." to copy hpcwork files or copy files on Rocky 8 systems (e.g. copy23-1 or copy23-2).

06.06.2025 11:01

Routing über neue XWiN-Router

Maintenance
Tuesday 06/10/2025 09:00 PM - Tuesday 06/10/2025 10:13 PM

During this period, the routing of the previous XWiN routers (Nexus 7700) will be switched to the new XWiN routers (Catalyst 9600). These routers are essential for RWTH's network connection. This changeover also requires the migration of the DFN connection, which is switched redundantly to Frankfurt and Hannover, and the RWTH firewall to the new systems.
There will be complete or partial outages of the external connection during the maintenance window. All RWTH services (e.g. VPN, email, RWTHonline, RWTHmoodle) will not be available during this period. The accessibility of services within the RWTH network will be temporarily unavailable due to limited DNS functionality.

07.05.2025 12:47
Updates

Der Uplink nach Frankfurt wurde erfolgreich auf das neue System geschwenkt.

10.06.2025 21:07

Umbau des Uplinks nach Hannover beginnt.

10.06.2025 21:36

Uplink nach Hannover auf das neue System umgezogen.

10.06.2025 21:45

BGP v4/v6 nach Frankfurt und Hannover sind nun über die neue Routern funktional.

10.06.2025 21:57

Es stehen noch ein paar kleinere Nacharbeiten an.

10.06.2025 22:03

Wartung ist abgeschlossen. Der Datenverkehr läuft nun vollständig über die neuen Router!

10.06.2025 22:12

Problematik mit der Anbindung zur Physik identifiziert, Lösung erfolgt morgen früh.

10.06.2025 23:50

Partial Migration of CLAIX-2023 to Rocky Linux 9

Maintenance
Monday 06/02/2025 09:00 AM - Monday 06/02/2025 07:00 PM

During the maintenance, we will migrate half of the CLAIX-2023 nodes (ML & HPC) and the entire devel partition to Rocky Linux 9. Additionally, the dialog systems login23-1, login23-x-1, and login23-x-2 will be migrated.

30.05.2025 11:14

Aktuell keine Home- und Work-Verzeichnisse auf login23-2

Partial Outage
Friday 05/16/2025 11:30 AM - Friday 05/23/2025 03:00 PM

Auf login23-2 stehen aktuell keine Home- und Work-Verzeichnisse zur Verfuegung. Bitte weichen Sie auf eines der anderen Dialogsysteme aus.

16.05.2025 11:38

Swapping MFA backend of RegApp

Partial Maintenance
Wednesday 05/21/2025 03:00 PM - Wednesday 05/21/2025 04:00 PM

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

16.05.2025 15:31

Swapping MFA backend of RegApp

Maintenance
Wednesday 05/14/2025 12:45 PM - Wednesday 05/14/2025 12:45 PM

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

09.05.2025 08:37
Updates

we have to postpone the maintenance

14.05.2025 12:55

Coolage disturbed, emergency shutdown of CLAIX23

Outage
Monday 05/12/2025 01:15 PM - Monday 05/12/2025 04:15 PM

We have a problem with the external coolage. We therefore have to do an emergency shutdown of CLAIX23.

12.05.2025 13:24
Updates

We are now powering on the cluster again. The frontend (login-) nodes will be available soon again. Until further notice, the batchsystem for CLAIX23 keeps stopped. We hope to resolve the issue today.

12.05.2025 14:49

HPC JupyterHub maintenance

Maintenance
Tuesday 05/06/2025 09:00 AM - Tuesday 05/06/2025 05:00 PM

The HPC JupyterHub will be down for maintenance to update some internal software packages.

02.05.2025 10:14

Access to hpcwork directories hanging

Partial Outage
Monday 05/05/2025 04:00 PM - Monday 05/05/2025 05:15 PM

Currently the access to the hpcwork directories may hang due to problems with the file servers. The problem is being worked on.

29.04.2025 10:02
Updates

The accesss to $HPCWORK should work again now.

29.04.2025 10:47

The access to the hpcwork directories may hang again.

05.05.2025 16:48

The accesss to $HPCWORK should work again.

06.05.2025 08:01

HPCJupyterHub Profile Installation Unavailable

Partial Outage
Friday 03/28/2025 01:15 PM - Tuesday 04/29/2025 11:27 AM

The installation of HPCJupyterHub Profiles is currently unavailable due to issues with our HPC container system.
A downtime of at least two weeks is expected. We apologize for the inconvenience.
Update: New Profiles can be installed now, after Apptainer changes.

28.03.2025 13:30

System Upgrade of Login Node

Partial Maintenance
Monday 03/24/2025 05:00 AM - Thursday 04/17/2025 03:25 PM

The CLAIX-2023 login node login23-4 will be upgraded to Rocky Linux 9.5 for assisting the migration to Rocky Linux 9. During the version upgrade, the login node will not be available.

21.03.2025 11:20
Updates

The CLAIX-2023 dialog node login23-4 is currently unavalable due to testing with respect to the planned migration to Rocky Linux 9. Please use an other dialog node until access to this node is available again.

25.03.2025 15:01

The new Rocky 9 login node login23-4 will be available soon. We are resolving the last issues before it can be used for the Rocky 9 pilot phase.

17.04.2025 15:22

The Rocky 9 login node login23-4 is available now.

17.04.2025 15:27

SSH Command Key Approval Currently Not Possible

Notice
Thursday 04/10/2025 08:00 AM - Wednesday 04/16/2025 02:35 PM

Due to a bug in the RegApp, SSH command keys cannot be approved at the moment. We are working on a solulution.
Update 16.04 14:35:
The cause of the bug has been identified and fixed

10.04.2025 14:57

[RegApp] hotfix deployment

Partial Maintenance
Wednesday 04/16/2025 02:30 PM - Wednesday 04/16/2025 02:35 PM

the bug affecting ssh command key approval has been identified.
A hotfix will be deployed momentarily, two factor logins on the HPC will be temporarily unavailable.

16.04.2025 14:13

Maintenance of the RegApp application

Partial Maintenance
Wednesday 04/09/2025 09:00 AM - Wednesday 04/09/2025 11:00 AM

No new logins are possible during the maintenance work. Users who are already logged in will not be disturbed and the maintenance will not affect the rest of the cluster. All frontends are still available, as is the batch system with its computing nodes.

10.03.2025 11:02

Bad Filesystem Performance

Partial Outage
Thursday 01/23/2025 12:30 PM - Friday 03/28/2025 09:31 AM

At the moment, some issues with the file systems that impact the performance are observed. File access can be severely delayed consequently. The issue is currently under investigation.

24.01.2025 14:38
Updates

We were not able to identify the root cause of the observed issues and are still working on a solution.

10.03.2025 13:30

We changed some network settings on the GPFS servers, but that did not change anything. We are still working with the manufacturer to get a solution.

20.03.2025 12:08

Additional changes have been made to the configuration. Initial tests tend to show an improvement in the GPFS performance. A full qualitative analysis of the performance is pending

21.03.2025 11:23

The performance of the HOME and WORK filesystems are much better now.

28.03.2025 09:33

Global Maintenance of CLAIX-2023

Maintenance
Wednesday 03/05/2025 08:00 AM - Friday 03/07/2025 03:00 PM

Due to maintenance tasks on a cluster-wide scale that cannot be performed online, the whole batch service will be suspended for the maintenance.
The Regapp is also affected by the maintenance.

18.02.2025 14:56
Updates

The infrastructure of the HOME and WORK filesystems will also be under maintenance. Hence, the frontend nodes will be inacessible as well until the pending tasks are completed.

26.02.2025 06:19

Due to delays in servicing the file systems, the cluster maintenance needs to be prolonged.

06.03.2025 12:38

Due to delays in deploying the updated file system, the maintenance hat do be prolonged.

07.03.2025 13:57

Jupyterhub Temporarily Unavailable Due To Kernel Update

Partial Maintenance
Monday 02/17/2025 06:00 AM - Monday 02/17/2025 09:00 AM

Due to a scheduled kernel update, the jupyterhub node is temporarily unavailable. The node will be available again, as soon as the update is completed.

17.02.2025 06:16

Kurze SSO-Störung

Outage
Wednesday 02/12/2025 08:25 PM - Wednesday 02/12/2025 09:00 PM

This evening around 20:25 - 21:00 there was a performance low in the SSO service and was checked by the specialist department during this time. The problem has been solved and fast logins without waiting times are possible again.

12.02.2025 21:41

Login nodes unavailable due to kernel update

Partial Maintenance
Monday 02/10/2025 05:45 AM - Monday 02/10/2025 07:45 AM

Due to a scheduled Kernelupdate, the login nodes, i.e., login23-*, are temporarily unavailable during the update. The nodes will be available again, as soon as the update is completed.

10.02.2025 00:47
Updates

The maintenance had to be prolonged for few minutes.

10.02.2025 07:25

Some CLAIX23 Nodes Unavailable Due To Update Issues

Partial Maintenance
Friday 01/24/2025 07:00 AM - Friday 01/24/2025 01:35 PM

During a kernel update issues occured which lead to delays in re-deploying the nodes to the batch service. Consequently, a large number of compute nodes is unavailable at the moment (reserved and/or down). We are currently working on handing the respective nodes and will release them as fast as possible.

24.01.2025 09:52
Updates

The updates could be completed. The nodes are available again.

24.01.2025 14:34

HPC JupyterHub unavailable

Outage
Wednesday 01/22/2025 01:30 PM - Wednesday 01/22/2025 03:14 PM

The HPC JupyterHub is currently unavailable due to unforeseen errors in the filesystems.
A solution is being worked on.

22.01.2025 13:41

Node unavailable due to kernel update

Partial Maintenance
Monday 01/20/2025 07:45 AM - Monday 01/20/2025 08:45 AM

The kernel of the CLAIX23 copy node claix23-2 has to be updated. The node will be available again as soon as the update is finished.

20.01.2025 07:46

RegApp Störung

Outage
Tuesday 01/14/2025 09:00 AM - Tuesday 01/14/2025 04:40 PM

It is currently not possible to change your HPC password in the RegApp.
Creating new accounts is not possible.
Update 16:40
the password problem has been fixed and the RegApp service is operating as usual again.

14.01.2025 15:25

RegApp Wartung

Maintenance
Tuesday 01/14/2025 08:00 AM - Tuesday 01/14/2025 10:10 AM

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit sind keine Logins möglich.
Update:
Die RegApp ist seit 9 Uhr wieder erreichbar, die Kommunikation mit dem 2. Faktor ist aber gestört. Entsprechend funktioniert der 2FA-Login am HPC noch nicht.
Update 10:10
Die Störung ist behoben, der 2FA-Login am HPC funktioniert wieder.

06.01.2025 15:31

Wartung der RegApp

Maintenance
Monday 12/23/2024 09:00 AM - Monday 12/23/2024 09:30 AM

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit ist kein Login in der RegApp möglich.

19.12.2024 11:04

Slurm hiccup possible

Warning
Thursday 12/19/2024 03:50 PM - Thursday 12/19/2024 04:00 PM

We are migrating the slurm controller to a new host. It might come to short timeouts. We try to minimize that as much as possible.

19.12.2024 09:53
Updates

The first try was not successfull, we are on the old master again. We are analyzing the problems that occurred and try again later.

19.12.2024 10:01

we make another attempt

19.12.2024 15:52

Issues regarding Availability

Partial Outage
Monday 12/02/2024 11:15 AM - Wednesday 12/18/2024 04:00 PM

There may currently be login problems with various login nodes. We are working on a solution.

02.12.2024 11:58
Updates

The observed issues affect the batch service as well. Consequently many batch jobs may have failed.

02.12.2024 12:18

The observed problems can be concluded from power issues. We cannot exclude that further systems may have to be controlled shut down temporarily due to a problem resolution if required. However, we hope the issues can be resolved without any additional measures.

02.12.2024 14:31

The cluster can be accessed again. cf. ticket https:
maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/9521 for further details.
Several nodes, however, are still unavailable due to the consequences of the aforementioned issues. We are currently working on resolving the issues.

04.12.2024 12:57

Infiniband issues leading to not reachable nodes

Partial Outage
Friday 12/13/2024 09:30 PM - Monday 12/16/2024 10:57 AM

Due to to be analyzed infiniband problems many nodes including the whole GPU cluster are not reachable at the moment. We are working together with the manufacturer, to solve the problems.

16.12.2024 08:04
Updates

The problem could be fixed

16.12.2024 10:57

Single sign-on and MFA mailfunction

Partial Outage
Wednesday 12/11/2024 08:00 AM - Wednesday 12/11/2024 09:30 AM

At the moment, single sign-on and multi-factor authentication are sporadically disrupted. We are already working on a solution and ask for your patience.

11.12.2024 08:12

Nodes drained due filesystem issues.

Partial Outage
Sunday 12/08/2024 06:45 AM - Sunday 12/08/2024 05:15 PM

Dear users, Sunday the 8.12.2024 at 06:53:00 AM the home filesystems of most nodes went offline.
This might have negatively crashed some jobs and no new jobs can start during the downtime.
We are actively working on the issue.

08.12.2024 16:22
Updates

Most nodes are coming back online. Apologies for the troubles. We expect most nodes to be usable by 18:00.

08.12.2024 16:56

lost filesystem $HOME $WORK connection

Notice
Thursday 12/05/2024 10:00 PM - Friday 12/06/2024 10:00 AM

Due to a problem in our network, some nodes lost their connection to the $HOME and $WORK file system. This included the login23-1 and login23-2 nodes. The issue has been resolved now.

06.12.2024 14:00

Emergency Shutdown of CLAIX-2023 Due to Failed Cooling

Outage
Monday 12/02/2024 03:15 PM - Wednesday 12/04/2024 06:00 AM

CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation.
The cluster will be operational again if the inducting issues can be resolved.

02.12.2024 15:20
Updates

Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.

03.12.2024 10:34

The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance.
Once these investigations are complete, the affected nodes will be made available through the batch system again.

04.12.2024 10:41

New file server for home and work directories

Maintenance
Friday 11/22/2024 12:00 PM - Tuesday 11/26/2024 01:45 PM

We are putting a new file server for the home and work directories into operation. For this purpose we will carry out a system maintenance in order to finally synchronise all data over the weekend.

20.11.2024 09:32
Updates

The maintenance needs to be extended.

25.11.2024 13:32

Due to some issues preventing a normal batch service, the maintenance had to be extended.

26.11.2024 13:33

Limited Usability of CLAIX-2023

Partial Outage
Thursday 11/21/2024 09:45 AM - Thursday 11/21/2024 05:15 PM

Due to concurrent external issues regarding the RWTH Aachen network, access and usability of the Compute Cluster is limited at the moment.
The respective network department is currently working on a solution.

21.11.2024 12:47
Updates

The issues could not be resolved until now and may persist trhoughout tomorrow as well.

21.11.2024 16:00

The issues have been resolved.

22.11.2024 06:52

Power disruption / Stromausfall

Partial Outage
Friday 11/15/2024 05:00 PM - Saturday 11/16/2024 02:00 PM

At 17:00, there was a brief interruption of the power lines in the Aachen area. The power is available again, however, most of the compute nodes went down consequently. Currently, it is unclear when the service can be resumed. At the moment, critical services are under special care and are, if required, being restored.

15.11.2024 18:43
Updates

After restoring critical operational infrastructure services, the HPC service is resumed. However, a large portion of the GPU nodes are unavailable due to the impact of the incurred blackout. Until further notice, these nodes are unavailable.

15.11.2024 21:06

The majority of the ML systems (GPUs) were restarted today and are back in batch operation.

16.11.2024 14:04

Scheduler Hiccup

Outage
Thursday 11/14/2024 10:45 AM - Thursday 11/14/2024 10:55 AM

Our Slurm workload manager crashed due to an unknown reason. Functionality could be restored at short hand. Further investigations are ongoing.

14.11.2024 10:59

GPU Malfunction on GPU Login Node

Partial Outage
Tuesday 11/12/2024 09:15 AM - Tuesday 11/12/2024 10:35 AM

Currently, a GPU of the GPU login node login23-g-1 shows an issue. The node is unavailable until the issue is resolved.

12.11.2024 09:29
Updates

The issues could be resolved.

12.11.2024 10:36

Login malfunction

Partial Outage
Wednesday 10/23/2024 05:00 PM - Thursday 10/24/2024 08:40 AM

It is currently not possible to log in to the login23-* frontends. There is a problem with two-factor authentication

24.10.2024 08:09

Top500 run for GPU nodes

Notice
Friday 09/27/2024 08:00 AM - Friday 09/27/2024 11:00 AM

We are doing a new top500 run for the ML partition of CLAIX23.
The GPU nodes will not be available during that run.
Other Nodes and login23-g-1 might be also unavailable:
i23m[0027-0030],r23m[0023-0026,0095-0098,0171-0174],n23i0001

26.09.2024 15:03

Expired certificate

Outage
Monday 09/23/2024 08:00 AM - Monday 09/23/2024 09:03 AM

Due to the expired certificate for idm.rwth-aachen.de, no IdM applications and the applications that use RWTH Single Sign-On can be accessed.
We are working on a solution.
- An insecure connection message is displayed when calling up IdM applications.
- When calling up applications with access via RWTH Single Sign-On, a message about missing authorisations is displayed.

23.09.2024 08:11
Updates

The certificate has been updated and the applications can be accessed again. Please delete the browser cache before accessing the pages again.

23.09.2024 09:06

Störung der RegApp -> kein login auf dem Cluster möglich

Partial Outage
Monday 09/09/2024 10:45 AM - Monday 09/09/2024 11:30 AM

Leider kam es in dem genannten Zeitraum zu einer Störung der RegApp, so dass man sich nicht auf den Frontends des Clusters einloggen konnte. Bereits bestehende Verbindungen wurden davon nicht beeinflusst. Das Problem ist behoben.

09.09.2024 15:40

Altes Ticket ohne Titel

Partial Maintenance
Monday 09/09/2024 08:00 AM - Monday 09/09/2024 09:00 AM

copy23-2 data transfer system will be unavailable for maintenance.

02.09.2024 09:01
Updates

The maintenance is completed.

09.09.2024 09:00

Firmware Update of InfiniBand Gateways

Partial Maintenance
Thursday 09/05/2024 03:00 PM - Friday 09/06/2024 12:15 PM

The firmware of the InfiniBand Gateways will be updated. The firmware update will be performed in background and should not cause any interruption of service.

05.09.2024 15:31
Updates

The updates are completed.

06.09.2024 13:19

Hosts of RWTH Aachen University partly not accessible from networks of other providers

Partial Outage
Saturday 08/24/2024 08:15 PM - Sunday 08/25/2024 09:00 PM

Due to DNS disruption, the name servers of various providers are currently not returning an IP address for hosts under *.rwth-aachen.de.
As a workaround, you can store alternative DNS servers in your connection settings, e.g. the Level3-Nameserver (4.2.2.2 and 4.2.2.1) or Comodo (8.26.56.26 und 8.20.247.20). It may also be possible to reach the RWTH VPN server, in which case please use VPN.

25.08.2024 10:34
Updates

Instructions for configuring an alternative DNS server under Windows can be found via the following links:
https:
www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/
https:
www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html
You can also use VPN as an alternative. If you cannot reach the VPN server, you can adjust the host file under Windows according to the following instructions. This will allow you to reach the server vpn.rwth-aachen.de. To do this, the following entry must be added:
134.130.5.231 vpn.rwth-aachen.de
https:
www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/

25.08.2024 13:20

The hosts of RWTH Aachen University can now be reached again from outside the RWTH network.

25.08.2024 21:10

Individual users may have experienced problems even after the fault was rectified on 25 August at 9 pm. On 26.8. at 9 a.m. all follow-up work was completed, so there should be no further problems.

26.08.2024 15:25

MPI/CPU Jobs Failed to start overnight

Partial Outage
Monday 08/19/2024 05:15 PM - Tuesday 08/20/2024 08:15 AM

Many nodes suffered an issue after our updates on the 19.08.2024, resulting in jobs failing on the CPU partitions.
If your job failed to start or failed on startup, please consider requeuing it if necessary. This list of jobs was identified as possibly being affected by the issue:
48399558,48468084,48470374,48470676,48473716,48473739,48473807,48473831,
48475599,48475607_0,48475607_1,48475607_2,48475607_3,48475607_4,48475607_5,
48475607_6,48475607_7,48475607_8,48475607_9,48475607_10,48475607_11,48475607_12,
48475607_13,48475607_14,48475607_15,48475607_16,48475607_17,48475607_18,48475607_19,
48476753,48482255,48485168,48486404,48488874_5,48488874_6,48488874_7,48488874_8,
48488874_9,48488874_10,48488874_11,48488875_9,48488875_10,48488875_11,48489133_1,
48489133_2,48489133_3,48489133_4,48489133_5,48489133_6,48489133_7,48489133_8,48489133_9,
48489133_10,48489154_0,48489154_1,48489154_2,48489154_3,48489154_4,48489154_5,48489154_6,48489154_7,
48489154_8,48489154_9,48489154_10,48489154_11,48489154_12,48489154_13,48489154_14,48489154_15,
48489154_16,48489154_17,48489154_18,48489154_19,48489154_20,48489154_21,48489154_22,48489154_23,
48489154_24,48489154_25,48489154_26,48489154_27,48489154_28,48489154_29,48489154_30,48489154_31,
48489154_32,48489154_33,48489154_34,48489154_35,48489154_36,48489154_37,48489154_38,48489154_39,
48489154_40,48489154_41,48489154_42,48489154_43,48489154_44,48489154_45,48489154_46,48489154_47,
48489154_100,48489154_101,48489154_102,48489154_103,48489154_104,48489154_105,48489154_106,48489154_107,
48489154_108,48489154_109,48489154_110,48489154_111,48489154_112,48489154_113,48489154_114,48489154_115,
48489154_116,48489154_117,48489154_118,48489154_119,48489154_120,48489154_121,48489154_122,48489154_123,
48489154_124,48489154_125,48489154_126,48489154_127,48489154_128,48489154_129,48489154_130,48489154_131,
48489154_132,48489154_133,48489154_134,48489154_135,48489154_136,48489154_137,48489154_138,48489154_139,
48489154_140,48489154_141,48489154_142,48489154_143,48489154_144,48489154_145,48489154_146,48489154_147,
48489154_148,48489154_149,48489154_150,48489154_151,48489154_152,48489154_153,48489154_154,48489154_155,
48489154_156,48489154_157,48489154_158,48489154_159,48489154_160,48489154_161,48489154_162,48489154_163,
48489154_164,48489154_165,48489154_166,48489154_167,48489154_168,48489154_169,48489154_170,48489154_171,
48489154_172,48489154_173,48489154_174,48489154_175,48489154_176,48489154_177,48489154_178,48489154_179,
48489154_180,48489154_181,48489154_182,48489154_183,48489154_184,48489154_185,48489154_186,48489154_187,
48489154_188,48489154_189,48489154_190,48489154_191,48489154_192,48489154_193,48489154_194,48489154_195,
48489618_1,48489618_2,48489618_3,48489618_4,48489618_5,48489618_6,48489618_7,48489618_8,48489618_9,48489618_10,
48489776,48489806_6,48489806_55,48489806_69,48489806_98,48489842,48489843,48489844,48489845,48489882_1,48489882_2,
48489882_3,48489882_4,48489882_5,48489882_6,48489882_7,48489882_8,48489882_9,48489882_10,48494481,48494490,48494752,
48494753,48494754,48494755,48494756,48494757,48494758,48494759,48494760

20.08.2024 11:34

Maintenance

Maintenance
Monday 08/19/2024 07:00 AM - Monday 08/19/2024 04:00 PM

Due to updates to our compute nodes, the HPC system will be unavailable for Maintenance.
The login nodes will available at noon without interruptions, but the batch queue for jobs won't be usable during the maintenance work.
As soon as the maintenance work has been completed, batch operation will be enabled again.
These jobs should be requeued if necessary:
48271714,48271729,48271731,48463405,48463406,48463407,48466930,
48466932,48468086,48468087,48468088,48468089,48468090,48468091,
48468104,48468105,48468108,48468622,48469133,48469262,48469404,
48469708,48469734,48469740,48469754,48469929,48470011,48470017,
48470032,48470042,48470045,48474641,48474666,48475362,48489829,
48489831,48489833_2,48489838

09.08.2024 11:01

Old HPCJupyterHub GPU profiles might run slower on the new c23g nodes.

Notice
Friday 05/24/2024 11:00 AM - Friday 08/09/2024 01:46 PM

Please migrate your notebooks to work with newer c23 GPU Profiles!
The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs.
Redeployment of these old profiles is necessary and will take some time.

24.05.2024 11:15

MPI jobs may crash

Partial Outage
Tuesday 07/16/2024 04:12 PM - Thursday 08/01/2024 09:15 AM

Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.

22.07.2024 09:37
Updates

We have identified the issue and are currently testing workarounds with the affected users.

24.07.2024 12:41

After successful tests with affected users, we have rolled out a workaround that automatically prevents this issue for our IntelMPI installations. We advise users to remove any custom workarounds from their job scripts to ensure compatibility with future changes.

01.08.2024 10:28

c23i Partition is DOWN for the HPC JupyterHub

Partial Outage
Thursday 07/18/2024 03:15 PM - Monday 07/29/2024 10:14 AM

The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition.
A solution is momentarily unknown and will be investigated.
The HPC JupyterHub will not be able to use it until it is resolved.

18.07.2024 15:29

Temporary Deactivation of User Namespaces

Partial Outage
Monday 07/08/2024 02:15 PM - Thursday 07/18/2024 01:00 PM

Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.

08.07.2024 14:32
Updates

User namespaces are available again.

18.07.2024 13:00

Quotas on HPCWORK may not work correctly

Partial Outage
Thursday 06/27/2024 02:30 PM - Thursday 07/18/2024 12:30 PM

The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.

27.06.2024 14:40
Updates

File quotas for all hpcwork directories were increased to one million.

18.07.2024 12:39

Reconfiguration of File Systems and Kernel Update

Maintenance
Monday 07/15/2024 07:00 AM - Tuesday 07/16/2024 04:11 PM

During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.

10.07.2024 09:43
Updates

The maintenance had to be extended for final filesystem tasks

15.07.2024 15:24

Due to unforseen problems, the maintenance has to be extended to tomorrow 16.07.2024 18.00. We do not expect the manufacturer of the filesystem to take that long, but expect to open the cluster earlier again.

15.07.2024 17:24

The maintenance could be ended successfully. Once again, sorry for the long delay.

16.07.2024 16:12

HPCJupyterHub down due to update to 5.0.0

Outage
Wednesday 06/26/2024 03:00 PM - Thursday 06/27/2024 04:00 PM

HPCJupyterHub is down after faied update to 5.0.0. will stay until the update is complete.
HPCJupyterHub could not be updated to 5.0.0. Remains at 4.1.5.

26.06.2024 15:04

FastX web servers on login18-x-1 and login18-x-2 stopped

Warning
Wednesday 05/15/2024 02:00 PM - Thursday 06/27/2024 02:26 PM

The FastX web servers on login18-x-1 and login18-x-2 have been stopped, i.e. the addresses https:
login18-x-1.hpc.itc.rwth-aachen.de:3300 and https:
login18-x-2.hpc.itc.rwth-aachen.de:3300 are not available anymore. Please use login23-x-1 or login23-x-2 instead.

15.05.2024 14:38
Updates

login18-x-1 and login18-x-2 has been decommissioned.

27.06.2024 14:29

Maintenance

Maintenance
Wednesday 06/26/2024 08:00 AM - Wednesday 06/26/2024 04:00 PM

Due to maintenance work on the water cooling system, Claix23 must be empty during the specified period. As soon as the maintenance work has been completed, batch operation will be enabled again. The dialog systems are not affected by the maintenance work.

12.06.2024 07:48
Updates

Additionally, between 10 and 11 o'clock, there will be a maintenance of the RegApp. During this time, new logins will not be possible, existing connections will not be disturbed.

25.06.2024 14:04

Upgrade to Rocky Linux 8.10

Partial Maintenance
Thursday 06/13/2024 11:15 AM - Wednesday 06/26/2024 04:00 PM

Due to the reached EOL of Rocky 8.9, the MPI nodes of CLAIX23 must be upgraded to Rocky 8.10. The upgrade is performed in background during production to minimize the downtime of the cluster. However, during the Upgrade, free nodes will be removed on a selection-basis and will not be available for job submission until the upgrade is completed.
Please keep in mind that during the update, the library versions installed will likely change. Thus, the performance and application behaviour may vary compared to earlier runs.

13.06.2024 11:49
Updates

Starting now, all new jobs will be scheduled to Rocky 8.10 nodes. The remaining nodes that still need to be updated are unvailable for job submission. These nodes will be upgraded as soon as possible after their jobs' completion.

14.06.2024 18:22

The update of the frontend and batch nodes is completed. Remaining nodes (i.e. integrated hosting and service nodes) will be updated on the cluster maintanance scheduled for 2024-06-26.

20.06.2024 08:49

Update of Frontend Nodes

Partial Maintenance
Wednesday 06/26/2024 08:00 AM - Wednesday 06/26/2024 10:00 AM

The dialog nodes (i.e. login23-1/2/3/4, login23-x-1/2) will be updated to Rocky 8.10 today within the weekly reboot. The upgrade of copy23-1/2 will follow.

17.06.2024 05:08
Updates

The copy frontend nodes (copy23-1, copy23-2) will be updated to Rocky Linux 8.10 during the cluster maintanance ond 2024-06-26.

24.06.2024 09:13

The update of the remaining frontend nodes is completed.

26.06.2024 11:12

Altes Ticket ohne Titel

Partial Outage
Tuesday 06/25/2024 10:00 AM - Tuesday 06/25/2024 05:00 PM

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

25.06.2024 16:11

Error on user/project management

Partial Outage
Thursday 06/20/2024 10:00 AM - Monday 06/24/2024 10:32 AM

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

20.06.2024 12:09
Updates

The issue has been resolved.

24.06.2024 10:33

Project management

Partial Outage
Wednesday 05/29/2024 03:30 PM - Wednesday 06/12/2024 04:30 PM

During this period no RWTH-S, THESIS, LECTURE or WestAI projects can be granted. We apologize for the inconvenience.

29.05.2024 15:42

RegApp Maintenance

Partial Maintenance
Wednesday 06/12/2024 09:00 AM - Wednesday 06/12/2024 10:00 AM

Due to maintenance of the RegApp Identiy Provider, it is not possible to establish new connections to the cluster during the specified period. Existing connections and batch operation are not affected by the maintenance.

04.06.2024 14:28

Deactivation of User Namespaces

Notice
Wednesday 03/27/2024 08:15 AM - Monday 04/29/2024 06:00 PM

Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability.

27.03.2024 08:14
Updates

A kernel update addressing the issue was released upstream and will be available to the compute cluster, soon. Upon the update, usernamespaces can be enabled, again.

04.04.2024 11:11

We are planning to re-enable user namespaces on April, 29th after some final adjustments

24.04.2024 17:22

Performance Problems on HPCWORK

Notice
Monday 04/08/2024 11:00 AM - Wednesday 04/24/2024 05:00 PM

We currently register recurring performance degradations on HPCWORK directories which might be partly worsened by the on-going migration process leading on to the filesystem migration on April, 17th. The problems cannot be traced back to a single cause but are actively investigated.

12.04.2024 11:35
Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

16.04.2024 16:21

HPC JupyterHub update

Maintenance
Tuesday 04/23/2024 07:00 AM - Wednesday 04/24/2024 12:00 PM

During the Claix HPC System Maintenance, the HPC JupyterHub will be updated to a newer version.
This will improve Claix 2023 support as well mandatory security updates.
The whole clusters needs to be updated with a new kernel.

23.04.2024 07:03
Updates

The migration was successfully completed.

24.04.2024 13:40

Migration from lustre18 to lustre22

Partial Maintenance
Tuesday 04/23/2024 07:00 AM - Wednesday 04/24/2024 12:00 PM

In the last weeks, we started migrating all HPCWORK data to a new filesystem. In this Maintenance we will do the final migration step. HPCWORK will not be available during this maintenance.

10.04.2024 11:26
Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

16.04.2024 16:23

System Maintenance

Maintenance
Tuesday 04/23/2024 07:00 AM - Wednesday 04/24/2024 12:00 PM

The whole clusters needs to be updated with a new kernel such that user namespaces can be reenabled again, please compare https:
maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/8929
Simultaneously the Infiniband Stack will be updated for better performance and stability.
During this maintenance, the dialog systems and the batchsystem will not be available. The dialog systems are expected to be reopened in the early morning.
We do not believe that the maintenance will last the whole day but expect the cluster to open earlier.

10.04.2024 11:22
Updates

Due to technical problems, we will have to postpone the maintenance to 23.04.2024 07:00.

16.04.2024 16:22

Unfortunately, unplanned complications have arisen during maintenance, so that maintenance will have to be extended until midday tomorrow. We will endeavor to complete the work by then.
We apologize for any inconvenience this may cause.

23.04.2024 16:27

Top500 - Benchmark

Warning
Thursday 04/11/2024 05:00 PM - Friday 04/12/2024 09:10 AM

During the stated time Claix-2023 will not be available due to a benchmark run for the Top500 list[1]. Batch jobs which cannot finish before the start of this downtime or which are scheduled during this time period will be kept in queue and started after the cluster resumes operation.
[1] https:
www.top500.org

11.04.2024 17:09
Updates

The nodes are available now again

12.04.2024 09:27

Longer waiting times in the ML partition

Notice
Wednesday 04/03/2024 04:00 PM - Thursday 04/11/2024 01:11 PM

There are currently longer waiting times in the ML partition as the final steps of the acceptance process are still being carried out.

04.04.2024 10:09
Updates

The waiting times should be better now

11.04.2024 13:11

RegApp Service Update

Maintenance
Wednesday 04/03/2024 02:00 PM - Wednesday 04/03/2024 02:30 PM

The RegApp will be updated on 2024-04-03. During the update window, the service will be unavailable for short time intervals. Active sessions should not be affected.
+++ English version above +++

27.03.2024 13:59

Problems with submitting jobs

Partial Outage
Wednesday 04/03/2024 12:00 PM - Wednesday 04/03/2024 02:03 PM

There are currently problems when submitting jobs. We are working on fixing the problems and apologize for the inconvenience.

03.04.2024 12:36
Updates

The problem is solved now.

03.04.2024 14:03

Deactivation of User Namespaces

Notice
Friday 01/12/2024 10:30 AM - Thursday 02/08/2024 08:00 AM

Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability.
Update:
We have installed a bugfix release for the affected software component and enabled user namespaces again.

12.01.2024 10:43

hpcwork directory is empty

Partial Outage
Monday 01/29/2024 10:15 AM - Monday 01/29/2024 11:34 AM

At the moment, no data are shown on /hpcwork.
We are working on a solution of the problem.

29.01.2024 10:26
Updates

The problem has been solved.

29.01.2024 11:34

Scheduled Reboot of CLAIX18 Copy Nodes

Outage
Monday 01/29/2024 06:00 AM - Monday 01/29/2024 07:15 AM

Both CLAIX18 copy nodes will be rebooted on Monday, January 29th, 6.00 am (CET) due to a scheduled kernel upgrade. The systems will temporarily unavailable and cannot be used until the kernel update is finished.

26.01.2024 17:15

Netzwerkprobleme

Warning
Friday 01/19/2024 07:45 PM - Saturday 01/20/2024 09:30 AM

Aufgrund von Netzwerkromplemen kann es im angegeben Zeitraum zu Problemen bei der Nutzung des Clusters gekommen sein.

22.01.2024 07:45

Connection to windows cluster ist no possible

Partial Outage
Friday 12/29/2023 02:45 PM - Monday 01/01/2024 12:00 AM

At the moment it is not possible to connect to the windows cluster.
We are working on a solution of the problem.

29.12.2023 14:55
Updates

The error has been resolved. You can connect to the Windows cluster again.

03.01.2024 11:46

jupyterhub.hpc.itc.rwth-aachen.de DNS Temporary out of Service

Outage
Thursday 12/14/2023 03:30 PM - Thursday 12/14/2023 03:55 PM

The jupyterhub.hpc.itc.rwth-aachen.de DNS is Temporary out of Service for 20 Minutes. Problems accessing the hpc JupyterHub might arise from this failure. Please wait until the system comes back online.

14.12.2023 15:33

Wartung HPC-Benutzerverwaltung

Maintenance
Tuesday 12/05/2023 10:00 AM - Tuesday 12/05/2023 12:00 PM

Aufgrund von Wartungsmassnahmen erfolgt das Einrichten von HPC-Accounts verzoegert. Passwort-Aenderungen sind nicht moeglich.

05.12.2023 09:55

login18-x-2 gestoert

Partial Outage
Monday 11/27/2023 12:45 PM - Tuesday 11/28/2023 02:40 PM

login18-x-2 ist defekt und steht deshalb aktuell nicht zur Verfuegung.

28.11.2023 12:50
Updates

Das System ist wieder ok.

28.11.2023 14:40