The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition. A solution is momentarily unknown and will be investigated. The HPC JupyterHub will not be able to use it until it is resolved.
Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.
We have identified the issue and are currently testing workarounds with the affected users.
Please migrate your notebooks to work with newer c23 GPU Profiles! -- The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs. Redeployment of these old profiles is necessary and will take some time.
Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.
User namespaces are available again.
The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.
File quotas for all hpcwork directories were increased to one million.
During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.
The maintenance had to be extended for final filesystem tasks
Due to unforseen problems, the maintenance has to be extended to tomorrow 16.07.2024 18.00. We do not expect the manufacturer of the filesystem to take that long, but expect to open the cluster earlier again.
The maintenance could be ended successfully. Once again, sorry for the long delay.