Date: 18/01/2021
Window Time: 45 minutes between 11:00am – 12:00pm
A CPU has failed in one of the hardware nodes, so we need to shut the server down and replace the failed CPU.
We have already evacuated all VPS’s off the hardware node in question, so they will be unaffected.
The hardware node does, however, contain 6 TB of SSD storage for the cluster. There are 3 copies of all data, so when we shut down the node, the VPS servers will continue to function normally with data from the other two data copies.
The cluster will, however, start replicating all of the data on the stopped hardware in order to maintain 3 copies and will spread that data around all of the other nodes.
This replication of data will cause a high load on the storage cluster causing higher than normal data access latency. This may cause some higher than normal load on the VPS servers until the replication process is completed.
Therefore, during this maintenance window, the VPS servers should be considered at risk of higher than normal load and increased response times.
We have again experienced a CPU failure on this hardware node (node03). All VPS were evacuated and the storage cluster is currently re-replicating the data. All services are running as normal. There is some increased latency in the storage access times.
The logs for hardware node 03 shows that CPU cores on cpu0 have been going offline and then just after 10 am the CPU was flagged as overheating and the system shutdown.
The CPU has been replaced and the node brought back online. It will be monitored before populating it with VPS servers.