[RESOLVED] Emergency maintenance on the storage cluster.

Posted on January 18, 2021 by swiftinternet — 2 Comments

Date: 18/01/2021
Window Time: 45 minutes between 11:00am – 12:00pm

A CPU has failed in one of the hardware nodes, so we need to shut the server down and replace the failed CPU.
We have already evacuated all VPS’s off the hardware node in question, so they will be unaffected.
The hardware node does, however, contain 6 TB of SSD storage for the cluster. There are 3 copies of all data, so when we shut down the node, the VPS servers will continue to function normally with data from the other two data copies.
The cluster will, however, start replicating all of the data on the stopped hardware in order to maintain 3 copies and will spread that data around all of the other nodes.
This replication of data will cause a high load on the storage cluster causing higher than normal data access latency. This may cause some higher than normal load on the VPS servers until the replication process is completed.
Therefore, during this maintenance window, the VPS servers should be considered at risk of higher than normal load and increased response times.

Posted in Planned Maintenance, Servers

2 comments on “[RESOLVED] Emergency maintenance on the storage cluster.”

swiftinternet says:

January 20, 2021 at 11:33 am

We have again experienced a CPU failure on this hardware node (node03). All VPS were evacuated and the storage cluster is currently re-replicating the data. All services are running as normal. There is some increased latency in the storage access times.

Reply
swiftinternet says:

January 20, 2021 at 12:14 pm

The logs for hardware node 03 shows that CPU cores on cpu0 have been going offline and then just after 10 am the CPU was flagged as overheating and the system shutdown.

The CPU has been replaced and the node brought back online. It will be monitored before populating it with VPS servers.

Reply

	swiftinternet on [Resolved] Major Incident…
	swiftinternet on [RESOLVED] Broadband packet…
	swiftinternet on [RESOLVED] Broadband packet…
	swiftinternet on [RESOLVED] Broadband packet…
	swiftinternet on [RESOLVED] Broadband packet…

[RESOLVED] Emergency maintenance on the storage cluster.

2 comments on “[RESOLVED] Emergency maintenance on the storage cluster.”

Leave a reply to swiftinternet Cancel reply