Our network supplier has been experiencing a network issue which is causing packet loss and high latency from certain routes on the internet. This is affecting our hosing platform and connectivity products.
The problem appears to have resolved, although we have not yet received confirmation.
We will update once we have a further information.
The network suppliers have confirmed that the issue is resolved, and will be reporting on the cause.
Our network suppliers have finished investigating the network incident 21/03/2013, and issued a technical report.
The network has been chosen for it’s reliability and performance, and is a highly resilient network with multiple failover systems and multiple paths, so in the event of a router or connection failure, data is immediately redirected over alternate paths with little or no loss of service.
During the incident 21/03 however, the whole network was effectively inaccessible, which should never be the case, and is highly unusual.
The cause of this has been traced by the supplier to human error in the configuration of a device which sent incorrect data to all routers in the whole network. They are putting additional measures in place to prevent a repeat of this issue.
Please accept our apologies for any inconvenience this issue caused.
An extract from the suppliers report says:
——————————————–start report
“The cause of the issue was traced at approximately 19.15 hrs when our engineers identified a TCAM memory overrun in one of our core router logs. This indicated that the core routers were seeing excessive BGP routes, hence the overrun. This was then rapidly traced to our lab environment within our HQ where a stress test of new hardware and equipment was taking place. The traffic and routes from the stress test that were causing the issue were being advertised on our core network, these would normally be filtered by our edge BGP filters. However a misconfigured filter was in place, which meant that the traffic was able to affect our core.
This stress test was immediately halted with the expectation that the network would return to normal. This did not happen. Further diagnostics showed that the core routers were no longer using hardware forwarding to exchange routes and had reverted to lower capacity software forwarding as a result of the overruns and problems seen. As a result services were not resumed upon the cessation of the stress test.
Our engineers unsuccessfully attempted to force the routers back into the normal hardware forwarding state. The decision therefore was taken to reboot our core routers individually as this would restore the service in a shortest possible timeframe.
A series of reboots were therefore undertaken which saw the bulk of services restored by 20.45hrs.
Actions taken:
The immediate action taken to rectify this problem was to correct the edge BGP filters between our lab equipment and our core. This is now in place and will remain indefinitely. We are currently reviewing whether a double level of filter can be introduced as an extra layer of protection and will implement this when possible.
Summary:
We appreciate that this was a major incident for both us and our customers that rely on our network.
We sincerely apologise for the incident and reassure you that it is being dealt with with the greatest priority.”
——————————————–end report