Important update about Pixel Internet’s outage
As you may be aware, we recently suffered the worst single incident in our history due to a power outage at our Leeds data centre on Wednesday afternoon.
Emergency maintenance work was being carried out on the load transfer module, which feeds power from our external energy supplies to the data centre hall that holds the majority of our servers. The data centre has 2 dual feed uninterruptible supplies both backed by diesel generators in case of National Grid outages.
Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 15,000 servers to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies despite working with the external engineer from the hardware manufacturer.
What happens when servers hard reboot?
Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.
Following the restoration of power, our first priority was to get our primary infrastructure boxes back online, then our managed and unmanaged platforms. Our managed platforms are built to be resilient, so although we lost a number of servers in the reboot, the majority of our platforms came up cleanly. We faced some issues with our Premium Hosting load balancers, which needed repairing, so some customer sites were off for longer than we would have hoped. We are adding additional redundant load balancers and modifying the failover procedure over the next 7 days as an extra precaution for us and our customers.
On our shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.
In a single case, the cluster containing web 75-79, representing just under 2% of our entire shared platform, both NAS drives failed to come back up. Following our disaster recovery procedure, we commenced attempts to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so we prioritised attempts to repair the file system.
Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, and we were forced to proceed to the next rung of our disaster recovery procedure. The further we step into the disaster recovery process, the greater the recovery time, and here we were looking at a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup although we did not need to enact these steps.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. We restored sites to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.
A full shared cluster restore from backups to new NAS is a critical incident for us, and we routinely train our engineers on disaster recovery steps. Our disaster recovery process functioned correctly, but because the event did not occur in isolation, we were unable to offer the level of individual service that we really wanted to, and that you would expect from us (e.g. individual site migration during restoration).
Given the magnitude of this event, we are currently investigating plans to split our platform and infrastructure servers across two data centre halls, which would allow us to continue running in the event of complete power loss to one. This added reliability is an extra step that we feel is necessary to put in place to ensure that this never happens again for our customers.
VPS and Dedicated Servers
For our unmanaged platforms (VPS and Dedicated Servers), the damage was more severe, as by default these servers are not redundant or backed up. In particular, one type of VPS was more susceptible to data corruption in the event of a power loss due to the type of caching the host servers use. We have remedied this issue on all re-built VPS involved in the outage, and no active or newly built VPS now suffer from this issue.
Support and Communications
During this incident, we have worked our hardest to ensure that our entire customer base was kept informed of our progress through our status page.
Given the scale of the issue, the load on our Customer Services team was far in excess of normal levels. On a standard day, we handle approximately 200 support tickets, which can rise to 500 during a fairly major incident. At absolute capacity, we can handle approximately 750 new tickets per day.
This event was unprecedented, so during and following the incident we received in excess of 1200 new support tickets every day (excluding old tickets that were re-opened), and the ticket complexity was far higher than usual. Our admin system was not set up to handle this number of requests (being poll heavy to give our team quick updates on our ticket queue). This heavily impacted the performance of our control panel and ticketing system until we made alterations to make it far less resource intensive.
After this, we took immediate steps to ameliorate the incredible support load via automated updates to affected customers, but most of the tickets required in-depth investigation and server repairs that require a high level of technical capability, so could only be addressed by our second line and sysadmin staff. It will take some time to clear our entire ticket backlog and restore normal ticket SLAs.
In conclusion we’d like to apologise to you, and to your customers. We know as much as anyone how important staying online is to your business. The best thing we can do to regain your trust is to offer good, uninterrupted service long into the future, and that is now our utmost priority.