Some of you may have noticed that there was a hosting issue last week and sites were offline, clearly this is upsetting.
A Bit of History
When I first started working on websites in the mid 1990s I had a small server in a business centre in Batley, West Yorkshire. It was a ropey old desktop machine that sat with a few others in a server cupboard, most of the time it was fine but occasionally it crashed late on a Friday and was only re-booted on the following Monday morning. This meant that I was unable to access the sites or work on them over the weekend.
Sometimes though when in the centre I went into the server room just to look at it; sad but true. Since that era things have moved on greatly.
We currently host with a UK based company, the choice was based on a number of considerations with reliability high on the list. DesignCredo don’t view hosting as a revenue earner, it is sold on pretty much at cost. We have previously been reassured by proactive internet security endeavours by the company and have found the ongoing service to be very good.
That said last week they had problems. Basically a power loss took the websites offline for a period of time. So, from the hosts themselves this is what happened last week:
Our UK-based data centre facility is one of the most efficient and resilient data centres in Europe. We carry out tests on a regular basis and test our generators every weekend to make sure that everything is in order. Additionally, we have expert technicians staffing our data centre 24/7, every day of the year. We have also have two UPS systems serving each data centre hall.
So, what happened?
On Wednesday, one of datacentre halls suffered a power loss which affected the facility for less than 9 minutes.
Each data centre hall has two UPS (uninterruptable power supplies) which feed into a LTM (load transfer module), which manages the feed of power from the UPS to the datacentre hall, where your servers are housed.
This piece of hardware automatically switches the power between the two external supplies, should one fail. This is part of our redundancy commitment to you.
The LTM showed a fault on the primary power supply and was running on its backup. Our teams followed the guidelines and contacted the manufacturer and an expert in this piece of hardware was sent to our facility to investigate and fix.
The fault code indicated that there was a problem with the voltage monitor. As a safety procedure, this automatically shuts down primary power, even though the power supply itself is likely fine.
The engineer on site assisted with the fitting of the replacement part. The procedure was then followed to turn off the already disabled switch and to change the part safely. Unfortunately, a safety mechanism in the device triggered incorrectly, which led to the data centre losing power.
The sites went offline mid-afternoon and were mostly returned by late afternoon the same day, we monitored the progress closely. Unfortunately just one of our sites, this one, remained offline for an extended period of time; it is hosted on one of four servers (out of several hundred) that would appear to have suffered physical damage as a result of the above.
Of course this was very frustrating. The hosts maintained good communications throughout the process but inevitably there is always going to be an element of unknown with issues such as this.
What About Other Provisions?
Last year a client who has their site hosted with a different company experienced a failure. Once we located the issue, something totally out of our control, and realised it wasn’t fixable on their host’s server we migrated the site to our own more flexible hosting provision as a temporary ‘home’ whilst a long term solution was found; the site has now been returned to their own host.
Had the site been on our host we would have been able to instigate a temporary fix immediately. However I wouldn’t use this as an excuse to say that the other host isn’t anything other than very good.
Why Didn’t We Do This For Our Own Site?
Last week once it was apparent that our Design Credo site was to endure an extended outage a backup of the site was indeed loaded to a different server (here) but we decided not to fully migrate it. Rest assured if this had been a client’s site we would have had it back and running very quickly. There were a number of reasons for not making the switch for us.
Unfortunately technology isn’t perfect, even Amazon was offline for an hour or so last week. Inevitably when issues such as this happen (ditto road works and Police investigating RTAs) Social Media Heroes have a good old rant, I guess it makes them feel good.
Of course there is a bit of me that feels annoyed, it’s something I can’t control. However through the years I have learned to be pragmatic, I have utter faith in the endeavours of the hosting company last week and I remind myself that whoever one hosts with we all pay a similar amount which roughly equates to the cost of a loaf of bread per week. That doesn’t mean that we shouldn’t strive for improvement and perfection but I genuinely believe that we get a lot of communications for our money with web-hosting.
So, What Did We Do Whilst Our Site Was Down?
Friday morning we had a meeting with a client to discuss a website proposal. Friday afternoon we had purchased the domain and an hour later we had a basic website hosted on the servers that had now been restored to full service. Monday morning the client was presented with a fully operational site, I can’t show it to you yet, and that makes me really frustrated!
This was the feedback:
We both love it! ( as an aside I’m hoping to get ****** to go for a website too…)