At around 11:19pm PST on March 8, 2011, our production server became unresponsive. Our host, Rackspace, notified us shortly that it was due to a hardware failure on their end. By about 1:18am PST on March 9, 2011, the server was back online.
What we’ve learned from the experience:
1) Our server-monitoring systems are working great. As soon as our site went down, our team received several notifications indicating there was a problem.
2) Having a redundant server infrastructure is critical for moments like this. A proper server setup should ensure that a common problem like a hardware failure on a machine will not take down the entire website.
What we’re doing to prevent these types of outages in the future:
1) We’ll be adding redundant database and web application servers so in the event a server goes down, there will be a backup in place to fulfill this missing server.
2) Additionally, we plan to setup a secondary alerting system for our customers, such as a Twitter or Tumblr account, or possibly hosting our Blog on a separate server, so in the event of a major catastrophy, we can still keep communication open with our users.