Posted by: Anil | June 15, 2008

Outage in San Jose, California – Internap

We’ve had a major outage on Friday evening in our San Jose, CA data center. Our data center provider tells us that its the biggest in 11 years. This was our own biggest outage to date. It turns out it was a configuration issue that happened at Internap that had caused this problem (we use Internap bandwidth). No hardware had failed.

I am not sure why Internap did not know that this change had caused a major network outage, until our network provider escalated – that too after a long time. Internap is a premium provider, so this is a question we are still waiting to hear back from our network provider/Internap. It is unacceptable.

To our customers – I am very dissapointed and concerned with what had happened. Entic.net is a startup, and is marketing to high quality hosting market. So, this kind of thing just throws the work we do, out the door, especially since we put in a lot of money to maximize reliability on the servers themselves. Sorry!

I also wanted to take this opportunity to remind our customers that there is no server or network out there that can be 100% reliable with 100% uptime (even excluding planned outages).

Things can and will go wrong: Human error, software problems, power failure, and more. These companies have lots of money for redundancy too, but things still fail.

All that said, so what can you do to make your web site less prone to outages?

Run two web servers on two different servers (maybe on different data centers). Setup the database in a clustered environment. Here is an example for MySQL. Setup DNS so http://www.domain.com (your web site URL) points to two different web servers, so if one goes down, quickly update DNS and remove one of the effected servers.

The DNS solution above can work, but it requires manual intervention to update DNS. Once DNS is updated, things will work without a problem. The other more automated solution is the use of hardware load balancers. But, this solution is a lot more expensive and (usually) is setup with two different web servers being in the same data center. If the whole data center has problems, than this solution will be of no use – it just provides an automated/seamless fail over of your web server from one server to another.

One thing that people can do is also setup good monitoring of web site so if something does fail, you’ll get paged and you can update DNS right away. Depending on the DNS setup, this can limit the outage to perhaps 30 minutes to one hour or so.

Keep in mind, the DNS setup above with database clustering could result in higher bandwidth costs for you (since the database is replicated over the network to another data center).

These solutions depend on how much redundancy you want and how much you are willing to spend on it. Hope that helps.

Back to Entic.net. We at Entic.net are looking to expand into another data center. We hope to offer our v.DS (Solaris VPS) service in an other data center in the coming weeks. For those wanting to run the two web servers in two different data centers, this will be an option for you very soon.

Again, we sincerely apologize for the outage and we’ll try to get the answers we want from Internap as soon as possible. It hurts us just as much as it hurts your business.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: