How Azure Web Sites Sucked in Production
A couple of months ago, we started a major redesign of our blog. Since the new theme required a much higher level of customizability, we had to migrate the thing from Blogger to WordPress and choose a new hosting.
Having played around with Windows Azure for quite a lot in the past, we decided to give Azure Web Sites a shot. So far, we’ve only used the old compute model: as an example, our main website, eleks.com, is a usual Azure Web Role on ASP.NET MVC 3. When Azure Web Sites came out, it made an impression of a service that had all the features we had been missing for a long time.
And it delivered! In just two minutes, we created a site and a database, deployed a WordPress template, set up Git deployment, and were ready to proceed with the development. Since we needed a custom domain and did not expect heavy traffic, Shared Plan ($10/mo) was a reasonable option.
Everything was awesome. We implemented and tested the new design and were ready to go live. And then it began.
Strike One: Memory Quota
Being aware that the Shared plan had certain quotas for CPU, storage and RAM, we added:
php_value memory_limit 128Mto
.htaccess(edit: which was our own mistake, since WAWS ignores
.htaccess. Instead, it uses
web.configfor server configuration and
.user.inifor PHP settings)
One day, we were adding some final strokes, when suddenly we saw this:
We went to the Management Portal and…
So, despite the configuration values, one developer and two editors were able to take down a Shared website due to a RAM quota. The troubling thing is that we received no prior notification whatsoever. The site just went offline for an hour.
With a traditional hosting, even if you are on a free plan, you can at least expect an email like: “Dear Customer. Not cool. Do something about your resource usage or next time we’ll suspend you”. In our case, we were on a paid $10 plan, and Azure just pulled the plug. To see a warning before this happens, you need to be on the Management Portal. Luckily, this happened on a pre-release stage, and we went to the Scale tab and added 3 more instances. Which, I suppose, increased our monthly bill from $10 to $40.
Good thing we did it. When we launched and served about 10K pageviews during the first day, the RAM meter fluctuated between 1.25 GB and 1.7 GB. (edit: this is not our typical daily load, we owed it to the HN effect, usually the traffic is much lower).
All in all, the mechanics of memory measurement is a mystery so far. Looks like Azure hosts PHP applications on IIS and measures the memory footprint of
w3wp.exe. And if the PHP handler leaks under IIS in production, there’s very little you can do as a user.
Strike Two: DB Outages
Unlike a traditional hosting environment, Azure Web Sites relies on another cloud provider, ClearDB, to provide MySQL infrastructure. ClearDB describes their platform as ‘Completely Fault Tolerant, with Global Multi-Master Design, 100% uptime guarantee’. So we were a bit surprised to see this:
We looked into the logs, and…
[10-Apr-2014 20:58:59 UTC] WordPress database error Server shutdown in progress for query SELECT t.term_id FROM wp_terms [...text truncated...]
Okay, accidents happen. But then we noticed that almost every day the blog could not establish a DB connection for 30-60 minutes.
Is there a quota for DB queries that was not mentioned on Azure website? Sure enough:
[11-Mar-2014 11:15:44 UTC] WordPress database error User 'b73a2dfc60748c' has exceeded the 'max_questions' resource (current value: 3600) for query [...text truncated...]
From ClearDB FAQ:
The max_questions resource is defined by how many queries you may issue to your database in an hour. Our free plans start with 3,600 queries per hour and increase to 18,000 upon purchasing a paid plan with us.
Strike Three: Phantom HTTP 500 Errors on AJAX
Our new design utilises AJAX heavily, and the developers noticed that, from time to time, while testing the site hosted on Azure, the AJAX calls returned HTTP 500 with no detectable patterns or steps to reproduce. Of course, this was true not only for the developers:
My only version was that it had something to do with 4 instances of our site running simultaneously (or is it just 4 times the quota, but 1 worker process?). Anyway, we had very little information to reason about this issue, so we just tried to copy the same code and DB to a traditional hosting. Sure enough, everything was smooth there and we haven’t noticed any errors ever since.
Sorry WAWS, You’re Out
One of the most disturbing things about these issues was the fact that we could not contact Azure support about any of them, because in order to ask technical questions you need a $29/mo support plan, which we didn’t have on that particular subscription. If you tally up the figures, here’s what it would look like to make everything work:
4 x $10 (instances) + $29 (support) = $69/month
With a traditional hosting, you get the same hosting model, with support, for 10 times less price, but, probably, with less features. We absolutely didn’t mind to pay some extra bucks and have more productivity with Azure (see Penny Pinching in the Cloud: When do Azure Websites make sense? by Scott Hanselman for details), provided that the primary features work reliably.
I’ve always loved Microsoft for their excellent developer products and tools, and I still do. But when it comes to production environments, having a good night’s sleep would take priority over any productivity boosting feature. So WAWS will still be at the top of my development list, but for production needs I’d give it a couple of years to mature.