Scalable doesn't mean fast by Mark Seemann
Recently I spent a couple of days with Thomas Jespersen who's working towards a launch of spiir.dk - on Windows Azure. The reason I got to talk to him was to see if I could help with some performance issues he had with Azure Table Storage.
The scenario is really simple: the application needs to load all of a user's bank transactions into memory to enable pretty advanced sorting and filtering. That sounds like a lot, but really isn't more than approximately 200 kB of data retrieved through a single query - so: there are no 1+N problems in play here, but even so it originally took more than two seconds. That's a bit long to wait before you can even start rendering a web page.
By tweaking his partitioning strategy and using parallel queries, Thomas managed to bring down the data retrieval time to approximately one second. Although stress testing indicated that this duration was very stable, even under load, it is still too slow. So we met to see what could be done.
Thomas had done a great job tweaking the query, so I couldn't really suggest some sort of secret API that would make it run significantly faster. Basically, we have to deal with Azure storage being based on REST and that there are a lot of things about run-time behavior we cannot control. Apart from designing a proper partitioning strategy, we can't add indexes to Azure Table Storage.
It was time to take a different approach.
As far as I can tell, Windows Azure is designed to be very scalable. However, just because scalability implies that you can handle an insane amount of work within acceptable time frames, it doesn't mean that you can extrapolate it to mean that under a light load, everything will be lightning fast. That's not the case at all.
Scalability means that performance characteristics remain stable from light to heavy load.
Consequently this means that if performance is adequate under heavy load, it will also be adequate under a light load. Azure Storage is first and foremost designed to be scalable, and as a second priority, as fast as possible.
As Thomas discovered, Azure Table Storage isn't particularly fast.
It may be a masochistic side of me that I'm not otherwise aware of, but I actually appreciate that. It makes us reassess our most basic assumptions.
The data that Thomas needs to read isn't particularly dynamic, so what if we take a snapshot of it? In short, we loaded all of a user's data into memory and serialized it to Azure Blob Storage.
Loading the same data from a binary serialized Blob took only 1/6 of the time it did to load it from Table Storage.
As it turns out, Thomas doesn't even need all the columns from the Table to populate the view, so we could even make the serialized Blob smaller yet.
At this point, however, we now have two representations of the same data: The original data in Table Storage, and a persistent cache in Blob Storage. The remaining challenge is to figure out how to keep these in sync.
This may seem like a hack, but is really represents a paradigm shift. Letting go of ACID opens up a lot of new opportunities.
Actually, I spend most of the next day trying to convince Thomas that CQRS would be the best approach, or that we could at least pick up some of the techniques from asynchronous, messaging based architectures, but that's another story.
The morale here is that on Azure, things may be slower than you are used to, but storage is (relatively) cheap, so denormalization can save you a lot of execution time.
Comments
Your comment about you letting go of ACID strikes me as odd. Basically you are just adding a cache in front of your table storage like people have done for ages so where in lies the paradigm shift in "letting go of ACID", are you speaking to people who have never cached data? Even people not using a cache will let go of ACID in most cases because they keep stale data in objects in their application and in best case do optimistic concurrency checks and worst case just let the last-to-write-win.
The same goes for the ACID comment. People don't seem to have too big of a problem with in-memory caches because somehow they know that these can't possibly be consistent. However, as soon as you start writing to a persistent store, you encounter knee-jerk reactions that all persisted data must be written and updated within transactions.
I'm not disagreeing with you. This is the basic premise behind CQRS and other scalable architectures. It's just not particularly widely known yet.