Cache Strategies for High Performance Web Serving


I've recently become very interested in various cache strategies, the idea being that I want to be able to recommend and implement high volume web sites with the end user experience in mind. There are various companies like Akamai that will do edge caching for you, but using an outside company can be expensive, time consuming, and it can introduce layers of complexity all over the place.

For a dynamic solution there are three separate strategies: Content Blind Caching, Content Aware Caching, and Replication.

For a static solution, there are two strategies: Replication, and Proxy Caching.

Content Blind Caching

In a content blind environment, you have a central database. You place cache servers at the edge of your local target area, and those servers run generic queries against your high transaction tables with regularity. That data is cached at the edge and is queried on demand as users request data from the web interface.

Content Aware Caching

In a content aware environment, queries are pushed out to edge servers on demand. Those servers check for fresh data by modification date. If new data exists, it gets pulled from your central server farm. If new data does not exist, it is pulled from the cache.

Replication

In a replication environment, all data is replicated from your central server farm to your local edges. Replication can be real-time or it can occur at timed intervals, depending on need.

The Best Solution

Each of these solutions have drawbacks, and you may have noticed that I've talked in general terms about data that needs to be pushed out to edges, but not about client data that needs to be inserted or modified at the server farm. The most realistic solution for dynamic web applications is Content Blind Caching. Content Aware Caching has great promise for certain types of applications, but the fact that each request incurs a remote WAN request is too cumbersome to overlook. High frequency replication can place too high a load on your central server farm providing relatively little benefit because each client server would also have to maintain an additional network load and CPU load as well. Content Blind Caching offers the best performance in general terms, but realistically there is a high degree of data latency which makes it unsuitable for applications that exhibit a high degree of geographically disparate interactivity.

Reverse caching is a possibility for highly interactive applications, but dataset portability concerns arise that need to be addressed on a requirement by requirement basis. The absolute need for a reverse caching structure in most environments is a questionable one unless there are wide area network connectivity issues. Whatever latency issues exist to the central server farm, if the outbound load was minimized latency experienced with inbound activity should in most scenarios be acceptable.

Static Content

In a static content world, edge caching is exceptionally simple, and you really have two options - replicate your content, or use a proxy cache solution like Squid. Squid is the sexier of the two solutions. Wikipedia uses a whole boatload of squid caches to maintain their traffic loads. But squid and other proxy caches have their limitations. The transactional volume capability of any proxy environment is going to be significantly lower than that of a static web server serving local content. Even the largest of sites could be transmitted quickly over a 10MB connection, so it really doesn't make sense to proxy if you have the resource power to manage replication. The downside to replication is that you have to make a conscious effort to get it done, whereas a proxy cache will get the latest and greatest content every time (assuming cache control headers are transmitted properly).

I know that there are plenty of application scenarios that would require a database edge caching solution, but there are so many more that are better suited for static replication. A lot of content is managed via a database without it being absolutely necessary. If you can output your site to html content - even if you have to schedule it to happen hourly - that is a much more efficient solution than any edge cache would be able to provide you with.



Discuss Cache Strategies for High Performance Web Serving