Revisiting Voting System Design


Voting systems are common in "Web 2.0" social networking sites. People vote on articles, images, discussion comments, polls, web links, and even on other members of their community. Knowing that eventually I will want to implement a voting system of some sort within one of my communities, I have been peripherally thinking about the design structure of such a voting system. I don't know how other sites are implementing their voting systems, but from the voting systems I've taken a peek at within various pieces of software I've installed, I haven't seen a design yet that I would trust to scale to the degree of popularity that some of the major sites are enjoying. At least not without a significant investment in infrastructure.

I'm not big on major infrastructure investments for small web site operators, but I am big on the idea that if a small site becomes popular it should scale easily along with the growth of the community. I can see throwing extra commodity servers at a site and implementing load balancing, dedicated database servers, edge caching, or other forms of redundancy, but an extra commodity server can be had for a one time investment of under a couple of thousand dollars or a monthly fee of under $150. These low risk kind of investments are in line with the kind of revenue boosts that a webmaster would see along with the gradual growth of their communities. Investing in a $20,000.00 + dedicated database server with high performance everything is not a smart investment for a webmaster experiencing gradual growth - especially if the main purpose of the investment is to enable functionality that is as simple as a voting system.

The design of the systems I have paid attention to is typically one of the following:

  • Vote counts are incremented as they come in, with no tracking of the source
  • Individual votes are stored and related to the original object in a many to one relationship

The problem comes into play with a large site such as Digg. They publicly display vote counts for each article they have posted, which I believe is numbered at over 1 million stories. Only registred users can vote, and they can vote on each article up to one time. Further, each article has a set of comments, and the comments can be voted on individually up to one time per user. I can only imagine that they have registered users numbering in the hundreds of thousands if not a more significant number. Obviously, the first above referenced design isn't used, because you cannot vote on anything more than once. The second design would require a voting record table with a number of records that is at least one million records long multiplied by the average number of diggs that an article receives. If comment voting is stored in a secondary table, a similar size for that table would be in place multiplied again by the number of comments each article receives on average.

It certainly is possible to retain all of that data within a single dataset. It doesn't seem very efficient, although a very small table with twenty million records is certainly within the capability of most databases, and would return data efficiently with sufficient hardware. Consider again that the number of queries hitting this database table would number at least 10 times per page view on their main page (10 articles displayed), plus a similar number of additional queries for each additional page view that each visitor makes. As the site increases in popularity and as community interaction increases (voting, article submission and commenting) the strain on the database increases as well. If the site were to double in size over the course of a year, the 20 million record table would be rapidly approaching 100 million records - even with very clean indexed searches and an ultra-lite table size peak load times with a table of that size would put quite a strain on a database server.

So what sort of design does make sense? I think you have to look at things from each activity perspective, optimize for that particular activity, replicate data and data structures appropriately, and disperse storage requirements to the point where no one table would have the potential of reaching a straining point.

The first point to be made in regards to strategy would be that an archiving strategy needs to be implemented. At a certain point in time, voting needs to close. When voting is closed, data concerning anything other than the ending vote count can be archived offline or discarded. Total vote counts for archived items can be consolidated and maintained within a data structure associated with the archiving time frame or the original vote date. This limits your scalability requirements and makes administrators aware of the limitations that you intend to support. Information can be relayed to the community and expectations can be managed.

Secondly, cacheing the most frequently accessed information in a separate set of tables is probably most desirable. While database platforms do a good job of caching frequently accessed data, those caches are subject to automated performance enhancement routines that would be vulnerable to periodic strange activity from your user base. It's best to maintain a small subset of information in high performance tables so that if the database caching routines were to dump your most popular data out of memory, the recovery of that data to memory would be handled most efficiently. The exact amount of what you would want in your high performance structures would be a factor of user activity and hardware, but a safe place to start would be to keep a set of your most recent 7 days of active material inside a small structure. It would also be a good idea to store the activity records for your active users within high performance tables separate from activity records for the entire userbase which could include a large amount of data for users who no longer frequent the site.

I'll return to this subject later with further thoughts as to design strategy, but this is a halfway decent introduction to the concepts. Off the top of my head some further subjects of discussion would be figuring out a reasonably scalable storage mechanism for votes, and a system for vote validation. The storage mechanism should be able to translate the data to a presentation structure for the client interface very efficiently, including the possibility of IPC caching the data structures in a format as needed by the client interface. Separation of current vote counts versus verified vote counts is also a possibility. There's also the matter of what information should be stored with a vote - whether it's just the IP and a vote record, or it is something more complex, such as user information, what the vote was, timestamp of the vote, ip of the vote, etc. Anti-spamming measures should also be in place, as any reasonably popular system would be a target of automation attacks.



What visitors have to say about Revisiting Voting System Design