Database Wheel
From KallestadWiki
Pooling Database Connections
Connecting to a database in order to process a transaction involves a good deal of overhead that is not always necessary given the fact that your database interactivity is in many situations predictable. Think of the steps that must occur for an encapsulated database interaction.
- Connect to the database
- Prepare SQL Statement
- Execute SQL Statement
- Return Data
These four steps can be very fast, but multiplied over 100,000 interactions, you start to think about how to improve performance. The obvious step is persistence. Before we go there, let's associate some large arbitrary times to each of these steps.
- Connect - 5 seconds
- Prepare - 5 seconds
- Execute - 20 seconds
- Return - 5 seconds
These are extraordinary large times, but go with them just for the sake of easy conversation. If the connect step accounts for 14% of your processing time, connecting to the database persistently can chop of a significant amount of your overhead very easily. Of course, if you are processing 100,000 interactions, you probably aren't doing that in a single process, which means you have to have 1 connection per open process. In more situations than not, one open connection per process is overkill.
Caching common results can alternatively shave 85% of your processing time for those requests, since all you have to do at that point is return data. Of course, it's not that clean - with any cacheing mechanism there is a certain amount of overhead, but regardless the cacheing mechanism should incur significantly less overhead than database connectivity. Let's call it a 60% savings.
For other transactions, you can save an additional 14% of your processing time by cacheing prepared statements. This is a hit and miss game if you are trying to do it dynamically, but as an application designer, you know what queries hit the system with a great deal of regularity. For system generated and commonly expected queries, you can save a significant amount of time.
To optimize execution, it's really a matter of attacking each individual query. What we can do is identify query wait times and put out a log of queries that exceed a predefined limit so that those queries can be analyzed and proper action can be taken as a result. Action could consist of:
- Application redesign
- Query Optimization
- Index Optimization
- End user training
- Results cacheing
It doesn't make sense to cache prepared statements for 100 connections all across those connections. Startup costs would exceed any performance benefit. The best way that I can come up with to acheive the maximum benefit is to create a service that provides optimization and to have the end user interaction interface proxy through that service rather than connect to the database directly - a database connection pool.
Design
I don't want to get into threads vs. processes at this point. I don't know the best answer to that question. Threads are nice, but select based event loops are very efficient as well. Each has their own benefit.
Essentially, we want a management node listening for requests. That management node provides a single point of interaction for clients. The management node then passes the request off to an appropriate worker node to process. The worker node returns the results either to the management node or directly to the client. The management node would be responsible for starting up worker nodes as appropriate within predefined limits. It would also be responsible for ending worker nodes appropriately.
