Active Data vs. Reporting
When coming up with application requirements, access to data that is not necessarily working data inevitably comes up. Managers want to know what happened last week, last month, and last year in order to understand their needs as far as staffing levels and training requirements. They want to be able to understand an individual's performance and identify environmental problems - whether it's a problem customer, a problematic piece of hardware or what have you.In a great many environments, the solution to these data access requirements is to maintain historical data in your working tables. It's the easiest thing to do from a development perspective - all you have to do is enable a query engine of some sort, something that is likely already in place, and you have ready access to all of the data you need.
This isn't the best solution, but before I get into why, let me talk briefly about active data vs. reporting data. Active data drives the daily tasks for end users - whether it's a message board or a help desk - those data elements that are currently in use are the ones that are most important. They must be created, accessed, updated, and
Reporting data is different. With reporting data, high speed access is important, but data creation, updates, and deletes are not. Likely, with reporting data, you want to prevent end users from creating, updating, and removing information. Reporting data serves a different purpose. The data is a historical record of activity.
With reporting data, you want to be able to bend it and shape it in any number of ways to achieve a meaningful representation of what happened at a particular point in time or what happened over a range of time.
With active data you have a fairly rigid set of things that you are doing. You are creating records in response to interaction and enabling a defined set of application functionality.
Performance
Indexes are an important part of database strategy. They allow for quick searches on specific data elements without having to perform a complete scan of each record in the database. They must be maintained, of course. With every index you have defined in a given table, the database has something to do when you insert, update, or delete records. Defining too many indexes means that any interactions other than reading data will be adversely affected.Many high end platforms are self-regulating. They manage the data based on end user interactions - caching queries, optimize indexes, and structure storage ideally in such a way that the most commonly performed tasks are the most quickly performed tasks.
With archived data, we want as many indexes as we have potential search points. The data sets are extremely large and the data is updated at known intervals. We can leverage this knowledge greatly. One strategy might be to drop all indexes prior to a data update (speeding up the data import) and rebuild them afterwards (making them current and well structured).
With active data, indexing strategy is different. We don't want too many indexes, and with small sets of data indexes can actually be detrimental to performance. With active data we know all of the search points defined by our application. For each of these we can determine the best indexing strategy. We can also predict the search points most commonly accessed by users based on their work tasks and come up with an appropriate strategy.
Aside from indexing, there is the question of where to physically place data from a hardware perspective. From a reporting perspective, large chunks of data are likely to be accessed by individual users in sparse intervals. From an active data perspective, smaller sets of data are likely to be interacted with by a large amount of users with great regularity. Mixing the two elements on the same disk through the same controller will inevitably result in an I/O bottleneck. Write-level caching is great for interactive data and entirely useless for searching through archives. Conversely read-ahead caching techniques will be much less effective or not effective at all for active data in a great many situations.
Segmenting your archived data also enables the use of edge-caching techniques to reduce network latency for geographically disparate clients in highly scalable environments.
Retention is Bad
Data retention in your primary working tables is bad for several reasons. The smaller your working tables, the more likely they are to be cached entirely in memory. If they aren't cached and they are small, they can be placed in memory quickly - so a single user's interaction with the data can benefit other users who are on similar paths within a reasonable time frame.Data retention is also bad from the perspective of a self-tuning database. End users that look at inappropriately retained data skew the statistics that are used to tune the database. This results in optimization that is not geared towards primary usage.
In Conclusion
I'm not advocating the removal of all historical data from primary working tables. There are reasons that you want to retain a certain set of information. What I am saying is that you do need to analyze your architecture and segment responsibility appropriately. Write your software in such a way that data mining does not adversely affect the primary purpose of the application.Performance and scalability will become an issue for any application that achieves a high degree of success. The best time to address these issues is early on in application design - when your options are not limited by existing end user expectations. Considering performance and scalability early on alleviates many pitfalls on the road to success.
The best solution will change from application to application, user base to user base. Don't rely on advancements in hardware to make your solution viable.
Discuss Archival vs. Active Usage
