Categorization and N-Dimensional Space


When working with database centric applications - which describes most applications in the both the web centric world and the client-server space - you inevitably want to categorize data. Categorizations can be as simple as a keyword tag or something slightly more complex such as a tier-structured menu, a collaborative tagging or voting system, or an elaborate combination of multiple fields.

These "simple" categorization systems certainly add value and do their job as far as structuring data, but what about predictive modeling? Perhaps you want to figure out what books a user might enjoy based on past voting history or maybe you want to make a knowledge base system more effective based on the self-help history of previous users.

N-Dimensional Space

What is N-Dimensional Space? I suppose the best way to understand it is to first talk about two dimensional space. In the two dimensional world, you have an X axis and a Y axis. If you know two points in two dimensional space, you can draw a line between them, calculate the distance between them, or any number of other things. With three points, you can potentially draw a triangle, calculate the area of that triangle, etc. In 3 dimensional space you have an X, a Y, and a Z axis - indicative of height, length, and depth. The mathematical possibilities for 3 dimensional space expand greatly. Beyond 3 dimensions is difficult at times to mentally visualize, but mathematically the work is the same. The concept itself remains easiest to visualize if you choose to display only 3 planes simultaneously.

Categorization

I spoke briefly about various categorization systems above, but none of my examples were individually restrictive. A given data set can have unlimited categorization methods, and depending on implementation - an unlimited amount of categorizations associated with each individual data entity. Further, you can not only categorize your data set, but your end user as well - either through past history, a brief questionnaire, or by monitoring system interaction real-time. This can provide you with an idea as to what information the end user is seeking prior to the end user actually seeking information in the first place.

Let's take an internet search engine for example. We all search for random things at one time or another, but I suspect if you examined your own usage you will find that the majority of your searches lie within a small number of vectors - mine might be computer programming information, webmaster information, technical news topics, and databases. Further, on any given day a common vector that I follow may be associated with a problem I'm trying to resolve - hard disk failure, an itunes bug, or something like that.

Whereas a keyword based search engine would provide me with a fairly generic result set, an intelligent search engine might see that I'm often searching for technical information and float technical results closer to the top of the rankings. A very intelligent engine would float error resolution results that are related to hard disks to the top if I happened to be sending a series of data recovery keywords followed by a search for a specific error message.

Predictive Systems

In an n-dimensional predictive system, the center of the world would contain the most broadly utilized result set and the most common characteristic end user. Beyond that, anything goes. A particular axis could be simply a positive or negative value. It could be a keyword targetted result set. It could also be a general idea axis in which ideas were contained as points, as lines, or as shapes.

Start out with what you know about your user and provide that user with a point, a line, a vector, or an n-dimensional shape. Overlay the data set. Within that users space, and branching out from there in degrees of relevance is the information they are looking for. When you have input, first search the data set within that users space. Branch out from there or leave the geographic region entirely if it is necessary (preferrably with a result set based on distance from your user's center), and you should yield the highest value results.

The higher the number of dimensions, and the more accurately categorized they are, the better your result set.

The Future

I've been wondering what kinds of applications could take advantage of a high number of processing cores and finding a way to split most applications into 64 simultaneously processing threads is difficult to imagine beyond heavy duty graphics such as video rendering. N-dimensional spatial analysis is a prime candidate for this kind of hardware - although the number of real-world applications that could take advantage of such techniques on such a large scale would be greatly limited.

Potential Problems

The biggest potential problem with predictive analysis has already happened. Just do a google search for "My Tivo thinks I'm gay". False assumptions when categorizing end users or including a random segment of activity that does not necessarily relate to the end users profile as a whole can be problematic when trying to understand what information they may be trying to look for. These problems can be reduced by continued analysis of end user activity and by allowing the end user to see and modify their own profile variables - and perhaps allowing them to wall off certain activity to a sub-profile.

Proximity as an Aid and not a solution

The proximity of a n-dimensional end points given a data set, an activity profile, and a query criteria can provide very meaningful results but not all of the relevant results. N-dimensional analysis should be used as a floating mechanism to push the most relevant results to the top of the stack, but not necessarily as a filtering mechanism to remove results that do not fall within a particular spatial region. Legacy results - whether they come from a keyword based full text search or from a SQL query or some other means will provide your end user with what they expect.

Until you have had a lot of time testing real world end user experiences and have tweaked and tested your algorithms completely, you should not ignore the results that come from your existing workhorse systems.



Discuss Categorization and N-Dimensional Space