Referential Integrity


Referential Integrity is something that's simply not addressed by a lot of new developers, and it is one of the primary elements that separates a utility project from something that is comercially viable.

Let's say you have two tables. The first table, we'll call it the PeopleTable has a list of names. The second table has a list of phone numbers for said people, we'll call it PhoneNumbersTable. People generally have the capability of posessing either zero, one, or more than one (many) phone numbers. These two tables would have a relationship of "one to zero, one, or many" - meaning that for every person in PeopleTable, there are zero, one, or many phone numbers in PhoneNumbersTable. An important part of that relationship is that for every phone number in PhoneNumbersTable there exists one, and only one, person in the PeopleTable.

Let's not stumble on practicality here, I do realize that some phone numbers are associated with two people, and some phone numbers are not associated with any people, but I don't want to get into the complexities of those kinds of relationships just yet.

Referential Integrity means that you (or really, you're application) is aware of the defined relationship between the two tables, and that the relationship as defined cannot be broken.

The "One to Zero, One, or Many" relationship requires that any time you are deleting a person from PeopleTable, you delete any associated phone numbers from PhoneNumbersTable. That way you don't end up with any "orphaned" records in PhoneNumbersTable. This relationship does not work both ways - because you can have a zero relationship, you may delete phone numbers from PhoneNumbersTable with impunity.

Following Referential Integrity rules ensures that your data set is stable, meaning that any reporting you do on your data set is valid. You can certainly have whatever referential integrity rules you want based on the needs of any given requirement, but in the end you want to ensure that your data is structured to meet all of your requirements. If you are maintaining orphaned records because you have a data retention requirement, you want to be able to flag those orphaned records so that any reporting on said data can take into account that orphaned records exist.

The "Referential" part of the term implies a reference. Any data set that incurs a referential integrity requirement must have a unique key or identifier for each and every record - this can be as simple as a counter incremented with each insertion, or something more complex that would be suitable for a high transaction, large user base environment. In a one to many relationship, the unique key of the left side (one), is stored with each associated record on the right side (many). That way, you can have limitless references. In a many to many relationship, the key references must be stored in a separate table.

Of course, I've been talking tables, but this architecture is applicable in all situations, whether you are storing a hash, a complex data structure, or if you are working with a simple spreadsheet. Having referential integrity rules not only is important for reporting, but also in developing the workflow of an application where the developer often will make assumptions about the data set. Communicating these rules to other developers is important, as well as keeping track of the reasoning behind the rules so that when a change to the rule set is required you can address any existing workflow that would be affected.



Referential Integrity Commentary