The Data Quality Paradox
The mistaken use of unique keys as the unique identifiers of records is perhaps the greatest paradox of data quality and the primary cause of duplication of entries in any database.
The problem starts when data analysts use codes as the Unique Identifiers (UIDs) of data entities in the data modelling stage of analysis.
Data Modeling 101: Codes are not identifiers!
Data Modelling 102: The unique identifier of a Data Entity is NEVER A CODE!!
Many data analysts may have apoplexy at these statements, as it is a practice that they have followed most of their modelling lives, but simple logic can show the practice to be totally flawed.
For something to be a unique identifier it must first of all qualify as an identifier. Codes do not identify! Don’t believe me?
A Simple Test
If I were to take you to the middle of a warehouse with 400 unlabeled parts laid out on the floor and ask you to go and pick up part 14563 you would not be able to proceed. No. Why? Because 14563 does not in any way identify any of the parts. If I say “pick up a motor” then you will be able to make a start. You might need to ask “do you want a hydraulic motor or an electric motor?” But you will be able to identify a motor among the parts.
We see from this that codes are not identifiers. If they are not identifiers then they CANNOT be unique identifiers.
What is a UID?
A unique identifier enables us to answer the question; “what is it, with respect to this enterprise, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?” If we know this then we will have the unique identifier for that data entity.
With regard to a finance company that loans money for the purchase of motor vehicles, lets ask the following question. “With regard to this enterprise, what is it that makes one customer uniquely different from another customer?” For the purposes of this example we will stick to customers who are individuals, as opposed to corporate entities.
Straight away, some data analysts will shout out “it’s the customer number!” Wrong!
Example UID Problem
To demonstrate this I could take you to many finance companies and show you the customer data files with records that would look very similar to the following.
Cust No Forename Surname Street Town D.O.B
1001 John Smith 123 High St Bristol 22 Dec 1950
1002 John Smith 123 High St Bristol 22 Dec 1950
1003 John Smith 123 High St Bristol 20 Aug 1971
1004 Mary Jones 27 West St Walsall 12 May 1975
1005 Mary Green 27 West St Walsall 12 May 1975
1006 Mary Jones 22 Grove St London 12 May 1975
Every customer has a different Customer Number, but do these six records represent six distinct customers?
It is highly likely that customer No 1001 and 1002 are the same person. But what about 1003? The same person again with an error in the date of birth or a younger person of the same name at that address; the son of the older John Smith perhaps?
Looking at Customer Numbers 1004 and 1005, we ask is Mary Green the same person as Mary Jones – now married to a Mr Green perhaps? Or is she an entirely different person who just happens to have the same date of birth?
What about Customer 1006? Is this the same Mary Jones who has moved house? If she had a bad debt record at 27 West St would you want to do business with her at 22 Grove St?
Data Uniqueness is a Business Decision
This set of records forces us to repeat the question; “What is it, in the context of this enterprise, that makes one customer uniquely different from every other customer?”
One thing that is for certain is that it is NOT going to be the Customer Number!
Uniqueness must be defined by the enterprise. It is the job of data analysts to help the enterprise arrive at this definition.
Database designers have the job to implement this uniqueness in a manner that prevents duplicate records being created in the database.
It is impossible for database designers to implement uniqueness if the enterprise has not defined uniqueness from a business point of view and the data analysts have not modelled this.
Unique Keys are Not UIDs
The use of the word “unique” in the term “unique key” and in “unique identifier” at some point in the past got misinterpreted completely by data analysts and database designers, in that they thought that unique keys were a mechanism for implementing unique identity in a database.
Because of this, what they did was to introduce to every database a mechanism that would, above all else, allow duplication to occur!
Amazingly, this error is still practiced worldwide on a daily basis, and not just by those analysts and designers who made the mistake many years ago but also, sadly, by those coming new to data analysis and database design.
Let’s Stop it Now!
Removing Duplicate Records
How can all of these duplicate records be removed from your database? You could buy a clever piece of software that will do it for you?
You could, but it would simply be changing the problem not solving it. You would either end up with a database with no duplicates removed or with a whole lot of records merged that should never have been merged. Try to sort that one out!
The fact is that, if the enterprise has not defined what it sees as the elements that make a data entity unique, then no piece of software can.
The data shown above is only a short, simplified and denormalised example, but is typical of data held by various finance companies for whom I have done consultancy.
I am not making an argument against unique primary keys in tables, as these are essential, but against confusing unique keys with unique identifiers.