You interact with data models all day long. Most of the things you do on the internet, certainly anything transactional like buying something or making a reservation or taking a survey, involves a database-backed website. All databases have explicit data models. But many of the data models you interact with are hidden. Bus schedules, nightclub lists, birth certificates, and death certificates all have implicit data models. And when a data model is broken or inadequate, systems based on that model fail. How can you spot a broken data model, and why is it hard to get them right in the first place?
Like any model, a data model an abstracted, simplified representation of something else. The first ingredients of a data model is a field, which is a single piece of data constrained by a name and a type (is it a number, or a word, or a time?) and maybe other things such as maximum or minimum length.
One way a data model can go awry is if a field is of the wrong type. If you are looking at a bus schedule at a bus stop, the name of the bus line is a field. But is it a letter or a number or a name? And does "L" mean Limited or Local? The timing of the bus is another field, and it could in either of two common types: a sequence of times when the bus is scheduled to pass this stop, or a frequency with which buses are expected to pass. If the bus actually comes whenever the driver wakes up, neither of these traditional models is going to fit well.
Note that we're not talking yet about computers; any data model is going
to have fields, I suspect, just as any human language is still going to
have some deep grammar in common with other languages, like the
concepts of noun and verb. The more specific and constrained a model
is, the less power it has to describe outlying phenomena, but the more
power it has to analyze and control what it does capture. For example, death certificates capture one or two causes of death:
but this is grossly inadequate to capturing the reality. This New Yorker article describes some of the epistemological challenges (e.g., which is the most accurate cause of death: lack of oxygen to the brain, or the cause of that, heart attack, or the cause of that, extreme loss of blood, or the cause of that, getting shot? Or is the cause lack of medical attention after getting shot, because of international politics or, e.g., racism?) Cause of death is captured in a rigid list of codes, and as we go from paper-based data modeling to computerized data modeling, it's easy to get authoritative, precise numbers which are very, very wrong, such as under-reported cyclist deaths, undercounted athelete deaths, or undercounted child deaths.
So overly narrow data models can lead to wrong data; when they are computerized, they can also have so much power that people change their lives to accommodate the limitations of the computer. For example, if your name is
, "many computer login systems don’t accept hyphens, so you have to decide between a space, no space, or dropping one of your names." And "The editor of the Irish Voice newspaper could book the flight only by giving up his national identity. "I dropped the apostrophe and ran my name as 'ODowd,'" he said." (ABC News
). And these two problems only scratch the surface
of the computers vs humans name problem, as this list of false programmer assumptions makes clear:
32. People’s names are assigned at birth.
33. OK, maybe not at birth, but at least pretty close to birth.
34. Alright, alright, within a year or so of birth.
35. Five years?
36. You’re kidding me, right?
39. People whose names break my system are weird outliers. They should have had solid, acceptable names, like 田中太郎.
Of course these are not computer vs human problems, since humans make computers. They are problems that the people commissioning, designing, building, testing, and maintaining computers deal with as they try to make computer systems within the traditional trilemma (faster, better, cheaper). I.e., this system can be almost on budget, or it can be done this year, or it can handle Korean names; pick two out of three. Blame the manager, not the programmer, if you disagree with the tradeoff that actually got made.
Not all problems with data models point to broken data models,
however. When you are arguing with a bouncer because your name isn't on
the nightclub list, you both agree that there is a paper list in the
bouncer's hand. You agree on the data model the list embodies: a list
of names representing human beings and a time constraint, that the list
is, let's say, for tonight only. And you agree on the semiotics of the
list: that the data model accurately reflects the conceptual model of
nightclub access. Nothing's wrong with the data model; you just
disagree about the contents
Another example of a not-really-broken data model: You get onto an
airplane and somebody else has a ticket for the same seat as yours. You
think that any data model that allows two people to have tickets for
the same seat on the same flight must be broken. But airline tickets
actually model who has been sold rights, not who gets to sit in each
seat. As long as some percentage of passengers are no-shows, it's not
unreasonable (from the airline's perspective) to sell more than 100% of
tickets. The data model isn't broken, necessarily; it just serves the
airline's needs, not the passengers'.
Yet another thing that can go wrong: The representation of the data model can substitute for the the
actual data model in an argument. For example, arguing that a gay
couple cannot marry because the marriage license has fields labeled
"husband" and "wife." The disagreement is not about the form, but what
the social data model the form represents. The normatively heterosexual
data model may not be broken, since it accurately models a relationship
many people have, but it is inadequate to document new kinds of marriages. Is the problem in the data model or in the reality being modeled?
The other big ingredient of data models is the relationship. The fields are attributed to entities; how are the entities related? And which entities are related to which other entities, directly and indirectly? A simple example: You own a bunch of gas stations. Okay, each gas station is an entity. Each type of gas is an entity. The fact that a particular gas station sells a particular type of gas? You might think that that's just implied by saying that there's a relationship between a certain gas station and a type of gas. But - don't you want to keep track of how much gas, and when a gas station is out of diesel? And you may want to track maintenance of individual pumps, so is the entity the pump or the gas station? What if, at some of your stores, all pumps draw from a single tank for gas, but each diesel pump has a dedicated tank? If you are building a program to help gas station owners track and predict fuel usage, you are probably going to want to visit a fair number of gas stations and ask a lot of questions before committing to your answers. This post is long enough already, so just imagine on your own time all of the kinds of things that can go wrong with data model relationships.
Bank accounts ...
Sidebar: Another saga of recursive task breakdown
By the way. I started writing this blog post because I got stuck on a design problem on my project, WhatNext. While I was getting prepared to write, I noticed a browser tab that reminded me that my IFTTT is broken. That's a website I use to automatically post on Facebook when I blog. It's broken because I have two different accounts (for stupid reasons, although I'm still not totally sure if the stupidity was mine or theirs) and I'm not sure which to use going forward. So I log in to fix IFTTT, and I decide which one to keep, but it's the one that doesn't also post to Twitter, so I realize I should set it up for Twitter. Which means logging in to Twitter, at which point I decide both that I should switch to a real password for Twitter and that I might as well set up two-factor authentication. While waiting for the text message for that, I realize I never plugged in my phone after it died yesterday.
Programming is predicting the future
The data model is the foundation for any data-based computer program. And like any foundation, you have to build it before you are completely sure what will go on it, or how what it supports will be used throughout the life of the foundation. To judge if the model is correct, you really have to know exactly what it will be used for so that you can see if it supports that use or not. And since you can't know the future with certainty, you must rely on past examples, some generalized as gut feelings or heuristics or best practices.
For example, you may not bother setting the timestamp on your camera because it takes pictures just fine without it and you don't need to know when your pictures were taken. But one day you want to transfer some pictures to somebody else, but the camera's memory card contains a mix of pictures you want to share and pictures you absolutely do not want to share; if you had set the timestamp, you could exclude all pictures from 2011, if that was your bad-hair year. Instead you have to check each picture one at a time. Of course, you couldn't foresee this particular use, but since setting the timestamp is so easy, it probably would have been worth it. On the other hand, if your camera was broken such that you had to manually reset the timestamp every time you used it, it probably wouldn't have been worth it. On the third hand, your camera also does GPS; do you need to buy and carry around a heavy, expensive GPS unit just to GPS-stamp all of your pictures? What are the chances that some day you'll photograph a distinctive tree in a forest, and then years later recognize the tree in a treasure map?
Computer programmers, especially when constructing data models, are constantly making more expensive decisions than this on much more speculative ground. Is this system ever going to have people entering Korean characters? Is the definition of marriage going to change during the lifespan of this computer system?
problem at hand in whatnext
when adding a new task,
Why do I need to solve this? until I solve this, I don't know the best way to track whether to do this blog post or ...