This week there were some work discussions about data types and sort ordering, which might be interesting to write down.
In one part of the system we have a set of records that are numbered in a hierarchy: “1.1”, “1.2” and so on. Another form is prefixed with a code, e.g. “AC-6.1”, “AC-18.5”.
Can we sort them in the natural order? Can we fiddle.
Alphabetical sorting places 1.10 before 1.2, and that tends to really mess up the view. It’s also not possible to just use floating-point numerics, because of the prefixes, but also because sometimes there are more than two levels: “4.3.2.4.2”.
One bad but natural cop-out is to make a string with padded zeros: “DSS05.04”, “APO10.05” or whatever. That’s just ugly, and runs the risk that adding just one more entry in a section will force renumbering of the entire scheme - which literally defeats the whole purpose. (COBIT, I’m looking at you). I mean, it’s fine as an internal hack, but not a good way forward.
The thing we need has been around a while, and is called a tumbler. This is a great example of a fundamentally simple and useful “primary data type” that’s obviously missing from just about all our tools. If Python and Java and Go and PostgreSQL had a tumbler type, we’d just load in the data and it would behave naturally. No zero-padding required. (Then we’d also get tumbler arithmetic! Better hyperlinks, even!)
The primary data types that we do have are all broken, too. Integers are restricted to arbitrary ranges, and often wrap around the edges, or distinguish (say) positive and negative zero. Floating point values don’t add up or compare, most of the time. (The Posit type is potentially an improvement, but IEEE754 is baked into many pieces of hardware and software). Complex numbers and tensors are basically glued together, not as kintsugi but with painter’s tape. Don’t even get me started on dates, or strings.
Our descriptions of data are broken because they’re conventional, developmental, interim, provisional, disarrayed, cultural artifacts: created by humans. It’s not like we can magically “just do better”; when the Pope introduced an improved calendar in 1582, it took hundreds of years to get widespread adoption, and caused all sorts of strife. Missed days and missing paychecks. Literal riots. Confusion that persists today about exactly when the holidays should happen. And he didn’t even need to deal with time-zones.
It gets worse! If I revise the definition of “date”, or “integer”, how should I label that revision? There is no numbering scheme that works for it! (Tumblers are approximately best, but SemVer will not save you!). Look at Python: there’s a trail of well-intentioned PEPs for package version identification, and they’re absolutely not done yet. We can’t even manage to catalog the evolution of things, let alone major-version them in the real world.
So, it’s a work in progress. Let’s just keep on keeping on. Sometimes we get a chance to try apply leverage to one of the fundamentals, and that’s very exciting. Most of the time, we’re just patching around the gaps, trying to make incremental change.