Masks as Error Codes

The idea introduced in the profiling chapter — that a mask can be used as a key to retrieve records of a particular structural type — leads to what is perhaps the most important insight in the entire DQOR framework: every mask is an implicit data quality error code.

If you know what masks are "correct" for a column, then every other mask is an error. And unlike a generic boolean flag ("valid" or "invalid"), the mask itself tells you what kind of error it represents. The mask 99999 appearing in a name column does not just say "this is wrong" — it says "this is a numeric value where text was expected." The mask A/A does not just say "this fails validation" — it says "this is a two-character abbreviation with a slash, probably a placeholder like N/A." The mask is the diagnosis.

This thinking leads to the following conclusion: we can create a general framework around mask-based profiling for doing data quality control and remediation as we read data within our data reading pipeline. This has some advantageous solution properties that are worth setting out explicitly.

Allow Lists and Exclusion Lists

The simplest way to operationalise masks as error codes is through allow lists and exclusion lists.

An allow list defines the acceptable masks for a column. Any value whose mask does not appear in the allow list is flagged as an anomaly. For a UK postcode column, the allow list might contain:

A9 9AA
A99 9AA
A9A 9AA
AA9 9AA
AA99 9AA
AA9A 9AA

These six masks cover all valid UK postcode formats. Any value that produces a different mask — aaaa (lowercase text), 99999 (numeric), A/A (placeholder), or an empty string — is automatically flagged, and the mask tells you exactly what structural form the offending value takes.

An exclusion list takes the opposite approach: it defines masks that are known to be problematic, and flags any value that matches. This is useful when the set of valid formats is large or open-ended (as with free-text name fields), but certain structural patterns are reliably indicative of errors:

9999           → numeric value in a text field
               → empty string (zero-length value)
a              → single lowercase character
aaaa://aaa.aaa → URL in a name field

This is not a theoretical exercise. In the UK Companies House profiling (see the Worked Example appendix), the RegAddress.PostCode field at LU grain produces just two dominant masks — A9 9A (88.3%, e.g. L23 0RG) and A9A 9A (7.3%, e.g. W1W 7LT) — which together cover all six standard UK postcode formats when expanded at HU grain. These two masks plus the empty value (4.4%) account for 99.96% of the data. An allow list of {A9 9A, A9A 9A, (empty)} at LU grain would instantly flag the remaining 0.04% — records containing missing spaces (GU478QN), trailing punctuation (BR7 5HF.), embedded semicolons (L;N9 6NE), and values that are not postcodes at all (BLOCK 3, 2L ONE). The allow list is three entries. The error detection is comprehensive.

In practice, allow lists are more useful for format-controlled fields (postcodes, phone numbers, dates, identifiers) where the set of valid patterns is finite and known. Exclusion lists are more useful for free-text fields where the valid patterns are diverse but certain structural types are reliably wrong.

Building Quality Gates

The combination of population analysis and mask-based error codes creates a natural quality gate for incoming data:

Profile the column using mask-based profiling at the appropriate grain level.
Compare each mask against the allow list (or exclusion list) for that column.
Check population thresholds — is the proportion of "good" masks above the minimum acceptable level? Has a previously rare "bad" mask suddenly increased in frequency?
Route errors by mask — different masks may require different handling. A placeholder (A/A) might be replaced with a null. An all-caps name (AAAA AAAAA) might be normalised to title case. A numeric value in a name field (99999) might be quarantined for manual review.

The French lobbyist registry provides a concrete example of routing by mask. The director's role field (dirigeants.fonction) produces masks that reveal three casing conventions in use: Aa Aa for title case (Directeur Général, 92 records), Aa a for French grammatical case (Directeur général, 74 records), and A A for uppercase (DIRECTEUR GENERAL, 29 records). A quality gate on this field would not flag any of these as errors — they are all valid role descriptions. But it would route each casing variant to a normalisation function, ensuring that downstream analytics do not create three separate categories for what is semantically the same role. The mask is not just an error detector; it is a router. (See the Worked Example: Profiling the French Lobbyist Registry appendix.)

The quality gate can run automatically on every new batch of data, providing a continuous structural health check. When the profile of incoming data drifts — a new mask appears that was not seen before, or the population of a known-bad mask increases — the gate flags it for investigation.

This approach maps directly to the Data Quality Controls capability described in enterprise data operating models, where dataset registration, profiling for outliers, column-level validation, alerts and notifications, bad data quarantine, and DQ remediation rules are all core components. Mask-based profiling provides a single mechanism that addresses all of these capabilities, because the mask itself serves as the registration key, the outlier detector, the validation check, the alert trigger, the quarantine criterion, and the remediation lookup key — all from one pass over the data.

Masks as Provenance

There is a secondary benefit to treating masks as error codes that is easy to overlook: they provide provenance for quality decisions. When a downstream consumer asks "why was this record flagged?" or "why was this value changed?", the mask provides a clear, reproducible answer. The record was flagged because its mask was 99999 and the allow list for the name column does not include numeric masks. The value was changed because its mask was AAAA AAAAA and the treatment function for that mask is title-case normalisation.

This audit trail is built into the mechanism by construction. No additional logging or documentation is required — the mask is both the detection method and the explanation. In regulated environments where data lineage and transformation justification are compliance requirements, this property is particularly valuable.

Text-Encoded Numeric Ranges

A particularly instructive pattern occurs when numeric data is encoded as text ranges rather than as actual numbers. In the French lobbyist registry (HATVP), the expenditure field contains values like 50000 à 99999 euros and 10000 à 24999 euros. These are not numbers — they are text descriptions of numeric bands. The mask at HU grain is something like 99999 a 99999 aaaaa, which clearly reveals the structure: digits, then the French word "à", then more digits, then a unit label.

This is a mask-as-error-code in a subtle sense. The mask is not "wrong" — the data faithfully represents what was reported. But the mask tells you that this field cannot be aggregated numerically without transformation. You cannot sum these ranges, compute averages, or join them to numeric thresholds. The mask diagnoses the field as requiring a treatment function that either extracts the midpoint, maps the range to a numeric band code, or flags it for domain-specific interpretation.

This pattern generalises beyond French expenditure data. Survey responses ("18-24 years", "25-34 years"), salary bands ("£30,000-£39,999"), and classification ranges ("Category A-C") all encode numeric information as text. Mask-based profiling surfaces these immediately because the structural fingerprint — digits mixed with letters and delimiters — is visually distinct from either pure numeric or pure text fields. The mask doesn't just flag the anomaly; it tells you the exact encoding scheme being used.

The treatment for text-encoded ranges depends on the consumer. A statistical analysis team might extract numeric boundaries and compute midpoints. A reporting team might preserve the original text labels. A downstream database might map each range to an enumerated code. The mask identifies the pattern; the treatment is domain-specific — consistent with the DQOR principle of suggestions, never mandates.

From Detection to Treatment