Worked Example: Profiling the French Lobbyist Registry (HATVP)
This appendix profiles a real French government transparency dataset — the lobbyist registry maintained by the Haute Autorité pour la transparence de la vie publique (HATVP). Where the Companies House example demonstrated mask-based profiling on tabular CSV data and the JMA earthquake example tackled nested JSON with mixed scripts, this dataset brings a different set of challenges: deeply nested JSON with French-language text, accented characters, text-encoded numeric ranges masquerading as quantitative fields, and casing inconsistencies rooted in French administrative conventions. The result is a worked example that shows how bytefreq profiling surfaces issues that neither schemas nor simple validation rules would catch.
The Dataset
The HATVP publishes a consolidated JSON file of all organisations registered in the French lobbyist registry, updated nightly and freely available under the Licence Ouverte (Etalab) — France's standard open data licence. The file is available at hatvp.fr/agora/opendata/agora_repertoire_opendata.json and weighs in at approximately 116MB.
Each record represents a registered lobbying organisation and contains its denomination, address, national identifier, directors, collaborators, clients, sector classifications, multi-year exercise declarations with nested activity reports, expenditure bands, revenue bands, and contact information. The nesting is substantial: a single organisation record can contain arrays of directors (each with name, title, and role), arrays of collaborators, arrays of clients, and multiple annual exercise declarations each containing their own nested activity structures.
For this example we sampled 405 records from the full file. After flattening the nested JSON (collapsing array indices so that dirigeants[0].nom and dirigeants[1].nom both become dirigeants.nom), these 405 records produced 81 unique field paths — a reflection of the structural depth of the data. Key paths and their value counts illustrate the one-to-many relationships: denomination yields 405 values (one per organisation), dirigeants.nom yields 760 (multiple directors per organisation), collaborateurs.nom yields 946, clients.denomination yields 1,130, and activites.listSecteursActivites.label yields 2,119 sector tags spread across the sample.
Running the Profile
Because this is nested JSON rather than flat tabular data, we use the same flatten-then-profile approach described in the JMA earthquake example. The JSON is first flattened into field-path/value pairs, collapsing array indices, and then profiled using bytefreq in LU (Low-grain Unicode) mode. The flattening preserves the hierarchical field names — exercices.publicationCourante.montantDepense rather than a generic column number — which makes the profile output immediately readable without needing to cross-reference a schema.
The LU grain is the right starting point here. It collapses consecutive characters of the same class (uppercase, lowercase, digit, punctuation) into single representative characters, giving us a compact set of structural masks for each field. Where we need finer discrimination — as we will see with the dirigeants.civilite field — we can drill into HU (High-grain Unicode) mode for specific fields.
Structure Discovery: Field Population Analysis
Before examining individual field values, we profile the field paths themselves. For each dot-notation path (with array indices collapsed), we count how many of the 405 lobbyist records contain that path and express it as a percentage. This is the structural discovery step — it tells us the shape of the data before we look at what is in it.
Field Path Count % Populated
-----------------------------------------------------------------------------------------
denomination 405 100.0%
typeIdentifiantNational 405 100.0%
identifiantNational 405 100.0%
codePostal 405 100.0%
ville 405 100.0%
pays 405 100.0%
dateCreation 405 100.0%
datePremierePublication 405 100.0%
categorieOrganisation.code 405 100.0%
categorieOrganisation.label 405 100.0%
categorieOrganisation.categorie 405 100.0%
activites.listSecteursActivites.code 405 100.0%
activites.listSecteursActivites.label 405 100.0%
activites.listNiveauIntervention.code 405 100.0%
dirigeants.nom 366 90.4%
dirigeants.prenom 366 90.4%
dirigeants.civilite 366 90.4%
dirigeants.fonction 366 90.4%
collaborateurs.nom 366 90.4%
collaborateurs.civilite 366 90.4%
collaborateurs.fonction 271 66.9%
exercices.publicationCourante.dateDebut 403 99.5%
exercices.publicationCourante.dateFin 403 99.5%
exercices.publicationCourante.nombreSalaries 315 77.8%
exercices.publicationCourante.montantDepense 315 77.8%
exercices.publicationCourante.chiffreAffaire 163 40.2%
lienSiteWeb 304 75.1%
adresse 306 75.6%
lienPageLinkedin 196 48.4%
emailDeContact 169 41.7%
lienPageTwitter 140 34.6%
telephoneDeContact 132 32.6%
lienPageFacebook 116 28.6%
clients.denomination 88 21.7%
nomUsage 205 50.6%
dateDernierePublicationActivite 301 74.3%
lienListeTiers 21 5.2%
nomUsageHatvp 44 10.9%
sigleHatvp 51 12.6%
dateCessation 39 9.6%
motifDesinscription 39 9.6%
ancienNomHatvp 2 0.5%
exercices.publicationCourante.activites...actionsRepresentationInteret.observation 151 37.3%
The registration core — name, identifier, postal code, city, country, dates, category, sectors — is 100% populated across all 405 records. These are mandatory registration fields, the skeleton that every lobbyist record shares. When we see a block of fields all at 100%, it confirms that the registration system enforces these as required inputs, and it gives us a stable foundation to work from when we start profiling values.
Director and collaborator names are 90.4% populated — but collaborateurs.fonction drops to 66.9%. A third of collaborators have no declared role. This is a data completeness issue hiding inside the nesting: the person exists in the array, but their function field is missing. A flat schema would show this as a null column. In nested JSON, the key simply is not present in some array elements. The profiler treats both the same way, which is why the flatten-then-profile approach works so well for this kind of discovery.
Financial data tells a story of progressive disclosure. exercices.publicationCourante.dateDebut is 99.5% (nearly universal), but nombreSalaries and montantDepense drop to 77.8%, and chiffreAffaire (revenue) falls to just 40.2%. Organisations are required to declare exercise periods but increasingly opt out of financial detail. Revenue is the most sensitive field, and fewer than half disclose it. The field population percentages quantify this reluctance in a way that no amount of manual inspection could — you see the gradient from near-universal to minority compliance in a single column of numbers.
Contact and social media fields follow a clear hierarchy: website (75.1%) > LinkedIn (48.4%) > email (41.7%) > Twitter (34.6%) > phone (32.6%) > Facebook (28.6%). This is not random — it reflects institutional communication preferences. Websites are near-universal for organisations, LinkedIn is the professional default, and Facebook has fallen out of favour for institutional lobbying. The field population percentages tell you this without reading a single value.
At the bottom of the table, operational fields appear: dateCessation and motifDesinscription (9.6%) mark organisations that have de-registered from the lobbying register, ancienNomHatvp (0.5%) records name changes — just 2 out of 405. These sparse fields are invisible in a casual inspection of the data but the population analysis surfaces them immediately. They are the kind of fields that cause edge-case bugs in downstream processing because developers never encounter them during testing.
Field-by-Field Analysis
Organisation Name (denomination)
denomination
Mask Count Example
A A 65 OTRE GIRONDE
A 62 DOMISERVE
A A A 55 REUSABLE PACKAGING EUROPE
A A A A 31 BNP PARIBAS PERSONAL FINANCE
A A A A A A 24 OTRE DES PAYS DE LA LOIRE
A A A A A 23 UNION DES ENTREPRISES CORSES
A A A A A A A 12 NESTLE EXCELLENCE SAS PRODUITS PETI
A_A A 3 MCDONALD'S FRANCE
A _ 2 TUKAZZA !
A 9 2 FNSEA 17
The dominant masks are exactly what we would expect for organisation names in uppercase: one to seven words separated by spaces, all collapsing to A tokens. The interesting patterns are at the bottom. A_A A (3 records, e.g. MCDONALD'S FRANCE) — the apostrophe is a punctuation character, creating a distinct structural mask. A _ (2 records, e.g. TUKAZZA !) — an exclamation mark, which bytefreq maps to the punctuation class. And A 9 (2 records, e.g. FNSEA 17) — a numeric suffix, likely a regional chapter number.
None of these are errors per se — they are legitimate organisation names. But the masks tell us immediately that any downstream process relying on "organisation names are alphabetic words separated by spaces" will need to account for apostrophes, punctuation marks, and trailing numbers. The mask frequency table is the specification that the data never came with.
Address (adresse)
adresse
Mask Count Example
9 A A A 76 169 RUE D'ANJOU
9 A A 50 60 BOULEVARD VOLTAIRE
A A 30 ZONE INDUSTRIELLE
A 29 AMYNOS
A A A 27 ASSOCIATION DES CONSOMMATEURS
9 A A A A 21 49 RUE EVARISTE GALOIS
9 A 17 75 BDVOLTAIRE
A A A A 17 CITE DE L INDUSTRIE
A 9 13 CS 70044
9 a Aa 7 79 rue Perrier
9 a Aa Aa 6 2 avenue Tony Garnier
A 9 A A A 2 BP 123 CHERBOURG EN COTENTIN
306 of 405 records have an address; 99 are empty (a 24.4% null rate). The dominant patterns start with a street number followed by uppercase street names (9 A A A: 169 RUE D'ANJOU), which is the standard French address format. But several things stand out.
First, mixed casing. The majority of addresses are in uppercase (169 RUE D'ANJOU, 60 BOULEVARD VOLTAIRE), which is the traditional French postal convention for addresses. But 13 records use title case or mixed case (79 rue Perrier, 2 avenue Tony Garnier). The masks 9 a Aa and 9 a Aa Aa are structurally different from 9 A A precisely because of this casing inconsistency — the profiler is separating records that a human might gloss over as "same thing, different capitalisation."
Second, non-address content. The masks A (29 records, e.g. AMYNOS) and A A A (27 records, e.g. ASSOCIATION DES CONSOMMATEURS) contain organisation names rather than street addresses. The address field is being used to store building names or organisation references.
Third, postal box codes. A 9 (13 records, e.g. CS 70044) represents CEDEX sorting codes — a French postal routing system. A 9 A A A (2 records, e.g. BP 123 CHERBOURG EN COTENTIN) combines a boîte postale (PO box) number with a city name, packing two logical fields into one.
Postal Code (codePostal)
codePostal
Mask Count Example
9 401 75019
9 2 1000
A9A9A 1 EC1R4QB
Three masks, and each tells a different story. The dominant 9 (401 records, 99.0%) represents standard five-digit French postal codes like 75019. Clean, consistent, no issues.
The 9 mask (2 records, e.g. 1000) has a leading space — note the space before the 9 in the mask. These are four-digit codes with space padding, likely Belgian postcodes (Belgium uses four-digit postal codes). Two Belgian organisations registered in the French lobbyist registry, and the source system padded their codes with a leading space rather than handling the shorter format.
And then there is A9A9A (1 record: EC1R4QB). That is a UK postcode — an alphanumeric format that is structurally unmistakable in a field of French five-digit codes. A British organisation registered in the French lobbyist registry, and the postal code field accepted whatever was submitted. The mask catches it instantly because the structural pattern is completely unlike the surrounding data.
City (ville)
ville
Mask Count Example
A 211 BEAUNE
A 9 52 PARIS 16
A A 29 NANTERRE CEDEX
A-A 24 LAMBALLE-ARMOR
Aa 21 Paris
A-A-A 15 (various hyphenated)
A A A 10 LE BOURGET CEDEX
A A 9 10 PARIS CEDEX 07
Aa-a-Aa 3 Neuilly-sur-Seine
Aa a Aa 2 Neuilly sur Seine
A 9A 2 LYON 3EME
a 1 avignon
A - A A 1 COURBEVOIE - LA DEFENSE
A A_ A 1 VILLEBON S/ YVETTE
A A Aa 9 1 LE MANS Cedex 2
This field is a catalogue of French address conventions and casing inconsistency, all visible through the masks.
Casing: A (211 records, BEAUNE) is uppercase. Aa (21 records, Paris) is title case. a (1 record, avignon) is entirely lowercase. Three different casing conventions for the same type of data, in the same field, in the same dataset.
CEDEX variations: A A (29 records, NANTERRE CEDEX), A A A (10 records, LE BOURGET CEDEX), A A 9 (10 records, PARIS CEDEX 07), A A Aa 9 (1 record, LE MANS Cedex 2). The postal routing suffix CEDEX appears in uppercase (CEDEX) and in title case (Cedex) — and the numeric arrondissement that follows it is sometimes present, sometimes not.
Hyphenation: A-A (24 records, LAMBALLE-ARMOR) and A-A-A (15 records) are hyphenated town names in uppercase. Aa-a-Aa (3 records, Neuilly-sur-Seine) is hyphenated in title case. Aa a Aa (2 records, Neuilly sur Seine) is the same town name without hyphens. The profiler reveals that Neuilly-sur-Seine and Neuilly sur Seine coexist in the data — same place, different punctuation, different masks.
And then the distinctly French conventions: A A_ A (1 record, VILLEBON S/ YVETTE) uses S/ as an abbreviation for "sur" (on/upon), a convention specific to French administrative addressing. A 9A (2 records, LYON 3EME) uses the arrondissement suffix 3EME (3rd) — the ordinal marker EME being the French equivalent of English "rd" or "th."
Country (pays)
pays
Mask Count Example
A 375 FRANCE
Aa 22 France
A-A 1 ROYAUME-UNI
a 1 france
Four masks, essentially two country values. FRANCE appears in three casing variants: uppercase (375), title case (22), and lowercase (1). The fourth mask, A-A, is ROYAUME-UNI — the French name for the United Kingdom, hyphenated as is standard in French. This is the same British organisation whose UK postcode we found in the codePostal field.
The real issue here is not the lone UK record — it is the casing inconsistency. 375 records say FRANCE, 22 say France, 1 says france. These are not three different countries. A downstream join or group-by on this field will produce three separate buckets for the same value unless casing is normalised first. The profiler makes this immediately obvious because each casing variant produces a different mask.
Organisation Category (categorieOrganisation.label)
categorieOrganisation.label
Mask Count Example
Aa a a a _a a a a_a a a a a_ 128 Société commerciale et civile (autre que cabinet d'avocats et société de conseil)
Aa 89 Association
Aa a 83 Fédération professionnelle
Aa a a a 58 Organisation non gouvernementale
Aa a a 40 Cabinet de conseil
Aa a a _a a_ 2 Groupe de recherche (think tank)
Aa a a a a a a a 2 Établissement public ou organisme consultatif
French category labels with accented characters (Société, Fédération, Établissement), apostrophes (d'avocats), and parenthetical qualifiers ((autre que cabinet d'avocats et société de conseil), (think tank)). This is a controlled vocabulary — seven distinct values with consistent formatting. The masks here are doing their job: confirming that the reference data is clean and internally consistent.
Note that LU mode treats accented characters (é, è, ê) the same as their unaccented counterparts — they are all lowercase letters, collapsing to a. This is the correct behaviour for structural profiling: we care about the shape of the data, not the specific diacritics.
Directors: Title (dirigeants.civilite)
dirigeants.civilite
Mask Count Example
A 760 M
A single mask: A. Every value collapses to uppercase alpha. But this field contains two distinct values — M (Monsieur) and MME (Madame) — which LU mode cannot distinguish because both are uppercase alphabetic strings. The mask A covers both a one-character and a three-character value.
This is a case where you would drill into HU (High-grain Unicode) mode, which preserves character count, to separate M from MME and get the gender distribution. At LU grain, the field looks perfectly uniform. At HU grain, the two populations would separate cleanly. It is a useful reminder that profiling grain is a choice, and the right grain depends on the question you are asking.
Directors: Surname (dirigeants.nom)
dirigeants.nom
Mask Count Example
A 684 DENIZOT
A A 43 LE LETTY
A-A 22 VESQUE-JEANCARD
A_A 4 N'GOADMY
A A A 3 DUBARRY DE LASSALLE
A A A A 2 VAN LIDTH DE JEUDE
A A_A 1 TEYSSIER D'ORFEUIL
French surname patterns, each structurally distinct and all legitimate. Single surnames (A, 684 records) dominate. Compound surnames with particles appear in several forms: space-separated (A A: LE LETTY, A A A: DUBARRY DE LASSALLE, A A A A: VAN LIDTH DE JEUDE), hyphenated (A-A: VESQUE-JEANCARD), and apostrophe-linked (A_A: N'GOADMY, A A_A: TEYSSIER D'ORFEUIL).
The apostrophe in French surnames (as in D'ORFEUIL, N'GOADMY) is structurally significant — it creates a different mask from a space-separated particle. Any normalisation logic that strips apostrophes or treats them as word boundaries will mangle these names. The mask frequency table is essentially a specification for a surname parser: here are the seven structural patterns you need to handle.
Directors: First Name (dirigeants.prenom)
dirigeants.prenom
Mask Count Example
Aa 697 Carole
Aa-Aa 50 Marc-Antoine
Aa Aa 11 Marie Christine
Aa_a 1 Ro!and
The first three masks are expected: simple first names in title case (Aa, 697 records), hyphenated compound first names (Aa-Aa, 50 records — a very common French pattern, as in Marc-Antoine, Jean-Pierre), and space-separated compound first names (Aa Aa, 11 records — Marie Christine, where the hyphen was omitted).
The fourth mask is the standout of the entire dataset. Aa_a (1 record: Ro!and). An exclamation mark where the letter l should be. The intended name is Roland, but a data entry error — likely a mis-hit on an adjacent key — has replaced the lowercase l with !. The mask catches it instantly because ! is a punctuation character, not a letter, so the structural pattern Aa_a (letter-class, letter-class, punctuation-class, letter-class) is fundamentally different from the expected Aa (letter-class, letter-class). One character wrong, and the mask is completely different.
This single record is worth the entire profiling exercise as a demonstration. No schema would catch it — the field is a valid string. No length check would catch it — Ro!and is six characters, perfectly reasonable for a first name. No lookup table would catch it unless you had an exhaustive dictionary of every possible French first name. But the structural profile catches it immediately, because the shape of the data is wrong. That is the core proposition of mask-based profiling, illustrated in a single record.
Directors: Role (dirigeants.fonction)
dirigeants.fonction
Mask Count Example
Aa 278 Secrétaire
Aa Aa 92 Directeur Général
Aa a 74 Directeur général
A 44 PRESIDENT
A A 29 DIRECTEUR GENERAL
Aa a Aa 20 Président du Conseil
Aa-Aa 20 Vice-Président
Aa a a 19 Directeur exécutif
Aa Aa Aa 18 Directeur Général Adjoint
Aa-a 10 Vice-président
a 6 président
Aa-Aa Aa 6 Vice-Président Exécutif
A-A 4 CO-PRÉSIDENT
4a Aa-Aa 4 2ème Vice-Président
Three casing conventions coexist in a single field. Title case with all words capitalised (Directeur Général, mask Aa Aa, 92 records). Title case with French grammatical casing where articles and prepositions are lowercase (Directeur général, mask Aa a, 74 records). And all-caps (PRESIDENT, mask A, 44 records; DIRECTEUR GENERAL, mask A A, 29 records).
The mask pair Aa-Aa (20 records, Vice-Président) versus Aa-a (10 records, Vice-président) is particularly revealing: the same role, with the only difference being whether the second element after the hyphen is capitalised. The profiler separates them because Aa-Aa and Aa-a are structurally different — and this tells us that different data entry operators or different source systems applied different capitalisation rules.
The 4a mask (4 records, 2ème Vice-Président) captures the French ordinal suffix ème (equivalent to English "nd" or "th"), preceded by a digit. And the a mask (6 records, président) reveals entries in all lowercase — no initial capital at all.
A treatment function for this field would need to normalise casing (choosing one convention), handle hyphenated roles, and decide what to do with ordinal prefixes. The mask frequency table tells you exactly what rules to write.
Email (emailDeContact)
emailDeContact
Mask Count Example
a_a.a 66 contact@cdcf.com
a.a_a.a 23 jean.dupont@example.fr
a_a-a.a 23 contact@france-industrie.org
a_a.a.a 8 info@cabinet.avocat.fr
a_a9.a 6 contact@euro4t.fr
a-9_a.a 1 udtr-12@otre.fr
169 of 405 records have an email address; 236 are empty (58.3% null rate). The masks show the structural variation in email formats. In bytefreq output, the @ symbol maps to the punctuation class and then collapses with adjacent punctuation or appears as _ depending on surrounding characters. The dominant pattern a_a.a (66 records) represents the simplest form: local@domain.tld.
Variations include dots in the local part (a.a_a.a: jean.dupont@example.fr), hyphens in the domain (a_a-a.a: contact@france-industrie.org), multi-level domains (a_a.a.a: info@cabinet.avocat.fr), numbers in the domain (a_a9.a: contact@euro4t.fr), and numbers with hyphens in the local part (a-9_a.a: udtr-12@otre.fr).
No structural errors here — the patterns all represent valid email formats. The 58.3% null rate is the main finding: more than half of registered lobbying organisations have not provided a contact email.
Expenditure (exercices.publicationCourante.montantDepense)
exercices.publicationCourante.montantDepense
Mask Count Example
_ _ 9 9 a a _ 9 9 a 580 >= 75 000 euros et < 100 000 euros
_ 9 9 a 455 < 10 000 euros
_ _ 9 9 9 a a _ 9 9 9 a 8 >= 3 250 000 euros et < 5 000 000 euros
_ _ 9 9 a a _ 9 9 9 a 2 >= 900 000 euros et < 1 000 000 euros
_ _ 9 9 9 a 1 >= 10 000 000 euros
This is one of the most instructive fields in the entire dataset. The expenditure column does not contain numbers. It contains French-language text descriptions of expenditure bands: >= 75 000 euros et < 100 000 euros — "greater than or equal to 75,000 euros and less than 100,000 euros."
A schema will tell you this field is a string. A null check will tell you it is populated. A length check will tell you nothing useful. But the mask tells you immediately that this is not a numeric field — the presence of a (lowercase alpha) characters in the mask means there are words mixed in with the numbers. You cannot sum this column, you cannot compute averages, you cannot do arithmetic of any kind without first parsing the range text.
The formatting follows French conventions: spaces as thousand separators (75 000, not 75,000), euros as the currency word (not a symbol), and et (French for "and") as the conjunction between the lower and upper bounds. The five masks represent five expenditure bands, from < 10 000 euros to >= 10 000 000 euros.
This pattern — encoding quantitative information as text ranges — is not uncommon in government datasets where the exact figure is considered sensitive but the band is public. The profiler reveals it immediately because the structural pattern of a text range is fundamentally different from the structural pattern of a number. A column of actual euro amounts would produce masks like 9 or 9.9 — not _ _ 9 9 a a _ 9 9 a.
Revenue Band (exercices.publicationCourante.chiffreAffaire)
exercices.publicationCourante.chiffreAffaire
Mask Count Example
_ _ 9 9 9 a 225 >= 1 000 000 euros
_ 9 9 a 101 < 100 000 euros
_ _ 9 9 a a _ 9 9 a 65 >= 100 000 euros et < 500 000 euros
_ _ 9 9 a a _ 9 9 9 a 41 >= 500 000 euros et < 1 000 000 euros
The same text-range pattern as expenditure. Four revenue bands rather than five, with the top band open-ended (>= 1 000 000 euros). The same French formatting conventions apply: space thousands, text currency, et conjunction.
The consistency between this field and montantDepense suggests a systematic encoding choice by the HATVP, not a one-off formatting quirk. Both financial fields use the same text-range approach, and both would need the same parsing treatment to extract usable numeric bounds.
Employee Count (exercices.publicationCourante.nombreSalaries)
exercices.publicationCourante.nombreSalaries
Mask Count Example
9.9 1,046 1.0
A single mask: 9.9. Every value is a number with a decimal point and trailing zero — 1.0, 25.0, 350.0. These are integers that have been serialised as floating-point numbers by the JSON encoder. The source system stores employee count as an integer, but somewhere in the serialisation pipeline the values were converted to floats, and the JSON output faithfully records 1.0 instead of 1.
This is a common issue with JSON data produced by systems that use loosely-typed numeric handling (Python's json.dumps with certain configurations, for example, or Java serialisers that map Number objects to double). The profiler catches it because the .0 suffix creates a structural pattern (9.9) that is different from what we would expect for integer counts (9).
The treatment is straightforward: parse as float, cast to integer, validate that the decimal portion is always .0. But you need to know the issue exists before you can treat it, and the mask tells you on the first profiling run.
Website (lienSiteWeb)
lienSiteWeb
Mask Count Example
a_a.a.a_ 80 https://www.example.com/fr
a_a.a.a 51 https://www.example.com
a_a.a_ 26 https://lfde.com/
a_a.a-a.a_ 23 https://www.france-industrie.org/
a_a.a 12 https://lfde.com
a_a-a.a_ 12 http://france-biotech.fr/
a_a9.a_ 6 http://cci47.fr/
304 of 405 records have a website; 101 are empty. The masks capture several URL structure variations: with and without www prefix, with and without trailing slash, http versus https, hyphens in domain names, numbers in domain names, and path suffixes (e.g. /fr for French-language landing pages).
The a_a-a.a_ mask (12 records) represents http:// (without TLS) — these organisations have not migrated to HTTPS. Not a data quality issue per se, but the mask separates them cleanly, which could feed a notification to affected organisations.
Dates (exercices.publicationCourante.dateDebut)
exercices.publicationCourante.dateDebut
Mask Count Example
9-9-9 1,953 01-04-2025
1,953 of 1,954 values share the same mask: 9-9-9, representing the DD-MM-YYYY format with dashes. One value is presumably empty or structurally different — a single anomaly in nearly two thousand records. This is a well-controlled field with consistent formatting. The dash separator (rather than slash or dot) is the dominant French date convention in administrative systems.
National Identifier (identifiantNational)
identifiantNational
Mask Count Example
9 371 834715807
A9 33 H810503325
Two masks, two distinct identifier systems. 9 (371 records) represents SIREN numbers — the nine-digit identifiers assigned to French commercial entities by INSEE (the national statistics office). A9 (33 records) represents RNA numbers — identifiers from the Répertoire National des Associations, France's national register of non-profit associations. RNA numbers have a letter prefix (typically W) followed by digits.
The mask separates commercial entities from non-profits instantly, without needing a lookup table or any domain knowledge beyond what the structural pattern reveals. A single character at the start of the identifier encodes the entity type, and the profiler surfaces it automatically.
Summary of Findings
Issues discovered through mask-based profiling of 405 HATVP lobbyist registry records:
Text-encoded numeric ranges:
montantDepense(expenditure) andchiffreAffaire(revenue) store French-language band descriptions, not numbers → Treatment: parse range text to extract numeric bounds- Euro formatting uses French conventions: space thousands (
75 000), text currency (euros), French conjunction (et) → any parser must handle these
Data entry errors:
Ro!andindirigeants.prenom— exclamation mark substituted for lowercasel→ Treatment: manual correction toRoland
Casing inconsistency:
pays: three casings ofFRANCE/France/france→ Treatment: normalise to single formville: uppercase (BEAUNE), title case (Paris), lowercase (avignon) → Treatment: normalise casingdirigeants.fonction: title case (Directeur Général), French grammatical case (Directeur général), all-caps (PRESIDENT), lowercase (président) → Treatment: normalise casingadresse: mixed uppercase and title case (RUE D'ANJOUvsrue Perrier) → Treatment: normalise casing
Float serialisation of integers:
nombreSalaries: all values have.0suffix (1.0,25.0) → Treatment: cast to integer after validation
Foreign data in domestic fields:
codePostal: one UK postcode (EC1R4QB) and two likely Belgian codes with leading spaces → Flag: legitimate foreign registrations, but may need special handlingpays: oneROYAUME-UNI(United Kingdom) record → Accept: legitimate
French address conventions:
- CEDEX postal routing suffixes in multiple forms (
NANTERRE CEDEX,PARIS CEDEX 07,LE MANS Cedex 2) S/abbreviation for "sur" (VILLEBON S/ YVETTE)- Hyphenated vs unhyphenated town names (
Neuilly-sur-SeinevsNeuilly sur Seine) - Arrondissement suffixes (
PARIS 16,LYON 3EME)
High null rates:
emailDeContact: 58.3% emptyadresse: 24.4% empty
Structural consistency (no issues):
dateDebut: near-perfect DD-MM-YYYY consistency (1,953 of 1,954 values)identifiantNational: clean separation of SIREN (numeric) and RNA (alphanumeric) identifierscategorieOrganisation.label: consistent controlled vocabulary
Lessons Learned
1. Text-encoded numeric ranges are invisible to schemas but obvious to masks. The expenditure and revenue fields store French-language band descriptions — >= 75 000 euros et < 100 000 euros — that look like strings to any schema validator and pass every null or length check. But the mask _ _ 9 9 a a _ 9 9 a immediately reveals the presence of alphabetic characters mixed with digits, signalling that this is not a straightforward numeric field. Any team ingesting this data and attempting arithmetic on these columns would discover the problem only at query time, possibly after building dashboards on meaningless aggregations. The profiler surfaces it in the first pass.
2. One character substitution, caught by structural profiling. Ro!and — a single exclamation mark where an l should be — produces the mask Aa_a, which is structurally different from every other first name in the dataset (all of which match Aa, Aa-Aa, or Aa Aa). No schema, no length check, no regex for "valid name characters" would catch this unless you explicitly excluded exclamation marks from names — and who thinks to do that? The mask catches it because the structural signature of the error is different from the structural signature of correct data. This is the essence of mask-based profiling: you do not need to know what errors to look for. You look at the structure, and the errors announce themselves.
3. Casing inconsistency is pervasive in French administrative data. The dataset contains uppercase (FRANCE, BEAUNE, PRESIDENT), title case (France, Paris, Directeur Général), French grammatical case (Directeur général, where only the first word is capitalised), and lowercase (france, avignon, président). These are not random — they reflect different data entry conventions, different source systems, and different interpretations of French typographic rules. The profiler separates them all because each casing pattern produces a different mask, turning an invisible consistency problem into a visible, countable one.
4. Float serialisation of integers is a silent data type issue. The nombreSalaries field contains values like 1.0 and 25.0 — integers that were serialised as floating-point numbers somewhere in the data pipeline. The JSON format does not distinguish between integer and float types in a way that survives most serialisation round-trips, so this kind of silent type promotion is common. The mask 9.9 (with a decimal point) is different from 9 (without), and that difference is the signal. Left undetected, these values might cause type errors in strongly-typed systems or produce unexpected results in aggregation queries that treat 1.0 as a float rather than an integer.
5. A UK postcode in a French dataset is not an error — it is a fact. EC1R4QB in the codePostal field is a legitimate British postal code belonging to a UK organisation registered in the French lobbyist registry. The mask A9A9A is unmistakable against a background of five-digit numeric French codes. The profiler does not tell you whether this is right or wrong — it tells you that it is structurally different, and gives you the example so you can decide. In this case the decision is clear: the data is correct, and the system needs to accommodate foreign postal code formats.
6. French address conventions create legitimate structural diversity. CEDEX postal routing suffixes, the S/ abbreviation for "sur", hyphenated commune names, arrondissement numbers, and space-separated thousands in currency amounts are all standard French conventions. They are not errors, but they create structural variation that any downstream consumer needs to understand. The mask frequency table is an inventory of these conventions — a specification extracted from the data itself, rather than imposed by a schema that someone wrote based on what they thought the data looked like.