Using bytefreq: Installation, Build, and Command-Line Reference
This chapter covers the practical side of bytefreq: how to install it, how to build it from source, and how to use it from the command line. If the previous chapters described the what and why of mask-based profiling, this chapter covers the how.
Prerequisites
bytefreq is written in Rust and built using Cargo, Rust's package manager and build system. If you do not already have Rust installed, the standard installation method is:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
This installs rustc (the compiler), cargo (the build tool), and rustup (the toolchain manager). Follow the on-screen prompts — the defaults are fine for most systems. After installation, restart your terminal or run source $HOME/.cargo/env to make the tools available.
Verify the installation:
rustc --version
cargo --version
Installation
There are two ways to install bytefreq.
From GitHub (recommended)
cargo install --git https://github.com/minkymorgan/bytefreq
This clones the repository, compiles the release binary, and installs it to ~/.cargo/bin/, which should already be on your PATH if Rust is installed correctly.
From a local clone
git clone https://github.com/minkymorgan/bytefreq.git
cd bytefreq
cargo build --release
cargo install --path .
Building from a local clone is useful if you intend to modify the code — for example, to add custom assertion rules as described in the previous chapter.
Verify
bytefreq --version
Input Formats
bytefreq supports four input formats:
Tabular (-f tabular, the default) — delimited text where the first line is a header row and subsequent lines are data records. The delimiter defaults to pipe (|) but can be set to any character using the -d flag. Tabular data is read from standard input (stdin).
JSON (-f json) — newline-delimited JSON (NDJSON), where each line is a complete JSON object. bytefreq flattens nested structures using dot-notation paths (e.g., customer.address.postcode), handling arrays and nested objects to a configurable depth. JSON data is read from standard input.
Excel (-f excel --excel-path file.xlsx) — native Excel file support for .xlsx, .xls, .xlsb, and .ods formats. Requires building with the excel feature flag (cargo install --git https://github.com/minkymorgan/bytefreq --features excel). By default, bytefreq reads the first sheet; use --excel-sheet to select a specific sheet by index (0-based).
Parquet (-f parquet --parquet-path file.parquet) — native Apache Parquet file support. Requires building with the parquet feature flag (cargo install --git https://github.com/minkymorgan/bytefreq --features parquet). Parquet files are converted internally to JSON lines, so all JSON features — dot-notation nested paths, array index handling (-a), path depth limiting (-p), and enhanced output — work with Parquet data. Nested structs produce dot-notation paths (user.address.city), and list columns produce indexed array paths (scores[0], scores[1]). Timestamps are automatically converted to ISO 8601 strings, and all standard Arrow data types are supported.
To install with all optional format support:
cargo install --git https://github.com/minkymorgan/bytefreq --features parquet,excel
A note on CSV: bytefreq defaults to pipe-delimited input rather than comma-delimited, because pipe characters appear far less frequently in real-world data values and thus produce fewer parsing ambiguities. If your data is comma-delimited, pass -d ','. For complex CSV files with quoted fields, escaped delimiters, or embedded newlines, bytefreq uses a proper CSV parser that handles quoted fields and escape sequences correctly.
Command-Line Reference
bytefreq [OPTIONS]
OPTIONS:
-g, --grain <GRAIN> Masking grain level [default: LU]
H - High grain ASCII (A/a/9)
L - Low grain ASCII (compressed)
U - High grain Unicode (HU)
LU - Low grain Unicode (compressed)
-d, --delimiter <DELIM> Field delimiter [default: |]
-f, --format <FORMAT> Input format [default: tabular]
tabular - Delimited text with header
json - Newline-delimited JSON
excel - Excel file (requires --excel-path)
parquet - Parquet file (requires --parquet-path)
--excel-path <PATH> Path to Excel file (with -f excel)
--excel-sheet <INDEX> Sheet index, 0-based [default: 0]
--parquet-path <PATH> Path to Parquet file (with -f parquet)
-r, --report <REPORT> Report type [default: DQ]
DQ - Data Quality (mask frequencies)
CP - Character Profiling (byte/codepoint frequencies)
-p, --pathdepth <DEPTH> JSON nesting depth [default: 9]
-a, --remove-array-numbers Collapse array indices in JSON paths
-e, --enhanced-output Output flat enhanced JSON (nested format)
-E, --flat-enhanced Output flat enhanced JSON (flattened format)
-h, --help Print help
-V, --version Print version
Basic Profiling
The most common use case is profiling a delimited file at the default grain level (Low Unicode):
cat data.csv | bytefreq -d ','
The output is a human-readable frequency report, organised by column. For each column, bytefreq lists the unique masks found, their occurrence counts, and a randomly sampled example value for each mask (selected using reservoir sampling to ensure a truly random representative):
=== Column: postcode ===
Mask Count Example
A9 9A 8,412 SW1A 1AA
A99 9A 1,203 M60 1NW
A9A 9A 892 W1D 3QU
AA9 9A 567 EC2R 8AH
9 312 N/A
44
The example column is particularly useful during exploratory profiling — it lets you see an actual value behind each mask without having to go back to the raw data.
Grain Levels in Practice
Low Unicode (LU) — the default
cat data.csv | bytefreq -d ',' -g LU
Consecutive characters of the same Unicode class are collapsed. Good for initial discovery: how many structural families exist in each column?
High Unicode (HU) — exact formats
cat data.csv | bytefreq -d ',' -g HU
Every character maps individually. Good for precision work: what exact postcode formats are present? What date formats are in use?
High ASCII (H) and Low ASCII (L) — legacy modes
cat data.csv | bytefreq -d ',' -g H
cat data.csv | bytefreq -d ',' -g L
The original A/a/9 masks without Unicode awareness. All non-ASCII characters are left unmapped. Useful when profiling data known to be ASCII-only, or when comparing results against the legacy awk-based bytefreq.
Character Profiling
The -r CP flag switches from mask-based profiling to character-level frequency analysis:
cat data.csv | bytefreq -d ',' -r CP
This reports the frequency of every Unicode code point found in the file, alongside the character itself and its Unicode name. The output is sorted by frequency and grouped by Unicode General Category (Letter, Number, Punctuation, Symbol, Separator, Other).
Character profiling is the forensic tool. Use it when you need to:
- Determine the encoding of an unknown file — UTF-8, Latin-1, Windows-1252, and mixed encodings each produce characteristic byte patterns.
- Find invisible characters — zero-width spaces, byte order marks, soft hyphens, and other non-printing characters that cause subtle parsing failures.
- Detect control characters — tabs, carriage returns, null bytes, and other control characters in fields that should contain only printable text.
- Understand the script composition — what proportion of the text is Latin, Cyrillic, CJK, Arabic, or other scripts?
JSON Profiling
For JSON data, use -f json:
cat data.ndjson | bytefreq -f json
bytefreq expects newline-delimited JSON — one complete JSON object per line. It flattens nested structures into dot-notation paths:
{"customer": {"address": {"postcode": "SW1A 1AA"}}}
becomes a column named customer.address.postcode with value SW1A 1AA.
Controlling nesting depth
For deeply nested JSON, the -p flag controls how many levels of nesting bytefreq will traverse. Consider this input:
{"org": {"dept": {"team": {"lead": {"name": "Alice"}}}}}
With the default depth (-p 9), this produces a column named org.dept.team.lead.name. Limiting the depth changes what bytefreq sees:
# Full depth — profiles org.dept.team.lead.name
cat data.ndjson | bytefreq -f json
# Depth 3 — profiles org.dept.team (stops here, treats remaining nesting as a value)
cat data.ndjson | bytefreq -f json -p 3
# Depth 1 — profiles org (the entire nested object as a single JSON string)
cat data.ndjson | bytefreq -f json -p 1
Limiting depth is useful for very complex JSON structures where the full path depth produces an unmanageable number of columns. Start shallow and increase depth as needed.
Collapsing array indices
JSON arrays produce indexed paths by default. Given this input:
{"items": [{"name": "Widget"}, {"name": "Gadget"}, {"name": "Doohickey"}]}
bytefreq generates separate columns: items.0.name, items.1.name, items.2.name. The -a flag collapses the array index, treating all array elements as the same column:
# Without -a: items.0.name, items.1.name, items.2.name (3 separate columns)
cat data.ndjson | bytefreq -f json
# With -a: items.name (1 column, all array elements pooled together)
cat data.ndjson | bytefreq -f json -a true
This produces items.name instead of separate columns per array position, which is usually what you want for profiling the structural patterns within array elements. The collapsed column's mask frequency table then reflects the patterns across all array elements, not just those at a specific index.
Enhanced Output
The -e and -E flags switch bytefreq from profiling mode to enhanced output mode. Instead of producing a frequency report, the tool processes every record and outputs the flat enhanced format described in Chapter 9.
Nested enhanced (-e)
cat data.csv | bytefreq -d ',' -e
Produces one JSON object per input row, with each field expanded into a nested structure:
{
"postcode": {
"raw": "SW1A 1AA",
"HU": "AA9A 9AA",
"LU": "A9A 9A",
"Rules": {
"string_length": 8,
"is_uk_postcode": true,
"poss_postal_country": ["UK"]
}
}
}
Flat enhanced (-E)
cat data.csv | bytefreq -d ',' -E
Produces the same information but flattened to dot-notation keys — one level deep, no nesting:
{
"postcode.raw": "SW1A 1AA",
"postcode.HU": "AA9A 9AA",
"postcode.LU": "A9A 9A",
"postcode.Rules.string_length": 8,
"postcode.Rules.is_uk_postcode": true,
"postcode.Rules.poss_postal_country": ["UK"]
}
The flat format is easier to load into columnar tools (Pandas, DuckDB, Parquet) because every key maps directly to a column name without requiring nested JSON parsing.
Pipeline Recipes
bytefreq is designed for Unix pipelines. Here are some common patterns:
Profile the first 10,000 rows of a large file
head -10001 data.csv | bytefreq -d ','
(10,001 to include the header row.)
Profile compressed data
zcat data.csv.gz | bytefreq -d ','
Profile a remote API response
curl -s 'https://api.example.com/data' | bytefreq -f json
Generate flat enhanced output and load into DuckDB
cat data.csv | bytefreq -d ',' -E > enhanced.ndjson
duckdb -c "SELECT * FROM read_ndjson_auto('enhanced.ndjson') LIMIT 10;"
Profile only specific columns (using pre-processing)
cat data.csv | cut -d',' -f1,3,5 | bytefreq -d ','
Compare two files structurally
diff <(cat file1.csv | bytefreq -d ',') <(cat file2.csv | bytefreq -d ',')
This shows which columns have gained or lost structural patterns between two versions of the same dataset — useful for detecting format drift over time.
Profile an Excel file (native)
bytefreq -f excel --excel-path data.xlsx
To profile a specific sheet (0-based index):
bytefreq -f excel --excel-path data.xlsx --excel-sheet 2
(Requires building with --features excel. Alternatively, DataRadar handles Excel files natively in the browser.)
Profile a Parquet file
bytefreq -f parquet --parquet-path data.parquet
Nested structs produce dot-notation paths and list columns produce indexed array paths, just like JSON. Use -a to collapse array indices:
bytefreq -f parquet --parquet-path data.parquet -a
Generate flat enhanced output from Parquet:
bytefreq -f parquet --parquet-path data.parquet -E > enhanced.ndjson
(Requires building with --features parquet.)
Understanding the Output
The standard DQ report output follows a consistent format:
=== Column: field_name ===
Mask Count Example
aaaa.aaaa@aaaa.aaa 45,231 john.smith@email.com
aaaa@aaaa.aaa 8,102 jane@company.org
Aaaa Aaaaa 312 John Smith
99999 45 12345
12
--------END OF REPORT--------
Each section corresponds to one column in the input. Masks are sorted by descending frequency, so the most common patterns appear first. The example value is a true random sample selected using reservoir sampling — not the first occurrence, but a statistically representative one.
The --------END OF REPORT-------- marker signals the end of the output, which is useful when piping to downstream tools.
Performance
bytefreq uses Rayon for multi-threaded processing, so it will utilise all available CPU cores when generating enhanced output. For standard DQ profiling, the bottleneck is typically I/O rather than computation — the mask function is simple enough that CPU time is negligible compared to the time spent reading input.
On a modern machine, expect throughput of several hundred thousand rows per second for tabular data, depending on the number of columns and the average field length. For most datasets under a few million rows, profiling completes in seconds.