Your Manager’s Spreadsheet Is Broken. Here’s the 3-Minute Fix.

Summary: How xlfilldown turns nightmare Excel files into clean, queryable data — without loading the whole thing into memory.

You know the spreadsheet. Every data engineer knows the spreadsheet.

It arrives on a Tuesday morning, attached to a Slack message that says “Hey, can you load this into the database? Should be pretty straightforward.”

You open it. And there it is.

Region	Country	City	Revenue
Europe
	Germany
		Bonn	10
		Berlin	12
Asia
	Japan
		Osaka	7
		Kyoto	8

The tiered, merged, visually-grouped spreadsheet. Beautiful on screen. Absolutely catastrophic for a pipeline.

Your manager doesn’t see a problem — the hierarchy is obvious to a human eye. But your database doesn’t have eyes. It needs every row to carry its full context: Europe, Germany, Bonn, 10. Not just Bonn, 10 and a prayer.

The Trap Everyone Falls Into

The obvious fix is pandas ffill(). Five seconds on Stack Overflow and you’ve got:

df[["Region", "Country", "City"]] = df[["Region", "Country", "City"]].ffill()



Code language: JavaScript (javascript)

Ship it? Not so fast.

What happens when a new Region appears but Country hasn’t been set yet? ffill() carries the old Country forward into the new Region. Now Osaka is in Europe and nobody notices until the quarterly board deck has wrong numbers and someone’s asking “why does our Germany revenue include Japanese cities?”

This isn’t a hypothetical. This is a Wednesday.

Fill-Down Has a Hierarchy Problem

The real world doesn’t have flat data. It has tiers. Region → Country → City. When the Region changes, Country and City should reset, not carry forward. ffill() doesn’t know that. It treats every column as independent.

This is the exact problem I built xlfilldown to solve.

One Command. Clean Data.

pip install xlfilldown

xlfilldown db \
  --infile data.xlsx \
  --insheet "Sheet1" \
  --header-row 1 \
  --fill-cols '["Region","Country","City"]' \
  --require-non-null '["Revenue"]' \
  --drop-blank-rows \
  --db output.db \
  --table clean_revenue \
  --row-hash \
  --excel-row-numbers



Code language: JavaScript (javascript)

That’s it. Here’s what just happened:

Hierarchical fill-down: Region resets Country and City. Country resets City. No stale data leaks.
Spacer rows dropped: Those blank visual separators your manager loves? Gone.
Group headers filtered out: Rows without a Revenue value (the “Europe” and “Germany” label rows) are excluded automatically via --require-non-null.
SHA-256 row hash: Every output row gets a deterministic hash for audit trails and deduplication.
Original row numbers preserved: Need to trace an output row back to the exact Excel row? It’s there.
Streamed in constant memory: Got a 500MB workbook? No problem. It won’t eat your RAM.

What lands in SQLite:

excel_row	Region	Country	City	Revenue
4	Europe	Germany	Bonn	10
5	Europe	Germany	Berlin	12
9	Asia	Japan	Osaka	7
10	Asia	Japan	Kyoto	8

Every row has its full context. No leaks. No stale cities. No 3 AM debugging.

“But My Columns Aren’t Hierarchical”

Fair. Sometimes you’ve got columns that really are independent — a “Last Seen Date” and an “Owner” that should each carry forward on their own without resetting each other.

Just flip the mode:

--fill-mode independent

Now it behaves like pandas ffill(), but with all the extras: streaming, row hashes, row numbers, and null filtering. Two modes, one tool, you choose.

Tip: Use --fill-mode hierarchical (the default) when your columns have a parent-child relationship. Use --fill-mode independent when each column should fill forward on its own.

Don’t Want to Touch the Terminal?

Use the Python API with the same options:

NOTE: CLI flags use hyphens (e.g., --fill-cols) while the Python API uses underscores (e.g., fill_cols). This follows common conventions: CLIs typically use hyphens, and Python identifiers use underscores (many parsers also map - → _). When moving between CLI and API, swap - for _.

from xlfilldown.api import ingest_excel_to_sqlite

summary = ingest_excel_to_sqlite(
    file="data.xlsx",
    sheet="Sheet1",
    header_row=1,
    fill_cols=["Region", "Country", "City"],
    fill_mode="hierarchical",
    db="output.db",
    table="clean_revenue",
    drop_blank_rows=True,
    require_non_null=["Revenue"],
    row_hash=True,
    excel_row_numbers=True,
    if_exists="replace",
)



Code language: PHP (php)

Tip: Need Excel output instead of SQLite? Swap ingest_excel_to_sqlite for ingest_excel_to_excel and pass outfile="clean.xlsx". Same API, same options.

Why Not Just Fix It in pandas?

You can. But here’s what you’ll end up rebuilding:

Hierarchical reset logic (the hard part)
Streaming for large files (openpyxl read_only mode, row-by-row processing)
Blank row detection and filtering
Null-after-fill filtering for group header rows
Deterministic hashing for audit trails
Original row number tracking for traceability
Schema validation for append mode
Column letter → header name resolution for messy headers

That’s not a Friday afternoon script. That’s a maintenance burden. xlfilldown is 40 characters on the command line.

The Messy Header Trick

Real spreadsheets have headers like "Q3 Revenue (USD, adjusted)" or worse, headers with commas and special characters. Nobody wants to type those.

Point to columns by letter instead:

xlfilldown db \
  --infile messy.xlsx \
  --header-row 3 \
  --fill-cols-letters A C AE \
  --db output.db



Code language: CSS (css)

Letters are resolved against whatever’s in your header row. If a letter points to a blank header, you get a clear error, not silent corruption.

Who This Is For

Data engineers who receive “the spreadsheet” on a regular basis and need it in a database, clean, with an audit trail.
Analysts who want to query tiered Excel data in SQLite or load it into another tool without manual cleanup.
Anyone building pipelines that ingest Excel files from non-technical stakeholders who format for humans, not machines.

Install It Now

pip install xlfilldown

Usage instructions here: https://github.com/RexBytes/xlfilldown/blob/main/README.md

Python 3.9+. One dependency (openpyxl). Runs anywhere Python runs.

Next time the spreadsheet arrives, you’ll have clean data in your database before your coffee gets cold.

xlfilldown on PyPI · Python ≥ 3.9 · MIT License

Your Manager’s Spreadsheet Is Broken. Here’s the 3-Minute Fix.

The Trap Everyone Falls Into

Fill-Down Has a Hierarchy Problem

One Command. Clean Data.

“But My Columns Aren’t Hierarchical”

Don’t Want to Touch the Terminal?

Why Not Just Fix It in pandas?

The Messy Header Trick

Who This Is For

Install It Now

Like this:

GitHub Actions CI/CD (6/10): Integration Tests — Real Dependencies in CI

Like this:

pgmonkey: One Config File to Rule All Your PostgreSQL Connections

Like this:

GitHub Actions CI/CD (2/10): Unit Tests — Fast Feedback

Like this:

Python Concurrency, Multi-threading & Multi-processing

Like this:

Jenkins CI/CD (1/11): Prerequisites and Mental Model

Like this:

Packaging Your Own Software for a PPA (From Scratch)

Like this:

Leave a Reply Cancel reply

The Trap Everyone Falls Into

Fill-Down Has a Hierarchy Problem

One Command. Clean Data.

“But My Columns Aren’t Hierarchical”

Don’t Want to Touch the Terminal?

Why Not Just Fix It in pandas?

The Messy Header Trick

Who This Is For

Install It Now

Share this:

Like this:

Similar Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a Reply Cancel reply