Your Manager’s Spreadsheet Is Broken. Here’s the 3-Minute Fix.

Summary: How xlfilldown turns nightmare Excel files into clean, queryable data — without loading the whole thing into memory.


You know the spreadsheet. Every data engineer knows the spreadsheet.

It arrives on a Tuesday morning, attached to a Slack message that says “Hey, can you load this into the database? Should be pretty straightforward.”

You open it. And there it is.

RegionCountryCityRevenue
Europe
Germany
Bonn10
Berlin12
Asia
Japan
Osaka7
Kyoto8

The tiered, merged, visually-grouped spreadsheet. Beautiful on screen. Absolutely catastrophic for a pipeline.

Your manager doesn’t see a problem — the hierarchy is obvious to a human eye. But your database doesn’t have eyes. It needs every row to carry its full context: Europe, Germany, Bonn, 10. Not just Bonn, 10 and a prayer.


The Trap Everyone Falls Into

The obvious fix is pandas ffill(). Five seconds on Stack Overflow and you’ve got:

df[["Region", "Country", "City"]] = df[["Region", "Country", "City"]].ffill()



Code language: JavaScript (javascript)

Ship it? Not so fast.

What happens when a new Region appears but Country hasn’t been set yet? ffill() carries the old Country forward into the new Region. Now Osaka is in Europe and nobody notices until the quarterly board deck has wrong numbers and someone’s asking “why does our Germany revenue include Japanese cities?”

This isn’t a hypothetical. This is a Wednesday.


Fill-Down Has a Hierarchy Problem

The real world doesn’t have flat data. It has tiers. Region → Country → City. When the Region changes, Country and City should reset, not carry forward. ffill() doesn’t know that. It treats every column as independent.

This is the exact problem I built xlfilldown to solve.


One Command. Clean Data.

pip install xlfilldown



xlfilldown db \
  --infile data.xlsx \
  --insheet "Sheet1" \
  --header-row 1 \
  --fill-cols '["Region","Country","City"]' \
  --require-non-null '["Revenue"]' \
  --drop-blank-rows \
  --db output.db \
  --table clean_revenue \
  --row-hash \
  --excel-row-numbers



Code language: JavaScript (javascript)

That’s it. Here’s what just happened:

  • Hierarchical fill-down: Region resets Country and City. Country resets City. No stale data leaks.
  • Spacer rows dropped: Those blank visual separators your manager loves? Gone.
  • Group headers filtered out: Rows without a Revenue value (the “Europe” and “Germany” label rows) are excluded automatically via --require-non-null.
  • SHA-256 row hash: Every output row gets a deterministic hash for audit trails and deduplication.
  • Original row numbers preserved: Need to trace an output row back to the exact Excel row? It’s there.
  • Streamed in constant memory: Got a 500MB workbook? No problem. It won’t eat your RAM.

What lands in SQLite:

excel_rowRegionCountryCityRevenue
4EuropeGermanyBonn10
5EuropeGermanyBerlin12
9AsiaJapanOsaka7
10AsiaJapanKyoto8

Every row has its full context. No leaks. No stale cities. No 3 AM debugging.


“But My Columns Aren’t Hierarchical”

Fair. Sometimes you’ve got columns that really are independent — a “Last Seen Date” and an “Owner” that should each carry forward on their own without resetting each other.

Just flip the mode:

--fill-mode independent



Now it behaves like pandas ffill(), but with all the extras: streaming, row hashes, row numbers, and null filtering. Two modes, one tool, you choose.

Tip: Use --fill-mode hierarchical (the default) when your columns have a parent-child relationship. Use --fill-mode independent when each column should fill forward on its own.


Don’t Want to Touch the Terminal?

Use the Python API with the same options:

NOTE: CLI flags use hyphens (e.g., --fill-cols) while the Python API uses underscores (e.g., fill_cols). This follows common conventions: CLIs typically use hyphens, and Python identifiers use underscores (many parsers also map -_). When moving between CLI and API, swap - for _.

from xlfilldown.api import ingest_excel_to_sqlite

summary = ingest_excel_to_sqlite(
    file="data.xlsx",
    sheet="Sheet1",
    header_row=1,
    fill_cols=["Region", "Country", "City"],
    fill_mode="hierarchical",
    db="output.db",
    table="clean_revenue",
    drop_blank_rows=True,
    require_non_null=["Revenue"],
    row_hash=True,
    excel_row_numbers=True,
    if_exists="replace",
)



Code language: PHP (php)

Tip: Need Excel output instead of SQLite? Swap ingest_excel_to_sqlite for ingest_excel_to_excel and pass outfile="clean.xlsx". Same API, same options.


Why Not Just Fix It in pandas?

You can. But here’s what you’ll end up rebuilding:

  1. Hierarchical reset logic (the hard part)
  2. Streaming for large files (openpyxl read_only mode, row-by-row processing)
  3. Blank row detection and filtering
  4. Null-after-fill filtering for group header rows
  5. Deterministic hashing for audit trails
  6. Original row number tracking for traceability
  7. Schema validation for append mode
  8. Column letter → header name resolution for messy headers

That’s not a Friday afternoon script. That’s a maintenance burden. xlfilldown is 40 characters on the command line.


The Messy Header Trick

Real spreadsheets have headers like "Q3 Revenue (USD, adjusted)" or worse, headers with commas and special characters. Nobody wants to type those.

Point to columns by letter instead:

xlfilldown db \
  --infile messy.xlsx \
  --header-row 3 \
  --fill-cols-letters A C AE \
  --db output.db



Code language: CSS (css)

Letters are resolved against whatever’s in your header row. If a letter points to a blank header, you get a clear error, not silent corruption.


Who This Is For

  • Data engineers who receive “the spreadsheet” on a regular basis and need it in a database, clean, with an audit trail.
  • Analysts who want to query tiered Excel data in SQLite or load it into another tool without manual cleanup.
  • Anyone building pipelines that ingest Excel files from non-technical stakeholders who format for humans, not machines.

Install It Now

pip install xlfilldown



Usage instructions here: https://github.com/RexBytes/xlfilldown/blob/main/README.md

Python 3.9+. One dependency (openpyxl). Runs anywhere Python runs.

Next time the spreadsheet arrives, you’ll have clean data in your database before your coffee gets cold.


xlfilldown on PyPI · Python ≥ 3.9 · MIT License

Similar Posts

Leave a Reply