|

xlfilldown: The Complete Guide to Excel Fill Down

Summary: Everything you need to know about xlfilldown — the CLI tool and Python library that streams Excel data into SQLite or a new Excel workbook, with hierarchical fill-down, row hashing, and constant-memory processing.

KeyValue
Toolxlfilldown
Python version3.9+
Dependenciesopenpyxl
Installpip install xlfilldown
Output targetsSQLite, Excel (.xlsx)
Fill modeshierarchical, independent

0. Prerequisites

  • Python 3.9 or later installed on your system
  • Basic familiarity with the command line
  • An Excel .xlsx file you want to process

Install xlfilldown with pip or pipx:

Or if you prefer an isolated install:


1. Why Fill Down Is Tricky

Typical spreadsheets use tiered headers: a high-level “group” column appears once, and lower-level detail rows sit beneath it. The hierarchy is obvious to a human eye but catastrophic for a database — every row needs its full context.

The naive fix is pandas ffill(), but that carries each column forward independently. When a new Region appears before Country is set, ffill() drags the old Country into the new Region. Now Osaka is in Europe and nobody notices until the quarterly report is wrong.

xlfilldown solves this with two fill modes:

  • Hierarchical (default) — higher tiers reset lower tiers when they change. When Region changes, Country and City both reset. This prevents stale context from leaking across groups.
  • Independent — each listed column carries forward on its own, like pandas ffill(). Use this when your columns are genuinely unrelated.

Choose the mode with --fill-mode hierarchical or --fill-mode independent. Column order matters only in hierarchical mode, where left equals highest tier.


2. Your First Command

Say Sheet1 in data.xlsx looks like this:

RegionCountryCityValue
Europe
Germany
Bonn10
Berlin12
Asia
Japan
Osaka7
Kyoto8

The goal: fill Region, Country, and City so every row carries its full context, then write the result to SQLite.

After running that command, the fact_locations table in out.db contains:

row_hashexcel_rowRegionCountryCityValue
aa93f736…4EuropeGermanyBonn10
2ef01629…5EuropeGermanyBerlin12
9c9f43c6…9AsiaJapanOsaka7
55d85bf2…10AsiaJapanKyoto8

Key details about the output:

  • All columns are stored as TEXT, including numeric values like Value.
  • Numbers and dates are written as their string representation (e.g., 10 becomes "10").
  • row_hash is a SHA-256 computed over all data columns in that row.
  • excel_row is the 1-based row number from the original spreadsheet.
  • Blank spacer rows and group-header rows (missing Value) were dropped.

3. Using Column Letters Instead of Names

If your header names are long or contain special characters, point to columns by their Excel letter instead:

  • Letters are resolved to header names using the header row you specify.
  • If a letter points to an empty header cell, xlfilldown raises a clear error.
  • --fill-cols and --fill-cols-letters are mutually exclusive — pick one.

4. Hierarchical vs Independent Fill

4.1 Hierarchical (default)

When Region changes, both Country and City reset on that row. When Country changes, City resets. This prevents stale values from leaking into a new group.

Use hierarchical mode when your data has natural tiers: Region, Country, City or Department, Team, Employee.

4.2 Independent

Each column carries forward on its own, regardless of what happens in other columns. This behaves like pandas ffill().

Use independent mode when your columns are unrelated — for example, Last Seen Date and Owner have no hierarchical relationship.

Tip: If you see old City names leaking into a new Country, switch to hierarchical mode.


5. Dropping Unwanted Rows

Two filters apply after fill-down:

Drop blank spacer rows — removes rows that are blank across all your fill columns:

This is useful when your source sheet has empty rows used as visual separators.

Require non-null — drops any row where a listed column is still blank after filling:

You can also specify this by letter:

Name-based and letter-based lists can be combined. They are merged and de-duplicated internally.


6. Writing to Excel Instead of SQLite

Replace the db subcommand with xlsx:

The --if-exists flag controls what happens when the output sheet already exists:

ValueBehavior
errorRaise an error if the sheet exists (default)
replaceDelete and recreate the sheet
appendAppend rows after validating that headers match exactly, including optional columns like row_hash and excel_row

Tip: When creating a new workbook, xlfilldown uses openpyxl’s write_only=True mode for streaming speed.


7. Append Semantics

Appending only works if the destination columns match exactly and in order:

  • SQLite — column names and order must match the existing table schema.
  • Excel — the header row in the destination sheet must equal the expected column list, including any optional columns like row_hash and excel_row if enabled.

This strictness prevents silent data drift. If your schema has changed (for example, you added --row-hash after the initial load), use --if-exists replace or adjust the destination to match.


8. Raw Ingest (No Fill Down)

Sometimes you want to load a sheet exactly as-is, without applying any fill-down logic. This is useful when you just need audit columns (row_hash, excel_row) and plan to handle nulls later.

Pass --ingest-mode raw to skip fill-down entirely:

The same flag works with the xlsx subcommand:


9. Python API

Both output targets are available as Python functions with the same options as the CLI.

9.1 Write to SQLite

9.2 Write to Excel

9.3 Raw ingest via Python

Use fill_cols_letters=["A", "B", "C"] instead of fill_cols to specify columns by letter. Use require_non_null_letters=["A", "D"] for the same with the non-null filter. Name-based and letter-based arguments can be combined — they are merged and de-duplicated.

The return value is a summary object with row counts and flags indicating whether hash and row-number columns were written.


10. Verifying Results

10.1 SQLite

10.2 Excel

Open the output workbook and inspect the header row. If you enabled --row-hash or --excel-row-numbers, those columns appear after the data headers.


11. Recipes

11.1 Group headers with detail rows — keep only detail

Your sheet has “group header” rows that contain just a Region value and detail rows that contain a Value. Drop the group headers:

The --require-non-null '["Value"]' flag ensures only rows with an actual Value survive.

11.2 Messy header names — use letters

Headers like "anothercolumn,3" are awkward to type in JSON. Use letters instead:

11.3 Strict filtering after fill

Drop any row where Region or City is still missing after fill-down:

11.4 Subset columns in the output

xlfilldown ingests all non-empty headers by design (this keeps hashing stable and predictable). To narrow the output, create a view in SQLite:

Or do a second-pass filter in your downstream tool.


12. Performance Tips

  • --batch-size 1000 is a good default. Increase it for very wide tables or slow disks.
  • When --row-hash or --excel-row-numbers are enabled, xlfilldown creates helpful indexes automatically.
  • The source workbook is opened with openpyxl’s read_only=True. New output workbooks use write_only=True.
  • Processing is row-streamed and runs in constant memory, so it handles large sheets without issue.

13. Troubleshooting

Common error messages and what to do about them:

  • “Use only one of –fill-cols or –fill-cols-letters” — You passed both. Pick one input style per invocation.
  • “Column letter X refers to an empty header cell” — The cell at that letter in your header row is blank, whitespace, or nan. Either populate it in Excel or switch to --fill-cols with explicit names.
  • “Duplicate header names found” — Rename the duplicates in your sheet. xlfilldown requires unique headers after trimming whitespace and normalizing nan values.
  • “Header row N exceeds sheet max row M” — You pointed at the wrong header row. Check your --header-row value.
  • Append failed with schema mismatch — The destination columns no longer match. You may have added --row-hash or changed header order since the original load. Use --if-exists replace or adjust the destination schema to match.

14. FAQ

Q: Does fill-down apply to completely empty rows?

A: No. Rows that are empty across all headers are preserved as-is (or dropped if you pass --drop-blank-rows). The carry state persists past them, so the next non-empty row still gets filled correctly.

Q: Why is everything stored as TEXT?

A: Determinism. Values are canonicalized to stable text for consistent storage and hashing. Cast to your desired types in downstream tools.

Q: Does hashing treat 1, 1.0, and Decimal('1.00') the same?

A: Yes. They all canonicalize to "1", so they produce the same hash.

Q: Can I keep Excel’s original row numbers?

A: Yes. Pass --excel-row-numbers to add an excel_row TEXT column containing the 1-based row number from the source sheet.

Q: Can I preview data before writing?

A: Not built-in yet. A common pattern is writing to a temporary SQLite database and inspecting it with sqlite3 or DB Browser for SQLite.


15. Cheat Sheets

15.1 CLI — SQLite

15.2 CLI — Excel

15.3 Python


Summary

xlfilldown takes messy, tiered Excel spreadsheets and turns them into clean, flat data — either in SQLite or a new Excel workbook. The key concepts covered in this guide:

  • Hierarchical fill resets lower tiers when higher tiers change, preventing stale context from leaking across groups.
  • Independent fill carries each column forward separately, like pandas ffill().
  • Row hashing and Excel row numbers provide audit trails back to the source data.
  • Raw ingest loads sheets as-is when you want audit columns without fill-down logic.
  • The CLI and Python API offer identical functionality — pick whichever fits your workflow.

For quick reference, see the cheat sheets in section 15.

Similar Posts

One Comment

Leave a Reply