TL;DR — Key Takeaways

  • 1,000,000-row stress test, real Excel limits
    Flattened a full-size workbook (1,000,000 rows × 5 cols) on a laptop without a single memory spike.
  • Performance
    Excel → Excel: 72.6 s total ≈ 13,800 rows/sec
    Excel → SQLite: 22.4 s total ≈ 44,700 rows/sec
  • Constant memory design
    xlfilldown streams row-by-row, never loads the workbook into RAM.
    Memory stayed flat from start to finish in both tests.
  • Deterministic + auditable
    Each run preserves Excel row numbers and computes per-row SHA-256 hashes.
    Same input → same hashes → verifiable reproducibility.
  • Why it matters
    You can safely process any Excel file up to Excel’s 1,048,576-row cap
    no pandas-style memory blowups, no VBA hacks, no data leakage across parents.
  • Result
    ✅ Fully flattened, hierarchy-aware, audit-ready datasets
    ✅ Runs anywhere (laptop, CI, or cloud)
    ✅ Predictable speed and zero RAM drift

xlfilldown

When I first started building xlfilldown, my goal wasn’t just correctness, it was speed and predictability. I wanted something that could chew through large Excel files without blowing up your memory, no matter how messy or hierarchical the data looked.

So I ran a real-world stress test: 1,000,000 rows of hierarchical data, filled down and streamed into both Excel and SQLite all from a modest laptop.

Here’s how it went. I provide code so you can play along.

Data Preparation

Step 1. Generate test data.

I generated a synthetic dataset based on GeoNames, countries, towns, and product codes arranged in a hierarchical format (think grouped Excel reports).
Each row belongs to a region and country, but many cells are left blank, mimicking the kind of spreadsheets analysts create every day.

https://download.geonames.org/export/zip/

Download and extract the file allCountries.zip is you want to play along.
Keep this file and the following scripts in the same directory.

Create the file gen_geonames_hier.py

#!/usr/bin/env python3

import argparse
import csv
import random
from datetime import datetime, timedelta
from pathlib import Path
from typing import Iterator, List, Optional, Tuple

# -------- ISO 3166-1 alpha-2 → Country name (baked-in) -----------------------
ISO_TO_COUNTRY = {
    "AD":"Andorra","AE":"United Arab Emirates","AF":"Afghanistan","AG":"Antigua and Barbuda","AI":"Anguilla",
    "AL":"Albania","AM":"Armenia","AO":"Angola","AR":"Argentina","AS":"American Samoa","AT":"Austria","AU":"Australia",
    "AW":"Aruba","AZ":"Azerbaijan","BA":"Bosnia and Herzegovina","BB":"Barbados","BD":"Bangladesh","BE":"Belgium",
    "BF":"Burkina Faso","BG":"Bulgaria","BH":"Bahrain","BI":"Burundi","BJ":"Benin","BM":"Bermuda","BN":"Brunei Darussalam",
    "BO":"Bolivia","BR":"Brazil","BS":"Bahamas","BT":"Bhutan","BW":"Botswana","BY":"Belarus","BZ":"Belize","CA":"Canada",
    "CD":"Democratic Republic of the Congo","CF":"Central African Republic","CG":"Republic of the Congo","CH":"Switzerland",
    "CI":"Côte d’Ivoire","CL":"Chile","CM":"Cameroon","CN":"China","CO":"Colombia","CR":"Costa Rica","CU":"Cuba",
    "CV":"Cabo Verde","CY":"Cyprus","CZ":"Czechia","DE":"Germany","DJ":"Djibouti","DK":"Denmark","DM":"Dominica",
    "DO":"Dominican Republic","DZ":"Algeria","EC":"Ecuador","EE":"Estonia","EG":"Egypt","ER":"Eritrea","ES":"Spain",
    "ET":"Ethiopia","FI":"Finland","FJ":"Fiji","FM":"Micronesia","FR":"France","GA":"Gabon","GB":"United Kingdom",
    "GD":"Grenada","GE":"Georgia","GH":"Ghana","GL":"Greenland","GM":"Gambia","GN":"Guinea","GQ":"Equatorial Guinea",
    "GR":"Greece","GT":"Guatemala","GU":"Guam","GW":"Guinea-Bissau","GY":"Guyana","HK":"Hong Kong","HN":"Honduras",
    "HR":"Croatia","HT":"Haiti","HU":"Hungary","ID":"Indonesia","IE":"Ireland","IL":"Israel","IN":"India","IQ":"Iraq",
    "IR":"Iran","IS":"Iceland","IT":"Italy","JM":"Jamaica","JO":"Jordan","JP":"Japan","KE":"Kenya","KG":"Kyrgyzstan",
    "KH":"Cambodia","KI":"Kiribati","KM":"Comoros","KN":"Saint Kitts and Nevis","KP":"North Korea","KR":"South Korea",
    "KW":"Kuwait","KY":"Cayman Islands","KZ":"Kazakhstan","LA":"Laos","LB":"Lebanon","LC":"Saint Lucia",
    "LI":"Liechtenstein","LK":"Sri Lanka","LR":"Liberia","LS":"Lesotho","LT":"Lithuania","LU":"Luxembourg","LV":"Latvia",
    "LY":"Libya","MA":"Morocco","MC":"Monaco","MD":"Moldova","ME":"Montenegro","MG":"Madagascar","MH":"Marshall Islands",
    "MK":"North Macedonia","ML":"Mali","MM":"Myanmar","MN":"Mongolia","MO":"Macao","MR":"Mauritania","MT":"Malta",
    "MU":"Mauritius","MV":"Maldives","MW":"Malawi","MX":"Mexico","MY":"Malaysia","MZ":"Mozambique","NA":"Namibia",
    "NE":"Niger","NG":"Nigeria","NI":"Nicaragua","NL":"Netherlands","NO":"Norway","NP":"Nepal","NR":"Nauru",
    "NZ":"New Zealand","OM":"Oman","PA":"Panama","PE":"Peru","PG":"Papua New Guinea","PH":"Philippines","PK":"Pakistan",
    "PL":"Poland","PT":"Portugal","PW":"Palau","PY":"Paraguay","QA":"Qatar","RO":"Romania","RS":"Serbia","RU":"Russia",
    "RW":"Rwanda","SA":"Saudi Arabia","SB":"Solomon Islands","SC":"Seychelles","SD":"Sudan","SE":"Sweden","SG":"Singapore",
    "SI":"Slovenia","SK":"Slovakia","SL":"Sierra Leone","SM":"San Marino","SN":"Senegal","SO":"Somalia","SR":"Suriname",
    "SS":"South Sudan","ST":"Sao Tome and Principe","SV":"El Salvador","SY":"Syria","SZ":"Eswatini","TD":"Chad","TG":"Togo",
    "TH":"Thailand","TJ":"Tajikistan","TL":"Timor-Leste","TM":"Turkmenistan","TN":"Tunisia","TO":"Tonga","TR":"Turkey",
    "TT":"Trinidad and Tobago","TV":"Tuvalu","TZ":"Tanzania","UA":"Ukraine","UG":"Uganda","US":"United States",
    "UY":"Uruguay","UZ":"Uzbekistan","VA":"Vatican City","VC":"Saint Vincent and the Grenadines","VE":"Venezuela",
    "VN":"Vietnam","VU":"Vanuatu","WS":"Samoa","YE":"Yemen","ZA":"South Africa","ZM":"Zambia","ZW":"Zimbabwe"
}
# -----------------------------------------------------------------------------

def read_geonames(tsv_path: Path) -> Iterator[Tuple[str, str]]:
    """Stream (iso, town) from GeoNames allCountries.txt (tab-delimited)."""
    with tsv_path.open("r", encoding="utf-8", newline="") as f:
        rd = csv.reader(f, delimiter="\t")
        for row in rd:
            if len(row) < 3:
                continue
            iso = row[0].strip()
            town = row[2].strip()
            if not iso or not town:
                continue
            yield iso, town

def country_name(iso2: str) -> str:
    return ISO_TO_COUNTRY.get(iso2, iso2)

def random_product_code(rng: random.Random) -> str:
    # Mix of numeric and alphanumeric codes
    patterns = [
        lambda: f"{rng.randint(10000, 99999)}",
        lambda: "".join(rng.choice("abcdefghijklmnopqrstuvwxyz0123456789") for _ in range(rng.randint(5,7))),
        lambda: f"{rng.choice('ABCDEFGHJKMNPQRSTUVWXYZ')}{rng.randint(10000,99999)}",
    ]
    return patterns[rng.randrange(len(patterns))]()

def iter_rows(tsv_path: Path, max_rows: int, seed: int = 42,
              min_products: int = 2, max_products: int = 6,
              min_repeat: int = 2, max_repeat: int = 15,
              start_date: str = "2020-01-01", years: int = 5) -> Iterator[List[str]]:
    """
    Yield header + up to max_rows data rows with blanks in ALL THREE hierarchy columns:
    Country, Town, Product Code.

    For every (iso,town) encountered (streamed), we:
      - Generate K product codes (K ∈ [min_products,max_products])
      - For each product, emit R rows (R ∈ [min_repeat,max_repeat])
         * 1st row of 1st product under the town: Country, Town, Product shown
         * 1st row of subsequent products: Country='', Town='', Product shown
         * Remaining rows in a product: Country='', Town='', Product=''
    """
    rng = random.Random(seed)
    base = datetime.fromisoformat(start_date)
    span_days = 365 * years

    yield ["Country", "Town", "Product Code", "Date", "Value"]

    prev_iso: Optional[str] = None
    prev_town: Optional[str] = None
    rows_out = 0

    for iso, town in read_geonames(tsv_path):
        ctry_full = country_name(iso)

        # country/town blanking for the *first* product row in this town
        first_country_out = "" if iso == prev_iso else ctry_full
        if iso != prev_iso:
            prev_iso = iso
            prev_town = None

        first_town_out = "" if town == prev_town else town
        if town != prev_town:
            prev_town = town

        num_products = rng.randint(min_products, max_products)
        for pi in range(num_products):
            product = random_product_code(rng)
            repeats = rng.randint(min_repeat, max_repeat)

            for rep in range(repeats):
                date = (base + timedelta(days=rng.randrange(span_days))).strftime("%Y-%m-%d")
                value = f"{rng.uniform(10.0, 5000.0):.2f}"

                if rep == 0:
                    if pi == 0:
                        # 1st product, 1st row of this town: show country & town & product
                        row = [first_country_out, first_town_out, product, date, value]
                    else:
                        # new product under same town: only product visible
                        row = ["", "", product, date, value]
                else:
                    # subsequent rows within product: all blanks in hierarchy
                    row = ["", "", "", date, value]

                yield row
                rows_out += 1
                if rows_out >= max_rows:
                    return

def write_csv(out_path: Path, rows: Iterator[List[str]]) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8", newline="") as f:
        w = csv.writer(f)
        for r in rows:
            w.writerow(r)

def write_xlsx(out_path: Path, rows: Iterator[List[str]], sheet: str = "Sheet1") -> None:
    try:
        from openpyxl import Workbook
    except ImportError:
        raise SystemExit("openpyxl not installed. Try: pip install openpyxl")
    wb = Workbook(write_only=True)
    ws = wb.create_sheet(title=sheet)
    if "Sheet" in wb.sheetnames and sheet != "Sheet":
        wb.remove(wb["Sheet"])
    for r in rows:
        ws.append(r)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    wb.save(out_path)

def main():
    ap = argparse.ArgumentParser(description="Generate hierarchy Country→Town→Product Code with all three needing fill-down.")
    ap.add_argument("--in", dest="infile", required=True, help="GeoNames allCountries.txt (TSV)")
    ap.add_argument("--out", dest="outfile", default="geo_hier_products.csv", help="Output .csv or .xlsx (default: geo_hier_products.csv)")
    ap.add_argument("--rows", type=int, default=1_000_000, help="Number of DATA rows to emit (excl. header)")
    ap.add_argument("--seed", type=int, default=42)
    ap.add_argument("--min-products", type=int, default=2)
    ap.add_argument("--max-products", type=int, default=6)
    ap.add_argument("--min-repeat", type=int, default=2)
    ap.add_argument("--max-repeat", type=int, default=15)
    ap.add_argument("--start-date", default="2020-01-01")
    ap.add_argument("--years", type=int, default=5)
    ap.add_argument("--sheet", default="Sheet1", help="Sheet name if writing .xlsx")
    args = ap.parse_args()

    tsv = Path(args.infile)
    if not tsv.exists():
        raise SystemExit(f"Input not found: {tsv}")

    rows_iter = iter_rows(
        tsv_path=tsv,
        max_rows=args.rows,
        seed=args.seed,
        min_products=args.min_products,
        max_products=args.max_products,
        min_repeat=args.min_repeat,
        max_repeat=args.max_repeat,
        start_date=args.start_date,
        years=args.years,
    )

    out = Path(args.outfile)
    if out.suffix.lower() == ".xlsx":
        write_xlsx(out, rows_iter, sheet=args.sheet)
    else:
        write_csv(out, rows_iter)

    print(f"Done. Wrote {args.rows} data rows (+ header) to {out}")

if __name__ == "__main__":
    main()

Here is my run. I use the command time to time how long it takes.
I generate 1 million rows which is just under the limit of what Excel can handle.

time ./gen_geonames_hier.py --in allCountries.txt --out geo_hier_products.csv --rows 1000000
Done. Wrote 1000000 data rows (+ header) to geo_hier_products.csv

real	0m3,171s
user	0m3,135s
sys	0m0,036s

Just double check that the file is as large as it should be.

cat geo_hier_products.csv | wc -l
1000001

Step 2) Convert CSV to Excel

We now need to convert the fake CSV data to Excel.

Create a file with name csv_to_excel_stream.py and add the following code.

#!/usr/bin/env python3

"""
Stream a large CSV into an Excel (.xlsx) file using XlsxWriter constant_memory.
This avoids the openpyxl write_only 'max_row=0' issue some readers run into.

Usage:
  python csv_to_excel_stream.py --in geo_hier_products.csv --out geo_hier_products.xlsx --sheet Sheet1
"""

import argparse
import csv
from pathlib import Path

def csv_to_xlsx(in_path: Path, out_path: Path, sheet_name: str = "Sheet1", encoding: str = "utf-8"):
    try:
        import xlsxwriter
    except ImportError:
        raise SystemExit("XlsxWriter not installed. Try: pip install XlsxWriter")

    # Create workbook with constant_memory to keep RAM tiny
    wb = xlsxwriter.Workbook(out_path.as_posix(), {'constant_memory': True})
    ws = wb.add_worksheet(sheet_name)

    row_idx = 0
    with in_path.open("r", encoding=encoding, newline="") as fin:
        reader = csv.reader(fin)
        for row in reader:
            ws.write_row(row_idx, 0, row)   # write entire CSV row starting at column A
            row_idx += 1
            if row_idx % 100_000 == 0:
                print(f"[CSV→XLSX] {row_idx:,} rows written...")

    wb.close()
    print(f"Done. Rows written: {row_idx:,}")
    print(f"Output: {out_path}")

def main():
    ap = argparse.ArgumentParser(description="Stream CSV → XLSX (robust dimensions, low memory).")
    ap.add_argument("--in", dest="infile", required=True, help="Input CSV file")
    ap.add_argument("--out", dest="outfile", required=True, help="Output .xlsx file")
    ap.add_argument("--sheet", default="Sheet1", help="Sheet name (default: Sheet1)")
    ap.add_argument("--encoding", default="utf-8", help="CSV encoding (default: utf-8)")
    args = ap.parse_args()

    in_path = Path(args.infile)
    out_path = Path(args.outfile)

    if not in_path.exists():
        raise SystemExit(f"Input CSV not found: {in_path}")
    if out_path.suffix.lower() != ".xlsx":
        raise SystemExit("Output must end with .xlsx")

    csv_to_xlsx(in_path, out_path, sheet_name=args.sheet, encoding=args.encoding)

if __name__ == "__main__":
    main()

Here I run this code.

time ./csv_to_excel_stream.py --in geo_hier_products.csv --out geo_hier_products.xlsx --sheet Sheet1
[CSV→XLSX] 100,000 rows written...
[CSV→XLSX] 200,000 rows written...
[CSV→XLSX] 300,000 rows written...
[CSV→XLSX] 400,000 rows written...
[CSV→XLSX] 500,000 rows written...
[CSV→XLSX] 600,000 rows written...
[CSV→XLSX] 700,000 rows written...
[CSV→XLSX] 800,000 rows written...
[CSV→XLSX] 900,000 rows written...
[CSV→XLSX] 1,000,000 rows written...
Done. Rows written: 1,000,001
Output: geo_hier_products.xlsx

real	0m12,894s
user	0m12,535s
sys	0m0,359s

Stress Test

Now the fun part: flattening the hierarchy wile preserving Excel row numbers and computing hashes.

Excel to Excel (with hierarchical fill down)

time xlfilldown xlsx --infile geo_hier_products.xlsx --insheet Sheet1 --header-row 1 --fill-cols '["Country","Town","Product Code"]' --outfile filled.xlsx --outsheet Processed
✅ Wrote 'Sheet1' → filled.xlsx:Processed (cols=5; rows=1000000; if_exists=fail)

real	1m12,564s
user	1m14,276s
sys	0m0,688s
  • 1,000,000 rows processed in 72.6 seconds (~13,800 rows/sec)
  • Memory stable throughout.
  • Result: fully flattened, audit-ready dataset.

Excel to SQLite

 time xlfilldown db --infile geo_hier_products.xlsx --insheet Sheet1 --header-row 1 --fill-cols '["Country","Town","Product Code"]' --db out.db --table Processed 
✅ Loaded 'Sheet1' → out.db:Processed (cols=5; rows=1000000; if_exists=fail)

real	0m22,359s
user	0m24,609s
sys	0m0,169s
  • 1,000,000 rows in 22.4 seconds (~44,700 rows/sec)
  • Includes table creation and batched inserts.
  • Memory footprint flat from start to finish.

Why This Matters

xlfilldown doesn’t load your whole workbook into RAM.
It streams, reading one row at a time, normalizing cells, filling columns intelligently, and writing output immediately.

That means:

  • You can process arbitrarily large Excel sheets (limited only by Excel’s own row cap of 1,048,576).
  • It scales gracefully on laptops, CI pipelines, or cloud jobs.
  • It’s deterministic. Same data, same hash, every time.

Traditional Excel automation scripts (e.g., pandas or openpyxl loops) often read entire sheets into memory before processing. On large workbooks, that means slowdowns, memory spikes, and potential crashes.

xlfilldown avoids all that.

The Real Benefit

When your spreadsheets flatten consistently, when hashes match, and when you can stream data straight into a database without “Excel gymnastics,” you’ve removed an entire category of failure from your workflow. It’s about trust.

Whether you’re uploading raw data to SQLite or prepping it for analysis in pandas, xlfilldown gives you clean, reproducible, and auditable results. Every. Time.

Closing Notes

This test wasn’t run on a server. Just on a regular laptop.
The results prove what I designed for: constant memory, strong throughput, and reliability under load.

If you’d like to reproduce the results, the generator and benchmark commands are shown above. Copy, paste, and flatten away.

pip install xlfilldown

One thought on “One Million Rows: Testing xlfilldown Performance and Constant-Memory Design”

Leave a Reply