TL;DR — Key Takeaways
- 1,000,000-row stress test, real Excel limits
Flattened a full-size workbook (1,000,000 rows × 5 cols) on a laptop without a single memory spike. - Performance
Excel → Excel: 72.6 s total ≈ 13,800 rows/sec
Excel → SQLite: 22.4 s total ≈ 44,700 rows/sec - Constant memory design
xlfilldownstreams row-by-row, never loads the workbook into RAM.
Memory stayed flat from start to finish in both tests. - Deterministic + auditable
Each run preserves Excel row numbers and computes per-row SHA-256 hashes.
Same input → same hashes → verifiable reproducibility. - Why it matters
You can safely process any Excel file up to Excel’s 1,048,576-row cap
no pandas-style memory blowups, no VBA hacks, no data leakage across parents. - Result
✅ Fully flattened, hierarchy-aware, audit-ready datasets
✅ Runs anywhere (laptop, CI, or cloud)
✅ Predictable speed and zero RAM drift
xlfilldown
When I first started building xlfilldown, my goal wasn’t just correctness, it was speed and predictability. I wanted something that could chew through large Excel files without blowing up your memory, no matter how messy or hierarchical the data looked.
So I ran a real-world stress test: 1,000,000 rows of hierarchical data, filled down and streamed into both Excel and SQLite all from a modest laptop.
Here’s how it went. I provide code so you can play along.
Data Preparation
Step 1. Generate test data.
I generated a synthetic dataset based on GeoNames, countries, towns, and product codes arranged in a hierarchical format (think grouped Excel reports).
Each row belongs to a region and country, but many cells are left blank, mimicking the kind of spreadsheets analysts create every day.
https://download.geonames.org/export/zip/
Download and extract the file allCountries.zip is you want to play along.
Keep this file and the following scripts in the same directory.
Create the file gen_geonames_hier.py
#!/usr/bin/env python3
import argparse
import csv
import random
from datetime import datetime, timedelta
from pathlib import Path
from typing import Iterator, List, Optional, Tuple
# -------- ISO 3166-1 alpha-2 → Country name (baked-in) -----------------------
ISO_TO_COUNTRY = {
"AD":"Andorra","AE":"United Arab Emirates","AF":"Afghanistan","AG":"Antigua and Barbuda","AI":"Anguilla",
"AL":"Albania","AM":"Armenia","AO":"Angola","AR":"Argentina","AS":"American Samoa","AT":"Austria","AU":"Australia",
"AW":"Aruba","AZ":"Azerbaijan","BA":"Bosnia and Herzegovina","BB":"Barbados","BD":"Bangladesh","BE":"Belgium",
"BF":"Burkina Faso","BG":"Bulgaria","BH":"Bahrain","BI":"Burundi","BJ":"Benin","BM":"Bermuda","BN":"Brunei Darussalam",
"BO":"Bolivia","BR":"Brazil","BS":"Bahamas","BT":"Bhutan","BW":"Botswana","BY":"Belarus","BZ":"Belize","CA":"Canada",
"CD":"Democratic Republic of the Congo","CF":"Central African Republic","CG":"Republic of the Congo","CH":"Switzerland",
"CI":"Côte d’Ivoire","CL":"Chile","CM":"Cameroon","CN":"China","CO":"Colombia","CR":"Costa Rica","CU":"Cuba",
"CV":"Cabo Verde","CY":"Cyprus","CZ":"Czechia","DE":"Germany","DJ":"Djibouti","DK":"Denmark","DM":"Dominica",
"DO":"Dominican Republic","DZ":"Algeria","EC":"Ecuador","EE":"Estonia","EG":"Egypt","ER":"Eritrea","ES":"Spain",
"ET":"Ethiopia","FI":"Finland","FJ":"Fiji","FM":"Micronesia","FR":"France","GA":"Gabon","GB":"United Kingdom",
"GD":"Grenada","GE":"Georgia","GH":"Ghana","GL":"Greenland","GM":"Gambia","GN":"Guinea","GQ":"Equatorial Guinea",
"GR":"Greece","GT":"Guatemala","GU":"Guam","GW":"Guinea-Bissau","GY":"Guyana","HK":"Hong Kong","HN":"Honduras",
"HR":"Croatia","HT":"Haiti","HU":"Hungary","ID":"Indonesia","IE":"Ireland","IL":"Israel","IN":"India","IQ":"Iraq",
"IR":"Iran","IS":"Iceland","IT":"Italy","JM":"Jamaica","JO":"Jordan","JP":"Japan","KE":"Kenya","KG":"Kyrgyzstan",
"KH":"Cambodia","KI":"Kiribati","KM":"Comoros","KN":"Saint Kitts and Nevis","KP":"North Korea","KR":"South Korea",
"KW":"Kuwait","KY":"Cayman Islands","KZ":"Kazakhstan","LA":"Laos","LB":"Lebanon","LC":"Saint Lucia",
"LI":"Liechtenstein","LK":"Sri Lanka","LR":"Liberia","LS":"Lesotho","LT":"Lithuania","LU":"Luxembourg","LV":"Latvia",
"LY":"Libya","MA":"Morocco","MC":"Monaco","MD":"Moldova","ME":"Montenegro","MG":"Madagascar","MH":"Marshall Islands",
"MK":"North Macedonia","ML":"Mali","MM":"Myanmar","MN":"Mongolia","MO":"Macao","MR":"Mauritania","MT":"Malta",
"MU":"Mauritius","MV":"Maldives","MW":"Malawi","MX":"Mexico","MY":"Malaysia","MZ":"Mozambique","NA":"Namibia",
"NE":"Niger","NG":"Nigeria","NI":"Nicaragua","NL":"Netherlands","NO":"Norway","NP":"Nepal","NR":"Nauru",
"NZ":"New Zealand","OM":"Oman","PA":"Panama","PE":"Peru","PG":"Papua New Guinea","PH":"Philippines","PK":"Pakistan",
"PL":"Poland","PT":"Portugal","PW":"Palau","PY":"Paraguay","QA":"Qatar","RO":"Romania","RS":"Serbia","RU":"Russia",
"RW":"Rwanda","SA":"Saudi Arabia","SB":"Solomon Islands","SC":"Seychelles","SD":"Sudan","SE":"Sweden","SG":"Singapore",
"SI":"Slovenia","SK":"Slovakia","SL":"Sierra Leone","SM":"San Marino","SN":"Senegal","SO":"Somalia","SR":"Suriname",
"SS":"South Sudan","ST":"Sao Tome and Principe","SV":"El Salvador","SY":"Syria","SZ":"Eswatini","TD":"Chad","TG":"Togo",
"TH":"Thailand","TJ":"Tajikistan","TL":"Timor-Leste","TM":"Turkmenistan","TN":"Tunisia","TO":"Tonga","TR":"Turkey",
"TT":"Trinidad and Tobago","TV":"Tuvalu","TZ":"Tanzania","UA":"Ukraine","UG":"Uganda","US":"United States",
"UY":"Uruguay","UZ":"Uzbekistan","VA":"Vatican City","VC":"Saint Vincent and the Grenadines","VE":"Venezuela",
"VN":"Vietnam","VU":"Vanuatu","WS":"Samoa","YE":"Yemen","ZA":"South Africa","ZM":"Zambia","ZW":"Zimbabwe"
}
# -----------------------------------------------------------------------------
def read_geonames(tsv_path: Path) -> Iterator[Tuple[str, str]]:
"""Stream (iso, town) from GeoNames allCountries.txt (tab-delimited)."""
with tsv_path.open("r", encoding="utf-8", newline="") as f:
rd = csv.reader(f, delimiter="\t")
for row in rd:
if len(row) < 3:
continue
iso = row[0].strip()
town = row[2].strip()
if not iso or not town:
continue
yield iso, town
def country_name(iso2: str) -> str:
return ISO_TO_COUNTRY.get(iso2, iso2)
def random_product_code(rng: random.Random) -> str:
# Mix of numeric and alphanumeric codes
patterns = [
lambda: f"{rng.randint(10000, 99999)}",
lambda: "".join(rng.choice("abcdefghijklmnopqrstuvwxyz0123456789") for _ in range(rng.randint(5,7))),
lambda: f"{rng.choice('ABCDEFGHJKMNPQRSTUVWXYZ')}{rng.randint(10000,99999)}",
]
return patterns[rng.randrange(len(patterns))]()
def iter_rows(tsv_path: Path, max_rows: int, seed: int = 42,
min_products: int = 2, max_products: int = 6,
min_repeat: int = 2, max_repeat: int = 15,
start_date: str = "2020-01-01", years: int = 5) -> Iterator[List[str]]:
"""
Yield header + up to max_rows data rows with blanks in ALL THREE hierarchy columns:
Country, Town, Product Code.
For every (iso,town) encountered (streamed), we:
- Generate K product codes (K ∈ [min_products,max_products])
- For each product, emit R rows (R ∈ [min_repeat,max_repeat])
* 1st row of 1st product under the town: Country, Town, Product shown
* 1st row of subsequent products: Country='', Town='', Product shown
* Remaining rows in a product: Country='', Town='', Product=''
"""
rng = random.Random(seed)
base = datetime.fromisoformat(start_date)
span_days = 365 * years
yield ["Country", "Town", "Product Code", "Date", "Value"]
prev_iso: Optional[str] = None
prev_town: Optional[str] = None
rows_out = 0
for iso, town in read_geonames(tsv_path):
ctry_full = country_name(iso)
# country/town blanking for the *first* product row in this town
first_country_out = "" if iso == prev_iso else ctry_full
if iso != prev_iso:
prev_iso = iso
prev_town = None
first_town_out = "" if town == prev_town else town
if town != prev_town:
prev_town = town
num_products = rng.randint(min_products, max_products)
for pi in range(num_products):
product = random_product_code(rng)
repeats = rng.randint(min_repeat, max_repeat)
for rep in range(repeats):
date = (base + timedelta(days=rng.randrange(span_days))).strftime("%Y-%m-%d")
value = f"{rng.uniform(10.0, 5000.0):.2f}"
if rep == 0:
if pi == 0:
# 1st product, 1st row of this town: show country & town & product
row = [first_country_out, first_town_out, product, date, value]
else:
# new product under same town: only product visible
row = ["", "", product, date, value]
else:
# subsequent rows within product: all blanks in hierarchy
row = ["", "", "", date, value]
yield row
rows_out += 1
if rows_out >= max_rows:
return
def write_csv(out_path: Path, rows: Iterator[List[str]]) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
with out_path.open("w", encoding="utf-8", newline="") as f:
w = csv.writer(f)
for r in rows:
w.writerow(r)
def write_xlsx(out_path: Path, rows: Iterator[List[str]], sheet: str = "Sheet1") -> None:
try:
from openpyxl import Workbook
except ImportError:
raise SystemExit("openpyxl not installed. Try: pip install openpyxl")
wb = Workbook(write_only=True)
ws = wb.create_sheet(title=sheet)
if "Sheet" in wb.sheetnames and sheet != "Sheet":
wb.remove(wb["Sheet"])
for r in rows:
ws.append(r)
out_path.parent.mkdir(parents=True, exist_ok=True)
wb.save(out_path)
def main():
ap = argparse.ArgumentParser(description="Generate hierarchy Country→Town→Product Code with all three needing fill-down.")
ap.add_argument("--in", dest="infile", required=True, help="GeoNames allCountries.txt (TSV)")
ap.add_argument("--out", dest="outfile", default="geo_hier_products.csv", help="Output .csv or .xlsx (default: geo_hier_products.csv)")
ap.add_argument("--rows", type=int, default=1_000_000, help="Number of DATA rows to emit (excl. header)")
ap.add_argument("--seed", type=int, default=42)
ap.add_argument("--min-products", type=int, default=2)
ap.add_argument("--max-products", type=int, default=6)
ap.add_argument("--min-repeat", type=int, default=2)
ap.add_argument("--max-repeat", type=int, default=15)
ap.add_argument("--start-date", default="2020-01-01")
ap.add_argument("--years", type=int, default=5)
ap.add_argument("--sheet", default="Sheet1", help="Sheet name if writing .xlsx")
args = ap.parse_args()
tsv = Path(args.infile)
if not tsv.exists():
raise SystemExit(f"Input not found: {tsv}")
rows_iter = iter_rows(
tsv_path=tsv,
max_rows=args.rows,
seed=args.seed,
min_products=args.min_products,
max_products=args.max_products,
min_repeat=args.min_repeat,
max_repeat=args.max_repeat,
start_date=args.start_date,
years=args.years,
)
out = Path(args.outfile)
if out.suffix.lower() == ".xlsx":
write_xlsx(out, rows_iter, sheet=args.sheet)
else:
write_csv(out, rows_iter)
print(f"Done. Wrote {args.rows} data rows (+ header) to {out}")
if __name__ == "__main__":
main()
Here is my run. I use the command time to time how long it takes.
I generate 1 million rows which is just under the limit of what Excel can handle.
time ./gen_geonames_hier.py --in allCountries.txt --out geo_hier_products.csv --rows 1000000
Done. Wrote 1000000 data rows (+ header) to geo_hier_products.csv
real 0m3,171s
user 0m3,135s
sys 0m0,036s
Just double check that the file is as large as it should be.
cat geo_hier_products.csv | wc -l
1000001
Step 2) Convert CSV to Excel
We now need to convert the fake CSV data to Excel.
Create a file with name csv_to_excel_stream.py and add the following code.
#!/usr/bin/env python3
"""
Stream a large CSV into an Excel (.xlsx) file using XlsxWriter constant_memory.
This avoids the openpyxl write_only 'max_row=0' issue some readers run into.
Usage:
python csv_to_excel_stream.py --in geo_hier_products.csv --out geo_hier_products.xlsx --sheet Sheet1
"""
import argparse
import csv
from pathlib import Path
def csv_to_xlsx(in_path: Path, out_path: Path, sheet_name: str = "Sheet1", encoding: str = "utf-8"):
try:
import xlsxwriter
except ImportError:
raise SystemExit("XlsxWriter not installed. Try: pip install XlsxWriter")
# Create workbook with constant_memory to keep RAM tiny
wb = xlsxwriter.Workbook(out_path.as_posix(), {'constant_memory': True})
ws = wb.add_worksheet(sheet_name)
row_idx = 0
with in_path.open("r", encoding=encoding, newline="") as fin:
reader = csv.reader(fin)
for row in reader:
ws.write_row(row_idx, 0, row) # write entire CSV row starting at column A
row_idx += 1
if row_idx % 100_000 == 0:
print(f"[CSV→XLSX] {row_idx:,} rows written...")
wb.close()
print(f"Done. Rows written: {row_idx:,}")
print(f"Output: {out_path}")
def main():
ap = argparse.ArgumentParser(description="Stream CSV → XLSX (robust dimensions, low memory).")
ap.add_argument("--in", dest="infile", required=True, help="Input CSV file")
ap.add_argument("--out", dest="outfile", required=True, help="Output .xlsx file")
ap.add_argument("--sheet", default="Sheet1", help="Sheet name (default: Sheet1)")
ap.add_argument("--encoding", default="utf-8", help="CSV encoding (default: utf-8)")
args = ap.parse_args()
in_path = Path(args.infile)
out_path = Path(args.outfile)
if not in_path.exists():
raise SystemExit(f"Input CSV not found: {in_path}")
if out_path.suffix.lower() != ".xlsx":
raise SystemExit("Output must end with .xlsx")
csv_to_xlsx(in_path, out_path, sheet_name=args.sheet, encoding=args.encoding)
if __name__ == "__main__":
main()
Here I run this code.
time ./csv_to_excel_stream.py --in geo_hier_products.csv --out geo_hier_products.xlsx --sheet Sheet1
[CSV→XLSX] 100,000 rows written...
[CSV→XLSX] 200,000 rows written...
[CSV→XLSX] 300,000 rows written...
[CSV→XLSX] 400,000 rows written...
[CSV→XLSX] 500,000 rows written...
[CSV→XLSX] 600,000 rows written...
[CSV→XLSX] 700,000 rows written...
[CSV→XLSX] 800,000 rows written...
[CSV→XLSX] 900,000 rows written...
[CSV→XLSX] 1,000,000 rows written...
Done. Rows written: 1,000,001
Output: geo_hier_products.xlsx
real 0m12,894s
user 0m12,535s
sys 0m0,359s
Stress Test
Now the fun part: flattening the hierarchy wile preserving Excel row numbers and computing hashes.
Excel to Excel (with hierarchical fill down)
time xlfilldown xlsx --infile geo_hier_products.xlsx --insheet Sheet1 --header-row 1 --fill-cols '["Country","Town","Product Code"]' --outfile filled.xlsx --outsheet Processed
✅ Wrote 'Sheet1' → filled.xlsx:Processed (cols=5; rows=1000000; if_exists=fail)
real 1m12,564s
user 1m14,276s
sys 0m0,688s
- 1,000,000 rows processed in 72.6 seconds (~13,800 rows/sec)
- Memory stable throughout.
- Result: fully flattened, audit-ready dataset.
Excel to SQLite
time xlfilldown db --infile geo_hier_products.xlsx --insheet Sheet1 --header-row 1 --fill-cols '["Country","Town","Product Code"]' --db out.db --table Processed
✅ Loaded 'Sheet1' → out.db:Processed (cols=5; rows=1000000; if_exists=fail)
real 0m22,359s
user 0m24,609s
sys 0m0,169s
- 1,000,000 rows in 22.4 seconds (~44,700 rows/sec)
- Includes table creation and batched inserts.
- Memory footprint flat from start to finish.
Why This Matters
xlfilldown doesn’t load your whole workbook into RAM.
It streams, reading one row at a time, normalizing cells, filling columns intelligently, and writing output immediately.
That means:
- You can process arbitrarily large Excel sheets (limited only by Excel’s own row cap of 1,048,576).
- It scales gracefully on laptops, CI pipelines, or cloud jobs.
- It’s deterministic. Same data, same hash, every time.
Traditional Excel automation scripts (e.g., pandas or openpyxl loops) often read entire sheets into memory before processing. On large workbooks, that means slowdowns, memory spikes, and potential crashes.
xlfilldown avoids all that.
The Real Benefit
When your spreadsheets flatten consistently, when hashes match, and when you can stream data straight into a database without “Excel gymnastics,” you’ve removed an entire category of failure from your workflow. It’s about trust.
Whether you’re uploading raw data to SQLite or prepping it for analysis in pandas, xlfilldown gives you clean, reproducible, and auditable results. Every. Time.
Closing Notes
This test wasn’t run on a server. Just on a regular laptop.
The results prove what I designed for: constant memory, strong throughput, and reliability under load.
If you’d like to reproduce the results, the generator and benchmark commands are shown above. Copy, paste, and flatten away.
pip install xlfilldown

[…] xlfilldown: the complete guide ( Excel Fill Down ) One Million Rows: Testing xlfilldown Performance and Constant-Memory Design […]