Parsing the Unparseable: Adaptive BOM Ingestion for Customer Spreadsheets

Customers send spreadsheets that look nothing alike. Hardcoded column indices fail on day two. Here is the four-step heuristic pipeline that handles anything.

Published: March 16, 2026
Author: Hrushiekesh Kanjula Reddy
Read time: ~6 min
Category: engineering

#python #data-engineering #excel #bom #heuristics #manufacturing

Adaptive BOM ingestion handling unpredictable customer spreadsheets

The first BOM parser I wrote assumed data started on row 1, column A. It worked for exactly one customer. The second customer's spreadsheet had a logo in rows 1–4, headers in row 5, and data from row 6. The third customer merged cells. The fourth used a different header name for every column. By the fifth customer, I had a choice: maintain five separate parsers, or build one adaptive engine that could handle all of them.

The adaptive engine is harder to build and infinitely more maintainable. Here is how it works.

Why Hardcoded Parsers Fail

A BOM (Bill of Materials) spreadsheet communicates the same information — reference designator, quantity, part number, manufacturer, description — but there is no enforced standard for how that information is laid out. Customers use their own ERP templates. Engineers export from different CAD tools. Procurement teams add custom columns. The result is a family of files that are semantically identical but structurally diverse.

Hardcoded parsers fail because they encode assumptions that are only true for one customer. df.iloc[3:, 2] assumes data starts at row 3, column 2. The moment a customer adds a revision note at the top or exports from a different tool, the parser silently misreads the file — often without raising an exception, just producing garbage output.

The cost of silent misreads in manufacturing is high. A misidentified component reference leads to a misplaced part on a PCB, which leads to a field return. The parser needs to either get it right or fail loudly — never silently wrong.

Step 1: Density Auto-Detection

Instead of assuming data starts at a fixed row, the density auto-detector reads the first 30 rows and scores each one by the density of BOM-relevant keywords it contains. Keywords are drawn from a curated list: ["qty", "quantity", "ref", "designator", "part", "number", "description", "manufacturer", "value", "package"].

def detect_header_row(df_raw: pd.DataFrame) -> int:
    keyword_set = {"qty", "quantity", "ref", "designator", 
                   "part", "number", "description", "value"}
    scores = []
    for i, row in df_raw.head(30).iterrows():
        row_text = " ".join(str(v).lower() for v in row if pd.notna(v))
        score = sum(kw in row_text for kw in keyword_set)
        scores.append((score, i))
    return max(scores, key=lambda x: x[0])[1]

The row with the highest keyword density is the header row. This correctly handles logos, revision tables, title blocks, and any other preamble that customers embed above their data.

Density scoring across rows to automatically locate the header row

Step 2: Weighted Column Scoring

Once the header row is identified, each column needs to be mapped to an internal schema field. The header name is a strong signal — a column called "Qty" is almost certainly the quantity field — but it's not always available (some exports use numeric column identifiers) and not always reliable (some customers use non-standard names).

The weighted column scorer combines two signals: header name similarity (fuzzy-matched against a dictionary of known names for each field) and data pattern matching (a column of strings like "C101", "R47", "U3" is almost certainly a reference designator column regardless of its header name).

Each column gets a confidence score for each possible schema field. Assignments are made greedily from highest confidence to lowest, preventing the same field from being assigned twice. Columns that score below a confidence threshold are flagged as unknown and excluded from the import.

Step 3: Array Dimensionality Handling

This one bit me hard in production. When reading Excel files with xlwings, cells that contain a single value return a scalar in Python. Cells in a range return a list. A range with one row returns a 1D list. A range with multiple rows returns a 2D list of lists.

If your code assumes 2D and receives 1D, you get an index error. If your code assumes 1D and receives 2D, you iterate over inner lists as if they were values. The fix is a normalization wrapper that coerces any cell range output to a consistent 2D structure before processing:

def normalize_xlwings_output(raw) -> list[list]:
    if not isinstance(raw, (list, tuple)):
        return [[raw]]
    if raw and not isinstance(raw[0], (list, tuple)):
        return [[v] for v in raw]
    return [list(row) for row in raw]

One function call at the ingestion boundary eliminates an entire class of runtime errors.

xlwings array dimensionality: scalar, 1D, and 2D output cases normalized to 2D

Step 4: Component Deduplication

Customer BOMs frequently contain duplicates — the same reference designator listed twice, or the same part number across multiple line items that should be consolidated. The final ingestion step runs a deduplication pass that:

Groups rows by reference designator
For groups with conflicting part numbers, flags the conflict for human review
For groups with matching part numbers, merges the quantities
Assigns a unique internal BOM line ID to each resolved entry

The deduplication pass produces a clean, unique-keyed dataset ready for the normalization pipeline described in the previous post. The two pipelines are intentionally separate — ingestion handles structure, normalization handles content.

Confidence Scores and the Rejection Queue

Every BOM import produces a confidence report: per-column mapping confidence, header detection confidence, and an overall import confidence score. Imports above 90% confidence are auto-approved and passed to normalization. Imports between 70–90% are flagged for engineer review with the specific low-confidence decisions highlighted. Imports below 70% are rejected with a diagnostic report explaining what the system couldn't figure out.

Import confidence scoring and the three-tier approval pipeline

This tiered approach means engineers only see files that actually need human judgment — the system handles the routine cases automatically and escalates the genuinely ambiguous ones. In practice, approximately 80% of BOM imports auto-approve on first pass. The 20% that need review are almost always legacy files from customers who haven't updated their ERP export templates in a decade.

The adaptive ingestion engine is one of the highest-leverage investments in the assembly hub. It's invisible when it works — which is most of the time — and saves hours of manual reformatting labor on every exception.

← All posts