An LLM-Based ETL Architecture for Semantic Normalization of Unstructured Data
Author
Shreyan Gupta
DOMAIN
AI / ML / NLP
CONFERENCE
IEEE MIT URTC 2025 (Preprint)
Date
August 2025
Abstract
Automating extract–transform–load (ETL) pipelines for scanned business documents typically demands costly, finetuned, layout-aware models. We present a cloud-native architecture that transforms heterogeneous documents into a unified, structured JSON schema—without any model fine-tuning. Our pipeline combines off-the-shelf OCR (Azure Document Intelligence) with a schema-constrained large language model (LLM), guided by type-checked Pydantic outputs and a one-pass swap heuristic for efficient few-shot prompting. Evaluated on the FUNSD (form) and CORD (receipt) corpora, the system achieves 0.60 and 0.83 fuzzy KV F1 scores respectively, while processing each page in under eight seconds at under $0.004 on standard cloud quota. Scaling to a larger LLM boosts CORD accuracy to 0.89 F1 at under $0.02 per page. The entire pipeline—code, prompts, and metric scripts—is open-sourced, enabling lightweight, fully-deployable semantic ETL for small-to medium-scale workloads.