Shreyan Gupta

An LLM-Based ETL Architecture for Semantic Normalization of Unstructured Data

Author

Shreyan Gupta

DOMAIN

AI / ML / NLP

CONFERENCE

IEEE MIT URTC 2025 (Preprint)

Date

August 2025

Abstract

Automating extract–transform–load (ETL) pipelines for scanned business documents typically demands costly, finetuned, layout-aware models. We present a cloud-native architecture that transforms heterogeneous documents into a unified, structured JSON schema—without any model fine-tuning. Our pipeline combines off-the-shelf OCR (Azure Document Intelligence) with a schema-constrained large language model (LLM), guided by type-checked Pydantic outputs and a one-pass swap heuristic for efficient few-shot prompting. Evaluated on the FUNSD (form) and CORD (receipt) corpora, the system achieves 0.60 and 0.83 fuzzy KV F1 scores respectively, while processing each page in under eight seconds at under $0.004 on standard cloud quota. Scaling to a larger LLM boosts CORD accuracy to 0.89 F1 at under $0.02 per page. The entire pipeline—code, prompts, and metric scripts—is open-sourced, enabling lightweight, fully-deployable semantic ETL for small-to medium-scale workloads.

Download

VIEW