Skip to content

Document Intelligence with Docling + GenAI

Context

I built an end‑to‑end pipeline that ingests messy PDFs and office docs, normalizes structure with Docling, and answers domain questions via GenAI.

Case Study Summary

Domain: Document AI / RAG
Focus: Reliable parsing, chunking, retrieval, and answer synthesis
Primary Tools: Docling, Python, vector DB, OpenAI/GenAI models

Highlights: - Robust structure extraction: sections, tables, figures, and references
- High‑signal chunking with metadata for faithful retrieval
- Guardrails: citation links and refusal patterns to reduce hallucinations

Enterprise documents are noisy and inconsistent. The system emphasizes resilient parsing and traceable answers.

Challenge

Extract authoritative answers from heterogeneous documents while preserving traceability back to the source.

Approach

  1. Parsing: Use Docling to extract clean text, headings, tables, and figure captions.
  2. Normalization: Convert to a unified schema; preserve layout metadata for context.
  3. Chunking: Adaptive windowing based on sections/tables; attach source anchors.
  4. Retrieval: Index in a vector database; hybrid BM25 + dense retrieval.
  5. Generation: Compose prompted answers with citations; enforce grounding/refusals.

Results & Impact

  • Faster time‑to‑answer for policy/technical queries
  • High answer confidence via inline citations to original pages/tables
  • Portable pipeline deployable to cloud or on‑prem environments

Solution Overview

ingest/
  doc_watch/             # new files
  parsers/docling.py     # structure extraction
pipeline/
  normalize.py           # schema + metadata
  chunk.py               # adaptive windows
  index.py               # vector + keyword index
serve/
  query.py               # retrieve + generate + citations

Tech Stack

  • Docling, Python
  • OpenAI/GenAI models
  • Vector DB (e.g., FAISS/Pinecone), BM25
  • FastAPI for API surface, Docker for packaging

Additional Context

  • Timeline: 6–8 weeks to MVP, ongoing hardening
  • Role: ML/Document AI Engineer
  • Deliverables: ingestion → retrieval → generation pipeline with evaluation harness
  • Build trustworthy document QA


    Let’s design a Docling + GenAI pipeline for your corpus with citations-first answers.

    Book Intro Call