Outclassing Frontier LLMs at Extracting Information

About This Session

Accurately extracting information from documents has been a decades-old dream. Important workflows — from automated back-office processing to enterprise RAG — depend on it. LLMs promise to fulfill this dream but currently fall short: they hallucinate information, struggle with long documents, and break down on complex layouts. The solution: LLMs specialized in information extraction. In this talk, I will present: - **NuExtract** — the first LLM specialized in extracting structured information (JSON output) - **NuMarkdown** — the first reasoning OCR LLM (RAG-ready Markdown output). **These low-hallucination [open-source] models outclass frontier LLMs like GPT-5 and Gemini 2.5 while being orders of magnitude smaller**, enabling private usage. I will demonstrate the abilities of these LLMs, show how to use them at scale, and discuss what’s coming next in information extraction.