Machine Learning Blog

Structured outputs with OpenAI and Pydantic

Marcin Tabisz

March 11th 2026

Large language models are remarkably capable at reading and understanding documents. Take a photo of a receipt or a scanned invoice, and it will tell you everything on it - the vendor name, the line items, the total, the tax. The problem is that it will tell you in prose. And prose, however accurate, is not something you can easily load into a database, compare against a ground truth label, or use to compute an F1 score. This was exactly the challenge we ran into in the FewTuRe research project, which dealt with few-shot fine-tuning for information extraction from receipts and invoices. We needed the model to extract structured information from documents, but we also needed that information to reliably arrive in a machine-readable format that matched our ground truth data schema exactly. A response that put the total under "total_amount" one time and "total" the next time was just as useless to our evaluation pipeline as a wrong answer. Consistency wasn't a nice-to-have. It was a hard requirement. That experience made one thing very clear to us: getting the right answer is only half the problem. Getting it in the right shape is the other half. This post is about solving that second half. We'll walk through how OpenAI's structured outputs feature, combined with Pydantic and the OpenAI Python SDK, gives you a reliable and elegant way to enforce exactly the output format your application needs. To understand why structured outputs matter, it helps to think about what LLMs actually produce by default: a stream of tokens that form natural language text. Even when you prompt a model to "return a JSON object", you are essentially asking it nicely. There is no guarantee. In practice, this means you get responses like: Valid JSON - great, but don't count on it every time, JSON wrapped in a markdown code block that you have to strip out first, JSON with hallucinated or inconsistently named keys, A helpful sentence before the JSON that breaks your parser, Subtly wrong types, such as a number returned as a string, a list returned as a comma-separated value. For a prototype or a demo, this is manageable. You write a bit of post-processing logic, catch the edge cases, and move on. But for a production pipeline, especially one where the output feeds directly into a training loop or an evaluation framework, this kind of fragility is a serious problem. In our information extraction project, our ground truth data had a well-defined schema. Every receipt and invoice was labeled with the same set of fields, the same types, the same structure. For the model's output to be comparable to that ground truth, it had to match that schema precisely. We couldn't afford a parser that worked 95% of the time. We needed something we could actually depend on. That is where structured outputs come in.

View all articles ->

Popular methods of Explainable AI (XAI)

Yana Chekan

August 18th 2025

There are many reasons to introduce an explainability component to your ML project (see more here ), but there are even more ways to go about it. This post is the second one in the dida’s XAI-series and in it we will be looking at the differing ontologies that can be found across the XAI literature to try and give an overview of the most popular explainability methods. The question of how exactly machine learning models, especially as big as LLMs, know how to perform so well on tasks they were never trained to solve, remains an open one in the ML research community. This is where the explainability research community steps in to try and take a peek inside the black box.

Setting Up a Secure Python Sandbox for LLM Agents

Anton Shemyakov

June 16th 2025

As large language models become more integrated into computational systems, their role in enhancing application efficiency and accuracy grows. However, this expanded capability brings new risks when executing autonomously generated code. This blog post explores how to establish a secure Python sandbox for LLM agents. We will cover the threats involved with LLM-generated code and introduce a sandbox solution using gVisor and Jupyter Notebook.

Coding Tips & Tricks

View more ->

Patching uploaded files for usage in FastAPI background tasks

Sebastian Thomas (PhD)

May 1st 2025

Often files uploaded to a server need to be processed in the background such that the client can get an immediate response. However, if the server is set up using FastAPI, this pattern leads to problems in recent versions.

Rapid Json Loading

Konrad Schultka (PhD)

April 24th 2025

In the idealo entity matching project, we recently ran into a problem with slow json parsing. An intermediate pipeline steps produces results with the following schema We serialized these results using jsonlines with 1000 candidates per anchor. These results were used for hard mining examples for other models in the pipeline. But naively loading these results like this leads to mediocre performance: We can only read about 500 results/sec this way. On our dataset this would take roughly 40 minutes to load. We could speed this up with two tweaks: Better parsing of the results Using a faster json parser

Computer Vision

View more ->

Detecting illegal mines from space

Matthias Werner

September 1st 2020

Throughout the globe, rain forests and other natural landscapes are endangered by illegal mining, which transforms areas formerly rich in flora and fauna into wasteland. In order for local governments to take countermeasures, they first need to know about the locations of illegal mines. In countries covered by vast areas of impenetrable rain forest, such as Brazil or Congo, obtaining this information is a difficult problem. In this blog post I describe an approach to detect illegal mines based on deep learning and remote sensing, that we have developed to support the conservation efforts of governments and NGOs. In particular, we use a U-Net for semantic segmentation , a branch of computer vision. As part of the project of automatic detection of illegal mines , we were also joined by scientists from the Institute of Mineral Resources Engineering of the RWTH Aachen University, who contributed their mining-specific expertise. The project was funded by the European Space Agency .

What is Reinforcement Learning? (Part 2)

Matthias Werner

August 3rd 2020

In the previous post we introduced the basics of reinforcement learning (RL) and the type of problem it can be applied to. The discussed setting was limited in the sense that we were dealing with a single agent acting in a stationary environment. Now we will take it one step further and discuss Multi-Agent Reinforcement Learning ( MARL ). Here we deal with multiple explicitly modeled agents in the same environment, hence every agent is part of the environment as it is perceived by all others. Since all agents learn over time and start to behave differently, the assumption of a stationary environment is violated.

Ethics

View more ->

Fairness in Machine Learning

Cornelius Braun

January 3rd 2024

In a previous blog post , we explained the plenitude of human biases that are often present in real-world data sets. Since practitioners may be forced to work with biased data, it is crucial to know about ways in which the fairness of model decisions can nevertheless be guaranteed. Thus, in this post, I explain the most important ideas around fairness in machine learning (ML). This includes a short summary of the main metrics to measure the fairness of your model decisions and an overview of tools that can help you guarantee or improve your model's fairness. Note: If you are interested in a 30min conversation with one of our Machine Learning experts regarding the topic of fairness in AI or biases, please take a look at our ML expert talk offering.

Data Privacy: Machine Learning and the GDPR

Ana Guerra

April 21st 2022

Datasets are essential for the research and development of models in the fields of Natural Language Processing (NLP) and Machine Learning (ML). However, while the use, collection, and storage of data increases, concerns about data privacy intensify as well. To be in line with best practices, it is relevant to understand what data privacy means and how it is regulated. This post will therefore offer a brief overview of how data privacy is regulated within the European Union. Besides following EU regulation, data driven projects have also to be ethically responsible. In consequence, this article ends with some words about ethics while processing personal data.

Introductions

View more ->

What is Reinforcement Learning? (Part 2)

Matthias Werner

August 3rd 2020

BERT for question answering (Part 1)

Mattes Mollenhauer (PhD)

July 22nd 2020

In this article, we are going to have a closer look at BERT - a state-of-the-art model for a range of various problems in natural language processing. BERT was developed by Google and published in 2018 and is for example used as a part of Googles search engine . The term BERT is an acronym for the term Bidirectional Encoder Representations from Transformers , which may seem quiet cryptic at first. The article is split up into two parts: In the first part we are going to see how BERT works and in the second part we will have a look at some of its practical applications - in particular, we are going to examine the problem of automated question answering .

Natural Language Processing

View more ->

How to identify duplicate files with Python

Ewelina Fiebig

September 28th 2020

Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents. Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents. In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp . After the duplicates have been identified, I will show how they can be deleted automatically.

How to extract text from PDF files

Lovis Schmidt

August 17th 2020

In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks. Those tools are PyPDF2 , pdfminer and PyMuPDF . There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here. You might also want to read about past dida projects where we developed an information extraction with AI for product descriptions, an information extraction from customer requests or an information extraxction from PDF invoices .

Projects

View more ->

Extracting information from technical drawings

Frank Weilandt (PhD)

July 16th 2021

Did you ever need to combine data about an object from two different sources, say, images and text? We are often facing such challenges during our work at dida. Here we present an example from the realm of technical drawings. Such drawings are used in many fields for specialists to share information. They consist of drawings that follow very specific guidelines so that every specialist can understand what is depicted on them. Normally, technical drawings are given in formats that allow indexing, such as svg, html, dwg, dwf, etc. but many, especially older ones, only exist in image format (jpeg, png, bmp, etc.), for example from book scans. This kind of drawings is hard to access automatically which makes its use hard and time consuming. In this regard, automatic detection tools could be used to facilitate the search. In this blogpost, we will demonstrate how both traditional and deep-learning based computer vision techniques can be applied for information extraction from exploded-view drawings. We assume that such a drawing is given together with some textual information for each object on the drawing. The objects can be identified by numbers connected to them. Here is a rather simple example of such a drawing: An electric drill machine. There are three key components on each drawing: The numbers, the objects and the auxiliary lines. The auxiliary lines are used to connect the objects to the numbers. The task at hand will be to find all objects of a certain kind / class over a large number of drawings , e.g. the socket with number 653 in the image above appears in several drawings and even in drawings from other manufacturers. This is a typical classification task, but with a caveat: Since there is additional information for each object accessible through the numbers, we need to assign each number on the image to the corresponding object first. Next we describe this auxiliary task can be solved by using traditional computer vision techniques.

21 questions we ask our clients: Starting a successful ML project

Emilius Richter

May 21st 2021

Automating processes using machine learning (ML) algorithms can increase the efficiency of a system beyond human capacity and thus becomes more and more popular in many industries. But between an idea and a well-defined project there are several points that need to be considered in order to properly assess the economic potential and technical complexity of the project. Especially for companies like dida that offer custom workflow automation software, a well-prepared project helps to quickly assess the feasibility and the overall technical complexity of the project goals -which, in turn, makes it possible to deliver software that fulfills the client's requirements. In this article, we discuss which topics should be considered in advance and why the questions we ask are important to start a successful ML software project.

Remote Sensing

View more ->

Detecting illegal mines from space

Matthias Werner

September 1st 2020

Pretraining for Remote Sensing

William Clemens (PhD)

May 11th 2020

In this blog post I will describe a number of pretraining tasks one can use either separately or in combination to get good “starting” weights before you train a model on your actual labelled dataset. Typically, remote sensing tasks come under the umbrella of semantic segmentation , so all the pretraining techniques described here are for tasks that output a prediction for each pixel and use a U-Net as the architecture.

Software Development

View more ->

How to distribute a Tensorflow model as a JavaScript web app

Johan Dettmar

December 2nd 2019

Anyone wanting to train a Machine Learning (ML) model these days has a plethora of Python frameworks to choose from. However, when it comes to distributing your trained model to something other than a Python environment, the number of options quickly drops. Luckily there is Tensorflow.js , a JavaScript (JS) subset of the popular Python framework with the same name. By converting a model such that it can be loaded by the JS framework, the inference can be done effectively in a web browser or a mobile app. The goal of this article is to show how to train a model in Python and then deploy it as a JS app which can be distributed online.

How Google Cloud facilitates Machine Learning projects

Johan Dettmar

October 25th 2019

Since not only the complexity of Machine Learning (ML) models but also the size of data sets continue to grow, so does the need for computer power. While most laptops today can handle a significant workload, the performance is often simply not enough for our purposes at dida. In the following article, we walk you through some of the most common bottlenecks and show how cloud services can help to speed things up.

Talks & Events

View more ->

AI Index Report 2022: key findings about the status quo of AI

David Berscheid

May 9th 2022

The AI Index Report tracks and collects data regarding the worldwide development of artificial intelligence (AI). This years fifth edition, by the independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI), is again aimed at informing relevant stakeholders like policy makers, researcher or related industries about the enormous advances of AI, the technological and societal stages of most prominent AI disciplines, as well as creating awareness for arising problems. In this article, we will discuss a selection of the report’s machine learning (ML)-related key messages as well as respectively add dida’s perspective to the following topics: Research and Development Technical Performance Technical AI Ethics The Economy and Education AI Policy and Governance For the full report please visit the original source here .

Theory & Algorithms

View more ->

Deep Learning vs Machine Learning: What is the difference?

Serdar Palaoglu

October 9th 2023

In the realm of artificial intelligence, two fundamental concepts, Machine Learning and Deep Learning, have emerged as key components in the advancement of computer-based learning systems. Machine Learning serves as a foundational principle where computers gain the ability to learn from data without explicit programming. Deep Learning, an evolution within the Machine Learning framework, utilizes artificial neural networks inspired by the human brain to achieve complex data analysis. This article delves into a comprehensive exploration of these domains, elucidating their differences, practical applications, and significance in artificial intelligence. Note: If you are interested in a 30min conversation with one of our Machine Learning experts regarding deep learning and machine learning, please take a look at our ML expert talk offering.

What is Reinforcement Learning? (Part 2)

Matthias Werner

August 3rd 2020

Tools

View more ->

How to identify duplicate files with Python

Ewelina Fiebig

September 28th 2020

How to extract text from PDF files

Lovis Schmidt

August 17th 2020