Semantic Search for Public Administration


Our algorithm helps citizens through the bureaucracy of registering a business.

Input

Free-text business descriptions

Output

The according industry codes

Goal

Simplify business registration for citizens


Introduction


The digitalization of public administration services is increasingly gaining momentum throughout Europe. When establishing a new digital public service, it is important to allow for easy and intuitive interaction between users and the service in order to ensure democratic and widespread adoption.
However, public authorities often use their own jargon, which is far from intuitive for most citizens. Furthermore, when using a digital service, citizens cannot consult with an authority concerning open questions in a direct way compared to an onsite visit. Therefore, a language barrier can quickly arise. This can lead to frustration on the part of the citizens and, in the worst case, to a lack of acceptance of the digital service.
In order to remedy this problem, dida developed an AI-based algorithm to extract relevant information from authority documents.


Starting Point


Citizens registering a new business in Germany have to provide an industry code along with their registration. This industry code is chosen from a list of over 800 different codes, each described and defined in complicated “public administration language”. Finding the correct code from all these options is hard, especially if someone is not accustomed to the language used in these descriptions.

Our client PublicPlan developed an online service portal for the federal state of North Rhine-Westphalia which enables citizens to access public administration services. PublicPlan wanted to enhance its functionality by offering an intuitive search function for industry codes, integrated both in the portal's chatbot and the business registration assistant.

The solution as part of the business registration assistant


Challenges


The input to the algorithm developed here should be the citizen’s free-text description of the business he or she wants to register.

As authorities often use words and turns of phrases differing widely from colloquial language, finding the correct industry code for a specific business registration is a non-trivial task.

Therefore, simple text search algorithms are not sufficient to find the correct industry code.


Solution


We adapted and trained an AI architecture especially suited for Natural Language Processing tasks to solve the task. For a given colloquial business description, the trained model suggests the relevant industry codes.

The training data for the AI were historical colloquial business descriptions and corresponding business codes. This data was provided by the client.

The final product can be deployed in various settings. Currently, it is used as a functionality of the chatbot and the business registration assistant which the client already had. Our solution is flexible and easy to maintain: New industry codes can be integrated with very little effort.

Below you can see three example outputs of the algorithm for different business descriptions. Because Machine Learning models for Natural Language Processing are usually language-specific, the example below is in German.

Automatic output of industry codes by the AI powered chatbot


Technical Background


The Client's Requirements

The client had an existing solution (chatbot and business registration assistant), which could be used as an interface so that citizens can type in their business descriptions using colloquial language and receive the five most relevant industry codes as a response.

Both solutions allow routing different user questions to corresponding API endpoints, meaning that we received the colloquial business descriptions written by users as API calls. Our algorithm was supposed to create a response to these API calls containing the 5 most relevant industry codes.

There was an existing solution using basic word embeddings, which often showed unsatisfactory results. This indicated that a better semantic understanding of the definitions and descriptions of different industry codes as well as the colloquial descriptions of a business was needed.

Our Solution

Backend: Python, spaCy, PyTorch, NumPy, Pandas, fastAPI, Pydantic, Docker, Elasticsearch
Infrastructure: GCloud (Training), Git, DVC, tensorboard

The basis for our algorithm is a version of BERT, a neural network architecture developed by Google, which was already pre-trained on a large german text corpus. We finetuned the existing layers and enriched BERTs output with custom features. By adding a few new layers and postprocessing steps we were able to build a text classifier that leverages the full semantic capabilities of BERT while being performant enough to run on a CPU.

This final classifier was then trained using historical business description - industry code pairs. It takes a business description as input and outputs a relevance score for each industry code. The 5 highest-ranking industry codes are then sent back as a response to the original user request in ranked order.

Related projects