Contents

How to extract text from PDF files

Text Extraction Natural Language Processing PDF Processing

In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks.


Those tools are PyPDF2pdfminer and PyMuPDF.


There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here.

Introduction

We have already discussed different OCR tools for automatically extracting text from documents. Although there are well-performing tools, they still make errors. So, aiming at extracting information from documents one either has to build robust models which can manage small errors or seek for alternative ways of text extraction. For images and documents with no underlying text information, OCR tools are without alternative. But when it comes to PDF documents with underlying text, the question arises if one could access this text information directly, circumventing possible OCR errors. I want to discuss this and provide insights from our experiences in recent projects.


First of all, it should be mentioned that PDF is not made for retrieving text information. PDF stands for Portable Document Format and was developed by Adobe. The main goal was to be able to exchange information platform-independently while preserving and protecting the content and layout of a document. This results in PDFs being hard to edit and difficult with extracting information from them. Which does not mean it is impossible.


Second, one has to decide how much information is actually needed. Do you only need the plain text information, do you also need the position of the text, do you maybe also want some font information? Those are questions which are also important when deciding on a suitable OCR tool. Everything is possible, but the task gets more complex and more messy with each additional layer of information needed.


We will test the three libraries on three simple sample PDFs:

PyPDF2

PyPDF2 is a pure Python PDF library capable of splitting, merging together, cropping, and transforming pages of different PDF files. We can retrieve metadata from PDFs, like author, creator, creation date and others. It can also retrieve the PDF text as found in the content stream. This means that the text might not be ordered logically if it is not done so in the stream object associated with the PDF. Illogical ordering should not happen in general, but as the documents get more complex the text ordering might too. The code for retrieving the plain text is rather simple:

import PyPDF2

with open(pdf_path, "rb") as f:
    reader = PyPDF2.PdfFileReader(f)
    page = reader.getPage(0)
    text = page.extractText()
Sample performance

Let's look at the output we get for the different PDFs:

  • Sample 1: "Adobe Acrobat PDF Files\n \nAdobe® Portable Document Format (PDF) is a universal file format that preserves all \nof the fonts, formatting, colours and graphics of any source document, regardless of the \napplication and platform used to create it.\n \nAdobe PDF is an ideal format for electr\nonic document distribution as it overcomes the \nproblems commonly encountered with electronic file sharing.\n \n\n \nAnyone, anywhere\n \ncan open a PDF file. All you need is the free Adobe Acrobat \nReader. Recipients of other file formats sometimes can't open files beca\nuse they \ndon't have the applications used to create the documents.\n \n\n \nPDF files \nalways print correctly\n \non any printing device.\n \n\n \nPDF files always display \nexactly\n \nas created, regardless of fonts, software, and \noperating systems. Fonts, and graphics are not lost \ndue to platform, software, and \nversion incompatibilities.\n \n\n \nThe free Acrobat Reader is easy to download and can be freely distributed by \nanyone.\n \n\n \nCompact PDF files are smaller than their source files and download a page at a time \nfor fast display on the Web.\n \n"
  • Sample 2: "\n\n\n\n\n\n\nˇˇˇ\nˇ\n\n\n\nˆˇ\n˝\n\nˇ˛\nˇ\n\n˚\n˜˙ˆ\n\nˇˆ\n\n\n\nˆ\nˇˇ\n·˘\n·\n· ˜\n·ˆˇ!\n˜ˇ\n\n\n·ˆ\n·ˆˇ"ˇ\n\n\n\n\n\n\n\n\n\n"
  • Sample 3: "Example table\n This is an example of a data table.\n Disability \nCategory\n Participants\n Ballots \nCompleted\n Ballots \nIncomplete/\n Terminated\n Results\n Accuracy\n Time to \ncomplete\n Blind\n 5 1 4 34.5%, n=1\n 1199 sec, n=1\n Low Vision\n 5 \n2 \n3 \n98.3% n=2\n (97.7%, n=3)\n 1716 \nsec, n=3\n (1934 sec, n=2)\n Dexterity\n 5 \n4 \n1 \n98.3%, n=4\n 1672.1 sec, n=4\n Mobility\n 3 \n3 \n0 \n95.4%, n=3\n 1416 sec, n=3\n"

Sample 2 actually looks like rubbish. PyPDF2 seems to have some problems with this file, although it looks quite normal when accessed with a PDF viewer. This can happen if the PDF creation software misses to link some font information when creating the PDF. Some more sophisticated PDF viewers and packages are capable of handling those issues, PyPDF2 fails with this particular document. Sample 1 also has some escape characters \n added where there shouldn't be any (e.g. the bold text). Sample 3 looks quite fine.

On missing text information or too much text information

Those errors like the one of PyPDF2 on Sample 2 can even occur when working with more sophisticated PDF libraries and can be hard to detect. Furthermore, things become much more difficult if the PDF is a mixture of text with available underlying text information and scan like areas, where text is visible but no text information can be obtained. Then you will miss some of the text information. The other way round is also possible: areas with no visible text, but obtainable underlying text information. Then you get too much text information (which can be a good thing, if you aim at detecting such hidden information). Both are hard to detect and difficult to cope with. Again, PDFs are a messy business.

pdfminer

In contrast to PyPDF2pdfminer does not take the ordering of the text from the content stream, but extracts additional information like text coordinates. Using them, it tries to merge all available characters to words, the words to associated text lines and the lines to paragraph-like objects. The image below illustrates this process. The blue boxes are the word objects, the green boxes the text line objects and the red box delineates the paragraph-like object (not all of them are labeled here).

This geometric analysis can be manipulated in order to influence how pdfminer finds words, text lines and text blocks. The code for retrieving the plain text is a bit more difficult than the one for PyPDF2:

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTFigure, LTTextBox
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser

text = ""

with open(pdf_path, 'rb') as f:
    parser = PDFParser(f)
    doc = PDFDocument(parser)
    page = list(PDFPage.create_pages(doc))[0]
    rsrcmgr = PDFResourceManager()
    device = PDFPageAggregator(rsrcmgr, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    layout = device.get_result()

    for obj in layout:
        if isinstance(obj, LTTextBox):
            text += obj.get_text()

        elif isinstance(obj, LTFigure):
            stack += list(obj)

This is very high-level and should just extract the plain text. We can manipulate the geometric analysis with the  LAParams() object and additionally retrieve the before-mentioned geometrical information of the objects as well as some font information.

Sample performance

Let's look at the output we get for the different PDFs:

  • Sample 1: "Adobe Acrobat PDF Files \nAdobe® Portable Document Format (PDF) is a universal file format that preserves all \nof the fonts, formatting, colours and graphics of any source document, regardless of the \napplication and platform used to create it. \nAdobe PDF is an ideal format for electronic document distribution as it overcomes the \nproblems commonly encountered with electronic file sharing. \n• Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat \nReader. Recipients of other file formats sometimes can't open files because they \ndon't have the applications used to create the documents. \n• PDF files always print correctly on any printing device. \n• PDF files always display exactly as created, regardless of fonts, software, and \noperating systems. Fonts, and graphics are not lost due to platform, software, and \nversion incompatibilities. \n• The free Acrobat Reader is easy to download and can be freely distributed by \nanyone. \n• Compact PDF files are smaller than their source files and download a page at a time \nfor fast display on the Web. \n"
  • Sample 2: "Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing \nelit. Nunc ac faucibus odio. \nVestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut\nvarius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum\ncondimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus\nconvallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,\nvulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus\nnisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum,\nac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet\ntortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet\nmauris tempus fringilla.\nMaecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.\n• Maecenas non lorem quis tellus placerat varius. \n• Nulla facilisi. \n• Aenean congue fringilla justo ut aliquam. \n• Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante\nsagittis. \n• Morbi viverra semper lorem nec molestie. \n• Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.\n12\n10\n8\n6\n4\n2\n0\nColumn 1\nColumn 2\nColumn 3\nRow 1\nRow 2\nRow 3\nRow 4\n"
  • Sample 3: "Example table \nThis is an example of a data table. \nDisability \nCategory \nParticipants \nBallots \nCompleted \nBallots \nIncomplete/ \nTerminated \nBlind \nLow Vision \nDexterity \nMobility \n \n5 \n5 \n5 \n3 \n1 \n2 \n4 \n3 \n4 \n3 \n1 \n0 \nResults \nAccuracy \nTime to \ncomplete \n34.5%, n=1 \n1199 sec, n=1 \n98.3% n=2 \n1716 sec, n=3 \n(97.7%, n=3) \n(1934 sec, n=2) \n98.3%, n=4 \n1672.1 sec, n=4 \n95.4%, n=3 \n1416 sec, n=3 \n"

This looks good. pdfminer is able to extract the text in Sample 2 too and also extracts the text from the figure in it (which can be turned off). For Sample 1 the font information could be accessed too, thus resulting in better text extraction than PyPDF2 which tries to indicate bold text by grouping it with "\n". However, the code is not as straightforward as with PyPDF2.

PyMuPDF

Both pdfminer and PyPDF2 are pure Python libraries. In contrast, PyMuPDF is based on MuPDF, a lightweight but extensive PDF viewer. This has huge advantages when it comes to handling difficult PDFs but is more strict on the licensing, since MuPDF is a commercial product. Additionally, PyMuPDF claims to be significantly faster than pdfminer and PyPDF2 in various tasks. PyMuPDF, as pdfminer, can extract geometrical text information and font information too, but has, like PyPDF2, also the possibility to extract the plain text directly. In contrast to pdfminer, there is no possibility to manipulate the algorithm of geometric text analysis. PyMuPDF groups the text in textblocks and textlines as done by MuPDF.

The simple code for just retrieving the plain text looks the following:

import fitz

doc = fitz.open(pdf_path)
page = doc[0]
text = page.getText("text")

This is simple and straighforward.

Sample performance

Let's look at the output:

  • Sample 1: "Adobe Acrobat PDF Files \nAdobe® Portable Document Format (PDF) is a universal file format that preserves all \nof the fonts, formatting, colours and graphics of any source document, regardless of the \napplication and platform used to create it. \nAdobe PDF is an ideal format for electronic document distribution as it overcomes the \nproblems commonly encountered with electronic file sharing. \n• \nAnyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat \nReader. Recipients of other file formats sometimes can't open files because they \ndon't have the applications used to create the documents. \n• \nPDF files always print correctly on any printing device. \n• \nPDF files always display exactly as created, regardless of fonts, software, and \noperating systems. Fonts, and graphics are not lost due to platform, software, and \nversion incompatibilities. \n• \nThe free Acrobat Reader is easy to download and can be freely distributed by \nanyone. \n• \nCompact PDF files are smaller than their source files and download a page at a time \nfor fast display on the Web. \n"
  • Sample 2: "Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing \nelit. Nunc ac faucibus odio. \nVestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut\nvarius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum\ncondimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus\nconvallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,\nvulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus\nnisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum,\nac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet\ntortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet\nmauris tempus fringilla.\nMaecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.\n•\nMaecenas non lorem quis tellus placerat varius. \n•\nNulla facilisi. \n•\nAenean congue fringilla justo ut aliquam. \n•\nMauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante\nsagittis. \n•\nMorbi viverra semper lorem nec molestie. \n•\nMaecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.\nRow 1\nRow 2\nRow 3\nRow 4\n0\n2\n4\n6\n8\n10\n12\nColumn 1\nColumn 2\nColumn 3\n"
  • Sample 3: "Example table \nThis is an example of a data table. \nDisability \nCategory \nParticipants \nBallots \nCompleted \nBallots \nIncomplete/ \nTerminated \nResults \nAccuracy \nTime to \ncomplete \nBlind \n5 \n1 \n4 \n34.5%, n=1 \n1199 sec, n=1 \nLow Vision \n5 \n2 \n3 \n98.3% n=2 \n(97.7%, n=3) \n1716 sec, n=3 \n(1934 sec, n=2) \nDexterity \n5 \n4 \n1 \n98.3%, n=4 \n1672.1 sec, n=4 \nMobility \n3 \n3 \n0 \n95.4%, n=3 \n1416 sec, n=3 \n \n"

This looks pretty much the same as for pdfminer. Again, the text from every document could be extracted. With different parameters like "dict""rawdict" or "xml" one can obtain different output formats with additional information like text coordinates, font and text level like text block or text line.

Conclusion

To sum up, there are different tools with different methodologies and functionalities available in python for PDF text extraction. Since PDF documents are quite messy, I would always go for libraries based on an existing PDF viewer instead of pure Python development. Nevertheless, there are some advantages and disadvantages of using one over the other. If you can, retrieve the information you try to extract in a more direct way, circumventing the writing to and extracting from PDF.