Getting started with Docling in Python

A hands-on guide to parsing PDFs with Docling in Python. Install it, parse your first document, handle tables and OCR, and know when self-hosting makes sense.

Docling is an open-source PDF parsing library that converts PDFs to Markdown with the structure intact. It started inside IBM Research and now lives under the Linux Foundation's LF AI & Data umbrella. If you have ever tried feeding raw PDF text into an LLM and watched the table layouts dissolve into hash, this is the library you wanted.

This post walks through installing it, parsing your first document, dealing with tables and OCR, and the issues you will hit in production. Docling also handles DOCX, PPTX, HTML, and images, but PDF is where it shines and where most of the configuration lives.

Installation

Docling needs Python 3.10 or newer. Support for 3.9 was dropped in version 2.70.

pip install docling

That single line is enough for most cases. The package wheels ship with PyTorch, so there is no separate torch install. On a CPU-only Linux server you can keep things lean by pointing pip at the CPU wheels:

pip install docling --extra-index-url https://download.pytorch.org/whl/cpu

The first time you call the converter, Docling pulls model weights from Hugging Face: layout detection, table structure, OCR. Plan for a few hundred megabytes on first run. After that, models are cached at ~/.cache/docling/models and the cold start drops from minutes to a few seconds.

GPU works without extra configuration. If a CUDA-capable PyTorch sees a GPU, Docling uses it. A single document on CPU finishes in seconds; batches are where the GPU pays for itself.

Your first parse

Five lines of code and you have Markdown:

from docling.document_converter import DocumentConvertersource = "https://arxiv.org/pdf/2408.09869"converter = DocumentConverter()doc = converter.convert(source).documentprint(doc.export_to_markdown())

For the Docling technical report linked above, the first lines of export_to_markdown() look something like:

## Docling Technical ReportVersion 1.0Christoph Auer · Maksym Lysak · Ahmed Nassar · ...## AbstractThis technical report introduces Docling, an easy to use, self-contained,MIT-licensed open-source package for PDF document conversion...

source accepts a URL, a local file path, or a pathlib.Path. The converter runs the full pipeline (layout analysis, table structure, OCR where needed) and returns a DoclingDocument Pydantic object. From there:

doc.export_to_markdown() for Markdown
doc.export_to_html() for HTML
doc.export_to_dict() for the structural JSON
doc.export_to_doctags() for Docling's lossless format

The first conversion in a process is slow because models load into memory. Subsequent calls reuse the same converter instance and run much faster. If you are processing many files, build the DocumentConverter once and pass documents through it.

Handling tables

Tables are where most PDF parsers fall apart. Visually a table is rows and columns; in the underlying PDF it is just a bag of positioned glyphs and lines. Docling uses TableFormer, a model trained specifically for table reconstruction, and it preserves merged cells and column alignment instead of flattening everything into one long line.

A good test document is the DocLayNet paper, which has several real tables:

from docling.document_converter import DocumentConverterconverter = DocumentConverter()doc = converter.convert("https://arxiv.org/pdf/2206.01062").documentfor table in doc.tables:    print(table.export_to_markdown())    print()

The output is real Markdown table syntax. Roughly (illustrative, your exact output will depend on the PDF and Docling version):

| class label   | Count   | Train   | Test   | Val   ||---------------|---------|---------|--------|-------|| Caption       | 22524   | 18288   | 2855   | 1381  || Footnote      | 6318    | 5044    | 832    | 442   || Formula       | 25027   | 20355   | 3017   | 1655  || List-item     | 185660  | 151123  | 21887  | 12650 |

TableFormer has two modes, set on PdfPipelineOptions.table_structure_options.mode: ACCURATE (default) and FAST. ACCURATE is roughly twice as slow but noticeably better on dense financial tables. FAST is fine for most well-structured documents.

from docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerModefrom docling.document_converter import DocumentConverter, PdfFormatOptionpipeline_options = PdfPipelineOptions()pipeline_options.table_structure_options.mode = TableFormerMode.FASTconverter = DocumentConverter(    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)})

Table extraction is the slowest stage of the pipeline, so this is the first knob to turn if throughput matters.

OCR for scanned PDFs

OCR is on by default. PdfPipelineOptions.do_ocr is True, and the default ocr_options is OcrAutoOptions(), which probes the runtime at startup and picks EasyOCR when a GPU is available, Tesseract otherwise. For text-native PDFs the OCR step is a no-op, since Docling only OCRs pages or regions without an embedded text layer.

To pin a specific engine instead of letting Docling choose:

from docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractOcrOptionsfrom docling.document_converter import DocumentConverter, PdfFormatOptionpipeline_options = PdfPipelineOptions()pipeline_options.do_ocr = Truepipeline_options.ocr_options = TesseractOcrOptions()converter = DocumentConverter(    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)})

Tesseract needs the tesserocr extra and the system binary: pip install "docling[tesserocr]" plus apt-get install tesseract-ocr (or brew install tesseract on macOS). Other supported engines are RapidOcrOptions, OcrMacOptions (macOS only), and EasyOcrOptions if you want to force EasyOCR regardless of hardware.

If you know your inputs are clean text PDFs, set do_ocr = False. It cuts a noticeable chunk off the per-document runtime.

Common issues

Conversion is slow

You are almost certainly on CPU. Layout detection and TableFormer both want a GPU. If GPU is not an option, set TableFormerMode.FAST and disable OCR for text-native inputs. The Heron layout model became the default in v2.50 and is faster than the previous default, so if you are pinned to an older Docling version, upgrading is the easiest speedup.

`ImportError: libGL.so.1`

The classic Linux/Docker error, triggered by opencv-python on hosts without a graphics stack:

pip uninstall -y opencv-python opencv-python-headlesspip install --no-cache-dir opencv-python-headless

Or install the system library: apt-get install libgl1.

Models redownload every run

Your cache is not persisting between runs. The default location is ~/.cache/docling/models. In a Docker image, either mount that path as a volume or bake the models into the image with:

docling-tools models download

You can also override the location with the DOCLING_ARTIFACTS_PATH environment variable, or pass PdfPipelineOptions(artifacts_path="/srv/models") directly.

It does not run on Lambda, Vercel, or Cloud Run

Docling pulls in PyTorch, OpenCV, and a few hundred megabytes of model weights. The full footprint sits in the 2-4 GB range, which puts you past the typical layer size limit before you start. There is no GPU on these platforms, and cold starts are unworkable even when the package fits. If your deployment target is serverless, do not try to package Docling into it. Run it on a long-lived server you control, or call it through a hosted API like Parsebridge.

When to self-host vs. use a hosted API

Self-host when you control your runtime and can leave a process resident. A long-lived container on Fly, Render, EC2, or your own hardware is what Docling is built for. If compliance rules out third-party APIs, self-hosting is the only option. At scale, a single GPU instance beats per-page pricing.

A hosted API makes more sense if you are deploying to serverless, or your volume is too low to justify a 24/7 instance with a GPU attached.

Full disclosure: I built Parsebridge for the hosted case. It runs Docling behind an API so a Lambda function or a Vercel route can call it without packaging the dependencies. If self-hosting fits your setup, do that instead. The library is good and free.

DocumentConverter().convert(source).document.export_to_markdown() covers most use cases. Drop into PdfPipelineOptions when you need to tune for speed, OCR, or table extraction. Run it against the messiest PDF you have on hand (a scanned contract, a financial filing with footnoted tables, a research paper with equations) and see what comes out. That tells you more than any benchmark.

Useful references: the Docling documentation, the GitHub repo, and GitHub Discussions for support. Questions on this post: support@parsebridge.com.