Running the Docling API yourself with Docling Serve

How to host the Docling REST API with Docling Serve. Install, call /v1/convert/source, run async jobs, and survive the production gotchas.

Docling Serve is the official way to host the Docling API on your own infrastructure. It wraps the Docling PDF parser in a FastAPI service so other apps can hit it over HTTP. Version 1.18.0 shipped on May 7, 2026 alongside docling 2.93.0, and the stable Docling REST API now lives under /v1. Anything you find online that still posts to /v1alpha is dead.

This post covers installing the server, the working request and response shape, auth, async jobs, and the production issues you will hit in the first week.

Installing Docling Serve

Two install paths. Pip:

Docker, with the prebuilt image:

Either way you get three things on port 5001:

http://127.0.0.1:5001 for the API
http://127.0.0.1:5001/docs for the OpenAPI spec
http://127.0.0.1:5001/ui for a browser playground

Official images live at quay.io/docling-project/docling-serve and ghcr.io/docling-project/docling-serve. The base image is about 4.4 GB on arm64 and 8.7 GB on amd64 because PyTorch and the layout and table models are baked in. A CPU-only variant (docling-serve-cpu) is around 4.4 GB. The CUDA image (docling-serve-cu128) is roughly 11.4 GB and does not publish a latest tag, so pin a version or use main.

Your first conversion

Two sync endpoints: POST /v1/convert/source for a URL, POST /v1/convert/file for multipart upload.

The response:

status is one of success, partial_success, skipped, or failure. The format fields are populated only for what you asked for in options.to_formats. Request md and html together if you need both.

If you’re porting from a v1alpha example, the payload shape changed. http_sources and file_sources are gone, replaced by a single sources array where each entry has a kind of http or file. The v1 migration notes have the full diff.

For a local file you have two options. Multipart upload:

Or send it inline as base64 to the same /v1/convert/source endpoint:

Options that matter

Everything Docling supports in Python is exposed through the options block. The knobs you actually reach for:

from_formats, to_formats
do_ocr, force_ocr, ocr_lang, ocr_preset
pdf_backend
table_mode: fast or accurate
page_range
document_timeout, abort_on_error
include_images, images_scale

Turn do_ocr off for text-native PDFs to cut a noticeable chunk of per-document latency. Use table_mode: "fast" when throughput matters more than perfect table fidelity. The full list lives in the usage docs.

Authentication

Off by default. Gate the API behind a shared secret by setting an env var on the server:

DOCLING_SERVE_API_KEY=changeme docling-serve run

Then pass it in every request:

That is the whole auth story. One shared key, no per-tenant credentials, no rate limiting, no quota tracking. If you need any of that you build it in front of Docling Serve, not inside it.

Async jobs

Sync requests have a hard ceiling at DOCLING_SERVE_MAX_SYNC_WAIT, default 120 seconds. A long PDF will blow through that. For anything that might run long, use async.

Submit the same payload to the async endpoint:

For multipart uploads, the async equivalent is POST /v1/convert/file/async, same form fields as the sync version.

You get back a task_id. Poll it:

curl http://127.0.0.1:5001/v1/status/poll/{task_id}

Or open a websocket at /v1/status/ws/{task_id} and skip polling. When status is success, fetch the result:

curl http://127.0.0.1:5001/v1/result/{task_id}

These jobs live in-process. There is no Redis or persistent queue. Restart the container and pending tasks vanish. If that is a problem for you, you need a real queue in front (which is what drmingler/docling-api does with Celery, more on that below).

Production reality

A few things will bite you in the first week.

Cold starts

The image is multi-GB and PyTorch takes time to initialize. DOCLING_SERVE_LOAD_MODELS_AT_BOOT defaults to True, which warms the first request at the cost of slower startup. On a fresh container, expect 30 to 60 seconds before the first conversion completes.

Memory floor

The CPU image is 4.4 GB on disk and resident model memory pushes practical RAM to about 8 GB for a single worker. VLM, OCR with image upscaling, or high batch sizes need more. Running multiple uvicorn workers multiplies model memory, it does not amortize it.

Model cache

Mount DOCLING_SERVE_ARTIFACTS_PATH to a persistent volume. Otherwise every deploy redownloads weights and adds minutes to the first request after rollout.

GPU not picked up

The base image runs CPU code paths. To use a GPU you need the docling-serve-cu128 image and a runtime that exposes the device. Even then, GPU OCR is uneven. The official GPU guide calls out RapidOCR with the torch backend as the known-working path. Layout and table models pick up the GPU automatically.

Throughput

Community-reported throughput across hardware: without OCR, roughly 1.2 to 1.5 pages/sec on CPU. The same workload hits about 4.2 pages/sec on an RTX 5070 and 7.9 pages/sec on an RTX 5090. AWS g6e.2xlarge with an L40S sits around 3.1 pages/sec. The GraniteDocling VLM pipeline is slower: about 2.4 pages/sec on g6e.2xlarge, 3.8 on a 5090, 2.0 on a 5070. Verify on your own hardware before sizing.

Concurrency

Configure with env vars, not CLI flags. CLI flags get silently dropped under multi-worker or reload modes, which is the kind of bug that costs an afternoon.

Common errors

`404` on `/v1alpha/...`

Stale tutorial. Move to /v1 and switch to the sources payload shape.

`504` from a sync endpoint

You exceeded DOCLING_SERVE_MAX_SYNC_WAIT. Move to /v1/convert/source/async.

`ImportError: libGL.so.1`

The classic OpenCV native dependency miss on slim base images. Use the official Docling Serve image, install libgl1, or swap to opencv-python-headless.

GPU visible but unused

You are on the base CPU image, or ONNX Runtime fell back to its CPU provider. Pull docling-serve-cu128 and verify with nvidia-smi from inside the container.

Memory grows over time

A known pattern in long-running workers. Set worker recycle limits in uvicorn or gunicorn and monitor RSS. The docling-serve issue tracker has open reports.

What about drmingler/docling-api?

The other thing you find on GitHub is drmingler/docling-api, a FastAPI app that wraps Docling with Celery, Redis, and Flower. The surface is slightly different: /documents/convert, /conversion-jobs, /batch-conversion-jobs. Worth knowing it exists, especially if you already run Celery and want batching out of the box. The tradeoffs are that the API is unofficial and there are more moving parts. For most setups today, the official docling-serve is the sharper choice.

When to self-host vs use a hosted Docling API

Self-host when you have a long-lived host, control over the runtime, and ideally a GPU. Compliance constraints push you here too, since running Docling Serve in your own VPC gives you data residency for free. At high enough volume, a single GPU instance beats per-page pricing.

A hosted Docling API makes sense when your caller is serverless (Lambda, Vercel, Cloud Run) and cannot ship a 4 to 8 GB image, or when volume is too low to justify keeping a GPU instance warm 24/7. IBM’s Serverless Fleets with GPUs on Code Engine covers batch corpus conversion well, but it is not built for low-latency request and response.

If running Docling Serve sounds like more ops than you want to own, Parsebridge runs the Docling API as a managed service. Same parser, no PyTorch in your Dockerfile.