Docling Serve is the official way to run Docling in Docker. It ships container images on GHCR and Quay, and the smallest working command is one line:
That gets you a FastAPI server on port 5001 with the full Docling pipeline behind it. Everything that matters comes after that line: which image you pick, where the models live, how much memory you give it, and which environment variables you set before real traffic arrives.
This post is the production checklist. Version v1.18.0, released May 7, 2026, is the reference throughout.
Picking a Docling Docker image
There are four official images, mirrored on ghcr.io/docling-project/... and quay.io/docling-project/...:
| Image | Torch flavor | Arch | Approx size |
|---|---|---|---|
docling-serve | PyPI torch | amd64, arm64 | 8.7 GB / 4.4 GB |
docling-serve-cpu | CPU-only torch wheels | amd64, arm64 | 4.4 GB |
docling-serve-cu128 | CUDA 12.8 wheels | amd64 | 11.4 GB |
docling-serve-cu130 | CUDA 13.0 wheels | amd64, arm64 | larger |
Use docling-serve-cpu if you do not have a GPU. The base docling-serve image is roughly twice the size on amd64 because the PyPI torch build pulls CUDA libraries you will not use. On a CPU host they are dead weight.
For GPU, match the tag to your driver. CUDA 12.8 wheels need a host driver that supports the 12.8 runtime, CUDA 13.0 wheels need 13.0. The CUDA images intentionally do not publish a latest tag, so pin explicitly: docling-serve-cu128:v1.18.0. Use :main only for testing; it tracks unreleased commits.
Persistent model cache
The first request triggers a download of layout, table, and OCR weights. In the official containers they land at /opt/app-root/src/.cache/docling/models. If your container is ephemeral (Cloud Run, Fly Machines, anything with a fresh disk per boot), every cold start will redownload several hundred megabytes from Hugging Face and your readiness check will fail under any timeout you would actually want.
Two options. Bake the models into the image:
docling-tools models download --all pulls every model variant, including multiple OCR engines and the VLM weights. Use it if you want to avoid any runtime download under any pipeline configuration. The default download is enough for the standard pipeline.
The alternative to baking is mounting a persistent volume and prewarming it once, which is what the Compose setup below does.
Two gotchas worth knowing either way. First, mounting an empty named volume over a path that already contains baked-in models hides them; the volume is empty until you populate it. Second, DOCLING_SERVE_ARTIFACTS_PATH is the env var for Docling Serve and DOCLING_ARTIFACTS_PATH is for the underlying Docling library. Set the right one for the layer you are configuring.
Docling Docker Compose setup
The Compose pattern is a one-shot model-cache service that exits after the download, plus the long-running server that mounts the populated volume read-only:
The exec-form healthcheck avoids shell quote nesting. start_period: 90s is the important number: model load on cold disk takes 30 to 90 seconds, and a shorter grace period will mark the container unhealthy before it ever serves a request. stop_grace_period: 2m belongs under the service, not at the top level; behind a rolling update it is the difference between zero-downtime deploys and a wave of 502s while in-flight conversions get killed.
Health and readiness
Docling Serve registers four health-related endpoints in docling_serve/app.py:
/healthand/livez: liveness, return immediately/readyand/readyz: readiness, verify models are loaded and the orchestrator is reachable/metrics: Prometheus/version: package versions, unless disabled
/livez and /readyz are the Kubernetes-conventional aliases and are hidden from the OpenAPI schema, but they exist. Wire your load balancer to /ready, not /health. Model load can take 30 to 90 seconds depending on disk and CPU; if you let traffic in based on liveness alone, the first few requests pile up and time out.
GPU passthrough
You need three things on the host: an NVIDIA driver that matches the CUDA wheels in the image, the nvidia-container-toolkit, and the --gpus all flag (or the Compose equivalent). Then point Docling at the device:
In Compose:
Some hosts (older Docker, certain orchestrators) still need runtime: nvidia instead of or alongside the deploy.resources block. If nvidia-smi works on the host and inside the container, the GPU is wired up. If Docling still runs on CPU, check that the image is a cu* tag and that DOCLING_DEVICE=cuda is set.
Sizing
Docling is memory-hungry. The model graph is several GB once loaded, and a single OCR-heavy page can spike well past that. Real numbers to start from:
- Smoke test: 2 vCPU, 4 to 8 GB RAM
- CPU production, low concurrency: 4 vCPU, 8 to 16 GB RAM
- CPU production, OCR-heavy or large PDFs: 8+ vCPU, 16 to 32 GB RAM
- GPU production: 4 to 8 vCPU, 16 GB+ RAM, 8 GB+ VRAM
- VLM workloads: 16 GB RAM floor, GPU memory determined by the chosen model
This is more than the docs imply. In docling-serve#366 a 4.4 MB PDF on a g5.xlarge (4 vCPU, 16 GB) drove resident memory to roughly 12 GB and held around 11 GB after the request finished. In docling#2635 a 5-page VLM conversion on an 8 CPU / 16 GB Compose limit timed out past 15 minutes. Small caps look fine in dev and quietly OOM in prod.
Load test with the largest, ugliest PDFs you actually expect. The dev fixtures will lie to you.
Workers and concurrency
Stay at one Uvicorn worker until you have evidence you need more. Each worker loads its own copy of the model graph unless you opt in to sharing, so two workers means roughly twice the memory for no extra throughput on a single-document bottleneck.
The env vars worth knowing:
DOCLING_NUM_THREADS and OMP_NUM_THREADS cap the thread pools used inside conversion. Set them at or below your container’s CPU limit, otherwise PyTorch oversubscribes and you spend cycles in context switches. DOCLING_SERVE_ENG_LOC_NUM_WORKERS is the local engine’s worker count for parallel conversions; pair it with SHARE_MODELS=true to keep memory sane.
Production env vars
The full list lives in the Docling Serve configuration reference. The handful of limits that prevent bad clients from holding the server hostage:
MAX_SYNC_WAIT is the one most people miss. The synchronous endpoint holds the connection until the document finishes; without a cap, a 500-page scan can park a worker for an hour. Raising the cap is fine, making it explicit is the point.
For real production, skip the synchronous endpoint and use the async API. It returns a job id immediately and you poll. That moves the “this request takes 8 minutes” problem out of your load balancer’s timeout window.
Logs go to stdout/stderr through Python logging and Uvicorn. INFO is the right default; drop to WARNING once you trust the deployment, bump to DEBUG only when chasing something specific.
Scaling past one container
A single container with the limits above will hold up under controlled traffic. Past that, the next step is a queue. DOCLING_SERVE_ENG_KIND selects the async engine: local (default, in-process), rq (Redis Queue), or kfp (Kubeflow Pipelines).
The RQ engine is Redis-backed and is the easier one to operate. Recommended Redis pool sizing:
- 1 to 4 workers: default pool, 50 connections
- 5 to 10 workers:
DOCLING_SERVE_ENG_RQ_REDIS_MAX_CONNECTIONS=100 - 10+ workers: 150 to 200
The KFP engine dispatches each conversion as a Kubeflow Pipelines run and uses a self-callback URL so the pipeline can post status back to Docling Serve. In-cluster it discovers the KFP endpoint at https://NAME.NAMESPACE.svc.cluster.local:8888 and picks up a ServiceAccount token automatically. It is the right pick if you already operate KFP. If not, run RQ.
At this point you are not running Docling in Docker so much as running a small distributed system around Docling. That is fine, and the official engines do most of the work. It is also where most teams decide whether they want to keep operating it.
If you would rather not run a model cache, GPU drivers, queue workers, health probes, and autoscaling yourself, Parsebridge runs Docling Serve as a hosted API. Same library, same output, no Compose file.