Running Docling on AWS Lambda, Vercel, and serverless platforms

What actually works when you try to run Docling on AWS Lambda or Vercel. Hard limits, cold starts, OCR filesystem traps, and the patterns that survive production.

Docling on AWS Lambda is possible only as a container image, and only if you accept multi-second cold starts and CPU-only execution. Docling on Vercel functions does not work at all. Docling serverless deployments that survive production usually run on something container-native like AWS Fargate, IBM Code Engine, or a hosted API. Everything below is the detail behind those three sentences: hard limits, the OCR filesystem traps, and the GitHub issues you will hit on the way.

If you have not parsed your first document yet, start with getting started with Docling in Python and come back when deployment becomes the question.

Why Docling Lambda zip deployments cannot work

AWS publishes the Lambda quotas directly:

Zip upload: 50 MB compressed, 250 MB unzipped (including layers).
Container image: 10 GB uncompressed.
Memory: 128 MB to 10,240 MB. CPU scales with memory, with one vCPU equivalent at 1,769 MB.
Timeout: 900 seconds.
Ephemeral /tmp: 512 MB to 10,240 MB.

Docling pulls in PyTorch, OpenCV, transformers, and a layout/table/OCR model set. A bare pip install docling on Python 3.12 lands around 2 GB on disk before you add any model weights. The 250 MB unzipped ceiling is not a soft limit you can negotiate with, so zip packaging is out before you write any code. See docling-project/docling#1134, where a user hits the 250 MB wall and switches to a container image.

Container deployment is the only Lambda path that works.

Lambda container images for Docling

The official docling-serve images give you a useful size benchmark:

docling-serve (arm64): 4.4 GB
docling-serve (amd64): 8.7 GB
docling-serve-cpu: 4.4 GB
docling-serve-cu128 (CUDA): 11.4 GB

The CUDA image is already over the 10 GB Lambda container ceiling, and Lambda has no normal GPU runtime anyway. The CPU image fits. Community reports in the same #1134 thread put a hand-rolled Lambda Docker image around 3 GB after stripping things that are not needed at runtime, with cold starts in the 8 to 10 second range for new instances.

A minimal Dockerfile for Lambda looks roughly like this:

The docling-tools models download step bakes the model weights into the image. Without it, the first request after a cold start triggers a Hugging Face download, which then tries to write into /var/lang/lib/python3.13/site-packages/... and fails because Lambda’s package directories are read-only. Pre-baking models is non-optional.

HF_HOME and the OCR module paths must point at /tmp. Lambda’s filesystem layout is documented in Lambda configuration filesystem: /var/task and /var/lang are read-only, /tmp is writable. If any model loader tries to download or cache to a package-relative path, you get an OSError: [Errno 30] Read-only file system and the request fails.

The RapidOCR case is open right now. docling-project/docling#2507, opened October 22, 2025, tracks RapidOCR writing model assets to /var/lang/lib/python3.13/site-packages/rapidocr/models/... on Lambda. Pin a known-working OCR engine and version until it closes. Tesseract or a pinned EasyOCR build are the safe choices.

The handler should not call DocumentConverter() inside the function body. Build it once at module scope so warm invocations skip the model load:

Do not run this below 4 GB of memory. TableFormer alone will blow through 2 GB on a moderately complex PDF. Set memory to at least 4096 MB, and go to 8192 MB or 10240 MB if you see OOM kills. CPU scales with memory on Lambda, so the higher setting also gets you a real second vCPU.

Lambda caps execution at 900 seconds. A clean text PDF parses in single-digit seconds on Lambda CPU, but a 100-page scanned contract with OCR will hit the wall. For anything user-uploaded with no length bound, put SQS in front and process parsing async. Cold starts land in the 8 to 10 second range for a fresh container, per the #1134 thread. Provisioned concurrency cuts that, but it also cuts the cost story that brought you to Lambda in the first place.

An alternative to baking weights into the image is mounting an EFS filesystem at /mnt/models and pointing DOCLING_ARTIFACTS_PATH at it. That keeps the image smaller and lets you swap models without rebuilding, at the cost of VPC plumbing, an access point, security groups, IAM, and network filesystem latency on cold start. For most teams, baking models in is simpler.

Docling on Vercel: do not try

Vercel functions are governed by the Vercel function limits:

Bundle size: 250 MB uncompressed for Node, 500 MB for Python.
Memory: 2 GB on Hobby, 4 GB on Pro and Enterprise.
Duration with Fluid Compute: 300 seconds on Hobby, 800 seconds on Pro and Enterprise.

Docling does not fit in 500 MB even with aggressive pruning. There is no container deployment model for Vercel functions, so the workarounds available on Lambda do not apply. The memory ceiling is also tight: 4 GB is the upper bound for any Vercel function, and that is the lower bound for Docling running anything beyond toy PDFs.

The right pattern on Vercel is to keep the function thin. Upload to S3 or R2, enqueue a job, and call out to wherever the actual parsing runs. Your Next.js route handler should be a few hundred lines of orchestration, not a PyTorch host.

That pattern works regardless of whether the parsing endpoint is your own Fargate service, an IBM Code Engine job, or a hosted Docling API.

Better serverless fits than Lambda

If the reason you want serverless is that you do not want to run a server 24/7, Lambda is not the only option.

IBM Code Engine is the closest thing to a Lambda equivalent that actually fits Docling. The Code Engine limits allow up to 48 GB memory and 48 GB ephemeral storage per app or job, with job timeouts up to 24 hours. IBM also announced Serverless Fleets with GPUs in October 2025, aimed at exactly this kind of batch ML workload. Code Engine is container-native and does not enforce a Lambda-style read-only package directory, so the OCR filesystem problems from #2507 do not exist there.

AWS Fargate or AWS Batch give you long-running task definitions, no zip limits, no /tmp-only writes, and the option of a GPU through ECS on EC2 if you need TableFormer at speed. The cost story is worse than Lambda for spiky workloads and better for sustained ones, and the operational story is much saner.

Full disclosure on the third option: a hosted Docling API is what Parsebridge sells. We run Docling behind an API so a Lambda function or Vercel route can call it without packaging any of this. If you have the volume and ops capacity for a long-lived GPU server, self-hosting on Fargate or your own hardware is cheaper at scale. If you are deploying to serverless and just want the Markdown out, a hosted endpoint is the smaller commitment.

What to actually do

If you are on Lambda and the inputs are bounded text PDFs, build the container image, bake the models, set memory to 4 GB or 8 GB, and accept the cold start. Put SQS in front for anything async, and add S3 for inputs and outputs. Pin your OCR engine and version, and watch the #2507 issue for the RapidOCR fix.

If you are on Vercel, do not run Docling inside the function. Move the parser to a long-lived process or call a hosted API from the route handler.

If you have not picked a target platform yet and your traffic is bursty, IBM Code Engine and AWS Fargate cost more attention up front but less pain later than Lambda. Both give you the container-native runtime Docling actually wants.

For the hosted route, the Parsebridge Docling API takes a file or URL and returns Markdown with the structure intact, so the Lambda function or Vercel route stays under a hundred lines and the model loading stops being your problem.