Docling vs LlamaParse vs Marker for PDF parsing

A criteria-first comparison of Docling, LlamaParse, and Marker for parsing PDFs in RAG pipelines. Licensing, output, cost per 1k pages, and what to actually test.

Three tools dominate the PDF-parser-for-RAG conversation: Docling, LlamaParse, and Marker. They win on different axes. There is no overall winner in a Docling vs LlamaParse vs Marker comparison. The right pick depends on licensing, deployment target, and the worst document in your corpus.

At a glance

	Docling	LlamaParse	Marker
Latest version	`docling 2.94.0` (May 18, 2026)	Hosted, no public engine version	`marker-pdf 1.10.2` (Jan 31, 2026)
License	MIT code, model licenses vary	Commercial SaaS	GPLv3 code, model weights have commercial restrictions
Hosting	Local, self-host, Docling Serve	SaaS, enterprise VPC	Local, Datalab hosted/on-prem
Output	Markdown, HTML, JSON, DocTags, text	Markdown, JSON, XLSX, HTML tables, annotated PDF	Markdown, JSON, chunks, HTML
Formats	PDF, DOCX, PPTX, XLSX, HTML, images, audio, WebVTT, LaTeX, XBRL	130+ claimed	PDF/images default, more via full install
OCR	EasyOCR, Tesseract, RapidOCR, macOS Vision, VLM	Built into hosted modes	Surya OCR, force-OCR option
Tables	TableFormer	Strongest in higher-cost modes	TableConverter, optional LLM mode
Infra burden	High	Low	Medium to high
Main risk	Deployment complexity	Cost and vendor dependency	GPL and model licensing

Docling

Open-source PDF parser that started inside IBM Research and now lives under the Linux Foundation’s LF AI & Data umbrella. MIT-licensed code, with model weights under their own licenses. Current release is 2.94.0, published May 18, 2026 (PyPI).

Tables run through TableFormer, which preserves merged cells instead of flattening them into one line. OCR is pluggable: EasyOCR, Tesseract, RapidOCR, macOS Vision, or a VLM. Output is Markdown, HTML, JSON, or DocTags (a lossless format for round-tripping).

The cost is infrastructure. Docling pulls in PyTorch and a few hundred megabytes of model weights, so the runtime footprint sits in the 2-4 GB range. Lambda, Vercel, and Cloud Run cannot package it directly. The intended deployment is a long-lived container on Fly, Render, EC2, or your own hardware.

Pick Docling if you need a permissive open-source license, can run a long-lived container, and want one library that covers PDF, DOCX, PPTX, XLSX, and HTML with consistent output. Skip it if you are shipping to serverless and do not want a hosted layer in front.

LlamaParse

Hosted SaaS from LlamaIndex. No open-source engine to inspect, you send files and get parsed output back. Documented at the LlamaParse docs.

Four modes, priced in credits per page. Credits run $0.00125 each at the public rate, which is what gets you the dollar column:

Mode	Credits/page	$ / 1k pages
Parse without AI	1	$1.25
Cost-effective	3	$3.75
Agentic	10	$12.50
Agentic Plus	45	$56.25

Quality scales with the mode. Agentic Plus is the tier that handles dense financial tables and complex forms; Cost-effective is closer to a fast generic parser. Credit pricing has shifted before, so verify against current LlamaCloud pricing before you size a budget.

130+ formats claimed, including XLSX, EPUB, and audio. Output options include Markdown, plain text, JSON, HTML tables, and an annotated PDF for debugging.

Pick LlamaParse if you want zero infrastructure, can pay per page, and parse a wide mix of formats. Skip if compliance rules out third-party APIs, or if your volume makes per-page pricing painful at the agentic tiers.

Marker

Open-source PDF-to-Markdown parser from Datalab. Current release is marker-pdf 1.10.2, published January 31, 2026 (PyPI, GitHub).

Fast on CPU, faster on GPU. Particularly strong on academic and math-heavy documents, which is what it was built for. The TableConverter handles standard tables, and an optional LLM mode improves results on harder tables and forms at the cost of a hosted model call.

Two licensing details to read carefully before you ship:

The code is GPLv3.
Surya OCR and layout model weights have commercial restrictions. Datalab sells a commercial license for production use.

Pick Marker if you parse research papers, math-heavy PDFs, or similar academic content, want the fastest local Markdown output, and can either accept GPL or buy the commercial license. Skip if you need permissive licensing for redistribution.

Why benchmarks don’t pick the best PDF parser for RAG

Most published comparisons score parsers against DocLayNet (80,863 annotated pages), PubLayNet, DocBank (token-level layout, 500k pages), or FinTabNet (financial tables). These measure layout-box detection or table-cell structure. None of them measure whether a downstream LLM can answer questions correctly from the parsed output.

ParseBench is closer to what RAG cares about, with metrics for faithfulness, grounding, and table understanding. It is also new and LlamaIndex-affiliated, so weight the results with that in mind.

The test that predicts production behavior is your own corpus. Pick five PDFs from your worst cases: a multi-column research paper with equations and references, a financial filing with merged-cell tables and footnotes, a scanned invoice or contract with stamps and rotation. Run each parser, embed the output, ask a dozen questions you already know the answers to, and score on whether the answers come back right. Markdown prettiness is not the metric.

Cost per 1k pages

Option	$ / 1k pages
LlamaParse Parse without AI	$1.25
LlamaParse Cost-effective	$3.75
LlamaParse Agentic	$12.50
LlamaParse Agentic Plus	$56.25
ParseBridge Basic ($17 / 5k pages)	$3.40
ParseBridge Growth ($79 / 60k pages)	$1.32
ParseBridge Scale ($259 / 300k pages)	$0.86
Self-hosted Docling, GPU always-on	Depends on utilization

The self-host number is the one that fools people. A g5.xlarge on AWS runs roughly $734/month always-on. At 1M pages/month that works out to $0.73 per 1k and you win. At 50k pages/month it is $14.68 per 1k and you lose to almost every hosted option. Break-even against a hosted API usually sits in the low hundreds of thousands of pages per month, and only if you can keep the GPU busy.

Pick this when

Scenario	Choice
Strict local or privacy requirement	Docling or Marker
Permissive OSS license needed	Docling
Math-heavy research papers	Marker
Broad format mix, hosted	LlamaParse
Serverless deployment that wants Docling output	Hosted Docling API
Worst-case complex tables, willing to pay	LlamaParse Agentic Plus
Embedded in a commercial OSS product	Docling, after legal review of model weights
High volume with a 24/7 GPU	Self-hosted Docling or Marker
Lowest marginal cost at high scale	Self-hosted Docling/Marker, or ParseBridge Scale

Where ParseBridge fits

ParseBridge is hosted Docling, not a fourth parser. It runs Docling behind an API so a Lambda function or a Vercel route can call Docling without packaging it, no PyTorch and no gigabytes of model weights in your deploy artifact. PDF only. If you self-host Docling already and the deployment is fine, stay there. If you tried and gave up on the infrastructure, the hosted Docling API is the same parser with the ops removed.