Three tools dominate the PDF-parser-for-RAG conversation: Docling, LlamaParse, and Marker. They win on different axes. There is no overall winner in a Docling vs LlamaParse vs Marker comparison. The right pick depends on licensing, deployment target, and the worst document in your corpus.
At a glance
| Docling | LlamaParse | Marker | |
|---|---|---|---|
| Latest version | docling 2.94.0 (May 18, 2026) | Hosted, no public engine version | marker-pdf 1.10.2 (Jan 31, 2026) |
| License | MIT code, model licenses vary | Commercial SaaS | GPLv3 code, model weights have commercial restrictions |
| Hosting | Local, self-host, Docling Serve | SaaS, enterprise VPC | Local, Datalab hosted/on-prem |
| Output | Markdown, HTML, JSON, DocTags, text | Markdown, JSON, XLSX, HTML tables, annotated PDF | Markdown, JSON, chunks, HTML |
| Formats | PDF, DOCX, PPTX, XLSX, HTML, images, audio, WebVTT, LaTeX, XBRL | 130+ claimed | PDF/images default, more via full install |
| OCR | EasyOCR, Tesseract, RapidOCR, macOS Vision, VLM | Built into hosted modes | Surya OCR, force-OCR option |
| Tables | TableFormer | Strongest in higher-cost modes | TableConverter, optional LLM mode |
| Infra burden | High | Low | Medium to high |
| Main risk | Deployment complexity | Cost and vendor dependency | GPL and model licensing |
Docling
Open-source PDF parser that started inside IBM Research and now lives under the Linux Foundation’s LF AI & Data umbrella. MIT-licensed code, with model weights under their own licenses. Current release is 2.94.0, published May 18, 2026 (PyPI).
Tables run through TableFormer, which preserves merged cells instead of flattening them into one line. OCR is pluggable: EasyOCR, Tesseract, RapidOCR, macOS Vision, or a VLM. Output is Markdown, HTML, JSON, or DocTags (a lossless format for round-tripping).
The cost is infrastructure. Docling pulls in PyTorch and a few hundred megabytes of model weights, so the runtime footprint sits in the 2-4 GB range. Lambda, Vercel, and Cloud Run cannot package it directly. The intended deployment is a long-lived container on Fly, Render, EC2, or your own hardware.
Pick Docling if you need a permissive open-source license, can run a long-lived container, and want one library that covers PDF, DOCX, PPTX, XLSX, and HTML with consistent output. Skip it if you are shipping to serverless and do not want a hosted layer in front.
LlamaParse
Hosted SaaS from LlamaIndex. No open-source engine to inspect, you send files and get parsed output back. Documented at the LlamaParse docs.
Four modes, priced in credits per page. Credits run $0.00125 each at the public rate, which is what gets you the dollar column:
| Mode | Credits/page | $ / 1k pages |
|---|---|---|
| Parse without AI | 1 | $1.25 |
| Cost-effective | 3 | $3.75 |
| Agentic | 10 | $12.50 |
| Agentic Plus | 45 | $56.25 |
Quality scales with the mode. Agentic Plus is the tier that handles dense financial tables and complex forms; Cost-effective is closer to a fast generic parser. Credit pricing has shifted before, so verify against current LlamaCloud pricing before you size a budget.
130+ formats claimed, including XLSX, EPUB, and audio. Output options include Markdown, plain text, JSON, HTML tables, and an annotated PDF for debugging.
Pick LlamaParse if you want zero infrastructure, can pay per page, and parse a wide mix of formats. Skip if compliance rules out third-party APIs, or if your volume makes per-page pricing painful at the agentic tiers.
Marker
Open-source PDF-to-Markdown parser from Datalab. Current release is marker-pdf 1.10.2, published January 31, 2026 (PyPI, GitHub).
Fast on CPU, faster on GPU. Particularly strong on academic and math-heavy documents, which is what it was built for. The TableConverter handles standard tables, and an optional LLM mode improves results on harder tables and forms at the cost of a hosted model call.
Two licensing details to read carefully before you ship:
- The code is GPLv3.
- Surya OCR and layout model weights have commercial restrictions. Datalab sells a commercial license for production use.
Pick Marker if you parse research papers, math-heavy PDFs, or similar academic content, want the fastest local Markdown output, and can either accept GPL or buy the commercial license. Skip if you need permissive licensing for redistribution.
Why benchmarks don’t pick the best PDF parser for RAG
Most published comparisons score parsers against DocLayNet (80,863 annotated pages), PubLayNet, DocBank (token-level layout, 500k pages), or FinTabNet (financial tables). These measure layout-box detection or table-cell structure. None of them measure whether a downstream LLM can answer questions correctly from the parsed output.
ParseBench is closer to what RAG cares about, with metrics for faithfulness, grounding, and table understanding. It is also new and LlamaIndex-affiliated, so weight the results with that in mind.
The test that predicts production behavior is your own corpus. Pick five PDFs from your worst cases: a multi-column research paper with equations and references, a financial filing with merged-cell tables and footnotes, a scanned invoice or contract with stamps and rotation. Run each parser, embed the output, ask a dozen questions you already know the answers to, and score on whether the answers come back right. Markdown prettiness is not the metric.
Cost per 1k pages
| Option | $ / 1k pages |
|---|---|
| LlamaParse Parse without AI | $1.25 |
| LlamaParse Cost-effective | $3.75 |
| LlamaParse Agentic | $12.50 |
| LlamaParse Agentic Plus | $56.25 |
| ParseBridge Basic ($17 / 5k pages) | $3.40 |
| ParseBridge Growth ($79 / 60k pages) | $1.32 |
| ParseBridge Scale ($259 / 300k pages) | $0.86 |
| Self-hosted Docling, GPU always-on | Depends on utilization |
The self-host number is the one that fools people. A g5.xlarge on AWS runs roughly $734/month always-on. At 1M pages/month that works out to $0.73 per 1k and you win. At 50k pages/month it is $14.68 per 1k and you lose to almost every hosted option. Break-even against a hosted API usually sits in the low hundreds of thousands of pages per month, and only if you can keep the GPU busy.
Pick this when
| Scenario | Choice |
|---|---|
| Strict local or privacy requirement | Docling or Marker |
| Permissive OSS license needed | Docling |
| Math-heavy research papers | Marker |
| Broad format mix, hosted | LlamaParse |
| Serverless deployment that wants Docling output | Hosted Docling API |
| Worst-case complex tables, willing to pay | LlamaParse Agentic Plus |
| Embedded in a commercial OSS product | Docling, after legal review of model weights |
| High volume with a 24/7 GPU | Self-hosted Docling or Marker |
| Lowest marginal cost at high scale | Self-hosted Docling/Marker, or ParseBridge Scale |
Where ParseBridge fits
ParseBridge is hosted Docling, not a fourth parser. It runs Docling behind an API so a Lambda function or a Vercel route can call Docling without packaging it, no PyTorch and no gigabytes of model weights in your deploy artifact. PDF only. If you self-host Docling already and the deployment is fine, stay there. If you tried and gave up on the infrastructure, the hosted Docling API is the same parser with the ops removed.