Back to blog

Docling vs LlamaParse vs Marker for PDF parsing

A criteria-first comparison of Docling, LlamaParse, and Marker for parsing PDFs in RAG pipelines. Licensing, output, cost per 1k pages, and what to actually test.

Three tools dominate the PDF-parser-for-RAG conversation: Docling, LlamaParse, and Marker. They win on different axes. There is no overall winner in a Docling vs LlamaParse vs Marker comparison. The right pick depends on licensing, deployment target, and the worst document in your corpus.

At a glance

DoclingLlamaParseMarker
Latest versiondocling 2.94.0 (May 18, 2026)Hosted, no public engine versionmarker-pdf 1.10.2 (Jan 31, 2026)
LicenseMIT code, model licenses varyCommercial SaaSGPLv3 code, model weights have commercial restrictions
HostingLocal, self-host, Docling ServeSaaS, enterprise VPCLocal, Datalab hosted/on-prem
OutputMarkdown, HTML, JSON, DocTags, textMarkdown, JSON, XLSX, HTML tables, annotated PDFMarkdown, JSON, chunks, HTML
FormatsPDF, DOCX, PPTX, XLSX, HTML, images, audio, WebVTT, LaTeX, XBRL130+ claimedPDF/images default, more via full install
OCREasyOCR, Tesseract, RapidOCR, macOS Vision, VLMBuilt into hosted modesSurya OCR, force-OCR option
TablesTableFormerStrongest in higher-cost modesTableConverter, optional LLM mode
Infra burdenHighLowMedium to high
Main riskDeployment complexityCost and vendor dependencyGPL and model licensing

Docling

Open-source PDF parser that started inside IBM Research and now lives under the Linux Foundation’s LF AI & Data umbrella. MIT-licensed code, with model weights under their own licenses. Current release is 2.94.0, published May 18, 2026 (PyPI).



Tables run through TableFormer, which preserves merged cells instead of flattening them into one line. OCR is pluggable: EasyOCR, Tesseract, RapidOCR, macOS Vision, or a VLM. Output is Markdown, HTML, JSON, or DocTags (a lossless format for round-tripping).

The cost is infrastructure. Docling pulls in PyTorch and a few hundred megabytes of model weights, so the runtime footprint sits in the 2-4 GB range. Lambda, Vercel, and Cloud Run cannot package it directly. The intended deployment is a long-lived container on Fly, Render, EC2, or your own hardware.

Pick Docling if you need a permissive open-source license, can run a long-lived container, and want one library that covers PDF, DOCX, PPTX, XLSX, and HTML with consistent output. Skip it if you are shipping to serverless and do not want a hosted layer in front.

LlamaParse

Hosted SaaS from LlamaIndex. No open-source engine to inspect, you send files and get parsed output back. Documented at the LlamaParse docs.

Four modes, priced in credits per page. Credits run $0.00125 each at the public rate, which is what gets you the dollar column:

ModeCredits/page$ / 1k pages
Parse without AI1$1.25
Cost-effective3$3.75
Agentic10$12.50
Agentic Plus45$56.25

Quality scales with the mode. Agentic Plus is the tier that handles dense financial tables and complex forms; Cost-effective is closer to a fast generic parser. Credit pricing has shifted before, so verify against current LlamaCloud pricing before you size a budget.

130+ formats claimed, including XLSX, EPUB, and audio. Output options include Markdown, plain text, JSON, HTML tables, and an annotated PDF for debugging.

Pick LlamaParse if you want zero infrastructure, can pay per page, and parse a wide mix of formats. Skip if compliance rules out third-party APIs, or if your volume makes per-page pricing painful at the agentic tiers.

Marker

Open-source PDF-to-Markdown parser from Datalab. Current release is marker-pdf 1.10.2, published January 31, 2026 (PyPI, GitHub).

Fast on CPU, faster on GPU. Particularly strong on academic and math-heavy documents, which is what it was built for. The TableConverter handles standard tables, and an optional LLM mode improves results on harder tables and forms at the cost of a hosted model call.

Two licensing details to read carefully before you ship:

  1. The code is GPLv3.
  2. Surya OCR and layout model weights have commercial restrictions. Datalab sells a commercial license for production use.

Pick Marker if you parse research papers, math-heavy PDFs, or similar academic content, want the fastest local Markdown output, and can either accept GPL or buy the commercial license. Skip if you need permissive licensing for redistribution.

Why benchmarks don’t pick the best PDF parser for RAG

Most published comparisons score parsers against DocLayNet (80,863 annotated pages), PubLayNet, DocBank (token-level layout, 500k pages), or FinTabNet (financial tables). These measure layout-box detection or table-cell structure. None of them measure whether a downstream LLM can answer questions correctly from the parsed output.

ParseBench is closer to what RAG cares about, with metrics for faithfulness, grounding, and table understanding. It is also new and LlamaIndex-affiliated, so weight the results with that in mind.

The test that predicts production behavior is your own corpus. Pick five PDFs from your worst cases: a multi-column research paper with equations and references, a financial filing with merged-cell tables and footnotes, a scanned invoice or contract with stamps and rotation. Run each parser, embed the output, ask a dozen questions you already know the answers to, and score on whether the answers come back right. Markdown prettiness is not the metric.

Cost per 1k pages

Option$ / 1k pages
LlamaParse Parse without AI$1.25
LlamaParse Cost-effective$3.75
LlamaParse Agentic$12.50
LlamaParse Agentic Plus$56.25
ParseBridge Basic ($17 / 5k pages)$3.40
ParseBridge Growth ($79 / 60k pages)$1.32
ParseBridge Scale ($259 / 300k pages)$0.86
Self-hosted Docling, GPU always-onDepends on utilization

The self-host number is the one that fools people. A g5.xlarge on AWS runs roughly $734/month always-on. At 1M pages/month that works out to $0.73 per 1k and you win. At 50k pages/month it is $14.68 per 1k and you lose to almost every hosted option. Break-even against a hosted API usually sits in the low hundreds of thousands of pages per month, and only if you can keep the GPU busy.

Pick this when

ScenarioChoice
Strict local or privacy requirementDocling or Marker
Permissive OSS license neededDocling
Math-heavy research papersMarker
Broad format mix, hostedLlamaParse
Serverless deployment that wants Docling outputHosted Docling API
Worst-case complex tables, willing to payLlamaParse Agentic Plus
Embedded in a commercial OSS productDocling, after legal review of model weights
High volume with a 24/7 GPUSelf-hosted Docling or Marker
Lowest marginal cost at high scaleSelf-hosted Docling/Marker, or ParseBridge Scale

Where ParseBridge fits

ParseBridge is hosted Docling, not a fourth parser. It runs Docling behind an API so a Lambda function or a Vercel route can call Docling without packaging it, no PyTorch and no gigabytes of model weights in your deploy artifact. PDF only. If you self-host Docling already and the deployment is fine, stay there. If you tried and gave up on the infrastructure, the hosted Docling API is the same parser with the ops removed.