n8n is good at moving things around. It can pick up a PDF from Gmail, Slack, Drive, or a webhook and push the result into Notion, Postgres, or Pinecone. What it does poorly is the middle step: turning that PDF into Markdown an LLM can read. The built-in Extract From File node hands you a flat string of text with the tables, columns, and reading order flattened. This post wires Docling into n8n so the conversion step matches the quality of the rest of your pipeline.
There are two ways to do this. Run docling-serve somewhere and call it from an HTTP Request node, or install the verified Parsebridge community node, which runs Docling on hosted infrastructure and exposes it as a first-class node. The second is shorter, so that’s what we’ll set up first.
Installing the n8n Docling node
The node is published as n8n-nodes-parsebridge (v1.0.5 at the time of writing, requires n8n 1.60.0 or newer). Install it from the n8n UI as an instance owner or admin:
- Settings, Community Nodes, Install.
- Enter
n8n-nodes-parsebridge. - Agree to the verified-node prompt.
After install, every member of the instance can drop it into a workflow. Self-hosted users on older n8n versions need to enable community nodes in their environment first; the verified install docs cover the flags.
Create a credential of type Parsebridge API and paste in a key from the Parsebridge dashboard. The credential test hits GET https://api.parsebridge.com/v1/status, so a green check means the key is live.
Your first parse
The node has two operations, labelled “Parse PDF From URL” (parseUrl) and “Parse PDF From File” (parseFile) in the picker. parseFile takes binary data coming out of another node; parseUrl fetches a publicly reachable PDF URL. A minimal URL workflow is two nodes:
Manual Trigger → Parsebridge (parseUrl, url = https://arxiv.org/pdf/2408.09869)
Execute it. The node returns:
content is Markdown, with tables as real | ... | rows and headings preserved. Send it anywhere downstream that accepts text and you have an n8n PDF to Markdown pipeline in under a minute.
Gmail attachment to Notion
Trigger on inbound mail carrying a PDF, parse the attachment, write the Markdown to a Notion page.
Nodes, left to right:
- Gmail Trigger, search query
has:attachment filename:pdf. - Parsebridge,
operation: parseFile,binaryPropertyName: data. Gmail’s attachment binary lands ondataby default. - Notion create page. Map
={{$json.content}}into the body,={{$('Gmail Trigger').item.json.subject}}into the title, and pushpageCountandcreditsUsedinto properties if you want them.
The node’s parameter object for this step looks like:
Tables in invoices and contracts survive intact, which is the part that usually breaks when people try this with the native n8n pdf parser.
Google Drive to Pinecone
The most common ask: a folder you can drop PDFs into and end up with searchable, chunked embeddings.
- Google Drive Trigger, watching a folder for new files.
- Google Drive download, binary property
data. - Parsebridge,
parseFileondata. - Code node to split the Markdown along headings into 1,500 to 3,000 character chunks. Splitting on
##boundaries keeps related rows of a table together. - Embeddings node (OpenAI, Cohere, whatever you use).
- Pinecone upsert. Stash file name, Drive file ID, folder, upload time, page count, and chunk index in the metadata so you can filter later.
A small chunker for step 4:
The chunker above slices on a hard character count inside each section. It is fine for illustration; a production chunker should respect sentence boundaries and table rows. RAG retrieval quality is bounded by the parser, though. If table rows arrive concatenated into one line, no clever chunking strategy saves it.
Slack PDF upload to a thread summary
The Slack file URL is private, so parseUrl will not work. Download the bytes with an HTTP Request node carrying Authorization: Bearer xoxb-... and pass the binary into parseFile.
A summary prompt that holds up over time: five-bullet TL;DR, key dates and obligations, any numeric tables flattened to a list, and an explicit “uncertainty” line where the model is guessing. A polished summary written on top of a bad extraction will hide the parser quality problem, which is the trap with the built-in node.
Why not the built-in Extract From File node
n8n ships an Extract From File core node with an Extract From PDF operation. As an n8n pdf parser it works, and for plain text-native PDFs with one column and no tables, it’s fine. It falls over on:
- Multi-column papers and reports, where reading order collapses.
- Tables, where rows become one long line of space-separated values.
- Scanned PDFs, with no OCR step at all.
- Layout-heavy documents like filings, invoices, or lab results, where the semantic structure an LLM needs is gone.
Docling was built for these. Layout detection runs first, then TableFormer reconstructs each table, then OCR fills in any pages without an embedded text layer. The result is Markdown that preserves what a human sees on the page. Run the Docling technical report through both nodes and the difference is obvious.
Troubleshooting
Node does not show up in the picker
You installed it as a regular user on a self-hosted instance. Only owners and admins can install community nodes; everyone can use them after.
parseFile says no binary data
Check the previous node’s output panel for the actual binary property name. Gmail and Drive default to data, but an HTTP Request node names its binary data only when Put Output File in Field is unset.
parseUrl returns a 4xx
The URL has to be reachable from Parsebridge’s servers. Gmail, Slack, and private Drive URLs are not. Download to binary first, then use parseFile.
Imported workflow JSON missing credentials
n8n strips API keys when exporting. After importing, open the Parsebridge node and reselect (or create) the credential.
Markdown too long for Notion
Notion has block size limits. Store the full Markdown in Postgres, S3, or R2, and write the title plus a link or a short summary into Notion.
Self-hosted alternative
If you would rather run Docling yourself, the project ships docling-serve, a FastAPI wrapper around the library. Stand it up on a long-lived box with a GPU and call it from an HTTP Request node. The trade-off is the usual one: you handle model downloads, OCR engines, GPU drivers, and uptime. The Getting started with Docling in Python guide walks through the library itself if you want to know what you’re signing up for.
The Parsebridge node calls the same Docling pipeline as the hosted Docling API, so you can prototype on the node and switch to self-hosted later without changing the shape of the workflow. The Parsebridge n8n integration page has the API key signup and the install steps in one place.