Zero-dependency TypeScript PDF parser designed from scratch for RAG pipelines. Fast, tiny, runs everywhere.
$ npm install docutext
| Library | Small PDF (31 KB) | Large PDF (1.3 MB) | Bundle (gzip) | Deps |
|---|---|---|---|---|
| docutext | 3 ms | 40 ms | ~24 KB | 0 |
| pdfjs-dist | 5 ms | 244 ms | ~1.3 MB | 0 (large) |
| pdf-parse | 5 ms | 279 ms | ~780 KB | 1 |
| unpdf | 6 ms | 232 ms | ~320 KB | 2 |
Median of 3 runs, Node.js, Apple Silicon.
npm install docutext
import { DocuText } from 'docutext'; const doc = await DocuText.load('document.pdf'); console.log(doc.text);
import { DocuText } from 'docutext'; import { docToMarkdown } from 'docutext/markdown'; const doc = await DocuText.load('document.pdf'); console.log(docToMarkdown(doc)); // headings, bold, links
import { DocuText } from 'docutext'; // auto-resolves browser entry const res = await fetch('/document.pdf'); const bytes = new Uint8Array(await res.arrayBuffer()); const doc = DocuText.fromBuffer(bytes); console.log(doc.text);
No PDF.js, no WASM, no native addons. Node.js build has zero runtime dependencies. Browser requires only fflate (~3 KB).
~24 KB gzipped. Over 50x smaller than pdfjs-dist. Ideal for serverless and edge deployments.
6x faster than alternatives on real-world documents. Pure TypeScript with no initialization overhead.
Opt-in structured markdown with inferred headings, bold/italic, link extraction, and column-aware text flow.
Text is extracted on first access. Skip pages you don't need for instant startup.
XRef tables/streams, FlateDecode/LZW/ASCII85, ToUnicode CMaps, font encodings, form XObjects.
Handles permission-encrypted PDFs (empty password) with RC4 and AES-128 decryption. Node.js only.
Written in TypeScript with full type definitions. ESM only, tree-shakeable.
[Symbol.dispose]().Open a PDF to see docutext in action. Toggle between plain text and structured markdown output. All processing happens locally in your browser -- no files are uploaded.
Select a PDF to extract text