PDF text extraction,
built for AI

Zero-dependency TypeScript PDF parser designed from scratch for RAG pipelines. Fast, tiny, runs everywhere.

6x
Faster extraction
~24 KB
Gzipped bundle
0
Dependencies
Try the Playground
$ npm install docutext

Performance

LibrarySmall PDF (31 KB)Large PDF (1.3 MB)Bundle (gzip)Deps
docutext3 ms40 ms~24 KB0
pdfjs-dist5 ms244 ms~1.3 MB0 (large)
pdf-parse5 ms279 ms~780 KB1
unpdf6 ms232 ms~320 KB2

Median of 3 runs, Node.js, Apple Silicon.

Quick Start

Install

npm install docutext

Plain text

import { DocuText } from 'docutext';

const doc = await DocuText.load('document.pdf');
console.log(doc.text);

Structured Markdown

import { DocuText } from 'docutext';
import { docToMarkdown } from 'docutext/markdown';

const doc = await DocuText.load('document.pdf');
console.log(docToMarkdown(doc)); // headings, bold, links

Browser

import { DocuText } from 'docutext'; // auto-resolves browser entry

const res = await fetch('/document.pdf');
const bytes = new Uint8Array(await res.arrayBuffer());
const doc = DocuText.fromBuffer(bytes);
console.log(doc.text);

Key Features

Zero Dependencies

No PDF.js, no WASM, no native addons. Node.js build has zero runtime dependencies. Browser requires only fflate (~3 KB).

Tiny Bundle

~24 KB gzipped. Over 50x smaller than pdfjs-dist. Ideal for serverless and edge deployments.

Fast Extraction

6x faster than alternatives on real-world documents. Pure TypeScript with no initialization overhead.

Markdown Output

Opt-in structured markdown with inferred headings, bold/italic, link extraction, and column-aware text flow.

Lazy Processing

Text is extracted on first access. Skip pages you don't need for instant startup.

Full PDF Parsing

XRef tables/streams, FlateDecode/LZW/ASCII85, ToUnicode CMaps, font encodings, form XObjects.

Encrypted PDFs

Handles permission-encrypted PDFs (empty password) with RC4 and AES-128 decryption. Node.js only.

TypeScript First

Written in TypeScript with full type definitions. ESM only, tree-shakeable.

API Reference

DocuText

static DocuText.load(path: string, options?: LoadOptions): Promise<DocuText>
Load a PDF from a file path (Node.js only).
static DocuText.fromBuffer(data: Uint8Array, options?: LoadOptions): DocuText
Parse a PDF from a byte array. Works in both Node.js and browser.
get text: string
Full document text, all pages concatenated with double newlines.
get pages: readonly PDFPage[]
Array of page objects.
get pageCount: number
Number of pages in the document.
get metadata: DocumentMetadata
Document metadata (title, author, subject, creator, producer, dates).
dispose(): void
Release internal references for garbage collection. Also supports [Symbol.dispose]().

PDFPage

get number: number
1-based page number.
get text: string
Extracted text for this page. Lazy -- computed on first access.
get textItems: TextItem[]
Raw text items with position, font, and size information.
get error: Error | null
Any error encountered during extraction for this page.
dispose(): void
Release page data for garbage collection.

docutext/markdown

docToMarkdown(doc: DocuText, separator?: string): string
Convert an entire document to structured markdown.
pageToMarkdown(page: PDFPage): string
Convert a single page to structured markdown.

LoadOptions

stripFormPlaceholders?: boolean
Remove form field placeholder text (default: true).
includeInvisibleText?: boolean
Include invisible (render mode 3) text (default: false).

Playground

Open a PDF to see docutext in action. Toggle between plain text and structured markdown output. All processing happens locally in your browser -- no files are uploaded.

Open PDF
Text Markdown

Select a PDF to extract text

Extracting…