PDF text extraction,
built for AI

Zero-dependency TypeScript PDF parser designed from scratch for RAG pipelines. Fast, tiny, runs everywhere.

Faster extraction

~24 KB

Gzipped bundle

Dependencies

Try the Playground

$ npm install docutext

Library	Small PDF (31 KB)	Large PDF (1.3 MB)	Bundle (gzip)	Deps
docutext	3 ms	40 ms	~24 KB	0
pdfjs-dist	5 ms	244 ms	~1.3 MB	0 (large)
pdf-parse	5 ms	279 ms	~780 KB	1
unpdf	6 ms	232 ms	~320 KB	2

Quick Start

Install

npm install docutext

Plain text

import { DocuText } from 'docutext';

const doc = await DocuText.load('document.pdf');
console.log(doc.text);

Structured Markdown

import { DocuText } from 'docutext';
import { docToMarkdown } from 'docutext/markdown';

const doc = await DocuText.load('document.pdf');
console.log(docToMarkdown(doc)); // headings, bold, links

Browser

import { DocuText } from 'docutext'; // auto-resolves browser entry

const res = await fetch('/document.pdf');
const bytes = new Uint8Array(await res.arrayBuffer());
const doc = DocuText.fromBuffer(bytes);
console.log(doc.text);

Key Features

Zero Dependencies

No PDF.js, no WASM, no native addons. Node.js build has zero runtime dependencies. Browser requires only fflate (~3 KB).

Tiny Bundle

~24 KB gzipped. Over 50x smaller than pdfjs-dist. Ideal for serverless and edge deployments.

Fast Extraction

6x faster than alternatives on real-world documents. Pure TypeScript with no initialization overhead.

Markdown Output

Opt-in structured markdown with inferred headings, bold/italic, link extraction, and column-aware text flow.

Lazy Processing

Text is extracted on first access. Skip pages you don't need for instant startup.

Full PDF Parsing

XRef tables/streams, FlateDecode/LZW/ASCII85, ToUnicode CMaps, font encodings, form XObjects.

Encrypted PDFs

Handles permission-encrypted PDFs (empty password) with RC4 and AES-128 decryption. Node.js only.

TypeScript First

Written in TypeScript with full type definitions. ESM only, tree-shakeable.

API Reference

DocuText

static DocuText.load(path: string, options?: LoadOptions): Promise<DocuText>

Load a PDF from a file path (Node.js only).

static DocuText.fromBuffer(data: Uint8Array, options?: LoadOptions): DocuText

Parse a PDF from a byte array. Works in both Node.js and browser.

get text: string

Full document text, all pages concatenated with double newlines.

get pages: readonly PDFPage[]

Array of page objects.

get pageCount: number

Number of pages in the document.

get metadata: DocumentMetadata

Document metadata (title, author, subject, creator, producer, dates).

dispose(): void

Release internal references for garbage collection. Also supports [Symbol.dispose]().

PDFPage

get number: number

1-based page number.

get text: string

Extracted text for this page. Lazy -- computed on first access.

get textItems: TextItem[]

Raw text items with position, font, and size information.

get error: Error | null

Any error encountered during extraction for this page.

dispose(): void

Release page data for garbage collection.

docutext/markdown

docToMarkdown(doc: DocuText, separator?: string): string

Convert an entire document to structured markdown.

pageToMarkdown(page: PDFPage): string

Convert a single page to structured markdown.

LoadOptions

stripFormPlaceholders?: boolean

Remove form field placeholder text (default: true).

includeInvisibleText?: boolean

Include invisible (render mode 3) text (default: false).

Playground

Open a PDF to see docutext in action. Toggle between plain text and structured markdown output. All processing happens locally in your browser -- no files are uploaded.

Open PDF

Text Markdown

Select a PDF to extract text

Extracting…

PDF text extraction,built for AI

Performance