API reference

Convert documents
programmatically

Upload, poll, download. The API is versioned at /api/v1/. Same credit balance as the web UI; no separate plan, no per-key surcharge.

Authentication

Generate a key from the dashboard. Pass it as a Bearer token:

Authorization: Bearer dpk_live_<your_key>

Each key is shown once at creation; only its hash is stored on the server. Rate limit: 60 requests/minute per key.

POST /api/v1/jobs

Upload a document for conversion. Charges credits automatically when the estimate is computed.

Request

POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data

file:        <binary>          # the document
outputs[]:   markdown           # one or more (repeated field)
outputs[]:   docx

Supported targets per source

Source	Targets
.pdf	markdown, chunks, images, docx, xlsx, pptx
.docx	markdown, chunks, pdf
.pptx	markdown, chunks, pdf
.xlsx	markdown, chunks, csv, pdf
.csv, txt, md, eml	markdown, chunks
.jpg, jpeg, png, heic, heif	jpg, png, pdf, image, markdown

Response (202)

{
  "job_id": "6ec9b16d-c551-48fc-aebe-b151fe3d20d8",
  "status": "PENDING_CLASSIFICATION",
  "page_count": 3,
  "requested_outputs": ["markdown", "docx"],
  "credits_estimated": 6
}

GET /api/v1/jobs/:id

Poll job status. The job moves through this state machine:

PENDING_CLASSIFICATION → CLASSIFYING → AWAITING_CONFIRMATION
                        → PROCESSING → DONE
                        → ERROR

Response (200) — DONE

{
  "job_id": "6ec9b16d-...",
  "status": "DONE",
  "page_count": 3,
  "credits_charged": 6,
  "duration_ms": 5661,
  "download_url": "https://.../outputs/.../all.zip?...",
  "error_code": null,
  "error_message": null
}

The download URL is a presigned link valid for 7 days. The zip contains a subdirectory per requested output (e.g. markdown/sample.md, docx/sample.docx).

Chunks (RAG)

Structure-aware chunks for RAG. Drop-in JSONL with breadcrumbs and heading paths included — not bolted on after the fact. Bring your own embedding model.

Each chunk carries its position in the document hierarchy (breadcrumb, heading_path), its type (text, table, etc.), and the strategy that produced it. Document type is detected automatically — no parameter tuning required.

Request

POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data

file:        <binary>          # the document
outputs[]:   chunks

Output: chunks/chunks.jsonl

First line is a manifest, subsequent lines are one chunk each.

{"_manifest": true, "schema_version": 1, "source_filename": "paper.pdf",
 "chunk_count": 42, "strategy": "general", "created_at": "2026-04-30T13:36:00Z"}
{"id": "8f6c...", "text": "...", "chunk_index": 0, "token_count": 312,
 "breadcrumb": "Chapter 2 > Section 2.1",
 "heading_path": ["Chapter 2", "Section 2.1"],
 "chunk_type": "text", "source_filename": "paper.pdf",
 "strategy_used": "general", "metadata": {}}

Schema

Field	Type	Notes
id	string	Stable within a job; uuid4 across jobs
text	string	Chunk content (markdown)
chunk_index	int	0-based position in the document
token_count	int	Estimated tokens
breadcrumb	string	e.g. "Chapter 2 > Section 2.1"
heading_path	string[]	Ancestor headings, top to bottom
chunk_type	enum	text \| table \| whole_doc \| cross_reference \| formula_annotation
source_filename	string	Original upload filename
strategy_used	string	general \| manual \| table \| one \| spreadsheet_advanced
metadata	object	Strategy-specific extras (opaque)

Embed and ingest (OpenAI + Chroma)

import json, openai, chromadb

client = openai.OpenAI()
col = chromadb.PersistentClient("./db").get_or_create_collection("docs")

for line in open("chunks/chunks.jsonl"):
    r = json.loads(line)
    if r.get("_manifest"):
        continue
    vec = client.embeddings.create(
        input=r["text"], model="text-embedding-3-small"
    ).data[0].embedding
    col.add(
        ids=[r["id"]],
        embeddings=[vec],
        documents=[r["text"]],
        metadatas=[{"breadcrumb": r["breadcrumb"], "source": r["source_filename"]}],
    )

Schema is v1 and additive-only — new fields may appear, existing fields will not be renamed or removed without a version bump. Chunk ids are not stable across re-runs of the same job; if you re-ingest, dedupe on (source_filename, chunk_index).

Error codes

HTTP	Code	Meaning
400	MISSING_FILE	Request did not include a 'file' field
400	BAD_JSON	Body is malformed JSON
401	UNAUTHENTICATED	Missing, malformed, or revoked API key
402	INSUFFICIENT_CREDITS	Account balance below estimate (set on the job after classification)
403	FORBIDDEN	Job belongs to another account
404	NOT_FOUND	Job not found
413	TOO_LARGE	File exceeds 32 MB
415	BAD_EXTENSION	Extension not in allowlist
415	BAD_MAGIC	Detected MIME does not match an allowed type
422	PAGE_LIMIT	PDF exceeds 600 pages
422	ZIP_BOMB	Office archive ratio or uncompressed size exceeds limits
429	RATE_LIMITED	Per-key rate limit exceeded (60 req/min)

Operational errors discovered during conversion land on the job record as status=ERROR with a descriptive error_code + error_message. Credits charged for the job are auto-refunded.

Example (curl)

# upload + auto-charge + auto-process
curl -sS -X POST https://docparser.app/api/v1/jobs \
  -H "Authorization: Bearer dpk_live_..." \
  -F "file=@./contract.pdf" \
  -F "outputs[]=markdown" \
  -F "outputs[]=docx"

# poll until done
JOB_ID=...
while true; do
  STATUS=$(curl -sS https://.../api/v1/jobs/$JOB_ID \
    -H "Authorization: Bearer dpk_live_..." | jq -r .status)
  [[ "$STATUS" == "DONE" || "$STATUS" == "ERROR" ]] && break
  sleep 2
done

# fetch the result
curl -sS https://.../api/v1/jobs/$JOB_ID \
  -H "Authorization: Bearer dpk_live_..." | jq -r .download_url | \
  xargs curl -sS -o output.zip

Convert documentsprogrammatically

Authentication

POST /api/v1/jobs

Request

Supported targets per source

Response (202)

GET /api/v1/jobs/:id

Response (200) — DONE

Chunks (RAG)

Request

Output: chunks/chunks.jsonl

Schema

Embed and ingest (OpenAI + Chroma)

Error codes

Example (curl)

Convert documents
programmatically