API reference

Convert documents
programmatically

Upload, poll, download. The API is versioned at /api/v1/. Same credit balance as the web UI; no separate plan, no per-key surcharge.

Authentication

Generate a key from the dashboard. Pass it as a Bearer token:

Authorization: Bearer dpk_live_<your_key>

Each key is shown once at creation; only its hash is stored on the server. Rate limit: 60 requests/minute per key.

POST /api/v1/jobs

Upload a document for conversion. Charges credits automatically when the estimate is computed.

Request

POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data

file:        <binary>          # the document
outputs[]:   markdown           # one or more (repeated field)
outputs[]:   docx

Supported targets per source

SourceTargets
.pdfmarkdown, chunks, images, docx, xlsx, pptx
.docxmarkdown, chunks, pdf
.pptxmarkdown, chunks, pdf
.xlsxmarkdown, chunks, csv, pdf
.csv, txt, md, emlmarkdown, chunks
.jpg, jpeg, png, heic, heifjpg, png, pdf, image, markdown

Response (202)

{
  "job_id": "6ec9b16d-c551-48fc-aebe-b151fe3d20d8",
  "status": "PENDING_CLASSIFICATION",
  "page_count": 3,
  "requested_outputs": ["markdown", "docx"],
  "credits_estimated": 6
}

GET /api/v1/jobs/:id

Poll job status. The job moves through this state machine:

PENDING_CLASSIFICATION → CLASSIFYING → AWAITING_CONFIRMATION
                        → PROCESSING → DONE
                        → ERROR

Response (200) — DONE

{
  "job_id": "6ec9b16d-...",
  "status": "DONE",
  "page_count": 3,
  "credits_charged": 6,
  "duration_ms": 5661,
  "download_url": "https://.../outputs/.../all.zip?...",
  "error_code": null,
  "error_message": null
}

The download URL is a presigned link valid for 7 days. The zip contains a subdirectory per requested output (e.g. markdown/sample.md, docx/sample.docx).

Chunks (RAG)

Structure-aware chunks for RAG. Drop-in JSONL with breadcrumbs and heading paths included — not bolted on after the fact. Bring your own embedding model.

Each chunk carries its position in the document hierarchy (breadcrumb, heading_path), its type (text, table, etc.), and the strategy that produced it. Document type is detected automatically — no parameter tuning required.

Request

POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data

file:        <binary>          # the document
outputs[]:   chunks

Output: chunks/chunks.jsonl

First line is a manifest, subsequent lines are one chunk each.

{"_manifest": true, "schema_version": 1, "source_filename": "paper.pdf",
 "chunk_count": 42, "strategy": "general", "created_at": "2026-04-30T13:36:00Z"}
{"id": "8f6c...", "text": "...", "chunk_index": 0, "token_count": 312,
 "breadcrumb": "Chapter 2 > Section 2.1",
 "heading_path": ["Chapter 2", "Section 2.1"],
 "chunk_type": "text", "source_filename": "paper.pdf",
 "strategy_used": "general", "metadata": {}}

Schema

FieldTypeNotes
idstringStable within a job; uuid4 across jobs
textstringChunk content (markdown)
chunk_indexint0-based position in the document
token_countintEstimated tokens
breadcrumbstringe.g. "Chapter 2 > Section 2.1"
heading_pathstring[]Ancestor headings, top to bottom
chunk_typeenumtext | table | whole_doc | cross_reference | formula_annotation
source_filenamestringOriginal upload filename
strategy_usedstringgeneral | manual | table | one | spreadsheet_advanced
metadataobjectStrategy-specific extras (opaque)

Embed and ingest (OpenAI + Chroma)

import json, openai, chromadb

client = openai.OpenAI()
col = chromadb.PersistentClient("./db").get_or_create_collection("docs")

for line in open("chunks/chunks.jsonl"):
    r = json.loads(line)
    if r.get("_manifest"):
        continue
    vec = client.embeddings.create(
        input=r["text"], model="text-embedding-3-small"
    ).data[0].embedding
    col.add(
        ids=[r["id"]],
        embeddings=[vec],
        documents=[r["text"]],
        metadatas=[{"breadcrumb": r["breadcrumb"], "source": r["source_filename"]}],
    )

Schema is v1 and additive-only — new fields may appear, existing fields will not be renamed or removed without a version bump. Chunk ids are not stable across re-runs of the same job; if you re-ingest, dedupe on (source_filename, chunk_index).

Error codes

HTTPCodeMeaning
400MISSING_FILERequest did not include a 'file' field
400BAD_JSONBody is malformed JSON
401UNAUTHENTICATEDMissing, malformed, or revoked API key
402INSUFFICIENT_CREDITSAccount balance below estimate (set on the job after classification)
403FORBIDDENJob belongs to another account
404NOT_FOUNDJob not found
413TOO_LARGEFile exceeds 32 MB
415BAD_EXTENSIONExtension not in allowlist
415BAD_MAGICDetected MIME does not match an allowed type
422PAGE_LIMITPDF exceeds 600 pages
422ZIP_BOMBOffice archive ratio or uncompressed size exceeds limits
429RATE_LIMITEDPer-key rate limit exceeded (60 req/min)

Operational errors discovered during conversion land on the job record as status=ERROR with a descriptive error_code + error_message. Credits charged for the job are auto-refunded.

Example (curl)

# upload + auto-charge + auto-process
curl -sS -X POST https://docparser.app/api/v1/jobs \
  -H "Authorization: Bearer dpk_live_..." \
  -F "file=@./contract.pdf" \
  -F "outputs[]=markdown" \
  -F "outputs[]=docx"

# poll until done
JOB_ID=...
while true; do
  STATUS=$(curl -sS https://.../api/v1/jobs/$JOB_ID \
    -H "Authorization: Bearer dpk_live_..." | jq -r .status)
  [[ "$STATUS" == "DONE" || "$STATUS" == "ERROR" ]] && break
  sleep 2
done

# fetch the result
curl -sS https://.../api/v1/jobs/$JOB_ID \
  -H "Authorization: Bearer dpk_live_..." | jq -r .download_url | \
  xargs curl -sS -o output.zip