skills/liteparse/references/output_formats.md
--format text)-o file.ParseResult.text — full document; each ParsedPage.text — page-level text.Use text output when feeding chunkers, summarizers, or keyword search that do not need coordinates.
--format json)lit parse document.pdf --format json -o document.json
The CLI serializes the native parse result. Structure aligns with the Python object model below.
After parser.parse(path), use result.pages and result.text. To emit JSON manually:
import json
from dataclasses import asdict
# Simple serialization pattern (adapt fields as needed)
def page_to_dict(page):
return {
"page_num": page.page_num,
"width": page.width,
"height": page.height,
"text": page.text,
"text_items": [
{
"text": item.text,
"x": item.x,
"y": item.y,
"width": item.width,
"height": item.height,
"font_name": item.font_name,
"font_size": item.font_size,
"confidence": item.confidence,
}
for item in page.text_items
],
}
payload = {
"text": result.text,
"pages": [page_to_dict(p) for p in result.pages],
}
json.dump(payload, open("out.json", "w"), indent=2)
{
"text": "Full document text...\n",
"pages": [
{
"page_num": 1,
"width": 612.0,
"height": 792.0,
"text": "Page 1 text...",
"text_items": [
{
"text": "Introduction",
"x": 72.0,
"y": 100.0,
"width": 120.0,
"height": 14.0,
"font_name": "Times-Bold",
"font_size": 12.0,
"confidence": null
},
{
"text": "scanned phrase",
"x": 80.0,
"y": 400.0,
"width": 200.0,
"height": 12.0,
"font_name": null,
"font_size": null,
"confidence": 0.94
}
]
}
]
}
Exact CLI JSON keys may match upstream serialization; treat text_items geometry as authoritative for grounding.
TextItem uses (x, y, width, height) — top-left corner plus size in page units (typically PDF points).[x1, y1, x2, y2]; LiteParse normalizes into x, y, width, height internally.x1, y1, x2, y2 = bbox
x, y, width, height = x1, y1, x2 - x1, y2 - y1
text_items (since upstream v1.4.0).null for native PDF text extraction.Use search_items() when a query spans multiple text_items:
from liteparse import search_items
hits = search_items(page.text_items, "Supplementary Table 1")
for hit in hits:
# hit.text — matched phrase
# hit.x, hit.y, hit.width, hit.height — merged bbox
page.text or group text_items by vertical bands.(page_num, x, y, width, height) with each chunk.screenshot() PNGs for the same page_num.confidence below threshold on OCR-heavy pages.