examples/sec_edgar_analytics/README.md
Semantic relevance and keyword presence combined, not either alone — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/sec-edgar-analytics/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>SEC filings come in many shapes — narrative 10-K risk factors as text, structured financials as XBRL JSON, exhibits as PDF. This pipeline pulls those formats into a single searchable index in Apache Doris, with both a vector index for semantic search and a full-text index for keyword search. You declare the transformation in native Python — target_state = transformation(source_state) — and the Rust engine handles incremental processing, so adding a filing embeds and loads only its chunks.
Two source formats fan into one chunk table:
*.txt 10-K filings and *.json XBRL company facts (the JSON is rendered to searchable text first).RISK:* / TOPIC:* labels.The magic is in mount_table_target — the same table gets a vector index (for l2_distance semantic search) and an inverted index (for MATCH_ANY keyword search). Read it in main.py:
table = await doris.mount_table_target(
DORIS_DB, TABLE,
await doris.TableSchema.from_class(FilingChunk, primary_key=["chunk_id"]),
vector_indexes=[doris.VectorIndexDef(field_name="embedding", metric_type="l2_distance")],
inverted_indexes=[doris.InvertedIndexDef(field_name="text", parser="unicode")],
)
# PII is redacted *before* chunking, so it never enters the index. Both source
# formats funnel into one shared path — scrub, chunk, embed, tag, declare a row per chunk:
async def _index_text(text, source_type, filename, cik, filing_date, form_type, table):
embedder = coco.use_context(EMBEDDER)
for chunk in _splitter.split(_scrub_pii(text), chunk_size=1000, chunk_overlap=200, language="markdown"):
table.declare_row(row=FilingChunk(
chunk_id=_chunk_id(filename, chunk.start.char_offset, chunk.end.char_offset),
source_type=source_type, doc_filename=filename, cik=cik, filing_date=filing_date,
form_type=form_type, text=chunk.text, topics=_extract_topics(chunk.text),
embedding=await embedder.embed(chunk.text),
))
Both sources declare_row into the same Doris table — chunk_id is a stable uuid5 of the file and chunk offsets, so re-running reconciles cleanly instead of duplicating.
Step-by-step walkthrough with the row schema, the dual-index target, PII scrubbing, and the hybrid RRF query.
</p>mount_table_target declares both a vector (ANN) and a full-text (inverted) index — the foundation for hybrid retrieval, with no second store to keep in sync._index_text (see Manuals to Structured Data).chunk_id reconciles edits in place instead of duplicating.search.py ranks by vector distance and by keyword match, then combines them with Reciprocal Rank Fusion in one SQL query.Needs Apache Doris 4.0+ for vector-index support — a ready
docker-compose.ymlis included.
1. Start Doris (FE + BE):
docker compose up -d fe be
2. Fetch sample data, configure & install — synthetic 10-K filings + XBRL company facts:
python download.py
cp .env.example .env # Doris host/ports
pip install -e .
3. Build the index:
cocoindex update main
On the sample data this loads 4 chunks (2 filings + 2 company-facts) into Doris, creating both idx_vec_embedding (ANN) and idx_inv_text (INVERTED). Topics come out as you'd expect — Apple tagged RISK:CYBER, RISK:CLIMATE, RISK:SUPPLY, RISK:REGULATORY, TOPIC:AI; Microsoft RISK:CYBER, RISK:REGULATORY, TOPIC:AI, TOPIC:CLOUD.
4. Hybrid search — vector + keyword, fused with RRF:
python search.py "cloud computing and AI risk"
On the sample data that ranks Microsoft's cloud-and-AI filing first (it carries both TOPIC:CLOUD and TOPIC:AI), Apple's second, and the company-facts rows below.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/sec-edgar-analytics/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>