skills/langextract-usage/SKILL.md
LangExtract extracts structured information from unstructured text using LLMs.
The main entry point is lx.extract(). Every extraction needs at least one
example and input text. prompt_description is optional but recommended.
pip install langextract
# For OpenAI model support:
pip install langextract[openai]
# Gemini (checks GEMINI_API_KEY then LANGEXTRACT_API_KEY)
export GEMINI_API_KEY="your_key"
# OpenAI (checks OPENAI_API_KEY then LANGEXTRACT_API_KEY)
export OPENAI_API_KEY="your_key"
# Ollama: no API key needed, just a running Ollama server
Auto-routing + env-default resolution is tuned for GPT-style model IDs
(gpt-4*, gpt-5*). For OpenAI-compatible endpoints or non-GPT IDs,
pass an explicit ModelConfig — see references/providers.md.
import langextract as lx
examples = [
lx.data.ExampleData(
text="Patient takes lisinopril 10mg daily for hypertension.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="lisinopril",
attributes={"dose": "10mg", "frequency": "daily"},
),
lx.data.Extraction(
extraction_class="condition",
extraction_text="hypertension",
attributes={"status": "active"},
),
],
)
]
result = lx.extract(
text_or_documents="Patient is prescribed metformin 500mg twice daily.",
prompt_description="Extract medications and conditions with attributes.",
examples=examples,
model_id="gemini-2.5-flash",
)
for e in result.extractions:
print(e.extraction_class, e.extraction_text)
print(f" char_interval: {e.char_interval}")
print(f" attributes: {e.attributes}")
See examples/basic_extraction.py for a runnable version and
examples/relationship_extraction.py for using extraction_class +
attributes to encode relationships between entities.
Examples drive model behavior. Follow these rules:
extraction_text must be verbatim from the example text, not paraphrasedextraction_class should be consistent across examplesLangExtract can raise prompt-alignment warnings when an example's
extraction_text values don't align cleanly to the example's text
(failed alignment, or fuzzy/lesser rather than exact). The validator does
not enforce rules 2–4 above — those are guidance to steer the model's
output, not checked invariants.
To fail fast on alignment issues during development, pass these kwargs
directly to lx.extract() (both default to permissive):
from langextract.prompt_validation import PromptValidationLevel
result = lx.extract(
...,
prompt_validation_level=PromptValidationLevel.ERROR, # default: WARNING
prompt_validation_strict=True, # default: False
)
See references/prompt-validation.md for level semantics and strict-mode
behavior.
result = lx.extract(
text_or_documents=text, # str, URL, or list of Documents
prompt_description=prompt, # what to extract (optional)
examples=examples, # few-shot examples (required)
model_id="gemini-2.5-flash", # model to use
extraction_passes=1, # >1 for higher recall (costs more)
max_char_buffer=1000, # chunk size
batch_length=10, # chunks per batch
max_workers=10, # parallel workers (provider-dependent)
context_window_chars=None, # cross-chunk context for coreference
)
text_or_documents accepts a URL string (fetch_urls=True by default).batch_length >= max_workers. Parallel
processing via max_workers is provider-dependent; some providers
parallelize batched prompts, while others (such as the current Ollama
provider) process them sequentially.context_window_chars includes characters from the previous chunk as
context for the current one, which helps with coreference and entity
continuity across chunk boundaries.lx.data.Document — see
examples/multiple_documents.py.# Filter to grounded extractions only (have source positions)
grounded = [e for e in result.extractions if e.char_interval]
# Access source position
for e in grounded:
start = e.char_interval.start_pos
end = e.char_interval.end_pos
matched_text = result.text[start:end]
# Save to JSONL
lx.io.save_annotated_documents(
[result], output_name="results.jsonl", output_dir="."
)
# Visualize from JSONL (shows first document)
html = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
if hasattr(html, "data"):
f.write(html.data) # Jupyter/Colab
else:
f.write(html)
# Or visualize an AnnotatedDocument directly
html = lx.visualize(result)
The default is Gemini. For other providers, see references/providers.md,
which covers:
model_url)ModelConfig for advanced provider_kwargs (custom base_url, etc.)router.register()Extractions with char_interval=None: the extraction could not be
located in the source text. Common causes include paraphrased output,
alignment misses, or hallucinated entities. Filter with
[e for e in result.extractions if e.char_interval]. To tune alignment,
see references/resolver-params.md.
Prompt alignment warnings: your examples have extraction_text that
doesn't match the example text verbatim. Fix the examples, or see
references/prompt-validation.md to fail fast during development.
Slow on long documents: extraction_passes > 1 multiplies processing
time. Start with 1 and increase only if recall is insufficient.
Model selection: gemini-2.5-flash is recommended for most tasks;
gemini-2.5-pro for complex reasoning.
references/providers.md — OpenAI, Ollama, ModelConfig, custom pluginsreferences/resolver-params.md — fuzzy alignment tuningreferences/prompt-validation.md — catching example issues earlyexamples/ — runnable scripts for each scenario above