Back to Langextract

LangExtract Usage

skills/langextract-usage/SKILL.md

1.3.06.4 KB
Original Source

LangExtract Usage

LangExtract extracts structured information from unstructured text using LLMs. The main entry point is lx.extract(). Every extraction needs at least one example and input text. prompt_description is optional but recommended.

Install

bash
pip install langextract

# For OpenAI model support:
pip install langextract[openai]

API keys

bash
# Gemini (checks GEMINI_API_KEY then LANGEXTRACT_API_KEY)
export GEMINI_API_KEY="your_key"

# OpenAI (checks OPENAI_API_KEY then LANGEXTRACT_API_KEY)
export OPENAI_API_KEY="your_key"

# Ollama: no API key needed, just a running Ollama server

Auto-routing + env-default resolution is tuned for GPT-style model IDs (gpt-4*, gpt-5*). For OpenAI-compatible endpoints or non-GPT IDs, pass an explicit ModelConfig — see references/providers.md.

Basic extraction

python
import langextract as lx

examples = [
    lx.data.ExampleData(
        text="Patient takes lisinopril 10mg daily for hypertension.",
        extractions=[
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="lisinopril",
                attributes={"dose": "10mg", "frequency": "daily"},
            ),
            lx.data.Extraction(
                extraction_class="condition",
                extraction_text="hypertension",
                attributes={"status": "active"},
            ),
        ],
    )
]

result = lx.extract(
    text_or_documents="Patient is prescribed metformin 500mg twice daily.",
    prompt_description="Extract medications and conditions with attributes.",
    examples=examples,
    model_id="gemini-2.5-flash",
)

for e in result.extractions:
    print(e.extraction_class, e.extraction_text)
    print(f"  char_interval: {e.char_interval}")
    print(f"  attributes: {e.attributes}")

See examples/basic_extraction.py for a runnable version and examples/relationship_extraction.py for using extraction_class + attributes to encode relationships between entities.

Writing good examples

Examples drive model behavior. Follow these rules:

  1. extraction_text must be verbatim from the example text, not paraphrased
  2. List extractions in order of appearance in the text
  3. Each extraction_class should be consistent across examples
  4. Include attributes that match what you want extracted at runtime

LangExtract can raise prompt-alignment warnings when an example's extraction_text values don't align cleanly to the example's text (failed alignment, or fuzzy/lesser rather than exact). The validator does not enforce rules 2–4 above — those are guidance to steer the model's output, not checked invariants.

To fail fast on alignment issues during development, pass these kwargs directly to lx.extract() (both default to permissive):

python
from langextract.prompt_validation import PromptValidationLevel

result = lx.extract(
    ...,
    prompt_validation_level=PromptValidationLevel.ERROR,  # default: WARNING
    prompt_validation_strict=True,                          # default: False
)

See references/prompt-validation.md for level semantics and strict-mode behavior.

Key parameters

python
result = lx.extract(
    text_or_documents=text,          # str, URL, or list of Documents
    prompt_description=prompt,        # what to extract (optional)
    examples=examples,                # few-shot examples (required)
    model_id="gemini-2.5-flash",     # model to use
    extraction_passes=1,              # >1 for higher recall (costs more)
    max_char_buffer=1000,             # chunk size
    batch_length=10,                  # chunks per batch
    max_workers=10,                   # parallel workers (provider-dependent)
    context_window_chars=None,        # cross-chunk context for coreference
)
  • text_or_documents accepts a URL string (fetch_urls=True by default).
  • For full parallelism, keep batch_length >= max_workers. Parallel processing via max_workers is provider-dependent; some providers parallelize batched prompts, while others (such as the current Ollama provider) process them sequentially.
  • context_window_chars includes characters from the previous chunk as context for the current one, which helps with coreference and entity continuity across chunk boundaries.
  • For multiple documents, pass a list of lx.data.Document — see examples/multiple_documents.py.

Working with results

python
# Filter to grounded extractions only (have source positions)
grounded = [e for e in result.extractions if e.char_interval]

# Access source position
for e in grounded:
    start = e.char_interval.start_pos
    end = e.char_interval.end_pos
    matched_text = result.text[start:end]

# Save to JSONL
lx.io.save_annotated_documents(
    [result], output_name="results.jsonl", output_dir="."
)

# Visualize from JSONL (shows first document)
html = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html, "data"):
        f.write(html.data)  # Jupyter/Colab
    else:
        f.write(html)

# Or visualize an AnnotatedDocument directly
html = lx.visualize(result)

Provider selection

The default is Gemini. For other providers, see references/providers.md, which covers:

  • OpenAI (JSON mode; fence behavior auto-configured)
  • Ollama (local models, model_url)
  • ModelConfig for advanced provider_kwargs (custom base_url, etc.)
  • Custom provider plugins via router.register()

Common issues

Extractions with char_interval=None: the extraction could not be located in the source text. Common causes include paraphrased output, alignment misses, or hallucinated entities. Filter with [e for e in result.extractions if e.char_interval]. To tune alignment, see references/resolver-params.md.

Prompt alignment warnings: your examples have extraction_text that doesn't match the example text verbatim. Fix the examples, or see references/prompt-validation.md to fail fast during development.

Slow on long documents: extraction_passes > 1 multiplies processing time. Start with 1 and increase only if recall is insufficient.

Model selection: gemini-2.5-flash is recommended for most tasks; gemini-2.5-pro for complex reasoning.

Further reading

  • references/providers.md — OpenAI, Ollama, ModelConfig, custom plugins
  • references/resolver-params.md — fuzzy alignment tuning
  • references/prompt-validation.md — catching example issues early
  • examples/ — runnable scripts for each scenario above