providers/common/ai/docs/operators/llm_file_analysis.rst
.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
.. _howto/operator:llm_file_analysis:
LLMFileAnalysisOperator & @task.llm_file_analysisUse :class:~airflow.providers.common.ai.operators.llm_file_analysis.LLMFileAnalysisOperator
or the @task.llm_file_analysis decorator to analyze files from object storage
or local storage with a single prompt.
The operator resolves file_path through
:class:~airflow.providers.common.compat.sdk.ObjectStoragePath, reads supported
formats in a read-only manner, injects file metadata and normalized content into
the prompt, and optionally attaches images or PDFs as multimodal inputs.
.. seealso::
:ref:Connection configuration <howto/connection:pydanticai>
Analyze a text-like file or prefix with one prompt:
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_operator_llm_file_analysis_basic] :end-before: [END howto_operator_llm_file_analysis_basic]
Use a directory or object-storage prefix when you want the operator to analyze
multiple files in one request. max_files bounds how many resolved files are
included in the request, while the size and text limits keep the request safe:
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_operator_llm_file_analysis_prefix] :end-before: [END howto_operator_llm_file_analysis_prefix]
.. note::
Prefix resolution enumerates objects under the supplied path and checks each
candidate to find files before ``max_files`` is applied. For very large
object-store prefixes, prefer a more specific path or a narrower prefix to
avoid expensive listing and stat calls.
Set multi_modal=True for PNG/JPG/PDF inputs so they are sent as binary
attachments to a vision-capable model:
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_operator_llm_file_analysis_multimodal] :end-before: [END howto_operator_llm_file_analysis_multimodal]
Set output_type to a Pydantic BaseModel when you want a typed response
back from the LLM instead of a plain string. The model instance is pushed to
XCom unchanged so downstream tasks can type-hint the class directly. The
declared output_type (and any BaseModel reachable from
Union/Optional/list shapes) is registered for deserialization by the
worker when it loads the DAG. Define the class at module scope and bind it to
an attribute matching its __name__: nested-in-function and dynamically-built
classes cannot be re-imported, so they are skipped at worker startup and fail to
deserialize at the consumer. Same-DAG downstream tasks need no configuration; the
UI XCom viewer renders the value
via the stringify path (no configuration needed). Cross-DAG xcom_pull
consumers still need the class qualname added to
[core] allowed_deserialization_classes (see the LLMOperator guide for
details).
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_operator_llm_file_analysis_structured_output_class] :end-before: [END howto_operator_llm_file_analysis_structured_output_class]
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_operator_llm_file_analysis_structured] :end-before: [END howto_operator_llm_file_analysis_structured]
The @task.llm_file_analysis decorator wraps the operator. The function
returns the prompt string; file settings are passed to the decorator:
.. exampleinclude:: /../../ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py :language: python :start-after: [START howto_decorator_llm_file_analysis] :end-before: [END howto_decorator_llm_file_analysis]
prompt: The analysis request to send to the LLM (operator) or the return
value of the decorated function (decorator).llm_conn_id: Airflow connection ID for the LLM provider.file_path: File or prefix to analyze.file_conn_id: Optional connection ID for the storage backend. Overrides a
connection embedded in file_path.multi_modal: Allow PNG/JPG/PDF inputs as binary attachments. Default False.max_files: Maximum number of files included from a prefix. Extra files are
omitted and noted in the prompt. Default 20.max_file_size_bytes: Maximum size of any single input file. Default 5 MiB.max_total_size_bytes: Maximum cumulative size across all resolved files.
Default 20 MiB.max_text_chars: Maximum normalized text context sent to the LLM after
sampling and truncation. Default 100000.sample_rows: Maximum number of sampled rows or records included for CSV,
Parquet, and Avro inputs. This controls structural preview depth, while
max_file_size_bytes and max_total_size_bytes are byte-level read
guards and max_text_chars is the final prompt-text budget. Default 10.model_id: Model identifier (e.g. "openai:gpt-5"). Overrides the
connection's extra field.system_prompt: System-level instructions appended to the operator's
built-in read-only guidance.output_type: Expected output type (default: str). Set to a Pydantic
BaseModel for structured output.agent_params: Additional keyword arguments passed to the pydantic-ai
Agent constructor (e.g. retries, model_settings)..log, .json, .csv, .parquet, .avro.png, .jpg, .jpeg, .pdf when multi_modal=True.log.gz, .json.gz, and
.csv.gz..parquet, .avro, image, or PDF inputs.Parquet and Avro readers require their corresponding optional extras:
.. code-block:: bash
pip install apache-airflow-providers-common-ai[parquet]
pip install apache-airflow-providers-common-ai[avro]