docs/guides/agent/agent_quickstarts/ingestion_pipeline_quickstart.md
RAGFlow's ingestion pipeline is a customizable, step-by-step workflow that prepares your documents for high-quality AI retrieval and answering. You can think of it as building blocks: you connect different processing "components" to create a pipeline tailored to your specific documents and needs.
RAGFlow is an open-source RAG platform with strong document processing capabilities. Its built-in module, DeepDoc, uses intelligent parsing to split documents for accurate retrieval. To handle diverse real-world needs—like varied file sources, complex layouts, and richer semantics—RAGFlow now introduces the ingestion pipeline.
The ingestion pipeline lets you customize every step of document processing:
This flexible pipeline adapts to your data, improving answer quality in RAG.
Now let's build a typical ingestion pipeline!
A Parser component converts your files into structured text while preserving layout, tables, headers, and other formatting. Its supported files 8 categories, 23+ formats including PDF, Image, Audio, Video, Email, Spreadsheet (Excel), Word, PPT, HTML, and Markdown. The following are some key configurations:
The chunker component splits text intelligently. It's goal is to prevent AI context window overflow and improve semantic accuracy in hybrid search. There are two core methods (Can be used sequentially):
\n (newlines) to split at natural paragraph boundaries first, avoiding mid-sentence cuts.:::caution IMPORTANT In the current design, if using both Token and Title methods, connect the Token chunker component first, then Title chunker component. Connecting Title chunker directly to Parser may cause format errors for Email, Image, Spreadsheet, and Text files. :::
A Transformer component is designed to bridge the "Semantic Gap". Generally speaking, it uses AI models to add semantic metadata, making your content more discoverable during retrieval. It has four generation types:
If you have multiple Transformers, ensure that you separate Transformer components for each function (e.g., one for Summary, another for Keywords).
The following are some key configurations:
/ and selecting the specific output (e.g., /{Parser.output} or /{Chunker.output}).
The Indexer component indexes for optimal retrieval. It is the final step writes processed data to the search engine (such as Infinity, Elasticsearch, OpenSearch). The following are some key configurations:
:::caution IMPORTANT To search across multiple datasets simultaneously, all selected datasets must use the same embedding model. :::
Click Run on your pipeline canvas to upload a sample file and see the step-by-step results.
Now, any files uploaded to this dataset will be processed by your custom pipeline.