docs/pages/app/index/file.md
The file index stores files in a local folder and index them for retrieval. This file index provides the following infrastructure to support the indexing:
The indexing and retrieval pipelines are encouraged to use the above software infrastructure.
The ktem has default indexing pipeline: ktem.index.file.pipelines.IndexDocumentPipeline.
This default pipeline works as follow:
You can customize this default pipeline if your indexing process is close to the default pipeline. You can create your own indexing pipeline if there are too much different logic.
The default pipeline provides the contact points in flowsettings.py.
FILE_INDEX_PIPELINE_FILE_EXTRACTORS. Supply overriding file extractor,
based on file extension. Example: {".pdf": "path.to.PDFReader", ".xlsx": "path.to.ExcelReader"}FILE_INDEX_PIPELINE_SPLITTER_CHUNK_SIZE. The expected number of characters
of each text segment. Example: 1024.FILE_INDEX_PIPELINE_SPLITTER_CHUNK_OVERLAP. The expected number of
characters that consecutive text segments should overlap with each other.
Example: 256.Your indexing pipeline will subclass BaseFileIndexIndexing.
You should define the following methods:
run(self, file_paths): run the indexing given the pipelineget_pipeline(cls, user_settings, index_settings): return the
fully-initialized pipeline, ready to be used by ktem.
user_settings: is a dictionary contains user settings (e.g. {"pdf_mode": True, "num_retrieval": 5}). You can declare these settings in the get_user_settings classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your get_pipeline method.index_settings: is a dictionary. Currently it's empty for File Index.get_user_settings: to declare user settings, return a dictionary.By subclassing BaseFileIndexIndexing, You will have access to the following resources:
self._Source: the source tableself._Index: the index tableself._VS: the vector storeself._DS: the docstoreOnce you have prepared your pipeline, register it in flowsettings.py: FILE_INDEX_PIPELINE = "<python.path.to.your.pipeline>".
The ktem has default retrieval pipeline:
ktem.index.file.pipelines.DocumentRetrievalPipeline. This pipeline works as
follow:
Your retrieval pipeline will subclass BaseFileIndexRetriever. The retriever
has the same database, vectorstore and docstore accesses like the indexing
pipeline.
You should define the following methods:
run(self, query, file_ids): retrieve relevant documents relating to the
query. If file_ids is given, you should restrict your search within these
file_ids.get_pipeline(cls, user_settings, index_settings, selected): return the
fully-initialized pipeline, ready to be used by ktem.
user_settings: is a dictionary contains user settings (e.g. {"pdf_mode": True, "num_retrieval": 5}). You can declare these settings in the get_user_settings classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your get_pipeline method.
index_settings: is a dictionary. Currently it's empty for File Index.selected: a list of file ids selected by user. If user doesn't select
anything, this variable will be None.get_user_settings: to declare user settings, return a dictionary.Once you build the retrieval pipeline class, you can register it in
flowsettings.py: FILE_INDEXING_RETRIEVER_PIPELIENS = ["path.to.retrieval.pipelie"]. Because there can be
multiple parallel pipelines within an index, this variable takes a list of
string rather than a string.
| Infra | Access | Schema | Ref |
|---|---|---|---|
| SQL table Source | self._Source | - id (int): id of the source (auto) |