pgml-cms/docs/open-source/pgml/api/pgml.chunk.md
Chunks are pieces of documents split using some specified splitter. This is typically done before embedding.
pgml.chunk(
splitter TEXT, -- splitter name
text TEXT, -- text to embed
kwargs JSON -- optional arguments (see below)
)
SELECT pgml.chunk('recursive_character', 'test');
SELECT pgml.chunk('recursive_character', 'test', '{"chunk_size": 1000, "chunk_overlap": 40}'::jsonb);
SELECT pgml.chunk('markdown', '# Some test');
Note that the input text for those splitters is so small it isn't splitting it at all, a real world example would look more like:
SELECT pgml.chunk('recursive_character', content) FROM documents;
Where documents is some table that has a text column called content
We support the following splitters:
recursive_characterlatexmarkdownntlkpythonspacyFor more information on splitters see LangChain's docs