Back to Mastra

Reference: .chunk() | RAG

docs/src/content/en/reference/rag/chunk.mdx

2025-12-1811.4 KB
Original Source

.chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example

typescript
import { MDocument } from '@mastra/rag'

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1
Here is the first section with some content.

## Section 2 
Here is another section with different content.
`)

// Basic chunking with defaults
const chunks = await doc.chunk()

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: 'markdown',
  headers: [
    ['#', 'title'],
    ['##', 'section'],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
})

Parameters

The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

<PropertiesTable content={[ { name: 'strategy', type: "'recursive' | 'character' | 'token' | 'markdown' | 'semantic-markdown' | 'html' | 'json' | 'latex' | 'sentence'", isOptional: true, description: "The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'", }, { name: 'maxSize', type: 'number', isOptional: true, defaultValue: '4000', description: 'Maximum size of each chunk. Note: Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.', }, { name: 'overlap', type: 'number', isOptional: true, defaultValue: '50', description: 'Number of characters/tokens that overlap between chunks.', }, { name: 'lengthFunction', type: '(text: string) => number', isOptional: true, description: 'Function to calculate text length. Defaults to character count.', }, { name: 'separatorPosition', type: "'start' | 'end'", isOptional: true, description: "Where to position the separator in chunks. 'start' attaches to beginning of next chunk, 'end' attaches to end of current chunk. If not specified, separators are discarded.", }, { name: 'addStartIndex', type: 'boolean', isOptional: true, defaultValue: 'false', description: 'Whether to add start index metadata to chunks.', }, { name: 'stripWhitespace', type: 'boolean', isOptional: true, defaultValue: 'true', description: 'Whether to strip whitespace from chunks.', }, { name: 'extract', type: 'ExtractParams', isOptional: true, description: 'Metadata extraction configuration.', }, ]} />

See ExtractParams reference for details on the extract parameter.

Strategy-specific options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

typescript
// Character strategy example
const chunks = await doc.chunk({
  strategy: 'character',
  separator: '.', // Character-specific option
  isSeparatorRegex: false, // Character-specific option
  maxSize: 300, // general option
})

// Recursive strategy example
const chunks = await doc.chunk({
  strategy: 'recursive',
  separators: ['\n\n', '\n', ' '], // Recursive-specific option
  language: 'markdown', // Recursive-specific option
  maxSize: 500, // general option
})

// Sentence strategy example
const chunks = await doc.chunk({
  strategy: 'sentence',
  maxSize: 450, // Required for sentence strategy
  minSize: 50, // Sentence-specific option
  sentenceEnders: ['.'], // Sentence-specific option
  fallbackToCharacters: false, // Sentence-specific option
})

// HTML strategy example
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [
    ['h1', 'title'],
    ['h2', 'subtitle'],
  ], // HTML-specific option
})

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [
    ['#', 'title'],
    ['##', 'section'],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
})

// Semantic Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'semantic-markdown',
  joinThreshold: 500, // Semantic Markdown-specific option
  modelName: 'gpt-3.5-turbo', // Semantic Markdown-specific option
})

// Token strategy example
const chunks = await doc.chunk({
  strategy: 'token',
  encodingName: 'gpt2', // Token-specific option
  modelName: 'gpt-3.5-turbo', // Token-specific option
  maxSize: 1000, // general option
})

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

Character

<PropertiesTable content={[ { name: 'separators', type: 'string[]', isOptional: true, description: 'Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.', }, { name: 'isSeparatorRegex', type: 'boolean', isOptional: true, defaultValue: 'false', description: 'Whether the separator is a regex pattern', }, ]} />

Recursive

<PropertiesTable content={[ { name: 'separators', type: 'string[]', isOptional: true, description: 'Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.', }, { name: 'isSeparatorRegex', type: 'boolean', isOptional: true, defaultValue: 'false', description: 'Whether the separators are regex patterns', }, { name: 'language', type: 'Language', isOptional: true, description: 'Programming or markup language for language-specific splitting behavior. See Language enum for supported values.', }, ]} />

Sentence

<PropertiesTable content={[ { name: 'maxSize', type: 'number', description: 'Maximum size of each chunk (required for sentence strategy)', }, { name: 'minSize', type: 'number', isOptional: true, defaultValue: '50', description: 'Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.', }, { name: 'targetSize', type: 'number', isOptional: true, description: 'Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.', }, { name: 'sentenceEnders', type: 'string[]', isOptional: true, defaultValue: "['.', '!', '?']", description: 'Array of characters that mark sentence endings for splitting boundaries.', }, { name: 'fallbackToWords', type: 'boolean', isOptional: true, defaultValue: 'true', description: 'Whether to fall back to word-level splitting for sentences that exceed maxSize.', }, { name: 'fallbackToCharacters', type: 'boolean', isOptional: true, defaultValue: 'true', description: 'Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.', }, ]} />

HTML

<PropertiesTable content={[ { name: 'headers', type: 'Array<[string, string]>', description: 'Array of [selector, metadata key] pairs for header-based splitting', }, { name: 'sections', type: 'Array<[string, string]>', description: 'Array of [selector, metadata key] pairs for section-based splitting', }, { name: 'returnEachLine', type: 'boolean', isOptional: true, description: 'Whether to return each line as a separate chunk', }, ]} />

Important: When using the HTML strategy, all general options are ignored. Use headers for header-based splitting or sections for section-based splitting. If used together, sections will be ignored.

Markdown

<PropertiesTable content={[ { name: 'headers', type: 'Array<[string, string]>', isOptional: true, description: 'Array of [header level, metadata key] pairs', }, { name: 'stripHeaders', type: 'boolean', isOptional: true, description: 'Whether to remove headers from the output', }, { name: 'returnEachLine', type: 'boolean', isOptional: true, description: 'Whether to return each line as a separate chunk', }, ]} />

Important: When using the headers option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers parameter.

Semantic Markdown

<PropertiesTable content={[ { name: 'joinThreshold', type: 'number', isOptional: true, defaultValue: '500', description: 'Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.', }, { name: 'modelName', type: 'string', isOptional: true, description: "Name of the model for tokenization. If provided, the model's underlying tokenization encodingName will be used.", }, { name: 'encodingName', type: 'string', isOptional: true, defaultValue: 'cl100k_base', description: 'Name of the token encoding to use. Derived from modelName if available.', }, { name: 'allowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens allowed during tokenization, or 'all' to allow all special tokens", }, { name: 'disallowedSpecial', type: "Set<string> | 'all'", isOptional: true, defaultValue: 'all', description: "Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens", }, ]} />

Token

<PropertiesTable content={[ { name: 'encodingName', type: 'string', isOptional: true, description: 'Name of the token encoding to use', }, { name: 'modelName', type: 'string', isOptional: true, description: 'Name of the model for tokenization', }, { name: 'allowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens allowed during tokenization, or 'all' to allow all special tokens", }, { name: 'disallowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens", }, ]} />

JSON

<PropertiesTable content={[ { name: 'maxSize', type: 'number', description: 'Maximum size of each chunk', }, { name: 'minSize', type: 'number', isOptional: true, description: 'Minimum size of each chunk', }, { name: 'ensureAscii', type: 'boolean', isOptional: true, description: 'Whether to ensure ASCII encoding', }, { name: 'convertLists', type: 'boolean', isOptional: true, description: 'Whether to convert lists in the JSON', }, ]} />

Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

Return value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

typescript
interface DocumentNode {
  text: string
  metadata: Record<string, any>
  embedding?: number[]
}