.chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example

typescript

import { MDocument } from '@mastra/rag'

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1
Here is the first section with some content.

## Section 2 
Here is another section with different content.
`)

// Basic chunking with defaults
const chunks = await doc.chunk()

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: 'markdown',
  headers: [
    ['#', 'title'],
    ['##', 'section'],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
})

Parameters

The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

<PropertiesTable content={[ { name: 'strategy', type: "'recursive' | 'character' | 'token' | 'markdown' | 'semantic-markdown' | 'html' | 'json' | 'latex' | 'sentence'", isOptional: true, description: "The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'", }, { name: 'maxSize', type: 'number', isOptional: true, defaultValue: '4000', description: 'Maximum size of each chunk. Note: Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.', }, { name: 'overlap', type: 'number', isOptional: true, defaultValue: '50', description: 'Number of characters/tokens that overlap between chunks.', }, { name: 'lengthFunction', type: '(text: string) => number', isOptional: true, description: 'Function to calculate text length. Defaults to character count.', }, { name: 'separatorPosition', type: "'start' | 'end'", isOptional: true, description: "Where to position the separator in chunks. 'start' attaches to beginning of next chunk, 'end' attaches to end of current chunk. If not specified, separators are discarded.", }, { name: 'addStartIndex', type: 'boolean', isOptional: true, defaultValue: 'false', description: 'Whether to add start index metadata to chunks.', }, { name: 'stripWhitespace', type: 'boolean', isOptional: true, defaultValue: 'true', description: 'Whether to strip whitespace from chunks.', }, { name: 'extract', type: 'ExtractParams', isOptional: true, description: 'Metadata extraction configuration.', }, ]} />

See ExtractParams reference for details on the extract parameter.

Strategy-specific options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

typescript

// Character strategy example
const chunks = await doc.chunk({
  strategy: 'character',
  separator: '.', // Character-specific option
  isSeparatorRegex: false, // Character-specific option
  maxSize: 300, // general option
})

// Recursive strategy example
const chunks = await doc.chunk({
  strategy: 'recursive',
  separators: ['\n\n', '\n', ' '], // Recursive-specific option
  language: 'markdown', // Recursive-specific option
  maxSize: 500, // general option
})

// Sentence strategy example
const chunks = await doc.chunk({
  strategy: 'sentence',
  maxSize: 450, // Required for sentence strategy
  minSize: 50, // Sentence-specific option
  sentenceEnders: ['.'], // Sentence-specific option
  fallbackToCharacters: false, // Sentence-specific option
})

// HTML strategy example
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [
    ['h1', 'title'],
    ['h2', 'subtitle'],
  ], // HTML-specific option
})

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [
    ['#', 'title'],
    ['##', 'section'],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
})

// Semantic Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'semantic-markdown',
  joinThreshold: 500, // Semantic Markdown-specific option
  modelName: 'gpt-3.5-turbo', // Semantic Markdown-specific option
})

// Token strategy example
const chunks = await doc.chunk({
  strategy: 'token',
  encodingName: 'gpt2', // Token-specific option
  modelName: 'gpt-3.5-turbo', // Token-specific option
  maxSize: 1000, // general option
})

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

Character

Recursive

Sentence

HTML

<PropertiesTable content={[ { name: 'headers', type: 'Array<[string, string]>', description: 'Array of [selector, metadata key] pairs for header-based splitting', }, { name: 'sections', type: 'Array<[string, string]>', description: 'Array of [selector, metadata key] pairs for section-based splitting', }, { name: 'returnEachLine', type: 'boolean', isOptional: true, description: 'Whether to return each line as a separate chunk', }, ]} />

Important: When using the HTML strategy, all general options are ignored. Use headers for header-based splitting or sections for section-based splitting. If used together, sections will be ignored.

Markdown

<PropertiesTable content={[ { name: 'headers', type: 'Array<[string, string]>', isOptional: true, description: 'Array of [header level, metadata key] pairs', }, { name: 'stripHeaders', type: 'boolean', isOptional: true, description: 'Whether to remove headers from the output', }, { name: 'returnEachLine', type: 'boolean', isOptional: true, description: 'Whether to return each line as a separate chunk', }, ]} />

Important: When using the headers option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers parameter.

Semantic Markdown

<PropertiesTable content={[ { name: 'joinThreshold', type: 'number', isOptional: true, defaultValue: '500', description: 'Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.', }, { name: 'modelName', type: 'string', isOptional: true, description: "Name of the model for tokenization. If provided, the model's underlying tokenization encodingName will be used.", }, { name: 'encodingName', type: 'string', isOptional: true, defaultValue: 'cl100k_base', description: 'Name of the token encoding to use. Derived from modelName if available.', }, { name: 'allowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens allowed during tokenization, or 'all' to allow all special tokens", }, { name: 'disallowedSpecial', type: "Set<string> | 'all'", isOptional: true, defaultValue: 'all', description: "Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens", }, ]} />

Token

<PropertiesTable content={[ { name: 'encodingName', type: 'string', isOptional: true, description: 'Name of the token encoding to use', }, { name: 'modelName', type: 'string', isOptional: true, description: 'Name of the model for tokenization', }, { name: 'allowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens allowed during tokenization, or 'all' to allow all special tokens", }, { name: 'disallowedSpecial', type: "Set<string> | 'all'", isOptional: true, description: "Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens", }, ]} />

JSON

Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

Return value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

typescript

interface DocumentNode {
  text: string
  metadata: Record<string, any>
  embedding?: number[]
}

Reference: .chunk() | RAG

.chunk()

Example

Parameters

Strategy-specific options

Character

Recursive

Sentence

HTML

Markdown

Semantic Markdown

Token

JSON

Latex

Return value