Table of Contents

Class TokenTextSplitter

Namespace
Mythosia.AI.Rag.Splitters
Assembly
Mythosia.AI.Rag.dll

Splits documents into chunks based on approximate token count using whitespace tokenization. For precise token counting with a specific model's tokenizer, consider extending this class.

public class TokenTextSplitter : ITextSplitter
Inheritance
TokenTextSplitter
Implements
Inherited Members

Constructors

TokenTextSplitter()

public TokenTextSplitter()

TokenTextSplitter(int, int)

public TokenTextSplitter(int maxTokensPerChunk, int tokenOverlap = 50)

Parameters

maxTokensPerChunk int
tokenOverlap int

Properties

MaxTokensPerChunk

Maximum number of tokens per chunk.

public int MaxTokensPerChunk { get; set; }

Property Value

int

TokenOverlap

Number of overlapping tokens between consecutive chunks.

public int TokenOverlap { get; set; }

Property Value

int

TokenSeparators

Characters used to split text into tokens. Default is whitespace.

public char[] TokenSeparators { get; set; }

Property Value

char[]

Methods

Split(RagDocument)

Splits a document into chunks. Implementations may split by character count, token count, sentence boundary, etc.

public IReadOnlyList<RagChunk> Split(RagDocument document)

Parameters

document RagDocument

The document to split.

Returns

IReadOnlyList<RagChunk>

An ordered list of chunks.