Class TokenTextSplitter
Splits documents into chunks based on approximate token count using whitespace tokenization. For precise token counting with a specific model's tokenizer, consider extending this class.
public class TokenTextSplitter : ITextSplitter
- Inheritance
-
TokenTextSplitter
- Implements
- Inherited Members
Constructors
TokenTextSplitter()
public TokenTextSplitter()
TokenTextSplitter(int, int)
public TokenTextSplitter(int maxTokensPerChunk, int tokenOverlap = 50)
Parameters
Properties
MaxTokensPerChunk
Maximum number of tokens per chunk.
public int MaxTokensPerChunk { get; set; }
Property Value
TokenOverlap
Number of overlapping tokens between consecutive chunks.
public int TokenOverlap { get; set; }
Property Value
TokenSeparators
Characters used to split text into tokens. Default is whitespace.
public char[] TokenSeparators { get; set; }
Property Value
- char[]
Methods
Split(RagDocument)
Splits a document into chunks. Implementations may split by character count, token count, sentence boundary, etc.
public IReadOnlyList<RagChunk> Split(RagDocument document)
Parameters
documentRagDocumentThe document to split.
Returns
- IReadOnlyList<RagChunk>
An ordered list of chunks.