Table of Contents

Class Bm25Tokenizer

Namespace
Mythosia.VectorDb
Assembly
Mythosia.VectorDb.Abstractions.dll

BM25 tokenizer backed by Lucene.Net Lucene.Net.Analysis.Standard.StandardAnalyzer. Used by in-memory BM25 indexing and Qdrant sparse vector building.

public static class Bm25Tokenizer
Inheritance
Bm25Tokenizer
Inherited Members

Methods

Analyze(string)

Analyzes text in a single pass and returns both normalized tokens and term frequencies.

public static Bm25Tokenizer.AnalysisResult Analyze(string text)

Parameters

text string

The input text to analyze.

Returns

Bm25Tokenizer.AnalysisResult

Analysis result containing tokens and term-frequency map.

ComputeTermFrequencies(IReadOnlyList<string>)

Computes term frequencies for a tokenized document.

public static Dictionary<string, int> ComputeTermFrequencies(IReadOnlyList<string> tokens)

Parameters

tokens IReadOnlyList<string>

The tokens to compute frequencies for.

Returns

Dictionary<string, int>

A dictionary mapping each token to its frequency.

Tokenize(string)

Tokenizes the input text into a list of normalized terms.

public static IReadOnlyList<string> Tokenize(string text)

Parameters

text string

The input text to tokenize.

Returns

IReadOnlyList<string>

A list of normalized, non-stopword tokens.