Token Counting

Accurate token counting is crucial for RAG applications to ensure chunks fit within LLM context windows. chunkx allows you to customize how "size" is calculated.

Built-in Counters

chunkx comes with several built-in counters for common use cases.

SimpleTokenCounter (Default)

Splits text by whitespace to estimate token count. Fast and sufficient for rough estimations or when exact token limits aren't strict.

ByteCounter

Counts the number of bytes in the chunk content. Useful when storage or transmission limits are byte-based.

import "github.com/gomantics/chunkx"

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Python),
    chunkx.WithMaxSize(4096),
    chunkx.WithTokenCounter(&chunkx.ByteCounter{}),
)

LineCounter

Counts the number of lines. Useful for display purposes or simple splitting.

chunks, err := chunker.Chunk(code,
    chunkx.WithTokenCounter(&chunkx.LineCounter{}),
)

Custom Token Counters

You can implement the TokenCounter interface to provide your own logic.

type TokenCounter interface {
    CountTokens(text string) (int, error)
}

Example of a custom counter:

type WordCounter struct{}

func (w *WordCounter) CountTokens(text string) (int, error) {
    return len(strings.Fields(text)), nil
}

// Usage
chunker.Chunk(code, chunkx.WithTokenCounter(&WordCounter{}))

OpenAI Integration (tiktoken)

For production applications using OpenAI models (GPT-4, GPT-3.5), you should use a tokenizer that matches the model's vocabulary. The tiktoken-go library is recommended.

import (
    "github.com/pkoukk/tiktoken-go"
    "github.com/gomantics/chunkx"
)

type TikTokenCounter struct {
    encoding *tiktoken.Tiktoken
}

func NewTikTokenCounter(model string) (*TikTokenCounter, error) {
    encoding, err := tiktoken.EncodingForModel(model)
    if err != nil {
        return nil, err
    }
    return &TikTokenCounter{encoding: encoding}, nil
}

func (t *TikTokenCounter) CountTokens(text string) (int, error) {
    // Encode returns the tokens; we just need the count
    tokens := t.encoding.Encode(text, nil, nil)
    return len(tokens), nil
}

func main() {
    counter, _ := NewTikTokenCounter("gpt-4")

    chunks, _ := chunker.Chunk(code,
        chunkx.WithMaxSize(8192), // Exact GPT-4 token limit
        chunkx.WithTokenCounter(counter),
    )
}

This approach ensures your chunks are maximally sized without exceeding the model's context window.