Usage

chunkx provides a flexible API for chunking code. This guide covers common usage patterns and configuration options.

Basic Usage

The core interface is the Chunker. You can create one with NewChunker() and use the Chunk method.

import (
    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

func main() {
    chunker := chunkx.NewChunker()

    code := "// Your source code here..."

    // Specify the language explicitly
    chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Python))
}

File-based Chunking

If you are processing files from the filesystem, ChunkFile is a convenient helper that reads the file and automatically detects the language based on the file extension.

// Automatically detects language (e.g., .go -> languages.Go)
chunks, err := chunker.ChunkFile("path/to/main.go")

If the file extension is not recognized, chunkx falls back to a generic line-based chunking strategy.

Configuration Options

You can customize the chunking behavior using functional options passed to Chunk or ChunkFile.

Max Chunk Size

Control the maximum size of each chunk. The unit depends on the token counter used (default is a simple whitespace-based token counter).

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithMaxSize(1000), // Limit to 1000 tokens
)

Overlap

You can specify an overlap between chunks to preserve context across boundaries. This is specified as a percentage (0-100).

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithOverlap(10), // 10% overlap
)

The Chunk Structure

The Chunk method returns a slice of Chunk structs containing metadata about each segment.

type Chunk struct {
    Content    string                // The actual code content
    StartLine  int                   // Starting line number (1-based)
    EndLine    int                   // Ending line number (1-based)
    StartByte  int                   // Starting byte offset
    EndByte    int                   // Ending byte offset
    NodeTypes  []string              // AST node types included (e.g., "function_declaration")
    Language   languages.LanguageName
}

This metadata is useful for:

  • High-lighting: Using line numbers to show the source.
  • Filtering: Using NodeTypes to filter out specific constructs.
  • Debugging: Verifying exactly what code corresponds to a chunk.