chunkx

chunkx is a Go library for AST-based code chunking implementing the CAST (Chunking via Abstract Syntax Trees) algorithm. It is designed to enhance Code Retrieval-Augmented Generation (RAG) by splitting code into semantically meaningful units rather than arbitrary lines.

The implementation is based on the paper "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree".

Features

  • Syntax-aware chunking: Respects code structure (functions, classes, methods) instead of arbitrarily splitting at line boundaries
  • Multi-language support: Works with 30+ languages via tree-sitter parsers
  • Generic fallback: Automatically falls back to line-based chunking for unsupported file types
  • Configurable chunk sizes: Set maximum chunk size in tokens, bytes, or lines
  • Custom token counters: Pluggable interface for custom tokenization strategies
  • Overlap support: Optional chunk overlapping for better context preservation

Installation

Install chunkx using go get:

go get github.com/gomantics/chunkx

Quick Start

Here's a simple example of how to chunk a Go file:

package main

import (
    "fmt"
    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

func main() {
    // Create a new chunker
    chunker := chunkx.NewChunker()

    code := `package main

func hello() {
    fmt.Println("Hello, World!")
}`

    // Chunk the code
    chunks, err := chunker.Chunk(code,
        chunkx.WithLanguage(languages.Go),
        chunkx.WithMaxSize(50),
    )
    if err != nil {
        panic(err)
    }

    for i, chunk := range chunks {
        fmt.Printf("Chunk %d:\n%s\n", i+1, chunk.Content)
    }
}