Supported Languages

chunkx uses Tree-sitter to parse code into Abstract Syntax Trees (ASTs). This allows it to understand the structure of the code for over 30 languages.

Supported List

The following languages are fully supported with AST-based chunking:

LanguageKeyExtensions
Bashlanguages.Bash.sh, .bash
Clanguages.C.c, .h
C++languages.CPP.cpp, .cc, .cxx, .hpp, .h, .hh, .hxx
C#languages.CSharp.cs
CSSlanguages.CSS.css
CUElanguages.Cue.cue
Dockerfilelanguages.DockerfileDockerfile, .dockerfile
Elixirlanguages.Elixir.ex, .exs
Elmlanguages.Elm.elm
Golanguages.Go.go
Groovylanguages.Groovy.groovy, .gradle
HCLlanguages.HCL.hcl, .tf
HTMLlanguages.HTML.html, .htm
Javalanguages.Java.java
JavaScriptlanguages.JavaScript.js, .jsx, .mjs, .cjs
Kotlinlanguages.Kotlin.kt, .kts
Lualanguages.Lua.lua
Markdownlanguages.Markdown.md, .markdown
OCamllanguages.OCaml.ml, .mli
PHPlanguages.PHP.php, .phtml
Protobuflanguages.Protobuf.proto
Pythonlanguages.Python.py, .pyi, .pyw
Rubylanguages.Ruby.rb, .rake, .gemspec
Rustlanguages.Rust.rs
Scalalanguages.Scala.scala, .sc
SQLlanguages.SQL.sql
Sveltelanguages.Svelte.svelte
Swiftlanguages.Swift.swift
TOMLlanguages.TOML.toml
TypeScriptlanguages.TypeScript.ts, .tsx
YAMLlanguages.YAML.yaml, .yml

Language Detection

When using ChunkFile, chunkx automatically detects the language based on the file extension. If a file matches one of the extensions above, the corresponding Tree-sitter parser is used.

Fallback Strategy

If a language is not recognized, or if you explicitly use languages.Generic, chunkx falls back to a line-based chunking strategy.

The fallback strategy:

  1. Does not parse the code (no AST).
  2. Splits code based on line counts or token limits.
  3. Tries to respect paragraph boundaries (empty lines) if possible to keep related lines together.

This ensures that chunkx can still be used to process any text file, even if semantic structure is not available.