A pure-Kotlin byte-level BPE tokenizer library. GPT-4 compatible, zero ML dependencies, and a minimal API designed to be consumed as a library.
Tessera focuses on doing one thing exceptionally well: turning text into tokens and back again.
Built with Kotlin stdlib only. No DJL, no KInference, no ML frameworks. Every line is readable, auditable, and dependency-free — by design.
The 256-byte base vocabulary guarantees that any UTF-8 string — ASCII, CJK, Arabic, emoji, math symbols — can always be encoded and decoded perfectly.
Uses the exact cl100k_base pre-tokenization regex from GPT-4, producing tokenization that mirrors OpenAI's approach.
decode(encode(text)) == text holds for any valid UTF-8 input. Verified against 2,212 random strings across 6 Unicode categories.
Save a trained tokenizer to JSON and load it back in milliseconds. The vocabulary is fully reconstructed from the merge list — no redundant data stored.
Strings like <|endoftext|> must be explicitly unlocked per encode call. Untrusted input is never misinterpreted as a control token.
Strict explicitApi() enforcement in Gradle. Everything not in the public API is internal. The surface is stable and minimal.
CI on every PR with ktlint, detekt, Kover (≥ 80% coverage threshold), and automated SemVer releases via Conventional Commits.
Four deterministic stages transform any UTF-8 string into a compact token sequence.
The cl100k_base regex splits text into semantic chunks — words, contractions, numbers, punctuation — preventing merges across word boundaries.
Each chunk is encoded to UTF-8 bytes. Every byte (0–255) maps to a base token. This guarantees full coverage of any text without an "unknown" token.
The greedy encode loop applies the learned merge with the lowest rank first, collapsing frequent byte pairs into composite tokens until no known pair remains.
The resulting integer array is compact, deterministic, and reversible. Decoding is a simple vocabulary lookup followed by UTF-8 string construction.
Step-by-step walkthroughs of the encode and decode pipelines.
Pre-tokenize with cl100k_base regex → convert to bytes → greedy BPE merges by lowest rank → compact IntArray.
Vocab lookup per ID → collect byte arrays → concatenate → ByteArray.toString(UTF_8). O(n), fully reversible.
Add the dependency, train on your corpus, encode text. That's the whole API.
// settings.gradle.kts dependencyResolutionManagement { repositories { maven { url = uri("https://jitpack.io") } } } // build.gradle.kts dependencies { implementation("com.github.HectorIFC:tessera:v0.0.7") }
import dev.tessera.BpeTokenizer import dev.tessera.Trainer import dev.tessera.TrainingConfig fun main() { // 1. Train from a corpus val tokenizer = Trainer(TrainingConfig(numMerges = 10_000)) .trainFromFile("corpus/text.txt") // 2. Save for later reuse tokenizer.save("tessera.json") // 3. Load and encode val loaded = BpeTokenizer.load("tessera.json") val ids = loaded.encode("Hello, world!") val text = loaded.decode(ids) println("$ids → $text") // → [72, 101, 281, 111, 44, 319, 33] → Hello, world! }
Production-grade quality gates on every pull request. No exceptions.
explicitApi() enforced on tessera-core.Tessera is the foundation of a larger ecosystem. The next codebase — an embeddings library — will depend on tessera-core as a Gradle dependency. Join early.