v0.0.7

Tokenization,
rebuilt from bytes.

A pure-Kotlin byte-level BPE tokenizer library. GPT-4 compatible, zero ML dependencies, and a minimal API designed to be consumed as a library.

0
automated tests
0%
line coverage
0
strings fuzz tested
0%
round-trip accuracy
0
ML dependencies
0
base vocabulary tokens

Everything you need.
Nothing you don't.

Tessera focuses on doing one thing exceptionally well: turning text into tokens and back again.

🔷

Pure Kotlin

Built with Kotlin stdlib only. No DJL, no KInference, no ML frameworks. Every line is readable, auditable, and dependency-free — by design.

🌐

Full UTF-8 Coverage

The 256-byte base vocabulary guarantees that any UTF-8 string — ASCII, CJK, Arabic, emoji, math symbols — can always be encoded and decoded perfectly.

GPT-4 Compatible

Uses the exact cl100k_base pre-tokenization regex from GPT-4, producing tokenization that mirrors OpenAI's approach.

🔄

Round-trip Guarantee

decode(encode(text)) == text holds for any valid UTF-8 input. Verified against 2,212 random strings across 6 Unicode categories.

💾

Persist & Reload

Save a trained tokenizer to JSON and load it back in milliseconds. The vocabulary is fully reconstructed from the merge list — no redundant data stored.

🔒

Safe Special Tokens

Strings like <|endoftext|> must be explicitly unlocked per encode call. Untrusted input is never misinterpreted as a control token.

📦

Library-first Design

Strict explicitApi() enforcement in Gradle. Everything not in the public API is internal. The surface is stable and minimal.

🛠️

Full Toolchain

CI on every PR with ktlint, detekt, Kover (≥ 80% coverage threshold), and automated SemVer releases via Conventional Commits.

The tokenization pipeline

Four deterministic stages transform any UTF-8 string into a compact token sequence.

Step 01

Pre-tokenize

The cl100k_base regex splits text into semantic chunks — words, contractions, numbers, punctuation — preventing merges across word boundaries.

Step 02

Byte conversion

Each chunk is encoded to UTF-8 bytes. Every byte (0–255) maps to a base token. This guarantees full coverage of any text without an "unknown" token.

Step 03

BPE merge

The greedy encode loop applies the learned merge with the lowest rank first, collapsing frequent byte pairs into composite tokens until no known pair remains.

Step 04

Token IDs

The resulting integer array is compact, deterministic, and reversible. Decoding is a simple vocabulary lookup followed by UTF-8 string construction.

See it in action

Step-by-step walkthroughs of the encode and decode pipelines.

encode Text → Token IDs

Pre-tokenize with cl100k_base regex → convert to bytes → greedy BPE merges by lowest rank → compact IntArray.

decode Token IDs → Text

Vocab lookup per ID → collect byte arrays → concatenate → ByteArray.toString(UTF_8). O(n), fully reversible.

Up and running in minutes

Add the dependency, train on your corpus, encode text. That's the whole API.

build.gradle.kts
// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

// build.gradle.kts
dependencies {
    implementation("com.github.HectorIFC:tessera:v0.0.7")
}
Main.kt
import dev.tessera.BpeTokenizer
import dev.tessera.Trainer
import dev.tessera.TrainingConfig

fun main() {
    // 1. Train from a corpus
    val tokenizer = Trainer(TrainingConfig(numMerges = 10_000))
        .trainFromFile("corpus/text.txt")

    // 2. Save for later reuse
    tokenizer.save("tessera.json")

    // 3. Load and encode
    val loaded = BpeTokenizer.load("tessera.json")
    val ids     = loaded.encode("Hello, world!")
    val text    = loaded.decode(ids)

    println("$ids → $text")
    // → [72, 101, 281, 111, 44, 319, 33] → Hello, world!
}
Kotlin 2.1.0 JVM 21 Gradle 8.14.1 kotlinx.serialization 1.7.3 Kotest 5.9.1 ktlint 12.1.2 detekt 1.23.7 Kover 0.8.3

Numbers don't lie.

Production-grade quality gates on every pull request. No exceptions.

Line coverage
95%
Measured with Kover. Threshold enforced at 80% in CI.
✓ CI enforced
Test suite
48
8 test classes covering round-trip, BPE correctness, persistence, and API contract.
0 failures
Fuzz round-trip
2,212
Random strings across ASCII, Latin, CJK, emoji, and mixed Unicode. All pass.
100% pass rate
Static analysis
0
Zero detekt violations. Complexity, magic numbers, and naming enforced.
detekt clean
Code style
0
Zero ktlint violations. IntelliJ IDEA code style enforced on every commit.
ktlint clean
Compiler warnings
0
Clean Kotlin 2.1.0 compilation. explicitApi() enforced on tessera-core.
clean build

Built in the open.
Ready for contributors.

Tessera is the foundation of a larger ecosystem. The next codebase — an embeddings library — will depend on tessera-core as a Gradle dependency. Join early.