Tessera — Byte-level BPE Tokenizer for Kotlin

Features

Everything you need.
Nothing you don't.

Tessera focuses on doing one thing exceptionally well: turning text into tokens and back again.

🔷

Pure Kotlin

Built with Kotlin stdlib only. No DJL, no KInference, no ML frameworks. Every line is readable, auditable, and dependency-free — by design.

🌐

Full UTF-8 Coverage

The 256-byte base vocabulary guarantees that any UTF-8 string — ASCII, CJK, Arabic, emoji, math symbols — can always be encoded and decoded perfectly.

⚡

GPT-4 Compatible

Uses the exact cl100k_base pre-tokenization regex from GPT-4, producing tokenization that mirrors OpenAI's approach.

🔄

Round-trip Guarantee

decode(encode(text)) == text holds for any valid UTF-8 input. Verified against 2,212 random strings across 6 Unicode categories.

💾

Persist & Reload

Save a trained tokenizer to JSON and load it back in milliseconds. The vocabulary is fully reconstructed from the merge list — no redundant data stored.

🔒

Safe Special Tokens

Strings like <|endoftext|> must be explicitly unlocked per encode call. Untrusted input is never misinterpreted as a control token.

📦

Library-first Design

Strict explicitApi() enforcement in Gradle. Everything not in the public API is internal. The surface is stable and minimal.

🛠️

Full Toolchain

CI on every PR with ktlint, detekt, Kover (≥ 80% coverage threshold), and automated SemVer releases via Conventional Commits.

How it works

The tokenization pipeline

Four deterministic stages transform any UTF-8 string into a compact token sequence.

Step 01

Pre-tokenize

The cl100k_base regex splits text into semantic chunks — words, contractions, numbers, punctuation — preventing merges across word boundaries.

Step 02

Byte conversion

Each chunk is encoded to UTF-8 bytes. Every byte (0–255) maps to a base token. This guarantees full coverage of any text without an "unknown" token.

Step 03

BPE merge

The greedy encode loop applies the learned merge with the lowest rank first, collapsing frequent byte pairs into composite tokens until no known pair remains.

Step 04

Token IDs

The resulting integer array is compact, deterministic, and reversible. Decoding is a simple vocabulary lookup followed by UTF-8 string construction.

Visualized

See it in action

Step-by-step walkthroughs of the encode and decode pipelines.

encode Text → Token IDs

Pre-tokenize with cl100k_base regex → convert to bytes → greedy BPE merges by lowest rank → compact IntArray.

decode Token IDs → Text

Vocab lookup per ID → collect byte arrays → concatenate → ByteArray.toString(UTF_8). O(n), fully reversible.

Quick start

Up and running in minutes

Add the dependency, train on your corpus, encode text. That's the whole API.

build.gradle.kts

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

// build.gradle.kts
dependencies {
    implementation("com.github.HectorIFC:tessera:v0.0.7")
}

Main.kt

import dev.tessera.BpeTokenizer
import dev.tessera.Trainer
import dev.tessera.TrainingConfig

fun main() {
    // 1. Train from a corpus
    val tokenizer = Trainer(TrainingConfig(numMerges = 10_000))
        .trainFromFile("corpus/text.txt")

    // 2. Save for later reuse
    tokenizer.save("tessera.json")

    // 3. Load and encode
    val loaded = BpeTokenizer.load("tessera.json")
    val ids     = loaded.encode("Hello, world!")
    val text    = loaded.decode(ids)

    println("$ids → $text")
    // → [72, 101, 281, 111, 44, 319, 33] → Hello, world!
}

Tech stack

Kotlin 2.1.0 JVM 21 Gradle 8.14.1 kotlinx.serialization 1.7.3 Kotest 5.9.1 ktlint 12.1.2 detekt 1.23.7 Kover 0.8.3

Quality

Numbers don't lie.

Production-grade quality gates on every pull request. No exceptions.

Line coverage

95%

Measured with Kover. Threshold enforced at 80% in CI.

✓ CI enforced

Test suite

8 test classes covering round-trip, BPE correctness, persistence, and API contract.

0 failures

Fuzz round-trip

2,212

Random strings across ASCII, Latin, CJK, emoji, and mixed Unicode. All pass.

100% pass rate

Static analysis

Zero detekt violations. Complexity, magic numbers, and naming enforced.

detekt clean

Code style

Zero ktlint violations. IntelliJ IDEA code style enforced on every commit.

ktlint clean

Compiler warnings

Clean Kotlin 2.1.0 compilation. explicitApi() enforced on tessera-core.

clean build

Tokenization,
rebuilt from bytes.

Everything you need.
Nothing you don't.

Pure Kotlin

Full UTF-8 Coverage

GPT-4 Compatible

Round-trip Guarantee

Persist & Reload

Safe Special Tokens

Library-first Design

Full Toolchain

The tokenization pipeline

Pre-tokenize

Byte conversion

BPE merge

Token IDs

See it in action

Up and running in minutes

Numbers don't lie.

Built in the open.
Ready for contributors.

Tokenization,rebuilt from bytes.

Everything you need.Nothing you don't.

Pure Kotlin

Full UTF-8 Coverage

GPT-4 Compatible

Round-trip Guarantee

Persist & Reload

Safe Special Tokens

Library-first Design

Full Toolchain

The tokenization pipeline

Pre-tokenize

Byte conversion

BPE merge

Token IDs

See it in action

Up and running in minutes

Numbers don't lie.

Built in the open.Ready for contributors.

Tokenization,
rebuilt from bytes.

Everything you need.
Nothing you don't.

Built in the open.
Ready for contributors.