Mosaic — Lookup-based token embeddings for the JVM, in pure Kotlin

Built for storage, lookup, and similarity — nothing more

Mosaic does one thing well: it owns the matrix that turns token IDs into vectors. No training, no GPU, no surprise allocations.

[ ]

Flat 1D storage

Embeddings live in a single contiguous FloatArray. Cache-friendly row access, ~1 % overhead over the theoretical bound.

~

6 initializers

uniformDefault, uniform, xavier, he, zeros, constant — all seedable for reproducible runs.

▲

Top-K nearest

O(N log K) mostSimilar via a fixed-size min-heap. ~3 ms on 10 K vocab × 128 dim.

⇆

Tessera-native

TesseraEmbeddings wires a BPE tokenizer straight into the table. Vocab-size mismatches caught at construction.

⌂

Binary persistence

16-byte header + raw float32 LE + JSON sidecar with SHA-256. Save/load round-trips exactly.

∅

No ML deps

Pure Kotlin stdlib + kotlinx.serialization + Tessera. No DJL, no KInference, no DL4J.

How a lookup actually happens

Four steps from a token ID to a copy of its embedding vector.

Token ID

A non-negative integer in [0, vocabSize), usually produced by Tessera.

Flat offset

Mosaic computes id × embeddingDim to index into the contiguous backing array.

Range copy

System.arraycopy pulls embeddingDim floats into a fresh buffer — no shared references.

Vector out

You get a FloatArray that you can mutate freely. The table is untouched.

Visualized

The two hot paths, animated.

Lookup Token ID → flat offset → vector copy.

mostSimilar(query, topK=10) Cosine vs every row, top-K in a min-heap.

Quick start

Add the JitPack dependency, create a table, encode some text.

build.gradle.kts Gradle

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        mavenCentral()
        maven { url = uri("https://jitpack.io") }
    }
}

// build.gradle.kts
dependencies {
    implementation("com.github.HectorIFC:mosaic:mosaic-core-v0.0.4")
}

Main.kt Kotlin

import dev.mosaic.EmbeddingTable
import dev.mosaic.Initializer
import dev.mosaic.TesseraEmbeddings
import dev.tessera.BpeTokenizer

fun main() {
    val tokenizer = BpeTokenizer.load("tessera.json")
    val table = EmbeddingTable.create(
        vocabSize    = tokenizer.vocabSize,
        embeddingDim = 128,
        initializer  = Initializer.uniformDefault(seed = 42L),
    )
    val pipeline = TesseraEmbeddings(tokenizer, table)
    val vectors = pipeline.encode("Hello, mosaic!")

    table.save("mosaic.bin")
}

Stack

Minimal by design.

Kotlin

2.3.21

JVM

target 21

Gradle

8.14.1

Kotest

6.1.11

kotlinx

.serialization

Tessera

v0.0.6

ktlint

12.1.2

detekt

1.23.8

kover

0.8.3

Quality

Coverage, linting, and benchmarks aren't an afterthought.

97.6%

Line coverage on mosaic-core (threshold 80 %)

91

Tests, all passing on every PR

~3 ms

mostSimilar at 10 K vocab × 128 dim

explicitApi

Every public symbol carries a KDoc