Building below the framework layer.Where latency and hardware control matter more than abstraction comfort.

CUDA · PTX · Hopper · Blackwell · Inference Runtimes

Telos is my attempt to build a runtime where the machine is treated as the subject, not as a backend to hide behind.

I · The Cathedral

The Machine

substrate · ground truth · ceremony of names

i.HopperSM_90 · Tensor Memory Accelerator
ii.BlackwellSM_100 · 5th-gen Tensor
iii.PTXparallel thread execution
iv.TMAasync global ↔ shared
v.tcgen055th-gen tensor core mma
vi.mbarriersphase-tracked synchronization

II · The Runtimea serving runtime, opinionated about the metal

Telos

multi-gpu llm serving · blackwell-only

Telos is intended to become a high-performance, multi-GPU LLM serving runtime focused only on Blackwell — powered by Telos-owned inference kernels fed with low scheduler, KV cache, metadata, sampler, and result overhead.

KernelsKV CacheSchedulerSamplerGraph buckets

Latency is the architecture. Everything else is a convenience that pays rent in microseconds.

III · The Primitive Layerexpose the machine · remove paperwork

Hexel

cuda / ptx primitive library

Hexel is my CUDA/PTX primitive library. Its purpose is to make brutal low-level GPU programming cleaner without hiding the hardware or taking control away from the kernel author. Hexel should simplify mechanics, not semantics.

Expose the machineRemove paperworkNever steal the wheel

IV · The Altar

The Stack

hover a layer to read its purpose

L5Applications↓

L4Telos Runtime↓

L3Telos-Owned Kernels↓

L2Hexel Core Primitives↓

L1CUDA · PTX · Hopper · Blackwell↓

L1 — Foundation

CUDA, PTX, Hopper, Blackwell. The ground truth every layer above is in conversation with.

V · The Proof

Receipts.

work in the open · benchmarks · notes

01CUDA Course on H100Course↗02Telos — Blackwell-only inference runtimeIn progress↗03Hexel — CUDA/PTX primitive libraryIn progress↗04Notes on TMA, mbarriers and tcgen05 mmaWriting↗05Benchmarks — kernel latency on BlackwellSoon↗