blog-references

126 Million Steps Per Second (But Why?) — Supplementary Material

Interactive execution traces and implementation code for the blog post.

Interactive Traces

Self-contained HTML pages showing step-by-step execution on the compiled transformer stack machine. Every step is a parabolic attention head operation — dot-product → argmax → value extraction on compiled weight matrices.

Trace Problem Instructions Steps Result
hungarian-trace.html Min-cost perfect matching (10×10) 682 23,167 Cost = 206
sudoku-trace.html Sudoku (constraint propagation search + verification) 7,147 6,787 60/60 verified

Hungarian Algorithm

The full Kuhn-Munkres algorithm runs on the stack machine: dual potentials, augmenting path search, traceback. The 10×10 matrix has cost entries stored in heap memory at addresses 0–99, with potential arrays u[], v[], assignment p[], and working arrays at addresses 100–209.

The trace shows uneven iteration lengths (1,312 to 5,701 steps per row) because later augmenting paths must examine more already-assigned columns.

Sudoku

Two phases: Norvig’s constraint propagation (Python) finds the solution through 172 guesses and 162 backtracks, then the stack machine verifies every placement by loading all 20 peer cells from heap memory and comparing. 1,200 I32.LOAD operations and 1,200 EQ comparisons, each a parabolic attention head lookup.

Implementation Code

The program generators that compile these algorithms to stack machine instructions:

Gotchas

Things we learned building these: