Appearance
CE-Games Chess Engine — Deep Technical Profile
Build timeline — ~10 active days across 4 phases (Feb 10 – Feb 25, 2026)
- Pre-chess scaffold (3 days) — Pong with AI, Sudoku, chess UI shell, CLAUDE.md
- Core engine build-out (3 days) — movegen + perft, search/eval/TT, Polyglot book, PVS/aspiration/futility, staged movegen, sprite pieces
- Eval tuning + optimization + tournaments (5 days) — Texel pipeline, pawn hash cache, sentinel ray optimization, SoA move pool, hand-written eZ80 asm, emu_uci tournament harness (~2083 Elo), difficulty levels, threefold repetition
- Final polish (2 days, after 2-day gap) — check extension + TT ordering bug fix, README
Table of Contents
- The Chess Engine Core
- The GUI
- Technical Tradeoffs & eZ80 Constraints
- The Optimization Journey
- Texel Tuning Pipeline
- Testing & Tournament Infrastructure
- Major Bugs & Debugging Stories
- AI Agent Orchestration & Usage Stats
- Development Timeline
- Key Files Reference
1. Project Overview
A chess engine for the TI-84 Plus CE graphing calculator (Zilog eZ80 @ 48 MHz, 256KB RAM, 3.5MB Flash), claiming to be the strongest chess engine ever built for this platform. The engine achieves ~2083 Elo at its highest difficulty (27s/move) and ~2700 Elo when cross-compiled for desktop.
By the numbers:
- 111 commits across 15 PRs, built in 16 days (Feb 10–25, 2026)
- 8 core engine C files + 1 hand-written eZ80 assembly file
- 13 iteration/variant repos tracking the optimization journey
- ~370 NPS on the eZ80 vs ~5.1M NPS on desktop ARM64 (14,000× slower)
- ~62KB RAM usage out of 256KB available
- Texel tuning on 1M Lichess positions: +52 Elo (single largest gain)
- 100-position benchmark suite with cycle-accurate profiling
- Automated tournament system running the actual eZ80 binary vs Stockfish
The project also includes Pong (AI opponent, 5 levels) and Sudoku (3 difficulties, pencil marks, undo), but chess dominates the engineering effort.
Libraries & Frameworks
Engine core (chess/engine/, C targeting eZ80)
- CE C/C++ Toolchain (v12+) — cross-compiler, linker, and stdlib targeting the TI-84 Plus CE's eZ80; every Makefile pulls
cedev-config --makefile. - Standard C headers only (
stdlib.h,string.h,stdio.h,math.h,sys/timers.h) — no third-party C deps. - Hand-written eZ80 assembly —
engine/src/pick_best.asm, inlineasminzobrist.cfor XOR tables.
Desktop/native engine build
- GNU Make — top-level and per-target Makefiles; native targets include
perft,uci,test-search,test-integration,bench,eval-fen,ablation.
TI-84 Plus CE runtime libraries (pulled in as .8xv AppVars)
- GraphX — 8-bit palette graphics + 60 FPS double-buffering for the on-device UI.
- KeypadC — calculator key scanning.
- FileIOC — flash AppVar access (used for the multi-AppVar Polyglot opening book).
- FontLibC — text rendering.
- LibLoad — dynamic
.8xvlibrary loader.
Tuning & tournament scripts (engine/tuning/, scripts/, Python 3)
- python-chess — PGN parsing for the Texel dataset, UCI subprocess driver for tournament play, board representation everywhere.
- NumPy — gradient math in
texel_tune.py(Adam-style descent on ~1M positions). - zstd CLI — decompresses
.pgn.zstLichess broadcast dumps during dataset build. - Stockfish (external UCI binary) — opponent in
tournament.py,emu_tournament.py, andablation_vs_sf.py.
Emulator-in-the-loop testing
- hunterchen7/ti84ce — the sibling Rust emulator from this portfolio, run as a subprocess by
emu_tournament.pyto measure Elo on cycle-accurate hardware. A customemu_ucieZ80 bridge is patched at the 512-byte@@CMDSLOT@@sentinel so FEN commands reach the engine.
Not used
- No NNUE, no PyTorch/TensorFlow, no neural eval — evaluation is hand-crafted Texel-tuned PeSTO-derived piece-square tables with tapered middlegame/endgame scoring.
2. Pre-Implementation Research & Planning
Before any code was written, I conducted extensive research across multiple Claude.ai conversations that produced the hardware constraints reference document, the full 16-section architecture plan, the Elo measurement methodology, and the competitive landscape analysis. This research phase is what made the 16-day implementation sprint possible.
2.1 Hardware Constraints Research (originated from a DOOM porting investigation)
The foundational platform knowledge actually originated from a separate investigation into porting real DOOM (the actual id Software source, not a raycaster recreation) to the TI-84 CE. That research produced a comprehensive hardware constraints document that was then reused as the starting context for the chess engine planning.
Key hardware facts established through this research:
- CPU: eZ80 @ 48 MHz, 8-bit ALU, 24-bit addressing (ADL mode).
intis 24 bits (3 bytes), not 32. Range ±8,388,607. No hardware multiply — multiplication synthesized from shifts and adds, ~350 cycles for 32×32. Effective throughput for 32-bit-heavy code: ~5-15 MIPS. - RAM partitioning (~133 KB usable): NOT a flat arena.
userMem~65 KB (where .8xp programs load), BSS+Heap ~60 KB shared (static globals and malloc growing toward each other), Stack ~4 KB, OS/libraries ~17 KB. Programs exceeding ~65 KB auto-split by toolchain into companion AppVars. - Flash: 4 MB total, ~3.5 MB usable. Code CANNOT execute from flash (copied to RAM first), but data CAN be read freely. Archived AppVars accessed via
ti_GetDataPtr()return direct flash pointers — zero RAM cost. Post-2019 hardware has a flash cache; sequential reads can outperform RAM because RAM competes with LCD DMA bandwidth. - Display: 320×240 pixels, 8-bit indexed color, VRAM memory-mapped at
0xD40000. - Toolchain: CE C/C++ toolchain (LLVM-based). Compile with
-Ozfor size.
The key implication I identified for chess: opening books and lookup tables go to flash as archived AppVars (essentially free), search/eval code must fit in ~65 KB userMem, transposition table competes with board state and move lists in the ~60 KB heap, and the int = 24-bit gotcha means bitboards require explicit uint64_t with expensive multi-byte operations.
2.2 The 16-Section Architecture Plan
I then conducted a full architecture planning session, working through every major design decision from first principles before implementation began. This produced the chess/TI84_Chess_Engine_Plan.md document.
Why 0x88 over bitboards — the first and most fundamental decision. Bitboards require 64-bit integers; every 64-bit operation on the eZ80 compiles into multi-byte library calls costing dozens of instructions. 0x88 uses a 128-element array where off-board detection is a single AND: if (square & 0x88). Direction arithmetic is just adding offsets. All operations are byte-width — perfect for the 8-bit ALU. I worked through the full index layout and confirmed all coordinate math was native byte operations.
Piece encoding settled on single-byte: type in bits 0-2 (PAWN=1 through KING=6), color in bit 7 (WHITE=0x00, BLACK=0x80). Translation from the Umy int8_t format is a small helper.
Board state worked out to ~430 bytes: squares[128], piece_list[2][16], king_sq[2] (avoids scanning for king during check detection), Zobrist hash, castling rights (4 bits packed), en passant target. Pre-computed castling_rights[128] flash table maps each square to a mask.
Move pool with stack pointer — instead of allocating per-ply move arrays (which would blow the 4 KB stack), a single global pool: pool_moves[2048] + pool_scores[2048] with a stack pointer tracking allocation. Each ply pushes/pops from the pool.
eZ80 optimization notes from planning: Minimize function parameters (eZ80 passes all args on stack, no register calling convention), make search-state variables static globals rather than recursion parameters, use uint8_t/int8_t aggressively since byte-wide operations are the native sweet spot.
Realistic Elo ceiling discussion: With heroic optimization (hand-tuned assembly, Texel tuning, large opening book), the ceiling is ~1800-2000 Elo. The hard wall is search depth — at 2000+ Elo, decisive positions need 12-15 plies, and even at 10K NPS with perfect pruning you can't reach that depth in complex middlegames. Historical validation: best 1980s commercial chess computers (Mephisto MM IV, Fidelity Mach IV) topped out around 2000-2100 on similar effective throughput. The final measured result of ~2083 Elo at 27s/move confirms my pre-implementation estimate was accurate.
2.3 Elo Measurement Methodology
I separately researched how to defensibly measure the engine's Elo. The investigation pulled in Marco Meloni's empirical data (110,000+ games at various node limits, calibrated against the SSDF human rating scale) and established the methodology:
Stockfish's UCI_LimitStrength mode was chosen over raw time throttling. UCI_Elo ranges from 1320-3190, calibrated at 120s+1s, anchored to CCRL 40/4 ±100 Elo. It works by intentionally selecting suboptimal moves with calibrated randomness — much more human-like than just limiting time. Below skill level 10 (~2596 Elo), the rating is hardware/time-control independent.
This methodology makes the engine's Elo claims defensible: anyone can reproduce the results by running the same Stockfish UCI_LimitStrength matches. The qualifier "based on testing against calibrated Stockfish" makes it a fact rather than a subjective claim.
2.4 Competitive Landscape Analysis
I researched existing TI-84 CE chess engines to validate the "strongest engine" claim:
- ChessCE (MateoConLechuga, 2016-2017): ~40K downloads, built on TSCP (~1724 desktop Elo). Known bugs — crashes on checkmate, no draw mechanics. Lacks LMR, futility pruning, most modern search techniques.
- Chess84 (thewarrenjames, June 2025): Most recent competitor, implements many of the same search techniques but author estimates "half the time like an 800, half like a 1200." Tiny opening book (6 hand-curated lines vs my 131K Polyglot entries). No formal Elo testing.
- PAndaContron/ce-chess: Minimal project, 5 commits, not a serious contender.
The decisive differentiator: my engine is the only one with rigorous Elo testing against Stockfish's calibrated strength modes. Even at the lowest time setting (2s/move), the ~1355 Elo already exceeds Chess84's self-assessment, and at 5s+ it's comfortably in the 1700-2100 range.
2.5 Chess Engine Architecture Philosophy
I also explored the fundamental architectural question of eval/search compatibility through a thought experiment: what happens when you swap evaluation functions and search algorithms across engines?
Neural net eval + alpha-beta: The most problematic combination. Alpha-beta needs millions of cheap evaluations; a transformer-style network is expensive per eval. Search depth collapses. Stockfish's NNUE is the pragmatic point on this spectrum — a lightweight neural net designed to be fast enough for deep alpha-beta search.
Classical/NNUE eval + MCTS: MCTS was designed for expensive-but-high-quality evals. Classical eval is cheap but noisy — it benefits from depth, not selectivity. Plugging cheap eval into MCTS means being selective with evaluations that aren't accurate enough to justify selectivity.
The deeper principle: Eval and search co-evolved as point solutions in a tradeoff space. You can't swap one for the other. This understanding informed why the TI-84 CE engine uses a hand-crafted evaluation (cheap, benefits from depth via alpha-beta) rather than attempting to run a neural network on the eZ80.
3. Architecture
3.1 Engine/UI Separation
The chess engine is cleanly separated into a portable C engine (chess/engine/src/, 8 files) and a TI-84 CE GUI (chess/src/main.c). The engine.h API serves as the boundary — the GUI never touches internal board representation directly.
chess/
├── src/main.c — GUI (320×240 LCD, sprite rendering, input)
├── engine/
│ ├── src/
│ │ ├── board.c/h — 0x88 board, piece lists, make/unmake
│ │ ├── movegen.c/h — Pseudo-legal move generation
│ │ ├── eval.c/h — Texel-tuned HCE (PeSTO + features)
│ │ ├── search.c/h — Negamax, PVS, LMR, null-move, killers
│ │ ├── tt.c/h — 4096-entry transposition table (32KB)
│ │ ├── zobrist.c/h — 24-bit Zobrist hashing with inline asm
│ │ ├── engine.c/h — Public API for GUI integration
│ │ ├── book.c/h — Polyglot opening book (multi-AppVar)
│ │ ├── types.h — Core types, 0x88 constants, move encoding
│ │ ├── directions.h — Knight/bishop/rook/king offset arrays
│ │ ├── chdata.h — AppVar layout (sprites + randoms)
│ │ └── pick_best.asm — Hand-written eZ80 assembly
│ ├── test/ — perft, search tests, integration tests, bench
│ ├── tuning/ — Texel tuning pipeline (Python)
│ ├── uci/ — UCI protocol for desktop play
│ └── Makefile — Desktop builds (GCC)
├── bench/ — On-calculator benchmark harness
├── emu_uci/ — Emulator UCI bridge for tournaments
├── tools/ — Book generation, data packing
└── Makefile — CE SDK build for CHESS.8xp3.2 Dual-Target Compilation
The same engine source compiles for both:
- TI-84 CE (CE C toolchain, eZ80 target,
-Oz,-DPAWN_CACHE_SIZE=16 -DSPRITES_EXTERNAL) - Desktop (GCC/Clang, x86/ARM64 target,
-O2,-DNO_BOOK)
This enables desktop benchmarking (fast iteration, Stockfish tournaments) while validating on the actual target via the cycle-accurate eZ80 emulator. A key finding: some optimizations help on eZ80 and hurt on desktop (e.g., sentinel rays: +1.3% on eZ80, -8.4% on desktop).
3.3 Memory Budget (~62KB of 256KB)
| Component | RAM (bytes) |
|---|---|
| Transposition table (4096 × 8) | 32,768 |
| Move pool (2048 × 5, SoA) | 10,240 |
| Pawn cache (32 entries) | ~5,000 |
| Piece sprites (12 × 402) | ~4,800 |
| Zobrist tables | ~3,200 |
| Lock tables | ~1,600 |
| Position history (256 × 3) | 768 |
| History heuristic (2 × 128 × 2) | 512 |
board_t | ~440 |
| Killers (64 × 2 × 3) | 384 |
| UI state + misc | ~2,000 |
| Total | ~62KB |
BSS limit measured at 60,690 bytes. The remaining ~194KB is for TI-OS, stack (~4KB), and other overhead.
Chess Engine Concepts — Quick Glossary
Reference for the terms used throughout this section. Each is a standard technique in computer chess; the interesting part is usually the eZ80-specific tradeoff, not the concept itself.
Board representation
- 0x88 board: A 128-square array where off-board squares have bit 0x88 set. Checking
sq & 0x88 == 0tests "is this square on the board?" in one AND + compare, no bounds check. Used since the 1970s; still the right choice when you don't have fast 64-bit bitwise ops. - Bitboards: The modern alternative — represent the board as 64-bit integers, one bit per square. Great on 64-bit CPUs with hardware
popcountandbitscan. Impractical on the eZ80 (24-bit, no SIMD, 64-bit ops require expensive multi-word library calls). - Piece lists: Per-side arrays of where each piece lives. Avoids scanning all 64 squares when you need "where are the white knights?"
- Mailbox: The generic term for any array-of-squares board representation. 0x88 is a specific mailbox variant.
Hashing
- Zobrist hashing: Precompute a random 64-bit (or 24-bit on eZ80) key for every (piece, square) combination, plus keys for side-to-move, castling rights, en passant square. The board's hash is the XOR of all currently-applicable keys. Incrementally updated on every move: XOR out the old piece's key, XOR in the new. Two positions that look different but hash to the same key are (probabilistically) the same position — enables transposition tables.
- Transposition table (TT): A hash table keyed on the Zobrist hash. Stores the search result for each position (score, best move, depth searched, bound type). When the search hits a position it's already seen (same position reached via a different move order = "transposition"), it can reuse the prior result instead of re-searching. Typical hit rates are 15-30%.
Evaluation
- Piece-Square Tables (PST): A 64-entry score table per (piece type, game phase). Captures "knight on e5 is worth more than knight on a1." Incrementally maintained during move-making for zero-cost evaluation of the positional component.
- PeSTO: Ronald Friederich's publicly-available PST set, tuned via Texel's method against a large database of positions. Widely used as a baseline because the tables are high-quality and free.
- Tapered evaluation: Separate middlegame (mg) and endgame (eg) scores, interpolated by game phase (a scalar 0-24 based on remaining non-pawn material). Lets the eval say "a rook on the 7th rank is worth a lot in the middlegame, less in the endgame" smoothly.
- Texel tuning: A method for automatically fitting eval parameters. Take millions of positions with known game outcomes, compute the eval's predicted result via a sigmoid, minimize mean-squared error between prediction and outcome. Named after Peter Österlund's Texel engine.
Search
- Negamax: The symmetric form of minimax — since chess is zero-sum,
min(a,b) = -max(-a,-b), so you only implement one function with sign flipping. Every modern engine uses negamax. - Alpha-beta pruning: The classic search optimization. Track upper/lower bounds (alpha, beta); prune any branch that can't improve on the best known line. Reduces effective branching factor from ~35 to ~6 with good move ordering.
- Iterative deepening: Search to depth 1, then 2, then 3, ... instead of jumping straight to the target depth. Gives you a best move at every depth (graceful time-up), and the shallower searches populate the TT with good move ordering for the deeper searches.
- PVS (Principal Variation Search): Assume the first move (from TT or iterative deepening) is best; search it with a full window. For every other move, search with a zero-width "null window" — if it fails high, you know the move is good and re-search it properly. Saves time when your move ordering is good (which it usually is, thanks to TT + killers + history).
- Aspiration windows: Instead of searching with
(-∞, +∞)every iteration, search a narrow window around the previous iteration's score (e.g., ±25 centipawns). If the result falls inside, you just saved time. If it falls outside ("fails high/low"), re-search with a wider window. Net win because most positions' scores don't swing dramatically between depths. - Quiescence search: Once you hit your target depth, don't just return the static eval — keep searching captures (and sometimes checks) until the position is "quiet." Prevents the horizon effect where the engine "stops looking" one move before getting its queen taken.
- Delta pruning: Inside quiescence, skip captures that can't possibly raise the score enough to matter (e.g., capturing a pawn won't save a position where you're down a rook). Standard formulation: if
stand_pat + captured_piece_value + margin < alpha, skip the capture.
Move ordering (makes alpha-beta effective)
- TT move: Try the best move from the previous TT hit first. If you searched this position before and found move X was best, it's probably still best.
- MVV-LVA: "Most Valuable Victim, Least Valuable Aggressor." Try capturing moves in order of
victim_value - attacker_value(capturing a queen with a pawn > capturing a pawn with a queen). Crude but extremely fast. - Killer moves: A 2-slot cache per search depth of "quiet moves that caused beta cutoffs at this depth." Empirical observation: a move that refuted one line often refutes sibling lines at the same depth.
- History heuristic: Counter table indexed by (from_sq, to_sq) or (piece, to_sq). Increment when a move causes a beta cutoff. Encodes "this move works well in general."
Pruning/reductions
- Null-move pruning: At each node, try "passing the turn" (a null move). If the opponent still can't improve their position below some bound, the current position is so good we can prune. Saves massive time in quiet positions. Has edge cases — breaks in zugzwang (positions where not moving helps), so it's disabled at low material counts.
- LMR (Late Move Reductions): For moves that aren't the first few tried (so probably not the best), reduce search depth by 1-2 ply. If the reduced search beats alpha, re-search at full depth. Assumes good move ordering so late moves are genuinely worse. Massive elo gain per line of code.
- Check extensions: Extend depth by 1 when giving check. Tactics tend to involve forcing sequences, so you don't want to stop searching in the middle of a check.
eZ80-specific terms
- Sentinel ray: Sliding piece move generation (bishop, rook, queen) normally loops until it goes off the board. Instead of checking
SQ_VALID(sq)at each step, extend the board array to 256 bytes withOFFBOARD = 0xFFfilling the unused indices. The loop terminates by checkingsquares[sq] != OFFBOARD— same check as "is this square empty or enemy piece?", so you get the off-board test for free. Works on eZ80 where branch prediction doesn't exist; can be a regression on modern CPUs where the extra array size hurts cache. int= 24-bit: The eZ80's native word size is 24-bit.sizeof(int)is 3 bytes. Array strides of 4 (shift-by-2) are faster than strides of 3 (multiply-by-3), which drives some weird layout choices — e.g., Zobrist tables stayuint32_teven though only 24 bits are used.
4. The Chess Engine Core
4.1 Board Representation (board.c/h)
0x88 board — the classic scheme where sq & 0x88 detects off-board squares:
squares[256]: The board array. Off-board indices filled withOFFBOARD(0xFF) sentinel, enabling sliding piece loops to terminate by checking for non-PIECE_NONEwithout per-stepSQ_VALID(). The sentinel was extended from 128 to 256 bytes specifically for the eZ80, where the branch predictor doesn't exist and eliminating theSQ_VALIDcheck saves cycles.piece_list[2][16]: Square of each piece, per side. Max 16 pieces.piece_index[128]: Reverse map from square to piece_list index (O(1) lookup).0xFFif empty. Enables O(1) piece removal via swap-with-last.bishop_count[2]: Tracked incrementally for bishop pair bonus.king_sq[2]: Cached king squares for fast check detection.- Incrementally maintained scores:
mg[2],eg[2](PST+material per side),phase(24=opening, 0=endgame),hash(24-bit Zobrist),lock(16-bit TT verification),pawn_hash(separate pawn-only hash for pawn cache).
board_make() (~150 lines): The critical hot-path function. Incrementally updates hash, lock, pawn_hash, PST scores, phase, bishop_count, king_sq, castling rights, and EP square. Uses a castling_mask[128] lookup table — a 128-byte table where corner/king squares AND out the relevant castling bits.
board_unmake(): Reverses make by restoring saved state from undo_t struct (10-11 bytes: captured piece, castling, EP, halfmove, fullmove, hash, lock, pawn_hash).
4.2 Move Generation (movegen.c)
Pseudo-legal generation iterating the piece list (O(piece_count), not O(64)):
- Staged generation:
GEN_CAPTURESfirst, thenGEN_QUIETS. Enables search to try captures before quiets without generating all moves. is_square_attacked(): Pure query function. Checks knight (8 offsets), pawn (2 diagonals), king (8 adjacent), then bishop/queen and rook/queen sliding rays. Uses the OFFBOARD sentinel for loop termination.- Legality fast path:
compute_legal_info()precomputes check, pin, and checker information by scanning rays from the king.move_needs_legality_check()returns 1 only for king moves, EP, pinned piece moves, and in-check moves — saving ~50% of expensiveis_square_attackedcalls.
4.3 Search (search.c)
Negamax with alpha-beta, iterative deepening, and these enhancements:
| Technique | Details |
|---|---|
| PVS | First move gets full window; others get null window [-alpha-1, -alpha], re-search on fail-high |
| Aspiration windows | ±25 from previous iteration's score, widen to full on fail |
| Null-move pruning | R=2 reduction, depth ≥ 3, not in check, non-pawn material required |
| LMR | After 4+ legal moves, depth ≥ 3, for quiet non-promotion moves: reduce by 1 |
| Futility pruning | Depth ≤ 2, static eval + margin ≤ alpha. Margins: 200 (d=1), 500 (d=2) |
| Check extensions | If in check: +1 depth. Limited to 2 per path |
| Quiescence | Max depth 8. Delta pruning (stand pat + 1100 < alpha). Captures only when not in check. |
| Move ordering | TT move (30000) > MVV-LVA captures (10000+) > Killer 1 (9000) > Killer 2 (8000) > Queen promos (+5000) > History heuristic |
| Killers | 2 per ply, FIFO replacement |
| History heuristic | Gravity-based: val += bonus - val * bonus / 16384, clamped to [-4000, 4000] |
| Repetition detection | Walks position history backward by 2, stops at last irreversible move |
Global move pool (SoA layout): pool_moves[2048] and pool_scores[2048]. Plies share this via a stack pointer (move_sp). This avoids ~2KB stack allocation per ply — critical on eZ80 with only ~4KB stack.
Stack overflow guard (eZ80 only): Reads the OS stack limit from port 0xE0003A, adds STACK_RESERVE=400 bytes. Each node checks the frame pointer via lea %0, ix+0 inline asm and bails to evaluate() if too close.
Move variance (for difficulty levels): After search completes, finds all root moves within N centipawns of best and picks one randomly. The PVS window at root is widened by variance + 1 to ensure near-best moves get accurate scores (fixing a critical bug where PVS null-window clipping made all non-best moves score identically).
pick_best_score(): Selection sort to find the highest-scored move. On eZ80, replaced by hand-written assembly (pick_best.asm) that's 2.8× faster than the compiler output. Uses a signed-to-unsigned comparison trick (XOR high byte with 0x80) and keeps the flipped best_score in a register permanently.
4.4 Evaluation (eval.c)
Texel-tuned PeSTO piece-square tables with Stockfish-inspired features:
- Material + PST (from incremental
mg[WHITE] - mg[BLACK]andeg). Values: pawn mg=82/eg=94, knight 337/281, bishop 365/297, rook 477/512, queen 1025/936. - Bishop pair: mg=19, eg=58.
- Tempo: mg=10, eg=9 (side-to-move bonus).
- Pawn structure (via pawn cache):
- Doubled pawns: mg=-12, eg=-3
- Isolated pawns: mg=-12, eg=-17
- Connected pawns: rank-indexed bonus (0, 9, 10, 16, 39, 65)
- Passed pawns: rank-indexed mg/eg bonus (up to 85/229)
- Rook bonuses: Open file (mg=38, eg=24), semi-open (mg=23, eg=11).
- Mobility: Knight (0-8 squares) and bishop (0-13 squares) with Stockfish classical values. Uses pawn attack bitmap from cache for safe-square counting.
- Pawn shield: mg=6 per friendly pawn in front of king.
- Tapered eval:
score = (mg * phase + eg * (24 - phase)) / 24.
Pawn cache: 4-way set-associative, 32 entries. Keyed by pawn_hash. Stores pawn mg/eg scores, per-file rank bitmasks, and a 128-byte pawn_atk[] bitmap (per-square pawn attack info). The pawn_atk array MUST be static, not stack-allocated — putting 128 bytes on the eZ80 stack pushes local variable offsets beyond the efficient IX+displacement range (signed 8-bit: -128 to +127), causing a 14% performance regression across all eval sections.
4.5 Transposition Table (tt.c)
4096 entries × 8 bytes = 32KB. Always-replace policy.
Entry: lock16 (2B, independent 16-bit verification), score (2B), best_move (2B, packed), depth (1B), flag (1B: EXACT/ALPHA/BETA).
Move packing: 6 bits from_sq64 + 6 bits to_sq64 + 1 bit promotion + 2 bits promo type = 16 bits. Capture/EP/double-push flags are NOT stored — caller must verify legality.
The 24-bit hash + 16-bit lock gives 40 bits total of collision resistance (1 in ~1 trillion false positive rate).
4.6 Zobrist Hashing (zobrist.c)
Tables stored as uint32_t (4-byte stride for fast array indexing), but all state variables use zhash_t = unsigned int = 24 bits on eZ80. This hybrid approach avoids both expensive __lxor library calls (from 32-bit state operations) and slow multiply-by-3 array strides (from 3-byte tables).
ZHASH_XOR inline assembly (eZ80 only): Memory-to-memory 3-byte XOR using ld a,(de); xor a,(hl); ld (hl),a repeated 3 times (~24 cycles). The compiler's library call __ixor costs ~45 cycles. A C macro approach was also tried but generated worse code (+20%) due to bad pointer arithmetic for uint8_t* casts.
4.7 Opening Book (book.c)
Polyglot format stored in TI-OS AppVars (flash, max ~65KB each), split across multiple segments. Five tiers: S/M/L/XL/XXL with up to 99 numbered segments each. Auto-detected at runtime from largest to smallest.
Book entries are accessed directly from flash pointers — zero RAM cost. Binary search + weighted random selection with the position hash as seed.
The book trimming pipeline (tools/trim_book.py) uses depth-aware BFS: shallow positions (≤8 ply) get multiple alternatives for variety, deep positions get 1 move for maximum depth. Tournament testing showed book vs no-book is essentially equal at 6000 nodes (~1950 Elo both ways).
5. The GUI
chess/src/main.c renders on the 320×240 LCD using the CE C toolchain's graphx.h library:
- Board: 208×208 pixels (8×26px squares), offset at (16, 16). 20-color indexed palette.
- Double-buffered:
gfx_SetDrawBuffer()/gfx_SwapDraw(). Partial redraw optimization for cursor-only moves. - Pre-rendered sprites: 12 sprites (6 types × 2 colors) with contour reinforcement for crisp edges at 20×20 pixels.
- Piece animation: Linear interpolation, 50ms/square (max 200ms). Piece removed from board during animation, drawn at interpolated position each frame.
- State machine: MENU → DIFFICULTY_SELECT → COLOR_SELECT → PLAYING → PROMOTION/GAMEOVER.
- AI integration: Two-phase render — frame 1 draws "Thinking..." (so display updates), frame 2 calls
engine_think()(blocking). - Board flipping: When playing as Black, all coordinates and D-pad directions are inverted.
6. Technical Tradeoffs & eZ80 Constraints
6.1 Why 0x88 Over Bitboards
The eZ80 is a 24-bit CPU with no SIMD. Bitboard operations (popcount, bitscan, 64-bit AND/OR/XOR) would require expensive multi-word library calls. The 0x88 scheme uses only 8-bit comparisons and additions — native eZ80 operations.
6.2 The 24-bit Integer Problem
On eZ80, unsigned int = 24 bits. This creates a unique optimization landscape:
| Type | eZ80 size | Speed |
|---|---|---|
uint8_t | 1 byte | Native, fast |
int / unsigned int | 3 bytes | Native, fast |
int16_t | 2 bytes | Non-native, slow (masking/sign-extension) |
int32_t / uint32_t | 4 bytes | Library calls (__lxor, __ladd) |
Rules discovered empirically:
- Widen register-bound locals to native
int— beneficial - Do NOT widen struct/array fields — harmful (increases memory traffic)
int16_tfor search scores in struct fields is okay;intfor local variables during computation is better- History table MUST stay
int16_t(widening caused +44.8% moveorder regression) - Array stride 4 (shift-by-2) is faster than stride 3 (multiply-by-3), so Zobrist tables stay
uint32_teven though only 24 bits are used
6.3 -Oz vs -O2
-Oz (optimize for size) decisively beats -O2 on the eZ80: 7,053 cy/node vs 10,005. The larger -O2 binary causes flash cache misses, especially in move ordering (+103.9% slower). Code size dominates performance because the eZ80's flash has 10-cycle latency on cache misses (197 cycles for a full miss).
6.4 Sentinel-Based Sliding Rays
Expanding squares[] from 128 to 256 bytes, filling off-board indices with OFFBOARD (0xFF), lets sliding piece inner loops drop the SQ_VALID() check:
c
// Before (two conditions per iteration):
while (SQ_VALID(target) && squares[target] == PIECE_NONE) target += dir;
// After (one condition per iteration):
while (squares[target] == PIECE_NONE) target += dir;eZ80 result: -1.29% cy/node (the eZ80 has no branch predictor, so eliminating a branch saves cycles predictably).
Desktop result: +83% slower movegen, +109% slower is_square_attacked, -8.4% throughput. The larger board_t (448B vs 320B) hurts L1 cache utilization on modern CPUs where branches are essentially free.
This is the most striking example of platform-divergent optimization in the project.
6.5 Stack Limitations
The TI-OS provides only ~4KB of stack. The chess engine mitigates this with:
- Global move pool instead of per-ply stack arrays (~2KB saved per ply)
- Static
pawn_atk[128]instead of stack-allocated (safe becauseevaluate()is never recursive) board_tandmoves[]declared static in benchmark harness- Runtime stack guard reading the OS stack limit register
6.6 Flash-Resident Data
Opening book entries and Polyglot random numbers are accessed directly from AppVar flash pointers — zero RAM cost. Piece sprites are also loaded from the CHDATA AppVar when SPRITES_EXTERNAL is defined, saving ~2400 bytes of code flash.
6.7 Strategic / "why do this at all" tradeoffs
- Why a new engine when ChessCE exists — ChessCE (~40K downloads, TSCP-based, ~1200–1500 Elo) already occupies the "there's a chess program on your calculator" slot. A new one only makes sense as a demonstrable-best artifact — a hard-to-fake "strongest chess engine ever for this platform" claim that functions as portfolio evidence of engineering judgment, not as a product competing for downloads. You give up the easy "ship something useful" win for a defensible ceiling whose value is in the story of how it was built. Inferred.
- Why target a 48 MHz eZ80 in 2026 — the eZ80 is strategically valuable because it's hostile: no branch predictor, no SIMD, no 64-bit ops, 24-bit
int, ~60 KB usable RAM. That hostility makes optimization work visible — every textbook technique has to be re-evaluated from first principles. A modern target would mean ~5M NPS for free, reducing the project to a Stockfish-feature-list reimplementation. Inferred. - Chasing Elo at a ~370 NPS ceiling instead of shipping a weak-but-simple engine — most users can't tell 1500 from 2000 Elo. Choosing to crowd the ceiling is where the interesting engineering lives: at weak targets you can ignore cache layout, integer width, and hand-written asm; at the ceiling every one of those is load-bearing. The last ~300 Elo cost roughly as much engineering as the first 1500 — maximum returns in portfolio depth, minimum returns in user-visible quality. Inferred.
UCI_LimitStrengthagainst Stockfish as the validation methodology — self-reported Elo from in-house play is unfalsifiable and the standard critique of hobbyist engine claims. Anchoring to Stockfish's CCRL-calibrated skill levels converts the claim from subjective to reproducible. The cost: absolute numbers have caveats (engine-pool Elo vs FIDE, calibrated randomness vs true strength throttling) — but relative comparisons against ChessCE/Chess84 under the same methodology become defensible, which is the only claim that matters. Evidenced in doc §2.4.- Eval and search co-evolved as a locked joint tradeoff — the NNUE investigation surfaced a general principle: eval and search aren't independent, they're point solutions in a joint space. Cheap/noisy eval wants depth (alpha-beta); expensive/accurate eval wants selectivity (MCTS). On eZ80 only one quadrant is physically reachable, so every downstream decision (24-bit Zobrist, PST-only updates, futility margins) is locked in upstream. Evidenced in §2 + NNUE shelving.
6.8 Additional architectural tradeoffs worth naming
- Global move pool + static search state instead of per-ply recursion frames — one 2048-entry pool, shared stack pointer, search globals not passed as parameters. TI-OS gives ~4 KB of stack and the eZ80 ABI passes all arguments on the stack with no register calling convention, so every recursion parameter is a memory write+read. Textbook negamax with per-ply
move_t[256]arrays blows the budget in ~8 plies. Cost: no reentrancy, no thread-safety — irrelevant on a single-core eZ80. Evidenced in planning notes. - 24-bit Zobrist hash + independent 16-bit "lock," not a single 64-bit key — 64-bit XOR on eZ80 is a ~45-cycle library call; 24-bit XOR is ~24 cycles inline asm. Bare 24-bit would collide at 1-in-16M (unacceptable for a TT probed millions of times); splitting into 24+16 pays the cheap XOR twice and lands at ~1-in-10¹² practical collision rate, matching full Zobrist. Inferred from
zobrist.c+tt.c. - Pseudo-legal movegen + fast-path legality filter — only invokes
is_square_attackedon king moves, EP, pinned pieces, or moves made in check (~50% of calls elided). Fully-legal movegen would need bitboard-style pin rays (expensive without 64-bit ops); naive pseudo-legal + test-on-every-move wastes a make/unmake per illegal move. Hybrid is the right point on the eZ80 cost curve. Evidenced inmovegen.c. - Dual-target build (eZ80 + desktop GCC) as first-class infrastructure — same source compiles both; Stockfish tournaments and Texel's 1M-position pipeline would be infeasible at 370 NPS, but running ablations on desktop catches eZ80-only regressions only if you also validate on target (which is what the
emu_ucibridge exists for — runs the actual .8xp binary in a cycle-accurate emulator). Evidence this matters: sentinel rays gained +1.3% on eZ80 but regressed -8.4% on desktop. Desktop alone would have shipped the wrong decision. Evidenced in ablation table. - Texel tuning on Lichess broadcast games, not general Lichess games — 1M positions from GM-level tournament broadcasts, filtered to plies 12–100, not-after-capture, not-in-check. Random Lichess is polluted by blunders at every rating; the eval would learn to exploit patterns that don't exist in strong play. Filter rules exclude positions where tactical search dominates static eval, since Texel is fitting the quiet eval. Inferred from data pipeline.
6.9 Code-level tradeoffs visible in the source
- Two independent Makefiles sharing
engine/src/, not one CMake with target detection —chess/Makefile(cedev/eZ80,-Oz, links asm, includesbook.c) andchess/engine/Makefile(gcc,-O2,-DNO_BOOK) both pull from the same source list.-DNO_BOOKwith#ifndef NO_BOOKbracketing is the minimum possible coupling — the desktop toolchain never needsfileioc.h. CMake + guarded compilation would have been more "correct" but invited a marriage of cedev and standard gcc configs that neither tool is happy about. Cost: two build systems to keep in sync. Evidenced. - Ablation as a build-matrix target, not a runtime flag —
make ablationcompilesuci_full+uci_no_<feat>binaries via-DNO_$$featfor each feature (tempo, pawns, mobility, etc.). A runtimesetoptionflag would have been simpler but leaves dead branches in the binary and muddies per-feature cost measurement. Compile-time dead-code elimination means each variant measures true feature cost. Cost: N binaries to keep in sync. Evidenced inengine/Makefile. emu_ucias a binary-patching bridge, not a socket or virtual serial port —emu_uci/src/main.cdeclaresstatic char cmd_slot[512] = "@@CMDSLOT@@";.emu_tournament.pyfinds the sentinel, writes command args + FEN in-place, recomputes the 8xp checksum, and invokes the emulator once per move. A persistent emulator with piped UCI would have preserved TT across moves, but required either emulator mods or a virtualized serial transport. One-shot-per-move is the cheapest protocol on top of a stateless debug harness — validates Elo of the actual target binary, not throughput. Evidenced.- Public engine API uses
board[8][8]row/col; internals use 0x88 —engine.hexposes a 2D-grid shape to the UI andemu_uciFEN parser;types.hkeeps the 0x88 indexing withOFFBOARD=0xFFsentinel internal. Exposing 0x88 directly would have saved one conversion, but the UI thinks in row/col naturally and keeping 0x88 out of the public API means internals can change indexing without breaking callers. Evidenced. - Hand-written
pick_best.asmas a Makefile source, not__asm__inline —chess/MakefilelistsEXTRA_ASM_SOURCES = engine/src/pick_best.asm; desktop build has no asm at all and the C fallback compiles in. Inline__asm__inside an#ifdef __ez80__would have kept the two implementations in one file. Separate translation units mean the desktop link simply ignores the asm and the C version wins by default — no preprocessor goo. Evidenced. - Per-harness
main()test binaries, not a framework —engine/test/has seven standalone.cfiles (perft, bench, eval_fen, diag_qxa2, …) each a Makefile target. CTest/Unity/Criterion driving one binary would have been more "professional." Each test is a single concern and tournament logic wants Python + python-chess + Stockfish's UCI_LimitStrength, so the language boundary lands where concerns diverge naturally. Evidenced.
7. The Optimization Journey
The engine went through 13 variant repos tracking systematic optimization. Here's the complete narrative:
7.1 Phase 1: Engine Bootstrap (Feb 13)
Four commits in one evening built the entire engine from nothing:
- Move generation with 30/30 perft validation
- Search + eval + TT (negamax, alpha-beta, iterative deepening)
- UI integration
- Integration tests
7.2 Phase 2: Eval Features + Ablation (Feb 15)
Added 6 Stockfish-inspired evaluation features with compile-time #ifndef guards for ablation testing:
| Feature | Impact when removed (vs SF-1700) |
|---|---|
| Mobility | Worst — ~35% swing |
| Pawn structure | Moderate |
| Passed pawns | Moderate |
| Rook on files | Small |
| Pawn shield | Small |
| Tempo | Small |
7.3 Phase 3: Movegen Optimizations (Feb 15)
Five sequential optimizations, each measured on 5 positions × 1000 iterations:
| # | Optimization | Impact |
|---|---|---|
| 1 | O(1) piece-list updates via square index map | -5.1% perft |
| 2 | Staged movegen (captures then quiets) | Helps search pruning |
| 3 | Skip castling attack probes when in check | +6.7% movegen overhead (but saves attack calls in real games) |
| 4 | Precompute check/pin info | ~50% fewer legality checks |
| 5 | Track bishop counts incrementally | -5.9% eval |
7.4 Phase 4: Texel Tuning (+52 Elo) (Feb 16)
The single largest Elo gain. See §8 (Texel Tuning Pipeline) for full details.
7.5 Phase 5: Integer Type Optimization (Feb 16)
Widened hot-path score locals to native int (24-bit). Narrowed phase to uint8_t. Combined: -2.4% cy/node, -379 bytes binary. Widening struct fields was tried and reverted (-5.6% make/unmake regression).
7.6 Phase 6: Pawn Hash Cache (Feb 16-17)
Progressive refinement:
- 2-way set-associative: -14.3% total eval cycles
- 4-way set-associative: additional -6.6%
- Qsearch capture-only fast path: -4.8%
- Combined
build_pawn_info+ pawn attack bitmap: -28.5% eval, -15.7% cy/node overall
7.7 Phase 7: Cycle-Level Micro-Optimizations (Feb 17)
| Change | cy/node Impact | Notes |
|---|---|---|
Native 24-bit zhash_t (hybrid approach) | -0.8% | Tables stay uint32_t, state uses zhash_t |
| Sentinel sliding rays | -1.3% | eZ80-only win; desktop regression |
Hand-written pick_best_score asm | -11.8% (in that function) | 2.8× faster than compiler |
Inline asm ZHASH_XOR | -2.8% make/unmake | Memory-to-memory 3-byte XOR |
| SoA move pool | -1.1% | Eliminated pool_copy overhead |
| Eval loop merging (4→2 loops) | -4.5% profiling | ~1.1% real-world |
sprintf/printf elimination | -7.3KB binary | Purely size reduction |
7.8 Optimizations Tried and Rejected
| Optimization | Result | Why |
|---|---|---|
| Division-free tapered eval | 0% | Compiler already uses multiply-by-reciprocal for /24 |
| 24-bit PRNG | Skipped | Both PRNGs called <2000 times per game |
| Eval deduplication (parameterized loops) | -2.3% | Prevents BIT instruction for constant color checks |
| Local pointer caching | 0% | LTO sees through the alias |
| Counter-move heuristic | -1% | Committed without testing, immediately reverted |
| Stack-allocated pawn_atk[128] | -14% | IX+displacement exceeds efficient signed 8-bit range |
| Pure zhash_t tables (3-byte stride) | -1.2% | multiply-by-3 slower than shift-by-2 |
| NNUE neural network eval | Not merged | Too expensive for 48MHz without SIMD |
7.9 NNUE Exploration (Shelved)
Architecture: 768→32→1 with SCReLU activation, int8 quantized weights (24,772 bytes). Implemented on feature/nnue branch but never merged. The 32-neuron hidden layer required 32 multiply-accumulate operations per eval call on a CPU with no SIMD, and the 24KB feature transformer weight matrix would dominate flash reads.
7.10 Final Performance Profile (100 positions, 1000 nodes each)
| Position Category | cy/node | eval% | movegen% | make% |
|---|---|---|---|---|
| Positional middlegames | 470,211 | 38% | 21% | 13% |
| Tactical middlegames | 337,437 | 34% | 19% | 14% |
| Complex endgames | 272,686 | 29% | 16% | 16% |
| Simple endgames | 133,155 | 23% | 14% | 17% |
Average ~200K cy/node = ~2.7ms/node at 48MHz = ~370 NPS.
8. Texel Tuning Pipeline
8.1 Methodology
Texel tuning optimizes evaluation parameters by minimizing mean squared error between sigmoid(K × static_eval) and actual game outcomes:
MSE = mean( (sigmoid(K × eval) - label)² )Where label is 1.0 (win), 0.5 (draw), 0.0 (loss) from side-to-move perspective.
8.2 Data Source
Lichess broadcast database (GM-level tournament games, not random Lichess games). 25 months from 2024-01 through 2026-01. Positions filtered: ply 12-100, not after capture, not in check. Two datasets: 60K (pilot) and 1M positions (production).
8.3 Optimization Algorithm
Adam gradient descent with analytical gradients (not scipy). Key settings:
- Sigmoid slope K: found via golden-section 1D search (~0.00652)
- Learning rate: 0.02
- L2 regularization centered at scale=1.0 (pulling toward baseline)
- 90/10 train/validation split
- Best model selected by validation MSE
8.4 Parameters Tuned
Scale factors for 14 scalar params (bishop pair, tempo, doubled/isolated pawns, rook files, shield — each mg/eg), 12 PST row scales, rank-indexed pawn bonuses, and mobility bucket tables. Expanded from 14 to 108 parameters across rounds.
8.5 Results Across Rounds
| Round | Games | Elo vs Baseline | Significant? |
|---|---|---|---|
| Round 1 (1M positions) | 4000 | +52 (±10) | Yes |
| Round 2 vs Round 1 | 4000 | -1.6 (±10) | No |
| Round 3 vs Round 1 | 4000 | -5.2 (±10) | No |
| Round 4 expanded (96 params) | 1000 | +11 (±20) | No |
| Round 4b constrained | 4000 | +0.1 (±10) | No |
| Round 5 (108 params, material split) | 4000 | +8 (±11) | Borderline |
Key takeaway: Round 1 delivered a clearly significant +52 Elo. All subsequent rounds produced marginal improvements. Further tuning was abandoned.
8.6 Application
apply_texel_params.py reads JSON output and patches eval.c using regex-based text substitution — replaces #define values and array initializers directly in C source.
9. Testing & Tournament Infrastructure
9.1 Test Suites
| Tool | Purpose | Key detail |
|---|---|---|
perft.c | Move generation correctness | 7 standard + 22 edge-case positions |
test_search.c | Search/eval correctness | 8 categories: mate finding, stalemate, draws, incremental eval consistency, 200-halfmove self-play |
test_integration.c | Public API testing | 11 categories covering the full engine.h API |
bench.c (desktop) | Component-level benchmarking | 100 positions from 14 standard suites, ns/call for each component |
bench/ (on-calc) | eZ80 cycle profiling | Same 100 positions but with 48MHz hardware timer, per-section cycle breakdown |
9.2 Tournament Infrastructure
Desktop tournaments (engine/tournament.py): Compiles UCI engine, plays against Stockfish at configurable Elo levels (1320-2420, step 100). 20 games per level, 5 concurrent matches, 0.1s/move.
Emulator tournaments (chess/emu_tournament.py): Runs the actual eZ80 binary inside the cycle-accurate TI-84 CE emulator against Stockfish. The key innovation:
- Reads the EMUUCI.8xp binary
- Finds the
@@CMDSLOT@@sentinel (512 bytes) - Patches each position's FEN + settings into the slot
- Recomputes .8xp file checksum
- Launches the emulator subprocess
- Parses
"MOVE xxxx"from debug output - Feeds the move to Stockfish via UCI
This measures the engine at its true eZ80 calculation speed (~370 NPS), not desktop speed.
Ablation testing (engine/ablation.py): Plays the full engine against versions compiled with -DNO_<FEATURE> to measure each eval feature's contribution.
9.3 Elo Results
| Platform | Config | Elo | Method |
|---|---|---|---|
| eZ80 | Easy (900ms, var=10) | ~1320 | 30 games vs SF-1320 |
| eZ80 | Medium (3s) | ~1442 | 30 games vs SF-1500 |
| eZ80 | Hard (9s) | ~1712 | 30 games vs SF-1700 |
| eZ80 | Expert (13.5s) | ~1958 | 30 games vs SF-1900 |
| eZ80 | Master (27s) | ~2083 | 100 games vs SF-2100 |
| Desktop | 0.01s/move | ~2583 | 100 games vs SF-2600 |
| Desktop | 0.1s/move | ~2700 | 100+189 games vs SF-2600/2700 |
10. Major Bugs & Debugging Stories
10.1 The Queen Hang Bug (PVS Scoring)
Symptom: Engine played Qxa2, hanging the queen to Rxa2.
Root cause: PVS null-window search at root scored terrible moves identically to the best. When move_variance > 0, the engine randomly selected among "equally scored" root moves — but PVS clips all non-PV moves to the same alpha value, making Qxa2 (losing the queen) look as good as the best move.
Investigation: Spanned multiple sessions. First confirmed the bug exists on desktop (not eZ80-specific). Quiescence correctly saw Rxa2, but PVS window clipping masked the true score at every search depth (1-8).
Fix attempts:
- Full window at root: catastrophic — -708 Elo (PVS exists for a reason)
- Wider PVS window by variance amount: worked.
pvs_floor = alpha - varianceinstead ofalpha
Variance sweep: var=0: 0 Elo loss. var=5: 0 Elo loss (sweet spot). var=10: -147 Elo. var=15: catastrophic.
10.2 The 0-30 Tournament Disaster
Symptom: After implementing partial depth reporting, engine went from competitive to 0-30 in tournaments.
Root cause: search_best_root_move was being used to override result.best_move at timeout, but if no root move was scored in the current iteration, the fallback was the first generated move (a2a3 or h7h6), not the PV move.
Fix: Added search_root_move_scored flag. Documented in MEMORY.md as a "Dangerous Pattern."
10.3 Check Extension + TT Ordering Bug (PR #15)
Symptom: Engine played Be7 (losing to mate-in-2) instead of finding the defense.
Root cause: Check extensions increase depth, so the TT caches results at the extended depth. On the next iteration, the TT probe returns the cached result BEFORE the check extension can trigger, locking the engine to depth 1 in check positions.
Discovery: I noticed the bad move and directed Claude to investigate. Claude initially got the chess analysis wrong; I corrected: "there's a mate-in-2 with both moves being check."
10.4 The Timer Failure Cascade
Symptom: Late-endgame tournament positions terminated with emu_error.
Root cause chain: check_time() checked the node limit FIRST with early return, preventing the time check from running. With time_fn = NULL and search_deadline = 0, search never stopped on time. Fell back to 30K node limit, which at ~300 NPS took ~96 seconds, exceeding the emulator timeout.
Fix: Added search_node_deadline as a hard safety fallback (~1 node/ms with 3× headroom). Also made search_stopped volatile (LTO was optimizing away reads).
10.5 The Book That Never Worked
Multiple rounds of debugging:
- Missing CHBKRN.8xv (shared randoms file)
- Tier mismatch: baked large book but compiled with
BOOK=medium const char *pointer arrays had incorrect behavior on eZ80 24-bit target- After fixing all of the above, still failed on the custom emulator — worked fine on CEmu reference emulator
- Root cause: custom emulator's VAT reconstruction didn't properly expose AppVars
11. AI Agent Orchestration & Usage Stats
11.1 Session Data
| Metric | Value |
|---|---|
| Main sessions | 19 session directories |
| Main JSONL files | 1 file, 2.2 MB |
| Subagent files | 174 files, ~18 MB |
| Compact summaries | 71 files |
| Date range | Feb 10 – Mar 27, 2026 |
11.2 Human-AI Collaboration
I directed every architectural and strategic decision; the AI implemented code per my direction. Key aspects of my leadership:
- Architectural decisions: Chose 0x88 board representation (over bitboards — impractical on 24-bit CPU), Polyglot book format, the phased engine development approach (movegen → search/eval → UI integration → tests)
- Benchmarking methodology: Designed the empirical approach — insisted on eZ80-only benchmarks ("desktop benchmarks are pointless"), established the 100-position benchmark suite as the single source of truth, required every optimization to be measured before committing
- Quality enforcement: Caught the AI committing code without testing ("yes i want to actually test things before making commits"), demanded accurate numbers, rejected over-engineered explanations
- Strategic direction: Explored NNUE as a potential approach (my idea, not the AI's), evaluated the tradeoffs, and made the call to shelve it. Designed the difficulty/variance system concept. Directed the Texel tuning pipeline.
- Course corrections: Caught hallucinated performance ratios ("100,000× slower" — actual: ~14,000×), corrected chess analysis errors (the AI got a mate-in-2 analysis wrong), stopped the AI from overclaiming ("let's not say chess is solved")
Specific instances where I corrected the AI:
- AI committed counter-move heuristic without benchmarking — I enforced the test-first rule
- AI overcomplicated explanations of benchmark paradoxes — I cut through the noise
- AI hallucinated performance ratios ("100,000× slower") — I caught it, actual ratio was ~14,000×
- AI misattributed NNUE exploration as "AI proposed, human rejected" — I corrected: it was my idea to explore, my evaluation to shelve
- AI miscalculated emulator tournament timing ("15-30 minutes each") — I corrected: games run at emulator speed, not real-time
- AI initially got the check extension bug chess analysis wrong — I corrected: "there's a mate-in-2 with both moves being check"
11.3 Key Methodology
my development loop: implement → benchmark on 100 positions → commit or revert. Every optimization has a measured cy/node or Elo impact. The RESULTS.md file serves as the single source of truth. The AI executes this loop; I reviews the results and decides what ships.
12. Development Timeline
| Date | Day | Key Achievement |
|---|---|---|
| Feb 10 | 1 | Pong game complete |
| Feb 11 | 2 | Sudoku game (PR #1) |
| Feb 12 | 3 | Chess UI only (PR #2) |
| Feb 13 | 4 | Full chess engine — movegen, search, eval, TT, integration |
| Feb 15 | 6 | Eval features + ablation. Movegen optimizations. Opening book. |
| Feb 16 | 7 | Texel tuning (+52 Elo). Integer type optimization. Pawn cache. |
| Feb 17 | 8 | 5 PRs in one day. Sentinel rays, difficulty system, PVS bug fix, tournament system. |
| Feb 18 | 9 | Emulator tournament system. CHDATA AppVar consolidation. |
| Feb 19 | 10 | Hand-written eZ80 assembly (-11.8% pick_best). SoA move pool. |
| Feb 20-25 | 11-16 | Check extension bug fix. Threefold repetition. Final polish. |
Total: 111 commits, 15 PRs (14 merged), 16 days. From nothing to ~2083 Elo on a calculator.
13. Key Files Reference
Engine Core
| File | Lines | Purpose |
|---|---|---|
engine/src/search.c | ~800 | Negamax, PVS, LMR, null-move, quiescence, iterative deepening |
engine/src/board.c | ~600 | 0x88 board, piece lists, make/unmake |
engine/src/eval.c | ~500 | Texel-tuned HCE, pawn cache, mobility, tapered eval |
engine/src/movegen.c | ~400 | Pseudo-legal generation, attack detection, legality fast path |
engine/src/book.c | ~350 | Polyglot book, multi-AppVar, binary search, weighted selection |
engine/src/engine.c | ~350 | Public API, game status, difficulty configuration |
engine/src/zobrist.c | ~150 | Zobrist tables, inline asm XOR, PRNG |
engine/src/tt.c | ~100 | 4096-entry TT, move packing, always-replace |
engine/src/pick_best.asm | ~60 | Hand-written eZ80 assembly for move scoring |
src/main.c | ~1200 | GUI: rendering, input, animation, state machine |
Testing & Tournaments
| File | Purpose |
|---|---|
engine/test/bench.c (25KB) | 100-position desktop benchmark |
bench/src/main.c | On-calculator cycle profiling benchmark |
bench/RESULTS.md | Complete optimization history with measurements |
engine/test/perft.c | 29-position move generation validation |
engine/test/test_search.c | Search correctness (8 categories) |
engine/test/test_integration.c | Full API integration tests (11 categories) |
engine/tournament.py | Desktop Stockfish tournaments |
emu_tournament.py | Emulator-based eZ80 tournaments |
engine/ablation.py | Eval feature ablation testing |
Texel Tuning
| File | Purpose |
|---|---|
engine/tuning/texel_tune.py (32KB) | Adam gradient descent on 1M positions |
engine/tuning/build_texel_dataset.py | Lichess broadcast PGN → CSV dataset |
engine/tuning/apply_texel_params.py | JSON results → C source patching |