Appearance
CE-Games Chess Engine — Deep Technical Profile
Table of Contents
- Project Overview
- Pre-Implementation Research & Planning
- Architecture
- The Chess Engine Core
- The GUI
- Technical Tradeoffs & eZ80 Constraints
- The Optimization Journey
- Texel Tuning Pipeline
- Testing & Tournament Infrastructure
- Major Bugs & Debugging Stories
- AI Agent Orchestration & Usage Stats
- Development Timeline
- Key Files Reference
1. Project Overview
A chess engine for the TI-84 Plus CE graphing calculator (Zilog eZ80 @ 48 MHz, 256KB RAM, 3.5MB Flash), claiming to be the strongest chess engine ever built for this platform. The engine achieves ~2083 Elo at its highest difficulty (27s/move) and ~2700 Elo when cross-compiled for desktop.
By the numbers:
- 111 commits across 15 PRs, built in 16 days (Feb 10–25, 2026)
- 8 core engine C files + 1 hand-written eZ80 assembly file
- 13 iteration/variant repos tracking the optimization journey
- ~370 NPS on the eZ80 vs ~5.1M NPS on desktop ARM64 (14,000× slower)
- ~62KB RAM usage out of 256KB available
- Texel tuning on 1M Lichess positions: +52 Elo (single largest gain)
- 100-position benchmark suite with cycle-accurate profiling
- Automated tournament system running the actual eZ80 binary vs Stockfish
The project also includes Pong (AI opponent, 5 levels) and Sudoku (3 difficulties, pencil marks, undo), but chess dominates the engineering effort.
2. Pre-Implementation Research & Planning
Before any code was written, Hunter conducted extensive research across multiple Claude.ai conversations that produced the hardware constraints reference document, the full 16-section architecture plan, the Elo measurement methodology, and the competitive landscape analysis. This research phase is what made the 16-day implementation sprint possible.
3.1 Hardware Constraints Research (originated from a DOOM porting investigation)
The foundational platform knowledge actually originated from a separate investigation into porting real DOOM (the actual id Software source, not a raycaster recreation) to the TI-84 CE. That research produced a comprehensive hardware constraints document that was then reused as the starting context for the chess engine planning.
Key hardware facts established through this research:
- CPU: eZ80 @ 48 MHz, 8-bit ALU, 24-bit addressing (ADL mode).
intis 24 bits (3 bytes), not 32. Range ±8,388,607. No hardware multiply — multiplication synthesized from shifts and adds, ~350 cycles for 32×32. Effective throughput for 32-bit-heavy code: ~5-15 MIPS. - RAM partitioning (~133 KB usable): NOT a flat arena.
userMem~65 KB (where .8xp programs load), BSS+Heap ~60 KB shared (static globals and malloc growing toward each other), Stack ~4 KB, OS/libraries ~17 KB. Programs exceeding ~65 KB auto-split by toolchain into companion AppVars. - Flash: 4 MB total, ~3.5 MB usable. Code CANNOT execute from flash (copied to RAM first), but data CAN be read freely. Archived AppVars accessed via
ti_GetDataPtr()return direct flash pointers — zero RAM cost. Post-2019 hardware has a flash cache; sequential reads can outperform RAM because RAM competes with LCD DMA bandwidth. - Display: 320×240 pixels, 8-bit indexed color, VRAM memory-mapped at
0xD40000. - Toolchain: CE C/C++ toolchain (LLVM-based). Compile with
-Ozfor size.
The key implication Hunter identified for chess: opening books and lookup tables go to flash as archived AppVars (essentially free), search/eval code must fit in ~65 KB userMem, transposition table competes with board state and move lists in the ~60 KB heap, and the int = 24-bit gotcha means bitboards require explicit uint64_t with expensive multi-byte operations.
3.2 The 16-Section Architecture Plan
Hunter then conducted a full architecture planning session, working through every major design decision from first principles before implementation began. This produced the chess/TI84_Chess_Engine_Plan.md document.
Why 0x88 over bitboards — the first and most fundamental decision. Bitboards require 64-bit integers; every 64-bit operation on the eZ80 compiles into multi-byte library calls costing dozens of instructions. 0x88 uses a 128-element array where off-board detection is a single AND: if (square & 0x88). Direction arithmetic is just adding offsets. All operations are byte-width — perfect for the 8-bit ALU. Hunter worked through the full index layout and confirmed all coordinate math was native byte operations.
Piece encoding settled on single-byte: type in bits 0-2 (PAWN=1 through KING=6), color in bit 7 (WHITE=0x00, BLACK=0x80). Translation from the UI's int8_t format is a small helper.
Board state worked out to ~430 bytes: squares[128], piece_list[2][16], king_sq[2] (avoids scanning for king during check detection), Zobrist hash, castling rights (4 bits packed), en passant target. Pre-computed castling_rights[128] flash table maps each square to a mask.
Move pool with stack pointer — instead of allocating per-ply move arrays (which would blow the 4 KB stack), a single global pool: pool_moves[2048] + pool_scores[2048] with a stack pointer tracking allocation. Each ply pushes/pops from the pool.
eZ80 optimization notes from planning: Minimize function parameters (eZ80 passes all args on stack, no register calling convention), make search-state variables static globals rather than recursion parameters, use uint8_t/int8_t aggressively since byte-wide operations are the native sweet spot.
Realistic Elo ceiling discussion: With heroic optimization (hand-tuned assembly, Texel tuning, large opening book), the ceiling is ~1800-2000 Elo. The hard wall is search depth — at 2000+ Elo, decisive positions need 12-15 plies, and even at 10K NPS with perfect pruning you can't reach that depth in complex middlegames. Historical validation: best 1980s commercial chess computers (Mephisto MM IV, Fidelity Mach IV) topped out around 2000-2100 on similar effective throughput. The final measured result of ~2083 Elo at 27s/move confirms Hunter's pre-implementation estimate was accurate.
3.3 Elo Measurement Methodology
Hunter separately researched how to defensibly measure the engine's Elo. The investigation pulled in Marco Meloni's empirical data (110,000+ games at various node limits, calibrated against the SSDF human rating scale) and established the methodology:
Stockfish's UCI_LimitStrength mode was chosen over raw time throttling. UCI_Elo ranges from 1320-3190, calibrated at 120s+1s, anchored to CCRL 40/4 ±100 Elo. It works by intentionally selecting suboptimal moves with calibrated randomness — much more human-like than just limiting time. Below skill level 10 (~2596 Elo), the rating is hardware/time-control independent.
This methodology makes the engine's Elo claims defensible: anyone can reproduce the results by running the same Stockfish UCI_LimitStrength matches. The qualifier "based on testing against calibrated Stockfish" makes it a fact rather than a subjective claim.
2.4 Competitive Landscape Analysis
Hunter researched existing TI-84 CE chess engines to validate the "strongest engine" claim:
- ChessCE (MateoConLechuga, 2016-2017): ~40K downloads, built on TSCP (~1724 desktop Elo). Known bugs — crashes on checkmate, no draw mechanics. Lacks LMR, futility pruning, most modern search techniques.
- Chess84 (thewarrenjames, June 2025): Most recent competitor, implements many of the same search techniques but author estimates "half the time like an 800, half like a 1200." Tiny opening book (6 hand-curated lines vs Hunter's 131K Polyglot entries). No formal Elo testing.
- PAndaContron/ce-chess: Minimal project, 5 commits, not a serious contender.
The decisive differentiator: Hunter's engine is the only one with rigorous Elo testing against Stockfish's calibrated strength modes. Even at the lowest time setting (2s/move), the ~1355 Elo already exceeds Chess84's self-assessment, and at 5s+ it's comfortably in the 1700-2100 range.
2.5 Chess Engine Architecture Philosophy
Hunter also explored the fundamental architectural question of eval/search compatibility through a thought experiment: what happens when you swap evaluation functions and search algorithms across engines?
Neural net eval + alpha-beta: The most problematic combination. Alpha-beta needs millions of cheap evaluations; a transformer-style network is expensive per eval. Search depth collapses. Stockfish's NNUE is the pragmatic point on this spectrum — a lightweight neural net designed to be fast enough for deep alpha-beta search.
Classical/NNUE eval + MCTS: MCTS was designed for expensive-but-high-quality evals. Classical eval is cheap but noisy — it benefits from depth, not selectivity. Plugging cheap eval into MCTS means being selective with evaluations that aren't accurate enough to justify selectivity.
The deeper principle: Eval and search co-evolved as point solutions in a tradeoff space. You can't swap one for the other. This understanding informed why the TI-84 CE engine uses a hand-crafted evaluation (cheap, benefits from depth via alpha-beta) rather than attempting to run a neural network on the eZ80.
3. Architecture
3.1 Engine/UI Separation
The chess engine is cleanly separated into a portable C engine (chess/engine/src/, 8 files) and a TI-84 CE GUI (chess/src/main.c). The engine.h API serves as the boundary — the GUI never touches internal board representation directly.
chess/
├── src/main.c — GUI (320×240 LCD, sprite rendering, input)
├── engine/
│ ├── src/
│ │ ├── board.c/h — 0x88 board, piece lists, make/unmake
│ │ ├── movegen.c/h — Pseudo-legal move generation
│ │ ├── eval.c/h — Texel-tuned HCE (PeSTO + features)
│ │ ├── search.c/h — Negamax, PVS, LMR, null-move, killers
│ │ ├── tt.c/h — 4096-entry transposition table (32KB)
│ │ ├── zobrist.c/h — 24-bit Zobrist hashing with inline asm
│ │ ├── engine.c/h — Public API for GUI integration
│ │ ├── book.c/h — Polyglot opening book (multi-AppVar)
│ │ ├── types.h — Core types, 0x88 constants, move encoding
│ │ ├── directions.h — Knight/bishop/rook/king offset arrays
│ │ ├── chdata.h — AppVar layout (sprites + randoms)
│ │ └── pick_best.asm — Hand-written eZ80 assembly
│ ├── test/ — perft, search tests, integration tests, bench
│ ├── tuning/ — Texel tuning pipeline (Python)
│ ├── uci/ — UCI protocol for desktop play
│ └── Makefile — Desktop builds (GCC)
├── bench/ — On-calculator benchmark harness
├── emu_uci/ — Emulator UCI bridge for tournaments
├── tools/ — Book generation, data packing
└── Makefile — CE SDK build for CHESS.8xp3.2 Dual-Target Compilation
The same engine source compiles for both:
- TI-84 CE (CE C toolchain, eZ80 target,
-Oz,-DPAWN_CACHE_SIZE=16 -DSPRITES_EXTERNAL) - Desktop (GCC/Clang, x86/ARM64 target,
-O2,-DNO_BOOK)
This enables desktop benchmarking (fast iteration, Stockfish tournaments) while validating on the actual target via the cycle-accurate eZ80 emulator. A key finding: some optimizations help on eZ80 and hurt on desktop (e.g., sentinel rays: +1.3% on eZ80, -8.4% on desktop).
3.3 Memory Budget (~62KB of 256KB)
| Component | RAM (bytes) |
|---|---|
| Transposition table (4096 × 8) | 32,768 |
| Move pool (2048 × 5, SoA) | 10,240 |
| Pawn cache (32 entries) | ~5,000 |
| Piece sprites (12 × 402) | ~4,800 |
| Zobrist tables | ~3,200 |
| Lock tables | ~1,600 |
| Position history (256 × 3) | 768 |
| History heuristic (2 × 128 × 2) | 512 |
board_t | ~440 |
| Killers (64 × 2 × 3) | 384 |
| UI state + misc | ~2,000 |
| Total | ~62KB |
BSS limit measured at 60,690 bytes. The remaining ~194KB is for TI-OS, stack (~4KB), and other overhead.
4. The Chess Engine Core
3.1 Board Representation (board.c/h)
0x88 board — the classic scheme where sq & 0x88 detects off-board squares:
squares[256]: The board array. Off-board indices filled withOFFBOARD(0xFF) sentinel, enabling sliding piece loops to terminate by checking for non-PIECE_NONEwithout per-stepSQ_VALID(). The sentinel was extended from 128 to 256 bytes specifically for the eZ80, where the branch predictor doesn't exist and eliminating theSQ_VALIDcheck saves cycles.piece_list[2][16]: Square of each piece, per side. Max 16 pieces.piece_index[128]: Reverse map from square to piece_list index (O(1) lookup).0xFFif empty. Enables O(1) piece removal via swap-with-last.bishop_count[2]: Tracked incrementally for bishop pair bonus.king_sq[2]: Cached king squares for fast check detection.- Incrementally maintained scores:
mg[2],eg[2](PST+material per side),phase(24=opening, 0=endgame),hash(24-bit Zobrist),lock(16-bit TT verification),pawn_hash(separate pawn-only hash for pawn cache).
board_make() (~150 lines): The critical hot-path function. Incrementally updates hash, lock, pawn_hash, PST scores, phase, bishop_count, king_sq, castling rights, and EP square. Uses a castling_mask[128] lookup table — a 128-byte table where corner/king squares AND out the relevant castling bits.
board_unmake(): Reverses make by restoring saved state from undo_t struct (10-11 bytes: captured piece, castling, EP, halfmove, fullmove, hash, lock, pawn_hash).
4.2 Move Generation (movegen.c)
Pseudo-legal generation iterating the piece list (O(piece_count), not O(64)):
- Staged generation:
GEN_CAPTURESfirst, thenGEN_QUIETS. Enables search to try captures before quiets without generating all moves. is_square_attacked(): Pure query function. Checks knight (8 offsets), pawn (2 diagonals), king (8 adjacent), then bishop/queen and rook/queen sliding rays. Uses the OFFBOARD sentinel for loop termination.- Legality fast path:
compute_legal_info()precomputes check, pin, and checker information by scanning rays from the king.move_needs_legality_check()returns 1 only for king moves, EP, pinned piece moves, and in-check moves — saving ~50% of expensiveis_square_attackedcalls.
4.3 Search (search.c)
Negamax with alpha-beta, iterative deepening, and these enhancements:
| Technique | Details |
|---|---|
| PVS | First move gets full window; others get null window [-alpha-1, -alpha], re-search on fail-high |
| Aspiration windows | ±25 from previous iteration's score, widen to full on fail |
| Null-move pruning | R=2 reduction, depth ≥ 3, not in check, non-pawn material required |
| LMR | After 4+ legal moves, depth ≥ 3, for quiet non-promotion moves: reduce by 1 |
| Futility pruning | Depth ≤ 2, static eval + margin ≤ alpha. Margins: 200 (d=1), 500 (d=2) |
| Check extensions | If in check: +1 depth. Limited to 2 per path |
| Quiescence | Max depth 8. Delta pruning (stand pat + 1100 < alpha). Captures only when not in check. |
| Move ordering | TT move (30000) > MVV-LVA captures (10000+) > Killer 1 (9000) > Killer 2 (8000) > Queen promos (+5000) > History heuristic |
| Killers | 2 per ply, FIFO replacement |
| History heuristic | Gravity-based: val += bonus - val * bonus / 16384, clamped to [-4000, 4000] |
| Repetition detection | Walks position history backward by 2, stops at last irreversible move |
Global move pool (SoA layout): pool_moves[2048] and pool_scores[2048]. Plies share this via a stack pointer (move_sp). This avoids ~2KB stack allocation per ply — critical on eZ80 with only ~4KB stack.
Stack overflow guard (eZ80 only): Reads the OS stack limit from port 0xE0003A, adds STACK_RESERVE=400 bytes. Each node checks the frame pointer via lea %0, ix+0 inline asm and bails to evaluate() if too close.
Move variance (for difficulty levels): After search completes, finds all root moves within N centipawns of best and picks one randomly. The PVS window at root is widened by variance + 1 to ensure near-best moves get accurate scores (fixing a critical bug where PVS null-window clipping made all non-best moves score identically).
pick_best_score(): Selection sort to find the highest-scored move. On eZ80, replaced by hand-written assembly (pick_best.asm) that's 2.8× faster than the compiler output. Uses a signed-to-unsigned comparison trick (XOR high byte with 0x80) and keeps the flipped best_score in a register permanently.
4.4 Evaluation (eval.c)
Texel-tuned PeSTO piece-square tables with Stockfish-inspired features:
- Material + PST (from incremental
mg[WHITE] - mg[BLACK]andeg). Values: pawn mg=82/eg=94, knight 337/281, bishop 365/297, rook 477/512, queen 1025/936. - Bishop pair: mg=19, eg=58.
- Tempo: mg=10, eg=9 (side-to-move bonus).
- Pawn structure (via pawn cache):
- Doubled pawns: mg=-12, eg=-3
- Isolated pawns: mg=-12, eg=-17
- Connected pawns: rank-indexed bonus (0, 9, 10, 16, 39, 65)
- Passed pawns: rank-indexed mg/eg bonus (up to 85/229)
- Rook bonuses: Open file (mg=38, eg=24), semi-open (mg=23, eg=11).
- Mobility: Knight (0-8 squares) and bishop (0-13 squares) with Stockfish classical values. Uses pawn attack bitmap from cache for safe-square counting.
- Pawn shield: mg=6 per friendly pawn in front of king.
- Tapered eval:
score = (mg * phase + eg * (24 - phase)) / 24.
Pawn cache: 4-way set-associative, 32 entries. Keyed by pawn_hash. Stores pawn mg/eg scores, per-file rank bitmasks, and a 128-byte pawn_atk[] bitmap (per-square pawn attack info). The pawn_atk array MUST be static, not stack-allocated — putting 128 bytes on the eZ80 stack pushes local variable offsets beyond the efficient IX+displacement range (signed 8-bit: -128 to +127), causing a 14% performance regression across all eval sections.
4.5 Transposition Table (tt.c)
4096 entries × 8 bytes = 32KB. Always-replace policy.
Entry: lock16 (2B, independent 16-bit verification), score (2B), best_move (2B, packed), depth (1B), flag (1B: EXACT/ALPHA/BETA).
Move packing: 6 bits from_sq64 + 6 bits to_sq64 + 1 bit promotion + 2 bits promo type = 16 bits. Capture/EP/double-push flags are NOT stored — caller must verify legality.
The 24-bit hash + 16-bit lock gives 40 bits total of collision resistance (1 in ~1 trillion false positive rate).
4.6 Zobrist Hashing (zobrist.c)
Tables stored as uint32_t (4-byte stride for fast array indexing), but all state variables use zhash_t = unsigned int = 24 bits on eZ80. This hybrid approach avoids both expensive __lxor library calls (from 32-bit state operations) and slow multiply-by-3 array strides (from 3-byte tables).
ZHASH_XOR inline assembly (eZ80 only): Memory-to-memory 3-byte XOR using ld a,(de); xor a,(hl); ld (hl),a repeated 3 times (~24 cycles). The compiler's library call __ixor costs ~45 cycles. A C macro approach was also tried but generated worse code (+20%) due to bad pointer arithmetic for uint8_t* casts.
4.7 Opening Book (book.c)
Polyglot format stored in TI-OS AppVars (flash, max ~65KB each), split across multiple segments. Five tiers: S/M/L/XL/XXL with up to 99 numbered segments each. Auto-detected at runtime from largest to smallest.
Book entries are accessed directly from flash pointers — zero RAM cost. Binary search + weighted random selection with the position hash as seed.
The book trimming pipeline (tools/trim_book.py) uses depth-aware BFS: shallow positions (≤8 ply) get multiple alternatives for variety, deep positions get 1 move for maximum depth. Tournament testing showed book vs no-book is essentially equal at 6000 nodes (~1950 Elo both ways).
5. The GUI
chess/src/main.c renders on the 320×240 LCD using the CE C toolchain's graphx.h library:
- Board: 208×208 pixels (8×26px squares), offset at (16, 16). 20-color indexed palette.
- Double-buffered:
gfx_SetDrawBuffer()/gfx_SwapDraw(). Partial redraw optimization for cursor-only moves. - Pre-rendered sprites: 12 sprites (6 types × 2 colors) with contour reinforcement for crisp edges at 20×20 pixels.
- Piece animation: Linear interpolation, 50ms/square (max 200ms). Piece removed from board during animation, drawn at interpolated position each frame.
- State machine: MENU → DIFFICULTY_SELECT → COLOR_SELECT → PLAYING → PROMOTION/GAMEOVER.
- AI integration: Two-phase render — frame 1 draws "Thinking..." (so display updates), frame 2 calls
engine_think()(blocking). - Board flipping: When playing as Black, all coordinates and D-pad directions are inverted.
6. Technical Tradeoffs & eZ80 Constraints
11.1 Why 0x88 Over Bitboards
The eZ80 is a 24-bit CPU with no SIMD. Bitboard operations (popcount, bitscan, 64-bit AND/OR/XOR) would require expensive multi-word library calls. The 0x88 scheme uses only 8-bit comparisons and additions — native eZ80 operations.
11.2 The 24-bit Integer Problem
On eZ80, unsigned int = 24 bits. This creates a unique optimization landscape:
| Type | eZ80 size | Speed |
|---|---|---|
uint8_t | 1 byte | Native, fast |
int / unsigned int | 3 bytes | Native, fast |
int16_t | 2 bytes | Non-native, slow (masking/sign-extension) |
int32_t / uint32_t | 4 bytes | Library calls (__lxor, __ladd) |
Rules discovered empirically:
- Widen register-bound locals to native
int— beneficial - Do NOT widen struct/array fields — harmful (increases memory traffic)
int16_tfor search scores in struct fields is okay;intfor local variables during computation is better- History table MUST stay
int16_t(widening caused +44.8% moveorder regression) - Array stride 4 (shift-by-2) is faster than stride 3 (multiply-by-3), so Zobrist tables stay
uint32_teven though only 24 bits are used
11.3 -Oz vs -O2
-Oz (optimize for size) decisively beats -O2 on the eZ80: 7,053 cy/node vs 10,005. The larger -O2 binary causes flash cache misses, especially in move ordering (+103.9% slower). Code size dominates performance because the eZ80's flash has 10-cycle latency on cache misses (197 cycles for a full miss).
8.4 Sentinel-Based Sliding Rays
Expanding squares[] from 128 to 256 bytes, filling off-board indices with OFFBOARD (0xFF), lets sliding piece inner loops drop the SQ_VALID() check:
c
// Before (two conditions per iteration):
while (SQ_VALID(target) && squares[target] == PIECE_NONE) target += dir;
// After (one condition per iteration):
while (squares[target] == PIECE_NONE) target += dir;eZ80 result: -1.29% cy/node (the eZ80 has no branch predictor, so eliminating a branch saves cycles predictably).
Desktop result: +83% slower movegen, +109% slower is_square_attacked, -8.4% throughput. The larger board_t (448B vs 320B) hurts L1 cache utilization on modern CPUs where branches are essentially free.
This is the most striking example of platform-divergent optimization in the project.
8.5 Stack Limitations
The TI-OS provides only ~4KB of stack. The chess engine mitigates this with:
- Global move pool instead of per-ply stack arrays (~2KB saved per ply)
- Static
pawn_atk[128]instead of stack-allocated (safe becauseevaluate()is never recursive) board_tandmoves[]declared static in benchmark harness- Runtime stack guard reading the OS stack limit register
7.6 Flash-Resident Data
Opening book entries and Polyglot random numbers are accessed directly from AppVar flash pointers — zero RAM cost. Piece sprites are also loaded from the CHDATA AppVar when SPRITES_EXTERNAL is defined, saving ~2400 bytes of code flash.
7. The Optimization Journey
The engine went through 13 variant repos tracking systematic optimization. Here's the complete narrative:
11.1 Phase 1: Engine Bootstrap (Feb 13)
Four commits in one evening built the entire engine from nothing:
- Move generation with 30/30 perft validation
- Search + eval + TT (negamax, alpha-beta, iterative deepening)
- UI integration
- Integration tests
11.2 Phase 2: Eval Features + Ablation (Feb 15)
Added 6 Stockfish-inspired evaluation features with compile-time #ifndef guards for ablation testing:
| Feature | Impact when removed (vs SF-1700) |
|---|---|
| Mobility | Worst — ~35% swing |
| Pawn structure | Moderate |
| Passed pawns | Moderate |
| Rook on files | Small |
| Pawn shield | Small |
| Tempo | Small |
11.3 Phase 3: Movegen Optimizations (Feb 15)
Five sequential optimizations, each measured on 5 positions × 1000 iterations:
| # | Optimization | Impact |
|---|---|---|
| 1 | O(1) piece-list updates via square index map | -5.1% perft |
| 2 | Staged movegen (captures then quiets) | Helps search pruning |
| 3 | Skip castling attack probes when in check | +6.7% movegen overhead (but saves attack calls in real games) |
| 4 | Precompute check/pin info | ~50% fewer legality checks |
| 5 | Track bishop counts incrementally | -5.9% eval |
8.4 Phase 4: Texel Tuning (+52 Elo) (Feb 16)
The single largest Elo gain. See Section 7 for full details.
8.5 Phase 5: Integer Type Optimization (Feb 16)
Widened hot-path score locals to native int (24-bit). Narrowed phase to uint8_t. Combined: -2.4% cy/node, -379 bytes binary. Widening struct fields was tried and reverted (-5.6% make/unmake regression).
7.6 Phase 6: Pawn Hash Cache (Feb 16-17)
Progressive refinement:
- 2-way set-associative: -14.3% total eval cycles
- 4-way set-associative: additional -6.6%
- Qsearch capture-only fast path: -4.8%
- Combined
build_pawn_info+ pawn attack bitmap: -28.5% eval, -15.7% cy/node overall
7.7 Phase 7: Cycle-Level Micro-Optimizations (Feb 17)
| Change | cy/node Impact | Notes |
|---|---|---|
Native 24-bit zhash_t (hybrid approach) | -0.8% | Tables stay uint32_t, state uses zhash_t |
| Sentinel sliding rays | -1.3% | eZ80-only win; desktop regression |
Hand-written pick_best_score asm | -11.8% (in that function) | 2.8× faster than compiler |
Inline asm ZHASH_XOR | -2.8% make/unmake | Memory-to-memory 3-byte XOR |
| SoA move pool | -1.1% | Eliminated pool_copy overhead |
| Eval loop merging (4→2 loops) | -4.5% profiling | ~1.1% real-world |
sprintf/printf elimination | -7.3KB binary | Purely size reduction |
7.8 Optimizations Tried and Rejected
| Optimization | Result | Why |
|---|---|---|
| Division-free tapered eval | 0% | Compiler already uses multiply-by-reciprocal for /24 |
| 24-bit PRNG | Skipped | Both PRNGs called <2000 times per game |
| Eval deduplication (parameterized loops) | -2.3% | Prevents BIT instruction for constant color checks |
| Local pointer caching | 0% | LTO sees through the alias |
| Counter-move heuristic | -1% | Committed without testing, immediately reverted |
| Stack-allocated pawn_atk[128] | -14% | IX+displacement exceeds efficient signed 8-bit range |
| Pure zhash_t tables (3-byte stride) | -1.2% | multiply-by-3 slower than shift-by-2 |
| NNUE neural network eval | Not merged | Too expensive for 48MHz without SIMD |
7.9 NNUE Exploration (Shelved)
Architecture: 768→32→1 with SCReLU activation, int8 quantized weights (24,772 bytes). Implemented on feature/nnue branch but never merged. The 32-neuron hidden layer required 32 multiply-accumulate operations per eval call on a CPU with no SIMD, and the 24KB feature transformer weight matrix would dominate flash reads.
7.10 Final Performance Profile (100 positions, 1000 nodes each)
| Position Category | cy/node | eval% | movegen% | make% |
|---|---|---|---|---|
| Positional middlegames | 470,211 | 38% | 21% | 13% |
| Tactical middlegames | 337,437 | 34% | 19% | 14% |
| Complex endgames | 272,686 | 29% | 16% | 16% |
| Simple endgames | 133,155 | 23% | 14% | 17% |
Average ~200K cy/node = ~2.7ms/node at 48MHz = ~370 NPS.
8. Texel Tuning Pipeline
11.1 Methodology
Texel tuning optimizes evaluation parameters by minimizing mean squared error between sigmoid(K × static_eval) and actual game outcomes:
MSE = mean( (sigmoid(K × eval) - label)² )Where label is 1.0 (win), 0.5 (draw), 0.0 (loss) from side-to-move perspective.
11.2 Data Source
Lichess broadcast database (GM-level tournament games, not random Lichess games). 25 months from 2024-01 through 2026-01. Positions filtered: ply 12-100, not after capture, not in check. Two datasets: 60K (pilot) and 1M positions (production).
11.3 Optimization Algorithm
Adam gradient descent with analytical gradients (not scipy). Key settings:
- Sigmoid slope K: found via golden-section 1D search (~0.00652)
- Learning rate: 0.02
- L2 regularization centered at scale=1.0 (pulling toward baseline)
- 90/10 train/validation split
- Best model selected by validation MSE
8.4 Parameters Tuned
Scale factors for 14 scalar params (bishop pair, tempo, doubled/isolated pawns, rook files, shield — each mg/eg), 12 PST row scales, rank-indexed pawn bonuses, and mobility bucket tables. Expanded from 14 to 108 parameters across rounds.
8.5 Results Across Rounds
| Round | Games | Elo vs Baseline | Significant? |
|---|---|---|---|
| Round 1 (1M positions) | 4000 | +52 (±10) | Yes |
| Round 2 vs Round 1 | 4000 | -1.6 (±10) | No |
| Round 3 vs Round 1 | 4000 | -5.2 (±10) | No |
| Round 4 expanded (96 params) | 1000 | +11 (±20) | No |
| Round 4b constrained | 4000 | +0.1 (±10) | No |
| Round 5 (108 params, material split) | 4000 | +8 (±11) | Borderline |
Key takeaway: Round 1 delivered a clearly significant +52 Elo. All subsequent rounds produced marginal improvements. Further tuning was abandoned.
7.6 Application
apply_texel_params.py reads JSON output and patches eval.c using regex-based text substitution — replaces #define values and array initializers directly in C source.
9. Testing & Tournament Infrastructure
11.1 Test Suites
| Tool | Purpose | Key detail |
|---|---|---|
perft.c | Move generation correctness | 7 standard + 22 edge-case positions |
test_search.c | Search/eval correctness | 8 categories: mate finding, stalemate, draws, incremental eval consistency, 200-halfmove self-play |
test_integration.c | Public API testing | 11 categories covering the full engine.h API |
bench.c (desktop) | Component-level benchmarking | 100 positions from 14 standard suites, ns/call for each component |
bench/ (on-calc) | eZ80 cycle profiling | Same 100 positions but with 48MHz hardware timer, per-section cycle breakdown |
11.2 Tournament Infrastructure
Desktop tournaments (engine/tournament.py): Compiles UCI engine, plays against Stockfish at configurable Elo levels (1320-2420, step 100). 20 games per level, 5 concurrent matches, 0.1s/move.
Emulator tournaments (chess/emu_tournament.py): Runs the actual eZ80 binary inside the cycle-accurate TI-84 CE emulator against Stockfish. The key innovation:
- Reads the EMUUCI.8xp binary
- Finds the
@@CMDSLOT@@sentinel (512 bytes) - Patches each position's FEN + settings into the slot
- Recomputes .8xp file checksum
- Launches the emulator subprocess
- Parses
"MOVE xxxx"from debug output - Feeds the move to Stockfish via UCI
This measures the engine at its true eZ80 calculation speed (~370 NPS), not desktop speed.
Ablation testing (engine/ablation.py): Plays the full engine against versions compiled with -DNO_<FEATURE> to measure each eval feature's contribution.
11.3 Elo Results
| Platform | Config | Elo | Method |
|---|---|---|---|
| eZ80 | Easy (900ms, var=10) | ~1320 | 30 games vs SF-1320 |
| eZ80 | Medium (3s) | ~1442 | 30 games vs SF-1500 |
| eZ80 | Hard (9s) | ~1712 | 30 games vs SF-1700 |
| eZ80 | Expert (13.5s) | ~1958 | 30 games vs SF-1900 |
| eZ80 | Master (27s) | ~2083 | 100 games vs SF-2100 |
| Desktop | 0.01s/move | ~2583 | 100 games vs SF-2600 |
| Desktop | 0.1s/move | ~2700 | 100+189 games vs SF-2600/2700 |
10. Major Bugs & Debugging Stories
11.1 The Queen Hang Bug (PVS Scoring)
Symptom: Engine played Qxa2, hanging the queen to Rxa2.
Root cause: PVS null-window search at root scored terrible moves identically to the best. When move_variance > 0, the engine randomly selected among "equally scored" root moves — but PVS clips all non-PV moves to the same alpha value, making Qxa2 (losing the queen) look as good as the best move.
Investigation: Spanned multiple sessions. First confirmed the bug exists on desktop (not eZ80-specific). Quiescence correctly saw Rxa2, but PVS window clipping masked the true score at every search depth (1-8).
Fix attempts:
- Full window at root: catastrophic — -708 Elo (PVS exists for a reason)
- Wider PVS window by variance amount: worked.
pvs_floor = alpha - varianceinstead ofalpha
Variance sweep: var=0: 0 Elo loss. var=5: 0 Elo loss (sweet spot). var=10: -147 Elo. var=15: catastrophic.
11.2 The 0-30 Tournament Disaster
Symptom: After implementing partial depth reporting, engine went from competitive to 0-30 in tournaments.
Root cause: search_best_root_move was being used to override result.best_move at timeout, but if no root move was scored in the current iteration, the fallback was the first generated move (a2a3 or h7h6), not the PV move.
Fix: Added search_root_move_scored flag. Documented in MEMORY.md as a "Dangerous Pattern."
11.3 Check Extension + TT Ordering Bug (PR #15)
Symptom: Engine played Be7 (losing to mate-in-2) instead of finding the defense.
Root cause: Check extensions increase depth, so the TT caches results at the extended depth. On the next iteration, the TT probe returns the cached result BEFORE the check extension can trigger, locking the engine to depth 1 in check positions.
Discovery: Hunter noticed the bad move and directed Claude to investigate. Claude initially got the chess analysis wrong; Hunter corrected: "there's a mate-in-2 with both moves being check."
10.4 The Timer Failure Cascade
Symptom: Late-endgame tournament positions terminated with emu_error.
Root cause chain: check_time() checked the node limit FIRST with early return, preventing the time check from running. With time_fn = NULL and search_deadline = 0, search never stopped on time. Fell back to 30K node limit, which at ~300 NPS took ~96 seconds, exceeding the emulator timeout.
Fix: Added search_node_deadline as a hard safety fallback (~1 node/ms with 3× headroom). Also made search_stopped volatile (LTO was optimizing away reads).
10.5 The Book That Never Worked
Multiple rounds of debugging:
- Missing CHBKRN.8xv (shared randoms file)
- Tier mismatch: baked large book but compiled with
BOOK=medium const char *pointer arrays had incorrect behavior on eZ80 24-bit target- After fixing all of the above, still failed on the custom emulator — worked fine on CEmu reference emulator
- Root cause: custom emulator's VAT reconstruction didn't properly expose AppVars
11. AI Agent Orchestration & Usage Stats
11.1 Session Data
| Metric | Value |
|---|---|
| Main sessions | 19 session directories |
| Main JSONL files | 1 file, 2.2 MB |
| Subagent files | 174 files, ~18 MB |
| Compact summaries | 71 files |
| Date range | Feb 10 – Mar 27, 2026 |
11.2 Human-AI Collaboration
Hunter directed every architectural and strategic decision; the AI implemented code per his direction. Key aspects of Hunter's leadership:
- Architectural decisions: Chose 0x88 board representation (over bitboards — impractical on 24-bit CPU), Polyglot book format, the phased engine development approach (movegen → search/eval → UI integration → tests)
- Benchmarking methodology: Designed the empirical approach — insisted on eZ80-only benchmarks ("desktop benchmarks are pointless"), established the 100-position benchmark suite as the single source of truth, required every optimization to be measured before committing
- Quality enforcement: Caught the AI committing code without testing ("yes i want to actually test things before making commits"), demanded accurate numbers, rejected over-engineered explanations
- Strategic direction: Explored NNUE as a potential approach (Hunter's idea, not the AI's), evaluated the tradeoffs, and made the call to shelve it. Designed the difficulty/variance system concept. Directed the Texel tuning pipeline.
- Course corrections: Caught hallucinated performance ratios ("100,000× slower" — actual: ~14,000×), corrected chess analysis errors (the AI got a mate-in-2 analysis wrong), stopped the AI from overclaiming ("let's not say chess is solved")
Specific instances where Hunter corrected the AI:
- AI committed counter-move heuristic without benchmarking — Hunter enforced the test-first rule
- AI overcomplicated explanations of benchmark paradoxes — Hunter cut through the noise
- AI hallucinated performance ratios ("100,000× slower") — Hunter caught it, actual ratio was ~14,000×
- AI misattributed NNUE exploration as "AI proposed, human rejected" — Hunter corrected: it was his idea to explore, his evaluation to shelve
- AI miscalculated emulator tournament timing ("15-30 minutes each") — Hunter corrected: games run at emulator speed, not real-time
- AI initially got the check extension bug chess analysis wrong — Hunter corrected: "there's a mate-in-2 with both moves being check"
11.3 Key Methodology
Hunter's development loop: implement → benchmark on 100 positions → commit or revert. Every optimization has a measured cy/node or Elo impact. The RESULTS.md file serves as the single source of truth. The AI executes this loop; Hunter reviews the results and decides what ships.
12. Development Timeline
| Date | Day | Key Achievement |
|---|---|---|
| Feb 10 | 1 | Pong game complete |
| Feb 11 | 2 | Sudoku game (PR #1) |
| Feb 12 | 3 | Chess UI only (PR #2) |
| Feb 13 | 4 | Full chess engine — movegen, search, eval, TT, integration |
| Feb 15 | 6 | Eval features + ablation. Movegen optimizations. Opening book. |
| Feb 16 | 7 | Texel tuning (+52 Elo). Integer type optimization. Pawn cache. |
| Feb 17 | 8 | 5 PRs in one day. Sentinel rays, difficulty system, PVS bug fix, tournament system. |
| Feb 18 | 9 | Emulator tournament system. CHDATA AppVar consolidation. |
| Feb 19 | 10 | Hand-written eZ80 assembly (-11.8% pick_best). SoA move pool. |
| Feb 20-25 | 11-16 | Check extension bug fix. Threefold repetition. Final polish. |
Total: 111 commits, 15 PRs (14 merged), 16 days. From nothing to ~2083 Elo on a calculator.
13. Key Files Reference
Engine Core
| File | Lines | Purpose |
|---|---|---|
engine/src/search.c | ~800 | Negamax, PVS, LMR, null-move, quiescence, iterative deepening |
engine/src/board.c | ~600 | 0x88 board, piece lists, make/unmake |
engine/src/eval.c | ~500 | Texel-tuned HCE, pawn cache, mobility, tapered eval |
engine/src/movegen.c | ~400 | Pseudo-legal generation, attack detection, legality fast path |
engine/src/book.c | ~350 | Polyglot book, multi-AppVar, binary search, weighted selection |
engine/src/engine.c | ~350 | Public API, game status, difficulty configuration |
engine/src/zobrist.c | ~150 | Zobrist tables, inline asm XOR, PRNG |
engine/src/tt.c | ~100 | 4096-entry TT, move packing, always-replace |
engine/src/pick_best.asm | ~60 | Hand-written eZ80 assembly for move scoring |
src/main.c | ~1200 | GUI: rendering, input, animation, state machine |
Testing & Tournaments
| File | Purpose |
|---|---|
engine/test/bench.c (25KB) | 100-position desktop benchmark |
bench/src/main.c | On-calculator cycle profiling benchmark |
bench/RESULTS.md | Complete optimization history with measurements |
engine/test/perft.c | 29-position move generation validation |
engine/test/test_search.c | Search correctness (8 categories) |
engine/test/test_integration.c | Full API integration tests (11 categories) |
engine/tournament.py | Desktop Stockfish tournaments |
emu_tournament.py | Emulator-based eZ80 tournaments |
engine/ablation.py | Eval feature ablation testing |
Texel Tuning
| File | Purpose |
|---|---|
engine/tuning/texel_tune.py (32KB) | Adam gradient descent on 1M positions |
engine/tuning/build_texel_dataset.py | Lichess broadcast PGN → CSV dataset |
engine/tuning/apply_texel_params.py | JSON results → C source patching |