Thorin

Enter password to continue

Skip to content

CE-Games Chess Engine — Deep Technical Profile

Table of Contents

  1. Project Overview
  2. Pre-Implementation Research & Planning
  3. Architecture
  4. The Chess Engine Core
  5. The GUI
  6. Technical Tradeoffs & eZ80 Constraints
  7. The Optimization Journey
  8. Texel Tuning Pipeline
  9. Testing & Tournament Infrastructure
  10. Major Bugs & Debugging Stories
  11. AI Agent Orchestration & Usage Stats
  12. Development Timeline
  13. Key Files Reference

1. Project Overview

A chess engine for the TI-84 Plus CE graphing calculator (Zilog eZ80 @ 48 MHz, 256KB RAM, 3.5MB Flash), claiming to be the strongest chess engine ever built for this platform. The engine achieves ~2083 Elo at its highest difficulty (27s/move) and ~2700 Elo when cross-compiled for desktop.

By the numbers:

  • 111 commits across 15 PRs, built in 16 days (Feb 10–25, 2026)
  • 8 core engine C files + 1 hand-written eZ80 assembly file
  • 13 iteration/variant repos tracking the optimization journey
  • ~370 NPS on the eZ80 vs ~5.1M NPS on desktop ARM64 (14,000× slower)
  • ~62KB RAM usage out of 256KB available
  • Texel tuning on 1M Lichess positions: +52 Elo (single largest gain)
  • 100-position benchmark suite with cycle-accurate profiling
  • Automated tournament system running the actual eZ80 binary vs Stockfish

The project also includes Pong (AI opponent, 5 levels) and Sudoku (3 difficulties, pencil marks, undo), but chess dominates the engineering effort.


2. Pre-Implementation Research & Planning

Before any code was written, Hunter conducted extensive research across multiple Claude.ai conversations that produced the hardware constraints reference document, the full 16-section architecture plan, the Elo measurement methodology, and the competitive landscape analysis. This research phase is what made the 16-day implementation sprint possible.

3.1 Hardware Constraints Research (originated from a DOOM porting investigation)

The foundational platform knowledge actually originated from a separate investigation into porting real DOOM (the actual id Software source, not a raycaster recreation) to the TI-84 CE. That research produced a comprehensive hardware constraints document that was then reused as the starting context for the chess engine planning.

Key hardware facts established through this research:

  • CPU: eZ80 @ 48 MHz, 8-bit ALU, 24-bit addressing (ADL mode). int is 24 bits (3 bytes), not 32. Range ±8,388,607. No hardware multiply — multiplication synthesized from shifts and adds, ~350 cycles for 32×32. Effective throughput for 32-bit-heavy code: ~5-15 MIPS.
  • RAM partitioning (~133 KB usable): NOT a flat arena. userMem ~65 KB (where .8xp programs load), BSS+Heap ~60 KB shared (static globals and malloc growing toward each other), Stack ~4 KB, OS/libraries ~17 KB. Programs exceeding ~65 KB auto-split by toolchain into companion AppVars.
  • Flash: 4 MB total, ~3.5 MB usable. Code CANNOT execute from flash (copied to RAM first), but data CAN be read freely. Archived AppVars accessed via ti_GetDataPtr() return direct flash pointers — zero RAM cost. Post-2019 hardware has a flash cache; sequential reads can outperform RAM because RAM competes with LCD DMA bandwidth.
  • Display: 320×240 pixels, 8-bit indexed color, VRAM memory-mapped at 0xD40000.
  • Toolchain: CE C/C++ toolchain (LLVM-based). Compile with -Oz for size.

The key implication Hunter identified for chess: opening books and lookup tables go to flash as archived AppVars (essentially free), search/eval code must fit in ~65 KB userMem, transposition table competes with board state and move lists in the ~60 KB heap, and the int = 24-bit gotcha means bitboards require explicit uint64_t with expensive multi-byte operations.

3.2 The 16-Section Architecture Plan

Hunter then conducted a full architecture planning session, working through every major design decision from first principles before implementation began. This produced the chess/TI84_Chess_Engine_Plan.md document.

Why 0x88 over bitboards — the first and most fundamental decision. Bitboards require 64-bit integers; every 64-bit operation on the eZ80 compiles into multi-byte library calls costing dozens of instructions. 0x88 uses a 128-element array where off-board detection is a single AND: if (square & 0x88). Direction arithmetic is just adding offsets. All operations are byte-width — perfect for the 8-bit ALU. Hunter worked through the full index layout and confirmed all coordinate math was native byte operations.

Piece encoding settled on single-byte: type in bits 0-2 (PAWN=1 through KING=6), color in bit 7 (WHITE=0x00, BLACK=0x80). Translation from the UI's int8_t format is a small helper.

Board state worked out to ~430 bytes: squares[128], piece_list[2][16], king_sq[2] (avoids scanning for king during check detection), Zobrist hash, castling rights (4 bits packed), en passant target. Pre-computed castling_rights[128] flash table maps each square to a mask.

Move pool with stack pointer — instead of allocating per-ply move arrays (which would blow the 4 KB stack), a single global pool: pool_moves[2048] + pool_scores[2048] with a stack pointer tracking allocation. Each ply pushes/pops from the pool.

eZ80 optimization notes from planning: Minimize function parameters (eZ80 passes all args on stack, no register calling convention), make search-state variables static globals rather than recursion parameters, use uint8_t/int8_t aggressively since byte-wide operations are the native sweet spot.

Realistic Elo ceiling discussion: With heroic optimization (hand-tuned assembly, Texel tuning, large opening book), the ceiling is ~1800-2000 Elo. The hard wall is search depth — at 2000+ Elo, decisive positions need 12-15 plies, and even at 10K NPS with perfect pruning you can't reach that depth in complex middlegames. Historical validation: best 1980s commercial chess computers (Mephisto MM IV, Fidelity Mach IV) topped out around 2000-2100 on similar effective throughput. The final measured result of ~2083 Elo at 27s/move confirms Hunter's pre-implementation estimate was accurate.

3.3 Elo Measurement Methodology

Hunter separately researched how to defensibly measure the engine's Elo. The investigation pulled in Marco Meloni's empirical data (110,000+ games at various node limits, calibrated against the SSDF human rating scale) and established the methodology:

Stockfish's UCI_LimitStrength mode was chosen over raw time throttling. UCI_Elo ranges from 1320-3190, calibrated at 120s+1s, anchored to CCRL 40/4 ±100 Elo. It works by intentionally selecting suboptimal moves with calibrated randomness — much more human-like than just limiting time. Below skill level 10 (~2596 Elo), the rating is hardware/time-control independent.

This methodology makes the engine's Elo claims defensible: anyone can reproduce the results by running the same Stockfish UCI_LimitStrength matches. The qualifier "based on testing against calibrated Stockfish" makes it a fact rather than a subjective claim.

2.4 Competitive Landscape Analysis

Hunter researched existing TI-84 CE chess engines to validate the "strongest engine" claim:

  • ChessCE (MateoConLechuga, 2016-2017): ~40K downloads, built on TSCP (~1724 desktop Elo). Known bugs — crashes on checkmate, no draw mechanics. Lacks LMR, futility pruning, most modern search techniques.
  • Chess84 (thewarrenjames, June 2025): Most recent competitor, implements many of the same search techniques but author estimates "half the time like an 800, half like a 1200." Tiny opening book (6 hand-curated lines vs Hunter's 131K Polyglot entries). No formal Elo testing.
  • PAndaContron/ce-chess: Minimal project, 5 commits, not a serious contender.

The decisive differentiator: Hunter's engine is the only one with rigorous Elo testing against Stockfish's calibrated strength modes. Even at the lowest time setting (2s/move), the ~1355 Elo already exceeds Chess84's self-assessment, and at 5s+ it's comfortably in the 1700-2100 range.

2.5 Chess Engine Architecture Philosophy

Hunter also explored the fundamental architectural question of eval/search compatibility through a thought experiment: what happens when you swap evaluation functions and search algorithms across engines?

Neural net eval + alpha-beta: The most problematic combination. Alpha-beta needs millions of cheap evaluations; a transformer-style network is expensive per eval. Search depth collapses. Stockfish's NNUE is the pragmatic point on this spectrum — a lightweight neural net designed to be fast enough for deep alpha-beta search.

Classical/NNUE eval + MCTS: MCTS was designed for expensive-but-high-quality evals. Classical eval is cheap but noisy — it benefits from depth, not selectivity. Plugging cheap eval into MCTS means being selective with evaluations that aren't accurate enough to justify selectivity.

The deeper principle: Eval and search co-evolved as point solutions in a tradeoff space. You can't swap one for the other. This understanding informed why the TI-84 CE engine uses a hand-crafted evaluation (cheap, benefits from depth via alpha-beta) rather than attempting to run a neural network on the eZ80.


3. Architecture

3.1 Engine/UI Separation

The chess engine is cleanly separated into a portable C engine (chess/engine/src/, 8 files) and a TI-84 CE GUI (chess/src/main.c). The engine.h API serves as the boundary — the GUI never touches internal board representation directly.

chess/
├── src/main.c              — GUI (320×240 LCD, sprite rendering, input)
├── engine/
│   ├── src/
│   │   ├── board.c/h       — 0x88 board, piece lists, make/unmake
│   │   ├── movegen.c/h     — Pseudo-legal move generation
│   │   ├── eval.c/h        — Texel-tuned HCE (PeSTO + features)
│   │   ├── search.c/h      — Negamax, PVS, LMR, null-move, killers
│   │   ├── tt.c/h          — 4096-entry transposition table (32KB)
│   │   ├── zobrist.c/h     — 24-bit Zobrist hashing with inline asm
│   │   ├── engine.c/h      — Public API for GUI integration
│   │   ├── book.c/h        — Polyglot opening book (multi-AppVar)
│   │   ├── types.h         — Core types, 0x88 constants, move encoding
│   │   ├── directions.h    — Knight/bishop/rook/king offset arrays
│   │   ├── chdata.h        — AppVar layout (sprites + randoms)
│   │   └── pick_best.asm   — Hand-written eZ80 assembly
│   ├── test/               — perft, search tests, integration tests, bench
│   ├── tuning/             — Texel tuning pipeline (Python)
│   ├── uci/                — UCI protocol for desktop play
│   └── Makefile            — Desktop builds (GCC)
├── bench/                  — On-calculator benchmark harness
├── emu_uci/                — Emulator UCI bridge for tournaments
├── tools/                  — Book generation, data packing
└── Makefile                — CE SDK build for CHESS.8xp

3.2 Dual-Target Compilation

The same engine source compiles for both:

  • TI-84 CE (CE C toolchain, eZ80 target, -Oz, -DPAWN_CACHE_SIZE=16 -DSPRITES_EXTERNAL)
  • Desktop (GCC/Clang, x86/ARM64 target, -O2, -DNO_BOOK)

This enables desktop benchmarking (fast iteration, Stockfish tournaments) while validating on the actual target via the cycle-accurate eZ80 emulator. A key finding: some optimizations help on eZ80 and hurt on desktop (e.g., sentinel rays: +1.3% on eZ80, -8.4% on desktop).

3.3 Memory Budget (~62KB of 256KB)

ComponentRAM (bytes)
Transposition table (4096 × 8)32,768
Move pool (2048 × 5, SoA)10,240
Pawn cache (32 entries)~5,000
Piece sprites (12 × 402)~4,800
Zobrist tables~3,200
Lock tables~1,600
Position history (256 × 3)768
History heuristic (2 × 128 × 2)512
board_t~440
Killers (64 × 2 × 3)384
UI state + misc~2,000
Total~62KB

BSS limit measured at 60,690 bytes. The remaining ~194KB is for TI-OS, stack (~4KB), and other overhead.


4. The Chess Engine Core

3.1 Board Representation (board.c/h)

0x88 board — the classic scheme where sq & 0x88 detects off-board squares:

  • squares[256]: The board array. Off-board indices filled with OFFBOARD (0xFF) sentinel, enabling sliding piece loops to terminate by checking for non-PIECE_NONE without per-step SQ_VALID(). The sentinel was extended from 128 to 256 bytes specifically for the eZ80, where the branch predictor doesn't exist and eliminating the SQ_VALID check saves cycles.
  • piece_list[2][16]: Square of each piece, per side. Max 16 pieces.
  • piece_index[128]: Reverse map from square to piece_list index (O(1) lookup). 0xFF if empty. Enables O(1) piece removal via swap-with-last.
  • bishop_count[2]: Tracked incrementally for bishop pair bonus.
  • king_sq[2]: Cached king squares for fast check detection.
  • Incrementally maintained scores: mg[2], eg[2] (PST+material per side), phase (24=opening, 0=endgame), hash (24-bit Zobrist), lock (16-bit TT verification), pawn_hash (separate pawn-only hash for pawn cache).

board_make() (~150 lines): The critical hot-path function. Incrementally updates hash, lock, pawn_hash, PST scores, phase, bishop_count, king_sq, castling rights, and EP square. Uses a castling_mask[128] lookup table — a 128-byte table where corner/king squares AND out the relevant castling bits.

board_unmake(): Reverses make by restoring saved state from undo_t struct (10-11 bytes: captured piece, castling, EP, halfmove, fullmove, hash, lock, pawn_hash).

4.2 Move Generation (movegen.c)

Pseudo-legal generation iterating the piece list (O(piece_count), not O(64)):

  • Staged generation: GEN_CAPTURES first, then GEN_QUIETS. Enables search to try captures before quiets without generating all moves.
  • is_square_attacked(): Pure query function. Checks knight (8 offsets), pawn (2 diagonals), king (8 adjacent), then bishop/queen and rook/queen sliding rays. Uses the OFFBOARD sentinel for loop termination.
  • Legality fast path: compute_legal_info() precomputes check, pin, and checker information by scanning rays from the king. move_needs_legality_check() returns 1 only for king moves, EP, pinned piece moves, and in-check moves — saving ~50% of expensive is_square_attacked calls.

4.3 Search (search.c)

Negamax with alpha-beta, iterative deepening, and these enhancements:

TechniqueDetails
PVSFirst move gets full window; others get null window [-alpha-1, -alpha], re-search on fail-high
Aspiration windows±25 from previous iteration's score, widen to full on fail
Null-move pruningR=2 reduction, depth ≥ 3, not in check, non-pawn material required
LMRAfter 4+ legal moves, depth ≥ 3, for quiet non-promotion moves: reduce by 1
Futility pruningDepth ≤ 2, static eval + margin ≤ alpha. Margins: 200 (d=1), 500 (d=2)
Check extensionsIf in check: +1 depth. Limited to 2 per path
QuiescenceMax depth 8. Delta pruning (stand pat + 1100 < alpha). Captures only when not in check.
Move orderingTT move (30000) > MVV-LVA captures (10000+) > Killer 1 (9000) > Killer 2 (8000) > Queen promos (+5000) > History heuristic
Killers2 per ply, FIFO replacement
History heuristicGravity-based: val += bonus - val * bonus / 16384, clamped to [-4000, 4000]
Repetition detectionWalks position history backward by 2, stops at last irreversible move

Global move pool (SoA layout): pool_moves[2048] and pool_scores[2048]. Plies share this via a stack pointer (move_sp). This avoids ~2KB stack allocation per ply — critical on eZ80 with only ~4KB stack.

Stack overflow guard (eZ80 only): Reads the OS stack limit from port 0xE0003A, adds STACK_RESERVE=400 bytes. Each node checks the frame pointer via lea %0, ix+0 inline asm and bails to evaluate() if too close.

Move variance (for difficulty levels): After search completes, finds all root moves within N centipawns of best and picks one randomly. The PVS window at root is widened by variance + 1 to ensure near-best moves get accurate scores (fixing a critical bug where PVS null-window clipping made all non-best moves score identically).

pick_best_score(): Selection sort to find the highest-scored move. On eZ80, replaced by hand-written assembly (pick_best.asm) that's 2.8× faster than the compiler output. Uses a signed-to-unsigned comparison trick (XOR high byte with 0x80) and keeps the flipped best_score in a register permanently.

4.4 Evaluation (eval.c)

Texel-tuned PeSTO piece-square tables with Stockfish-inspired features:

  1. Material + PST (from incremental mg[WHITE] - mg[BLACK] and eg). Values: pawn mg=82/eg=94, knight 337/281, bishop 365/297, rook 477/512, queen 1025/936.
  2. Bishop pair: mg=19, eg=58.
  3. Tempo: mg=10, eg=9 (side-to-move bonus).
  4. Pawn structure (via pawn cache):
    • Doubled pawns: mg=-12, eg=-3
    • Isolated pawns: mg=-12, eg=-17
    • Connected pawns: rank-indexed bonus (0, 9, 10, 16, 39, 65)
    • Passed pawns: rank-indexed mg/eg bonus (up to 85/229)
  5. Rook bonuses: Open file (mg=38, eg=24), semi-open (mg=23, eg=11).
  6. Mobility: Knight (0-8 squares) and bishop (0-13 squares) with Stockfish classical values. Uses pawn attack bitmap from cache for safe-square counting.
  7. Pawn shield: mg=6 per friendly pawn in front of king.
  8. Tapered eval: score = (mg * phase + eg * (24 - phase)) / 24.

Pawn cache: 4-way set-associative, 32 entries. Keyed by pawn_hash. Stores pawn mg/eg scores, per-file rank bitmasks, and a 128-byte pawn_atk[] bitmap (per-square pawn attack info). The pawn_atk array MUST be static, not stack-allocated — putting 128 bytes on the eZ80 stack pushes local variable offsets beyond the efficient IX+displacement range (signed 8-bit: -128 to +127), causing a 14% performance regression across all eval sections.

4.5 Transposition Table (tt.c)

4096 entries × 8 bytes = 32KB. Always-replace policy.

Entry: lock16 (2B, independent 16-bit verification), score (2B), best_move (2B, packed), depth (1B), flag (1B: EXACT/ALPHA/BETA).

Move packing: 6 bits from_sq64 + 6 bits to_sq64 + 1 bit promotion + 2 bits promo type = 16 bits. Capture/EP/double-push flags are NOT stored — caller must verify legality.

The 24-bit hash + 16-bit lock gives 40 bits total of collision resistance (1 in ~1 trillion false positive rate).

4.6 Zobrist Hashing (zobrist.c)

Tables stored as uint32_t (4-byte stride for fast array indexing), but all state variables use zhash_t = unsigned int = 24 bits on eZ80. This hybrid approach avoids both expensive __lxor library calls (from 32-bit state operations) and slow multiply-by-3 array strides (from 3-byte tables).

ZHASH_XOR inline assembly (eZ80 only): Memory-to-memory 3-byte XOR using ld a,(de); xor a,(hl); ld (hl),a repeated 3 times (~24 cycles). The compiler's library call __ixor costs ~45 cycles. A C macro approach was also tried but generated worse code (+20%) due to bad pointer arithmetic for uint8_t* casts.

4.7 Opening Book (book.c)

Polyglot format stored in TI-OS AppVars (flash, max ~65KB each), split across multiple segments. Five tiers: S/M/L/XL/XXL with up to 99 numbered segments each. Auto-detected at runtime from largest to smallest.

Book entries are accessed directly from flash pointers — zero RAM cost. Binary search + weighted random selection with the position hash as seed.

The book trimming pipeline (tools/trim_book.py) uses depth-aware BFS: shallow positions (≤8 ply) get multiple alternatives for variety, deep positions get 1 move for maximum depth. Tournament testing showed book vs no-book is essentially equal at 6000 nodes (~1950 Elo both ways).


5. The GUI

chess/src/main.c renders on the 320×240 LCD using the CE C toolchain's graphx.h library:

  • Board: 208×208 pixels (8×26px squares), offset at (16, 16). 20-color indexed palette.
  • Double-buffered: gfx_SetDrawBuffer() / gfx_SwapDraw(). Partial redraw optimization for cursor-only moves.
  • Pre-rendered sprites: 12 sprites (6 types × 2 colors) with contour reinforcement for crisp edges at 20×20 pixels.
  • Piece animation: Linear interpolation, 50ms/square (max 200ms). Piece removed from board during animation, drawn at interpolated position each frame.
  • State machine: MENU → DIFFICULTY_SELECT → COLOR_SELECT → PLAYING → PROMOTION/GAMEOVER.
  • AI integration: Two-phase render — frame 1 draws "Thinking..." (so display updates), frame 2 calls engine_think() (blocking).
  • Board flipping: When playing as Black, all coordinates and D-pad directions are inverted.

6. Technical Tradeoffs & eZ80 Constraints

11.1 Why 0x88 Over Bitboards

The eZ80 is a 24-bit CPU with no SIMD. Bitboard operations (popcount, bitscan, 64-bit AND/OR/XOR) would require expensive multi-word library calls. The 0x88 scheme uses only 8-bit comparisons and additions — native eZ80 operations.

11.2 The 24-bit Integer Problem

On eZ80, unsigned int = 24 bits. This creates a unique optimization landscape:

TypeeZ80 sizeSpeed
uint8_t1 byteNative, fast
int / unsigned int3 bytesNative, fast
int16_t2 bytesNon-native, slow (masking/sign-extension)
int32_t / uint32_t4 bytesLibrary calls (__lxor, __ladd)

Rules discovered empirically:

  1. Widen register-bound locals to native int — beneficial
  2. Do NOT widen struct/array fields — harmful (increases memory traffic)
  3. int16_t for search scores in struct fields is okay; int for local variables during computation is better
  4. History table MUST stay int16_t (widening caused +44.8% moveorder regression)
  5. Array stride 4 (shift-by-2) is faster than stride 3 (multiply-by-3), so Zobrist tables stay uint32_t even though only 24 bits are used

11.3 -Oz vs -O2

-Oz (optimize for size) decisively beats -O2 on the eZ80: 7,053 cy/node vs 10,005. The larger -O2 binary causes flash cache misses, especially in move ordering (+103.9% slower). Code size dominates performance because the eZ80's flash has 10-cycle latency on cache misses (197 cycles for a full miss).

8.4 Sentinel-Based Sliding Rays

Expanding squares[] from 128 to 256 bytes, filling off-board indices with OFFBOARD (0xFF), lets sliding piece inner loops drop the SQ_VALID() check:

c
// Before (two conditions per iteration):
while (SQ_VALID(target) && squares[target] == PIECE_NONE) target += dir;
// After (one condition per iteration):
while (squares[target] == PIECE_NONE) target += dir;

eZ80 result: -1.29% cy/node (the eZ80 has no branch predictor, so eliminating a branch saves cycles predictably).

Desktop result: +83% slower movegen, +109% slower is_square_attacked, -8.4% throughput. The larger board_t (448B vs 320B) hurts L1 cache utilization on modern CPUs where branches are essentially free.

This is the most striking example of platform-divergent optimization in the project.

8.5 Stack Limitations

The TI-OS provides only ~4KB of stack. The chess engine mitigates this with:

  • Global move pool instead of per-ply stack arrays (~2KB saved per ply)
  • Static pawn_atk[128] instead of stack-allocated (safe because evaluate() is never recursive)
  • board_t and moves[] declared static in benchmark harness
  • Runtime stack guard reading the OS stack limit register

7.6 Flash-Resident Data

Opening book entries and Polyglot random numbers are accessed directly from AppVar flash pointers — zero RAM cost. Piece sprites are also loaded from the CHDATA AppVar when SPRITES_EXTERNAL is defined, saving ~2400 bytes of code flash.


7. The Optimization Journey

The engine went through 13 variant repos tracking systematic optimization. Here's the complete narrative:

11.1 Phase 1: Engine Bootstrap (Feb 13)

Four commits in one evening built the entire engine from nothing:

  1. Move generation with 30/30 perft validation
  2. Search + eval + TT (negamax, alpha-beta, iterative deepening)
  3. UI integration
  4. Integration tests

11.2 Phase 2: Eval Features + Ablation (Feb 15)

Added 6 Stockfish-inspired evaluation features with compile-time #ifndef guards for ablation testing:

FeatureImpact when removed (vs SF-1700)
MobilityWorst — ~35% swing
Pawn structureModerate
Passed pawnsModerate
Rook on filesSmall
Pawn shieldSmall
TempoSmall

11.3 Phase 3: Movegen Optimizations (Feb 15)

Five sequential optimizations, each measured on 5 positions × 1000 iterations:

#OptimizationImpact
1O(1) piece-list updates via square index map-5.1% perft
2Staged movegen (captures then quiets)Helps search pruning
3Skip castling attack probes when in check+6.7% movegen overhead (but saves attack calls in real games)
4Precompute check/pin info~50% fewer legality checks
5Track bishop counts incrementally-5.9% eval

8.4 Phase 4: Texel Tuning (+52 Elo) (Feb 16)

The single largest Elo gain. See Section 7 for full details.

8.5 Phase 5: Integer Type Optimization (Feb 16)

Widened hot-path score locals to native int (24-bit). Narrowed phase to uint8_t. Combined: -2.4% cy/node, -379 bytes binary. Widening struct fields was tried and reverted (-5.6% make/unmake regression).

7.6 Phase 6: Pawn Hash Cache (Feb 16-17)

Progressive refinement:

  • 2-way set-associative: -14.3% total eval cycles
  • 4-way set-associative: additional -6.6%
  • Qsearch capture-only fast path: -4.8%
  • Combined build_pawn_info + pawn attack bitmap: -28.5% eval, -15.7% cy/node overall

7.7 Phase 7: Cycle-Level Micro-Optimizations (Feb 17)

Changecy/node ImpactNotes
Native 24-bit zhash_t (hybrid approach)-0.8%Tables stay uint32_t, state uses zhash_t
Sentinel sliding rays-1.3%eZ80-only win; desktop regression
Hand-written pick_best_score asm-11.8% (in that function)2.8× faster than compiler
Inline asm ZHASH_XOR-2.8% make/unmakeMemory-to-memory 3-byte XOR
SoA move pool-1.1%Eliminated pool_copy overhead
Eval loop merging (4→2 loops)-4.5% profiling~1.1% real-world
sprintf/printf elimination-7.3KB binaryPurely size reduction

7.8 Optimizations Tried and Rejected

OptimizationResultWhy
Division-free tapered eval0%Compiler already uses multiply-by-reciprocal for /24
24-bit PRNGSkippedBoth PRNGs called <2000 times per game
Eval deduplication (parameterized loops)-2.3%Prevents BIT instruction for constant color checks
Local pointer caching0%LTO sees through the alias
Counter-move heuristic-1%Committed without testing, immediately reverted
Stack-allocated pawn_atk[128]-14%IX+displacement exceeds efficient signed 8-bit range
Pure zhash_t tables (3-byte stride)-1.2%multiply-by-3 slower than shift-by-2
NNUE neural network evalNot mergedToo expensive for 48MHz without SIMD

7.9 NNUE Exploration (Shelved)

Architecture: 768→32→1 with SCReLU activation, int8 quantized weights (24,772 bytes). Implemented on feature/nnue branch but never merged. The 32-neuron hidden layer required 32 multiply-accumulate operations per eval call on a CPU with no SIMD, and the 24KB feature transformer weight matrix would dominate flash reads.

7.10 Final Performance Profile (100 positions, 1000 nodes each)

Position Categorycy/nodeeval%movegen%make%
Positional middlegames470,21138%21%13%
Tactical middlegames337,43734%19%14%
Complex endgames272,68629%16%16%
Simple endgames133,15523%14%17%

Average ~200K cy/node = ~2.7ms/node at 48MHz = ~370 NPS.


8. Texel Tuning Pipeline

11.1 Methodology

Texel tuning optimizes evaluation parameters by minimizing mean squared error between sigmoid(K × static_eval) and actual game outcomes:

MSE = mean( (sigmoid(K × eval) - label)² )

Where label is 1.0 (win), 0.5 (draw), 0.0 (loss) from side-to-move perspective.

11.2 Data Source

Lichess broadcast database (GM-level tournament games, not random Lichess games). 25 months from 2024-01 through 2026-01. Positions filtered: ply 12-100, not after capture, not in check. Two datasets: 60K (pilot) and 1M positions (production).

11.3 Optimization Algorithm

Adam gradient descent with analytical gradients (not scipy). Key settings:

  • Sigmoid slope K: found via golden-section 1D search (~0.00652)
  • Learning rate: 0.02
  • L2 regularization centered at scale=1.0 (pulling toward baseline)
  • 90/10 train/validation split
  • Best model selected by validation MSE

8.4 Parameters Tuned

Scale factors for 14 scalar params (bishop pair, tempo, doubled/isolated pawns, rook files, shield — each mg/eg), 12 PST row scales, rank-indexed pawn bonuses, and mobility bucket tables. Expanded from 14 to 108 parameters across rounds.

8.5 Results Across Rounds

RoundGamesElo vs BaselineSignificant?
Round 1 (1M positions)4000+52 (±10)Yes
Round 2 vs Round 14000-1.6 (±10)No
Round 3 vs Round 14000-5.2 (±10)No
Round 4 expanded (96 params)1000+11 (±20)No
Round 4b constrained4000+0.1 (±10)No
Round 5 (108 params, material split)4000+8 (±11)Borderline

Key takeaway: Round 1 delivered a clearly significant +52 Elo. All subsequent rounds produced marginal improvements. Further tuning was abandoned.

7.6 Application

apply_texel_params.py reads JSON output and patches eval.c using regex-based text substitution — replaces #define values and array initializers directly in C source.


9. Testing & Tournament Infrastructure

11.1 Test Suites

ToolPurposeKey detail
perft.cMove generation correctness7 standard + 22 edge-case positions
test_search.cSearch/eval correctness8 categories: mate finding, stalemate, draws, incremental eval consistency, 200-halfmove self-play
test_integration.cPublic API testing11 categories covering the full engine.h API
bench.c (desktop)Component-level benchmarking100 positions from 14 standard suites, ns/call for each component
bench/ (on-calc)eZ80 cycle profilingSame 100 positions but with 48MHz hardware timer, per-section cycle breakdown

11.2 Tournament Infrastructure

Desktop tournaments (engine/tournament.py): Compiles UCI engine, plays against Stockfish at configurable Elo levels (1320-2420, step 100). 20 games per level, 5 concurrent matches, 0.1s/move.

Emulator tournaments (chess/emu_tournament.py): Runs the actual eZ80 binary inside the cycle-accurate TI-84 CE emulator against Stockfish. The key innovation:

  1. Reads the EMUUCI.8xp binary
  2. Finds the @@CMDSLOT@@ sentinel (512 bytes)
  3. Patches each position's FEN + settings into the slot
  4. Recomputes .8xp file checksum
  5. Launches the emulator subprocess
  6. Parses "MOVE xxxx" from debug output
  7. Feeds the move to Stockfish via UCI

This measures the engine at its true eZ80 calculation speed (~370 NPS), not desktop speed.

Ablation testing (engine/ablation.py): Plays the full engine against versions compiled with -DNO_&lt;FEATURE&gt; to measure each eval feature's contribution.

11.3 Elo Results

PlatformConfigEloMethod
eZ80Easy (900ms, var=10)~132030 games vs SF-1320
eZ80Medium (3s)~144230 games vs SF-1500
eZ80Hard (9s)~171230 games vs SF-1700
eZ80Expert (13.5s)~195830 games vs SF-1900
eZ80Master (27s)~2083100 games vs SF-2100
Desktop0.01s/move~2583100 games vs SF-2600
Desktop0.1s/move~2700100+189 games vs SF-2600/2700

10. Major Bugs & Debugging Stories

11.1 The Queen Hang Bug (PVS Scoring)

Symptom: Engine played Qxa2, hanging the queen to Rxa2.

Root cause: PVS null-window search at root scored terrible moves identically to the best. When move_variance > 0, the engine randomly selected among "equally scored" root moves — but PVS clips all non-PV moves to the same alpha value, making Qxa2 (losing the queen) look as good as the best move.

Investigation: Spanned multiple sessions. First confirmed the bug exists on desktop (not eZ80-specific). Quiescence correctly saw Rxa2, but PVS window clipping masked the true score at every search depth (1-8).

Fix attempts:

  • Full window at root: catastrophic — -708 Elo (PVS exists for a reason)
  • Wider PVS window by variance amount: worked. pvs_floor = alpha - variance instead of alpha

Variance sweep: var=0: 0 Elo loss. var=5: 0 Elo loss (sweet spot). var=10: -147 Elo. var=15: catastrophic.

11.2 The 0-30 Tournament Disaster

Symptom: After implementing partial depth reporting, engine went from competitive to 0-30 in tournaments.

Root cause: search_best_root_move was being used to override result.best_move at timeout, but if no root move was scored in the current iteration, the fallback was the first generated move (a2a3 or h7h6), not the PV move.

Fix: Added search_root_move_scored flag. Documented in MEMORY.md as a "Dangerous Pattern."

11.3 Check Extension + TT Ordering Bug (PR #15)

Symptom: Engine played Be7 (losing to mate-in-2) instead of finding the defense.

Root cause: Check extensions increase depth, so the TT caches results at the extended depth. On the next iteration, the TT probe returns the cached result BEFORE the check extension can trigger, locking the engine to depth 1 in check positions.

Discovery: Hunter noticed the bad move and directed Claude to investigate. Claude initially got the chess analysis wrong; Hunter corrected: "there's a mate-in-2 with both moves being check."

10.4 The Timer Failure Cascade

Symptom: Late-endgame tournament positions terminated with emu_error.

Root cause chain: check_time() checked the node limit FIRST with early return, preventing the time check from running. With time_fn = NULL and search_deadline = 0, search never stopped on time. Fell back to 30K node limit, which at ~300 NPS took ~96 seconds, exceeding the emulator timeout.

Fix: Added search_node_deadline as a hard safety fallback (~1 node/ms with 3× headroom). Also made search_stopped volatile (LTO was optimizing away reads).

10.5 The Book That Never Worked

Multiple rounds of debugging:

  1. Missing CHBKRN.8xv (shared randoms file)
  2. Tier mismatch: baked large book but compiled with BOOK=medium
  3. const char * pointer arrays had incorrect behavior on eZ80 24-bit target
  4. After fixing all of the above, still failed on the custom emulator — worked fine on CEmu reference emulator
  5. Root cause: custom emulator's VAT reconstruction didn't properly expose AppVars

11. AI Agent Orchestration & Usage Stats

11.1 Session Data

MetricValue
Main sessions19 session directories
Main JSONL files1 file, 2.2 MB
Subagent files174 files, ~18 MB
Compact summaries71 files
Date rangeFeb 10 – Mar 27, 2026

11.2 Human-AI Collaboration

Hunter directed every architectural and strategic decision; the AI implemented code per his direction. Key aspects of Hunter's leadership:

  • Architectural decisions: Chose 0x88 board representation (over bitboards — impractical on 24-bit CPU), Polyglot book format, the phased engine development approach (movegen → search/eval → UI integration → tests)
  • Benchmarking methodology: Designed the empirical approach — insisted on eZ80-only benchmarks ("desktop benchmarks are pointless"), established the 100-position benchmark suite as the single source of truth, required every optimization to be measured before committing
  • Quality enforcement: Caught the AI committing code without testing ("yes i want to actually test things before making commits"), demanded accurate numbers, rejected over-engineered explanations
  • Strategic direction: Explored NNUE as a potential approach (Hunter's idea, not the AI's), evaluated the tradeoffs, and made the call to shelve it. Designed the difficulty/variance system concept. Directed the Texel tuning pipeline.
  • Course corrections: Caught hallucinated performance ratios ("100,000× slower" — actual: ~14,000×), corrected chess analysis errors (the AI got a mate-in-2 analysis wrong), stopped the AI from overclaiming ("let's not say chess is solved")

Specific instances where Hunter corrected the AI:

  • AI committed counter-move heuristic without benchmarking — Hunter enforced the test-first rule
  • AI overcomplicated explanations of benchmark paradoxes — Hunter cut through the noise
  • AI hallucinated performance ratios ("100,000× slower") — Hunter caught it, actual ratio was ~14,000×
  • AI misattributed NNUE exploration as "AI proposed, human rejected" — Hunter corrected: it was his idea to explore, his evaluation to shelve
  • AI miscalculated emulator tournament timing ("15-30 minutes each") — Hunter corrected: games run at emulator speed, not real-time
  • AI initially got the check extension bug chess analysis wrong — Hunter corrected: "there's a mate-in-2 with both moves being check"

11.3 Key Methodology

Hunter's development loop: implement → benchmark on 100 positions → commit or revert. Every optimization has a measured cy/node or Elo impact. The RESULTS.md file serves as the single source of truth. The AI executes this loop; Hunter reviews the results and decides what ships.


12. Development Timeline

DateDayKey Achievement
Feb 101Pong game complete
Feb 112Sudoku game (PR #1)
Feb 123Chess UI only (PR #2)
Feb 134Full chess engine — movegen, search, eval, TT, integration
Feb 156Eval features + ablation. Movegen optimizations. Opening book.
Feb 167Texel tuning (+52 Elo). Integer type optimization. Pawn cache.
Feb 1785 PRs in one day. Sentinel rays, difficulty system, PVS bug fix, tournament system.
Feb 189Emulator tournament system. CHDATA AppVar consolidation.
Feb 1910Hand-written eZ80 assembly (-11.8% pick_best). SoA move pool.
Feb 20-2511-16Check extension bug fix. Threefold repetition. Final polish.

Total: 111 commits, 15 PRs (14 merged), 16 days. From nothing to ~2083 Elo on a calculator.


13. Key Files Reference

Engine Core

FileLinesPurpose
engine/src/search.c~800Negamax, PVS, LMR, null-move, quiescence, iterative deepening
engine/src/board.c~6000x88 board, piece lists, make/unmake
engine/src/eval.c~500Texel-tuned HCE, pawn cache, mobility, tapered eval
engine/src/movegen.c~400Pseudo-legal generation, attack detection, legality fast path
engine/src/book.c~350Polyglot book, multi-AppVar, binary search, weighted selection
engine/src/engine.c~350Public API, game status, difficulty configuration
engine/src/zobrist.c~150Zobrist tables, inline asm XOR, PRNG
engine/src/tt.c~1004096-entry TT, move packing, always-replace
engine/src/pick_best.asm~60Hand-written eZ80 assembly for move scoring
src/main.c~1200GUI: rendering, input, animation, state machine

Testing & Tournaments

FilePurpose
engine/test/bench.c (25KB)100-position desktop benchmark
bench/src/main.cOn-calculator cycle profiling benchmark
bench/RESULTS.mdComplete optimization history with measurements
engine/test/perft.c29-position move generation validation
engine/test/test_search.cSearch correctness (8 categories)
engine/test/test_integration.cFull API integration tests (11 categories)
engine/tournament.pyDesktop Stockfish tournaments
emu_tournament.pyEmulator-based eZ80 tournaments
engine/ablation.pyEval feature ablation testing

Texel Tuning

FilePurpose
engine/tuning/texel_tune.py (32KB)Adam gradient descent on 1M positions
engine/tuning/build_texel_dataset.pyLichess broadcast PGN → CSV dataset
engine/tuning/apply_texel_params.pyJSON results → C source patching