CE-Games Chess Engine — Deep Technical Profile

Build timeline — ~10 active days across 4 phases (Feb 10 – Feb 25, 2026)
Pre-chess scaffold (3 days) — Pong with AI, Sudoku, chess UI shell, CLAUDE.md
Core engine build-out (3 days) — movegen + perft, search/eval/TT, Polyglot book, PVS/aspiration/futility, staged movegen, sprite pieces
Eval tuning + optimization + tournaments (5 days) — Texel pipeline, pawn hash cache, sentinel ray optimization, SoA move pool, hand-written eZ80 asm, emu_uci tournament harness (~2083 Elo), difficulty levels, threefold repetition
Final polish (2 days, after 2-day gap) — check extension + TT ordering bug fix, README

Project Overview

Libraries & Frameworks

Pre-Implementation Research & Planning
Architecture

Chess Engine Concepts — Quick Glossary

The Chess Engine Core
The GUI
Technical Tradeoffs & eZ80 Constraints
The Optimization Journey
Texel Tuning Pipeline
Testing & Tournament Infrastructure
Major Bugs & Debugging Stories
AI Agent Orchestration & Usage Stats
Development Timeline
Key Files Reference

1. Project Overview

A chess engine for the TI-84 Plus CE graphing calculator (Zilog eZ80 @ 48 MHz, 256KB RAM, 3.5MB Flash), claiming to be the strongest chess engine ever built for this platform. The engine achieves ~2083 Elo at its highest difficulty (27s/move) and ~2700 Elo when cross-compiled for desktop.

By the numbers:

111 commits across 15 PRs, built in 16 days (Feb 10–25, 2026)
8 core engine C files + 1 hand-written eZ80 assembly file
13 iteration/variant repos tracking the optimization journey
~370 NPS on the eZ80 vs ~5.1M NPS on desktop ARM64 (14,000× slower)
~62KB RAM usage out of 256KB available
Texel tuning on 1M Lichess positions: +52 Elo (single largest gain)
100-position benchmark suite with cycle-accurate profiling
Automated tournament system running the actual eZ80 binary vs Stockfish

The project also includes Pong (AI opponent, 5 levels) and Sudoku (3 difficulties, pencil marks, undo), but chess dominates the engineering effort.

Libraries & Frameworks

Engine core (`chess/engine/`, C targeting eZ80)

CE C/C++ Toolchain (v12+) — cross-compiler, linker, and stdlib targeting the TI-84 Plus CE's eZ80; every Makefile pulls cedev-config --makefile.
Standard C headers only (stdlib.h, string.h, stdio.h, math.h, sys/timers.h) — no third-party C deps.
Hand-written eZ80 assembly — engine/src/pick_best.asm, inline asm in zobrist.c for XOR tables.

Desktop/native engine build

GNU Make — top-level and per-target Makefiles; native targets include perft, uci, test-search, test-integration, bench, eval-fen, ablation.

TI-84 Plus CE runtime libraries (pulled in as `.8xv` AppVars)

GraphX — 8-bit palette graphics + 60 FPS double-buffering for the on-device UI.
KeypadC — calculator key scanning.
FileIOC — flash AppVar access (used for the multi-AppVar Polyglot opening book).
FontLibC — text rendering.
LibLoad — dynamic .8xv library loader.

Tuning & tournament scripts (`engine/tuning/`, `scripts/`, Python 3)

python-chess — PGN parsing for the Texel dataset, UCI subprocess driver for tournament play, board representation everywhere.
NumPy — gradient math in texel_tune.py (Adam-style descent on ~1M positions).
zstd CLI — decompresses .pgn.zst Lichess broadcast dumps during dataset build.
Stockfish (external UCI binary) — opponent in tournament.py, emu_tournament.py, and ablation_vs_sf.py.

Emulator-in-the-loop testing

hunterchen7/ti84ce — the sibling Rust emulator from this portfolio, run as a subprocess by emu_tournament.py to measure Elo on cycle-accurate hardware. A custom emu_uci eZ80 bridge is patched at the 512-byte @@CMDSLOT@@ sentinel so FEN commands reach the engine.

Not used

No NNUE, no PyTorch/TensorFlow, no neural eval — evaluation is hand-crafted Texel-tuned PeSTO-derived piece-square tables with tapered middlegame/endgame scoring.

2. Pre-Implementation Research & Planning

Before any code was written, I conducted extensive research across multiple Claude.ai conversations that produced the hardware constraints reference document, the full 16-section architecture plan, the Elo measurement methodology, and the competitive landscape analysis. This research phase is what made the 16-day implementation sprint possible.

2.1 Hardware Constraints Research (originated from a DOOM porting investigation)

The foundational platform knowledge actually originated from a separate investigation into porting real DOOM (the actual id Software source, not a raycaster recreation) to the TI-84 CE. That research produced a comprehensive hardware constraints document that was then reused as the starting context for the chess engine planning.

Key hardware facts established through this research:

CPU: eZ80 @ 48 MHz, 8-bit ALU, 24-bit addressing (ADL mode). int is 24 bits (3 bytes), not 32. Range ±8,388,607. No hardware multiply — multiplication synthesized from shifts and adds, ~350 cycles for 32×32. Effective throughput for 32-bit-heavy code: ~5-15 MIPS.
RAM partitioning (~133 KB usable): NOT a flat arena. userMem ~65 KB (where .8xp programs load), BSS+Heap ~60 KB shared (static globals and malloc growing toward each other), Stack ~4 KB, OS/libraries ~17 KB. Programs exceeding ~65 KB auto-split by toolchain into companion AppVars.
Flash: 4 MB total, ~3.5 MB usable. Code CANNOT execute from flash (copied to RAM first), but data CAN be read freely. Archived AppVars accessed via ti_GetDataPtr() return direct flash pointers — zero RAM cost. Post-2019 hardware has a flash cache; sequential reads can outperform RAM because RAM competes with LCD DMA bandwidth.
Display: 320×240 pixels, 8-bit indexed color, VRAM memory-mapped at 0xD40000.
Toolchain: CE C/C++ toolchain (LLVM-based). Compile with -Oz for size.

The key implication I identified for chess: opening books and lookup tables go to flash as archived AppVars (essentially free), search/eval code must fit in ~65 KB userMem, transposition table competes with board state and move lists in the ~60 KB heap, and the int = 24-bit gotcha means bitboards require explicit uint64_t with expensive multi-byte operations.

2.2 The 16-Section Architecture Plan

I then conducted a full architecture planning session, working through every major design decision from first principles before implementation began. This produced the chess/TI84_Chess_Engine_Plan.md document.

Why 0x88 over bitboards — the first and most fundamental decision. Bitboards require 64-bit integers; every 64-bit operation on the eZ80 compiles into multi-byte library calls costing dozens of instructions. 0x88 uses a 128-element array where off-board detection is a single AND: if (square & 0x88). Direction arithmetic is just adding offsets. All operations are byte-width — perfect for the 8-bit ALU. I worked through the full index layout and confirmed all coordinate math was native byte operations.

Piece encoding settled on single-byte: type in bits 0-2 (PAWN=1 through KING=6), color in bit 7 (WHITE=0x00, BLACK=0x80). Translation from the Umy int8_t format is a small helper.

Board state worked out to ~430 bytes: squares[128], piece_list[2][16], king_sq[2] (avoids scanning for king during check detection), Zobrist hash, castling rights (4 bits packed), en passant target. Pre-computed castling_rights[128] flash table maps each square to a mask.

Move pool with stack pointer — instead of allocating per-ply move arrays (which would blow the 4 KB stack), a single global pool: pool_moves[2048] + pool_scores[2048] with a stack pointer tracking allocation. Each ply pushes/pops from the pool.

eZ80 optimization notes from planning: Minimize function parameters (eZ80 passes all args on stack, no register calling convention), make search-state variables static globals rather than recursion parameters, use uint8_t/int8_t aggressively since byte-wide operations are the native sweet spot.

Realistic Elo ceiling discussion: With heroic optimization (hand-tuned assembly, Texel tuning, large opening book), the ceiling is ~1800-2000 Elo. The hard wall is search depth — at 2000+ Elo, decisive positions need 12-15 plies, and even at 10K NPS with perfect pruning you can't reach that depth in complex middlegames. Historical validation: best 1980s commercial chess computers (Mephisto MM IV, Fidelity Mach IV) topped out around 2000-2100 on similar effective throughput. The final measured result of ~2083 Elo at 27s/move confirms my pre-implementation estimate was accurate.

2.3 Elo Measurement Methodology

I separately researched how to defensibly measure the engine's Elo. The investigation pulled in Marco Meloni's empirical data (110,000+ games at various node limits, calibrated against the SSDF human rating scale) and established the methodology:

Stockfish's UCI_LimitStrength mode was chosen over raw time throttling. UCI_Elo ranges from 1320-3190, calibrated at 120s+1s, anchored to CCRL 40/4 ±100 Elo. It works by intentionally selecting suboptimal moves with calibrated randomness — much more human-like than just limiting time. Below skill level 10 (~2596 Elo), the rating is hardware/time-control independent.

This methodology makes the engine's Elo claims defensible: anyone can reproduce the results by running the same Stockfish UCI_LimitStrength matches. The qualifier "based on testing against calibrated Stockfish" makes it a fact rather than a subjective claim.

2.4 Competitive Landscape Analysis

I researched existing TI-84 CE chess engines to validate the "strongest engine" claim:

ChessCE (MateoConLechuga, 2016-2017): ~40K downloads, built on TSCP (~1724 desktop Elo). Known bugs — crashes on checkmate, no draw mechanics. Lacks LMR, futility pruning, most modern search techniques.
Chess84 (thewarrenjames, June 2025): Most recent competitor, implements many of the same search techniques but author estimates "half the time like an 800, half like a 1200." Tiny opening book (6 hand-curated lines vs my 131K Polyglot entries). No formal Elo testing.
PAndaContron/ce-chess: Minimal project, 5 commits, not a serious contender.

The decisive differentiator: my engine is the only one with rigorous Elo testing against Stockfish's calibrated strength modes. Even at the lowest time setting (2s/move), the ~1355 Elo already exceeds Chess84's self-assessment, and at 5s+ it's comfortably in the 1700-2100 range.

2.5 Chess Engine Architecture Philosophy

I also explored the fundamental architectural question of eval/search compatibility through a thought experiment: what happens when you swap evaluation functions and search algorithms across engines?

Neural net eval + alpha-beta: The most problematic combination. Alpha-beta needs millions of cheap evaluations; a transformer-style network is expensive per eval. Search depth collapses. Stockfish's NNUE is the pragmatic point on this spectrum — a lightweight neural net designed to be fast enough for deep alpha-beta search.

Classical/NNUE eval + MCTS: MCTS was designed for expensive-but-high-quality evals. Classical eval is cheap but noisy — it benefits from depth, not selectivity. Plugging cheap eval into MCTS means being selective with evaluations that aren't accurate enough to justify selectivity.

The deeper principle: Eval and search co-evolved as point solutions in a tradeoff space. You can't swap one for the other. This understanding informed why the TI-84 CE engine uses a hand-crafted evaluation (cheap, benefits from depth via alpha-beta) rather than attempting to run a neural network on the eZ80.

3. Architecture

3.1 Engine/UI Separation

The chess engine is cleanly separated into a portable C engine (chess/engine/src/, 8 files) and a TI-84 CE GUI (chess/src/main.c). The engine.h API serves as the boundary — the GUI never touches internal board representation directly.

chess/
├── src/main.c              — GUI (320×240 LCD, sprite rendering, input)
├── engine/
│   ├── src/
│   │   ├── board.c/h       — 0x88 board, piece lists, make/unmake
│   │   ├── movegen.c/h     — Pseudo-legal move generation
│   │   ├── eval.c/h        — Texel-tuned HCE (PeSTO + features)
│   │   ├── search.c/h      — Negamax, PVS, LMR, null-move, killers
│   │   ├── tt.c/h          — 4096-entry transposition table (32KB)
│   │   ├── zobrist.c/h     — 24-bit Zobrist hashing with inline asm
│   │   ├── engine.c/h      — Public API for GUI integration
│   │   ├── book.c/h        — Polyglot opening book (multi-AppVar)
│   │   ├── types.h         — Core types, 0x88 constants, move encoding
│   │   ├── directions.h    — Knight/bishop/rook/king offset arrays
│   │   ├── chdata.h        — AppVar layout (sprites + randoms)
│   │   └── pick_best.asm   — Hand-written eZ80 assembly
│   ├── test/               — perft, search tests, integration tests, bench
│   ├── tuning/             — Texel tuning pipeline (Python)
│   ├── uci/                — UCI protocol for desktop play
│   └── Makefile            — Desktop builds (GCC)
├── bench/                  — On-calculator benchmark harness
├── emu_uci/                — Emulator UCI bridge for tournaments
├── tools/                  — Book generation, data packing
└── Makefile                — CE SDK build for CHESS.8xp

3.2 Dual-Target Compilation

The same engine source compiles for both:

TI-84 CE (CE C toolchain, eZ80 target, -Oz, -DPAWN_CACHE_SIZE=16 -DSPRITES_EXTERNAL)
Desktop (GCC/Clang, x86/ARM64 target, -O2, -DNO_BOOK)

This enables desktop benchmarking (fast iteration, Stockfish tournaments) while validating on the actual target via the cycle-accurate eZ80 emulator. A key finding: some optimizations help on eZ80 and hurt on desktop (e.g., sentinel rays: +1.3% on eZ80, -8.4% on desktop).

3.3 Memory Budget (~62KB of 256KB)

Component	RAM (bytes)
Transposition table (4096 × 8)	32,768
Move pool (2048 × 5, SoA)	10,240
Pawn cache (32 entries)	~5,000
Piece sprites (12 × 402)	~4,800
Zobrist tables	~3,200
Lock tables	~1,600
Position history (256 × 3)	768
History heuristic (2 × 128 × 2)	512
`board_t`	~440
Killers (64 × 2 × 3)	384
UI state + misc	~2,000
Total	~62KB

BSS limit measured at 60,690 bytes. The remaining ~194KB is for TI-OS, stack (~4KB), and other overhead.

Chess Engine Concepts — Quick Glossary

Reference for the terms used throughout this section. Each is a standard technique in computer chess; the interesting part is usually the eZ80-specific tradeoff, not the concept itself.

Board representation

0x88 board: A 128-square array where off-board squares have bit 0x88 set. Checking sq & 0x88 == 0 tests "is this square on the board?" in one AND + compare, no bounds check. Used since the 1970s; still the right choice when you don't have fast 64-bit bitwise ops.
Bitboards: The modern alternative — represent the board as 64-bit integers, one bit per square. Great on 64-bit CPUs with hardware popcount and bitscan. Impractical on the eZ80 (24-bit, no SIMD, 64-bit ops require expensive multi-word library calls).
Piece lists: Per-side arrays of where each piece lives. Avoids scanning all 64 squares when you need "where are the white knights?"
Mailbox: The generic term for any array-of-squares board representation. 0x88 is a specific mailbox variant.

Hashing

Zobrist hashing: Precompute a random 64-bit (or 24-bit on eZ80) key for every (piece, square) combination, plus keys for side-to-move, castling rights, en passant square. The board's hash is the XOR of all currently-applicable keys. Incrementally updated on every move: XOR out the old piece's key, XOR in the new. Two positions that look different but hash to the same key are (probabilistically) the same position — enables transposition tables.
Transposition table (TT): A hash table keyed on the Zobrist hash. Stores the search result for each position (score, best move, depth searched, bound type). When the search hits a position it's already seen (same position reached via a different move order = "transposition"), it can reuse the prior result instead of re-searching. Typical hit rates are 15-30%.

Evaluation

Piece-Square Tables (PST): A 64-entry score table per (piece type, game phase). Captures "knight on e5 is worth more than knight on a1." Incrementally maintained during move-making for zero-cost evaluation of the positional component.
PeSTO: Ronald Friederich's publicly-available PST set, tuned via Texel's method against a large database of positions. Widely used as a baseline because the tables are high-quality and free.
Tapered evaluation: Separate middlegame (mg) and endgame (eg) scores, interpolated by game phase (a scalar 0-24 based on remaining non-pawn material). Lets the eval say "a rook on the 7th rank is worth a lot in the middlegame, less in the endgame" smoothly.
Texel tuning: A method for automatically fitting eval parameters. Take millions of positions with known game outcomes, compute the eval's predicted result via a sigmoid, minimize mean-squared error between prediction and outcome. Named after Peter Österlund's Texel engine.

Search

Negamax: The symmetric form of minimax — since chess is zero-sum, min(a,b) = -max(-a,-b), so you only implement one function with sign flipping. Every modern engine uses negamax.
Alpha-beta pruning: The classic search optimization. Track upper/lower bounds (alpha, beta); prune any branch that can't improve on the best known line. Reduces effective branching factor from ~35 to ~6 with good move ordering.
Iterative deepening: Search to depth 1, then 2, then 3, ... instead of jumping straight to the target depth. Gives you a best move at every depth (graceful time-up), and the shallower searches populate the TT with good move ordering for the deeper searches.
PVS (Principal Variation Search): Assume the first move (from TT or iterative deepening) is best; search it with a full window. For every other move, search with a zero-width "null window" — if it fails high, you know the move is good and re-search it properly. Saves time when your move ordering is good (which it usually is, thanks to TT + killers + history).
Aspiration windows: Instead of searching with (-∞, +∞) every iteration, search a narrow window around the previous iteration's score (e.g., ±25 centipawns). If the result falls inside, you just saved time. If it falls outside ("fails high/low"), re-search with a wider window. Net win because most positions' scores don't swing dramatically between depths.
Quiescence search: Once you hit your target depth, don't just return the static eval — keep searching captures (and sometimes checks) until the position is "quiet." Prevents the horizon effect where the engine "stops looking" one move before getting its queen taken.
Delta pruning: Inside quiescence, skip captures that can't possibly raise the score enough to matter (e.g., capturing a pawn won't save a position where you're down a rook). Standard formulation: if stand_pat + captured_piece_value + margin < alpha, skip the capture.

Move ordering (makes alpha-beta effective)

TT move: Try the best move from the previous TT hit first. If you searched this position before and found move X was best, it's probably still best.
MVV-LVA: "Most Valuable Victim, Least Valuable Aggressor." Try capturing moves in order of victim_value - attacker_value (capturing a queen with a pawn > capturing a pawn with a queen). Crude but extremely fast.
Killer moves: A 2-slot cache per search depth of "quiet moves that caused beta cutoffs at this depth." Empirical observation: a move that refuted one line often refutes sibling lines at the same depth.
History heuristic: Counter table indexed by (from_sq, to_sq) or (piece, to_sq). Increment when a move causes a beta cutoff. Encodes "this move works well in general."

Pruning/reductions

Null-move pruning: At each node, try "passing the turn" (a null move). If the opponent still can't improve their position below some bound, the current position is so good we can prune. Saves massive time in quiet positions. Has edge cases — breaks in zugzwang (positions where not moving helps), so it's disabled at low material counts.
LMR (Late Move Reductions): For moves that aren't the first few tried (so probably not the best), reduce search depth by 1-2 ply. If the reduced search beats alpha, re-search at full depth. Assumes good move ordering so late moves are genuinely worse. Massive elo gain per line of code.
Check extensions: Extend depth by 1 when giving check. Tactics tend to involve forcing sequences, so you don't want to stop searching in the middle of a check.

eZ80-specific terms

Sentinel ray: Sliding piece move generation (bishop, rook, queen) normally loops until it goes off the board. Instead of checking SQ_VALID(sq) at each step, extend the board array to 256 bytes with OFFBOARD = 0xFF filling the unused indices. The loop terminates by checking squares[sq] != OFFBOARD — same check as "is this square empty or enemy piece?", so you get the off-board test for free. Works on eZ80 where branch prediction doesn't exist; can be a regression on modern CPUs where the extra array size hurts cache.
int = 24-bit: The eZ80's native word size is 24-bit. sizeof(int) is 3 bytes. Array strides of 4 (shift-by-2) are faster than strides of 3 (multiply-by-3), which drives some weird layout choices — e.g., Zobrist tables stay uint32_t even though only 24 bits are used.

4. The Chess Engine Core

4.1 Board Representation (board.c/h)

0x88 board — the classic scheme where sq & 0x88 detects off-board squares:

squares[256]: The board array. Off-board indices filled with OFFBOARD (0xFF) sentinel, enabling sliding piece loops to terminate by checking for non-PIECE_NONE without per-step SQ_VALID(). The sentinel was extended from 128 to 256 bytes specifically for the eZ80, where the branch predictor doesn't exist and eliminating the SQ_VALID check saves cycles.
piece_list[2][16]: Square of each piece, per side. Max 16 pieces.
piece_index[128]: Reverse map from square to piece_list index (O(1) lookup). 0xFF if empty. Enables O(1) piece removal via swap-with-last.
bishop_count[2]: Tracked incrementally for bishop pair bonus.
king_sq[2]: Cached king squares for fast check detection.
Incrementally maintained scores: mg[2], eg[2] (PST+material per side), phase (24=opening, 0=endgame), hash (24-bit Zobrist), lock (16-bit TT verification), pawn_hash (separate pawn-only hash for pawn cache).

board_make() (~150 lines): The critical hot-path function. Incrementally updates hash, lock, pawn_hash, PST scores, phase, bishop_count, king_sq, castling rights, and EP square. Uses a castling_mask[128] lookup table — a 128-byte table where corner/king squares AND out the relevant castling bits.

board_unmake(): Reverses make by restoring saved state from undo_t struct (10-11 bytes: captured piece, castling, EP, halfmove, fullmove, hash, lock, pawn_hash).

4.2 Move Generation (movegen.c)

Pseudo-legal generation iterating the piece list (O(piece_count), not O(64)):

Staged generation: GEN_CAPTURES first, then GEN_QUIETS. Enables search to try captures before quiets without generating all moves.
is_square_attacked(): Pure query function. Checks knight (8 offsets), pawn (2 diagonals), king (8 adjacent), then bishop/queen and rook/queen sliding rays. Uses the OFFBOARD sentinel for loop termination.
Legality fast path: compute_legal_info() precomputes check, pin, and checker information by scanning rays from the king. move_needs_legality_check() returns 1 only for king moves, EP, pinned piece moves, and in-check moves — saving ~50% of expensive is_square_attacked calls.

4.3 Search (search.c)

Negamax with alpha-beta, iterative deepening, and these enhancements:

Technique	Details
PVS	First move gets full window; others get null window `[-alpha-1, -alpha]`, re-search on fail-high
Aspiration windows	±25 from previous iteration's score, widen to full on fail
Null-move pruning	R=2 reduction, depth ≥ 3, not in check, non-pawn material required
LMR	After 4+ legal moves, depth ≥ 3, for quiet non-promotion moves: reduce by 1
Futility pruning	Depth ≤ 2, static eval + margin ≤ alpha. Margins: 200 (d=1), 500 (d=2)
Check extensions	If in check: +1 depth. Limited to 2 per path
Quiescence	Max depth 8. Delta pruning (stand pat + 1100 < alpha). Captures only when not in check.
Move ordering	TT move (30000) > MVV-LVA captures (10000+) > Killer 1 (9000) > Killer 2 (8000) > Queen promos (+5000) > History heuristic
Killers	2 per ply, FIFO replacement
History heuristic	Gravity-based: `val += bonus - val * bonus / 16384`, clamped to [-4000, 4000]
Repetition detection	Walks position history backward by 2, stops at last irreversible move

Global move pool (SoA layout): pool_moves[2048] and pool_scores[2048]. Plies share this via a stack pointer (move_sp). This avoids ~2KB stack allocation per ply — critical on eZ80 with only ~4KB stack.

Stack overflow guard (eZ80 only): Reads the OS stack limit from port 0xE0003A, adds STACK_RESERVE=400 bytes. Each node checks the frame pointer via lea %0, ix+0 inline asm and bails to evaluate() if too close.

Move variance (for difficulty levels): After search completes, finds all root moves within N centipawns of best and picks one randomly. The PVS window at root is widened by variance + 1 to ensure near-best moves get accurate scores (fixing a critical bug where PVS null-window clipping made all non-best moves score identically).

pick_best_score(): Selection sort to find the highest-scored move. On eZ80, replaced by hand-written assembly (pick_best.asm) that's 2.8× faster than the compiler output. Uses a signed-to-unsigned comparison trick (XOR high byte with 0x80) and keeps the flipped best_score in a register permanently.

4.4 Evaluation (eval.c)

Texel-tuned PeSTO piece-square tables with Stockfish-inspired features:

Material + PST (from incremental mg[WHITE] - mg[BLACK] and eg). Values: pawn mg=82/eg=94, knight 337/281, bishop 365/297, rook 477/512, queen 1025/936.
Bishop pair: mg=19, eg=58.
Tempo: mg=10, eg=9 (side-to-move bonus).
Pawn structure (via pawn cache):
- Doubled pawns: mg=-12, eg=-3
- Isolated pawns: mg=-12, eg=-17
- Connected pawns: rank-indexed bonus (0, 9, 10, 16, 39, 65)
- Passed pawns: rank-indexed mg/eg bonus (up to 85/229)
Rook bonuses: Open file (mg=38, eg=24), semi-open (mg=23, eg=11).
Mobility: Knight (0-8 squares) and bishop (0-13 squares) with Stockfish classical values. Uses pawn attack bitmap from cache for safe-square counting.
Pawn shield: mg=6 per friendly pawn in front of king.
Tapered eval: score = (mg * phase + eg * (24 - phase)) / 24.

Pawn cache: 4-way set-associative, 32 entries. Keyed by pawn_hash. Stores pawn mg/eg scores, per-file rank bitmasks, and a 128-byte pawn_atk[] bitmap (per-square pawn attack info). The pawn_atk array MUST be static, not stack-allocated — putting 128 bytes on the eZ80 stack pushes local variable offsets beyond the efficient IX+displacement range (signed 8-bit: -128 to +127), causing a 14% performance regression across all eval sections.

4.5 Transposition Table (tt.c)

4096 entries × 8 bytes = 32KB. Always-replace policy.

Entry: lock16 (2B, independent 16-bit verification), score (2B), best_move (2B, packed), depth (1B), flag (1B: EXACT/ALPHA/BETA).

Move packing: 6 bits from_sq64 + 6 bits to_sq64 + 1 bit promotion + 2 bits promo type = 16 bits. Capture/EP/double-push flags are NOT stored — caller must verify legality.

The 24-bit hash + 16-bit lock gives 40 bits total of collision resistance (1 in ~1 trillion false positive rate).

4.6 Zobrist Hashing (zobrist.c)

Tables stored as uint32_t (4-byte stride for fast array indexing), but all state variables use zhash_t = unsigned int = 24 bits on eZ80. This hybrid approach avoids both expensive __lxor library calls (from 32-bit state operations) and slow multiply-by-3 array strides (from 3-byte tables).

ZHASH_XOR inline assembly (eZ80 only): Memory-to-memory 3-byte XOR using ld a,(de); xor a,(hl); ld (hl),a repeated 3 times (~24 cycles). The compiler's library call __ixor costs ~45 cycles. A C macro approach was also tried but generated worse code (+20%) due to bad pointer arithmetic for uint8_t* casts.

4.7 Opening Book (book.c)

Polyglot format stored in TI-OS AppVars (flash, max ~65KB each), split across multiple segments. Five tiers: S/M/L/XL/XXL with up to 99 numbered segments each. Auto-detected at runtime from largest to smallest.

Book entries are accessed directly from flash pointers — zero RAM cost. Binary search + weighted random selection with the position hash as seed.

The book trimming pipeline (tools/trim_book.py) uses depth-aware BFS: shallow positions (≤8 ply) get multiple alternatives for variety, deep positions get 1 move for maximum depth. Tournament testing showed book vs no-book is essentially equal at 6000 nodes (~1950 Elo both ways).

5. The GUI

chess/src/main.c renders on the 320×240 LCD using the CE C toolchain's graphx.h library:

Board: 208×208 pixels (8×26px squares), offset at (16, 16). 20-color indexed palette.
Double-buffered: gfx_SetDrawBuffer() / gfx_SwapDraw(). Partial redraw optimization for cursor-only moves.
Pre-rendered sprites: 12 sprites (6 types × 2 colors) with contour reinforcement for crisp edges at 20×20 pixels.
Piece animation: Linear interpolation, 50ms/square (max 200ms). Piece removed from board during animation, drawn at interpolated position each frame.
State machine: MENU → DIFFICULTY_SELECT → COLOR_SELECT → PLAYING → PROMOTION/GAMEOVER.
AI integration: Two-phase render — frame 1 draws "Thinking..." (so display updates), frame 2 calls engine_think() (blocking).
Board flipping: When playing as Black, all coordinates and D-pad directions are inverted.

6. Technical Tradeoffs & eZ80 Constraints

6.1 Why 0x88 Over Bitboards

The eZ80 is a 24-bit CPU with no SIMD. Bitboard operations (popcount, bitscan, 64-bit AND/OR/XOR) would require expensive multi-word library calls. The 0x88 scheme uses only 8-bit comparisons and additions — native eZ80 operations.

6.2 The 24-bit Integer Problem

On eZ80, unsigned int = 24 bits. This creates a unique optimization landscape:

Type	eZ80 size	Speed
`uint8_t`	1 byte	Native, fast
`int` / `unsigned int`	3 bytes	Native, fast
`int16_t`	2 bytes	Non-native, slow (masking/sign-extension)
`int32_t` / `uint32_t`	4 bytes	Library calls (`__lxor`, `__ladd`)

Rules discovered empirically:

Widen register-bound locals to native int — beneficial
Do NOT widen struct/array fields — harmful (increases memory traffic)
int16_t for search scores in struct fields is okay; int for local variables during computation is better
History table MUST stay int16_t (widening caused +44.8% moveorder regression)
Array stride 4 (shift-by-2) is faster than stride 3 (multiply-by-3), so Zobrist tables stay uint32_t even though only 24 bits are used

6.3 -Oz vs -O2

-Oz (optimize for size) decisively beats -O2 on the eZ80: 7,053 cy/node vs 10,005. The larger -O2 binary causes flash cache misses, especially in move ordering (+103.9% slower). Code size dominates performance because the eZ80's flash has 10-cycle latency on cache misses (197 cycles for a full miss).

6.4 Sentinel-Based Sliding Rays

Expanding squares[] from 128 to 256 bytes, filling off-board indices with OFFBOARD (0xFF), lets sliding piece inner loops drop the SQ_VALID() check:

// Before (two conditions per iteration):
while (SQ_VALID(target) && squares[target] == PIECE_NONE) target += dir;
// After (one condition per iteration):
while (squares[target] == PIECE_NONE) target += dir;

eZ80 result: -1.29% cy/node (the eZ80 has no branch predictor, so eliminating a branch saves cycles predictably).

Desktop result: +83% slower movegen, +109% slower is_square_attacked, -8.4% throughput. The larger board_t (448B vs 320B) hurts L1 cache utilization on modern CPUs where branches are essentially free.

This is the most striking example of platform-divergent optimization in the project.

6.5 Stack Limitations

The TI-OS provides only ~4KB of stack. The chess engine mitigates this with:

Global move pool instead of per-ply stack arrays (~2KB saved per ply)
Static pawn_atk[128] instead of stack-allocated (safe because evaluate() is never recursive)
board_t and moves[] declared static in benchmark harness
Runtime stack guard reading the OS stack limit register

6.6 Flash-Resident Data

Opening book entries and Polyglot random numbers are accessed directly from AppVar flash pointers — zero RAM cost. Piece sprites are also loaded from the CHDATA AppVar when SPRITES_EXTERNAL is defined, saving ~2400 bytes of code flash.

6.7 Strategic / "why do this at all" tradeoffs

Why a new engine when ChessCE exists — ChessCE (~40K downloads, TSCP-based, ~1200–1500 Elo) already occupies the "there's a chess program on your calculator" slot. A new one only makes sense as a demonstrable-best artifact — a hard-to-fake "strongest chess engine ever for this platform" claim that functions as portfolio evidence of engineering judgment, not as a product competing for downloads. You give up the easy "ship something useful" win for a defensible ceiling whose value is in the story of how it was built. Inferred.
Why target a 48 MHz eZ80 in 2026 — the eZ80 is strategically valuable because it's hostile: no branch predictor, no SIMD, no 64-bit ops, 24-bit int, ~60 KB usable RAM. That hostility makes optimization work visible — every textbook technique has to be re-evaluated from first principles. A modern target would mean ~5M NPS for free, reducing the project to a Stockfish-feature-list reimplementation. Inferred.
Chasing Elo at a ~370 NPS ceiling instead of shipping a weak-but-simple engine — most users can't tell 1500 from 2000 Elo. Choosing to crowd the ceiling is where the interesting engineering lives: at weak targets you can ignore cache layout, integer width, and hand-written asm; at the ceiling every one of those is load-bearing. The last ~300 Elo cost roughly as much engineering as the first 1500 — maximum returns in portfolio depth, minimum returns in user-visible quality. Inferred.
UCI_LimitStrength against Stockfish as the validation methodology — self-reported Elo from in-house play is unfalsifiable and the standard critique of hobbyist engine claims. Anchoring to Stockfish's CCRL-calibrated skill levels converts the claim from subjective to reproducible. The cost: absolute numbers have caveats (engine-pool Elo vs FIDE, calibrated randomness vs true strength throttling) — but relative comparisons against ChessCE/Chess84 under the same methodology become defensible, which is the only claim that matters. Evidenced in doc §2.4.
Eval and search co-evolved as a locked joint tradeoff — the NNUE investigation surfaced a general principle: eval and search aren't independent, they're point solutions in a joint space. Cheap/noisy eval wants depth (alpha-beta); expensive/accurate eval wants selectivity (MCTS). On eZ80 only one quadrant is physically reachable, so every downstream decision (24-bit Zobrist, PST-only updates, futility margins) is locked in upstream. Evidenced in §2 + NNUE shelving.

6.8 Additional architectural tradeoffs worth naming

Global move pool + static search state instead of per-ply recursion frames — one 2048-entry pool, shared stack pointer, search globals not passed as parameters. TI-OS gives ~4 KB of stack and the eZ80 ABI passes all arguments on the stack with no register calling convention, so every recursion parameter is a memory write+read. Textbook negamax with per-ply move_t[256] arrays blows the budget in ~8 plies. Cost: no reentrancy, no thread-safety — irrelevant on a single-core eZ80. Evidenced in planning notes.
24-bit Zobrist hash + independent 16-bit "lock," not a single 64-bit key — 64-bit XOR on eZ80 is a ~45-cycle library call; 24-bit XOR is ~24 cycles inline asm. Bare 24-bit would collide at 1-in-16M (unacceptable for a TT probed millions of times); splitting into 24+16 pays the cheap XOR twice and lands at ~1-in-10¹² practical collision rate, matching full Zobrist. Inferred from zobrist.c + tt.c.
Pseudo-legal movegen + fast-path legality filter — only invokes is_square_attacked on king moves, EP, pinned pieces, or moves made in check (~50% of calls elided). Fully-legal movegen would need bitboard-style pin rays (expensive without 64-bit ops); naive pseudo-legal + test-on-every-move wastes a make/unmake per illegal move. Hybrid is the right point on the eZ80 cost curve. Evidenced in movegen.c.
Dual-target build (eZ80 + desktop GCC) as first-class infrastructure — same source compiles both; Stockfish tournaments and Texel's 1M-position pipeline would be infeasible at 370 NPS, but running ablations on desktop catches eZ80-only regressions only if you also validate on target (which is what the emu_uci bridge exists for — runs the actual .8xp binary in a cycle-accurate emulator). Evidence this matters: sentinel rays gained +1.3% on eZ80 but regressed -8.4% on desktop. Desktop alone would have shipped the wrong decision. Evidenced in ablation table.
Texel tuning on Lichess broadcast games, not general Lichess games — 1M positions from GM-level tournament broadcasts, filtered to plies 12–100, not-after-capture, not-in-check. Random Lichess is polluted by blunders at every rating; the eval would learn to exploit patterns that don't exist in strong play. Filter rules exclude positions where tactical search dominates static eval, since Texel is fitting the quiet eval. Inferred from data pipeline.

6.9 Code-level tradeoffs visible in the source

Two independent Makefiles sharing engine/src/, not one CMake with target detection — chess/Makefile (cedev/eZ80, -Oz, links asm, includes book.c) and chess/engine/Makefile (gcc, -O2, -DNO_BOOK) both pull from the same source list. -DNO_BOOK with #ifndef NO_BOOK bracketing is the minimum possible coupling — the desktop toolchain never needs fileioc.h. CMake + guarded compilation would have been more "correct" but invited a marriage of cedev and standard gcc configs that neither tool is happy about. Cost: two build systems to keep in sync. Evidenced.
Ablation as a build-matrix target, not a runtime flag — make ablation compiles uci_full + uci_no_<feat> binaries via -DNO_$$feat for each feature (tempo, pawns, mobility, etc.). A runtime setoption flag would have been simpler but leaves dead branches in the binary and muddies per-feature cost measurement. Compile-time dead-code elimination means each variant measures true feature cost. Cost: N binaries to keep in sync. Evidenced in engine/Makefile.
emu_uci as a binary-patching bridge, not a socket or virtual serial port — emu_uci/src/main.c declares static char cmd_slot[512] = "@@CMDSLOT@@";. emu_tournament.py finds the sentinel, writes command args + FEN in-place, recomputes the 8xp checksum, and invokes the emulator once per move. A persistent emulator with piped UCI would have preserved TT across moves, but required either emulator mods or a virtualized serial transport. One-shot-per-move is the cheapest protocol on top of a stateless debug harness — validates Elo of the actual target binary, not throughput. Evidenced.
Public engine API uses board[8][8] row/col; internals use 0x88 — engine.h exposes a 2D-grid shape to the UI and emu_uci FEN parser; types.h keeps the 0x88 indexing with OFFBOARD=0xFF sentinel internal. Exposing 0x88 directly would have saved one conversion, but the UI thinks in row/col naturally and keeping 0x88 out of the public API means internals can change indexing without breaking callers. Evidenced.
Hand-written pick_best.asm as a Makefile source, not __asm__ inline — chess/Makefile lists EXTRA_ASM_SOURCES = engine/src/pick_best.asm; desktop build has no asm at all and the C fallback compiles in. Inline __asm__ inside an #ifdef __ez80__ would have kept the two implementations in one file. Separate translation units mean the desktop link simply ignores the asm and the C version wins by default — no preprocessor goo. Evidenced.
Per-harness main() test binaries, not a framework — engine/test/ has seven standalone .c files (perft, bench, eval_fen, diag_qxa2, …) each a Makefile target. CTest/Unity/Criterion driving one binary would have been more "professional." Each test is a single concern and tournament logic wants Python + python-chess + Stockfish's UCI_LimitStrength, so the language boundary lands where concerns diverge naturally. Evidenced.

7. The Optimization Journey

The engine went through 13 variant repos tracking systematic optimization. Here's the complete narrative:

7.1 Phase 1: Engine Bootstrap (Feb 13)

Four commits in one evening built the entire engine from nothing:

Move generation with 30/30 perft validation
Search + eval + TT (negamax, alpha-beta, iterative deepening)
UI integration
Integration tests

7.2 Phase 2: Eval Features + Ablation (Feb 15)

Added 6 Stockfish-inspired evaluation features with compile-time #ifndef guards for ablation testing:

Feature	Impact when removed (vs SF-1700)
Mobility	Worst — ~35% swing
Pawn structure	Moderate
Passed pawns	Moderate
Rook on files	Small
Pawn shield	Small
Tempo	Small

7.3 Phase 3: Movegen Optimizations (Feb 15)

Five sequential optimizations, each measured on 5 positions × 1000 iterations:

#	Optimization	Impact
1	O(1) piece-list updates via square index map	-5.1% perft
2	Staged movegen (captures then quiets)	Helps search pruning
3	Skip castling attack probes when in check	+6.7% movegen overhead (but saves attack calls in real games)
4	Precompute check/pin info	~50% fewer legality checks
5	Track bishop counts incrementally	-5.9% eval

7.4 Phase 4: Texel Tuning (+52 Elo) (Feb 16)

The single largest Elo gain. See §8 (Texel Tuning Pipeline) for full details.

7.5 Phase 5: Integer Type Optimization (Feb 16)

Widened hot-path score locals to native int (24-bit). Narrowed phase to uint8_t. Combined: -2.4% cy/node, -379 bytes binary. Widening struct fields was tried and reverted (-5.6% make/unmake regression).

7.6 Phase 6: Pawn Hash Cache (Feb 16-17)

Progressive refinement:

2-way set-associative: -14.3% total eval cycles
4-way set-associative: additional -6.6%
Qsearch capture-only fast path: -4.8%
Combined build_pawn_info + pawn attack bitmap: -28.5% eval, -15.7% cy/node overall

7.7 Phase 7: Cycle-Level Micro-Optimizations (Feb 17)

Change	cy/node Impact	Notes
Native 24-bit `zhash_t` (hybrid approach)	-0.8%	Tables stay uint32_t, state uses zhash_t
Sentinel sliding rays	-1.3%	eZ80-only win; desktop regression
Hand-written `pick_best_score` asm	-11.8% (in that function)	2.8× faster than compiler
Inline asm `ZHASH_XOR`	-2.8% make/unmake	Memory-to-memory 3-byte XOR
SoA move pool	-1.1%	Eliminated pool_copy overhead
Eval loop merging (4→2 loops)	-4.5% profiling	~1.1% real-world
`sprintf`/`printf` elimination	-7.3KB binary	Purely size reduction

7.8 Optimizations Tried and Rejected

Optimization	Result	Why
Division-free tapered eval	0%	Compiler already uses multiply-by-reciprocal for `/24`
24-bit PRNG	Skipped	Both PRNGs called <2000 times per game
Eval deduplication (parameterized loops)	-2.3%	Prevents BIT instruction for constant color checks
Local pointer caching	0%	LTO sees through the alias
Counter-move heuristic	-1%	Committed without testing, immediately reverted
Stack-allocated pawn_atk[128]	-14%	IX+displacement exceeds efficient signed 8-bit range
Pure zhash_t tables (3-byte stride)	-1.2%	multiply-by-3 slower than shift-by-2
NNUE neural network eval	Not merged	Too expensive for 48MHz without SIMD

7.9 NNUE Exploration (Shelved)

Architecture: 768→32→1 with SCReLU activation, int8 quantized weights (24,772 bytes). Implemented on feature/nnue branch but never merged. The 32-neuron hidden layer required 32 multiply-accumulate operations per eval call on a CPU with no SIMD, and the 24KB feature transformer weight matrix would dominate flash reads.

7.10 Final Performance Profile (100 positions, 1000 nodes each)

Position Category	cy/node	eval%	movegen%	make%
Positional middlegames	470,211	38%	21%	13%
Tactical middlegames	337,437	34%	19%	14%
Complex endgames	272,686	29%	16%	16%
Simple endgames	133,155	23%	14%	17%

Average ~200K cy/node = ~2.7ms/node at 48MHz = ~370 NPS.

8. Texel Tuning Pipeline

8.1 Methodology

Texel tuning optimizes evaluation parameters by minimizing mean squared error between sigmoid(K × static_eval) and actual game outcomes:

MSE = mean( (sigmoid(K × eval) - label)² )

Where label is 1.0 (win), 0.5 (draw), 0.0 (loss) from side-to-move perspective.

8.2 Data Source

Lichess broadcast database (GM-level tournament games, not random Lichess games). 25 months from 2024-01 through 2026-01. Positions filtered: ply 12-100, not after capture, not in check. Two datasets: 60K (pilot) and 1M positions (production).

8.3 Optimization Algorithm

Adam gradient descent with analytical gradients (not scipy). Key settings:

Sigmoid slope K: found via golden-section 1D search (~0.00652)
Learning rate: 0.02
L2 regularization centered at scale=1.0 (pulling toward baseline)
90/10 train/validation split
Best model selected by validation MSE

8.4 Parameters Tuned

Scale factors for 14 scalar params (bishop pair, tempo, doubled/isolated pawns, rook files, shield — each mg/eg), 12 PST row scales, rank-indexed pawn bonuses, and mobility bucket tables. Expanded from 14 to 108 parameters across rounds.

8.5 Results Across Rounds

Round	Games	Elo vs Baseline	Significant?
Round 1 (1M positions)	4000	+52 (±10)	Yes
Round 2 vs Round 1	4000	-1.6 (±10)	No
Round 3 vs Round 1	4000	-5.2 (±10)	No
Round 4 expanded (96 params)	1000	+11 (±20)	No
Round 4b constrained	4000	+0.1 (±10)	No
Round 5 (108 params, material split)	4000	+8 (±11)	Borderline

Key takeaway: Round 1 delivered a clearly significant +52 Elo. All subsequent rounds produced marginal improvements. Further tuning was abandoned.

8.6 Application

apply_texel_params.py reads JSON output and patches eval.c using regex-based text substitution — replaces #define values and array initializers directly in C source.

9. Testing & Tournament Infrastructure

9.1 Test Suites

Tool	Purpose	Key detail
`perft.c`	Move generation correctness	7 standard + 22 edge-case positions
`test_search.c`	Search/eval correctness	8 categories: mate finding, stalemate, draws, incremental eval consistency, 200-halfmove self-play
`test_integration.c`	Public API testing	11 categories covering the full `engine.h` API
`bench.c` (desktop)	Component-level benchmarking	100 positions from 14 standard suites, ns/call for each component
`bench/` (on-calc)	eZ80 cycle profiling	Same 100 positions but with 48MHz hardware timer, per-section cycle breakdown

9.2 Tournament Infrastructure

Desktop tournaments (engine/tournament.py): Compiles UCI engine, plays against Stockfish at configurable Elo levels (1320-2420, step 100). 20 games per level, 5 concurrent matches, 0.1s/move.

Emulator tournaments (chess/emu_tournament.py): Runs the actual eZ80 binary inside the cycle-accurate TI-84 CE emulator against Stockfish. The key innovation:

Reads the EMUUCI.8xp binary
Finds the @@CMDSLOT@@ sentinel (512 bytes)
Patches each position's FEN + settings into the slot
Recomputes .8xp file checksum
Launches the emulator subprocess
Parses "MOVE xxxx" from debug output
Feeds the move to Stockfish via UCI

This measures the engine at its true eZ80 calculation speed (~370 NPS), not desktop speed.

Ablation testing (engine/ablation.py): Plays the full engine against versions compiled with -DNO_<FEATURE> to measure each eval feature's contribution.

9.3 Elo Results

Platform	Config	Elo	Method
eZ80	Easy (900ms, var=10)	~1320	30 games vs SF-1320
eZ80	Medium (3s)	~1442	30 games vs SF-1500
eZ80	Hard (9s)	~1712	30 games vs SF-1700
eZ80	Expert (13.5s)	~1958	30 games vs SF-1900
eZ80	Master (27s)	~2083	100 games vs SF-2100
Desktop	0.01s/move	~2583	100 games vs SF-2600
Desktop	0.1s/move	~2700	100+189 games vs SF-2600/2700

10. Major Bugs & Debugging Stories

10.1 The Queen Hang Bug (PVS Scoring)

Symptom: Engine played Qxa2, hanging the queen to Rxa2.

Root cause: PVS null-window search at root scored terrible moves identically to the best. When move_variance > 0, the engine randomly selected among "equally scored" root moves — but PVS clips all non-PV moves to the same alpha value, making Qxa2 (losing the queen) look as good as the best move.

Investigation: Spanned multiple sessions. First confirmed the bug exists on desktop (not eZ80-specific). Quiescence correctly saw Rxa2, but PVS window clipping masked the true score at every search depth (1-8).

Fix attempts:

Full window at root: catastrophic — -708 Elo (PVS exists for a reason)
Wider PVS window by variance amount: worked. pvs_floor = alpha - variance instead of alpha

Variance sweep: var=0: 0 Elo loss. var=5: 0 Elo loss (sweet spot). var=10: -147 Elo. var=15: catastrophic.

10.2 The 0-30 Tournament Disaster

Symptom: After implementing partial depth reporting, engine went from competitive to 0-30 in tournaments.

Root cause: search_best_root_move was being used to override result.best_move at timeout, but if no root move was scored in the current iteration, the fallback was the first generated move (a2a3 or h7h6), not the PV move.

Fix: Added search_root_move_scored flag. Documented in MEMORY.md as a "Dangerous Pattern."

10.3 Check Extension + TT Ordering Bug (PR #15)

Symptom: Engine played Be7 (losing to mate-in-2) instead of finding the defense.

Root cause: Check extensions increase depth, so the TT caches results at the extended depth. On the next iteration, the TT probe returns the cached result BEFORE the check extension can trigger, locking the engine to depth 1 in check positions.

Discovery: I noticed the bad move and directed Claude to investigate. Claude initially got the chess analysis wrong; I corrected: "there's a mate-in-2 with both moves being check."

10.4 The Timer Failure Cascade

Symptom: Late-endgame tournament positions terminated with emu_error.

Root cause chain: check_time() checked the node limit FIRST with early return, preventing the time check from running. With time_fn = NULL and search_deadline = 0, search never stopped on time. Fell back to 30K node limit, which at ~300 NPS took ~96 seconds, exceeding the emulator timeout.

Fix: Added search_node_deadline as a hard safety fallback (~1 node/ms with 3× headroom). Also made search_stopped volatile (LTO was optimizing away reads).

10.5 The Book That Never Worked

Multiple rounds of debugging:

Missing CHBKRN.8xv (shared randoms file)
Tier mismatch: baked large book but compiled with BOOK=medium
const char * pointer arrays had incorrect behavior on eZ80 24-bit target
After fixing all of the above, still failed on the custom emulator — worked fine on CEmu reference emulator
Root cause: custom emulator's VAT reconstruction didn't properly expose AppVars

11. AI Agent Orchestration & Usage Stats

11.1 Session Data

Metric	Value
Main sessions	19 session directories
Main JSONL files	1 file, 2.2 MB
Subagent files	174 files, ~18 MB
Compact summaries	71 files
Date range	Feb 10 – Mar 27, 2026

11.2 Human-AI Collaboration

I directed every architectural and strategic decision; the AI implemented code per my direction. Key aspects of my leadership:

Architectural decisions: Chose 0x88 board representation (over bitboards — impractical on 24-bit CPU), Polyglot book format, the phased engine development approach (movegen → search/eval → UI integration → tests)
Benchmarking methodology: Designed the empirical approach — insisted on eZ80-only benchmarks ("desktop benchmarks are pointless"), established the 100-position benchmark suite as the single source of truth, required every optimization to be measured before committing
Quality enforcement: Caught the AI committing code without testing ("yes i want to actually test things before making commits"), demanded accurate numbers, rejected over-engineered explanations
Strategic direction: Explored NNUE as a potential approach (my idea, not the AI's), evaluated the tradeoffs, and made the call to shelve it. Designed the difficulty/variance system concept. Directed the Texel tuning pipeline.
Course corrections: Caught hallucinated performance ratios ("100,000× slower" — actual: ~14,000×), corrected chess analysis errors (the AI got a mate-in-2 analysis wrong), stopped the AI from overclaiming ("let's not say chess is solved")

Specific instances where I corrected the AI:

AI committed counter-move heuristic without benchmarking — I enforced the test-first rule
AI overcomplicated explanations of benchmark paradoxes — I cut through the noise
AI hallucinated performance ratios ("100,000× slower") — I caught it, actual ratio was ~14,000×
AI misattributed NNUE exploration as "AI proposed, human rejected" — I corrected: it was my idea to explore, my evaluation to shelve
AI miscalculated emulator tournament timing ("15-30 minutes each") — I corrected: games run at emulator speed, not real-time
AI initially got the check extension bug chess analysis wrong — I corrected: "there's a mate-in-2 with both moves being check"

11.3 Key Methodology

my development loop: implement → benchmark on 100 positions → commit or revert. Every optimization has a measured cy/node or Elo impact. The RESULTS.md file serves as the single source of truth. The AI executes this loop; I reviews the results and decides what ships.

12. Development Timeline

Date	Day	Key Achievement
Feb 10	1	Pong game complete
Feb 11	2	Sudoku game (PR #1)
Feb 12	3	Chess UI only (PR #2)
Feb 13	4	Full chess engine — movegen, search, eval, TT, integration
Feb 15	6	Eval features + ablation. Movegen optimizations. Opening book.
Feb 16	7	Texel tuning (+52 Elo). Integer type optimization. Pawn cache.
Feb 17	8	5 PRs in one day. Sentinel rays, difficulty system, PVS bug fix, tournament system.
Feb 18	9	Emulator tournament system. CHDATA AppVar consolidation.
Feb 19	10	Hand-written eZ80 assembly (-11.8% pick_best). SoA move pool.
Feb 20-25	11-16	Check extension bug fix. Threefold repetition. Final polish.

Total: 111 commits, 15 PRs (14 merged), 16 days. From nothing to ~2083 Elo on a calculator.

13. Key Files Reference

Engine Core

File	Lines	Purpose
`engine/src/search.c`	~800	Negamax, PVS, LMR, null-move, quiescence, iterative deepening
`engine/src/board.c`	~600	0x88 board, piece lists, make/unmake
`engine/src/eval.c`	~500	Texel-tuned HCE, pawn cache, mobility, tapered eval
`engine/src/movegen.c`	~400	Pseudo-legal generation, attack detection, legality fast path
`engine/src/book.c`	~350	Polyglot book, multi-AppVar, binary search, weighted selection
`engine/src/engine.c`	~350	Public API, game status, difficulty configuration
`engine/src/zobrist.c`	~150	Zobrist tables, inline asm XOR, PRNG
`engine/src/tt.c`	~100	4096-entry TT, move packing, always-replace
`engine/src/pick_best.asm`	~60	Hand-written eZ80 assembly for move scoring
`src/main.c`	~1200	GUI: rendering, input, animation, state machine

Testing & Tournaments

File	Purpose
`engine/test/bench.c` (25KB)	100-position desktop benchmark
`bench/src/main.c`	On-calculator cycle profiling benchmark
`bench/RESULTS.md`	Complete optimization history with measurements
`engine/test/perft.c`	29-position move generation validation
`engine/test/test_search.c`	Search correctness (8 categories)
`engine/test/test_integration.c`	Full API integration tests (11 categories)
`engine/tournament.py`	Desktop Stockfish tournaments
`emu_tournament.py`	Emulator-based eZ80 tournaments
`engine/ablation.py`	Eval feature ablation testing

Texel Tuning

File	Purpose
`engine/tuning/texel_tune.py` (32KB)	Adam gradient descent on 1M positions
`engine/tuning/build_texel_dataset.py`	Lichess broadcast PGN → CSV dataset
`engine/tuning/apply_texel_params.py`	JSON results → C source patching

Thorin

CE-Games Chess Engine — Deep Technical Profile ​

Table of Contents ​

1. Project Overview ​

Libraries & Frameworks ​

Engine core (chess/engine/, C targeting eZ80) ​

Desktop/native engine build ​

TI-84 Plus CE runtime libraries (pulled in as .8xv AppVars) ​

Tuning & tournament scripts (engine/tuning/, scripts/, Python 3) ​

Emulator-in-the-loop testing ​

Not used ​

2. Pre-Implementation Research & Planning ​

2.1 Hardware Constraints Research (originated from a DOOM porting investigation) ​

2.2 The 16-Section Architecture Plan ​

2.3 Elo Measurement Methodology ​

2.4 Competitive Landscape Analysis ​

2.5 Chess Engine Architecture Philosophy ​

3. Architecture ​

3.1 Engine/UI Separation ​

3.2 Dual-Target Compilation ​

3.3 Memory Budget (~62KB of 256KB) ​

Chess Engine Concepts — Quick Glossary ​

4. The Chess Engine Core ​

4.1 Board Representation (board.c/h) ​

4.2 Move Generation (movegen.c) ​

4.3 Search (search.c) ​

4.4 Evaluation (eval.c) ​

4.5 Transposition Table (tt.c) ​

4.6 Zobrist Hashing (zobrist.c) ​

4.7 Opening Book (book.c) ​

5. The GUI ​

6. Technical Tradeoffs & eZ80 Constraints ​

6.1 Why 0x88 Over Bitboards ​

6.2 The 24-bit Integer Problem ​

6.3 -Oz vs -O2 ​

6.4 Sentinel-Based Sliding Rays ​

6.5 Stack Limitations ​

6.6 Flash-Resident Data ​

6.7 Strategic / "why do this at all" tradeoffs ​

6.8 Additional architectural tradeoffs worth naming ​

6.9 Code-level tradeoffs visible in the source ​

7. The Optimization Journey ​

7.1 Phase 1: Engine Bootstrap (Feb 13) ​

7.2 Phase 2: Eval Features + Ablation (Feb 15) ​

7.3 Phase 3: Movegen Optimizations (Feb 15) ​

7.4 Phase 4: Texel Tuning (+52 Elo) (Feb 16) ​

7.5 Phase 5: Integer Type Optimization (Feb 16) ​

7.6 Phase 6: Pawn Hash Cache (Feb 16-17) ​

7.7 Phase 7: Cycle-Level Micro-Optimizations (Feb 17) ​

7.8 Optimizations Tried and Rejected ​

7.9 NNUE Exploration (Shelved) ​

7.10 Final Performance Profile (100 positions, 1000 nodes each) ​

8. Texel Tuning Pipeline ​

8.1 Methodology ​

8.2 Data Source ​

8.3 Optimization Algorithm ​

8.4 Parameters Tuned ​

8.5 Results Across Rounds ​

8.6 Application ​

9. Testing & Tournament Infrastructure ​

9.1 Test Suites ​

9.2 Tournament Infrastructure ​

9.3 Elo Results ​

10. Major Bugs & Debugging Stories ​

10.1 The Queen Hang Bug (PVS Scoring) ​

10.2 The 0-30 Tournament Disaster ​

10.3 Check Extension + TT Ordering Bug (PR #15) ​

10.4 The Timer Failure Cascade ​

10.5 The Book That Never Worked ​

11. AI Agent Orchestration & Usage Stats ​

11.1 Session Data ​

11.2 Human-AI Collaboration ​

11.3 Key Methodology ​

12. Development Timeline ​

13. Key Files Reference ​

Engine Core ​

Testing & Tournaments ​

Texel Tuning ​