Training Data
Model
In-sample R² will be near 1 when training data is thin (textbook overfitting). Trustworthy fit quality requires N ≫ feature count.
Retrain reads every completion, fits ridge regression in-process, persists coefficients to $DATA_DIR/difficulty_model.json, and rescores all stored puzzles. Active model survives restarts.
Score distribution by requested tier
Each row is a histogram of stored difficulty_score for puzzles requested as that tier. Ideal: tiers separate cleanly with little overlap. Heavy overlap means structural tier boundaries don't match perceived difficulty.
Predicted score vs actual log time
Each dot is one completion: x = stored difficulty_score, y = ln(active_time in seconds). A good model produces a positive trend; tight bands per size mean the model captures size-dependent timing.
Completions per puzzle
How many times each puzzle has been solved. Most production puzzles sit at zero; a long tail at the right means a few popular puzzles are dominating the training signal.
Coverage by size × tier
Number of scored puzzles in each (size, requested-tier) bucket and the median stored score. Sparse cells highlight combos where the generator struggles.