The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
Abstract
State-of-the-art post-training pipelines for reasoning LLMs rely on bootstrapped reasoning loops: they sample many traces, score them, and reinforce the highest-scoring ones, typically by correctness. This can improve accuracy while still collapsing the distribution inside the correct set onto a narrow family of redundant strategies, reducing creative problem-solving. To diagnose this failure mode, we introduce Distributional Creative Reasoning (DCR), a variational framework that casts training as gradient flow on the simplex of reasoning traces. The framework yields three core results. First, a diversity-decay analysis shows that STaR-style rejection fine-tuning and exact mean-field GRPO amplify whichever correct trace is already larger, while DPO regresses pairwise correct-trace ratios toward the reference ratios. Second, it explains why entropy and KL can slow or tether collapse but do not reward semantically distinct correct strategies for being distinct, and how a creativity kernel supplies the missing relational term. Third, under mild conditions, the resulting dynamics converge to a unique, stable, and diverse equilibrium, yielding practical guidance for kernel and hyperparameter design. DCR thus offers a principled route to training reasoning LLMs that remain both correct and creative.