In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?
Abstract
The rapid growth of model sizes and training datasets has created a strong demand for test-time compute—the ability to perform inference without additional training. At the core of test-time compute is in-context learning, an emerging capability of large language models (LLMs) that enables them to perform statistical inference directly at test time. Recent progress has shed light on the mechanisms of in-context learning in statistical learning: language models can implement linear regression and classification by iteratively extracting features at test time. This naturally raises a broader question: can in-context learning extend beyond statistical learning to combinatorial problems and integer programs? We take a first step toward addressing this question by analyzing the fundamental task of sorting and the more general problem of discrete optimal transport. We prove that transformers with softmax self-attention layers can solve discrete optimal transport through in-context learning, when model parameters remain fixed and only the input length and distribution vary. Consequently, transformers can approximately sort lists of arbitrary length within a provable approximation factor. This sorting mechanism fundamentally differs from discrete sorting algorithms: transformers implicitly simulate a continuous optimization process through their internal feature-extraction dynamics.