Oral
Oral Session 2: Learning Theory & High-Dimensional Statistics
Main Ballroom
Moderator: Masashi Sugiyama
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix
Tomohiro Hayase ⋅ Benoit Collins ⋅ Ryo Karakida
Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko–Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.
PAC-Bayesian Bounds on Constrained $f$-Entropic Risk Measures
Hind Atbir ⋅ Farah Cherfaoui ⋅ Guillaume Metzler ⋅ Emilie Morvant ⋅ Paul Viallard
PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To tackle this limitation, we introduce a new family of risk measures, called constrained $f$-entropic risk measures, which enable finer control over distributional shifts and subgroup imbalances via $f$-divergences, and include the Conditional Value at Risk (CVaR), a well-known risk measure. We derive both classical and disintegrated PAC-Bayesian generalization bounds for this family of risks, providing the first disintegrated PAC-Bayesian guarantees beyond standard risks. Building on this theory, we design a self-bounding algorithm minimizing our bounds directly, yielding models with guarantees at the subgroup level. We empirically demonstrate the usefulness of our approach.
The core of generalization theory was developed for independent observations. Some PAC and PAC-Bayes bounds are available for data that exhibit a temporal dependence. However, there are constants in these bounds that depend on properties of the data-generating process: mixing coefficients, mixing time, spectral gap... Such constants are unknown in practice. In this paper, we prove a new PAC-Bayes bound for Markov chains. This bound depends on a quantity called the \textit{pseudo-spectral gap}, $\gamma_{ps}$. The main novelty is that we can provide an empirical bound on $\gamma_{ps}$ when the state space is finite. Thus, we obtain the first fully empirical PAC-Bayes bound for Markov chains. This extends beyond the finite case, although this requires additional assumptions. On simulated experiments, the empirical version of the bound is essentially as tight as the one that depends on $\gamma_{ps}$.
Support Basis: Fast Attention Beyond Bounded Entries
Maryam Aliakbarpour ⋅ Vladimir Braverman ⋅ Junze Yin ⋅ Haochen Zhang
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks. However, the quadratic complexity of softmax attention remains a central bottleneck that limits their scalability. Alman and Song (NeurIPS 2023a; NeurIPS 2024a) proposed sub-quadratic time algorithms for attention inference and training, respectively, but they rely on the restrictive **bounded-entry assumption**. We show that this assumption rarely holds in practice, which significantly limits their applicability to modern LLMs. In this paper, we introduce **support-basis decomposition**, a new technique for accurate and efficient attention inference and training **without** the bounded-entry assumption. We empirically show that the entries of the query and key matrices exhibit sub-Gaussian behavior. Leveraging this widely observed property, we perform exact computation on sparse components and polynomial approximation on dense components. Without relying on restrictive assumptions, we theoretically show that our algorithm achieves sub-quadratic runtime while matching the approximation error of prior work, and we empirically validate its computational efficiency and downstream task performance. We further generalize our method to a multi-threshold setting that eliminates all distributional assumptions, providing the first theoretical justification for the empirical success of polynomial attention. Moreover, we show that softmax attention can be closely approximated by multiple polynomial attentions with significantly smaller $\ell_p$ error.
Minimax Generalized Cross-Entropy
Kartheek Bondugula ⋅ Santiago Mazuelas ⋅ Aritz Pérez ⋅ Anqi Liu
Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.
Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning
Zhenghao Li ⋅ Shengbo Wang ⋅ NIAN SI
Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-\gamma)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $\gamma$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.
Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms
Rohan Deb ⋅ Qiaobo Li ⋅ Mayank Shrivastava ⋅ Arindam Banerjee
Uniform bounds on sketched inner products underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, etc. However, many modern analyses involve *sketched bilinear forms*, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms, and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on *generic chaining* and introduces new techniques for handling suprema over pairs of sets. We further extend our results to (i) sketch matrices with conditionally independent entries, e.g., as in CountSketch and SRHT (Subsampled Randomized Hadamard Transform), and (ii) bilinear form involving a sum of $T$ sketch matrices, showing the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP guarantees. Using our new bounds, we give tighter convergence bounds for sketched federated learning, and develop sketched bandits whose regret depends on the geometric complexity of the action and parameter sets rather than the ambient dimension.
High Effort, Low Gain: Fundamental Limits of Active Learning for Linear Dynamical Systems
Nicolas Chatzikiriakos ⋅ Kevin Jamieson ⋅ Andrea Iannelli
We consider the problem of identifying an unknown linear dynamical system from a finite hypothesis class. In particular, we analyze the effect of the excitation input on the sample complexity of identifying the true system with high probability. To this end, we present sample complexity lower bounds that capture the choice of the selected excitation input. The sample complexity lower bound gives rise to a system-theoretic condition to determine the potential benefit of experiment design. Informed by the analysis of the sample complexity lower bound, we propose a persistency of excitation (PE) condition tailored to the considered setting, which we then use to establish sample complexity upper bounds. Notably, the PE condition is weaker than in the case of an infinite hypothesis class and allows analyzing different excitation inputs modularly. Crucially, the lower and upper bounds share the same dependency on key problem parameters. Finally, we leverage these insights to propose an active learning algorithm that sequentially excites the system optimally with respect to the current estimate, and provide sample complexity guarantees for the presented algorithm. Concluding simulations showcase the effectiveness of the proposed algorithm.