Oral
Oral Session 5: Deep Architectures, Transformers & Representation Learning
Main Ballroom
Moderator: Guillaume Rabusseau
Complexity-Aware Deep Symbolic Regression with Robust Risk-Seeking Policy Gradients
Zachary Bastiani ⋅ Mike Kirby ⋅ Jacob Hochhalter ⋅ Shandian Zhe
We propose a novel deep symbolic regression (DSR) approach to enhance the robustness and interpretability of data-driven mathematical expression discovery. Existing DSR methods are built on recurrent neural networks, solely guided by data fitness, and potentially meet tail barriers that can zero out the policy gradient, causing inefficient model updates. To address these issues, we design a decoder-only architecture that performs attention in the frequency domain and introduce a dual-indexed position encoding to conduct layer-wise generation. Second, we propose a Bayesian information criterion (BIC)-based reward function that can automatically adjust the trade-off between expression complexity and data fitness, without the need for explicit manual tuning. Third, we develop a ranking-based weighted policy update method that eliminates the tail barriers and enhances training effectiveness. Extensive benchmarks and systematic experiments demonstrate the advantages of our approach. We have released our implementation at https://github.com/ZakBastiani/CADSR.
Representation Learning via Non-Contrastive Mutual Information
Zhaohan Daniel Guo ⋅ Bernardo Avila Pires ⋅ Khimya Khetarpal ⋅ Dale Schuurmans ⋅ Bo Dai
Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.
Why is prompting hard? Understanding prompts on binary sequence predictors
Li Kevin Wenliang ⋅ Anian Ruoss ⋅ Jordi Grau-Moya ⋅ Marcus Hutter ⋅ Tim Genewein
Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.
On the Role of Depth in the Expressivity of RNNs
Maude Lizaire ⋅ Michael Rizvi-Martel ⋅ Éric Dupuis ⋅ Guillaume Rabusseau
The benefits of depth in feedforward neural networks are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs’ memory capacity efficiently with respect to parameters, enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We extend our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.
Certifying Reading Comprehension in Large Language Models
Isha Chaudhary ⋅ Vedaant Jain ⋅ Gagandeep Singh
Large Language Models (LLMs) are increasingly deployed in safety-critical systems that rely heavily on reading comprehension—extracting and reasoning over exten- sive in-context information. However, existing evaluations of LLMs on reading comprehension are typically over limited test sets containing only a tiny fraction of the vast number of possible prompts. Empirical evaluations on these test sets have questionable reliability and generalizability. We propose a fundamentally different approach: rather than evaluating LLMs with fixed datasets, we introduce the first framework for certifying LLMs based on large probability distributions over realistic reading comprehension prompts. To create these distributions, we use knowledge graphs (KGs) as structured representations of real-world knowledge and define the distributions’ sample spaces with prompts based on directed acyclic subgraphs of the KGs. We also incorporate realistic noise designed to mimic real-world complexity, such as distractor texts and synonyms. Our prompt distributions have i.i.d. samplers represented as probabilistic programs. Our framework generates novel, formal probabilistic quantitative certificates that provide high-confidence, tight bounds on the probability that an LLM correctly answers any prompt drawn from these distributions. We enable formal certification for SOTA LLMs by using an input-output example-driven approach. We apply our framework to certify SOTA LLMs in precision medicine and general question-answering domains. Our results uncover previously unknown vulnerabilities caused by natural prompt noise and establish the first formal performance hierarchies among these models.
In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?
Hadi Daneshmand
The rapid growth of model sizes and training datasets has created a strong demand for test-time compute—the ability to perform inference without additional training. At the core of test-time compute is in-context learning (ICL), an emerging capability of large language models (LLMs) that enables them to perform statistical inference directly at test time. Recent progress has shed light on the mechanisms underlying in-context learning in statistical tasks: language models can implement linear regression and classification by iteratively extracting features at test time. This naturally raises a broader question: Can we analyze ICL beyond statistical learning and extend it to discrete algorithmic tasks relevant to NLP? One of the fundamental tasks in NLP can be formulated as discrete optimal transport: matching tokens, with applications ranging from machine translation to mixture-of-experts routing. We show that transformers with softmax self-attention can solve discrete optimal transport via in-context learning when the model parameters are fixed and only the input length and data distribution vary. One implication of this result is that transformers can approximately sort lists of arbitrary length with a provable approximation guarantee.
LAMP: Extracting Local Decision Surfaces from Large Language Models
Ryan Chen ⋅ Youngmin Ko ⋅ Catherine Cho ⋅ Zeyu Zhang ⋅ Mauro Giuffrè ⋅ Sunny Chung ⋅ Dennis Shung ⋅ Bradly Stadie
We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case‑file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.
Beyond Binning: Soft Task Reformulation for Deep Regression
Lawrence Stewart ⋅ Francis Bach ⋅ Quentin Berthet
Whilst neural networks are powerful predictors, it has been observed and theoretically analyzed that training such models by minimizing the square loss can lead to suboptimal results on regression problems, where the targets are real-valued. In this work, we propose a novel method aimed at improving test-time performance of neural networks on regression tasks. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We demonstrate our method on a wide range of real-world datasets.