Skip to yearly menu bar Skip to main content


Oral

Oral Session 4: RL, Bandits & Online Decision-Making

Main Ballroom

Moderator: Alberto Maria Metelli

Sun 3 May 2:30 a.m. PDT — 3:30 a.m. PDT
Abstract:
Chat is not available.


Oral
Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits

Ramakrishnan Krishnamurthy ⋅ Arpit Agarwal ⋅ Lakshmi Subramanian ⋅ Maximilian Nickel

We study a collaborative variant of the stochastic linear bandit problem in a multi-agent setting, motivated by real-world recommender systems where multiple content creators indirectly influence each other’s outcomes. In our formulation, agents interact with a shared environment but may choose to form coalitions, sharing observations to improve collective learning efficiency. We formalize this setup as a transferable utility (TU) cooperative game, where the value of a coalition is defined as the negative sum of cumulative regrets incurred by its members. This framework allows us to examine how algorithmic design and structural assumptions about agents---such as identical vs. heterogeneous action sets---affect collaboration incentives. We show that under some algorithmic conditions, the induced TU game exhibits desirable properties: for identical agents, the game is convex and admits a non-empty core containing the Shapley value, ensuring stable and equitable collaboration. For heterogeneous agents, we demonstrate core non-emptiness and propose a simple, implementable payoff mechanism that satisfies all but one Shapley value axioms. Experimental results on problem instances derived from MovieLens-100k dataset further illustrate how the empirical payout aligns and diverges from ideal cooperative outcome for different settings. Our results offer a principled lens for designing collaborative learning systems that are both effective and incentive-aligned.


Spotlight
Pure Exploration with Infinite Answers

Riccardo Poiani ⋅ Martino Bernasconi ⋅ Andrea Celli

We study pure exploration problems where the set of correct answers is possibly infinite, e.g., the regression of any continuous function of the means of the bandit. We derive an instance-dependent lower bound for these problems. By analyzing it, we discuss why existing methods (i.e., Sticky Track-and-Stop) for finite answer problems fail at being asymptotically optimal in this more general setting. Finally, we present a framework, Sticky-Sequence Track-and-Stop, which generalizes both Track-and-Stop and Sticky Track-and-Stop, and that enjoys asymptotic optimality. Due to its generality, our analysis also highlights special cases where existing methods enjoy optimality.


Spotlight
A Modularized Framework for Piecewise-Stationary Restless Bandits

Kuan-Ta Li ⋅ Chia-Chun Lin ⋅ Ping-Chun Hsieh ⋅ Yu-Chih Huang

We study the piecewise-stationary restless multi-armed bandit (PS-RMAB) problem, where each arm evolves as a Markov chain but \emph{mean rewards may change across unknown segments}. To address the resulting exploration--detection delay trade-off, we propose a modular framework that integrates arbitrary RMAB base algorithms with change detection and a novel diminishing exploration mechanism. This design enables flexible plug-and-play use of existing solvers and detectors, while efficiently adapting to mean changes without prior knowledge of their number. To evaluate performance, we introduce a refined regret notion that measures the \emph{excess regret due to exploration and detection}, benchmarked against an oracle that restarts the base algorithm at the true change points. Under this metric, we prove a regret bound of $\tilde{O}(\sqrt{LMKT})$, where $L$ denotes the maximum mixing time of the Markov chains across all arms and segments, $M$ the number of segments, $K$ the number of arms, and $T$ the horizon. Simulations confirm that our framework achieves regret close to that of the segment oracle and consistently outperforms base solvers that do not incorporate any mechanism to handle environmental hanges.


Spotlight
Towards Blackwell Optimality: Bellman Optimality Is All You Can Get

Victor Boone ⋅ Adrienne Tuynman

Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.

In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent's number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. In this analysis, we express the regret upper bound as an optimization problem with respect to the learning rates and the coefficients of certain negative terms, enabling refined analysis of the leading constants. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these social regret upper and lower bounds match exactly including the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these bounds with algorithm-dependent dynamic regret lower bounds that match the improved bounds.


Spotlight
Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Alberto Rumi ⋅ Andrew Jacobsen ⋅ Nicolò Cesa-Bianchi ⋅ Fabio Vitale

We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only *point-evaluation feedback* on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\\{\boldsymbol{u}\_t \neq \boldsymbol{u}\_{t-1}\\}$ of an arbitrary comparator sequence. In particular, we provide the *first* algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms *without prior knowledge of $S_T$*, thus resolving a long-standing open problem.


Spotlight
An Information-Geometric Approach to Artificial Curiosity

Alexander Nedergaard ⋅ Pablo Morales

Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to its representation---an invariance central to information geometry. Leveraging this, we show that information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring these rewards to yield a principled exploration-exploitation trade-off, via information geodesic interpolation on the occupancy manifold, effectively limits the candidates to those determined by a scalar parameter. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration. This framework provides important constraints to the engineering of intrinsic reward while integrating foundational exploration methods into a single, cohesive model.


Spotlight
Learning to Bid in Discriminatory Auctions with Budget Constraints

Negin Golrezaei ⋅ Sourav Sahoo

We study repeated bidding in multi-unit discriminatory (pay-as-bid) auctions for a single bidder with per-round utility equal to value minus $\alpha$ times payment, where $\alpha\in[0,1]$ is a cost-of-capital parameter. The bidder aims to maximize cumulative utility over $T$ rounds subject to a total budget $B$. The problem is challenging even without budgets: the action space is exponential in the bidder’s maximum demand $M$, and the valuation vector (context) varies over time. Exploiting a decomposition of utility across units, we develop polynomial-time learning algorithms based on shortest paths in a directed acyclic graph, obtaining sublinear regret under both full-information and bandit feedback. In the bandit setting, the regret is independent of the number of contexts due to complete cross-learning: observing the utility of the chosen action under the realized context reveals its utility for the same action under all counterfactual contexts. With budget constraints, when the average normalized per-round budget $\rho=\frac{B}{MT}<1$, we design a coupled primal-dual algorithm in which the DAG-based procedure uses dual-adjusted edge weights for primal updates, while online gradient descent updates the dual variable, yielding $\rho$-approximate sublinear regret. Finally, we give implementations whose per-round time and space are independent of the number of contexts, enabling scalability to large or even infinite context spaces.