### Oral

## Oral: Bandit & Causality

##### Auditorium 1

**Positivity-free Policy Learning with Observational Data**

Pan Zhao · Antoine Chambaz · julie Josse · Shu Yang

Policy learning utilizing observational data is pivotal across various domains, with the objective of learning the optimal treatment assignment policy while adhering to specific constraints such as fairness, budget, and simplicity. This study introduces a novel positivity-free (stochastic) policy learning framework designed to address the challenges posed by the impracticality of the positivity assumption in real-world scenarios. This framework leverages incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments. We characterize these incremental propensity score policies and establish identification conditions, employing semiparametric efficiency theory to propose efficient estimators capable of achieving rapid convergence rates, even when integrated with advanced machine learning algorithms. This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance through comprehensive numerical experiments, ensuring the identification of causal effects from observational data is both robust and reliable.

**Best-of-Both-Worlds Algorithms for Linear Contextual Bandits**

Yuko Kuroki · Alberto Rumi · Taira Tsuchiya · Fabio Vitale · NicolĂ˛ Cesa-Bianchi

We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\!\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{\mathcal{O}}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{\mathcal{O}}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{\mathcal{O}}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.

**Policy Learning for Localized Interventions from Observational Data**

Myrl Marmarelis · Fred Morstatter · Aram Galstyan · Greg Ver Steeg

A largely unaddressed problem in causal inference is that of learning reliable policies in continuous, high-dimensional treatment variables from observational data. Especially in the presence of strong confounding, it can be infeasible to learn the entire heterogeneous response surface from treatment to outcome. It is also not particularly useful, when there are practical constraints on the size of the interventions altering the observational treatments. Since it tends to be easier to learn the outcome for treatments near existing observations, we propose a new framework for evaluating and optimizing the effect of small, tailored, and localized interventions that nudge the observed treatment assignments. Our doubly robust effect estimator plugs into a policy learner that stays within the interventional scope by optimal transport. Consequently, the error of the total policy effect is restricted to prediction errors nearby the observational distribution, rather than the whole response surface.

**Exploration via linearly perturbed loss minimisation**

David Janz · Shuai Liu · Alex Ayoub · Csaba Szepesvari

We introduce \emph{exploration via linear loss perturbations} (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.