Skip to yearly menu bar Skip to main content


Oral: RL & Optimization

Auditorium 1
Thu 2 May 1:30 a.m. PDT — 2:30 a.m. PDT
Chat is not available.

Conformal Contextual Robust Optimization

Yash Patel · Sahana Rayan · Ambuj Tewari

Data-driven approaches to predict-then-optimize decision-making problems seek to mitigate the risk of uncertainty region misspecification in safety-critical settings. Current approaches, however, suffer from considering overly conservative uncertainty regions, often resulting in suboptimal decision-making. To this end, we propose Conformal-Predict-Then-Optimize (CPO), a framework for leveraging highly informative, nonconvex conformal prediction regions over high-dimensional spaces based on conditional generative models, which have the desired distribution-free coverage guarantees. Despite guaranteeing robustness, such black-box optimization procedures alone inspire little confidence owing to the lack of explanation of why a particular decision was found to be optimal. We, therefore, augment CPO to additionally provide semantically meaningful visual summaries of the uncertainty regions to give qualitative intuition for the optimal decision. We highlight the CPO framework by demonstrating results on a suite of simulation-based inference benchmark tasks and a vehicle routing task based on probabilistic weather prediction.

Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games

Yang Cai · Haipeng Luo · Chen-Yu Wei · Weiqiang Zheng

We study policy optimization algorithms for computing correlated equilibria in multi-player general-sum Markov Games. Previous results achieve $\tilde{O}(T^{-1/2})$ convergence rate to a correlated equilibrium and an accelerated $\tilde{O}(T^{-3/4})$ convergence rate to the weaker notion of coarse correlated equilibrium. In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium. Our algorithm is constructed by combining two main elements (i) smooth value updates and (ii) the \emph{optimistic-follow-the-regularized-leader} algorithm with the log barrier regularizer.

Model-based Policy Optimization under Approximate Bayesian Inference

Chaoqi Wang · Yuxin Chen · Kevin Murphy

Model-based reinforcement learning algorithms~(MBRL) present an exceptional potential to enhance sample efficiency within the realm of online reinforcement learning (RL). Nevertheless, a substantial proportion of prevalent MBRL algorithms fail to adequately address the dichotomy of exploration and exploitation. Posterior sampling reinforcement learning (PSRL) emerges as an innovative strategy adept at balancing exploration and exploitation, albeit its theoretical assurances are contingent upon exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.

Online Learning of Decision Trees with Thompson Sampling

Ayman Chaouki · Jesse Read · Albert Bifet

Decision Trees are prominent prediction models for interpretable Machine Learning. They have been thoroughly researched, mostly in the batch setting with a fixed labelled dataset, leading to popular algorithms such as C4.5, ID3 and CART. Unfortunately, these methods are of heuristic nature, they rely on greedy splits offering no guarantees of global optimality and often leading to unnecessarily complex and hard-to-interpret Decision Trees. Recent breakthroughs addressed this suboptimality issue in the batch setting, but no such work has considered the online setting with data arriving in a stream. To this end, we devise a new Monte Carlo Tree Search algorithm, Thompson Sampling Decision Trees (TSDT), able to produce optimal Decision Trees in an online setting. We analyse our algorithm and prove its almost sure convergence to the optimal tree. Furthermore, we conduct extensive experiments to validate our findings empirically. The proposed TSDT outperforms existing algorithms on several benchmarks, all while presenting the practical advantage of being tailored to the online setting.