### Session

## Poster Session 2

**Understanding the wiring evolution in differentiable neural architecture search**

Sirui Xie · Shoukang Hu · Xinjiang Wang · Chunxiao Liu · Jianping Shi · Xunying Liu · Dahua Lin

Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.

**Linear Regression Games: Convergence Guarantees to Approximate Out-of-Distribution Solutions**

Kartik Ahuja · Karthikeyan Shanmugam · Amit Dhurandhar

Recently, invariant risk minimization (IRM) (Arjovsky et al. 2019) was proposed as a promising solution to address out-of-distribution (OOD) generalization. In Ahuja et al. (2020), it was shown that solving for the Nash equilibria of a new class of “ensemble-games” is equivalent to solving IRM. In this work, we extend the framework in Ahuja et al. (2020) for linear regressions by projecting the ensemble-game on an $\ell_{\infty}$ ball. We show that such projections help achieve non-trivial out-of-distribution guarantees despite not achieving perfect invariance. For linear models with confounders, we prove that Nash equilibria of these games are closer to the ideal OOD solutions than the standard empirical risk minimization (ERM) and we also provide learning algorithms that provably converge to these Nash Equilibria. Empirical comparisons of the proposed approach with the state-of-the-art show consistent gains in achieving OOD solutions in several settings involving anti-causal variables and confounders.

**Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization**

Peiyuan Zhang · Antonio Orvieto · Hadi Daneshmand · Thomas Hofmann · Roy S. Smith

Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration.

**Fenchel-Young Losses with Skewed Entropies for Class-posterior Probability Estimation**

Han Bao · Masashi Sugiyama

We study class-posterior probability estimation (CPE) for binary responses where one class has much fewer data than the other. For example, events such as species co-occurrence in ecology and wars in political science are often much rarer than non-events. Logistic regression has been widely used for CPE, while it tends to underestimate the probability of rare events. Its main drawback is symmetry of the logit link---symmetric links can be misled by small and imbalanced samples because it is more incentivized to overestimate the majority class with finite samples. Parametric skewed links have been proposed to overcome this limitation, but their estimation usually results in nonconvex optimization unlike the logit link. Such nonconvexity is knotty not only from the computational viewpoint but also in terms of the parameter identifiability. In this paper, we provide a procedure to derive a convex loss for a skewed link based on the recently proposed Fenchel-Young losses. The derived losses are always convex and have a nice property suitable for class imbalance. The simulation shows the practicality of the derived losses.

**γ-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator**

Masahiro Fujisawa · Takeshi Teshima · Issei Sato · Masashi Sugiyama

Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based γ-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear time complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures.

In this work, we focus on a variant of the generalized linear model (GLM) called corrupted GLM (CGLM) with heavy-tailed features and responses. To robustify the statistical inference on this model, we propose to apply L4-norm shrinkage to the feature vectors in the low-dimensional regime and apply elementwise shrinkage to them in the high-dimensional regime. Under bounded fourth moment assumptions, we show that the maximum likelihood estimator (MLE) based on the shrunk data enjoys nearly the minimax optimal rate with an exponential deviation bound. Our simulations demonstrate that the proposed feature shrinkage significantly enhances the statistical performance in linear regression and logistic regression on heavy-tailed data. Finally, we apply our shrinkage principle to guard against mislabeling and image noise in the human-written digit recognition problem. We add an L4-norm shrinkage layer to the original neural net and reduce the testing misclassification rate by more than 30% relatively in the presence of mislabeling and image noise.

**Learning Individually Fair Classifier with Path-Specific Causal-Effect Constraint**

Yoichi Chikahara · Shinsaku Sakaue · Akinori Fujino · Hisashi Kashima

Machine learning is used to make decisions for individuals in various fields, which require us to achieve good prediction accuracy while ensuring fairness with respect to sensitive features (e.g., race and gender). This problem, however, remains difficult in complex real-world scenarios. To quantify unfairness under such situations, existing methods utilize {\it path-specific causal effects}. However, none of them can ensure fairness for each individual without making impractical functional assumptions about the data. In this paper, we propose a far more practical framework for learning an individually fair classifier. To avoid restrictive functional assumptions, we define the {\it probability of individual unfairness} (PIU) and solve an optimization problem where PIU's upper bound, which can be estimated from data, is controlled to be close to zero. We elucidate why our method can guarantee fairness for each individual. Experimental results show that our method can learn an individually fair classifier at a slight cost of accuracy.

Probabilistic graphical models (PGMs) are effective for capturing the statistical dependencies in stochastic databases. In many domains (e.g., working with multimodal data), one faces multiple information layers that can be modeled by structurally similar PGMs. While learning the structures of PGMs in isolation is well-investigated, the algorithmic design and performance limits of learning from multiple coupled PGMs are investigated far less. This paper considers learning the structural similarities shared by a pair of Ising PGMs. The objective is learning the shared structure with no regard for the structures exclusive to either of the graphs, and significantly different from the existing approaches that focus on entire structure of the graphs. We propose an algorithm for the shared structure learning objective, evaluate its performance empirically, and compare with existing approaches on structure learning of single graphs.

Given a set $\mathcal{C}=\{C_i\}_{i=1}^m$ of square matrices, the matrix blind joint block diagonalization problem (BJBDP) is to find a full column rank matrix $A$ such that $C_i=A\Sigma_iA^{\T}$ for all $i$, where $\Sigma_i$'s are all block diagonal matrices with as many diagonal blocks as possible. The BJBDP plays an important role in independent subspace analysis. This paper considers the identification problem for BJBDP, that is, under what conditions and by what means, we can identify the diagonalizer $A$ and the block diagonal structure of $\Sigma_i$, especially when there is noise in $C_i$'s. In this paper, we propose a ``bi-block diagonalization'' method to solve BJBDP, and establish sufficient conditions for when the method is able to accomplish the task. Numerical simulations validate our theoretical results. To the best of the authors' knowledge, current numerical methods for BJBDP have no theoretical guarantees for the identification of the exact solution, whereas our method does.

**Tight Regret Bounds for Infinite-armed Linear Contextual Bandits**

Yingkai Li · Yining Wang · Xi Chen · Yuan Zhou

Linear contextual bandit is a class of sequential decision-making problems with important applications in recommendation systems, online advertising, healthcare, and other machine learning-related tasks. While there is much prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we consider the linear contextual bandit problem with (changing) infinite action sets. We prove a regret upper bound on the order of O(\sqrt{d^2T\log T}) \poly(\log\log T) where d is the domain dimension and T is the time horizon. Our upper bound matches the previous lower bound of \Omega(\sqrt{d^2 T\log T}) in [Li et al., 2019] up to iterated logarithmic terms.

**Generalization Bounds for Stochastic Saddle Point Problems**

Junyu Zhang · Mingyi Hong · Mengdi Wang · Shuzhong Zhang

This paper studies the generalization bounds for the empirical saddle point (ESP) solution to stochastic saddle point (SSP) problems. For SSP with Lipschitz continuous and strongly convex-strongly concave objective functions, we establish an {\footnotesize$\cO\left(1/n\right)$} generalization bound by using a probabilistic stability argument. We also provide generalization bounds under a variety of assumptions, including the cases without strong convexity and without bounded domains. We illustrate our results in three examples: batch policy learning in Markov decision process, stochastic composite optimization problem, and mixed strategy Nash equilibrium estimation for stochastic games. In each of these examples, we show that a regularized ESP solution enjoys a near-optimal sample complexity. To the best of our knowledge, this is the first set of results on the generalization theory of ESP.

**Neural Function Modules with Sparse Arguments: A Dynamic Approach to Integrating Information across Layers**

Alex Lamb · Anirudh Goyal · Agnieszka Słowik · Mike Mozer · Philippe Beaudoin · Yoshua Bengio

Feed-forward neural networks consist of a sequence of layers, in which each layer performs some processing on the information from the previous layer. A downside to this approach is that each layer (or module, as multiple modules can operate in parallel) is tasked with processing the entire hidden state, rather than a particular part of the state which is most relevant for that module. Methods which only operate on a small number of input variables are an essential part of most programming languages, and they allow for improved modularity and code re-usability. Our proposed method, Neural Function Modules (NFM), aims to introduce the same structural capability into deep learning. Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems. The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm which, as we show, improves the results in standard classification, out-of-domain generalization, generative modeling, and learning representations in the context of reinforcement learning.

**Predictive Power of Nearest Neighbors Algorithm under Random Perturbation**

Yue Xing · Qifan Song · Guang Cheng

This work investigates the predictive performance of the classical $k$ Nearest Neighbors ($k$-NN) algorithm when the testing data are corrupted by random perturbation. The impact of corruption level on the asymptotic regret is carefully characterized and we reveal a phase-transition phenomenon that, when the corruption level of the random perturbation $\omega$ is below a critical order (i.e., small-$\omega$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-$\omega$ regime), the asymptotic regret deteriorates polynomially. More importantly, the regret of $k$-NN classifier heuristically matches the rate of minimax regret for randomly perturbed testing data, thus implies the strong robustness of $k$-NN against random perturbation on testing data. In fact, we show that the classical $k$-NN can achieve no worse predictive performance, compared to the NN classifiers trained via the popular noise-injection strategy. Our numerical experiment also illustrates that combining $k$-NN component with modern learning algorithms will inherit the strong robustness of $k$-NN. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in \cite{xue2017achieving} will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step.

**The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry**

Tomohiro Hayase · Ryo Karakida

The Fisher information matrix (FIM) is fundamental to understanding the trainability of deep neural nets (DNN), since it describes the parameter space's local metric. We investigate the spectral distribution of the conditional FIM, which is the FIM given a single sample, by focusing on fully-connected networks achieving dynamical isometry. Then, while dynamical isometry is known to keep specific backpropagated signals independent of the depth, we find that the parameter space's local metric linearly depends on the depth even under the dynamical isometry. More precisely, we reveal that the conditional FIM's spectrum concentrates around the maximum and the value grows linearly as the depth increases. To examine the spectrum, considering random initialization and the wide limit, we construct an algebraic methodology based on the free probability theory. As a byproduct, we provide an analysis of the solvable spectral distribution in two-hidden-layer cases. Lastly, experimental results verify that the appropriate learning rate for the online training of DNNs is in inverse proportional to depth, which is determined by the conditional FIM's spectrum.

**Provably Efficient Safe Exploration via Primal-Dual Policy Optimization**

Dongsheng Ding · Xiaohan Wei · Zhuoran Yang · Zhaoran Wang · Mihailo Jovanovic

We study the safe reinforcement learning problem using the constrained Markov decision processes in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learning algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization \mbox{(OPDOP)} algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for constrained Markov decision processes in the function approximation setting, with safe exploration.

**Mirror Descent View for Neural Network Quantization**

Thalaiyasingam Ajanthan · Kartik Gupta · Philip Torr · RICHARD HARTLEY · Puneet Dokania

Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. It is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the Straight Through Estimator (STE) based method which is typically viewed as a “trick” to avoid vanishing gradients issue. Our experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants yield state-of-the-art performance.

**Adversarially Robust Estimate and Risk Analysis in Linear Regression**

Yue Xing · Ruizhi Zhang · Guang Cheng

Adversarial robust learning aims to design algorithms that are robust to small adversarial perturbations on input variables. Beyond the existing studies on the predictive performance to adversarial samples, our goal is to understand statistical properties of adversarial robust estimates and analyze adversarial risk in the setup of linear regression models. By discovering the statistical minimax rate of convergence of adversarial robust estimators, we emphasize the importance of incorporating model information, e.g., sparsity, in adversarial robust learning. Further, we reveal an explicit connection of adversarial and standard estimates, and propose a straightforward two-stage adversarial training framework, which facilitates to utilize model structure information to improve adversarial robustness. In theory, the consistency of the adversarial robust estimator is proven and its Bahadur representation is also developed for the statistical inference purpose. The proposed estimator converges in a sharp rate under either low-dimensional or sparse scenario. Moreover, our theory confirms two phenomena in adversarial robust learning: adversarial robustness hurts generalization, and unlabeled data help improve the generalization. In the end, we conduct numerical simulations to verify our theory.

**Bayesian Model Averaging for Causality Estimation and its Approximation based on Gaussian Scale Mixture Distributions**

Shunsuke Horii

In estimation of the causal effect under linear Structural Causal Models (SCMs), it is common practice to first identify the causal structure, estimate the probability distributions, and then calculate the causal effect. However, if the goal is to estimate the causal effect, it is not necessary to fix a single causal structure or probability distributions. In this paper, we first show from a Bayesian perspective that it is Bayes optimal to weight (average) the causal effects estimated under each model rather than estimating a single model. This idea is also known as Bayesian model averaging. Although the Bayesian model averaging is optimal, as the number of candidate models increases, the weighting calculations become computationally hard. We develop an approximation to the Bayes optimal estimator by using Gaussian scale mixture distributions.

**Contrastive learning of strong-mixing continuous-time stochastic processes**

Bingbin Liu · Pradeep Ravikumar · Andrej Risteski

Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. It has recently emerged as one of the leading learning paradigms in the absence of labels across many different domains (e.g. brain imaging, text, images). However, theoretical understanding of many aspects of training, both statistical and algorithmic, remain fairly elusive.

In this work, we study the setting of time series---more precisely, when we get data from a strong-mixing continuous-time stochastic process. We show that a properly constructed contrastive learning task can be used to the transition kernel for small-to-mid-range intervals in the diffusion case. Moreover, we give sample complexity bounds for solving this task and quantitatively characterize what the value of the contrastive loss implies for distributional closeness of the learned kernel. As a byproduct, we illuminate the appropriate settings for the contrastive distribution, as well as other hyperparameters in this setup.

Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the more general regularized RL objective and using Fenchel duality; we derive the dual problem which takes the form of an adversarial reward problem. In particular, we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our result allows us to reinterpret the popular entropic regularization scheme as a form of robustification. Furthermore, due to the generality of our results, we apply to other existing regularization schemes. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.

**A comparative study on sampling with replacement vs Poisson sampling in optimal subsampling**

HaiYing Wang · Jiahui Zou

Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency, subsampling is often implemented with replacement or through Poisson subsampling. However, no rigorous investigation has been performed to study the difference between the two subsampling procedures such as their estimation efficiency and computational convenience. In the context of maximizing a general target function, this paper derives optimal subsampling probabilities for both subsampling with replacement and Poisson subsampling. The optimal subsampling probabilities minimize variance functions of the subsampling estimators. Furthermore, they provide deep insights on the theoretical similarities and differences between subsampling with replacement and Poisson subsampling. Practically implementable algorithms are proposed based on the optimal structural results, which are evaluated by both theoretical and empirical analysis.

**Ridge Regression with Over-parametrized Two-Layer Networks Converge to Ridgelet Spectrum**

Sho Sonoda · Isao Ishikawa · Masahiro Ikeda

Characterization of local minima draws much attention in theoretical studies of deep learning. In this study, we investigate the distribution of parameters in an over-parametrized finite neural network trained by ridge regularized empirical square risk minimization (RERM). We develop a new theory of ridgelet transform, a wavelet-like integral transform that provides a powerful and general framework for the theoretical study of neural networks involving not only the ReLU but general activation functions. We show that the distribution of the parameters converges to a spectrum of the ridgelet transform. This result provides a new insight into the characterization of the local minima of neural networks, and the theoretical background of an inductive bias theory based on lazy regimes. We confirm the visual resemblance between the parameter distribution trained by SGD, and the ridgelet spectrum calculated by numerical integration through numerical experiments with finite models.

**Minimal enumeration of all possible total effects in a Markov equivalence class**

Richard Guo · Emilija Perkovic

In observational studies, when a total causal effect of interest is not identified, the set of all possible effects can be reported instead. This typically occurs when the underlying causal DAG is only known up to a Markov equivalence class, or a refinement thereof due to background knowledge. As such, the class of possible causal DAGs is represented by a maximally oriented partially directed acyclic graph (MPDAG), which contains both directed and undirected edges. We characterize the minimal additional edge orientations required to identify a given total effect. A recursive algorithm is then developed to enumerate subclasses of DAGs, such that the total effect in each subclass is identified as a distinct functional of the observed distribution. This resolves an issue with existing methods, which often report possible total effects with duplicates, namely those that are numerically distinct due to sampling variability but are in fact causally identical.

**DAG-Structured Clustering by Nearest Neighbors**

Nicholas Monath · Manzil Zaheer · Kumar Avinava Dubey · Amr Ahmed · Andrew McCallum

Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structures are not only more flexible than trees, but also allow for points to be members of multiple clusters. We describe a scalable algorithm, Llama, which simply merges nearest neighbor substructures to form a DAG structure. Llama discovers structures that are more accurate than state-of-the-art tree-based techniques while remaining scalable to large-scale clustering benchmarks. Additionally, we support the proposed algorithm with theoretical guarantees on separated data, including types of data that cannot be correctly clustered by tree-based algorithms.

**Tensor Networks for Probabilistic Sequence Modeling**

Jacob E Miller · Guillaume Rabusseau · John Terilla

Tensor networks are a powerful modeling framework developed for computational many-body physics, which have only recently been applied within machine learning. In this work we utilize a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data. We first show that u-MPS enable sequence-level parallelism, with length-n sequences able to be evaluated in depth O(log n). We then introduce a novel generative algorithm giving trained u-MPS the ability to efficiently sample from a wide variety of conditional distributions, each one defined by a regular expression. Special cases of this algorithm correspond to autoregressive and fill-in-the-blank sampling, but more complex regular expressions permit the generation of richly structured data in a manner that has no direct analogue in neural generative models. Experiments on sequence modeling with synthetic and real text data show u-MPS outperforming a variety of baselines and effectively generalizing their predictions in the presence of limited data.

We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO.

This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

A general framework of personalized federated multi-armed bandits (PF-MAB) is proposed, which is a new bandit paradigm analogous to the federated learning (FL) framework in supervised learning and enjoys the features of FL with personalization. Under the PF-MAB framework, a mixed bandit learning problem that flexibly balances generalization and personalization is studied. A lower bound analysis for the mixed model is presented. We then propose the Personalized Federated Upper Confidence Bound (PF-UCB) algorithm, where the exploration length is chosen carefully to achieve the desired balance of learning the local model and supplying global information for the mixed learning objective. Theoretical analysis proves that PF-UCB achieves an O(log(T)) regret regardless of the degree of personalization, and has a similar instance dependency as the lower bound. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness of the proposed algorithm.

**Shadow Manifold Hamiltonian Monte Carlo**

Chris van der Heide · Fred Roosta · Liam Hodgkinson · Dirk Kroese

Hamiltonian Monte Carlo and its descendants have found success in machine learning and computational statistics due to their ability to draw samples in high dimensions with greater efficiency than classical MCMC. One of these derivatives, Riemannian manifold Hamiltonian Monte Carlo (RMHMC), better adapts the sampler to the geometry of the target density, allowing for improved performances in sampling problems with complex geometric features. Other approaches have boosted acceptance rates by sampling from an integrator-dependent “shadow density” and compensating for the induced bias via importance sampling. We combine the benefits of RMHMC with those attained by sampling from the shadow density, by deriving the shadow Hamiltonian corresponding to the generalized leapfrog integrator used in RMHMC. This leads to a new algorithm, shadow manifold Hamiltonian Monte Carlo, that shows improved performance over RMHMC, and leaves the target density invariant.

Motivated by the EU's "Right To Be Forgotten" regulation, we initiate a study of statistical data deletion problems where users' data are accessible only for a limited period of time. This setting is formulated as an online supervised learning task with \textit{constant memory limit}. We propose a deletion-aware algorithm \texttt{FIFD-OLS} for the low dimensional case, and witness a catastrophic rank swinging phenomenon due to the data deletion operation, which leads to statistical inefficiency. As a remedy, we propose the \texttt{FIFD-Adaptive Ridge} algorithm with a novel online regularization scheme, that effectively offsets the uncertainty from deletion. In theory, we provide the cumulative regret upper bound for both online forgetting algorithms. In the experiment, we showed \texttt{FIFD-Adaptive Ridge} outperforms the ridge regression algorithm with fixed regularization level, and hopefully sheds some light on more complex statistical models.

**Differentially Private Monotone Submodular Maximization Under Matroid and Knapsack Constraints**

Omid Sadeghi · Maryam Fazel

Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting.

In this paper, we establish the ordinary differential equation (ODE) that underlies the training dynamics of Model-Agnostic Meta-Learning (MAML). Our continuous-time limit view of the process eliminates the influence of the manually chosen step size of gradient descent and includes the existing gradient descent training algorithm as a special case that results from a specific discretization. We show that the MAML ODE enjoys a linear convergence rate to an approximate stationary point of the MAML loss function for strongly convex task losses, even when the corresponding MAML loss is non-convex. Moreover, through the analysis of the MAML ODE, we propose a new BI-MAML training algorithm that reduces the computational burden associated with existing MAML training methods, and empirical experiments are performed to showcase the superiority of our proposed methods in the rate of convergence with respect to the vanilla MAML algorithm.

**Hierarchical Inducing Point Gaussian Process for Inter-domian Observations**

Luhuan Wu · Andrew Miller · Lauren Anderson · Geoff Pleiss · David Blei · John Cunningham

We examine the general problem of inter-domain Gaussian Processes (GPs): problems where the GP realization and the noisy observations of that realization lie on different domains. When the mapping between those domains is linear, such as integration or diﬀerentiation, inference is still closed form. However, many of the scaling and approximation techniques that our community has developed do not apply to this setting. In this work, we introduce the hierarchical inducing point GP (HIP-GP), a scalable inter-domain GP inference method that enables us to improve the approximation accuracy by increasing the number of inducing points to the millions. HIP-GP, which relies on inducing points with grid structure and a stationary kernel assumption, is suitable for low-dimensional problems. In developing HIP-GP, we introduce (1) a fast whitening strategy, and (2) a novel preconditioner for conjugate gradients which can be helpful in general GP settings.

**Independent Innovation Analysis for Nonlinear Vector Autoregressive Process**

Hiroshi Morioka · Hermanni Hälvä · Aapo Hyvarinen

The nonlinear vector autoregressive (NVAR) model provides an appealing framework to analyze multivariate time series obtained from a nonlinear dynamical system. However, the innovation (or error), which plays a key role by driving the dynamics, is almost always assumed to be additive. Additivity greatly limits the generality of the model, hindering analysis of general NVAR processes which have nonlinear interactions between the innovations. Here, we propose a new general framework called independent innovation analysis (IIA), which estimates the innovations from completely general NVAR. We assume mutual independence of the innovations as well as their modulation by an auxiliary variable (which is often taken as the time index and simply interpreted as nonstationarity). We show that IIA guarantees the identifiability of the innovations with arbitrary nonlinearities, up to a permutation and component-wise invertible nonlinearities. We also propose three estimation frameworks depending on the type of the auxiliary variable. We thus provide the first rigorous identifiability result for general NVAR, as well as very general tools for learning such models.

Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We analyze the generalization error of gradient descent combined with our labeling scheme and show that the proposed algorithm achieves the optimal rate of convergence in much wider settings and especially gives much better generalization ability in a small noise setting than the usual uniform sampling scheme. Numerical experiments verify our theoretical findings.

**GANs with Conditional Independence Graphs: On Subadditivity of Probability Divergences**

Mucong Ding · Constantinos Daskalakis · Soheil Feizi

Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set. GANs have been widely used in sample synthesis, de-noising, domain transfer, etc. GANs, however, are designed in a model-free fashion where no additional information about the underlying distribution is available. In many applications, however, practitioners have access to the underlying independence graph of the variables, either as a Bayesian network or a Markov Random Field (MRF). We ask: how can one use this additional information in designing model-based GANs? In this paper, we provide theoretical foundations to answer this question by studying subadditivity properties of probability divergences, which establish upper bounds on the distance between two high-dimensional distributions by the sum of distances between their marginals over (local) neighborhoods of the graphical structure of the Bayes-net or the MRF. We prove that several popular probability divergences satisfy some notion of subadditivity under mild conditions. These results lead to a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF, rather than a giant discriminator on the entire network, providing significant statistical and computational benefits. Our experiments on synthetic and real-world datasets demonstrate the benefits of our principled design of model-based GANs.

**Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning**

Yunhao Tang · Alp Kucukelbir

We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards.

There is an increasing use of algorithms to inform decisions in many settings, from student evaluations, college admissions, to credit scoring. These decisions are made by applying a decision rule to individual's observed features. Given the impacts of these decisions on individuals, decision makers are increasingly required to be transparent on their decision making to offer the ``right to explanation.'' Meanwhile, being transparent also invites potential manipulations, also known as gaming, that the individuals can utilize the knowledge to strategically alter their features in order to receive a more beneficial decision.

In this work, we study the problem of \emph{robust} decision-making under strategic behavior. Prior works often assume that the decision maker has full knowledge of individuals' cost structure for manipulations. We study the robust variant that relaxes this assumption: The decision maker does not have full knowledge but knows only a subset of the individuals' available actions and associated costs. To approach this non-quantifiable uncertainty, we define robustness based on the worst-case guarantee of a decision, over all possible actions (including actions unknown to the decision maker) individuals might take. A decision rule is called \emph{robust optimal} if its worst case performance is (weakly) better than that of all other decision rules. Our main contributions are two-fold. First, we provide a crisp characterization of the above robust optimality: For any decision rules under mild conditions that are robust optimal, there exists a linear decision rule that is equally robust optimal. Second, we explore the computational problem of searching for the robust optimal decision rule and interestingly, we demonstrate the problem is closely related to distributionally robust optimization. We believe our results promotes the use of simple linear decisions with uncertain individual manipulations.

**Minimax Estimation of Laplacian Constrained Precision Matrices**

Jiaxi Ying · José Vinícius de Miranda Cardoso · Daniel Palomar

This paper considers the problem of high-dimensional sparse precision matrix estimation under Laplacian constraints. We prove that the Laplacian constraints bring favorable properties for estimation: the Gaussian maximum likelihood estimator exists and is unique almost surely on the basis of one observation, irrespective of the dimension. We establish the optimal rate of convergence under Frobenius norm by the derivation of the minimax lower and upper bounds. The minimax lower bound is obtained by applying Le Cam-Assouad's method with a novel construction of a subparameter space of multivariate normal distributions. The minimax upper bound is established by designing an adaptive $\ell_1$-norm regularized maximum likelihood estimation method and quantifying the rate of convergence. We prove that the proposed estimator attains the optimal rate of convergence with an overwhelming probability. Numerical experiments demonstrate the effectiveness of the proposed estimator.

**Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations?**

Chaoqi Wang · Shengyang Sun · Roger Grosse

While uncertainty estimation is a well-studied topic in deep learning, most such work focuses on marginal uncertainty estimates, i.e. the predictive mean and variance at individual input locations. But it is often more useful to estimate predictive correlations between the function values at different input locations. In this paper, we consider the problem of benchmarking how accurately Bayesian models can estimate predictive correlations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL). We find that TAL makes better use of models' uncertainty estimates than ordinary active learning, and recommend this as a benchmark for evaluating Bayesian models. Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations and which can be computed efficiently: meta-correlations (i.e. the correlations between the models correlation estimates and the true values), and cross-normalized likelihoods (XLL). We validate these metrics by demonstrating their consistency with TAL performance and obtain insights about the relative performance of current Bayesian neural net and Gaussian process models.

**A Parameter-Free Algorithm for Misspecified Linear Contextual Bandits**

Kei Takemura · Shinji Ito · Daisuke Hatano · Hanna Sumita · Takuro Fukunaga · Naonori Kakimura · Ken-ichi Kawarabayashi

We investigate the misspecified linear contextual bandit (MLCB) problem, which is a generalization of the linear contextual bandit (LCB) problem. The MLCB problem is a decision-making problem in which a learner observes $d$-dimensional feature vectors, called arms, chooses an arm from $K$ arms, and then obtains a reward from the chosen arm in each round. The learner aims to maximize the sum of the rewards over $T$ rounds. In contrast to the LCB problem, the rewards in the MLCB problem may not be represented by a linear function in feature vectors; instead, it is approximated by a linear function with additive approximation parameter $\varepsilon \geq 0$. In this paper, we propose an algorithm that achieves $\tilde{O}(\sqrt{dT\log(K)} + \eps\sqrt{d}T)$ regret, where $\tilde{O}(\cdot)$ ignores polylogarithmic factors in $d$ and $T$. This is the first algorithm that guarantees a high-probability regret bound for the MLCB problem without knowledge of the approximation parameter $\varepsilon$.

**Consistent k-Median: Simpler, Better and Robust**

Xiangyu Guo · Janardhan Kulkarni · Shi Li · Jiayi Xian

In this paper we introduce and study the online consistent k-clustering with outliers problem, generalizing the non-outlier version of the problem studied in Lattanzi-Vassilvitskii [18]. We show that a simple local-search based on-line algorithm can give a bicriteria constant approximation for the problem with O(k^2 log^2(nD)) swaps of medians (recourse) in total, where D is the diameter of the metric. When restricted to the problem without outliers, our algorithm is simpler, deterministic and gives better approximation ratio and recourse, compared to that of Lattanzi-Vassilvitskii [18].

**Optimal query complexity for private sequential learning against eavesdropping**

Jiaming Xu · Kuang Xu · Dana Yang

We study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner's queries but not the responses. The learner's goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot.

**On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification**

Tianyi Lin · Zeyu Zheng · Elynn Chen · Marco Cuturi · Michael Jordan

Optimal transport (OT) distances are increasingly used as loss functions for statistical inference, notably in the learning of generative models or supervised learning. Yet, the behavior of minimum Wasserstein estimators is poorly understood, notably in high-dimensional regimes or under model misspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances, complementing and improving previous literature that has been restricted to one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces. Our complexity bounds can help explain why both PRW and IPRW distances outperform Wasserstein distances empirically in high-dimensional inference tasks. Finally, we consider parametric inference using the PRW distance. We provide an asymptotic guarantee of two types of minimum PRW estimators and formulate a central limit theorem for max-sliced Wasserstein estimator under model misspecification. To enable our analysis on PRW with projection dimension larger than one, we devise a novel combination of variational analysis and statistical theory.

**Learning Temporal Point Processes with Intermittent Observations**

Vinayak Gupta · Srikanta Bedathur · Sourangshu Bhattacharya · Abir De

Marked temporal point processes (MTPP) have emerged as a powerful framework to model the underlying generative mechanism of asynchronous events localized in continuous time. Most existing models and inference methods in MTPP framework consider only the complete observation scenario i.e. the event sequence being modeled is completely observed with no missing events – an ideal setting barely encountered in practice. A recent line of work which considers missing events uses supervised learning techniques which require a missing or observed label for each event. In this work, we provide a novel unsupervised model and inference method for MTPPs in presence of missing events. We first model the generative processes of observed events and missing events using two MTPPs, where the missing events are represented as latent random variables. Then we devise an unsupervised training method that jointly learns both the MTPPs by means of variational inference. Experiments with real datasets show that our modeling and inference frameworks can effectively impute the missing data among the observed events, which in turn enhances its predictive prowess.

We study alternating least-squares (ALS) for canonical correlation analysis (CCA). Recent research shows that the alternating least-squares solver for k-CCA can be directly accelerated with momentum and prominent performance gain has been observed in practice for the resulting simple algorithm. However, despite the simplicity, it is difficult for the accelerated rate to be analyzed in theory in order to explain and match the empirical performance gain. By looking into two neighboring iterations, in this work, we propose an even simpler variant of the faster alternating least-squares solver. Instead of applying momentum to each update for acceleration, the proposed variant only leverages momentum at every other iteration and can converge at a provably faster linear rate of nearly square-root dependence on the singular value gap of the whitened cross-covariance matrix. In addition to the high consistency between theory and practice, experimental studies also show that our variant of the alternating least-squares algorithm as a block CCA solver is even more pass efficient than other variants.

**PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming**

Alexander Lew · Monica Agrawal · David Sontag · Vikash Mansinghka

Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to automate. We present PClean, a probabilistic programming language (PPL) for leveraging dataset-specific knowledge to automate Bayesian cleaning. Compared to general-purpose PPLs, PClean tackles a restricted problem domain, enabling three modeling and inference innovations: (1) a non-parametric model of relational database instances, which users' programs customize; (2) a novel sequential Monte Carlo inference algorithm that exploits the structure of PClean's model class; and (3) a compiler that generates near-optimal SMC proposals and blocked-Gibbs rejuvenation kernels based on the user's model and data. We show empirically that short (< 50-line) PClean programs can: be faster and more accurate than generic PPL inference on data-cleaning benchmarks; match state-of-the-art data-cleaning systems in terms of accuracy and runtime (unlike generic PPL inference in the same runtime); and scale to real-world datasets with millions of records.

**Robust Imitation Learning from Noisy Demonstrations**

Voot Tangkaratt · Nontawat Charoenphakdee · Masashi Sugiyama

Robust learning from noisy demonstrations is a practical but highly challenging problem in imitation learning. In this paper, we first theoretically show that robust imitation learning can be achieved by optimizing a classification risk with a symmetric loss. Based on this theoretical finding, we then propose a new imitation learning method that optimizes the classification risk by effectively combining pseudo-labeling with co-training. Unlike existing methods, our method does not require additional labels or strict assumptions about noise distributions. Experimental results on continuous-control benchmarks show that our method is more robust compared to state-of-the-art methods.

Optimal transport (OT) theory provides powerful tools to compare probability measures. However, OT is limited to nonnegative measures having the same mass, and suffers serious drawbacks about its computation and statistics. This leads to several proposals of regularized variants of OT in the recent literature. In this work, we consider an entropy partial transport (EPT) problem for nonnegative measures on a tree having different masses. The EPT is shown to be equivalent to a standard complete OT problem on a one-node extended tree. We derive its dual formulation, then leverage this to propose a novel regularization for EPT which admits fast computation and negative definiteness. To our knowledge, the proposed regularized EPT is the first approach that yields a closed-form solution among available variants of unbalanced OT for general nonnegative measures. For practical applications without prior knowledge about the tree structure for measures, we propose tree-sliced variants of the regularized EPT, computed by averaging the regularized EPT between these measures using random tree metrics, built adaptively from support data points. Exploiting the negative definiteness of our regularized EPT, we introduce a positive definite kernel, and evaluate it against other baselines on benchmark tasks such as document classification with word embedding and topological data analysis. In addition, we empirically demonstrate that our regularization also provides effective approximations.

We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses (and gradients) incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes, obtain excess CVaR bounds, and finally complement the theory with a regression application.

Using only samples from a probabilistic model, we predict properties of the model and of future observations. The prediction game continues in an online fashion as the sample size grows with new observations. After each prediction, the predictor incurs a binary (0-1) loss. The probability model underlying a sample is otherwise unknown except that it belongs to a known class of models. The goal is to make finitely many errors (i.e. loss of 1) with probability 1 under the generating model, no matter what it may be in the known model class.

Model classes admitting predictors that make only finitely many errors are eventually almost surely (eas) predictable. When the losses incurred are observable (the supervised case), we completely characterize eas predictable classes. We provide analogous results in the unsupervised case. Our results have a natural interpretation in terms of regularization. In eas-predictable classes, we study if it is possible to have a universal stopping rule that identifies (to any given confidence) when no more errors will be made. Classes admitting such a stopping rule are eas learnable. When samples are generated iid, we provide a complete characterization of eas learnability. We also study cases when samples are not generated iid, but a full characterization remains open at this point.

**Mean-Variance Analysis in Bayesian Optimization under Uncertainty**

Shogo Iwazaki · Yu Inatsu · Ichiro Takeuchi

```
We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three scenarios based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.
```

**Sample efficient learning of image-based diagnostic classifiers via probabilistic labels**

Roberto Vega · Pouneh Gorji · Zichen Zhang · Xuebin Qin · Abhilash Rakkunedeth · Jeevesh Kapur · Jacob Jaremko · Russell Greiner

Deep learning approaches often require huge datasets to achieve good generalization. This complicates its use in tasks like image-based medical diagnosis, where the small training datasets are usually insufficient to learn appropriate data representations. For such sensitive tasks it is also important to provide the confidence in the predictions. Here, we propose a way to learn and use probabilistic labels to train accurate and calibrated deep networks from relatively small datasets. We observe gains of up to 22% in the accuracy of models trained with these labels, as compared with traditional approaches, in three classification tasks: diagnosis of hip dysplasia, fatty liver, and glaucoma. The outputs of models trained with probabilistic labels are calibrated, allowing the interpretation of its predictions as proper probabilities. We anticipate this approach will apply to other tasks where few training instances are available and expert knowledge can be encoded as probabilities.

**Beyond Perturbation Stability: LP Recovery Guarantees for MAP Inference on Noisy Stable Instances**

Hunter Lang · Aravind Reddy Talla · David Sontag · Aravindan Vijayaraghavan

Several works have shown that perturbation stable instances of the MAP inference problem can be solved exactly using a natural linear programming (LP) relaxation. However, most of these works give few (or no) guarantees for the LP solutions on instances that do not satisfy the relatively strict perturbation stability definitions. In this work, we go beyond these stability results by showing that the LP approximately recovers the MAP solution of a stable instance even after the instance is corrupted by noise. This "noisy stable" model realistically fits with practical MAP inference problems: we design an algorithm for finding "close" stable instances, and show that several real-world instances from computer vision have nearby instances that are perturbation stable. These results suggest a new theoretical explanation for the excellent performance of this LP relaxation in practice.

**Experimental Design for Regret Minimization in Linear Bandits**

Andrew Wagenmaker · Julian Katz-Samuels · Kevin Jamieson

In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.

The Iterative Hard Thresholding (IHT) algorithm is one of the most popular and promising greedy pursuit methods for high-dimensional statistical estimation under cardinality constraint. The existing analysis of IHT mostly focuses on parameter estimation and sparsity recovery consistency. From the perspective of statistical learning theory, another fundamental question is how well the IHT estimation would perform on unseen samples. The answer to this question is important for understanding the generalization ability of IHT yet has remaind elusive. In this paper, we investigate this problem and develop a novel generalization theory for IHT from the viewpoint of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; and 2) a fast rate of order $\mathcal{\tilde O}(n^{-1}k(\log^3(n)+\log(p)))$ can be derived for strongly convex risk function under certain strong-signal conditions. The results have been substantialized to sparse linear regression and logistic regression models along with numerical evidence provided to support our theory.

**Parametric Programming Approach for More Powerful and General Lasso Selective Inference**

Vo Nguyen Le Duy · Ichiro Takeuchi

Selective Inference (SI) has been actively studied in the past few years for conducting inference on the features of linear models that are adaptively selected by feature selection methods such as Lasso. The basic idea of SI is to make inference conditional on the selection event. Unfortunately, the main limitation of the original SI approach for Lasso is that the inference is conducted not only conditional on the selected features but also on their signs---this leads to loss of power because of over-conditioning. Although this limitation can be circumvented by considering the union of such selection events for all possible combinations of signs, this is only feasible when the number of selected features is sufficiently small. To address this computational bottleneck, we propose a parametric programming-based method that can conduct SI without conditioning on signs even when we have thousands of active features. The main idea is to compute the continuum path of Lasso solutions in the direction of a test statistic, and identify the subset of the data space corresponding to the feature selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned computational bottleneck but also improves the performance and practicality of SI for Lasso in various respects. We conduct several experiments to demonstrate the effectiveness and efficiency of our proposed method.

**One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks**

Hanjing Zhu · Jiaming Xu

There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous work assumes that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.

**On the Absence of Spurious Local Minima in Nonlinear Low-Rank Matrix Recovery Problems**

Yingjie Bi · Javad Lavaei

The restricted isometry property (RIP) is a well-known condition that guarantees the absence of spurious local minima in low-rank matrix recovery problems with linear measurements. In this paper, we introduce a novel property named bound difference property (BDP) to study low-rank matrix recovery problems with nonlinear measurements. Using RIP and BDP jointly, we propose a new criterion to certify the nonexistence of spurious local minima in the rank-1 case, and prove that it leads to a much stronger theoretical guarantee than the existing bounds on RIP.

**Flow-based Alignment Approaches for Probability Measures in Different Spaces**

Tam Le · Nhat Ho · Makoto Yamada

Gromov-Wasserstein (GW) is a powerful tool to compare probability measures whose supports are in different metric spaces. However, GW suffers from a computational drawback since it requires to solve a complex non-convex quadratic program. In this work, we consider a specific family of cost metrics, namely, tree metrics for supports of each probability measure, to develop efficient and scalable discrepancies between the probability measures. Leveraging a tree structure, we propose to align flows from a root to each support instead of pair-wise tree metrics of supports, i.e., flows from a support to another support, in GW. Consequently, we propose a novel discrepancy, named Flow-based Alignment (FlowAlign), by matching the flows of the probability measures. FlowAlign is computationally fast and scalable for large-scale applications. Further exploring the tree structure, we propose a variant of FlowAlign, named Depth-based Alignment (DepthAlign), by aligning the flows hierarchically along each depth level of the tree structures. Theoretically, we prove that both FlowAlign and DepthAlign are pseudo-metrics. We also derive tree-sliced variants of the proposed discrepancies for applications without prior knowledge about tree structures for probability measures, computed by averaging FlowAlign/DepthAlign using random tree metrics, adaptively sampled from supports of probability measures. Empirically, we test our proposed approaches against other variants of GW baselines on a few benchmark tasks.

**Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features**

Shingo Yashima · Atsushi Nitanda · Taiji Suzuki

Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive computationally efficient learning algorithms. We consider solving a binary classification problem using random features and stochastic gradient descent, both of which are common and widely used in practical large-scale problems. Although there are plenty of previous works investigating the efficiency of these algorithms in terms of the convergence of the objective loss function, these results suggest that the computational gain comes at expense of the learning accuracy when dealing with general Lipschitz loss functions such as logistic loss. In this study, we analyze the properties of these algorithms in terms of the convergence not of the loss function, but the classification error under the strong low-noise condition, which reflects a realistic property of real-world datasets. We extend previous studies on SGD to a random features setting, examining a novel analysis about the error induced by the approximation of random features in terms of the distance between the generated hypothesis to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. We demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems under the strong low-noise condition.

**RankDistil: Knowledge Distillation for Ranking**

Sashank Reddi · Rama Kumar Pasumarthi · Aditya Menon · Ankit Singh Rawat · Felix Yu · Seungyeon Kim · Andreas Veit · Sanjiv Kumar

Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a distillation framework for top-k ranking and draw connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the order of items of student and teacher, while penalizing large scores for items ranked low by the teacher. Building on this, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank, and establish statistical basis for the method. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.

**Multi-Fidelity High-Order Gaussian Processes for Physical Simulation**

Zheng Wang · Wei Xing · Robert Kirby · Shandian Zhe

The key task of physical simulation is to solve partial differential equations (PDEs) on discretized domains, which is known to be costly. In particular, high-fidelity solutions are much more expensive than low-fidelity ones. To reduce the cost, we consider novel Gaussian process (GP) models that leverage simulation examples of different fidelities to predict high-dimensional PDE solution outputs. Existing GP methods are either not scalable to high-dimensional outputs or lack effective strategies to integrate multi-fidelity examples. To address these issues, we propose Multi-Fidelity High-Order Gaussian Process (MFHoGP) that can capture complex correlations both between the outputs and between the fidelities to enhance solution estimation, and scale to large numbers of outputs. Based on a novel nonlinear coregionalization model, MFHoGP propagates bases throughout fidelities to fuse information, and places a deep matrix GP prior over the basis weights to capture the (nonlinear) relationships across the fidelities. To improve inference efficiency and quality, we use bases decomposition to largely reduce the model parameters, and layer-wise matrix Gaussian posteriors to capture the posterior dependency and to simplify the computation. Our stochastic variational learning algorithm successfully handles millions of outputs without extra sparse approximations. We show the advantages of our method in several typical applications.

Communication cost and local computation complexity are two main bottlenecks of the distributed statistical learning. In this paper, we consider the distributed M-estimation problem in both regular and sparse case and propose a novel one-round communication efficient algorithm. For regular distributed M-estimator, the asymptotic normality is provided to conduct statistical inference. For sparse distributed M-estimator, we only require solving a quadratic Lasso problem in the master machine using the same local information as the regular distributed M-estimator. Consequently, the computation complexity of the local machine is sufficiently reduced compared with the existing debiased sparse estimator. Under mild conditions, the theoretical results guarantee that our proposed distributed estimators achieve (near)optimal statistical convergence rate. The effectiveness of our proposed algorithm is verified through experiments across different M-estimation problems using both synthetic and real benchmark datasets.

**Communication Efficient Primal-Dual Algorithm for Nonconvex Nonsmooth Distributed Optimization**

Congliang Chen · Jiawei Zhang · Li Shen · Peilin Zhao · Zhiquan Luo

Decentralized optimization problems frequently appear in the large scale machine learning problems. However, few works work on the difficult nonconvex nonsmooth case. In this paper, we propose a decentralized primal-dual algorithm to solve this type of problem in a decentralized manner and the proposed algorithm can achieve an $\mathcal{O}(1/\epsilon^2)$ iteration complexity to attain an $\epsilon-$solution, which is the well-known lower iteration complexity bound for nonconvex optimization. To our knowledge, it is the first algorithm achieving this rate under a nonconvex, nonsmooth decentralized setting. Furthermore, to reduce communication overhead, we also modifying our algorithm by compressing the vectors exchanged between agents. The iteration complexity of the algorithm with compression is still $\mathcal{O}(1/\epsilon^2)$. Besides, we apply the proposed algorithm to solve nonconvex linear regression problem and train deep learning model, both of which demonstrate the efficiency and efficacy of the proposed algorithm.

Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks.

**Regret Minimization for Causal Inference on Large Treatment Space**

Akira Tanimoto · Tomoya Sakai · Takashi Takenouchi · Hisashi Kashima

Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from observational data with a sampling bias is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have focused on debiased estimation of the potential outcomes on a binary action space and the difference between them, namely, the individual treatment effect. When it comes to a large action space (e.g., selecting an appropriate combination of medicines for a patient), however, the regression accuracy of the potential outcomes is no longer sufficient in practical terms to achieve a good decision-making performance. This is because a high mean accuracy on the large action space does not guarantee the nonexistence of a single potential outcome misestimation that misleads the whole decision. Our proposed loss minimizes the classification error of whether or not the action is relatively good for the individual target among all feasible actions, which further improves the decision-making performance, as we demonstrate. We also propose a network architecture and a regularizer that extracts a debiased representation not only from the individual feature but also from the biased action for better generalization in large action spaces. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the superiority of our method for large combinatorial action spaces.

**Statistical Guarantees for Transformation Based Models with applications to Implicit Variational Inference**

Sean Plummer · Shuang Zhou · Anirban Bhattacharya · David Dunson · Debdeep Pati

Transformation based methods have been an attractive approach in non-parametric inference for problems such as unconditioned and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformation based models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both non-parametric inference and variational inference lacks theoretical justification. In the context of non-linear latent variable models (NL-LVM), we provide theoretical justification for the use of these models in non-parametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the $L_1$ sense and show that for this class of priors the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the non-parametric setting we use the NL-LVM to construct an implicit family of variational distributions, deemed as GP-IVI. We delineate sufficient conditions under which GP-IVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback-Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit variational inference.

Modern machine learning and deep learning models are shown to be vulnerable when testing data are slightly perturbed. Theoretical studies of adversarial training algorithms mostly focus on their adversarial training losses or local convergence properties. In contrast, this paper studies the generalization performance of a generic adversarial training algorithm. Specifically, we consider linear regression models and two-layer neural networks (with lazy training) using squared loss under low-dimensional regime and high-dimensional regime. In the former regime, after overcoming the non-smoothness of adversarial training, the adversarial risk of the trained models will converge to the minimal adversarial risk. In the latter regime, we discover that data interpolation prevents the adversarial robust estimator from being consistent (i.e. converge in probability). Therefore, inspired by successes of the least absolute shrinkage and selection operator (LASSO), we incorporate the $\mathcal{L}_1$ penalty in the high dimensional adversarial learning, and show that it leads to consistent adversarial robust estimation. A series of numerical studies are conducted to demonstrate that how the smoothness and $\mathcal{L}_1$ penalization help to improve the adversarial robustness of DNN models.

Recent years have witnessed the success of adaptive (or unified) approaches in estimating symmetric properties of discrete distributions, where the learner first obtains a distribution estimator independent of the target property, and then plugs the estimator into the target property as the final estimator. Several such approaches have been proposed and proved to be adaptively optimal, i.e. they achieve the optimal sample complexity for a large class of properties within a low accuracy, especially for a large estimation error $\varepsilon\gg n^{-1/3}$ where $n$ is the sample size. In this paper, we characterize the high accuracy limitation, or the penalty for adaptation, for general adaptive approaches. Specifically, we obtain the first known adaptation lower bound that under a mild condition, any adaptive approach cannot achieve the optimal sample complexity for every $1$-Lipschitz property within accuracy $\varepsilon \ll n^{-1/3}$. In particular, this result disproves a conjecture in [Acharya et al. 2017] that the profile maximum likelihood (PML) plug-in approach is optimal in property estimation for all ranges of $\varepsilon$, and confirms a conjecture in [Han and Shiragur 2020] that their competitive analysis of the PML is tight.

Non-stationary bandits and clustered bandits lift the restrictive assumptions in contextual bandits and provide solutions to many important real-world scenarios. Though they have been studied independently so far, we point out the essence in solving these two problems overlaps considerably. In this work, we connect these two strands of bandit research under the notion of test of homogeneity, which seamlessly addresses change detection for non-stationary bandit and cluster identification for clustered bandit in a unified solution framework. Rigorous regret analysis and extensive empirical evaluations demonstrate the value of our proposed solution, especially its flexibility in handling various environment assumptions, e.g., a clustered non-stationary environment.

**vqSGD: Vector Quantized Stochastic Gradient Descent**

Venkata Gandikota · Daniel Kane · Raj Kumar Maity · Arya Mazumdar

In this work, we present a family of vector quantization schemes vqSGD (Vector-Quantized Stochastic Gradient Descent) that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization. In the process we derive the following fundamental information theoretic fact: $\Theta(\frac{d}{R^2})$ bits are necessary and sufficient (up to an additive $O(\log d)$ term) to describe an unbiased estimator $\hat{g}(g)$ for any $g$ in the $d$-dimensional unit sphere, under the constraint that $\|\hat{g}(g)\|_2\le R$ almost surely. In particular, we consider a randomized scheme based on the convex hull of a point set, that returns an unbiased estimator of a $d$-dimensional gradient vector with almost surely bounded norm. We provide multiple efficient instances of our scheme, that are near optimal, and require only $o(d)$ bits of communication at the expense of tolerable increase in error. The instances of our quantization scheme are obtained using the properties of binary error-correcting codes and provide a smooth tradeoff between the communication and the estimation error of quantization. Furthermore, we show that vqSGD also offers some automatic privacy guarantees.

**Diagnostic Uncertainty Calibration: Towards Reliable Machine Predictions in Medical Domain**

Takahiro Mimori · Keiko Sasada · Hirotaka Matsui · Issei Sato

We propose an evaluation framework for class probability estimates (CPEs) in the presence of label uncertainty, which is commonly observed as diagnosis disagreement between experts in the medical domain. We also formalize evaluation metrics for higher-order statistics, including inter-rater disagreement, to assess predictions on label uncertainty. Moreover, we propose a novel post-hoc method called alpha-calibration, that equips neural network classifiers with calibrated distributions over CPEs. Using synthetic experiments and a large-scale medical imaging application, we show that our approach significantly enhances the reliability of uncertainty estimates: disagreement probabilities and posterior CPEs.

**Tractable contextual bandits beyond realizability**

Sanath Kumar Krishnamurthy · Vitor Hadad · Susan Athey

Tractable contextual bandit algorithms often rely on the realizability assumption – i.e., that the true expected reward model belongs to a known class, such as linear functions. In this work, we present a tractable bandit algorithm that is not sensitive to the realizability assumption and computationally reduces to solving a constrained regression problem in every epoch. When realizability does not hold, our algorithm ensures the same guarantees on regret achieved by realizability-based algorithms under realizability, up to an additive term that accounts for the misspecification error. This extra term is proportional to T times a function of the mean squared error between the best model in the class and the true model, where T is the total number of time-steps. Our work sheds light on the bias-variance trade-off for tractable contextual bandits. This trade-off is not captured by algorithms that assume realizability, since under this assumption there exists an estimator in the class that attains zero bias.

Smooth game optimization has recently attracted great interest in machine learning as it generalizes the single-objective optimization paradigm. However, game dynamics is more complex due to the interaction between different players and is therefore fundamentally different from minimization, posing new challenges for algorithm design. Notably, it has been shown that negative momentum is preferred due to its ability to reduce oscillation in game dynamics. Nevertheless, existing analysis of negative momentum was restricted to simple bilinear games. In this paper, we extend the analysis to smooth and strongly-convex strongly-concave minimax games by taking the variational inequality formulation. By connecting Polyak’s momentum with Chebyshev polynomials, we show that negative momentum accelerates convergence of game dynamics locally, though with a suboptimal rate. To the best of our knowledge, this is the \emph{first work} that provides an explicit convergence rate for negative momentum in this setting.

**Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees**

Alessio Mazzetto · Dylan Sam · Andrew Park · Eli Upfal · Stephen Bach

We develop the first theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence).

We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards.

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple, cost-efficient algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust validation sub-routine, which is used to boost the confidence of inexpensive gradient-based sub-processes. Compared with recent robust gradient descent methods from the literature, dimension dependence (both risk bounds and cost) is substantially improved, without relying upon strong convexity or expensive per-step robustification. We also empirically show that the proposed procedure cannot simply be replaced with naive cross-validation.

**Causal Modeling with Stochastic Confounders**

Thanh Vinh Vo · Pengfei Wei · Wicher Bergsma · Tze Yun Leong

This work extends causal inference in temporal models with stochastic confounders. We propose a new approach to variational estimation of causal inference based on a representer theorem with a random input space. We estimate causal effects involving latent confounders that may be interdependent and time-varying from sequential, repeated measurements in an observational study. Our approach extends current work that assumes independent, non-temporal latent confounders with potentially biased estimators. We introduce a simple yet elegant algorithm without parametric specification on model components. Our method avoids the need for expensive and careful parameterization in deploying complex models, such as deep neural networks in existing approaches, for causal inference and analysis. We demonstrate the effectiveness of our approach on various benchmark temporal datasets.

**Dual Principal Component Pursuit for Learning a Union of Hyperplanes: Theory and Algorithms**

Tianyu Ding · Zhihui Zhu · Manolis Tsakiris · Rene Vidal · Daniel Robinson

State-of-the-art subspace clustering methods are based on convex formulations whose theoretical guarantees require the subspaces to be low-dimensional. Dual Principal Component Pursuit (DPCP) is a non-convex method that is specifically designed for learning high-dimensional subspaces, such as hyperplanes. However, existing analyses of DPCP in the multi-hyperplane case lack a precise characterization of the distribution of the data and involve quantities that are difficult to interpret. Moreover, the provable algorithm based on recursive linear programming is not efficient. In this paper, we introduce a new notion of geometric dominance, which explicitly captures the distribution of the data, and derive both geometric and probabilistic conditions under which a global solution to DPCP is a normal vector to a geometrically dominant hyperplane. We then prove that the DPCP problem for a union of hyperplanes satisfies a Riemannian regularity condition, and use this result to show that a scalable Riemannian subgradient method exhibits (local) linear convergence to the normal vector of the geometrically dominant hyperplane. Finally, we show that integrating DPCP into popular subspace clustering schemes, such as K-ensembles, leads to superior or competitive performance over the state-of-the-art in clustering hyperplanes.

**Good Classifiers are Abundant in the Interpolating Regime**

Ryan Theisen · Jason Klusowski · Michael Mahoney

Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the \emph{worst-case} model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes. We apply our method to compute this distribution for several real and synthetic datasets, with both linear and random feature classification models. We find that test errors tend to concentrate around a small \emph{typical} value $\varepsilon^*$, which deviates substantially from the test error of the worst-case interpolating model on the same datasets, indicating that ``bad'' classifiers are extremely rare. We provide theoretical results in a simple setting in which we characterize the full asymptotic distribution of test errors, and we show that these indeed concentrate around a value $\varepsilon^*$, which we also identify exactly. We then formalize a more general conjecture supported by our empirical findings. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice, and that approaches based on the statistical mechanics of learning may offer a promising~alternative.

**Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship**

Abishek Sankararaman · Soumya Basu · Karthik Abinav Sankararaman

Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation over the supply side (arms), while the arms have known uniform preference over the demand side (agents). We design the first decentralized algorithm - UCB with Decentralized Dominant-arm Deletion (UCB-D3), for the agents, that does not require any knowledge of reward gaps or time horizon. UCB-D3 works in phases, where in each phase, agents delete dominated arms -- the arms preferred by higher ranked agents, and play only from the non-dominated arms according to the UCB. At the end of the phase, agents broadcast in a decentralized fashion, their estimated preferred arms through pure exploitation. We prove a new regret lower bound for the decentralized serial dictatorship model, and prove that UCB-D3 achieves order optimal regret guarantee.

We introduce graph gamma process (GGP) linear dynamical systems to model real-valued multivariate time series. GGP generates $S$ latent states that are shared by $K$ different communities, each of which is characterized by its own pattern of activation probabilities imposed on a $S\times S$ directed sparse graph, and allow both $S$ and $K$ to grow without bound. For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences generated by formed communities. In each sub-sequence, different data dimensions often share similar temporal patterns but may exhibit distinct magnitudes, and hence allowing the superposition of all sub-sequences to exhibit diverse behaviors at different data dimensions. On both synthetic and real-world time series, the proposed nonparametric Bayesian dynamic models, which are initialized at random, consistently exhibit good predictive performance in comparison to a variety of baseline models, revealing interpretable latent state transition patterns and decomposing the time series into distinctly behaved sub-sequences.

**Bandit algorithms: Letting go of logarithmic regret for statistical robustness**

Kumar Ashutosh · Jayakrishnan Nair · Anmol Kagrecha · Krishna Jagannathan

We study regret minimization in a stochastic multi-armed bandit setting, and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms' distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super-logarithmic regret. This result highlights the inevitable statistical fragility of all `logarithmic regret' bandit algorithms available in the literature - for instance, if a UCB algorithm designed for 1-subGaussian distributions is used in a subGaussian setting with a mismatched variance parameter, the learning performance could be inconsistent. Next, we show a positive result: statistically robust and consistent learning performance is attainable if we allow the regret to be slightly worse than logarithmic. Specifically, we propose three classes of distribution oblivious algorithms that achieve an asymptotic regret that is arbitrarily close to logarithmic.

**Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent**

Frederik Kunstner · Raunak Kumar · Mark Schmidt

Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions.

**Tracking Regret Bounds for Online Submodular Optimization**

Tatsuya Matsuoka · Shinji Ito · Naoto Ohsaka

In this paper, we propose algorithms for online submodular optimization with tracking regret bounds. Online submodular optimization is a generic framework for sequential decision making used to select subsets. Existing algorithms for online submodular optimization have been shown to achieve small (static) regret, which means that the algorithm's performance is comparable to the performance of a fixed optimal action. Such algorithms, however, may perform poorly in an environment that changes over time. To overcome this problem, we apply a tracking-regret-analysis framework to online submodular optimization, one by which output is assessed through comparison with time-varying optimal subsets. We propose algorithms for submodular minimization, monotone submodular maximization under a size constraint, and unconstrained submodular maximization, and we show tracking regret bounds. In addition, we show that our tracking regret bound for submodular minimization is nearly tight.

**Causal Inference with Selectively Deconfounded Data**

Jingyi Gan · Andrew Li · Zachary Lipton · Sridhar Tayur

Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data; (b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large \emph{confounded} observational dataset (\emph{confounder unobserved}) alongside a small \emph{deconfounded} observational dataset (\emph{confounder revealed}) when estimating the ATE. Our theoretical results {\red suggest} that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases---say, genetics---we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer.

**Understanding Gradient Clipping In Incremental Gradient Methods**

Jiang Qian · Yuren Wu · Bojin Zhuang · Shaojun Wang · Jing Xiao

We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping.

**Non-Volume Preserving Hamiltonian Monte Carlo and No-U-TurnSamplers**

Hadi Mohasel Afshar · Rafael Oliveira · Sally Cripps

Volume preservation is usually regarded as a necessary property for the leapfrog transition functions that are used in Hamiltonian Monte Carlo (HMC) and No-U-Turn (NUTS) samplers to guarantee convergence to the target distribution. In this work we rigorously prove that with minimal algorithmic modifications, both HMC and NUTS can be combined with transition functions that are not necessarily volume preserving. In light of these results, we propose a non-volume preserving transition function that conserves the Hamiltonian better than the baseline leapfrog mechanism, on piecewise-continuous distributions. The resulting samplers do not require any assumptions on the geometry of the discontinuity boundaries, and our experimental results show a significant improvement upon traditional HMC and NUTS.

Couplings play a central role in the analysis of Markov chain Monte Carlo algorithms and appear increasingly often in the algorithms themselves, e.g. in convergence diagnostics, parallelization, and variance reduction techniques. Existing couplings of the Metropolis-Hastings algorithm handle the proposal and acceptance steps separately and fall short of the upper bound on one-step meeting probabilities given by the coupling inequality. This paper introduces maximal couplings which achieve this bound while retaining the practical advantages of current methods. We consider the properties of these couplings and examine their behavior on a selection of numerical examples.