Moderators : James McInerney · Masashi Sugiyama
Yongchan Kwon · James Zou
Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.
Anant Raj · Pooria Joulani · Andras Gyorgy · Csaba Szepesvari
Gradient temporal difference (GTD) algorithms are provably convergent policy evaluation methods for off-policy reinforcement learning. Despite much progress, proper tuning of the stochastic approximation methods used to solve the resulting saddle point optimization problem requires the knowledge of several (unknown) problem-dependent parameters. In this paper we apply adaptive step-size tuning strategies to greatly reduce this dependence on prior knowledge, and provide algorithms with adaptive convergence guarantees. In addition, we use the underlying refined analysis technique to obtain new O(1/T) rates that do not depend on the strong-convexity parameter of the problem, and also apply to the Markov noise setting, as well as the unbounded i.i.d. noise setting.
Sharan Vaswani · Olivier Bachem · Simone Totaro · Robert Müller · Shivam Garg · Matthieu Geist · Marlos C. Machado · Pablo Samuel Castro · Nicolas Le Roux
Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO, or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple reinforcement learning problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
Jan Macdonald · Stephan Wäldchen
We give a complete characterisation of families of probability distributions that are invariant under the action of ReLU neural network layers (in the same way that the family of Gaussian distributions is invariant to affine linear transformations). The need for such families arises during the training of Bayesian networks or the analysis of trained neural networks, e.g., in the context of uncertainty quantification (UQ) or explainable artificial intelligence (XAI).We prove that no invariant parametrised family of distributions can exist unless at least one of the following three restrictions holds: First, the network layers have a width of one, which is unreasonable for practical neural networks. Second, the probability measures in the family have finite support, which basically amounts to sampling distributions. Third, the parametrisation of the family is not locally Lipschitz continuous, which excludes all computationally feasible families.Finally, we show that these restrictions are individually necessary. For each of the three cases we can construct an invariant family exploiting exactly one of the restrictions but not the other two.