Replicable Machine Learning: Theory and Algorithms for Stochastic Convex and Non-Convex Optimization
Raman Arora ⋅ Kaibo Zhang
Abstract
Replicable algorithms produce identical outputs with high probability when run on independent samples drawn from the same distribution, providing strong reproducibility guarantees for machine learning pipelines. We study replicability in machine learning in Vapnik's general learning setting, which encompasses stochastic optimization over convex and non-convex loss classes, establishing algorithms with near-optimal sample complexity across these settings. For general Lipschitz losses over a bounded parameter space, we show that the exponential mechanism combined with correlated sampling achieves optimal $O(1/\sqrt{n})$ excess risk with $\rho$-replicability guarantees, but at the cost of exponential runtime. For general Lipschitz losses, the exponential mechanism with correlated sampling achieves optimal $O(1/\sqrt{n})$ excess risk and $\rho$-replicability, but with exponential runtime. For strongly convex losses over a $d$-dimensional parameter space, empirical risk minimization (ERM) paired with randomized rounding achieves $\widetilde{O}(\sqrt{d}/(\rho\sqrt{n}))$ excess risk in polynomial time. For general convex losses, regularized ERM yields excess risk of $\widetilde{O}(n^{-1/4})$. We further extend our techniques to overparameterized neural networks in the Neural Tangent Kernel (NTK) regime. Taken together, our results provide evidence for a fundamental computational-statistical tradeoff in replicable learning, whereby optimal replicability requires exponential time while our polynomial-time algorithms incur a modest but provable statistical penalty.
Successful Page Load