Elevating our Evaluations: Technical and Sociotechnical Standards of Assessment in Machine Learning
Evaluation in Machine Learning does not always get the attention it deserves. I hope to focus our attention for the time of this talk on the questions of systematic evaluation in machine learning and the changes that we should continue to make as we elevate the standard of evaluation across our field. The breadth of application areas we collaborate on in machine learning requires a variety of approaches for evaluation, and we’ll explore this variety by considering applications in generative models, social good, healthcare, and environmental science. Grounded in these applications, we will expand the conceptual aperture through which we think about machine learning evaluations, starting from purely technical evaluations (thinking about likelihoods), moving to mixed methods (with proper scoring rules and expert assessments), and then to sociotechnical assessments (considering fairness, impacts, and participation). My core message is that broad and expansive evaluation remains fundamental and an area into which I hope we will drive even greater investments as a community, together.
Deep Gaussian Processes
In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief net- work based on Gaussian process mappings. The data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent vari- able model (GP-LVM). We perform inference in the model by approximate variational marginal- ization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically ap- plied to relatively large data sets using stochas- tic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model se- lection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.