Accelerated Learning on Large-Scale Screens using Generative Library Models
Eli Weinstein ⋅ Andrei Slabodkin ⋅ Mattia Gollub ⋅ Elizabeth Wood
Abstract
Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high-throughput screens, which can experimentally test the activity of $10^6-10^{12}$ protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large-scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we \textit{only} collect positive examples of active sequences, i.e. $x$ with $y>0$. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true $p(y\mid x)$. We demonstrate this approach in simulation and on a large-scale screen of antibodies. Overall, co-design of experiments and inference lets us accelerate learning dramatically.
Successful Page Load