Recent progress in machine learning provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not “open up the black box” and try understanding its underpinnings. Rather we discuss broad methodologies that can be wrapped around any black box to produce results that can be trusted and are equitable. We also show how our ideas can inform causal inference predictive; for instance, we will answer counterfactual predictive problems: i.e. predict the outcome of a treatment would have been given that the patient was actually not treated.
"A.I. is like nuclear energy -- both promising and dangerous" -- Bill Gates, 2019.
Data Science is a pillar of A.I. and has driven most of recent cutting-edge discoveries in biomedical research. In practice, Data Science has a life cycle (DSLC) that includes problem formulation, data collection, data cleaning, modeling, result interpretation and the drawing of conclusions. Human judgement calls :wq:ware ubiquitous at every step of this process, e.g., in choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the "dangers" of A.I. To maximally mitigate these dangers, we developed a framework based on three core principles: Predictability, Computability and Stability (PCS). Through a workflow and documentation (in R Markdown or Jupyter Notebook) that allows one to manage the whole DSLC, the PCS framework unifies, streamlines and expands on the best practices of machine learning and statistics – bringing us a step forward towards veridical Data Science. We will illustrate the PCS framework in the modeling stage through the development of DeepTune images for characterization of neurons in the difficult V4 area of primary visual cortex.
The physical sciences are replete with high-fidelity simulators: computational manifestations of causal, mechanistic models. Ironically, while these simulators provide our highest-fidelity physical models, they are not well suited for inferring properties of the model from data. I will formulate the emerging area of simulation-based inference and describe how machine learning and probabilistic programming techniques are being brought to bear on these challenging problems. Finally, I will provide examples of how these techniques can impact particle physics at the Large Hadron Collider, astrophysics, neuroscience, and public health.