Room: Conference Suite


8:00- 9:00 Arrivals

9:00- 9:10 Welcome Magnus Rattray, University of Manchester Data Science Institute

9:10- 9:45 Beyond Urban Analytics: Stochastic Differential Equations provide Insights to Retail Development in Cities Mark Girolami, Imperial College London and The Alan Turing Institute

9:55- 10:30 Overcoming Catastrophic Forgetting in Neural Nets Raia Hadsell, DeepMind

10:40 Coffee

11:00- 11:10 Bayesian Boolean Matrix Factorisation Tammo Rukat, University of Oxford

11:15- 11.25 A Bayesian Model for Drug Response Estimation and Biomarker Testing using Gaussian Processes Frank Dondelinger, Lancaster University

11:30- 12:05 Automating Machine Learning Zoubin Ghahramani, University of Cambridge and Uber AI

12:15 Lunch and posters

13:30- 14:05 Computational Privacy: The privacy bounds of human behavior Yves-Alexandre de Montjoye, Imperial College London

14:15- 14:50 Secure Multi-Party Linear Regression on High-Dimensional Data Borja de Balle Pigem, Amazon Research

15:00- 15:10 Confidentiality and Differential Privacy in the Dissemination of Frequency Tables Natalie Shlomo, Univerity of Manchester

15:15- 15:25 Differentially Private Gaussian Processes Michael Smith, University of Sheffield

15:30 Coffee

16:00- 16:50 The Security and Privacy Challenges Raised by Precision Medicine Jean-Pierre Hubaux, EPFL

17.00- 17:35 Ethical Personal Data: Do Humans and Data Mix? StJohn Deakins, CitizenMe

17:45 Posters and drinks

19:30 Dinner

Abstracts


Beyond Urban Analytics: Stochastic Differential Equations provide Insights to Retail Development in Cities

Mark Girolami

The era of Big Data opens up opportunities to observe, study, manage and optimize the evolution of complex systems. One example of practical importance is urban structure which has given rise to urban analytics that is reliant on observational data. However the dynamics of economic and social systems have long been described by systems of non-linear first-order differential equations. Whilst this mathematical formalism provides a reasonable model of urban structure, it is somewhat restrictive owing to uncertainties arising in the modelling process and the lack of assimilation of observational data. Such shortcomings can be addressed to some degree by developing a statistical representation of urban flow structure, based on systems of stochastic differential equations. Taking urban retail development as a use case a model which is ergodic and whose invariant measure encodes our knowledge of spatio-temporal interactions is considered. Taking urban data for retail development in the city of London we proceed by performing inference and prediction in a Bayesian setting, and explore the resulting probability distributions over retail evolution with a position-specific metropolis-adjusted Langevin algorithm and Pre-Conditioned Crank-Nicholson MCMC. Insights into possible futures based on policy based interventions are discussed. This is joint work with Louis Ellam and Sir Alan Wilson.



Overcoming Catastrophic Forgetting in Neural Nets

Raia Hadsell

Catastrophic forgetting is a long-acknowledged challenge for neural networks. This phenomenon has become increasingly relevant as we move beyond static datasets towards AGI systems that learn online, from sequential experience. I will describe new learning methods that enable stable, data-efficient end-to-end learning while mitigating forgetting, and show how these methods can be used to train agents to navigate complex simulated environments.



Bayesian Boolean Matrix Factorisation

Tammo Rukat

Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. This is the first method to provide full posterior inference for Boolean Matrix factorisation which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering and, crucially, improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11 thousand genes on commodity hardware.



A Bayesian Model for Drug Response Estimation and Biomarker Testing using Gaussian Processes

Frank Dondelinger

Large-scale drug response assays on cancer cell lines have in recent years produced a wealth of data that benefits cancer treatment, drug development and biomarker discovery (see e.g. Barretina et al. 2012, Garnett et al. 2012). The raw data from these experiments usually takes the form of a dose-response curve. The default approach for analysing this data is to fit a sigmoidal curve to the measurements, and extract summary values that reflect the drug response, such as the IC50 and area under the curve (AUC). In this work, we present a novel Bayesian approach for modelling dose-response curves using Gaussian processes (GPs). Our model has several advantages over the sigmoid fit: 1) it is non-parametric and allows us to fit a variety of responses; 2) it allows for a hierarchical Bayesian setup with information sharing across different curves; and 3) we automatically obtain a measure of the uncertainty of our curve fits from the variance of the Gaussian process. We extend the model with a Bayesian biomarker testing framework that allows us to test for a difference in the proportion of responsive curves in mutated versus wild type cell lines. We test the model on cell line drug response data from the Cancer Genome Project (Garnett et al. 2012, Iorio et al. 2016). The data consists of 256 anti-cancer drugs assayed on ~1,000 cancer cell lines. While our curve fits are in agreement with sigmoidal fits on the majority of cell lines when comparing summary measures, we demonstrate that the Gaussian process model shows greater robustness to outliers and to unusual response patterns. The Bayesian testing model successfully identifies known biomarkers, and is able to leverage information about the complete dose-response curve, rather than relying on summary measures.



Automating Machine Learning

Zoubin Ghahramani

Machine learning is the field concerned with designing algorithms that allow computers to learn from data. Ironically, machine learning systems are currently hand-built by experts in a slow, laborious and error-prone manner. I will describe three directions of research which aim to truly automate machine learning: the Automatic Statistician; Turing: a new Probabilistic Programming language based on Julia; and the rational allocation of computational resources.



Computational Privacy: The privacy bounds of human behavior

Yves-Alexandre de Montjoye

We're living in an age of big data, a time when most of our movements and actions are collected and stored in real time. Large-scale mobile phone, credit card, or browsing datasets dramatically increase our capacity to measure, understand, and potentially affect the behavior of individuals and collectives. The use of this data, however, raise legitimate privacy concerns. In this talk, I will first show how the mere absence of obvious identifiers such as name or phone number is often not enough to prevent re-identification. I will then discuss how, as the use of this data progress, it will become increasingly important to consider whether sensitive information can be inferred from apparently innocuous data. Finally, I will discuss the impact of metadata on society and some of solutions we are developing to allow behavioral metadata to be used in a privacy-conscientious way.



Secure Multi-Party Linear Regression on High-Dimensional Data

Borja de Balle Pigem

The goal of secure multi-pary computation (MPC) is to facilitate the evaluation of functionalities that depend on the private inputs of several distrusting parties in a privacy preserving manner. I will start my talk by discussing potential applications of secure MPC to machine learning and the relation between MPC and other well-known privacy frameworks like differential privacy. Then I will present our recent work on secure MPC protocols for linear regression on distributed databases. By combining several tools from the MPC literature we obtain scalable solutions that can solve problems with millions of records and hundreds of features in a matter of minutes. Some crucial implementation details will be discussed, including the role of fixed-point arithmetic and a robust conjugate gradient descent solver for private linear systems. An implementation of our protocols based on the Obliv-C framework is available as open source.



Confidentiality and Differential Privacy in the Dissemination of Frequency Tables

Natalie Shlomo

For decades, national statistical agencies and other data custodians have been publishing frequency tables based on census, survey and administrative data. In order to protect the confidentiality of individuals represented in the data, tables based on original data are modified before release. Recently, in response to user demand for more flexible and responsive table publication services, frequency table publication schemes have been augmented with on-line table generating servers such as the US Census Bureau FactFinder and the Australian Bureau of Statistics TableBuilder. These systems allow users to build their own custom tables, and make use of automated perturbation routines to protect confidentiality. Motivated by the growing popularity of table generating servers, in this paper we study confidentiality protection for perturbed frequency tables, including the trade-off with analytical utility. Confidentiality protection is assessed in terms of the differential privacy standard, and this paper can be used as a practical introduction to differential privacy, to calculations related to its application, and to the relationship between confidentiality protection and utility. This work for all authors was partially supported by the Simons Foundation. All authors thank the Isaac Newton Institute for Mathematical Sciences, University of Cambridge, for its support and hospitality during the programme Data Linkage and Anonymisation which was supported by EPSRC grant EP/K032208/1.



Differentially Private Gaussian Processes

Michael Smith

A major challenge for machine learning is increasing the availability of data while respecting the privacy of individuals. Differential privacy is a framework which allows algorithms to have provable privacy guarantees. Gaussian processes are a widely used approach for dealing with uncertainty in functions. We propose a method using Gaussian Processes to provide Differentially Private predictions by bounding the posterior mean function's sensitivity. We then improve this method by considering a subset of test points and crafting the DP noise covariance structure to efficiently protect the training data, while minimising the scale of the added noise. We find that, for the datasets used, this perturbed Gaussian Process method achieves the greatest accuracy, while still providing privacy guarantees, and offers practical Differential Privacy for regression over multiple dimensions. Together these methods provide a starter toolkit for combining differential privacy and Gaussian processes.



The Security and Privacy Challenges Raised by Precision Medicine

Jean-Pierre Hubaux

Precision medicine is around the corner, fueled notably by the immense progress achieved on the front of genome sequencing. This is clearly a desirable evolution, but the security and privacy implications absolutely need to be tackled. In this talk, we will first describe the basics of genomics and the relevance for precision medicine. We will then mention the numerous threats induced by precision medicine in general (including the “quantified self”) and by genomics in particular. Moreover, we will discuss possible IT architectures for genomic (and phenotypic) data generation, processing, and protection, and present the solutions envisioned by the (few) researchers working on the topic. We will address the benefits (and limitations) of using partial homomorphic encryption for the protection of genomic data, and mention the pros and cons of the Paillier and ElGamal schemes. We will also discuss the potential of lattice-based encryption. We will detail the system we are currently developing for the privacy-conscious sharing of data between Swiss hospitals, as well as the investigations made by the Global Alliance for Genomics and Health (GA4GH). Finally, we will address the problem of inference attacks against genome databases and discuss the implications for kinship. The community Web site we have set up on the topic of genome privacy and security can be found at: https://genomeprivacy.org



Ethical Personal Data: Do Humans and Data Mix?

StJohn Deakins

Over the past four years CitizenMe has been experimenting with building an Ethical Human Data platform, working to give Digital Citizens better control over the use of their data, and the value derived from it. In this talk StJohn will share their discoveries, the benefits of Citizen participation, and the implications for the processing of personal data.