Room: Conference Suite


8:00- 9:00 Arrivals

9:00- 9:35 Efficient semi-complete data likelihood approaches (and making the most out of the BUGS/JAGS black-box) Ruth King, University of Edinburgh

9:45- 10:20 Enabling open science and data science via software: scikit-learn Gaël Varoquaux, INRIA

10:30 Coffee

10:50- 11.25 Telling Stories with data Neil Richards, Higher Eductation Statistics Authority (HESA)

11:35- 12:10 Beyond conscious thought: using data about perception to understand cognition Caroline Jay, University of Manchester

12.20- 12:30 Automatic Generation of Tile Maps Graham McNeill, Oxford Internet Institute, University of Oxford

12:35- 12:45 Understanding culture with Data Science Timothy Cowlishaw, BBC Research and Development

12:50 Lunch

13:50- 14:25 Changepoint challenges: making sense of industrial sensor data Idris Eckley, Lancaster University

14:35- 15:10 Data-Driven and Science-Driven Statistical Methods in Astronomy and Solar Physics David van Dyk, Imperial College London

15:20 Coffee

15:45- 15:55 Scalable and Adaptable Network Resilience Weisi Guo, University of Warwick and The Alan Turing Institute Data Centric Engineering Programme

16:00- 16.35 Cloudbursting: analysing massive weather data in the cloud Niall Robinson, Met Office

16.45 Close

Abstracts


Efficient semi-complete data likelihood approaches (and making the most out of the BUGS/JAGS black-box)

Ruth King

Data Science drives both the development of new models to fit to data to better describe the underlying system and the associated model-fitting tools for efficiently conducting statistical analyses of real data. We will consider the particular issue where the associated likelihood of the model is analytically intractable. A common technique in this case is to use a Bayesian data augmentation technique, where the parameter space is expanded via the specification of auxiliary variables, such that the “complete data likelihood” of the observed data and auxiliary variables is straightforward to write down. Such techniques can be applied in standard software such as BUGS (Bayesian inference Using the Gibbs Sampler) and JAGS (Just Another Gibbs Sampler), which are widely used throughout the scientific community – particularly as the associated Markov chain Monte Carlo (MCMC) algorithm is effectively a hidden black-box. However, standard MCMC algorithms can perform very poorly due to highly correlated parameters. We propose a semi-complete data likelihood approach, which can significantly improve the performance of standard vanilla MCMC algorithms. We demonstrate this by applying the idea to applications in statistical ecology, implemented using the software JAGS.



Enabling open science and data science via software: scikit-learn

Gaël Varoquaux

Data science, with sophisticated data processing, is having a transformational impact on many facets of science and society. It is driven by a technological revolution based on statistical models, and software implementing them. Outreach, bridging the technical gap outside of the ivory tower of research lab and high-margin tech ventures, is crucial to see data-science applications of the beaten track. Scikit-learn is a machine-learning software that strives to reach many users and applications. Via the rich Python data ecosystem it can be embedded any domain or workflow. It has hundreds of thousands of users in a variety of field in the industry or in academia. I will discuss how we built scikit-learn to be easy-to-use and didactic; how we grew a community of open-source developers with a focus on collaboration; how we ensure quality in a statistical-learning codebase; how we try to distill the most important progress from the rapid pace of academic publishing; and how we are struggling to make the development sustainable.



Telling Stories with data

Neil Richards

People don’t remember numbers, they remember stories. If you want to communicate data driven findings to an audience, then you need to be able to tell a compelling story with your numbers. Numbers might not change the world, but the story they are telling just might. We look at data visualisations, the stories they tell, and the methods used to tell them.



Beyond conscious thought: using data about perception to understand cognition

Caroline Jay

Humans continuously gather data from the environment to form judgements and guide behaviour. While many decisions appear to be made at a conscious level, they are strongly influenced by the perceptual processes used to obtain the relevant information, and by the editing that the brain performs to prevent the conscious mind being overwhelmed by vast amounts of noisy data. Our perceptual behaviour has a profound affect on how we understand the world, but the process by which it occurs is subjectively hard to articulate. This talk discusses how we can use computational methods to monitor and make sense of these complex perceptual processes, providing a window on subconscious cognition, and laying the foundations for technology that could vastly improve our decision making capabilities.



Automatic Generation of Tile Maps

Graham McNeill

Tile maps (also known as grid maps and tile grid maps) are an important tool in thematic cartography with distinct qualities (and limitations) that distinguish them from better-known techniques such as choropleths, cartograms and symbol maps. Specifically, tile maps display geographic regions as a grid of identical tiles so large regions do not dominate the viewer's attention and small regions are easily seen. Furthermore, complex data such as time series can be shown on each tile in a consistent format, and the grid layout facilitates comparisons across tiles. Whilst a small number of handcrafted tile maps have become popular, the time-consuming process of creating new tile maps limits their wider use. To address this issue, we present an algorithm that generates a tile map of the specified type (e.g., square, hexagon, triangle) from raw shape data. Since the best tile map depends on the specific geography visualized and the task to be performed, the algorithm generates and ranks multiple tile maps and allows the user to choose the most appropriate. The approach is demonstrated on a range of examples and available in a prototype browser-based application.



Understanding culture with Data Science

Timothy Cowlishaw

The Secret Science of Pop (http://www.bbc.co.uk/programmes/b08gk664) is a BBC television programme which aimed to showcase the power of machine learning and data science by using these techniques to understand and predict what makes a song a chart success. A collaboration between Dr Armand Leroi of Imperial College, BBC R&D, and academics from Oxford University and Queen Mary, University of London, the project posed the unique challenge of both carrying out large-scale signal processing and machine learning analysis on a large corpus of cultural artefacts, and making the process and results understandable to a general audience. This talk will provide an overview of the process of the project, and the challenges we faced, an examination of where it succeeded and failed, as well as drawing more general lessons about how to present machine learning models and results derived from them to a non-specialist audience.



Changepoint challenges: making sense of industrial sensor data

Idris Eckley

The prevalence of high quality sensors within business and industrial systems has resulted in a torrent of data. Typically, these sensors are capable of recording data on several different attributes at very high rates (kHz or GHz). These signals pose many important challenges for data science. Arguably one of the most fundamental of these is the identification of when the statistical properties of the signal have changed. This requires the development of efficient and accurate methods, capable of detecting potentially subtle changes in signal composition. This talk will focus on the challenge of efficiently estimating the locations of changepoints, i.e. abrupt changes, within such signals and, in particular, the benefits of parallelisation together with details of key statistical properties that can be established in this setting.



Data-Driven and Science-Driven Statistical Methods in Astronomy and Solar Physics

David van Dyk

In recent years, technological advances have dramatically increased the quality and quantity of data available to astronomers. Newly launched or soon-to-be launched space-based telescopes are tailored to data-collection challenges associated with specific scientific goals. These instruments provide massive new surveys resulting in new catalogs containing terabytes of data, high resolution spectrography and imaging across the electromagnetic spectrum, and incredibly detailed movies of dynamic and explosive processes in the solar atmosphere. The spectrum of new instruments is helping scientists make impressive strides in our understanding of the physical universe, but at the same time generating massive data-analytic and data-mining challenges for scientists who study the resulting data. In this talk I will illustrate and discuss the interplay of data science, statistics, data-driven methods, and science-driven methods in the context of several problems in astrophysics.



Scalable and Adaptable Network Resilience

Weisi Guo

Human built systems are increasingly under the strain of both natural and man-made stressors. Many of the critical infrastructure systems have network dimensions. Despite their national importance, the complexity of these interdependent and multi-scale networks means we do not fully understand how to invest and adapt them to different risks and uncertainties. This poster outlines a new study which explores whether we can learn from natural complex systems that have evolved under constant predation and environmental stress. In particular, we draw inspiration from food webs' ability to: (1) maintain a robust coherent structure, and (2) exhibit common community preservation adaptation mechanisms. The study will also explore the coupling between small-scale failures and large-scale effects to inform the design of real-time routing mechanisms and long-term investment strategies for interdependent critical infrastructures.



Cloudbursting: analysing massive weather data in the cloud

Niall Robinson

With the upgrade of the Met Office supercomputer, we now generate hundreds of terabytes of data every day. Our group, the Informatics Lab has been exploring how we could use cloud computing to let people access all this data, and crucially, find out what they want to know from it.