Sketch-driven Data Analysis

Back to: 
Big Data in Medicine: Exemplars and Opportunities in Data Science

Neil Satra, University of Cambridge Computer Laboratory

Sketch-driven Data Analysis

Neil Satra, University of Cambridge Computer Laboratory

It is no secret that research institutes and hospitals are struggling to analyse the vast amounts of data they are gathering and to convert it into actionable insights. Some organisations have turned to incentivising all researchers to run queries on their data warehouses, hoping that they can combine domain knowledge with empirical data to surface valuable information. However, the tools to query this data have large learning curves, making them inaccessible to non-experts. This project is based on a study of the data workflow at Papworth hospital, where doctors and researchers currently fight for the limited time of the single statistician.

The initial stage of knowledge discovery entails coming up with rough hypotheses and testing these on the data. Currently, carrying out such Exploratory Data Analysis requires deep knowledge of statistics, statistical packages and programming languages such as R or Python. Medical students and researchers are not trained in their use. There are also issues such as the ‘Gulf of Execution’ and the ‘Gulf of Evaluation’, since users constantly need to translate between their visual mental model and the textual input model of the tools. This project explores sketch-based methods to specify and run various data analysis queries and tasks. Drawing on the familiar metaphor of drawing on pen and paper, users may, for example, draw a line through a scatterplot to initiate a test for linear regression, or circles to initiate clustering. They may also use it for pre-processing tasks, such as erasing a point to initiate the removal of outliers.

By lowering the barrier to entry and the cost of running each analysis, the hope is that users may be encouraged to explore their data more. A mixed initiative system allows the user to leverage their domain knowledge (Hand, 1984) to guide the analysis, with the system anticipating their needs and complimenting them with statistical knowledge. This approach combines the advantages of entirely automated systems such as Intelligent Discovery Assistants (Bernstein et al., 2005), and completely manually-driven statistical packages like R. What makes such a solution intelligent is that while providing the user control, it translates the user’s high level directions into individual statistical analysis tasks by taking decisions on its own. For example, it may autonomously decide which test may be most appropriate for a certain distribution. Similar mixed initiative systems (Amant and Cohen, 1998) and expert systems (Aliferis et al., 1993) have been built for data mining and machine learning tasks, but focus on traditional Graphical User Interfaces. Hence, they suffer the drawback of requiring the user to translate between their mental model and the tool’s input model. Other interesting input models have been tried in various domains, such as Hand (1984)’s Natural Language approach, and Ko and Myers (2004)’s line of questioning approach. Meanwhile, sketch or touch based interfaces have been successfully applied to domains such as machine learning (Fails and Olsen, 2003). Despite decades of research in Direct Manipulation (Schmieder et al., 2009) and sketch interfaces (Sutherland, 1964), not much work has been done in applying them to statistical analysis.