Figure 3.1 shows some (fictitious) data on energy consumption for a number of buildings on a university campus. The consumption is broken down by building, quarter and year. The price per megawatt hour is fixed for all buildings within each year, but varies from one year to the next. The table includes both some raw data (the quarterly consumption figures) and some calculated values (the annual totals and costs). Its presentation makes use of a number of visual attributes – shading, grid lines, and the spacing and relative positioning of the text and numbers – to highlight natural groupings and hierarchies within the data.
The figure looks very like a spreadsheet (or a pivot table), and indeed it could easily have been implemented and rendered as such. But the data here have in fact been represented as a lish, and the figure is a screen shot from the prototype lish editor. Although a lish is not the same thing as
Figure 3.1: An example of data represented in lish form: fictitious energy consumption data for some buildings on a university campus.
Figure 3.2: Highlighting the nested structure of the data from the previous figure. Note that the frames are part of the screenshot and not an annotation.
a spreadsheet, it can be used to represent many forms of spreadsheet-like data. As stated in the introduction, it is based not upon grids of cells but upon nested lists of cells. In a lish, these lists may be made subject to a “template” behaviour that allows sets of related lists – the columns of a table, for instance – to be kept in alignment. The nested frames in Figure 3.2 show some of this structure more explicitly: the inner grid reveals that the local “Q2” column is a list of four cells, which resides within a parent list labelled “Consumed”, which in turn is part of a grandparent labelled “Building”, and so on. Those frames can be switched on at will in the editor (they are part of the screenshot, not an annotation) to help the user see what the underlying structure is.
Why might a data analyst choose to represent their data in lish form? There are several potential advantages, all of which stem in some way from the property that the lish expresses more of the structure of the model than a simple grid of cells would do.
One manifestation of this extra structure can be seen in Figure 3.1. The heavy dashed border around the cell at the first intersection of “Venables” and “Consumed” is a cell cursor, which can be moved around just as in a spreadsheet. The bluish-grey shading of this cell identifies it as a template cell. The editor interprets template selections specially: in this instance, it means “give me all the cells relating to ‘Consumed’ and ‘Venables’, in all of
Figure 3.3: The example lish data in long form
the years”. The lighter borders around the three rows of cells answering this description sprang up automatically when the user navigated to the given cursor position. They form an implicit selection, which may be extracted in compact matrix form for calculation or viewing – as has been done in the small table at the bottom of the figure.
The upshot for the user is that both navigation and calculation can be carried out, not by referring to arbitrary cell ranges, but by referring to logically coherent subsets of cells that describe some feature of the model: a column, a table, a set of related rows, or (as we shall see later) some more complex pattern. Let us turn momentarily to the long or “tidy” form of the same data in Figure 3.3: this is the form that databases and many data science tools would use. There is a one-to-one correspondence between the grey template cells of the lish, and many simple filters that we might wish to employ on the long data – for example, all the consumption values relating to Q2, or the entire price column.
These properties of the lish may be applied to good effect when it comes to calculation. The lish supports a formula-based calculation model similar to that found in the spreadsheet, but borrows the idea of “vectorisation” from data science languages like R – with some extensions to accommodate the nesting of vectors within vectors. Formulae are typically, though not necessarily, defined on a template cell and take arguments that are other template cells. Hence a single formula may both operate upon and populate some range of cells, which need not all be adjacent.
The machine knows what the structure is, and this is information it can apply to the service of the user. Take for example the “Total” column. A single formula (located in the first “Total” cell) provides for the whole column, and it is set simply to be the sum of the “Consumed” table. There was no need for the user anywhere to ask explicitly for row sums, since that was deducible from the related structures of the input and output ranges. Similar logic allowed the “Cost” column to be expressed merely as the product of “Price” and “Total”, with the machine safely entrusted to determine which of the two prices belonged to which of the six totals.
Now let us look at what would happen if the user were to create an extra building, by inserting an extra row below any of the cells containing “Venables”. The lish editor would respond by inserting rows, not only here, but also below each of the other two “Venables” cells. Consistency is maintained, and no modification is required to the original formulae either. A related idea is that the result of a single cell formula may be a lish of arbitrary size, which becomes nested within that cell. This provides dynamic allocation, relieving the user of the need to pre-allocate sufficient cells for the expected size of a result. The Venables extract at the bottom of Figure 3.1 is an example of such a formula. In a nutshell, when the user edits a spreadsheet, the cell values are recalculated automatically; when they edit a lish, the structure itself may be recalculated automatically as well.
This small example shows some benefits that might be realised by build- ing a spreadsheet-like calculating machine on top of a data model that isn’t the traditional grid. Of course, these benefits might come at the cost of mak- ing other aspects more difficult, something that I will evaluate in Chapters 7 and 8. But first, how does it work?