The formula model - The Lish: A Data Model for Grid Free Spreadsheets

2.3 Spreadsheets

2.3.4 The formula model

A defining feature of the spreadsheet is that the user may set the value of any cell to be given by a formula, which may refer to the values of other cells. For example, if cell C8 contains the formula “=sum(C2:C7)” then the machine will calculate the sum of the values in the range of cells C2 through to C7 and place the result in cell C8. With the exception of array formulae (see below), a formula affects only the value of the cell in which it resides.

This property of the spreadsheet, that the value of a (non-literal) cell is determined only by that cell’s own formula, has the benefit of transparency as to how any particular cell was calculated. But it also gives rise to a problem: if a whole column of cells is to be operated on in the same way (perhaps, we would like column F to contain twice the value in column E), then all the cells to be calculated must contain their own copy of what is to all intents and purposes the same formula. This contravenes the “Don’t Repeat Yourself ”, or DRY, principle as promulgated by Thomas and Hunt [1999]. It introduces a risk of errors during maintenance – what if only part of a column gets updated, leaving some copies of a superseded formula behind? Indeed, the inferential approaches of the previous section were in large part concerned with identifying such regions so that departures from the expected replication can be detected. An alternative approach is to change the formula model so that the replication is not required.

One way to achieve this is to allow formulae to be defined per column, rather than per cell. In the commercial sphere, Lotus Improv was an early pioneer of this approach; it separated the concepts of data, views and formulae. By taking the formulae outside the cells it incurred some loss of directness, which may be why this style of working never became main- stream. It does however live on in Quantrix, a current commercial product.

Apple’s current spreadsheet product, Numbers, has taken a step in this di- rection with the use of automatic named ranges, so that a formula may refer to a column by its heading rather than to a cell within that column. The formulae themselves remain cell-based. Numbers has also addressed the issue of demarcating separate tables, by adopting a canvas as a super-container on which table grids are placed. The lish allows greater generality than per-column formulae – they may be defined in various repeating patterns including over a column, column group, table body, or a set of equivalent columns in related tables.

Spreadsheets commonly allow the output of a formula to be an array. The array abstraction represents an aggregation of cells and once again can cut down on formula replication, as well as representing return values that are intrinsically multivalued, such as the inverse of a matrix. Array formulae are available, for example, in Excel, LibreOffice and Gnumeric. However they can be limiting in practice, since the dimensions must be fixed at the time the formula is entered, imposing restrictions on future row and column insertions into the sheet. The Table construct in Excel is a more flexible approach that allows a table to be treated as a single, properly demarcated object. Formulae are still replicated, but in a more automated way than would be possible on the basic grid. It is a little paradoxical, though, that the Table has been introduced as a new abstraction, quite separate from the table-like grid that formed the basis of the spreadsheet all along! The pivot table is another multi-cellular abstraction, to which I shall return in section 2.3.6.

Clack and Braine [1997] aim at “updating the spreadsheet computa- tional model, whilst retaining the essence of the spreadsheet user interface”. Their approach is to introduce object-oriented and functional programming features to the spreadsheet – workbooks behave like classes, and worksheets like functions. They address the formula replication problem by allowing whole regions of cells to be the l-values of formulae. The formula syntax itself resembles a functional programming language, for example with ex- pressions like “map sum [...]” to operate upon sequences of ranges in a single formula. Behind the scenes the lish is doing something a little similar (section 6.5), but the syntax is of vector arithmetic, so the user does not see anything looking like a higher order function.

Instead of trying to eliminate formula replication, Hermans and Van Der Storm [2015] manage its risks by copy-paste tracking. When the user

fills an area with an equivalent formula by means of copying and pasting, the machine remembers which cells were affected. Then, if there is a subsequent change to the formula in any of these cells, it is possible to update the others so that they remain consistent.

Other developments to the formula model add support for data of more than two dimensions. The Hypernumbers spreadsheet of Guthrie and Mc- Crory [2011] aims at engineering out many of the risks of web-based collab- orative spreadsheets. Along the way it includes support for 3D formulae in the form of “z-queries” that operate across a “pile” of sheets. These do not appear to be true 3D arrays, however, in the sense that inserting a row on one sheet in the pile would automatically cause a similar insertion in the others. Analytica software [Henrion, 2004] is more of a visual programming language than a direct spreadsheet replacement, but is applicable to many of the same kinds of problems. It uses “intelligent arrays”, which may have any number of dimensions, and which may be manipulated as single objects by formulae. The lish, too, supports arrays with any number of dimensions by virtue of its nested structure.

There are some other developments to the formula model which are not directly concerned with addressing the formula replication and dimensional generalisation problems, but nevertheless I mention here briefly as comple- mentary approaches that could be used alongside the lish.

An important step forward in the capability of the formula model comes from Peyton Jones et al. [2003] who designed a specification for supporting user-defined functions within the grid itself (as opposed to using an exter- nal language such as VBA). A prototype was later implemented by Sestoft [2013]. Clark and Hellerstein [2017] replace the usual formula language with the ability for cells to execute arbitrary Python scripts. Their implemen- tation allows NumPy arrays on the Python side to be bound to tables and columns in the spreadsheet. Bidirectional formulae [Macedo et al., 2014] allow the user to change the result of a formula and have the change back- propagated to a source cell, such that the formula will produce the new result.

Another aspect is the user interface by which formulae are composed. Roast et al. [2018] do not change the calculation model, but have developed a new user interface for interactive construction of formulae. The textual “formula bar” in Excel is replaced with a visual representation that shows the data flow involved in the calculation.

In document The Lish: A Data Model for Grid Free Spreadsheets (Page 31-34)