• No results found

Alternative data structures

2.3 Spreadsheets

2.3.7 Alternative data structures

The simplicity of the grid of cells, at least in its classical form, does have some disadvantages. It provides only a lightly guided imposition of struc- ture. This leaves the user free to superimpose their own, more complex structures, but with little safety net: data that are not “table shaped” can still be represented, but the logical relationships between their elements are typically not captured, nor is their integrity enforced. And modelling data that are three dimensional (or higher) is not straightforward.

Another problematic aspect is the lack of demarcation between regions of the grid that actually represent separate objects. Not only does this risk formula references “leaking” outside the objects they should be accessing, but if two unrelated tables are placed one above the other on the same grid, they will be forced to share common column widths even though there is no logical reason why their columns should align.

Creating a spreadsheet-like system on top of an underlying structure that is not the grid seems to have been only little investigated. An early exception was the work of Burnett et al. [2001]. They implemented a re- search prototype, Forms/3, which introduced a range of new abstractions to replace the traditional cell/grid model. A large part of this work concerned extending a spreadsheet-like style of working to graphical programming and real time I/O, which are outside my scope. But of particular interest here are their representations for tabular data. Cells are placed within dynamic matrices which themselves are placed within forms; the latter behave vi- sually like a canvas, and organisationally like a module or class definition. Formulae may be defined over an entire dynamic matrix or a region of one, and importantly the size of the matrix may be computed at run time, pro- viding dynamic storage allocation in a spreadsheet setting.

More recently there has been something of a resurgence of interest in this area. Miller and Hermans [2016] propose an alternative model that would

allow a user to migrate a normal spreadsheet into a more structured form; their technique can be applied selectively alongside ordinary cells, providing a “gentle slope”. Their model involves capturing extra structure within the sheet margins in the form of semantic axes. These can include rows or columns that nest within some larger category, such as quarters within years, or product type within revenue. Formulae may then be defined over a cell grouping specified in terms of the axes, avoiding per-cell replication of formulae. The axes can also include dynamic nodes which allow the column structure itself to be a live copy taken from elsewhere on the sheet.

Chang and Myers [2016] extend the spreadsheet grid to support hier- archical data, by allowing nested cells. Their GNEISS application allows arbitrarily nested JSON data to be imported without flattening. They extend the formula notation with expressions like “=B1.1” to mean the first nested cell within cell B1, and provide extensive facilities for reshaping and summarising; aggregation functions such as COUNT respect the nested structure.

McCutchen et al. [2016] articulate with great clarity how despite looking like a table, the spreadsheet ironically is not very well suited to operating upon tables! In an alternative model named Object Spreadsheets, they rep- resent each object type in a table, where the rows are instances of that type of object and the columns are its attributes (though the underlying schema is not limited to this choice of representation). Where the non grid-like behaviour comes in is that attributes are allowed to be multivalued, so one “row” may contain nested cells under some of its attributes. Furthermore, an object can own one or more other objects, again represented by nested rows; the table representing the owned object type is a column subset of the table representing the owner. Object Spreadsheets also defines an advanced formula language. Formulae are defined per column and accommodate those cases where a column may have multiple levels of structure nested within it. The language also includes object references, which are analogous to foreign keys. The authors make the important observation that:

“In practical terms, ownership captures the nature of a signifi- cant fraction of entity relationships in real applications; for ex- ample, if an application includes invoices that contain line items, one would expect deleting an invoice to delete the line items.” That is, a hierarchical model with ownership can represent in a single table many (though not all) of the use cases that would require multiple tables in

a relational database. The lish benefits in an identical way from this use of hierarchy.

Baˇzant and Marˇs´alkov´a [2018] describe a prototype (“Nezt”) for spread- sheet-like interactive functional programming based, like the lish, upon nested lists of cells rather than a grid. Its programming capabilities in- clude support for user-defined functions. The lists operated on are Lisp-like lists, as opposed to tables.

Some common threads running through the works reviewed in this sub- section are the benefits of defining formulae over multi-cellular areas, and the need for many use cases to be able to capture hierarchical structures. The problem with introducing hierarchy into a spreadsheet is that at some level, the grid must reappear in order that tables can be represented at all. So (with the exception of Nezt, it seems) they all combine the hierarchy with a grid in some way. In Forms/3 (and also in Hypernumbers, reviewed earlier) the grid is at the lowest level in the structure – a nest of grids. In GNEISS and in Object Spreadsheets, the top level construct is a grid and there is nesting within it – a grid of nests. Object Spreadsheets additionally defines nesting at the marginal level, which then spans the entire table, a property it shares with the semantic axes of Miller and Hermans.

The lish does not explicitly define a grid at any level, but we shall see that by virtue of its templating behaviour it is able to generate both nests of grids and grids of nests, as well as a rich set of repeating cell patterns including the marginal groupings just described, and higher dimensional arrays. On the other hand, the lish has no concept (as yet) of a foreign key, so is limited in its ability to combine multiple relations.

Another issue that the works just cited grapple with, to varying degrees of success, is that the formula language can easily become complicated by the presence of hierarchy. The lish aims to mitigate this by extending vec- torisation to deal with hierarchical data. Formulae in a lish are not bound to be per-column, but may take effect over any of the repeating cell patterns defined by the templates.