Extracting data out of a lish - The Lish: A Data Model for Grid Free Spreadsheets

6.2.1 Locating the cells to extract

We shall often want to extract part of a lish (such as a row or column of a table) for input to a calculation. I will use this running example, which contains a small sample of sales data:

         •, Sales summary,     Region,

Sales, Jan, Feb, Mar , Total , North, •, 200, 350, 300 , • , South, •, 150, 200, 250 , •             

Suppose first we want to extract the “North” row. This row is represented as a sublist, so extraction is trivial: we can simply make a copy of the already existing sublist as our extract, namely:

[North, [•, 200, 350, 300], •].

But suppose now we want to extract the “Feb” column. This time, the cells of interest do not form a sublist in the original structure – the column “cuts across” the three sublists that form the rows. So we cannot simply copy out an existing sublist to form our extract as was done for the “North” row. Fortunately, the user already has an interactive visual method of identifying the three cells of interest in the form of an implicit selection

(subsection 5.4.3). If we adopt the simple solution of creating a new list to hold the extract, we obtain:

[Feb, 350, 200].

Next let us extract the “Sales” cell and all its inheritors – that is, the four columns headed “Sales”, “Jan”, “Feb” and “Mar” respectively. The twelve cells involved once again do not lie neatly in a single sublist. We could try creating a single unstructured new list to contain them, which would yield:

[Sales, Jan, Feb, Mar, •, 200, 350, 300, •, 150, 200, 250].

This has all the required content, but some structural information has been lost. It will be important in the context of vectorised arithmetic that the extracted data are treated as an array of three rows by four columns, not as a single vector of twelve elements. So it would seem appropriate when extracting these cells to place them in a list structure resembling the one they were extracted from – but how much of that original structure do we need to keep? A conservative strategy would be to retain the entire sublist structure of the original lish, but filter out all but the twelve cells of interest. We would obtain:        

Sales, Jan, Feb, Mar , •, 200, 350, 300 , •, 150, 200, 250        

As well as being very inefficient, this strategy obfuscates the true structure of the data extracted. An intermediate form where only “relevant” structure is retained would appear preferable; let us see what that might look like.

6.2.2 Retaining relevant structure

When extracting cells from a lish, I will retain structure according to the following criterion. If every element of some sublist in the original lish is either a cell that is part of the extract, or contains at least one such cell somewhere within it, then that sublist is retained in the extract. Otherwise, it is dropped.

Let us apply this criterion to the “Sales” extract above. First, we consider the root of the original lish. One of its elements is the “Sales summary”

cell, which is not part of the extract. So the root is to be dropped from the structure of the extract.

Moving one level in, the main sublist (containing all of the table) is retained. Its three elements are the row sublists beginning “Region”, “North” and “South” respectively. Each of these rows does contain at least one cell that is part of the extract (actually, they each contain four such cells – for example, “North” contains null, 200, 350 and 300).

Moving in a further level, we consider the sublist beginning “Region”, which holds the entire top row of the table. The first element of this sublist is the cell “Region” which is not part of the extract. Therefore this sublist is not retained. Similarly, nor are the sublists for the other two rows.

Finally we come to the innermost level. Each of the four elements of the sublist beginning ”Sales” is a single cell that is part of the extract, so this sublist is retained. Likewise, so are the two sublists immediately below it, each beginning with null.

Having discarded two of the original four levels of nested lists, we are left with the structurally appropriate result of:



  

Sales, Jan, Feb, Mar , •, 200, 350, 300 , •, 150, 200, 250     6.2.3 Extracts as traces

Any data extract (beyond a single cell) obtained by the above procedure is clearly a list, but is it a lish? For the sales extract above, the answer is yes. In general, however, once structure extraneous to the extract has been removed, this is not guaranteed. A counterexample would be an extract of the cell containing 5 and all its inheritors in the lish of subsection 4.5.4. This extract is " " 5, 6 # , "" 7, 8 # , " 9, 10 # , " 11, 12 ## #

which is not a lish, because the prior template of the second element is [5, 6] with which this element does not conform.

Fortunately, the vectorised operations to be defined in this chapter do not rely on their operands themselves being lishes. Ordinary lists will do just fine, except for one deficiency. The operations will depend on the archetypes of the structure from which the data were extracted, so some archetype information needs to be captured. For this purpose I make one

small modification to the extraction procedure above: instead of putting the results in an ordinary list, I shall put them in a trace (subsection 4.3.1). The archetype of each trace in this context is simply the archetype of the lish from which it was extracted. For example, suppose in the original Sales lish the archetypes of the root, the outermost sublist, the “Region” sublist and the “Sales” sublists were respectively 0x0040, 0x0080, 0x00c0 and 0x0100. The extract based on the “Sales” cell now becomes:

      

( Sales, Jan, Feb, Mar 0x0100 ),

( •, 200, 350, 300 0x0100 ), ( •, 150, 200, 250 0x0100 ) 0x0080       

Earlier, I ascribed informally the list [Feb, 350, 200] to the simpler extract consisting only of the February column. If we follow the more formal procedure above, we obtain the trace (Feb, 350, 200 0x0080) for this extract: the 0x0080 sublist (the outermost one, comprising the whole table) is the only sublist in the original lish for which every element is represented by a cell in the extract.

In document The Lish: A Data Model for Grid Free Spreadsheets (Page 98-101)