• No results found

The machine learning track

2.3 Definitions

3.1.1 The machine learning track

A machine learning track (MLTrack) T implements the track structure presented in 2.1.4. It is implemented to be a generalization of the GTrack format. The track elements (points and segments) are therefore equivalently supported as in the GTrack implementation.

The representation of T uses a file-like notation for internal use, but it may also be useful for purposes of storage and sharing. All occupied positions of T are ordered by start position and written as separate lines. There are two types of lines, namely the «comment» line and the «data» line.

Figure 3.2: An overview of the eight reserved columns in the GTrack format and their associations to the different track type. The overview is a subset of Table 1 from the GTrack specification, with an additional row for the added MLTrack (in bold).

The comment line is optional, but may only occur once in any MLTrack file. If present, it must be the first line within the file and must start with four ’#’ signs (e.g. "####") followed by a minimum of one key-value pair separated with the ’=’ symbol. Multiple key-value pairs may be separated with the semicolon ’;’ sign. The commont line notation is, in fact, a generalization of the «bounding region specification line» (3.B) of the GTrack specification. By this, any valid GTrack notation will pass as a MLTrack notation, but not necessarily the other way around.

A data line must contain the four tabular-separated columns seqid, start, end and val. By this, the format may be seen as an extension of the GTrack format, since the following 5 GTrack formats (P,VP,S,VS,F)1 may be mapped to the MLTrack. Concretely, since valued segments (VS) and functions (F) contain the four columns explicitly, they may be directly mapped only adding a proper comment line (header). The points (P) and valued points (VP) both lack the end column in the GTrack specification, but this may dynamically be created by the earlier definition that a point always have length of 1, meaning the end position is 1 more than the start position in the start column (which is present). Furthermore, the points (VP) and segment (S) is unvalued and may therefore be assigned the null value directly or equivalently omitted.

The formats (P,VP,S,VS,F) lays the foundation for learning any abstract relationships. Any explanatory track format could theorethically be related to any response track format. Consequently, a total of 25 (52) possible and abstract relationships exists.

Figure 3.2 on the facing page places the MLTrack in the context of the other mentioned GTrack formats. Note, that the function format is actually a special case of the valued point type, where all positions of a MLTrack T is occupied and none of the assigned values is the null value. The similarities or differences between two MLTrack’s is usually represented as a MLTrack itself, based on the same reasoning as for GTrack. Note that the representation of the MLTrack does not guarantee to be a valid GTrack. In fact, the only situation where it is valid is in the case of segments (S) or valued segments (VS). Otherwise, if a MLTrack is not a valid GTrack, it should always be possible to reduce it into one (from the three remaining types P, VP or F. This is done by rearranging or removing content which is not supported by the GTrack specification. This is possible, because the mapping from GTrack to MLTrack is a lossless process, meaning all prior data may be recovered from the available data.

In the rest of the chapter, whenever a (explanatory or response) track is mentioned - the MLTrack is understood, if not explicitly stated otherwise.

MLTrack example 1: Unvalued points

Figure 3.3: An example MLTrack of the unvalued points (UP) type.

Example 1 as GTrack ## gtrack v e r s i o n : 1 . 0 ## t r a c k type : p o i n t s ### s e q i d s t a r t ####genome=hg18 ; s t a r t =0; end=20 chr1 2 chr1 7 chr1 9 chr1 17 Example 1 as MLTrack ####genome=hg18 ; s t a r t =0; end=20 chr1 2 3 chr1 7 8 chr1 9 10 chr1 17 18

MLTrack example 2: Valued points

Figure 3.4: Visualization of an example MLTrack of the valued points (VP) type.

Example 2 as GTrack ## gtrack v e r s i o n : 1 . 0 ## t r a c k type : valued p o i n t s ### s e q i d s t a r t value ####genome=hg18 ; s t a r t =0; end=20 chr1 2 3.141579 chr1 6 1 . 0 chr1 7 2.718281828 chr1 13 1 . 0 chr1 15 6 . 0 chr1 17 1 . 0 Example 2 as MLTrack ####genome=hg18 ; s t a r t =0; end=20 chr1 2 3 3.141579 chr1 6 7 1 . 0 chr1 7 8 2.718281828 chr1 13 14 1 . 0 chr1 15 16 6 . 0 chr1 17 18 1 . 0

MLTrack example 3: Unvalued segments

Figure 3.5: Visualization of an example MLTrack of the valued segments (VS) type. Example 3 as GTrack ## gtrack v e r s i o n : 1 . 0 ## t r a c k type : segment ### s e q i d s t a r t end ####genome=hg18 ; s t a r t =0; end=20 chr1 1 4 chr1 8 12 chr1 15 18 Example 3 as MLTrack ####genome=hg18 ; s t a r t =0; end=20 chr1 1 4 chr1 8 12 chr1 15 18 n u l l

Related documents