Data pre-processing specifics - WORD SEGMENTATION METHOD OVERVIEW

4 WORD SEGMENTATION METHOD OVERVIEW

4.3 Data pre-processing specifics

Figure 18: CSV-Splitter Graphical User Interface of the suggested software

In any case, this step has to be carried out outside the main program of our method; so that when the program is launched, CSV files with the appropriate size are available, (each one) with all data of different words written by different subjects.

4.3.2 Data file selection

The data file(s) contain data samples that belong to different words from different subjects. In contrast, we have designed this method to work with each subject’s data separately. For this reason, we present two main options available to start with data-processing:

i) First time: With the “first time” option, the program reads the initial CSV file(s) and generates a new file per each subject, with all data samples associated to his handwriting. To do this, the program reads the content of the original files as a table, and uses the Subject variable as the criterion to assign each sample to a new file, depending on who does it belong to. Subjects are recognised with a four-digit code. Thus, the first time that the program finds a sample of a particular subject it creates a new CSV file named after the subject’s code. The rest of the samples that are encountered and belong to the same subject, are added as new lines of the same document. So at the end, the program has created the same number of CSV files as the number of subject that participated in the recording experiment.

ii) Subject’s Data File: If the initial data has already been rearranged based on the writing subject of the samples, the user has to select the specific CSV file of a specific subject. The words encoded in this file are processed in the next phases. So this selection serves the program to locate the source data file and to find out the 4-digit code of the Subject whose data will be analysed in the following steps.

To sum up, to apply this method we work on Wacom samples split per subject (identified with a 4-digit code), and we work on one-subject’s handwriting at a time. We include these pre- processing steps in our development.

40 Chapter 4: WORD SEGMENTATION METHOD OVERVIEW

Figure 19: Scheme of data pre-processing applied to data recorded by WACOM SmartPads and stored in a CSV file

4.3.3 List of words’ characteristics

Besides than then samples data file, it has to be loaded the CSV data file that contains the list of words that were dictated to the children during the handwriting experiment session.

This list is used for two main functions. First of all, it is used to generate an internal variable (.mat) that contains relevant information about all the words’ characteristics. And secondly, it is used for the graphical user-interface’s configuration.

This internal variable is an array, with one column per word in the list. The information associated to each word is included in the different rows of the corresponding column. The contents of the three rows consist in:

1) Digital encoding of the word: A character vector with all the characters that belong to the text of word.

2) Number of letters: The length of the character vector that represents the word is used to determine the length of the words in the list. This is sored as an integer variable in the array.

3) Word type: Word classification based on geometrical characteristics of the words based on a similar geometric classification of the letters.

Figure 20: Abstract of the variable that contains word’s characteristics. Each column belongs to one word of the list and the 3 rows define each word’s text, the number of letters and type.

With respect to the word type feature, we have created a classification for the alphabet letters based on their graphical dimensions in cursive writing. The four possible labels are: “up”, “down”, “both” and “small”. Letter classification possibilities can be seen in the Table below:

Type of letter Letters that belong to this type Small a,c,e,i,k,m,n,o,r,s,u,v,w,x Up b,d,h,l,t,è,é,â

Down g,j,p,q,y,z

Both f

Section 4.3. Data pre-processing specifics 41

The words are classified into four groups with the same labels depending on the letters that contain. This information is relevant to generate a spatial grid on which it is possible to approximately locate the different letters in a word, based on the expected dimensions that each letter may have. This is further developed in the letter’s initialization (Section 4.4.2), in which this information is used.

4.3.4 Data extraction from files and Image generation

At this point, all necessary files required for the analysis are already loaded. The next step is to proceed with information extraction from samples of the selected subject’s handwriting.

The content of the samples data file is now read from its directory and the content is loaded into the program memory as a table. Actually, instead of loading all the content from the CSV file, only the indispensable part is loaded; i.e., only the table columns with data associated to X, Y, writing and group features.

Group is used to associate each sample to a different word in the list. Write is used to distinguish from all samples of each word, those in which the pen was touching the paper, i.e. the child was writing, from those that correspond to air-movements of the pen. Finally, X and Y correspond to the 2D coordinates of the pen tip with respect to the tablet plane, and they are used to reproduce the words as digital images.

For each word, a new internal variable (.mat) is created. These variables are three-row arrays, which have as many columns as samples belong to that word. The first row contains X- coordinate, the second one contains Y-coordinates and the third row consists in a binary variable worth 0 or 1 depending to writing initial parameter. When this third variable is equal to 1, it means that the associated sample (i.e. associated coordinates) corresponds to an instant in which the pen was touching the screen, therefore, the child was actually drawing a letter-stroke; meanwhile if the variable is worth 0, the child was moving the pen close to the screen but without drawing any contour line.

Figure 21: Abstract of the variable that stores data from a word’s samples. Each column belongs to one WACOM sample and the three rows define X, Y and Writing Condition, respectively

This information is essential to be able to reconstruct the digital word images in an accurate way with respect to children’s graphical writing. To generate the digital images of the handwritten words, the samples have to be plotted in the same order in which they were recorded. Moreover, we have into account two main considerations for this step: To plot the word contour line, we only plot the 2D coordinates of samples whose associated variable “write” indicates that the pen was actually touching the screen (Write = True). And, secondly, non-writing samples in the middle of the same word indicate different strokes; hence when this is encountered, a non- continuous contour is plotted by leaving a gap between consecutive sample-points.

42 Chapter 4: WORD SEGMENTATION METHOD OVERVIEW

The images that correspond to all the selected subject’s handwritten work are generated and stored in (.tif) format in a new folder named by the subject’s code inside the Results folder (also created by the algorithm). The saved images have all the same dimensions of 656x875pixels.

Figure 22: Example of the output of the image generation step applied to the handwritten word “arbre”

Also, data related to each word’s construction are stored as Matlab variables (.mat). These variables are the connection between original data and word images on which the algorithm works from this point on.

4.4 I

MAGE PROCESSING

To introduce image processing and manipulation section, we want to note that it is often necessary to reference relative position between images or to locate a particular pixel within an image. For this, a reference point is determined. During all this work description, we refer to images position or coordinates by always using their upper-left corner as reference. It is important to have this in mind during the entire lecture to find coherence between the different method stages.

4.4.1 Word selection

From this point on, the analysis steps are executed on a single word image, which is selected by the user. Hence, we process one word at a time as well. All words have an associate numbering. This number is read by the program and stored as a program variable. It serves to identify the word within the list and read its associate word characteristics from the variable that contains them (List_of_words.mat, in Section 4.3.3).

Per each selected word, all the necessary steps until complete word segmentation are applied. This transcript presents the program operation mode step-by-step, but an option to process all word in a black-box mode is available in the program GUI.

4.4.2 Estimates initialization

Thanks to the fact that the word text to be identified in the selected image is known, it is possible to compute an estimate size and position of the different letters that compound it. All the letters are associated the same width, by assuming a regular text distribution in the image. The height associated to each letter is different depending on the letter’s graphical characteristics, following the classification presented before, in Table 3. So, the previous defined word-classification is used in this new step.

In document Graph-based segmentation of letters in handwriting words with Lucas Kanade template warping (Page 57-61)