• No results found

2.8 State of the Art in the Area of Keystroke Dynamics Recognition

2.8.1 Fixed-text Keystroke Dynamics

It was 1975 when Spillane [37] initially suggested using the keyboard as a means for collecting keystrokes for identification purposes. Two years later, tests were conducted by Forsen et al. [38] to check if keystroke dynamics could be used as a mechanism to differentiate between typists. The participant’s name was typed by its owner and by other participants. Statistical comparisons proved that the manner that a user types his or her name was distinguishable from the way another user typed the same name.

Three years later, Card et al. [93] suggested a psychological model for explaining the human interaction with computer programs on a keystroke level. It summarizes a human-machine session as:

(2.9)

Where Tacquire is the time needed to assess the task and decide how to solve it. This differs

between users based on the difficulty of the task and the user experience with it, therefore it is not considered a good way to distinguish between users. While, on the other hand, Texecute is

the actual time needed to perform the task. It basically describes a mechanical act, so it can be described, in the case of a typing task, as:

∑ (2.10)

Where Tm is time for mental planning and Tk is the time to perform the key press for

character (i). The sum of this amount for all characters (n) denotes the execute time for a subsection of the text. Moreover, the text subsections are on average between 6 and 8 characters long; this is the number of characters that the brain can handle at once [94]. Therefore, keystrokes less than 6-8 are good candidates to show clear keystroke patterns.

Another related theorem is Fitts’s law [95]. It defines the connection between the target distance, width, and time needed to perform a target acquisition task. It represents the speed/accuracy trade-off in rapid, aimed movement [96]. The movement time is computed using Fitts’s law as follows:

1 (2.11)

Where MT is the movement time, a and b are empirically determined constants, that are device dependent, D is the distance of movement from start to target and W is the width of the target.

Fitts’s Law can be physically interpreted as: big targets at close distance are obtained faster than small targets at long range [97]. This can be applied to the keyboard layout where the modifying keys that are pressed most often, i.e. space, enter and shift, are larger than other modifying keys. They are also closer to the alphanumeric keys which are frequently used [98]. These two characteristics make these keys, i.e. space, enter and shift, easier to hit. Moreover, Fitts’s law aims to reduce the total travel distance between keys. Therefore all alphanumerical keys (frequent keys) are located in the centre of the keyboard. Moreover, frequently connected letters (such as T and H) are located closer to each other compared with less frequently connected letters [98].

Although the width of all alphanumerical keys is equal, the distance between two adjacent keys is smaller than non-adjacent ones. This will cause higher accuracy of movement between adjacent keys compared with non-adjacent ones. This accounts for the non-adjacent key-pairs having more outlier or noisy data compared with adjacent ones. This is because users are more likely to miss when hitting a key that is located farther away. In addition, fitts’s law validates the idea behind grouping keys based on adjacency, i.e. distance between each other. As adjacent keys are closer to each other, they are pressed with higher speed compared with 2nd-adjacent ones. The same applies between 2nd-adjacent keys and 3rd- adjacent ones and so on.

A U.S. patent was granted to Garcia [53] for an identity verification method that involved the use of the user’s name as a keystroke generator. Garcia stated that the timing latencies produced by the user when typing his or her name had been proven to be stable enough to use for identity verification, therefore, the best data that can be used for authenticating a user is collected when the user types his or her own name. In his study, Garcia, asked the users to type their names a number of times. This data was used to create a vector of mean latencies for this user’s profile (electronic signature) after discarding any outlier data, i.e. values larger than 500% of the mean. At log-in time, the user was also required to type his or her name. This data was used to create the keystroke’s latency vector of the test data. The two vectors were compared using the Mahalanobis distance function. The user was denied access to the system if this distance was more than 100; and likewise, the user was granted access if the distance was less than 50. In addition, in the case where the value of the distance was between 50 and 100, the user was required to retype his or her name again. Garcia produced an overall performance of a 0.01% FAR and a 50 % FRR.

Similarly, Bleha et al. [99] used latencies between keystrokes as timing features in their experiment, in which two different classification methods were used; the minimum distance classifier and Bayesian classifier, each of which used a different threshold. Both classifiers were used jointly in the classification process; i.e. the user was rejected only if both classifiers agree that the attempter was an imposter, and vice versa in case of accepted users. This, unfortunately, was the reason for some results being undecided. The authors conducted three recognition systems. First of which was an identification system where 10 users retyped a fixed phrase five times. An indecision error of 1.2% was produced. The second recognition system was a verification system, in which 26 users were asked to retype their names 30 times. An FRR of 8.1% and a FAR of 2.8% was generated. The third system was an overall

recognition system, i.e. combining the two systems. In this part of the study, 32 users’ data, 10 genuine and 22 imposter, were used to achieve an error of 3.1% FRR and 0.5% FAR.

Joyce and Gupta [36] agreed with Garcia [53] in the sense that the timing data extracted from typing a well-known string provides more consistent results, compared with unfamiliar strings. They used a statistical method that employed the absolute distances between the timing vectors of the user’s template and test data. Each of these vectors consisted of the means of di-graph latencies for a fixed-text, i.e. digital signature. This digital signature consisted of the username, password, first name and last name of the user. All outliers, i.e. values as three times standard deviations greater than the mean, were discarded. A suitable threshold was identified for each particular user based on the mean and standard deviation of his or her template data. This threshold was used for deciding if the test data was actually for the user in question. More specifically, the user was authenticated only if the distance between his or her test data and profile data was less than the threshold. Thirty-three users were asked to go through the training session, in which they provided eight digital signatures each. Then, they were allowed to re-enter the system using that same digital signature, five times for testing. In the rest of the testing process, twenty-seven users acted as imposters and targeted the remaining six users by entering the victims’ digital signatures five times. The study produced an FAR of 0.25 % and a FRR of 16.36 % over all the tested data.

Although the results produced by Joyce and Gupta [36] were encouraging, they used a replacement methodology in their tests, in which they asked users to provide training data all in one session. Monrose and Rubin [33] suggested the use of data sets that had been recorded at variable times to improve the dataset created by Joyce and Gupta in [36] and therefore make it more reliable. In addition, their study allowed the participants to download the data collecting program on their personal PC’s rather than restricting them to use a specific PC as done in [39, 53, 36]. This allowed them to run the experiment at times when it was most convenient for them. The collected users' profiles were represented as n-dimensional feature vectors, which were created using the same method developed by Joyce and Gupta in [36]. Three different experiments were conducted on the collected data. The first utilized the Euclidean distance between the vectors of the test data and the profile data. The second used a non-weighted probability scheme. The last experiment adds a weight to the probability measure; this weight was based on the frequency in which the di-graph occurs in the written language. Over a period of 11 months, typing data was collected from 63 users. The rate that users were correctly identified using the weighted probabilistic classifier was approximately

87.18%, which was better than the performance of the Euclidean distance which was 83.22% and the non-weighted scoring approach being around 85.63%. A better performance was produced using a fourth scheme, the Bayesian classifier, which resulted in around 92.14% correct identification.

One of the first research that used Neural Networks for keystroke dynamics authentication was produced by Obaidat and Macchairolo [100]. In this study, they presented a new multilayer neural network system to be used for identifying users based on their typing patterns. Only the down-down time, between two consecutive keystrokes, was used. The experiment made use of three different types of networks, namely: a multilayer feed-forward network which was trained using the back propagation algorithm, a sum-of-products network which was trained with a modified version of the back propagation algorithm, and a new hybrid network that combines the two previous kinds. In the experiment, six participants were requested to enter the same password. This password consisted of a phrase that had thirty letters, but only the first 15 characters were used because using all 30 did not have any effect on the final results. In total, 40 vectors were collected for each user over a period of 6 months. After collecting the raw data, it was divided into odd patterns and even patterns in order to use either part in training and the other for testing. The neural network used for classification consisted of 15 inputs, corresponding to the length of the tested pattern, and 6 outputs, corresponding to the number of users, and 5 hidden units. Each of the three types of neural networks was trained similarly: training data was fed to the network and the weights were adjusted until the total summed squared error produced by the training set was smaller than 0.05. It was proven that the back propagation algorithm provided better accuracy than the sum-of-product algorithm; the accuracies for the two algorithms were 97.5% and 93.7% respectively. The new multilayer hybrid sum-of-products neural network produced accuracy of 96.2%, yet it required less training time than the other two.

Fuzzy logic was first used by Ru and Eloff [101] for keystroke patterns classification. They used flight time to distinguish between the typing patterns of the users. In this research, five fuzzy rules corresponding to the typing speed were used for classification. Thirty users were asked to type a minimum of eight-character passwords 25 times. A mean accuracy of 87% was produced.

Cho and Hwang [102] looked at the keystroke system from a different perspective. They argued that the quality of the user’s patterns is very important in obtaining a high

performance keystroke authentication system and has the same significance as the quality of the classifier. Therefore, their study suggested different ways to improve the typing pattern quality using specific motions performed by the user. The concept of the typing pattern quality has to do with two issues, namely: uniqueness, which emphasises the difference between the users’ patterns, and consistency, which focuses on the similarity of a single user’s patterns. Several techniques were proposed to improve both factors. First, artificial rhythms were used to improve uniqueness including: pausing, musical rhythm, staccato (minimum duration), legato (maximize duration), and slow tempo. Second, timing cues were used to improve consistency including: auditory (ticking sound), visual (hammer hitting a nail video-clip), and audio-visual (combining both) cues. Five users were involved in the two experiments of this study. The chosen password was “password”, which was typed naturally by the user 20 times; the different typing rhythms and cues of this password were also typed by each user 20 times. The uniqueness level was escalated significantly using these rhythms; there was a 200% improvement using short pauses and 500% using slow tempo. The consistency level has also improved using the cues; it improved almost 15 times using the auditory cues. The above strategies have also improved the discriminability of the typing pattern, which is defined as the relation between uniqueness and consistency, to more than twice using only auditory cues.

A major limitation of the scheme used in [102] was applicability to real life situations, since it only gave results from 5 users. Therefore Hwang et al. [103] used data from 25 users to investigate the level of improvement that artificial rhythm and tempo cues can add to the uniqueness and consistency of the typing pattern, in the case of a larger population. The results showed an error of 11% EER using natural rhythms; this was improved to 3% EER using pauses. An even better performance of 1% EER was produced using auditory cues in addition to pauses. This proved the validity of using these rhythms to improve the system performance.

Moreover, Maxion and Killourhy [104] examined using a keystroke dynamics identification system that uses digits only. This system can be adapted in ATMs, mobile phones, building’s electronic security keypad etc. The experiment was performed on 28 subjects, each of which was asked to type-in a 10-digit number using only their right-hand index fingers. The 10-digit passcode was the same for all users and it was carefully chosen to satisfy some conditions such as: memorizing factor and digit’s location on the keypad. The actual typing of the digits was restricted to using only the number-pad section of the keyboard. Participants were

requested to type this 10-digit passcode 50 times, throughout four sessions, which were held over four different days. This adds up to 200 repetitions per person; half of which were used as training data and the rest were used as test data. The features used in the timing vector were: hold time, down-down time and up-down time. The random forest classifier was used to determine the identity of the test data’s owner. An EER of 8.5% was achieved when using this model, which was less than the target that the authors hoped to reach. Therefore, some improvements were added to the original model in order to increase the performance; this included: outlier data handling and incorporating practice time. This improved the overall performance of the system to achieve an EER of 1.00%. Although practice time had improved the system performance rather a lot, it was not a realistic step to include in such a system, as it adds extra load on the users.

Liu et al. [105] used duration and latency as the original features. These original features were then transformed using: Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT) and Gabor wavelets. Moreover, SVMs, Gaussian classifier, nearest neighbour were used in both the single classification model and a fusion of features and classifiers model. In this study, two large-scale databases were collected, each of which were split into two datasets. Database 1 consists of 1902 test samples and 477 training samples from 117 subjects while database 2 consists of 5089 test samples and 478 training samples from 102 subjects. The best accuracy was produced using original features and the Gaussian classifier reaching 3.6846 EER. Even though, this research has introduced many feature selection techniques and classification methods, these schemes did not produce similar results when applied to different datasets.

A summary of some of the keystroke dynamics studies employing fixed-text are listed in Table 2.4.

Table 2.4: A list of fixed-text keystroke dynamics studies.

Study Features Method Subjects Samples Performance

Garcia [53] Latency Mahalanobis distance - - 0.01% FAR, 50 % FRR Bleha et al. [99] Latency Minimum distance classifier, Bayesian classifier 26 780 2.8% FAR, 8.1 % FRR Joyce & Gupta [36] Latency Absolute distances 33 429 16.36 % FRR 0.25% FAR, Monrose & Rubin [33] Latency, duration Euclidean distance, non-weighted probability scheme, weighted

probability scheme

63 - 85.63% accuracy Obaidat & Macchairolo

[100] Latency Neural network 6 240 96.2% accuracy Obaidat & Sadoun [54] Latency, duration Neural network 15 3375 0.00% FAR, 0.00 % FRR

Ru & Eloff [101] Latency Fuzzy logic 30 750 87% accuracy Lin [106] Latency Neural network 90 270 0.00% FAR, 1.1 % FRR Robinson et al. [107] Latency, duration

Nearest neighbour, minimum intra-class distance, nonlinear

classifier, inductive learning classifier

140 2800 10% FAR, 9% FRR Lau et al. [45]

Latency, duration, press- release ordering, typing

speed

Mean, standard deviation 15 150 1.01% FAR, 1.25% FRR Araujo et al. [66] Latency, duration Absolute distance 30 300 1.98% FAR, 1.45

% FRR Bartlow & Cukic [57] Latency, duration, shift

key usage Random forests 53 10000

2.1% FAR, 15.1 % FRR Modi & Elliott [108] Latency, duration Absolute distance 40 6000 0.94% FAR,

94.87% FRR Jiang et al. [55] N-graph duration Hidden Markov model, Gaussian 58 2030 2.54% EER

Teh et al. [72] Latency, duration Gaussian similarity, direction similarity measure 50 500 6.36% EER Jin et al. [109] Latency, duration Fuzzy logic 10 1000 20% EER

Hwang et al. [103] Latency, duration

Gaussian, Parzen window, density estimators, k-nearest neighbor, K-

means, one-class SVMs 25 6300 3% EER Rybnik et al. [56] Latency, duration K-nearest neighbours 250 1100 88.5% accuracy Maxion & Killourhy

[104] Latency, duration Random forest 28 5600 8.5 EER Stefan & Yao [70] Latency PCA/SVMs 20 700 4.2% FAR

Narainsamy &

Soyjaudah [15] Latency, duration Neural network 40 400 1.00% FAR, 8% FRR Giot et al. [52] Latency SVMs 100 1200 15.28% EER Abualgasim & Osman

[110] Latency, duration Statistical weight 26 130

20% FAR, 0.00 % FRR Kekre et al. [111] Latency, duration Relative Entropy, Euclidian distance 33 330 55%EER

Raghu et al. [12] Latency Neural network 21 4347 0.00% FAR, 1.1 % FRR Zhong et al. [112] Latency, duration Nearest neighbour 51 20400 8.7% EER Rezaei & Mirzakochaki

[113] Latency, duration

Linear discriminate classifier, quadratic discriminant classifier, k-

nearest neighbour

24 240 19.20% FAR, 0.81% FRR Gunathilake et al. [65] Duration, frequencies Deviation of duration and frequencies 48 - 0.10% FAR, 0.12% FRR Hansen & Willmore [86] Latency, keystroke intensity Naive Bayes, neural network 14 1317 1.00% FAR, 25% FRR

Liu et al. [105] Latency, duration SVMs, Gaussian classifier, nearest neighbour 50 3154 7.66% EER Bours & Komanpally

[114] Latency, duration

Manhattan distance, scaled Manhattan distance, Euclidean distance, scaled

Euclidean distance

28 5600 13.0% EER

Descriptions of latency, duration and n-graph duration are found in Section 2.6. Definitions of accuracy, FAR, FRR, ERR and keystrokes are found in Section 2.7