• No results found

7 Chapter 2. Predicting Failures of High Tech Innovations-in-Use: Application of

2.4 Research Design and Methodological Foundation

2.4.1 Classification of Devices

Data classification and organization represents an important step towards building a robust predictive model. While the data in the recall database, the adverse events and patient databases are unstructured and noisy, the data in the approval database are relatively more organized and complete. The approval database also contains fairly complete device classification information.

Therefore, the task of classification for the data points where the classification information is not available is to primarily link the recall and the adverse event databases with the approval database.

Linking would mean identification of a suitable primary key that is common between all the relevant databases especially in MAUDE and patient databases. However, no such codified key exists between all these databases other than the device name and manufacturer name combination.

The problems with the device name and manufacturer name combination are that neither the product name nor the manufacturer name has a standard codified format in any of the database. We used the generic device names from the approval databases and the device names from the other databases to classify the databases and link the databases. However, since there is no direct way to perform classification, we used a modified text mining method to perform the classification task.

Apart from the issue of data classification and linking, many of the key information related to several variables like severity, device failure type, causes for failure and device description are embedded in plain text paragraphs. Extraction and codification of such information also required text analytics.

Text classification is the task of identifying what class among a finite number of defined

classes a group of words belongs to (Madsen, Kauchak and Elkan, 2005). It is common to represent a string like a device name as a collection of words, or in text mining terminology a “bag-of-words.”

The basic algorithm we used is a Bayes classification approach using a Dirichlet distribution for the classification task (a modified Latent Dirichlet Allocation – LDA). For training the Naïve Bayes classifier, we used the device names from the approval database to create a frequency table. The rows of the table are the unique words from the device names and the columns of the table represent the different usage class of the devices. The frequency count for each word along the row would sum up to the total number of times that particular word appears in all the device classes. Individual values of the cells represent the number of times the particular word appears in a particular class.

Let ;bc represent the number of times word b appear in class /c. Naïve estimates of some of the The classification problem is to find out the probability that a device belongs to a class /c given that it contains a set of words j = G k, … , lL. In text classification, the probability distribution of a word being in a specific class out of several possible classes is modeled as a multinomial or a Dirichlet distribution. A Dirichlet distribution has been proven to perform better in the presence of sparseness of word distribution. Also, a Dirichlet distribution is more appropriate when the class distributions are not even, i.e., when a few classes appear much more in the sample as compared to the rest of the classes. We assume that the probability that a word with frequency ;bc belongs to a specific class /c is distributed according to the Dirichlet process m n;bceocp. The full conditional probability will then be given by the Bayes formula [9] below using a standard Dirichlet distribution for text classification (Madsen et al. 2005; Nigam et al., 2000; Blei et. al, 2003). See Appendix A.4 for proof.

The 510K and the PMA approval databases generated more than 150,000 unique words.

Many of the words were generic words such as “and,” “the,” and “of.” To test the algorithm, we split the data into 80% train-set and 20% test-set. A first run of the Naïve Bayes classification

scheme led to a correct classification rate of 66%. Some words are more informative than others.

Clearly, there are some key words which are important. To improve the accuracy of classification we needed to screen the words based on the information content. We calculated the entropy value for each unique word using the Shannon entropy equation [2.10]:

€( b) = − • ;bc

∑ ;b bclog ‚ ;bc

∑ ;b bcƒ

c

… [2.10]

Words with high information content about class belongingness would have low entropy.

So, we removed words from the train-set as well as the target classification-set using a threshold function W such that all words with entropy value greater than the threshold parameter are not considered. Also, words with very low frequency count ∑ ;b bc were removed from the list using a threshold value \. These are mostly words which are very specific to some device or firm. We chose the threshold values W and \ so as to minimize the classification error with the test set using a grid search algorithm in R (www.cran.r-project.org). After calibrating the algorithm with the train set and test set and obtaining the optimal calibration of threshold parameters W and \, we trained the algorithm with the complete data from the approval database. Then the model was run on the target sets of device names from the recall, adverse event and patient database. For the adverse event and patient databases, only those data points were reclassified where the classification was not available in the original database. On the 20% hold-out sample or test-set we achieved (92.5 ± 1.4)% classification accuracy. The data organization process and the resulting data-sets are presented in Table [2.1]. An algorithm for the classification method (Algorithm 1) is presented in Appendix A.1.

Table 2.1 Source and Description of Databases for the Empirical Analysis

2.4.2 Variable Generation and Variable Description