IV. Dimensional Reduction Analysis
4.2 Dimensional Reduction Analysis Methods
4.2.6 Dimensionality Assessment
With relevance ranked features, DRA next involves selecting an appropriate level of dimensionality. Both qualitative and quantitative DRA dimensionality assessment methods are possible. Prior RF-DNA DRA research, e.g. [89, 113, 121], examined qualitative DRA for RF-DNA fingerprint features; however these were based on subjective assessments which may not be precise. Herein quantitative DRA approaches to estimate the intrinsic dimensionality in the data are developed. As noted by Jain et al. [213], an optimal approach to selecting features is via exhaustively examining classifier results produced from all possible combinations of features. However, this is very computationally intensive (and was noted as such by Jain et al. [213]) and is not practical for large datasets such as the ZigBee RF-DNA data where NFeats = 729. Therefore
quantitatively DRA approaches that examine intrinsic dimensionality of the data are developed and considered.
4.2.6.1 Qualitative Dimensionality Assessment
Prior RF-DNA work, c.f. [89, 113, 121] examined qualitative DRA methods for RF-DNA where subjective operator experience was used to select NDRA. This was
136
p-value or GRLVQI relevance values. To determine an appropriate number of ranked
features to retain, Dubendorfer et al. [113] examined various qualitative operating points corresponding to
šš·š š· = [25, 50, 100, 200, 243] (4.18)
feature sets. These were evaluated using an MDA/ML classifier, with the conclusion that
NDRA =50 features (selected using either KS-test p-values or GRLVQI relevance values)
offered sufficient classification performance. However, this quantity or proportion (50/729, or 6.86% of the available features) is not necessarily generalizable to other RF- DNA fingerprint datasets and applications. Additionally, it is not known how to systematically search for these quantities. Therefore creating quantitative approaches based on the data itself are of particular interest.
4.2.6.2 Quantitative Dimensionality Assessment
Various quantitative dimensionality selection methods exist based on data covariance and correlation matrix responses [458ā461]. Additionally, heuristics exist based on p-value significance and MDA-loadings magnitudes [358]. Of interest are developing quantitative dimensionality assessment methods for RF-DNA applications through data covariance and correlation matrices, p-values, and MDA-loadings.
(a) Heuristic-based Approaches on Discriminant Loadings
Discriminant loading magnitudes can also be used to estimate an appropriate number of features to retain. Various publications, c.f. [462ā464], suggested that discriminate loadings magnitudes greater than 0.30 indicate a feature is significant.
137
Given that these works did not address scaled loadings, the heuristic value of 0.30 was applied to Unscaled Max scores at SNR = 10 dB and yielded NDRA = 51 as the number of
loadings greater than 0.30 in each composite. Because NDRA = 51 is equivalent to the
NDRA = 50 determined by [113], this leads credence to the qualitative method of [113] and
thus only NDRA = 50 will be further examined for consistency with prior work.
(b) P-value based Approaches
Another approach to DRA assessment involves electing NDRA from p-value
significance [358]. As described in Section (b) p-values tend to zero for RF-DNA fingerprints and thus employing a p-value threshold for quantitative DRA could involve retaining a majority of the data. For instance, at 10dB, if one employed a p-value threshold of 5%, a common statistical significance threshold, one would retain NDRA =
674 if using the F-test or NDRA = 512 if using the KS-test.
Table IV-3 further presents the quantity of retained features using the F-test and KS-test at SNR = [0, 10, 18, 30] dB for different statistical significance levels. Statistical significance levels of [0.1%, 1%, 5%, 10%] are employed as commonly used [465], although largely arbitrary [379], statistical thresholds. Comparing Table IV-3 with the results of [121] indicates that p-value DRA assessment heavily over-estimates the number of features to retain since phase (š) features, NF=243 herein, are known to offer
performance comparable to the baseline. Therefore, p-value dimensionality assessment appears neither appropriate or is considered for ZigBee RF-DNA data.
138
Table IV-3: Dimensionality Assessment by p-value and Significance Level, Reprinted from [49].
SNR METHOD
SIGNIFICANCE LEVEL
0.1% 1% 5% 10%
0 dB F-TEST P-VALUES 196 264 350 402
KS-TEST SUMMED P-VALUES 37 74 130 160
10 dB F-TEST P-VALUES 589 639 674 688
KS-TEST SUMMED P-VALUES 337 414 512 557
18 dB F-TEST P-VALUES 706 713 720 722
KS-TEST SUMMED P-VALUES 666 692 711 716
30 dB F-TEST P-VALUES 718 725 727 728
KS-TEST SUMMED P-VALUES 727 729 729 729
(c) Data Covariance Matrix Approaches
DRA assessments on the intrinsic dimensionality in data can also be considered. If one considers the eigenvalues of the data covariance (or correlation matrix) one can estimate data dimensionality based. Given that RF-DNA features have consistent units, the covariance matrix was considered herein with three quantitative DRA assessment methods: Kaiserās Criterion, Maximum Distance Secant Line (MDSL), and Hornās Curve.
(i) Kaiser Criterion
Kaiser criterion offers a basic estimate of NDRA with Eigenvalues greater than the
average eigenvalue being retained [237, 458, 466]; when correlation eigenvalues are considered, this results in all eigenvalues greater than 1 being retained [467]. Although it can offer reasonable performance, it is also acknowledged as a rather arbitrary method
139
[458]. Because this metric is frequently generalized to just selecting the eigenvalues above 1, both the appropriate metric (above the mean) for covariance eigenvalues is presented along with the āabove 1ā metric.
Kaiser criterion offers a basic estimate of dimensionality with the DRA assessment made where the quantity of covariance matrix eigenvalues greater than the mean are retained [237, 458]. Although offering reasonable performance, Kaiser is acknowledged as a rather arbitrary method [458]. Kaiserās criterion at SNR = 10 dB suggests retaining NDRA = 191 features.
(ii) Cattellās Scree Plot
One extension of the Kaiser criterion involves including visual subjectivity in the form of Scree plots. Scree plots involve two dimensional plots of data covariance (or correlation) matrix Eigenvalues versus rank order, and provide a visual method of determining the dimensionality of the data [237]. Cattellās Scree Test, involves visually examining the scree plot and selecting NDRA above the inflection point, the proverbial
āelbow in the curveā [458]. The difficulty of this methods involves selecting the actual inflection point and NDRA.
1. Maximum Distance Secant Line (MDSL)
The MDSL approach, introduced by Johnson et al. [468], aims to remove subjectivity from Cattellās Scree Test through algorithmic means. MDSL both removes subjectivity of Cattell through automation, where 1) one creates a line between the first and last rank ordered eigenvalues and 2) on then finding the point with the largest
140
perpendicular distance from this line, i.e., the inflection point [468]. Using MDSL at SNR =10 dB NDRA = 26 features would be retained.
(iii) Hornās Curve
Hornās curve is another eigenvalue based DRA assessment method where eigenvalues are computed for a random dataset of the same size and rank as the ZigBee fingerprint set under analysis [469]. Hornās curve involves plotting the data sample correlation matrix eigenvalues against the Hornās curve eigenvalues [469]. The intrinsic dimensionality of the data is determined by counting the number of data eigenvalues that appear above Hornās curve [469]. Using the Hornās curve algorithm of Bigley [466], at
SNR = 10 dB Hornās curve indicated NDRA = 157 features should be retained.
4.2.6.3 DRA Assessments and ZigBee RF-DNA Features
As all of the presented DRA assessments provided different NDRA subsets,
multiple DRA subsets must be considered. For comparison with qualitative methods,
NDRA = [50, 100] subsets are examined for consistency with [113], additionally a lower
qualitative DRA assessment of NDRA = 10 is also important to examine to understand
performance when only a very limited subset of features are available and thus examine how DRA methods fundamentally interacts with classifier performance. The resultant
141
šš·š š· = [10, 26, 50, 100, 157, 191] , (4.19)
which considers both quantitative and qualitative methods. Comparison with the full- dimensional NDRA = 729 feature set is also requisite to generate a performance baseline
for comparison.