THE FORMATION OF A SET OF INFORMATIVE FEATURES BASED ON THE FUNCTIONAL RELATIONSHIPS BETWEEN THE DATA STRUCTURE FIELD OBSERVATIONS

(1)

THE FORMATION OF A SET OF INFORMATIVE FEATURES BASED ON THE FUNCTIONAL RELATIONSHIPS BETWEEN

THE DATA STRUCTURE FIELD OBSERVATIONS Artemenko M.V., Kalugina N.M., Dobrovolsky I.I.

South-West State University, Kursk,

e-mail: [email protected], [email protected], [email protected]

The methods of forming the set of informative features – tuple linguistic variables to solve diagnostic tasks in a decision support system diagnostic decision making in medicine. It is proposed to use the parameters of the ap-proximating polynomials, algebraic and logical functions, correlations and criteria exploration of clustering for the formation of a variety of signs and calculation of informativeness based on rank sorting. Formulate the paradigm of the formation of each alternative node of the hierarchical decision tree differential diagnostic private sets of informative indicators.

Keywords: informativeness, approximately polynomial, differential diagnosis, method of hierarchies, tuple linguistic variable

Modern medical service of the population based on the information and computer tech-nologies to support various stages of treatment and diagnostic process [6]. The development of the theoretical basis and software tools of artifi cial intelligence for solving tasks of clas-sifi cation and pattern recognition, forecasting led to the creation of various specialized au-tomated systems of support of acceptance di-agnostic solutions (ASSADS) for the tasks of clinical medicine and training of health work-ers [1, 2, 6, 16].

Design specialized ASSADS in medicine is based on the formation of adequate and ef-fective knowledge base on the basis of deci-sive diagnostic rules synthesized and tested on clinically confi rmed material, each element of which is characterized by a certain multiple of the recorded monitored and managed char-acteristics of the biological object or process. The problem of forming the set of informative features is important because the quality of its resolution depends on the effi ciency of further diagnosis, as with the use and without the use of automated ASSADS.

From a medical point of view, the forma-tion of extensive informaforma-tion, many signs bear semantic load as the formation of the tuple lin-guistic variables for the symptoms of a particu-lar disease or condition of the body.

Feature build ASSADS for clinical medi-cine is the use in real conditions small amounts of training and examination (control) of sam-ples of research results state of the biological object or process. Necessary and suffi cient conditions imposed on the volume of the in-vestigated material from the point of view of classical evidence-based medicine almost unrealizable in terms of the analysis of open

systems (which are objects), vagueness and inaccuracies of recorded data in conditions of uncertainty. In addition, the same system of signs may have an acceptable informative for solving a recognition task and a completely un-suitable for another [13].

Formation tuple linguistic variables (many informative features) is a subject of many studies, fundamental of which are the work of G.S. Foreheads (e.g. [10]). Consider a number of methods of forming a tuple (as previously studied and proposed by the authors) based on the methodologies: the analytic hierarchy process (ordering is based on has go obtained grades – weights), regression analysis and self-organization of structural-parametric identifi -cation of mathematical models of the method of group accounting of arguments, or logical functions (identifi ed, for example, logic algo-rithms, artifi cial neural networks [5]).

In the beginning of the study the charac-teristics set non-formalized way, with the help of experts (the Delphi technique or the fuzzy Delphi method) [7], recommended for the analysis of biomedical information due to its registration) or forcibly, taking into account the personal experience and knowledge of the re-searcher and analysis of specialized literature.

(2)

the above algorithms; fractal analysis applied to the tensor data (e.g., diagnosis of Parkin-son’s disease); Grad is the same as algorithm AddDel, but the inclusion and exclusion of in-dicators in the resulting lot is not “one”, and “complex”.

(Note that as features are directly measured and latent or integral, as the latter can be used indicators of system organization whose appli-cation is considered in [3, 4]).

These algorithms analyze the character-istics of the data structure, which is suggest-ed to use the coefficients of pair correlation and/or the distance to the cluster centers. In this case, it is recommended to apply crite-ria – quality indicators [7]: Given index, in-dices of density, total Giprobum, the index of the Davis – Bouldin. I.e., a small volume of the sample applied these algorithms and indicators of the quality of a certain value, the generated sets of linguistic variables consisting of specific symptoms. In this case, the researcher specifies the “freedom of choice decision-making” – the number of sets from which to exam the sample accord-ing to the external criteria retained are the most informative.

If the implementation of exploratory cluster analysis is impossible, it is proposed that a simple and semantically transparent method in the fi nal set of linguistic variables retained those characteristics that have the least correlation with the left and the highest with “discarded”.

For deciding on the inclusion of symp-tom information, many are encouraged to use the methodology of decision making T.L. Saaty [14]. Create a matrix of pref-erence of the elements of W, which ele-ments to indices i and j differ by 9 degrees (the sign of i is preferable than attribute j): w_i,j = 1 – equal preference, w_i,j = 2 – the low degree of preference, w_i,j = 3 – me-dium preference, w_i,j = 4 – a preference above average, w_i,j = 5 – moderately strong preference, w_i,j = 6 – a strong preference, w_i,j = 7 – very strong (obvious) preference, w_i,j = 8 – a very, very strong preference, absolute preference, w_i,j = 9 – absolute preference.

Analysis of the matrix allows conversion of the matrix to group the signs by clusters of preference with the IJ-conversion. Is a permu-tation of the row I with row J in the matrix of modifi ed preferences so that around the main diagonal of the clustered matrix elements with the highest values. The stop condition of the process of permutation acts achieve the

mini-mum sum-of-products of the element values of the modifi ed preference matrix W* the distance of this element from the main diagonal accord-ing to the formula:

(1)

where N – the

number of analysed characteristics before selection.

The degree of preference are proposed to determine by way of order signs on ranks of informativeness in descending order. The rank of informativeness metric for SPDR diagnostic character proposed to determine one way (or all – given the known algorithms of decision making on several alternative two).

Method 1. – By the maximum gradient of the functional differences (MGR) with or with-out taking into account latent integral indicator of systemic organization of functional States (proposed and approved by school A.V. Zavy-alov – [3]);

Method 2. By analysing the structure and the parameters of the approximating polyno-mial Gabor [15].

Method 3. By analysing the structure and analysis of Boolean functions obtained by ap-plying the algorithms and software logic,

arti-fi cial neural networks [5].

Method 4. In terms of clustering quality [7]. In the fi rst proposed method, for each alter-native class is the matrix of pair connectivity (for example, the absolute value of the Pear-son correlation coeffi cient) between variables, whose elements equal zero, if the calculated value is less than a certain threshold level. Classes for characteristics that are candidates for inclusion in the informative tuple linguis-tic variables are determined by the number of links – and and calculated differences

(gradients) for which the signs of i in descending order the Ks_i. For the ordered set of indicators are the ranks Rn_i by the formula:

(3)

The vector {Rn} is the matrix of prefer-ences W, the values of the elements of which are calculated in accordance with the gradation proposed by T.L. Saaty (presented earlier) or cognitology or automatically – by the formula

(3)

where

w_i,i = 9.

The second method of forming a matrix of preferences of information content of signs involves the use of the approximating polyno-mial Gabor – formula (4), since the increase in the degree of the polynomial the accuracy of the approximation, they approximated the function increases and then decreases – this allows you to apply a polynomial in the self-organizing algorithms of the group method of accounting arguments (GMDH) [11, 12]. Note that the GMDH allows handling samples of small volume and building the Gabor polyno-mial at the interpolation nodes, the number of which is smaller than the maximum degree of the polynomial.

(4)

where Z = {z₁, z₂, …, z_N} – a lot of arguments; Y(Z) is the response function (approximant); L is the number of terms in the polynomial; A_k, p_i.k – the identifi ed model parameter; N is the number of arguments

The information content of the indicator of the set {X} is proposed to defi ne the following methods:

1 method – based nonlinear discriminant functions identifi ed for class w₁ and w₀ (“ill” –

“not ill”, “condition 1” – “condition 2” – i.e., assumes a binary hierarchical decision tree). According to the recommendations of [6] for a class w₀ sets the value of the response func-tion that lies in the range (–1 ± e) and having

a uniform distribution where

N₀, N₁ – volume training samples for class w₀ and w₁, respectively). Similarly formed re-sponse for a class w₁ in the range (1 ± e) of the formula (4) and using the orthogonal algorithm GMDH is the structural-parametric identifi ca-tion of a polynomial (4).

Next, we determine the share of infl uence of each term in formula in each class:

(5)

where the operator – represents the mod-al vmod-alue of ZZ.

Then, for each argument included in the k-th term is calculated the weight of multipli-canda by the formula

(6)

In the end, determines the value of addi-tive-multiplicative effect of indicator xi on the response function (according to the parameters of the discriminant approximant) for each al-ternative class, according to the formula

(7)

Introduces a relative error of ”difference”

ε < 0,5 (recommended of 0,01 ≤ε < 0,1) and recalculated the values of the multiplicative ef-fects in by the formula (8):

(4)

Next, for each class (w₁ and w₀) signs (lin-guistic variables) out in descending order of values of . In the end, are formed two-tuple of signs for classes: and . According to the obtained tuples by applying the formula (2), replacing G_i for

generated two sets of ranks and . By and fi nalized many informa-tive features according to a specifi c research-er volume NI ≤ N consisting of elements

which are im-ported from the original set {X} according to

and in descending order by serial con-nection in descending order of ranks. In case of alternative situations inclusions apply one of the following: «handcontol» (knowledge and experience of the researcher), Monte-Carlo, or by reducing the magnitude of ε, and repeat the procedure of ranking.

The information content of sign Inf(x_j) pro-poses is determined by the formula

(9)

where – value of rank metric x_jin w₀ and w₁, respectively.

2 method of forming the set of informa-tive indicators, and the calculation of Inf(x_j), based on preliminary identifi cation of the ap-proximating polynomial Gabor (4) for each in-dicator from the initial set {X}. In this case, the identifi cation procedure is repeated N times for each class w₀ and w₁, sequentially forming the set {Z} = {X} – x_j and responses Y(Z) = x_j.

As a result, generated many approximants for alternative classes:

and

(M₀≤ N, M₁≤ N, M₀≠ 0, M₁≠ 0). It should be noted that approximate with values of coefficient of determination less than a certain researcher thresholds in fur-ther analysis is not involved. If the result of selection produced an empty lot approxim-ants, it consistently returned approximant with the highest values of determination co-efficients. The minimum amount many ap-proximativeness “freedom of choice” (the recommended value of 3 to 7).

Next, for each alternative class formed ma-trix and , the number of rows which are equal, respectively, М₀ and М₁, number of columns – number of indica-tors the set {X}, the value of the element ma-trices are calculated using formulas similar to (5)–(8). On the resulting matrices to form two vectors and (for each class), the values of which are calculated by formulas (10):

(10)

For each class (w₁ and w₀) indicators x_i are sorted in descending order of values of

. Thus, a formed two-tuple of in-dices for alternative classes: and .

The job ε, the formation of tuples, and further application of formula (2), the for-mation of many informative features

and calculating the information content is then the same as dis-cussed in method 1 procedures.

In method 3 linguistic variables take val-ues “true” (“1”) or false (0). With a certain accuracy (diagnostic performance in medical applications), the approximant of the response is represented by the formula (11) (indices and variables have counterparts in (2)).

(11)

where zb  {ZB} – logic ex-ception.

(5)

form of formula (12), based on analogues of arithmetic operations logical functions.

p_k = {0, 1}, (12)

Then apply formula (5) to(10) and conclu-sions from the consequences.

Method 4 proposes to implement the order-ing of attributes (lorder-inguistic variables) with the subsequent calculation of grades, the inclusion in an informative tuple and the calculation of in-formativeness similar to the previously discussed methods on the basis of hyperobject H (and/or index density PD), considered in [7], conducting exploratory clustering procedure by calculating the value of changes in the quality of clustering as the exception from consideration of the analyzed characteristic by the formula

where – is a covariance matrix into the corresponding classes w₀ and w₁ in the initial set {X}; – is the

correlation matrix of the classes w₀ and w₁ the set{{X} – x_j} (excluded sign x_j; det( ) – com-pute the determinant of the matrix.

Under covariance matrices here are the ma-trices calculated by the formulas

(14)

where N₀, N₁ – is the number of objects in classes w₀ and w₁, respectively; , – co-ordinate vector of the i-th object in the respec-tive clusters; , – vectors of coordinates of the centers of the classes w₀ and w₁.

Note that can take both positive and negative values – the latter option means that after breeding the quality of the classifi -cation according to the General hyperonym H deteriorated.

The disadvantage of this method is the analysis of exception characteristic as a single representative, rather than together with some other tuples. Procedure complete enumeration of different variants of demand in this case, large computational resources are usually, with negligible loss of diagnostic quality (or lack thereof) in the end.

In conclusion, we note that:

1. In the proposed methods, the infor-mation content characteristic is determined for each “branching” of the tree of deci-sion-making about the object or process al-ternative classes. Thus, from the paradigm definition, equal informative tuples linguis-tic variables for the full set of alternative classes (and, subsequently, the synthesis

of diagnostic rules), it is proposed to move to the paradigm of determining the informa-tional content of the basis for each hierar-chy, differential division.

2. If in the formula (2) to move from go to , then the binary characteristic value, go to the interval esti-mates of the characteristic values of member-ship functions in fuzzy sets or functions of be-lief in the theory of decision-making.

Thus, in the course of the study de-veloped a new nonparametric methods of formation of informative tuples describ-ing observable and/or controllable signs (linguistic variables) of the biological ob-ject (recorded, calculated, and latent, in numeric and logical metrics), which allows in conditions of semi-structured imprecise data necessary for the synthesis of diagnos-tic decision rules knowledge bases decision support systems in various segments of the automation of intellectual activities of deci-sion makers on the basis of modern com-puter and information technologies.

(6)

References

1. Artemenko M.V., Babkov A.S. Classifi cation of meth-ods of forecasting the behavior of systems // Modern problems of science and education. – 2013. – № 6; URL: http://www. science-education.ru/ru/article/view?id=11527 (date accessed: 8.06.2016).

2. Artemenko M.V., Dobrovolsky I.I., Mishustin V.N. In-formation-analytical support of the automated classifi cation on the basis of direct and inverse decision rules on the example of prediction of thromboembolic disease // Modern high technolo-gies. – 2015. – № 12–2. – Р. 199–205.

3. Artemenko M.V., Korenevsky N.A. Jelinkova L.A. Di-agnostics of the health of the newborn through systemic analysis of pregnant indicators // Bulletin of new medical technologies. – 2003. – T. 10. – № 3. – Р. 50–52.

4. Artemenko N.M. Recognition of the state of human lungs in that they produce acoustic noise // proceedings of southwest state University // Series: Management, computer engineering, computer science. Medical devices. – 2015. – № 2 (15). – Р. 94–98.

5. Barsky A. B. Logical neural networks. – M.: NOU “In-tuit”, 2016. – 492p.

6. Vorontsov I. M., Shapovalov V.V., Sherstuk Y.M. Health. Experience in the development and justifi cation of the applica-tion of automated systems for monitoring and srinilaya diagnosis of health disorders. – SPb.: OOO “IPK “Costa” B, 2006. – 432 p.

7. Demidova L.A., Kirakovskii V.V., Pylkin A.N. Decision-making in conditions of uncertainty. – 2nd ed. revised – M.: Hot line – Telecom, 2015. – 283 p.

8. Zagoruiko N.G., Kutnenko O.A. The division Algorithm for selecting informative subspaces of signs Institute of

Math-ematics SB RAS (access point http://pandia.ru/text/78/248/7 9351.php).

9. Zhvalevsky A.V. The Selection of informative features: setting objectives and methods of its solution // Proceedings of SPIIRAS. – 2007. – Vol. 4. – Р. 416–426.

10. Lbov G.S., Startseva N.G. Logical decision func-tions and the question of statistical stability of solufunc-tions. – Novosibirsk: publishing house of Institute of mathematics, 1999. – 212 p.

11. Orlov A.A. the Principles of the architecture of a soft-ware platform for implementing the algorithms of the group method of accounting arguments // Control systems and ma-chines. – 2013. – № 2. – Р. 65–71.

12. The multiplicative approximation method of group accounting of arguments // The Certificate of of-ficial registration program for computer № 2007611654 from 25.04.2007

13. Research library natural science selected publica-tions. – URL: http://sernam.ru.

14. Saaty Thomas L. Decision making with dependence and feedbacks: analytical networks. Per. s angl / Scientifi c. ed-ited by A.V. Andreychikov, O.N. Andreichikova. Ed. 4. – M.: LENAND, 2015. – 360 p.

15. Handbook on mathematics for researchers and engi-neers // King., Corn. – M.: Science 2007, – 789 р.