Research on Querying Node Probability Method in Probabilistic XML
Data Based on Possible World
Wang Jianwei
1,3,Hao Zhongxiao
1,21: Computer Science and Technology College, Harbin University of Science and Technology,
Harbin 150080, China
email:[email protected]
2: Computer Science and Technology College, Harbin Institute of Technology,
Harbin 150001, China
3: Information and Computer Engineering College, Northeast Forestry University,
Harbin 150040, China
doi:10.4156/jdcta.vol4. issue8.26
Abstract
In order to solve the low efficiency problem of directly querying single node probability in the set of all ordinary XML data obtained by enumerating possible world set of the corresponding probabilistic XML data, the method is presented that probabilistic XML data of possible world set is represented by semis-structured information unit. And it is modeled probabilistic XML data tree. Then the rules of decomposing probabilistic XML tree of possible world model are discussed. An algorithm for decomposing the probabilistic XML tree according to the normal form is also presented. An algorithm for querying the single node probability is designed over the set of decomposed probabilistic XML sub-tree set. The instance analysis results show that the method is effective and right.
Keywords
: Possible World, Semis-structured Information Unit, Probabilistic XML Tree,Decomposing, Node Probability, Query
1. Introduction
Because XML emerged as the de facto standard for data formats on the web [1], the probabilistic XML data has been also presented as a special style of XML data from 2001. It can represent the distribution data in XML element in the form of probabilistic data. So how to querying the element probability during the probabilistic XML data management is the facing problem after the probabilistic XML database was created [2,3,4].
Because the method of querying the distribution of element value in probabilistic XML database is different from the method of querying the element in XML database, so the researchers mainly focus on querying the element name according to element value and the work of querying and computing the element probability is less involved in the probabilistic XML data management[5,6,7]. The probability computation after simple querying the element is studied in [5], but the method of classifying the nodes in the probabilistic XML tree, the specific algorithm for querying the node probability and the complexity analysis of querying were not given. On the basis of research in [5],the probability computation problem was also studied after the simple querying the node probability on basis of the algorithms for querying the ordinary XML tree in [6],but the algorithm for decomposing the probabilistic XML tree was not given. And the algebra operation based on the twig conception of querying the probabilistic XML data.
The contributions of this work include:
(1).We propose the method of representing probabilistic XML data in the form of semis-structured information unit according to possible world model.
(2).We give the definition of probabilistic XML data tree and the descriptions of 1NF and 3NF of probabilistic XML tree.
(3). We analyze the rules of decomposing rules, so the 1NF probabilistic XML tree can be decomposed to the 3NF probabilistic XML tree.
(4).Algorithm PT2PTS is proposed to decompose the 1NF probabilistic XML tree to the 3NF probabilistic XML tree.
(5). Algorithm querynodeprob is proposed to query single node probabilistic in the probabilistic XML data tree.
(6).The instance analysis verifies the efficiency of the two algorithms.
2. Representing Method of Probabilistic XML Data Based on Possible World
During the process of probabilistic data management, the common model is possible world model. Many certain database instances obtained from probabilistic database are called possible world instance that the probability sum of all possible world instance is 1[8,9]. So a series of definitions on semis-structured information unit are listed as follows [10].Definition 1: semis-Structured Information Unit (SIU)is defined as follows: Ifis element name,is element value, then / is SIU.
If is element name, is element value, is attribute name and k is attribute value ,then /
k is also SIU.
If is element name, 1,...,n is also SIU, then 1,...,n/is also SIU.
Definition2: Semis-Structured Information Unit isomorphism is defined inductively as follows: If / is SIU, where is element name, then ( ) is isomorphic with /. If k / is SIU, where is element name, then ( , )k is isomorphic with
/
k
.
If 1,...,n/ is SIU, 1' is isomorphic with1,……, n' is isomorphic with n,then
1
( ',..., ')
n is isomorphic with 1,...,n/.
Definition3:Probability_valid semis-Structured Information Uni/t
1,..., /
probability n probability
is a valid probability semis-Structured Information Unit iff for each i{ ,...,1 n} is the form probvalue k probwhere k[0,1] and is element value. Because probabilistic XML data is nested semis-structure XML data, so the probability of the pair <elementname,elemtentvalue> in possible world set is pi,the representing form is described as follows[11]:
<dist>
<poss value=‟p1‟>
<elementname1>elemtentvalue1</elementname1>
…
<elementnamen>elemtentvaluen</elementnamen>
</poss> ...
<possvalue=‟pn‟>
<elementname1>elemtentvalue1</elementname1>
…
<elementnamen>elemtentvaluen</elementnamen>
</poss> <dist>
Because semis-Structured Information Unit can be represented data tree [12], the correlated definitions on probabilistic XML tree are given as follows.
Definition 4: A data tree T is a 5 tuple T( , , ,N E r label value, ), whereN is a finite set of nodes and EN a tree rooted in rN,label N: nameassociates a label to each node in N.value associates a value in V to leaves of the tree where a leaf is a node inleaf N { }r that has no child.
Definition 5 :A Possible Worlds (PW) set PT is a finite set of pairs ( ,t pi i)where each tiis a data tree, each pi is a positive real and 1 1
n i i p
.finite set of node NNord Ndist NposswhereNordis the set of ordinary node, Ndistis the set of distribution node, Npossis the set of possibility node. E N Nis the set of edges; ris the root of
PT;:NordLis the label function of the ordinary node, whereLis the name of element and attribute; n Ndist, ( )n dist; n Nposs, ( )n 'poss'.V N: lD is the data of leaf node, where Nl
are the leaf nodesNlNordNpossand D is the data set. prob N: ord Npossp is the probability value of the given node, 0 p 1,the default value is 1 without assigning the probability value.
From the above definition of probabilistic XML data tree, the probability node is represented by the node dist that the children is the poss nodes i.e. the root node of possible world instance.
Let PT( , , , , ,N E rV prob) be probabilistic XML data tree, n N r,parent n( )represents the parent node of n ; n N N children nl, ( ) represents the children node of
n. n Ndist,children n( ){ ,...,n1 nk}(k1), then 1 ( ) 1
i ki prob i .
Theorem 1. Let nk be a leaf node in probabilistic XML data treePT, then the probability from the root node to the given leaf node is computed by Pr( . .n n1 2 . )nk prob n( )1 prob n( ) ...2 prob n( k)where
1
n
is the root ofPT.
Proof: According to the method of computing probability in the probability chain, the computation is correct.
Theorem 2. The probabilistic XML data tree PT1 is equal to PT2 iff PWSPT1PWSPT2.
Proof: According to the principle of possible world model, the meaning of PT1PT2 is that all
certain instance set of PT1 is the same as all certain instance set of PT2, i.e. PWSPT1PWSPT2. The denotation PWSPT1is all certain instance set ofPT1, The denotation PWSPT2 is all certain instance set of PT2, if PWSPT1 PWSPT2,then PT1PT2.
Definition 7: Probabilistic XML data tree PTis the 1NF form iff for each poss node there are two
or more ordinary children node C{child poss( ) }(i i2) that satisfies
| | 2 1 C C i prob
.Definition 8: Probabilistic XML data tree PTis the 3NF form iff for each poss node there is only and one only ordinary child node C{child poss( ) }(i i2) that satisfies0probC1.
From the definitions in this section, 1NF Probabilistic XML data tree can be decomposed to 3NF Probabilistic XML data tree [13]. Referring to the triple operation principle of probabilistic relation database [14], the three decomposing rules are listed as follows:
Rule 1: The uniqueness of probabilistic XML data tree must be kept, i.e. the probabilistic XML data tree can not be modified. According to Theorem 2,this rule is the important rule.
Rule 2: The faster decomposing method is always adopted. In general, decomposing time depends on the node numbers of probabilistic XML data tree.
Rule 3: The probability querying and computation process is simplified through decomposing the probabilistic XML data tree. The advantage reflects in the circumstance of probability chain.
3. Decomposing Probabilistic XML Tree Process From 1NF to 3NF
The decomposing process consists of the following four steps. The first step is to check whether the probabilistic XML data tree is the 3NF form or not. If it is not the 3NF form, then go to the second step;
otherwise go to end. The second step is to divide the absolute path set of probabilistic XML data tree into the set with distribution node and the set with no distribution node. The third step is to obtain extended absolute path set and basic absolute path set from the absolute path set with distribution node. The fourth step is to merge the absolute path in order to having only and only one poss node in a absolute path.
Let the child number of the root in PTbecount, i.e. there are count probabilistic XML data sub-tree in PT[15,16]. So the denotation PTalso represents the probabilistic XML data sub-tree. Let the absolute path set by maximum parsing the probabilistic XML data sub-tree PATHi,1 i count be given.
The absolute path set PATHPTcan be gained by parsing the probabilistic data tree PT. Algorithm1 PT 2PTS (APATH)
Input: absolute path set APATH of 1NF PT Output: absolute path set APATH of 3NF PTS begin
PTS_s1=;c1=0; PTS_s2=;c2=0;PTS=; A= APATH;CALL mark(A);
if check(F1)=T then end; else
CALL list_EPATH_PATH (F1,F2); for i=1 to |PATH| do
for j=1 to |EPATH| do if PATH[i]=EPATH[j] then PTS_s1= PTS_s1 PATH[i]; c1=c1+1; else c2=c2+1;conpath[c2]= ; for k=1 to |v(EPATH[i].poss)| do conpath[c2][k]=EPATH[i]/Nl(v(EPATH[i].poss)[k]); PTS_s2= PTS_s2 conpath [c2]; PTS= PTS_s1 PTS_s2; return(PTS); end.
Algorithm2 mark (A) Input: absolute path set A
Output: absolute path vector set B, independent of distribution subset F1, dependent of distribution subset F2
begin
for i=1 to |A|
for j=1 to length(A(i)) B(i)(j)=node(A(i)).j; B(i)=B(i)[2:length(B(i))]; F1= ; F2=; B=; for i=1 to |B| for j=1 to length(B(i)) if B(i)(j)=dist then B(i)= B(i)[1:j-1,j+1:last]; B(i)(length(B(i))+1)=1; F1=F1 { B(i)};
else B(i)(length(B(i))+1)=0;F2=F2 {B(i)}; B= F1 F2;
return(F1,F2,B); end.
Algorithm3 check(F1) Input: absolute path set F1
Output:If F1 is 3NF,return T; else return F begin
child_poss=;is_3NF=T; for i=1 to | F1|
for j=1 to length(F1 (i))
if child(poss)child_poss then
child_poss = child_poss {child(poss)}; else
is_3NF=F; goto end; return(is_3NF)
end.
Algorithm 4 list_ EPATH_PATH (F1) Input: absolute path set F1
Output: EPATH,PATH begin
EPATH=;PATH= F2; for i=1 to | F1|
for j=1 to length(F1 (i)) epath=delete(last (F1 (i))); EPATH= EPATH { epath }; path=delete(epath.dist,epath.poss); PATH= PATH {path};
return(EPATH, PATH) end.
Theorem 3. Algorithm PT2PTS () for decomposing probabilistic XML tree from the 1NF form to the 3NF form is correct and complete.
Proof: Definition 4-8 and Theorem 1-2 offer methods to compute PT2PTS (APATH) and prove themselves when APATH is the set of probabilistic XML data tree in the form of 1-NF.The algorithm fulfills the methods offered by these definitions and the rule. Therefore the algorithm can correctly compute PT2PTS (APATH) for any APATH obtained from the probabilistic XML data tree. Thus the algorithm is correct and complete.
4. Algorithm for Querying Node Probability
This section describes the node probability querying and computing process in the 3 NF probabilistic XML data tree. At first the probabilistic XML data tree is found that the given ordinary node is in the probabilistic XML data tree, then to query the corresponding poss node probability and compute it.
Algorithm5 querynodeprob(PTS, NODE)
Input : absolute path set of 3NF probabilistic XML data tree PTS,an ordinary node NODE Output: probability of NODE PR
begin PR=0; for j=1 to |PTS| do P[j]=1; for k=1 to |PTS [j]| do if NODEPTS [j] then call computepathprob (PTS [j])
P[j]=P[j]×prob; PR=PR+P[j];
return(PR); end.
Algorithm6 computepathprob(APATH)
Input: absolute path set of 3NF probabilistic XML data sub-tree APATH Output: absolute path probability prob
begin
for i=1 to |APATH| do prob[i]=1; for j=1 to length(APATH[i]) do node=APATH[i].node(j); if node=poss then prob[i]=prob[i]×value(poss); return(prob); end.
Theorem 4. Algorithm querynodeprob() for the node probability querying and computing is correct and complete.
Proof: Definition 4-8 and Theorem 1-2 offer methods to compute querynodeprob(PTS, NODE) and prove themselves when PTS is the set of probabilistic XML data tree in the form of 3NF.The algorithm fulfills the methods offered by these definitions and the theorem. Therefore the algorithm can correctly compute querynodeprob(PTS, NODE) for any given NODE in the 3NF probabilistic XML data tree. Thus the algorithm is correct and complete.
5. Instance Analysis
In order to show the above method correct and effective, this section gives the instance analysis. Example 1 The teacher probabilistic data about rank, department and course are represented in the probabilistic relational data table. The teacher probabilistic relational schema is composed of Table 1 which consists of Table 2, Table 3,Table 4 and Table 5 where no is the key attribute in Table 1.Possible world instances (no=0001) are listed in Table 2.The <rank,ps> attribute in Table 3 represents the rank and its distribution data, the <dept,ps> attribute in Table 4 represents the department and its distribution data, and the <course,ps> attribute in Table 5 represents the teaching course and its distribution data.
The first row in Table 1 can be listed a set of possible world instance where the instance probability can be computed through correlated computation. At first, the individual attribute is independent each other,then there are 12 possible world instances computed by the formula 2×2×3=12 that the probability ditribution is equeal with the individule value of the ps attribute shown in Table 2.For example, the probability of possible world instance {ass,d1,c1} is given by the formula 0.3×0.2×0.3=0.018.
Table 1. Probability data on rank ,depatment and course of some teacher no name <rank,ps> <dept,ps> <course,ps> 0001 WangMing {<ass,0.3>,<ins,0.7>} {<d1,0.2>,<d2,0.8>} {<c1,0.3>,<c2,0.5>,<c3,0.2>}
Table 2. Possible world instances (no=0001) no name rank dept course ps
0001 WangMing ass d1 c1 0.018 0001 WangMing ass d1 c2 0.03 0001 WangMing ass d1 c3 0.012 0001 WangMing ins d1 c1 0.042 0001 WangMing ins d1 c2 0.07 0001 WangMing ins d1 c3 0.028 0001 WangMing ass d2 c1 0.072 0001 WangMing ass d2 c2 0.12 0001 WangMing ass d2 c3 0.048 0001 WangMing ins d2 c1 0.168 0001 WangMing ins d2 c2 0.28 0001 WangMing ins d2 c3 0.112
It is obvious that there are rank, dept and course attributes depending on the ps attribute. According the 3NF principle, Table 2 can be decomposed to rank table, dept table and course table what are shown in Table 3,Table 4 and Table 5.
Table 3. Possible world instances of rank (no=0001) no name rank ps
0001 WangMing ass 0.3
0001 WangMing ass 0.7
Table 4. Possible world instances of dept (no=0001) no name dept ps
0001 WangMing d1 0.2
0001 WangMing d2 0.8
Table 5. Possible world instances of course (no=0001) no name course ps
0001 WangMing c1 0.3
0001 WangMing c2 0.5
0001 WangMing c3 0.2
According to the definition of probability_valid ,semis-structured information unit, probabilistic XML document is created corresponding to Table 2.
The DTD document is shown as follows. <!ELEMENT teacher (no, name, dist)> <!ELEMENT no (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT dist (poss+)>
<!ELEMENT poss (rank, dept, course)> <!ATTLIST poss CDATA #REQUIRED> <!ELEMENT rank (#PCDATA)> <!ELEMENT dept (#PCDATA)> <!ELEMENT course (#PCDATA)>
<teacher> <no>0001</no> <name>Wang Ming</name> <dist> <poss value='0.018'> <rank>ass</rank> <dept>d1</dept> <course>c1</course> </poss > <poss value='0.03'> … </poss > … </dist> </teacher>
According to the definition of probabilistic XML data tree, the above XML document can be represented in the form of probabilistic XML tree in Figure 1 and Figure 1 can be transformed to Figure 2. teacher no name 0001 Wang Ming poss dist
rank deptcourse ass d1 c1 0.018 teacher no name 0001Wang Ming
rank deptcourse ins d2 c3 0.0112 poss teacher no name 0002 Zhang Min poss
rank deptcourse
a.prof d1 c2 0.063 teacher no name 0002 Zhang Min poss
rank deptcourse prof d4 c4
0.063
Figure 1. Probabilistic XML tree of some teacher data
no name 0001 Wang Ming poss dist
rank dept course ass d1 c1
0.018
rank dept course ins d2 c3 0.0112 poss T teacher no name 0002 Zhang Min poss dist
rank dept course
a.prof d1 c2
0.063
rank dept course prof d4 c4
0.063
poss teacher
Figure 2. Probabilistic XML tree transform format of some teacher data
The process applying algorithm PT2PTS to solve the 3NF probabilistic XML tree the 1NF probabilistic XML tree is introduced. The same node in the absolute path set with no leaf is merged, so the schema graph of the 1NF probabilistic XML data tree is obtained as is shown in Figure 3 and 4 where the a graph in Figure 3 is the schema graph before transformation and Figure 4 is the schema graph after transformation. The 1NF probabilistic XML data sub-tree that the no node value is 0001 can be decomposed shown in Figure 5, Figure 6, Figure 7. And the three sub-trees in Figure 5, Figure 6 and Figure 7 can be merged the form of Figure 8.
dist
teacher
no name
poss
rank deptcourse
pv
dist no name poss
rank dept course
pv
teacher
Figure 3. 1NF schema graph before Figure 4. 1NF schema graph after transformation of probabilistic XML tree transformation of probabilistic XML tree
0.3 rank ins poss teacher no name 0001 Wang Ming ass 0.7 dist poss rank teacher no name 0001 Wang Ming dept d1 d2 0.2 0.8 dist poss dept poss course c1 c2 c3 no name 0001 Wang Ming 0.3 0.5 dist poss 0.2 poss poss course course teacher
Figure 5. rank sub-tree Figure 6. dept sub-tree Figure 7. course sub-tree
0.3 rank course c1 c3 ins c2 poss teacher no name 0001 Wang Ming ass 0.3 0.7 dept d1 d2 0.2 0.8 0.5
dist dist dist poss poss
0.2
poss rank dept
poss poss poss
course course
Figure 8. merged 3NF probabilistic XML sub-tree
The 3NF probabilistic XML document fragment which is consist with Figure 8 is shown as follows. <teacher> <no>0001</no> <name>Wang Ming</name> <dist> <poss value='0.3'><rank>ass</rank></poss > <poss value='0.7'><rank>ins</rank></poss > </dist> <dist> <poss value='0.2'><dept>d1</dept></poss> <poss value='0.8'><dept>d2</dept></poss> </dist> <dist> <poss value='0.3'><course>c1</course></poss > <poss value='0.5'><course>c2</course></poss > <poss value='0.2'><course>c3</course></poss > </dist> </teacher>
The node probability by applying Algorithm querynodeprob is queried in Figure 8. The querying results are listed as shown.
( '0001' ' ') 0.3 pr no rank ass ( '0001' ' ') 0.7 pr no rankins ( '0001' ' 1') 0.2 pr no dept d ( '0001' ' 2') 0.8 pr no dept d ( '0001' ' 1') 0.3 pr no coursec ( '0001' ' 2') 0.5 pr no coursec ( '0001' ' 3') 0.2 pr no coursec
6. Conclusion
In this paper a normative framework of probabilistic XML document is designed which has the 1NF and 3NF probabilistic XML data tree forms. The decomposing and query algorithms are given and proved through the instance analysis.
Because of a large amount of probabilistic XML sub-tree elements after decomposing the 1NF probabilistic XML tree, how to merge the 3NF probabilistic XML sub-tree fast is essential and how to
design storage method based on data compression [17]. Because the node probability query algorithm must be correct and effective, so the fast algorithm for finding the node in which probabilistic XML data sub-trees are given is the further research direction.
7. Acknowledgement
This material is based upon work supported by Heilongjiang Nature and Science funds under F200925.
8. References
[1] http://www.w3.org/TR/
[2] WANG Jianwei HAO Zhongxiao. “Survey of Research on Probabilistic XML Data management Techniques. ” COMPUTER SCIENCE, 2009, 36(11), pp. 14-17(in Chinese).
[3] Kim,Yong Jae, “A Study on The Global Standards Of The E-Trade Process on The Basis ebXML & Web Services”, JCIT: Journal of Convergence Information Technology,Vol. 4, No. 4, pp. 102-110, 2009.
[4] Panida Songram , “Efficient Ming of Top-K Closed Sequences”, JCIT: Journal of Convergence Information Technology,Vol. 5, No. 5, pp. 170 - 178, 2010.
[5] Andrew Nierman H. V. Jagadish . “ProTDB: Probabilistic Data in XML”.Proceedings of the 28th VLDB Conference,Hong Kong, China, 2002.
[6] Te Li, Qihong Shao, and Yi Chen. “PEPX: A Query-Friendly Probabilistic XML Database”. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. CIKM '06, ACM Press, pp. 848-849, 2006.
[7] Benny Kimelfeld and Yehoshua Sagiv. “Matching Twigs in Probabilistic XML”. VLDB „07, September, Vienna, Austria, pp. 27-38,2007.
[8] Abiteboul S, Kanellakis P, Grahne G. “On the representationand querying of sets of possible worlds”. ACM SIGMOD Record, 16(3), pp. 34-48, 1987.
[9] Green T J, Tannen V. “Models for incomplete and probabilistic information”. IEEE Date Engineering Bulletin, 29(1), pp. 17-24, 2006.
[10] A Hunter and W Liu . “Merging uncertain information with semantic heterogeneity in XML”. Knowledge and Information Systems,9(2), pp. 230-258, 2006.
[11] M. van Keulen, A. de Keijzer, and W. Alink. “A probabilistic xml approach to data integration”. In Proc. ICDE Conf., Tokyo, Japan, pp. 459–470, 2005.
[12] Ander de Keijzer, Maurice van Keulen. “User Feedback in Probabilistic Integration”.DEXA‟07.18th International Workshop on Database and Expert Systems Applications, pp. 377-381,2007.
[13] Suk Kyoon Lee. “An extended relational database model for uncertain and imprecise information”. In VLDB, pp.211-220, 1992.
[14] Reynold Cheng, Sarvjeet Singh, Sunil Prabhakar, Rahul Shah, Jeffrey Scott Vitter, Yuni Xia. “Efficient join processing over uncertain data”. Proceedings of the 15th ACM international conference on Information and knowledge management, pp. 738-747, 2006.
[15] Zhiyuan Chen ,H. V. Jagadish, Flip Korn ,Nick Koudas ,S. Muthukrishnan ,Raymond T. Ng ,Divesh Srivastava .”Counting Twig Matches in a Tree”. Proceedings of the 17th International Conference on Data Engineering, pp. 595 – 604,2001.
[16] Jianwei Wang and Zhongxiao Hao. “An Algorithm of Estimation Pattern Tree Number in Probabilistic XML Data Tree”,The 2010 International Conference on Electronics and Information Engineering (ICEIE 2010), Kyoto, Japan,V(1), pp. 507-512, 2010.
[17] Christopher League and Kenjone Eng, “Schema-based compression of XML data with relax NG”. J. Comput,Vol. 2, NO.10, pp. 9-17, 2007.