i.e., elapsed time ideally scales linearly whereas the re- source allocation is constant – independent of the size of the generated document; (4) deterministic, that is, the output should only depend on the input parameters. First, in order to be able to reproduce the document independently of the platform, we incorporated a ran- dom number generator rather then relying on the op- erating system’s built-in random number generators. Together with basic algorithms which can be found in statistics textbooks this xmlgen implements uni- form, exponential, and normal distributions of fairly high quality. We assigned to each of the elements in the DTD a plausible distribution of its children and its references, observing consistency among referenc- ing elements, that is, the number of items organized by continents equals the sum of open and closed auc- tions, etc. Second, to provide for accurate scaling we scale selected sets like the number of items and per- sons with the user defined factor. Moreover, we cal- ibrated the numbers to match a total document size of slightly more than 100 MB for scaling factor 1.0 (cf. Fig. 3). Finally, it is a challenge to implement the data generator efficiently because references are cre- ated at various places throughout the document; since we have to abide by the integrity constraint that every reference points to a valid identifier we could go for the straight-forward solution of keeping some sort of log and record which identifier has already been ref- erenced; unfortunately this seems infeasible for large documents. We solved this problem by modifying the random number generation to produce several identical streams of random numbers. That way, we are able to implement a partitioning of sets like the item IDs that are referenced from both open and closed auctions. In its current version, xmlgen requires less than 2 MB of main-memory, and produces documents of sizes of 100 MB and 1 GB in 33.4 and 335.5 seconds, respectively (450MHz Pentium III). A more detailed description of the tool and downloads can be found at the project Web page .
In this thesis, we have focused on uniformly representing different solutions to a multitude of datamanagement problems, addressed in the context of XML. We have overviewed works on XMLdata integration, XMLdata exchange and answering XML queries using XML views. We have illustrated similarities and differences between the works by introducing a unified framework for comparing the different approaches. This framework incorporates a new query language for XML trees, called extended tree patterns, that successfully captures the expres- sive power of several query languages that have been extensively used for querying XML trees throughout the literature on XMLdatamanagement. We have introduced translation algorithms between these query formalisms and our formalism of extended tree patterns. Moreover, based on the extended tree patterns, we have defined schema mapping assertions for XML, that can be either LAV or GLAV mapping assertions. The translation algorithms, together with the mapping assertions for XML based on extended tree patterns, have allowed us to make approaches that tackle different problems using different query languages comparable. Furthermore, we have proposed a classification of the existing works in three different classes inside which we addi- tionally characterize each approach by comparing them along several dimensions. By using our unified framework, we have been able to identify different open problems in XMLdata man- agement. We have seen that there is still plenty of space for examining different variants of the problems connected with XMLdatamanagement.
Over the past few years, there has been a tremendous surge in the interest in XML as a universal queryable representation of data. This has in part been boosted by the growth of web and e-commerce applications in the context of which XML has emerged as the de-facto standard for information interchange [W3C98b]. Today nearly every major vendor of a datamanagement tool, be it Oracle [Ora] or IBM [IBM], has added support for importing, storing and viewing XMLdata over their relational engine. XML publishing capabilities, that is, the use of XML for format- ting the result of SQL queries, have been added to the latest releases of relational database systems by Oracle [Ora], IBM [IBM] and Microsoft [Mic]. At the same time, due to the maturity of query optimization techniques and the high query per- formance oﬀered by Relational Database query engines, use of Relational Database Management Systems (RDBMS) as a store for XMLdata has been put forth as a promising direction for XMLdatamanagement.
Use results of formal studies of the containment of recursive query languages [Benedikt et al., 2011, 2012a] to optimize query answering
from a print-oriented XMLdata format into RDF. The aim is to link together linguistic data currently residing in silos and to leverage Semantic Web technologies for discovering new information embedded in the data. The initial steps of this transition have been described in  where OUP moved from monolithic, print-oriented XML to a leaner, machine-interpretable XMLdata format in order to facilitate transformations into RDF.  provides examples of conversion code as well as snippets of XML and RDF dictionary data and we recommend to refer to it for understanding the type of data modelling challenges faced in this transition.
According to (Abiteboul, 1997), relational data, which is typed, unordered and grouped in semantic entities with the same attributes, can be called structured data.
1.3 The semistructured (XML) approach
In contrast to the structured relational data, semistructured data (SSD) is data that is or- dered, not strictly typed and where entities of one group can have different attributes (Abiteboul, 1997). SSD is often called self-contained or self-describing since there is no ex- plicit need for a schema and type information can be omitted or placed directly into the data. Websites and hierarchical data can be described well by SSD; XML is a markup language to specify SSD.
arbitrary sets of nodes) or arbitrary recursive Markov chains (that can represent spaces of unbounded tree height or tree width).
We mentioned various open problems throughout this chapter. Two of these deserve particular emphasis. First, the connection to probabilistic relational models needs better understanding, from both the theoretical viewpoint (e.g., what makes tree-pattern queries over ProTDB tractable, when they are encoded into relations?) and the practical viewpoint (e.g., can we build on a system such as Trio (Widom ) or MayBMS (Huang, Antova, Koch, and Olteanu ) to effectively manage probabilistic XMLdata?). Second, further effort should be made to realize and demonstrate the ideal of using probabilistic XML databases, or probabilistic databases in general, to answer data needs of applications (rather than devising per-application solutions). We discussed some of the wide range of candidate applications in the introduction. We believe that the research of recent years, which is highly driven by the notable popularity of probabilistic databases in the database-research community, constitutes a significant progress towards this ideal, by significantly improving our understanding of probabilistic (XML) databases, by developing a plethora of algorithmic techniques, and by building prototype implementations thereof.
Now you should have a list showing each ref XML item in the list. Among other things, this happens to be a convenient way to quickly visualize which XML elements are bound in a particular context, and what is available on them. When we first set up the XML bindings, we bound to the OuterXml everywhere to watch as the context of the data changed. Before we head to the next section, go ahead and set the binding back to using XPath:
Universal Description Discovery and Integration (UDDI), developed by Microsoft, IBM and Ariba, uses the XML Schema Language to formally describe its data structures. UDDI is SOAP based and defines global interaction with the web service information repository. A web service is a self-describing, self-contained, modular unit of application logic that provides business functionality to other applications through an Internet connection. The UDDI specification enables businesses to quickly, easily and dynamically find each other and interact. It enables a business to describe it as well as to find and interact with businesses offering desired services. This internet facilitated discovery and interaction fosters new e- business partnerships. UDDI also simplifies the intergradation of disparate systems and allows market expansion, improved efficiency and reduced cost. Applications can access web services via ubiquitous web protocols and data formats, such as XML, without concern re web service implementation. Web services can be mixed and matched to execute a larger workflow or business transaction. UDDI Business Registry can be accessed using SOAP and a service registered in the UDDI Business Registry can expose to any type of service interface.
Roughly, we propose the following strategy for data in- tegration. The information in a foreign probabilistic XML document is used to supplement a device’s knowledge of the real world. We either ﬁnd (1) information on previously unknown real world objects, (2) conﬂicting information on already known real world objects, or (3) the same informa- tion on already known real world objects. In the ﬁrst case, that information is simply added. In the second case, we incorporate existing and new information as distinct possi- bilities. The last case only conﬁrms what the device already knows. Note that the fact that information corresponds to the same real world object can often not be determined with certainty. We add possibilities to the document accordingly. We assume the existence of a rule engine that determines the probability of two elements referring to the same real world object.
APEX is an adaptive index that searches for a trade-off between size and effectiveness (Chang et al. 2002). Instead of indexing all the paths from the root, APEX only indexes the FUPs and preserves the source data structure in a tree. However, since FUPs are stored in the index, path query processing is quite efficient. APEX is also workload-aware, i.e., it can be dynamically updated according to changes in the query workload. A data mining method is used to extract FUPs from the workload for incremental update (Agrawal & Srikant 1995). Unfortunately, all these indexing techniques are ill-suited to decision-support queries. Data structures such as the data guide, the 1-index and its variants, and APEX are indeed applicable only on XMLdata that are targeted by simple path expressions. However, in the context of XMLdata warehouses, queries are complex and include several path expressions that compute join operations. Moreover, these indices operate on one XML document only, whereas in XML warehouses, data are managed in several XML documents and decision- support queries are performed over these documents.
  have done quite some work in specifying XQuery trigger semantic models and discussing the XQuery trigger execution model. The key analysis question is the termination of the trigger execution. A set of triggers is said to be terminating if for any initial event and any initial database state, the trigger execution terminates. Analysis of ECA rules in active databases is a well-studied topic, with a number of approaches appearing in the literature e.g., , mostly in the context of relational databases. A natural question to ask is whether it is possible to reuse analysis techniques developed for triggers in relational databases by translating the set of XML documents and associated triggers into a relational form and then applying previous analysis techniques to these triggers. The problem with this approach is that it may result in a significant loss of transformation.    propose new languages for defining Event- Condition-Action (ECA) rules on XML, providing reactive functionality on XML repositories.  proposes a specification language for active view definition on top of an XML repository. Our work is applied to XML views of the underlying heterogeneous data sources.  proposes and validates XBML (XML-based Business Modeling Language) as an XML active query language approach to specifying electronic commerce business models. Neither of these works discussed the trigger termination problem. In the XML-based data integration systems, the triggers are quite more dynamic. Therefore, we need a totally new mechanism to analyze the XQuery triggers in the XML-based data integration systems to ensure the termination of trigger execution.
Firstly, System receives a query formulated in terms of the unified global Schema. The query is decomposed by the rewriter component into sub- queries and addressed to specific data sources. This decomposition is based on source descriptions by global Schema and mapping Schema, which play an important role in sub-queries execution plan optimization. Finally, the sub-queries are sent to the wrappers of the individual sources, which transform them into queries over the sources. The results of these sub queries are sent back to the mediator system. At this point answers of these sub queries are merged and returned to the user. Besides the possibility of making queries, the mediator has no control over the individual sources.
ABSTRACT: Keyword queries are those terms that users enter and use to retrieve documents that have all or any of those terms. They are the most familiar and popular method used by ordinary users to search data. Keyword queries are highly ambiguous. Keyword search querying has emerged as one of the most effective way for information discovery, especially over HTML documents in the World Wide Web. Because of its simplicity keyword queries are one of the most effective ways for information discovery in World Wide Web. When using Keyword queries users do not have to learn a complex query language, nor needs him to have any prior knowledge about the structure of the underlying data. When considering the keyword query interpretations, a single keyword query interpretation will not be sufficient to satisfy the user. Many interpretations may yield unnecessary results. This will lead to user’s dissatisfaction. Diversification is mainly meant to minimize user’s dissatisfaction. The keyword query should return the major possible interpretations for the keywords in the underlying database. This will enable the user to easily select the intended interpretation. In the case of information retrieval (IR) keyword queries always retrieve a list of relevant documents and it needs to be analyzed manually one by one. But keyword queries over structured data give a more direct and effective way of diversification. In the case of keyword search over structured or semi-structured database if a keyword comes as value of more than one attribute, then each occurrence can be taken as different interpretations. Each interpretation will yield different results. Keyword Query Diversification can be defined as for a given keyword query over the XML dataset, the user should get a result set of top-k results, where each result should be relevant to the given keyword query and they should be maximum different to each
Related work In recent years, significant effort has been de- voted to developing high-performance XML database systems, and to building tools for data exchange. One major direction of the XML effort is the “relational approach”, which uses relational DBMSs to store and query XMLdata. Documents could be translated into relational tuples using either a “DTD- aware” translation ,  or a “schemaless” translation. The latter translations include the edge  and the node  representations of the data. Indexes could be prebuilt on the data to improve performance in relational query processing, see, e.g., , . Constraints arising in the translation are sometimes dealt with explicitly , . See  for a survey of the relational approach to answering XML queries.