The objective of this paper is to propose a technique that collects the search terms from office documents, and generates a searchindex that can effectively identify the target office documents by the search conditions whose definitions are based on document types, search terms, and term descriptions. In this paper, we focus on the digital documents that are edited by Microsoft Office Word 2003 (MS Word, for short), since it is widely used in enterprise and government organizations. However, the proposed technique can be applied to the documents edited by Microsoft Office Word 2007 with some trivial technique modification. In this paper, we use the term office documents to denote the digital documents edited by MS Word. As XML  has become a de facto standard also for describing web and office documents , we use XML for defining a searchindex of office documents. Therefore, the searchindex can be easily imported to existing search engines. The searchindex is designed to effectively address various document types (such as meeting minutes, sale reports, contracts, letters, etc.). Users can easily build a search query by defining the search terms and add operators such as equals, greater than, or less than in the list or between to further expand or restrict the search scope. The contribution of this paper can be summarized as follows.
Based on the popularity of open source search frameworks and their method of implementation, we chose tools for building a multilingual searchindex. The latest version of Nutch with Hadoop was chosen as a crawler and Apache Solr for building the index of crawled documents. Both Nutch and Solr support customization of modules to make them adaptable to a desired application. The desired application for which these tools are used is to develop monolingual web search engine for 9 Indian languages, viz., Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Tamil and Telugu.
In Present day internet are most power full network and it given a space for storing all type of importance information. If anybody, person or organization can be wanted to any type of information, data, news or other things he searches to WWW through search engine. World Wide Web is a storing to all type of data and Search engine given the searched item related list of link and person select the one link at a time. Search engine given the some importance link they related to search item and some link is not related to search item. In present day basically three type of search engine are used first is Index base search engines search second is Directories search engines and third is Meta search engine. Main aim of this research paper is finding on which search engine are given to best result and how many link are related to searched item. This paper help to they person or organization which are searching on data or information to all time because we try to calculation on performance of Index base search engines, Directories search engines and Meta search engine.
This paper describes a tool that helps users se- mantically bundle linguistic trees such as a con- stituency tree and a dependency tree. We refer to the tool as StruAP (Structure-based Absract Pattern). By using the proposed tool, the user can easily define relations that are specific to a given business use case and create a searchindex for the newly defined relations. The searchindex al- lows the user to retrieve sentences that include the defined relations. For instance, we can interpret the following sentence as including a spin-off rela- tion between Japanese electronics maker Hitachi and home appliance and industrial equipment di- visions.
Our SMSE scheme supporting fast decryption consists of three entities: the data owner, the cloud server and the user. The data owner has a plaintext database and wants to outsource it to the cloud server. The data owner first encrypts the database and generates a searchindex, then outsources them to the server. Besides, the data owner allocates different privileges for searching and decrypting to every authorized user. When a user wants to search for some documents, she should generate a valid search token and submit it to the server. Upon receiving the search token, the server performs a search over the searchindex and finds out the corresponding search results. After that, the server tests each search result to determine whether the user can decrypt. Finally, the server just returns the search results which can be decrypted. In our system, we assume the server is “honest-but-curious”, which means the server will follow our protocol and return the correct search results, but try to learn as much information as possible from the protocol. Figure 1 shows the system model. Formally, our scheme consists of five algorithms:
A privacy-preserving multi-keyword text search (MTS) scheme with similarity-based ranking has been introduced .To support multi-keyword search and search result ranking, a new scheme is introduced to build the searchindex based on term frequency and the vector space model with cosine similarity measure to get higher search result accuracy. To improve the search efficiency, a tree- based index structure and various adaption methods for multi-dimensional (MD) algorithm so that the practical search efficiency is much better than that of linear search. To further enhance the search privacy, there exists two secure index schemes to meet the highest privacy requirements under strong threat models, i.e., known ciphertext model and known background model.
In this paper  ,to support multi-keyword search and search result ranking, they propose to create the searchindex based on term frequency(TF) and also the vector area model with cosine similarity .This is performed to achieve the higher search result accuracy. To enhance the search efficiency a tree-based index structure is proposed . Furthermore they propose two secure index schemes to satisfy the privacy needs under strong threat models which include known cipher text model and known background model.
The growing number of applications requires the efficient execution of nearest neighbor queries which is constrained by the properties of spatial objects. Keyword search is very popular on the internet so these applications allow users to give list of keywords that spatial objects should contain. Such queries called as a spatial keyword query. This is consisted of query area and set of keywords. The IR2-tree is developed by the combination of R-tree and signature files, where each node of tree has spatial and keyword information. This method is efficiently answering the top-k spatial keyword queries. In this signature is added to the every node of the tree. An able algorithm is used to answer the queries using the tree. Incremental nearest algorithm is used for the tree traversal and if root node signature does not match the query signature then it prunes the whole subtrees. But IR2-tree has some drawbacks such as false hits where the object of final result is far away from the query or this is not suitable for handling ranking queries.
In this paper I introduce the de…nition of market index leader, de…ning it as that index which Granger-causes other indices but it is not Granger-caused by any other index. I perform a time-series analysis to detect the existence of possible market index leaders in Asian …nancial markets. Many authors have already studied the interdependence amongst Asian stock markets (Chang et al. 1992, Pan et al. 1999, Manning 2002) and Granger-causality in …nancial markets was studied in a few empirical researches (Gu & Annala 2005, Herwany & Febrian 2008) but, to the best of my knowledge, an integration of these two …eld of research has never been considered. This study aims to investigate the Granger-causality under di¤erent market conditions in order to detect whether this type of causality always exist or if it is more related to certain conditions. Furthermore I aim to discover if market leaders does and if they are the same in all the quartiles analysed.
There are so many increasing amount of information in the today’s World Wide Web. For these increasing amount of information we need efficient and effective index structure .Most indexing techniques directly matched terms from the documents and terms from query. Granting efficient and fast accesses to the index is a key issue for performance of web search engines. The main aim of search engine is to provide most relevant documents to the users in minimum possible time. Indexing is performed on the web pages after they have been gathered into a repository by the crawler. The existing architecture of search engine shoes that the index is built on the basis of the terms of the document. The context of the documents being collected by the crawler in the repository is being extracted by the indexer using the context repository, thesaurus and Ontology repository and then documents are indexed.
An efficient method for computing the node proximity is one of the most challenging problems for many applications such as recommendation systems and social networks. Regarding large-scale, mutable datasets and user queries, top-k query processing has gained significant interest. Jaehui Park and Sang-Goo Lee presents a novel method to find top-k answers in a node proximity search based on the well-known measure, Personalized PageRank (PPR). First, they deduct a distribution state transition graph (DSTG) to depict iterative steps for solving the PPR equation. Second, they proposed a weight distribution model of a DSTG to capture the states of intermediate PPR scores and their distribution. Using a DSTG, they selectively followed and compared multiple random paths with different lengths to find the most promising nodes. The limitation of this work is that it cant be applied directly to XML document
In this paper I propose a method that, given a query submitted to a search engine, suggests a list of related queries. Query recommendation is a method to improve search results in web. This paper presents a method for mining search engine query logs to obtain fast query recommendation on a large scale. Search engines generally return long list of ranked pages, finding the important information related to a particular topic is becoming increasingly difficult and therefore, optimized search engines become one of the most popular solution available. In this work, an algorithm has been applied to recommend related queries to a query submitted by user. For this, the technology used for allowing query recommendations is query log which contains attributes like query name, clicked URL, rank, time. Then, the similarity based on keywords as well as clicked URL’s is calculated. Additionally, clusters have been obtained by combining the similarities of both keywords and clicked URL’s. The related queries are based in previously issued queries The method not only discovers the related queries, but also ranks them according to a relevance criterion. In this paper the rank is updated only the clicked URL, not all the related URL’s of the page.
We attempted to test the limits of algorithm D/fp by making the values for all indexed attributes follow a 90:10 skewed distribution, instead of the uniform distribution. With reasonably large node sizes (greater than 1K bytes) space utilization is very good regardless of the number of indexed attributes. Note that the decline in utilization is due to increased control information and not index term size (see Section 5.3). Also, the size of the index is very small: for node sizes 0.5K, 1K, 2K, and 4K, 5%, 2.5%, 1.4%, and 0.75% of the total number of hB Π -tree nodes are index nodes respectively.
The degree of sales channel digitalization describes “the combined share of Internet and mobile purchases in the specific country and retail segment” (Bovensiepen et al., 2015). This value is weighted with 20% of the overall index, so Strategy& finds this less important than the consumer behavior. While we are neither looking at countries nor at specific retail segments, this metric, as it is, can also not be adapted for our index. But looking at the amount of sales via the online and mobile channel compared to the sales through offline channel might give us an insight about how digitalized a specific retailer is. It is to find out how easy this data is accessible for us.
EAD and BJ are only applied if there are nodes in the first- path that satisfy the subpartition condition. Without a cell selector that favours subpartitions, they cannot be expected to be useful in general. Hence, a cell selector like DCS is needed to choose a good cell for individualization. In conauto-2.03, DCS is implemented in the following way. At node (𝐺, 𝜋), for each cell 𝑐 ∈ 𝜅(𝜋), it computes its size 𝑠 = |𝑐| and degree 𝑑 = 𝛿(𝐺, 𝜅(𝜋), 𝑐). For each pair of values (𝑠, 𝑑), one cell is selected as a candidate for individualization. From each such cell, it takes the first vertex V, and computes the corresponding refinement 𝑅(𝐺, 𝜋 ↓ V). If it gets a partition which is a subpartition of 𝜋, it selects that cell (and vertex) for individualization. If no such cell is found, it selects the cell (and vertex) which produces the partition with the largest number of cells. Observe that this function is not isomorphism-invariant (not all the vertices of a cell will always produce compatible colored graphs), and it has a non- negligible cost in both time and number of additional nodes explored. However, it pays off because the final search tree is drastically reduced for a great variety of graphs, and other techniques compensate for the overhead introduced.
The main source of information is the World Wide Web, One of its main great expansion is the absence of any strict rules for the information presentation and the relative simplicity of the used technology. HTML (Hypertext Markup Language) is the main programming language used to create documents, it gives the authors enough freedom for presenting any kind of data with minimal effort and the new technologies such as Cascading Style Sheets (CSS) allow to achieve the desired quality of presentation. The different forms of data presentation on the web, also have a lot of drawbacks, it increases the complexity of how to access, extract and use effectively all the information contained in the web. Since the majority of search engines use HTML, it explains the need to have good extraction methods to have an IRS that responds appropriately to the needs of users.
The original Google crawler was developed at Stanford. Topical crawling was first introduced by Menczer. Focussed crawling was first introduced by Chakrabarti. A focussed crawler has the following components: (a) How to know whether a particular web page is relevant to given topic, and (b) way to determine how to follow the single page to retrieve multiple set of pages. A search engine which used the focussed crawling strategy was proposed in based on the assumption that relevant pages must contain only the relevant links. So it searches deeper where it founds relevant pages, and stops searching at pages not as relevant to the topic. But, the above crawlers are having a drawback that when the pages about a topic are not directly connected the crawling can stop at early stage. They keep the overall number of downloaded Web pages for processing to a minimum while maximizing the percentage of relevant pages. For high performance, the seed page must be highly relevant. Seed pages can also be selected among the best results retrieved by the Web search engine
Abstract—Over the year People have become more privacy conscious.Security require constant effort, one simply can’t rely on just some top notch antivirus or firewall software’s.So, best solution is to encrypt the data before sending it to any of the third party storage.Now, in this scenario where user choose to encrypt all the data before outsourcing; what happen is performing regular operations like search becomes a very hectic task; as one has to decrypt all data before actually starting the searching process.In this paper, we are exploring a technique that will help user perform search over encrypted data.Also the main attraction of this technique is even though the user can’t remember the exact keyword to look for the particular encrypted file;still the user will be able to search with partially correct keywords.We will also discuss some string matching algorithms in this paper.
A search engine are planned to search information and data on the World Wide Web or FTP server. The Search Engine is given the helps in the process of retrieving the information, data or other importance thing as required to user or person. The main goal of this paper is to analysis the efficiency of index and directory search engines and determine the best one search engine . Now day search engine is a most popular application of internet and it used by everyone because all importance subject related information, data, news is given by WWW. Search engine are given the result as a format of URL and URLs are consist of web pages, data, images, video, audio, information and other type of file. Result of Search engine is given some result web link is related to search item and some result link is not related for search item. In this research papers are cover to evaluating the performance of index web search engine Google and directory search engine Yahoo on their searching result. This paper compares the retrieval effectiveness of the Google and Yahoo. Precision and relative recall of Web search engine was considered for evaluating the Performance. We get the performance in based on General Queries were tested. We divided on searching keyword in a simple one word, simple multi word and complex multi word group and we taken on each group two searching keyword.
A few years ago if someone talked about finding information on the web, it meant that information was searched on the web by means of hyperlinks or via search engines. Today, people talk about different versions of the web namely Web 2.0, Web 3.0 or the semantic web. Most people are unaware of the difference between the versions. There are some misconceptions about their similarities or differences and biggest misconception being the terms Web 2.0 and semantic web mean the same (Beal, 2010). If 10 people are asked about Web 2.0, one is likely to get 10 different definitions for the term was never clearly defined (Spivack., 2012). The web has evolved through different stages, each being capable of an additional functionality to make life much easier. The first stage, (Davis, 2008) i.e. the Web or Web 1.0 getting on the internet and connecting information; Web 2.0 is a social thing (Beal, 2010), it focuses on people collaboration and sharing information online like facebook which was launched in 2004 and Twitter which was launched in 2006 (Wong, 2011); In terms of the technology there are minor differences between Web 1.0 and Web 2.0 (Fensel, Facca, Simperl, & Toma, 2011);whereas, Web 3.0 is open and structured data