Top PDF Continuous Top-k Queries over Real-Time Web Streams

Continuous Top-k Queries over Real-Time Web Streams

Continuous Top-k Queries over Real-Time Web Streams

time decay and propose several alternative index structures for implementing event matching in the Event Handler (see Figure 2) that accounts for the dynamic score of items due to social feedback. Finally, [26] considers the problem of annotating real-time news stories with social posts (tweets). The proposed solution adopts a top-k publish-subscribe approach to accommodate high rate of both pageviews (millions to billions a day) and incoming tweets (more than 100 millions a day). Compared to our setting, stories are considered as continuous queries over a stream of tweets while pageviews are used to dynamically compile the top-k annotations of a viewed page (rather than for scoring the news). The published items (e.g., tweets) do not have a fixed expiration time. Instead, time is a part of the relevance score, which decays as time passes. Older items retire from the top-k only when new items that score higher arrive. This work adapts, optimizes and compares popular families of IR algorithms without addressing dynamic scoring aspects of top-k continuous queries beyond information recency, namely TAAT (term-at-a-time) where document/tweet scores are computed concurrently one term at a time, and DAAT (document-at-a-time) where the score of each document is computed separately. Top-k publish/subscribe on spatio-textual streams: AP (Adaptive spatial-textual Partition)-Trees [30] aim to efficiently serve spatial-keyword subscriptions in location-aware publish/subscribe applications (e.g. registered users interested in local events). They adaptively group user subscriptions using keyword and spatial partitions where the recursive partitioning is guided by a cost model. As users update their positions, subscriptions are continuously moving. Furthermore, [29] investigates the real-time top-k filtering problem over streaming in- formation. The goal is to continuously maintain the top-k most relevant geo-textual messages (e.g. geo-tagged tweets) for a large number of spatial-keyword subscriptions simultaneously. Dynamicity in this body of work is more related to the queries rather than to the content and its relevance score. Opposite to our work, user feedback is not considered. [6] presents SOPS (Spatial-Keyword Publish/Subscribe System) for efficiently processing spatial-keyword contin- uous queries. SOPS supports Boolean Range Continuous (BRC) queries (Boolean keyword expressions over a spatial region) and Temporal Spatial-Keyword Top-k Continuous (TaSK) query geo-textual object streams. Finally, [5] proposes a temporal publish/subscribe system considering both spatial and keyword factors. However, the top-k matching semantics in these systems is different from ours (i.e., boolean filtering).
Show more

24 Read more

Evaluating Continuous Monitoring of Top k Queries Using Sliding Window

Evaluating Continuous Monitoring of Top k Queries Using Sliding Window

The Top k Monitoring Algorithm comprises of three modules, for example, matrix based record structure, top k algorithm module and support module. The framework based record is spoken to by utilizing 2- dimensional space. The framework structure contains cells each matrix cell contains the focuses. Each point p comprise of following characteristics <p.id, p.x, p.y, p.t> where id is the exceptional identifier, x and y are the traits and t is the entry time. The matrix based record construction permits the continuous preparing of numerous questions. It keeps away from costly redesign costs. It can be comprehensively characterized into normal lattice structure and sporadic framework structure. The advantage of the general lattice is that additions and erasures are prepared proficiently. It is critical to supply a proficient component for removing the lapsing records. In the tally and time based sliding window the tuples are expelled in First In First Out (FIFO) way .All the records are put away in a solitary rundown .The fresh debuts are set toward the finish of the rundown. The tuples that drop out of the
Show more

6 Read more

An Optimal Strategy for Evaluating Continuous Top k Monitoring on Document Streams

An Optimal Strategy for Evaluating Continuous Top k Monitoring on Document Streams

CTQDs from various users. Each CTQD specifies a set of keywords, as explicitly given by the issuing user or extracted from Their online behaviour [4], [5]. The task of the server is to continuously refresh for snapshot top-k queries revealed that, for sparse types of data, it may be more effec- tive to sort the lists of the inverted file by document ID [10], thus enabling ―jumps‖ within the relevant lists, i.e., disre- garding contiguous fractions of the lists. This is an interest-ing fact, which however is not directly applicable to continuous top-k queries. An application of ID-ordering to document streams would incur costly index maintenance, and also it would require repetitive query reevaluation, as it entails no mechanism to reuse past query results in response to updates.
Show more

7 Read more

Dictionary Compression: An Optimal Answering for Continuous Top k Dominating Queries

Dictionary Compression: An Optimal Answering for Continuous Top k Dominating Queries

First introduced in the work of top-k query has been one of the most popular preference data retrieval schemes today [1]. Basically, top-k queries use a function to define a score for each record in the dataset. The records are then sorted according to these scores and the top k records are returned to the users. Top-k query has found application in many fields such as information retrieval [2]. A comprehensive survey about top-k query in database can be found in a data structure called Pareto-Based Dominant Graph (commonly known as DG) was proposed in consideration of the intrinsic connection between top-k problem with aggregate monotone functions and dominance relationship. As a result, DG can be used to reduce the search space in order to answer a top-k query. In a threshold- based technique was proposed for processing a top-k query in a highly distributed environment. The aim of the technique is to minimize the amount of data transferred between nodes[3]. The authors proposed the reverse top-k query which addresses the problem of deter- mining the weighting vectors in which a given record is on the top k list of a dataset [4]. Related research can also be found in skyline query provides users with a set of data that are prominent based on the concept of dominance relationship. Extensive research results have since been reported in the literature. In [5], two algorithms, called Bitmap and Index, were proposed to progressively provide skyline points. The authors proposed a nearest neighbor search-based technique to tackle the problem without having to read the entire dataset. Instead a of centralized system environment, the processing of a skyline query has also been adapted to peer-to-peer systems [6]. As an alternative proposed the concept of representative skyline that returns k skyline points that best describe tradeoffs among different dimensions offered by the full skyline [7]. Furthermore, [8] presented a technique to answer a skyline query by using the Z-order curve approach. TKDQ is a relatively new topic in the field of data management. Several algorithms that are based on the aggregate R-tree were proposed to process the query. In this scheme, a pruning technique was used to reduce the search space needed to obtain the query result [9]. In the problem of probabilistic top-k dominating query was introduced to address uncertain datasets. The introduction of streaming models presents new challenges for the processing of data queries, in which updated query results must be returned continuously [10]. The authors provided a review of some important preference queries over data streams [11]. The proposed method returns k records with highest domination
Show more

6 Read more

Top-k/w publish/subscribe: A publish/subscribe model for continuous top-k processing over data streams

Top-k/w publish/subscribe: A publish/subscribe model for continuous top-k processing over data streams

Top-k /w publish/subscribe can be applied in a number of real-world applications: 1) Consider an application which ag- gregates product advertisements from a number of auction sites. A user may specify properties of an ”ideal” product he /she is in- terested in, while the application informs the user in real-time about newly appearing advertisements that are most similar to his /her ”ideal” product. Furthermore, the user can specify the number of advertisements he /she is willing to receive per day. 2) A personal information filtering engine continuously checks for updates from various RSS feeds and blogs, and delivers k most relevant updates to users during a course of a day and at the time when those updates are published. 3) Another possible application area is real-time environmental monitoring. With top-k /w publish/subscribe environmental scientists can identify and monitor up to k sites with the largest pollution readings over the course of a single day, or identify a predefined number of sensors closest to a particular location measuring the largest pollution levels over time (e.g. top-10 readings per hour). All these example applications require continuous processing of top-k queries over largely-distributed data sources generating data objects possibly with high publication rates.
Show more

21 Read more

Continuous Top-K Monitoring on Document Streams

Continuous Top-K Monitoring on Document Streams

ABSTRACT: In current years, there has been a strong development of incorporating results from structured data source into keyword based web hunt systems such as Amazon or Google. When presenting ordered data, facets are a potent tool for navigate, refinement, and group the results. For a given structured data source, a fundamental problem in supporting faceted search is finding an ordered assortment of attributes and standards that will occupy the facets. This creates two sets of challenges: First, because of the restricted screen real estate, it is vital that the top a small number of facets best match the probable user intent. Second, the massive scale of existing data to engines like Amazon or Google demands an automated unsupervised resolution. In this work we proposed efficient evaluation techniques of continuous top-k queries over text and feedback streams featuring generalized scoring functions which capture dynamic ranking aspects. As a first contribution, we generalize state of the art continuous top-k query models, by introducing a general family of non-homogeneous scoring functions combining query-independent item importance with query-dependent content relevance and continuous score decay reflecting information freshness. Our second contribution consists in the definition and implementation of efficient in-memory data structures for indexing and evaluating this new family of continuous top-k queries. Our experiments show that our solution is scalable and outperforms other existing state of the art solutions, when restricted to homogeneous functions. Going a step further, in the second part of this thesis we consider the problem of incorporating dynamic feedback signals to the original scoring function and propose a new general realtime query evaluation framework with a family of new algorithms for efficiently processing continuous top- k queries with dynamic feedback scores in a real-time web context.
Show more

10 Read more

An Efficient Strategy for Monitoring Top k Queries in Document Streaming

An Efficient Strategy for Monitoring Top k Queries in Document Streaming

(google.com/cautions) and Yahoo! Cautions (alerts.yahoo.com), verify the centrality of these applications. Then again, these frameworks either work in a semi-disconnected way by conveying intermittent updates (e.g., day by day) or consider coarse separating just (e.g., in view of general subjects, instead of sets of particular keywords). Another application area for CTQDs aremicroblog continuous pursuit services. Right now, these services enable the client to query (in an on-request, one-off route) for posts that match an arrangement of keywords. CTQDs could expand the usefulness of these services by offering persistent checking/warnings about new posts that match the keywords. In customary content hunt, there are preview (i.e., one-off) top-k queries over static document accumulations. The modified record is the standard list to arrange reports. It com-prises a rundown for each term in the word reference; the rundown for a term holds a section for each document that contains the term. By arranging the rundowns in diminishing term frequency, and with fitting utilization of thresholding, a depiction query can be replied by preparing just the best parts of the pertinent records. Due to the said arranging, we allude to that worldview as frequencyordering. This regular practice for preview queries has been trailed by most methodologies for ceaseless best k look, though adjusted to the "standing" idea of the consistent queries and the profoundly unique attributes of the document stream. In this work, we leave from frequencyordering, and receive an alternate worldview, specifically, identifier- requesting (ID-requesting). Past examinations on preview top-k queries uncovered that, for scanty kinds of data, it might be more effective to sort the arrangements of the modified record by document ID, accordingly empowering "hops" inside the pertinent records, i.e., disregarding adjoining portions of the rundowns. This is an intriguing truth, which however isn't specifically pertinent to ceaseless best k queries. A utilization of IDordering to record streams would bring about expensive file upkeep,
Show more

6 Read more

Processing count queries over event streams at multiple time granularities

Processing count queries over event streams at multiple time granularities

Management and analysis of streaming data has become crucial with its applications to web, sensor data, network traffic data, and stock market. Data streams consist of mostly numeric data but what is more interesting are the events derived from the numer- ical data that need to be monitored. The events obtained from streaming data form event streams. Event streams have similar properties to data streams, i.e., they are seen only once in a fixed order as a continuous stream. Events appearing in the event stream have time stamps associated with them at a certain time granularity, such as second, minute, or hour. One type of frequently asked queries over event streams are count queries, i.e., the frequency of an event occurrence over time. Count queries can be answered over event streams easily, however, users may ask queries over different time granularities as well. For example, a broker may ask how many times a stock increased in the same time frame, where the time frames specified could be an hour, day, or both. Such types of queries are challenging especially in the case of event streams where only a window of an event stream is available at a certain time instead of the whole stream. In this paper, we propose a technique for predicting the frequencies of event occurrences in event streams at
Show more

31 Read more

Analysis of Emotions in Real-time Twitter Streams

Analysis of Emotions in Real-time Twitter Streams

In order to analyze and visualize the tweets based on the location information, EmoTwitter uses the Google Maps API — an interface for interacting with the backend web service of Google Maps [11]. Taking a location string as an input, EmoTwitter 2.0 queries Google Maps to retrieve the geographical coordinates of the string (presuming that Google Maps is able to interpret the string as a location). Next, EmoTwitter receives the values of latitude and longitude as a response. Using these values and the streaming API, the system retrieve tweets posted near the specified geographical location. Currently the system returns all the tweets inside a square having the specified point in the center and sides of 1.0-mile length.
Show more

6 Read more

Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams

Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams

We define the semantics of sliding window joins as monotonic queries over append-only relations; there- fore once a result tuple is produced, it remains in the answer set indefinitely. Hence, new results may be streamed directly to the user. In particular, for each newly arrived tuple k, the join is to probe all tuples present in the sliding windows precisely at the time of k’s arrival (i.e. those tuples which have not expired at time equal to k’s timestamp), and return all (compos- ite) tuples that satisfy every join predicate. Moreover, we impose a time bound τ on the time interval from the arrival of a new tuple k until the time when all join re- sults involving k and all tuples older than k have been streamed to the user. This means that a query must be re-executed at least every τ time units. However, it does not mean that all join results containing k will be available at most τ units after k’s arrival; k will remain in the window until it expires, during which time it may join with other (newer) tuples.
Show more

12 Read more

Filtering Real-Time Linked Data Streams

Filtering Real-Time Linked Data Streams

C­SPARQL(Continuous­SPARQL) is an extension of SPARQL. It enables running SPARQL                     queries continuously on RDF streams. It achieves this by adding a timestamp to RDF triples,                               when they are added to the stream creating a new data structure for RDF Streams: an ordered list                                     of triple­timestamp pairs. The window clause in this query language makes it possible to filter                               and compute results on specific subset of the stream after regular time intervals. It has two parts:                                   a range and a step. The range is the size of the window and step is a frequency of execution for                                           the query. So for example a range of 10 seconds and a step of 5 seconds means that the query is                                           executed every 5 seconds on triples whose timestamp is not older than 10 seconds. C­SPARQL                               also adds a FROM STREAM keyword for identifying the streams of RDF triples. An example of                                 C­SPARQL query is shown in Figure 6. This query finds every organization where a person                               working for organisation 11215399 is a member of.  
Show more

34 Read more

It's About Time: Link Streams as Continuous Metadata

It's About Time: Link Streams as Continuous Metadata

emerging MPEG standards effectively promote embedded metadata and we anticipate that Web developers will instinc- tively embed URLs in media streams. In many applications there is a case for transporting digital media in a compos- ite form including metadata, just as HTML with embedded links is an effective transport and delivery format. However, when we have multiple streams, described by one or more metadata flows, operating in a real-time scenario, there is a compelling case for handling the metadata separately. Our future work is to realise the distributed metadata machin- ery using multicast and to evaluate the approach. The ex- isting experimental systems will also be extended to further explore integration with other OHS components, particularly content-based navigation and application of ontologies. The distributed architecture raises systems infrastructure issues, especially with respect to timely delivery when working with mediadata and metadata streams together.
Show more

10 Read more

TOP-K DOMINATING QUERY PROCESSING OVER DISTRIBUTED DATA STREAMS Guidan Chen *, Yongheng Wang

TOP-K DOMINATING QUERY PROCESSING OVER DISTRIBUTED DATA STREAMS Guidan Chen *, Yongheng Wang

With the rapid development of social networks, Internet of Things, mobile Internet, etc. huge amount of data has been produced, and the era of big data has arrived 1. Big data has the characteristics of huge amount of data, high real-time requirements, low value density, and the value of data will decrease with the passage of time. Therefore, how to process data streams in real time and obtain more valuable information is the key to streaming big data research. Top-k queries and skyline queries are most widely used in streaming data environment .
Show more

11 Read more

Distributed processing of continuous spatiotemporal queries over road networks

Distributed processing of continuous spatiotemporal queries over road networks

Abstract This paper presents a distributed algorithm for processing continuous spatiotemporal queries. Distributed query processing that involve multiple servers is inspired by (1) the need for scalability in terms of supporting a large number of moving objects and a large number of queries, and (2) the need for providing real-time answers to users’ queries. To rapidly answer queries, incom- ing data streams are processed in-memory. The load is distributed among a set of regional servers that collaborate to continuously answer the spatiotemporal queries. The algorithm assumes that object movement is constrained by a road network and that the continuous queries are stationary, i.e., the queries’ focal points do not move. Continuous queries are evaluated incrementally by mon- itoring network positions that are part of existing queries’ search regions. The performance of the proposed algorithm is measured to analyze their behavior under various work loads and distributed system settings.
Show more

9 Read more

SOLE: Scalable On-Line Execution of Continuous Queries on Spatio-temporal Data Streams

SOLE: Scalable On-Line Execution of Continuous Queries on Spatio-temporal Data Streams

The most related work to SOLE in the context of spatio-temporal databases is the SINA framework [39]. SOLE has common functionalities with SINA where both of them utilize a shared grid structure as a basis for shared execution and incremental evaluation paradigms. However, SOLE distinguishes itself from SINA and other spatio-temporal query processors in the following as- pects: (1) SOLE is an in-memory algorithm where all the employed data structures are built in memory while SINA is a disk-based query processing technique that mainly relies on the disk storage to perform its opera- tions. (2) Due to the size limitations of memory, not all objects are really stored in SOLE. On the other side, in SINA, all data objects are physically stored. (3) As some data objects are not stored, SOLE suffers from having uncertainty areas in its queries where part of the query area may not be aware by the existence of some mov- ing objects. Such scenario cannot happen in SINA as it is proven to be correct based on the knowledge of all stored objects. (4) SOLE is an online algorithm where it produces the incremental result with the change of any location of the query and/or objects. This online feature is in contrast to SINA where SINA buffers all the updates for the last T time units and processes them as a bulk. Such online behavior of SOLE makes it suitable to be en- capsulated into a pipelined operator. On the other side, the bulk behavior of SINA hinders its applicability to be implemented inside real systems. (5) SOLE is equipped with load shedding techniques to cope with intervals of high arrival rates of moving objects and/or queries. The main idea is to drop some data objects from memory to allow for supporting more queries with an approxi- mate answer. Such load shedding cannot be supported in SINA as it is mainly a disk-based algorithm and does not suffer from limited storage space.
Show more

23 Read more

Management of Data Streams for a Real Time Flood Simulation

Management of Data Streams for a Real Time Flood Simulation

Applications that generate rapid, continuous and large volumes of stream data include ATM and credit card operations, financial tickers, web server log records, readings from sensors used in a variety of applications, such as High Energy Physics experiments, Weather sensors and Network sensors. Most of this data is archived in a database at an off-site warehouse making accesses prohibitively expensive [1]. Ability to make decisions and infer interesting patterns in real-time, is crucial for several mission critical tasks.

8 Read more

QoE-Driven Network Management For Real-Time Over-the-top Multimedia Services

QoE-Driven Network Management For Real-Time Over-the-top Multimedia Services

The test demonstrates very well the benefit of a QoE- driven management system. While the bulk download adapts to the varying bandwidth of the RTP stream even without the management system, both streams lose packets due to the fact that TCP congestion control does not throttle down until some packets are lost. While this is not so severe for the bulk transfer, the quality of the video stream decreases significantly. We could think that the RTP stream can be instantly promoted to a higher traffic class, which would result in the same outcome. For a simple example, this is true to some extent. However, when several different applications enter the network with different types of constraints, we must identify not only the required share of resources per application but also how much we can compromise the resources without a loss in perceived quality. This results in a greater utilization of the link, or in other words, more applications with the same amount of resources. By using QoE as a trigger for management, excess resources will be distributed only to applications which truly need them to keep the users satisfied. B. Subscriber Priority Test
Show more

6 Read more

k-Nearest Neighbors Queries in Time-Dependent Road Networks

k-Nearest Neighbors Queries in Time-Dependent Road Networks

Instead of using an A ∗ search to calculate each time-dependent distance from a query q to each point of interest in a set of candidates, we incorporate an A ∗ search directly in an incremental network expansion. Thus our search has multiple and unknown targets. An incremental strategy avoids re- computing costs previously calculated and visits only necessary edges. In our algorithm, the heuristic function adds to each vertex an estimate of the potential for it to take part of the fastest path that leads to a nearest point of interest. The idea behind our search is avoid continuing to expand nodes in a path that is fast but is far from any point of interest in the network. We are motivated by the fact that a vertex u may be the closest node to q, but it does not imply that we would find a next nearest POI when expanding u. To explain how our method works, we need to introduce some definitions.
Show more

16 Read more

Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries

Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries

Figure 3 shows the average number of I/O s and elapsed time over 50 k-SB queries of each workload type {S, M, L} for every dataset. IF C shows perfor- mance advantage when query terms are relatively infrequent (S). Short posting lists can be quickly evaluated to compute query result, whereas SKI and RIF spend additional R-tree traversals. When query terms become more frequent (M and L), IF C incurrs in expensive long posting list merges, which is observed in peaks of Figure 3.a for L queries. RIF performs acceptably for S queries but degrades for M and L queries. This may be due to filtering limitations in R- tree upper levels. Eventually, subtrees known (via inverted file) to contain query terms are traversed. However, terms may belong to different objects, i.e. no sin- gle object satisfies B predicate. In the same vein, RIF must wait until objects are retrieved to apply N OT -semantics filters, which can also degrade its per- formance. In summary, we observed consistent enhanced retrieval performance using the proposed hybrid indexing and query processing methods.
Show more

9 Read more

LotterySampling: a novel algorithm for the heavy hitters and the top-k problems on data streams

LotterySampling: a novel algorithm for the heavy hitters and the top-k problems on data streams

Regarding the time cost that Valgrind reports: the value is not execution time, but number of instructions executed in a virtual machine, accounting cache misses, data reads, etc. To measure the accuracy of the algorithms, it’s necessary that the Python program keeps a counter for each element in the stream to exactly compute the heavy hitters and their exact frequencies. This is what all the proposed algorithms intend to avoid, since it requires a lot of memory. Hence, running experiments that measure the accuracy of the algorithms are very time consum- ing because the computer running the tests enters into swap memory. Also, the executions using Valgrind are very slow due to the virtualization. Some of the tests have been performed with EC2 instances in Amazon Web Services using hosts with more memory and more resources to address these problems.
Show more

58 Read more

Show all 10000 documents...