Top PDF Analysis of large time-series data in OpenTSDB

Analysis of large time-series data in OpenTSDB

Analysis of large time-series data in OpenTSDB

There are several tools specialized for analysis of time series. R is a statistics soft- ware that has extensive features for analyzing time series data. In the Open Source community, there are two popular tools: opentsdbr [10] and StatsD OpenTSDB pub- lisher backend [11]. Both tools uses OpenTSDB HTTP/JSON API to query data from OpenTSDB. This API is only useful for small scale analysis due to its non distributed implementation that creates performance bottlenecks for real world applications. It requires huge memory to store time-series data at client side. Moreover, it is time con- suming due to transfering of data through network interface. For visual analysis, both systems use third party packages in R for displaying high dimensional time series data. Some of the most common time-series analysis tools are: GRETL (GNU Regression, Econometrics and Time-series Library) [12], TimeSearcher [13], Calendar-Based Visu- alisation [14] and Spiral [15] etc., but they are not specialised for real-world time-series analysis. These tools are not designed to work with distributed programming model. These tools works with single node, so the tasks are not distributed. If users want to do statistical analysis on massive amounts of data using these tools, it will take couple of days.
Show more

68 Read more

Time Series Analysis

Time Series Analysis

The idea of smooth transition regression models is based on the observation that many economic variables are sluggish and will not move until some state variable exceeds a certain threshold. For example, price arbitrage in markets will only set in once the expected profit of a trade exceeds the transaction cost. This observation has led to the development of models with fixed thresholds that depend on some observable state variable. Smooth transition models allow for the possibility that this transition occurs not all of a sudden at a fixed threshold, but gradually, as one would expect in time series data that have been aggregated across many market participants. A simple example is the smooth-transition AR(1) model:
Show more

24 Read more

Time Series Analysis

Time Series Analysis

where d < 1/ 2 and d ≠ 0. Such ‘long memory’ models may be estimated by the two- step procedure of Geweke and Porter-Hudak (1983) or by maximum likelihood (Sowell, 1992; Baillie, Bollerslev, and Mikkelsen, 1996). A detailed discussion including extensions to the notion of fractional co-integration is provided by Baillie (1996). Long memory may arise, for example, from infrequent stochastic regime changes (Diebold and Inoue, 2001) or from the aggregation of economic data (Granger, 1980; Chambers, 1998). Perhaps the most successful application of long- memory processes in economics has been work on modeling the volatility of asset prices and powers of asset returns, yielding new insights into the behavior of markets and the pricing of financial risk.
Show more

24 Read more

time series analysis.

time series analysis.

This study provides further evidence that exposure to tobacco smoke is an independent risk factor which increases the risk of IMD. Of the four countries studied, the most complete dataset for IMD, smoking, ILI and household crowding was obtained from Norway. Over a 34-year period, a 5.2–6.9% increase in IMD in children under 5 years of age was observed for every 1% rise in prevalence of smoking in adults aged between 25 and 49 years. Taken together with previous case–control studies showing smoking as a risk factor for contracting IMD, the reduction in smoking prevalence that has occurred in Norway during this period is likely to have made a signi fi cant contribution to the concurrent reduc- tion in incidence of IMD. The proportion of IMD cases under 5 years of age in the total population that could be attributed to active smoking in Norway was found to be 11.4%, which is far lower than that estimated in other studies for young children. 22 37 The lack of dem- onstrable associations between incidence of IMD and prevalence of smoking, after adjustment for the same confounding variables, in Denmark, Sweden and the Netherlands may in part be ascribed to the limited data- sets available. The absence of statistically signi fi cant asso- ciations is hence difficult to interpret, although unadjusted analysis showed positive associations between IMD in children related to older smokers in Sweden and the Netherlands. In contrast, negative associations were found related to younger smokers in Sweden. These mixed patterns of associations may indicate that not all biologically relevant confounding factors were accounted for.
Show more

11 Read more

Time Series Analysis

Time Series Analysis

This chapter presents an introduction to the branch of statistics known as time series analysis. Often the data we collect in environmental studies is collected sequentially over time – this type of data is known as time series data. For instance, we may mon- itor wind speed or water temperatures at regularly spaced time intervals (e.g. every hour or once per day). Collecting data sequentially over time induces a correlation between measurements because observations near each other in time will tend to be more similar, and hence more correlated to observations made further apart in time. Often in our data analysis, we assume our observations are independent, but with time series data, this assumption is often false and we would like to account for this temporal correlation in our statistical analysis.
Show more

30 Read more

time series analysis

time series analysis

This study will utilize quarterly time series data with period from 2000 to 2018 for aggregate level data and quarterly data from 2005-2018 for trade balance of commodity-level. This study will use this period because this study will focus on the period after the Asian financial crisis that made Indonesia’s exchange regime shift into a floating exchange regime. Further information regarding the data is described as follows:

60 Read more

Exploratory analysis of large spatial time series and network interaction data sets: house prices in England and Wales

Exploratory analysis of large spatial time series and network interaction data sets: house prices in England and Wales

The mainstream is not without its problems, however. Kreps (ibid) notes the difficulties of game theory, for example: on what basis is an equilibrium chosen if there are multiple equilibria? And what if players make moves which run counter to theory? Day (1993) points out th at the founders of classical economics, including Adam Smith himself, were well aware that not all of human behaviour was rooted in balance and rationality. Atkinson (1969, in Ormerod 2005, p. 21), states th at it may take over 100 years for economic growth equilibria to stabilise - meaning that the systems we observe are largely in disequilibrium in any case. And in the latter half of the 20th century, the advent of chaos theory undermined the idea that even the simplest behavioural foundations would necessarily result in an analytically tractable outcome. The sentiment is succinctly expressed by Strogatz (1994): “If you listen to your two favourite songs at the same time, you won’t get double the pleasure!”2
Show more

237 Read more

Statistical Learning for the Spectral Analysis of Time Series Data

Statistical Learning for the Spectral Analysis of Time Series Data

The first is detailed in Chapter 2 and is motivated by estimating the power spectrum of HRV time series in a way that provides insight into the workings of the autonomic nervous system (Malik et al., 1996). Because HRV is a nonstationary time series, it poses a specific challenge in that the frequency characteristics of its power spectrum can vary over time (Priestley, 1965). Furthermore, since time-varying power is estimated as a three dimensional surface, often clinicians use summa- rizing measures in their research, such as power within a band of frequencies. Our method hopes to provide an alternative by aiding in the interpretation of these structures by reframing the typical locally stationary Fourier estimate of the time varying spectrum in a penalized reduced rank re- gression setting. This allows for the power spectrum to be broken up into multiple unit-rank layers that are formed by multiplying an “importance” singular value, a left singular “time” vector, and a right singular “frequency” vector together. An adaptive sparse fused lasso penalty is imposed on these vectors that introduces sparsity and smoothness into the estimate. These layers can then be examined individually for patterns and the singular vectors provide a parsimonious representation of the time- and frequency-varying characteristics of the power spectrum.
Show more

94 Read more

Time series causality analysis and EEG data analysis on music improvisation

Time series causality analysis and EEG data analysis on music improvisation

In the intra-brain neural networks, improvisation was found to have triggered a more widely distributed network than composed music [216]. The distribution of intra-brain neural information flows expands from the back of the brain (pianist) or the right of the brain (listener) to the entire brain, when composed music is changed to improvisation. The frontal (attention and executive control) and central (motor cortex) regions became acti- vated when musicians played the improvisations. This may be because either performing or listening to improvisations demands more widespread functional coordinations between large brain regions [216]. Also, the intra-brain causality values were found to be sig- nificantly greater in composed music than in improvisation, particularly for the listeners, where the neural information flows separately began and ended in the left frontal and the right frontal regions in composed music and reverse directions when composed music is changed to improvisation [216]. Similarly, the differences between strict mode and “let- go” mode can also be found in the frontal activities and the inversion of information flows when the strict mode is changed to the “let-go” mode. These results agree with early stud- ies [22] [164] that the frontal regions (a more general area that covers the dorsal prefrontal regions) especially the right frontal region plays an important role in free improvisation of melodies and rhythms, which is the key regions that distinguish the brain activities be- tween composed music and improvisation and between strict mode and “let-go” mode [22] [164]. Moreover, in the contrast intra-brain neural networks, the central regions tend to act as transit hubs that transport the neural information flows, while the temporal and parietal regions also behave differently to different experimental conditions [216]. Moreover, the results of the differences between experimental conditions are robust and independent of the significance thresholds (Remark 12.3.1) [216].
Show more

254 Read more

Predefined pattern detection in large time series

Predefined pattern detection in large time series

Predefined pattern detection from time series is an interesting and challenging task. In order to reduce its computational cost and increase effectiveness, a number of time series represen- tation methods and similarity measures have been proposed. Most of the existing methods focus on full sequence matching, that is, sequences with clearly defined beginnings and end- ings, where all data points contribute to the match. These methods, however, do not account for temporal and magnitude deformations in the data and result to be ineffective on several real-world scenarios where noise and external phenomena introduce diversity in the class of patterns to be matched. In this paper, we present a novel pattern detection method, which is based on the notions of templates, landmarks, constraints and trust regions. We employ the Minimum Description Length (MDL) principle for time series preprocessing step, which helps to preserve all the prominent features and prevents the template from overfitting. Tem- plates are provided by common users or domain experts, and represent interesting patterns we want to detect from time series. Instead of utilising templates to match all the potential subsequences in the time series, we translate the time series and templates into landmark sequences, and detect patterns from landmark sequence of the time series. Through defin- ing constraints within the template landmark sequence, we effectively extract all the land- mark subsequences from the time series landmark sequence, and obtain a number of land- mark segments (time series subsequences or instances). We model each landmark segment through scaling the template in both temporal and magnitude dimensions. To suppress the influence of noise, we introduce the concept of trust region, which not only helps to achieve an improved instance model, but also helps to catch the accurate boundaries of instances of the given template. Based on the similarities derived from instance models, we introduce the probability density function to calculate a similarity threshold. The threshold can be used to judge if a landmark segment is a true instance of the given template or not. To evaluate the effectiveness and efficiency of the proposed method, we apply it to two real-world datasets. The results show that our method is capable of detecting patterns of temporal and magnitude deformations with competitive performance.
Show more

15 Read more

Time Series Analysis

Time Series Analysis

There is a clear seasonal effect present in the series, but the size of the seasonal effects seems to be increasing as the level of the series increases. The number of passengers is clearly increasing with time, with the number travelling in July and August always being roughly 50% greater than the number travelling in January and February. This kind of proportional variability suggests that it would be more appropriate to examine the series on a log scale. Figure 5.18 shows the data plotted in this way. On that scale the series shows a consistent level of seasonal variation across time. It seems appropriate to analyse this time series on the log scale.
Show more

111 Read more

Large-Scale Assessment of Coastal Aquaculture

Ponds with Sentinel-1 Time Series Data

Large-Scale Assessment of Coastal Aquaculture Ponds with Sentinel-1 Time Series Data

Abstract: We present an earth observation based approach to detect aquaculture ponds in coastal areas with dense time series of high spatial resolution Sentinel-1 SAR data. Aquaculture is one of the fastest-growing animal food production sectors worldwide, contributes more than half of the total volume of aquatic foods in human consumption, and offers a great potential for global food security. The key advantages of SAR instruments for aquaculture mapping are their all-weather, day and night imaging capabilities which apply particularly to cloud-prone coastal regions. The different backscatter responses of the pond components (dikes and enclosed water surface) and aquaculture’s distinct rectangular structure allow for separation of aquaculture areas from other natural water bodies. We analyzed the large volume of free and open Sentinel-1 data to derive and map aquaculture pond objects for four study sites covering major river deltas in China and Vietnam. SAR image data were processed to obtain temporally smoothed time series. Terrain information derived from DEM data and accurate coastline data were utilized to identify and mask potential aquaculture areas. An open source segmentation algorithm supported the extraction of aquaculture ponds based on backscatter intensity, size and shape features. We were able to efficiently map aquaculture ponds in coastal areas with an overall accuracy of 0.83 for the four study sites. The approach presented is easily transferable in time and space, and thus holds the potential for continental and global mapping. Keywords: aquaculture; SAR; Sentinel-1; time series; image segmentation; remote sensing; ponds; coastal zone; river delta
Show more

23 Read more

Time Series Analysis

Time Series Analysis

The Dickey-Fuller test statistics for the joint hypotheses are computed in the same way as the usual F -test statistics Reject the null hypothesis if the test statistic is too large The critical values are not the quantiles of the F -distribution There are tables with the correct critical values

148 Read more

Similarity Search and Analysis Techniques for Uncertain Time Series Data

Similarity Search and Analysis Techniques for Uncertain Time Series Data

The first objective of the experiments is to study the performance of the proposed method with regard to various data parameters and query parameters (Section ‎4.3). As explained earlier, the exhaustive technique, which calculates all the correlation coefficients in (18), is infeasible since its time complexity is 𝑂(𝑁 𝑛 ), where N is the number of observed values at each timestamp and n is the dimension (length) of the uncertain time series. Thus, to make the similarity search feasible in different settings, similar to [DAL12], we reduced and used the input data, obtained by truncating the dataset to 50 time series of dimension 6 with 3 observed values at each timestamp. For example, given a correlation threshold c , probability threshold p, and SDR r, we need to do over 26.5 million calculations (with 50 time series) in the exhaustive technique, and in total over 15.7 billion calculations (with 9 SDR, 6 correlation thresholds, and 11 probability thresholds (Section ‎4.3)). This shows that even for small uncertain time series dataset, the exhaustive technique requires an excessive amount of processing time.
Show more

132 Read more

Improved singular spectrum analysis for time series with missing data

Improved singular spectrum analysis for time series with missing data

Although the selection of window length is an important issue for SSA (Hassani et al., 2012; Hassani and Mehmoud- vand, 2013), this paper chooses the same window length (L = 120) as that in Schoellhamer (2001) in order to com- pare the performance of the proposed method with that of Schoellhamer. Using the synthetic time series we compute the lagged correlation matrix and the variances of each mode. The first four modes contain the periodic components, which account for 72.3 % of the total variance; in particular, the first mode contains 50.2 % of the total variance. In order to eval- uate the accuracies of reconstructed PCs from the time series with different percentages of missing data, following the ap- proach of Shen et al. (2014), we compute the relative errors of the first four modes derived by ISSA and SSAM with the following expression:
Show more

6 Read more

Contributions to Functional Data Analysis with Applications to

Modeling Time Series and Panel Data

Contributions to Functional Data Analysis with Applications to Modeling Time Series and Panel Data

with T → ∞; see Gijbels & Peng (2000) for a consideration of similar estimators. For the estimation of the bounds a(τ) and b(τ ) we take advantage of our assumption that a(τ ) and b(τ) are smooth, two times continuously differentiable, bounds. This allows us to estimate these bounds consistently as T → ∞ even if n(T ) remains bounded. However, the scatter plot in Figure 2.2 suggests a nonstandard assumption on the shape of a(τ) and b(τ ), which excludes the usage of classical boundary estimators such as free disposal hull (FDH) estimators or data envelope estimators [see, e.g., Deprins et al. (1984) and Kneip et al. (1998)]. Instead, we use nonparametric local linear regression in order to estimate the bounds a(τ) and b(τ ). On the one hand, this allows us to estimate arbitrary smooth boundary functions; on the other hand, it seamlessly fits to our unify- ing nonparametric regression problem in Eq. (2.15). We use the deterministic frontier regression model proposed by Martins-Filho & Yao (2007), which can be formulated for our case as
Show more

161 Read more

Indexing for Very Large Data Series Collections

Indexing for Very Large Data Series Collections

Generating workloads for data series indexes. Our fourth contribution is motivated by the fact that up to this point very little attention had been paid on how to properly evaluate data series index structures. Most previous work relied solely on randomly selecting data series with or without adding noise, which were then used as queries. A hardness analysis of these queries was always omitted, instead measuring index per- formance as the average query answering time across a large number of queries. On the contrary, in the context of relational databases, various benchmark workload gen- eration techniques have been proposed through the years. Such techniques included methods for generating queries with specific properties, carefully designed to stress different parts of the database stack. In this thesis, we argue that apart from creat- ing novel data structures for data series, there is also a need for carefully generating a query workload, such that these structures are stressed at appropriate levels. To solve this problem, in Chapter 6, we define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired prop- erties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections.
Show more

176 Read more

Missing observation analysis for matrix-variate time series data

Missing observation analysis for matrix-variate time series data

this is indeed obtained under the proposed new recursions and so the respective correlations at points of time where there are gaps are 0.633 (at t = 24), 0.779 (at t = 43), 0.812 (at t = 75) and 0.809 (at t = 86); the mean of these correlations is 0.792, which is close to the real 0.8 under the simulation experiment.

13 Read more

Time Series Outlier Analysis of Tea Price Data

Time Series Outlier Analysis of Tea Price Data

We analyze the tea price data of three regions, NI, SI and ARIMA models for these data. Time series plot of the three types of data (Figure 1) revealed that the data is not stationary, but shows an upward trend. To tionary successive differences are taken to create new series. Now we look at the autocorrelation func- tion (ACF) and partial autocorrelation function (PACF) of the differenced series for determining the order of the most ed in identifying model parameters are autocorrelation function (ACF) and partial autocorrelation function (PACF). First we analyze the data of NI. The time stationary. For NI region the ACF cosine waves and each value is highly significant. PACF (Figure 3) is significant at lags 1, 5
Show more

6 Read more

Fractal Analysis of Time-Series Data Sets: Methods and Challenges

Fractal Analysis of Time-Series Data Sets: Methods and Challenges

Contrasting the trends displayed in Figures 19 and 20 with those displayed in Figures 16 and 17 highlights the inherent challenge in assessing the fractal proper- ties of time-series structures that suffer from limited total length and/or limited resolution/spectral content. Indeed, accommodating the impact of a minimum fea- ture size that is significantly in excess of the trace’s resolution limit generally necessitates restricting a fractal analysis to length scales larger still than even this observed minimum feature size. This in turn often restricts an analysis of scaling properties to a consideration of relatively few orders of magnitude in length. For example, performing a fractal analysis of a 512-point Fourier filtered trace using analysis cutoffs corresponding to 10 data points and 1/5 of the trace length corre- sponds to an analysis of the scaling behavior over barely more than one order of magnitude in length scale; attempting to increase the accuracy of the measurement by raising the fine-scale cutoff to 20 data points further reduces the scaling range to 0.71 orders of magnitude.
Show more

27 Read more

Show all 10000 documents...