Collaborative Data Collection & Quality Analysis In Smart Phone Based Wireless Sensor Networks

(1)

Collaborative Data Collection & Quality Analysis In Smart Phone Based Wireless Sensor Networks

1Wilson Thomas,²E. Madhusudhana Reddy

1 Research Scholar, ²Professor,

1Research & Development Center Bharathiar University Coimbatore, India

2Department of CSE

Guru Nanak Institutions Technical Campus Telangana, India

Abstract:

Collaborative sensing has become a novel approach for smart phone based data collection. Mobile Data Collection (MDC) is the use of mobile phones, tablets or PDAs for programming or data collection. MDC can be very useful to the evaluator who is collecting quantitative data for their evaluation or abstracting data for an evaluation. Other important issues we address in literature are issues relegated to energy, routing, security and bandwidth etc. This paper emphasizes the issues of data quality from the data aggregation or fusion process in wireless sensor network. The paper has presented a simple description of data, it’s essential characteristics, and has discussed some of the significant literature that discusses about enhancing the data quality as well as fault tolerant system in Smart phone based wireless sensor network. Finally, the paper briefs about the research gap that has been identified from the literature review.

Keywords: collaborative sensing; Sensor Data Quality; Smart Phones; qualitative data collection

I. Introduction

In wireless sensor networks (WSNs), many errors occur among the sensor data gathered due to certain characteristics, such as low-cost sensors, limited resources, and link variation [1]. These errors come with variety of modes - the data loss or errors caused by hardware, failure due to transmission delays, and the sampling jitter [2] due to task conflicts at node etc. Hence the data received at the sink can have these types of errors.

To increase the reliability of application monitoring using smart phones, it is important to know that the quality of the data can be affected by faults. Data quality issue can be addresses in two ways: by using techniques to reduce the faults and improve the reliability of the system or by enhancing the system with means to continuously assess and characterize the quality of data [3]. In the former case, the system will not be aware of the quality of data, and hence, if a certain quality is needed, it must be enforced by design, confining the effects of faults a priori. Considering sensors to be the main source of data, errors in sensing measurements are handled by procedures that are established based on a deep understanding of the characteristics of the sensors [4]. Missing readings may be handled by oversampling, and glitches, like outliers and noise, can be masked by averaging. In the latter case, given that the system can be aware of the quality of data at run-time, it is better suited to be used in environments where full knowledge of the operational conditions is not known in advance. In this case, mitigation techniques must be deployed to handle faults and data quality problems at run-time,

(2)

for instance exploiting application semantics to determine appropriate data corrections and to regain the needed data quality.

Ensuring the quality of smart phone sensor data is challenging. It depends on the mobility of smart phone users, network coverage, and the availability of participants. Hence its likely to be not completely free from faults. This can create serious implications in certain situations, example faulty data on flood or pollution level is shared.

As another example, WSNs are deployed in data centers for flexible temperature monitoring and energy-efficient control of air-cooling equipment [5,6]. Hence assuring the quality of data collected through smart phones is also critical for its effectiveness. In these examples, the conditions in which the data was collected is difficult to predict and can cause faulty data collection and the cost of erroneous sensor data collection can be severe.

II. Related Work

When quality of the data is an important parameter for the application we can express this in different ways. Additional information on possible quality degradation of data can be used to generate a fault model. This helps creating better identification of data quality and adopting methods to reduce the effects of possible faults in the data received. So expressing the quality of data, analysis and modeling of information into hypothesis and frameworks for quality data representation is required. This quality analysis can be difficult when different authors introduce an array of terms for quality (the most generic), including: Validity is typically employed when a determined requirement about the quality of data is available, against which it is possible to compare some quality measure and declare if the data are valid [7,8]. Confidence is the attribute which can be assured by monitoring sensor data which doesn’t require quality measurement. This is used when available data sets can be categorized in a probabilistic approach, with a threshold definition or model fitting method to get continuous or multi-level confidence measures. Reliability is a typical dependability attribute, expressing the ability of a system to provide the correct service (or the correct data, for that matter) over a period of time.

Only when there are data transmission faults or data loss it is necessary to concern about the Reliability of data. Trustworthiness is mostly employed in connection with security concerns, namely when it is assumed that data can be altered in a malicious way. In the context of sensor networks, Sensors 2017, 17, 2010 6 of 23 it characterizes the degree to which it is possible to trust that sensor data have not been tampered with and have thus the needed quality. Authenticity is also used, in particular in a security context, but to express the degree to which it is possible to trust the claimed data origin. This is particularly important when the overall quality of the system or application depends on the correct association of some data to their producer.

III. Data Quality

Data quality is used in different context in Wireless Sensonr Networks. We have identified four core components of data quality, on which our system is based:

Accuracy: The accuracy of a single sample reflects the numerical difference between the sample and the true value of the measured. This explicitly includes errors introduced at sensor level. In network aggregation can further introduce faulty data.

Consistency: If a single sample or a stream is compliant with a user-defined model, it is consistent.

(3)

Timeliness: The timeliness of data denotes if data is being received by a sink or actuator node in time.

Unreliable radio communication and latency in network can affect the timeliness of data sharing. It reflects if a node has taken a sufficient number of samples to reconstruct the measurand, or if a node has successfully received a sufficient fraction of a stream from the network.

Figure 1: Generic view of the WSN-based monitoring system.

Our goal is to describe and enumerate the processes involved in each layer of the scheme in Figure 1.

In a bottom-up perspective, the first layer is the physical environment (it can also be defined as the object or the event to be monitored), which can have a great influence on the measurements.

IV. Network Model and problem

The wireless sensor network consists of a set of sensor nodes randomly deployed in a planar area, S = {s1, s2, …, s n }. Consider the total monitoring time is T. The time synchronized and the sampling interval is ΔT. At a given time, one node can collect kphysical quantities, and the collected data of node i at time t can be represented by set X (i, t).

X(i,t)={x1,x2,…,xk}.X(i,t)={x1,x2,…,xk}.

Let sequence of data gathered by node i in T is X i:

Xi=[X(i,1),X(i,2),…,X(i,T/Δt)].Xi=[X(i,1),X(i,2),…,X(i,T/Δt)].

Suppose without loss if one event is measured by smart phone sensor say temperature data sequence of node i in T time is denoted as Xi:

Xi=[val1,val2,…,valT/Δt].Xi=[val1,val2,…,valT/Δt].

The dataset collected by all the nodes S is received at the sink node during the monitoring time T, which can be represented by a matrix D with size as (T/Δt) × n,

D=[X1,X2,…,Xn]T.D=[X1,X2,…,Xn]T.

Let q v , q c , q t , and q a represent the corresponding quality indicators of dataset D.

The quality assessment and data cleaning of dataset D are done at the sink node. Data cleaning includes the missed data patching, sampling jitter correction, and outliers and correction.

We assume that the signal of a physical object detected by a sensor node will change in a smooth way.

For example, the temperature or humidity in 1 day usually changes continuously and smoothly. In

(4)

data sampling jitter elimination and the data cleaning process, this constraint is necessary by assuming that the sampling interval is smaller than the change frequency of the physical signal.

V. Data Quality Metrics

The data volume describes the size of dataset, which can be used to describe the working state for a given sensor node. In the case that the node has less data compared with other nodes, it is considered that data is lost. The data volume describes the availability of dataset and the reliability of related logic results. For example, a mean operation can be done on two datasets with different sizes for a given observation object, and the one with smaller data volume is assumed to be less trustworthy.

Definition 1 (Data volume indicators) Assuming that the monitoring area has nnodes, the monitoring time duration is T, and all nodes collect data with the same time interval Δt. The data sequence of the node i in the monitoring duration T is

X i = [X(i, 1), X(i, 2), …, X(i, T/Δt)].

The existence of sampling for node i at time t is defined as:

fv(X(i,t))={1,X(i,t)≠null0,X(i,t)=null.fv(X(i,t))={1,X(i,t)≠null0,X(i,t)=null. (1)

Let v i be the number of samplings for node i:

vi=∑t=1T/Δtfv(X(i,t)).vi=∑t=1T/Δtfv(X(i,t)). (2)

Then, the data volume indicator can be calculated as:

qv=(Δt×∑i=1nvi)/(N×T).qv=(Δt×∑i=1nvi)/(N×T). (3)

Completeness describes the seriousness of data loss problems in the dataset. The completeness indicator is generally measured with the proportion of the raw data volume compared with the required data volume.

Definition 2 (Completeness indicator) Assuming that the monitoring area has nnodes, the monitoring time duration is T, and all nodes collect data with the same time interval Δt. The data sequence of the node i in the monitoring duration T is X i = [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The completeness of data record X(i, t) is defined as follows:

fc(X(i,t))={1,X(i,t)≠null and xj≠null0,otherwise,fc(X(i,t))={1,X(i,t)≠null and xj≠null0,otherwise, (4)

where X(i, t) = {x1, x2, …, x k }.

The completeness metric for dataset D at time t is denoted as cv t , that is:

cvt=∑i=1nfc(X(i,t)).cvt=∑i=1nfc(X(i,t)).

(5)

Then, the completeness indicator can be calculated as:

qc=(Δt⋅∑t=1T/Δtcvt)/(N⋅T).

There are two main concerns with the time-related indicator, i.e., volatility and timeliness. Volatility is generally used to describe the data variation, and it can be measured by the valid time period during which the data remains valid. Some physical quantities have high volatility in the case that they change frequently, such as displacement, the opposite temperature, and humidity. The timeliness contains two meanings. The first is that data itself shall maintain the freshness which can be measured by the variation of time between the times of the current system and the data instance. The second is that time alignment of multi-sourced data requires that data instances originated from the same node

(5)

shall have the same interval, or the data instances of different nodes shall be generated at the same time. It can be measured by the jitter size.

The correctness indicator describes the closeness of the monitored value to the true value. To the data obtained from one sampling of a specific physical quantity (such as temperature), the data is considered to be correct in the case that the data error between the measured value and the real value of the environment is less than a given threshold.

Definition 4 (Correctness indicator) Assuming that the monitoring area has nnodes, the monitoring time duration is T, and all nodes collect data with the same time interval Δt. The data sequence of the node i in the monitoring duration T is X i = [X(i, 1), X(i, 2), …, X(i, T/Δt)]. The observation value can be expressed as val = valreal + Δ, which is a combination of the real value of the environment valrealand error Δ. The correctness of node i at time t is defined as follows:

fa(valt)={1,Δ<ξc0,Δ>ξc,fa(valt)={1,Δ<ξc0,Δ>ξc, (11) where ξ c is the error threshold.

Definition 5 (Data quality evaluation coefficient) Given the dataset D in the time duration T, the data quality Q is the weighted combination of the data quantity, correctness, completeness, and time- related indicator.

Q=(∑i=14wi⋅qi)/(∑i=14wi).Q=(∑i=14wi⋅qi)/(∑i=14wi). (13) In which w i is the weight of each indicator.

VI. Performance and Evaluation

The simulation shows the relation between data volume vs other parameters. Data loss, delay, faults, and other mistakes in the dataset are independent and steady with binomial distribution. In this paper, two data volumes are gathered at time Δt and 2Δt, and the metrics in other indicators of these two datasets are calculated respectively. The results are as follows.

As we can see in Fig. 2, in the case that the data volume of each node changes from 100 to 200, the metric of the time-dependent indicator decreases, the dataset completeness increases slightly, and the correctness indicator increases. The effect of volume of data on the other parameters is not definite.

When data volume rises, the three indicators either increase or decrease at once. Thus, Theorems 1 to 3 are verified.

Fig 2: The effect of data volume on other indicators

(6)

As we can see in Figure 3, in the case that the completeness increases, the time-dependent indicator is almost unchanged, while the correctness indicator will increase or decrease. It is observed that while carrying out the completeness cleaning the variation of the correctness indicator and time-dependent is indecisive. At the same time, the mending of missing data will repair partial lost data. As per Definition 1, the data volume of nodes will raise. Thus, Theorems 4 to 6 are verified.

Figure 3: The effect of completeness on other indicators

The following group deals with the relationship between correctness and other indicators. Two times data cleaning process for the irregular data are carried out successively, and thus the accuracy will amplify consequently. After this, we can observe the distinction between the other three values.

As we can see in Fig. 4, the cleaning process by eliminating the abnormal data will enhance the correctness, but the time-related and completeness indicators remain unchanged.

Figure 4: The effect of correctness on other indicators

Data Cleaning Simulation

In order to verify the performance of the proposed data cleaning strategy, we adopt two different sequential cleaning strategies under the same cleaning cost. The data before cleaning and the cleaned data are respectively compared with the true values of the environment so that the difference between them can be observed intuitively. The cleaning costs of the two cleaning approaches are the same and anomalous data detection and correction, lost data restoration, and cleaning operation for removing sample jitter are done. We use an average 54 nodes considering the fact that the realtime value of from the environment is not available.

As we can see from the first value in Figure 5, there are many errors, such as data loss, gross error, and sample jitter in dataset D of node 7. The quality metrics Q is 65.34%. When D is clean with the

(7)

projected data cleaning approach, the final dataset D′ is more alike to the practical value (the second one in Figure 2). The new quality metrics Q is 89.43%. We also carry out the data cleaning strategy with order (4) in Section 4.5, and compare the performance with the practical value (the last one in Fig. 5). It can be seen that the proposed data cleaning strategy performs a better cleaning effect on dataset D.

Figure 5: Comparison of data after cleaning of different data cleansing strategies References

[1] C Batini, M Scannapieco, Data quality: concepts, methodologies and techniques (Springer Publishing Company, 2010). https://doi.org/10.1007/3-540-33173-5%2010.1109%2FICCSE.2012.88 [2] D Ganesan, S Ratnasamy, H Wang, et al., Coping with irregular spatio-temporal sampling in sensor networks. ACM Sigcomm Comput Communication Rev 34(1), 125–130 (2004).

https://doi.org/10.1145/972374.972396

[3] Brade, T.; Kaiser, J.; Zug, S. Expressing validity estimates in smart sensor applications. In Proceedings of the 2013 26th International Conference on Architecture of Computing Systems (ARCS), Prague, Czech Republic,19–22 February 2013; pp. 1–8.

[4] Dietrich, A.; Zug, S.; Kaiser, J. Detecting external measurement disturbances based on statistical analysis for smart sensors. In Proceedings of the 2010 IEEE International Symposium on Industrial Electronics (ISIE), Bari, Italy, 4–7 July 2010; pp. 2067–2072

[5] Rodriguez, M.; Ortiz Uriarte, L.; Jia, Y.; Yoshii, K.; Ross, R.; Beckman, P. Wireless sensor network for data-center environmental monitoring. In Proceedings of the 2011 Fifth International Conference on Sensing Technology (ICST), Palmerston North, New Zealand, 28 November–1 December 2011; pp. 533–537.

[6] Scherer, T.; Lombriser, C.; Schott, W.; Truong, H.; Weiss, B. Wireless Sensor Network for Continuous Temperature Monitoring in Air-Cooled Data Centers: Applications and Measurement Results. In Ad-hoc, Mobile, and Wireless Networks; Li, X.Y., Papavassiliou, S., Ruehrup, S., Eds.;

Lecture Notes in Computer

[7] Brade, T.; Kaiser, J.; Zug, S. Expressing validity estimates in smart sensor applications. In Proceedings of the 2013 26th International Conference on Architecture of Computing Systems (ARCS), Prague, Czech Republic, 19–22 February 2013; pp. 1–8.

(8)

[8] Rodger, J. Toward reducing failure risk in an integrated vehicle health maintenance system: A fuzzy multi-sensor data fusion Kalman filter approach for IVHMS. Expert Syst. Appl. 2012, 39, 9821–9836.

[9] https://jwcn-eurasipjournals.springeropen.com/articles/10.1186/s13638-018-1069-6