• No results found

Network Coding in Collaborative Data Storage

CHAPTER 2 RELATED WORKS

2.6 Network Coding in Collaborative Data Storage

In [37], Dimakis et al. propose a class of distributed erasure codes to solve the data collection query problem. The contribution of paper is to prove that O(ln(k)) “pre-routed” packets for each data node is sufficient such that collecting packets from any k of n storage nodes can retrieve raw data with high probability. Essentially, Dimakis et al.’s contribution is to make this random bipartite graph as sparse as possible by constructing a perfect matching w.h.p., which ensure the codes are decodable by proof. Another interesting point in [37] is the connection between distributed erasure codes and network codings. Distributed erasure codes can be viewed as the linear network codes between data nodes and storage nodes placed in distinct parts of a random bipartite graph. There is no explicit routes between data nodes and storage nodes, because the packets are randomly generated. In [38], Dimakis et al. studied the problem of establishing fountain codes for distributed storage and data collection. The technical contribution of [38] is a degree distribution for fountain codes, which enables the “pre-routing” degree of each data node is bounded by a constant a.a. (almost always). Later a randomized algorithm is also proposed to find the random “pre-route” between each encoded packet and raw data on a given grid topology.

[39] provides a LT codes based network coding to solve the data persistence problem in a large-scale network. The contribution is to find random routes between data node and encoded packet using random walks. Each data node disseminate a constant number of packet to the network which will stop in specific storage node with probability computed by its selected RSD (Robust Soliton Distribution) degree. To ensure the stop probability closely approach to the desired degree distribution in the network, [39] utilized the Metropolis algorithm to construct a transition probability for forwarding packets on random walks. Nevertheless, max node degree is required for constructing Metropolis transition matrix, which is hard to obtain in large-scale network, especially when topology changes. In [40], Aly et al. designed two distributed data dissemination algorithms based on LT codes, which eliminate the global knowledge of maximum node degree. In the first algorithm, network size

n and number of raw symbolk is still assumed available in each local node, while the n and k is obtained by estimating the random walk data dissemination in the second algorithm. Later in [41], Aly et al. extend to develop Raptor codes based coding scheme, which reduce the packet demand from logarithmic to constant by inserting a proper pre-code in front of LT codes. Most recently, a survey [42] provides a summary of research problems and results in maintaining the reliability of distributed storage system by partially repairing the encoded packets in poor nodes. This repairing demand a partial recovery of the replaced codes, while the distributed network coding mentioned above focus on the full decoding of the raw symbols from a sufficient subset of codes. In [38], Dimakis et. al. presented a new degree distribution of fountain codes for distributed storage and data collection. It enables the “pre-routing” degree of each data node to be bounded by a constant number almost always. Later an algorithm is also proposed to find the random “pre-route” between each encoded packet and raw data on a given grid topology. [43] proposes a “packet-centric” approach for distributed data storage. It is also based on random walk mechanism. Every rateless packet is initiated with a selected code degree. While the rateless packet randomly traverses through the network, it collects the exact same number of encoding symbol from uniformly distributed sensor data as its code degree, and terminates in a random node, which may require different storage spaces at different nodes.

However, in those random walk based schemes [38–41, 43], the message cost is significant. Our works reduce message cost during the codewords construction period toO(1) per node. Each node only needs to perform several rounds of broadcast, and each node just needs to encode several randomly heard packets to a single encoded packet at each period. It also implies that the storage space requirement is uniform across the network. It has a fast termination time, which means more resilience to network disruptions as well.

Distributed data storage proposed in [44, 45] maximize network storage capacity by offloading local data to network when memory overflow. However, they do not consider disruptive network conditions. In [46], the proposed scheme replicates data items on mobile hosts to improve the data accessibility in case of partitioned network, by considering mobile

users’ access behavior, and read/write patterns. SolarStore in [27] maximizes the retrieval data under energy and storage constraints. The proposed approach can dynamically adjust the degree of data replication according to the energy and storage.

Recently, applying network coding to preserve the data persistence against disruptive node conditions is a subject of increasing research interest [47–50]. The advantage of network coding is to increase the diversity and redundancy of data packets so as to improve data persistence. [35] proposes Growth Codes to preserve data persistence in a disruptive sensor network. Growth Codes increase their code degrees on-the-fly efficiently as data is collected in the gateway. The changing point of code degree is designed to maximize data persistence when decoding. [51] proposes geometric random linear codes, to encode data in a hierarchical fashion in geographic regions with different sizes. It enables to locally recover small amount of node failures. In [52], Albano et al. construct a random linear coding for distributed data storage with near linear message cost. In particular, the paper employs spatial gossip to construct the linear codes. The idea is that each node i chooses another node j with probability proportional to 1/|ij|3, where |ij| is the Euclidean distance between i and j. It is proved that a data symbol from node i is delivered to a node j with the probability 1O(1/n) afterO(log3.4n) iterations. Thus, using spatial gossip constrains the total number

of transmissions during codes construction to O(n polylogn). The above methods explore random linear coding, which requires data packets to traverse through network to generate the linear codes. Our ECPC approach only needs to encode data received from localized broadcast.

Tang et al. [53] formalizes storage depletion induced data redistribution as the minimum cost flow problem, and devises a distributed data redistribution algorithm (PoF). [54] con- sider the energy and storage depletion to maximize data preservation time. Valero et al. [55] formulates the problem as an optimization problem and use Linear Programming to find the optimal solution. A distributed algorithm (EDR2) is implemented for in-network storage and later data retrieval. [56] proposes a probabilistic broadcasting based data redistribution scheme. Their contribution is to derive a proper probability distribution for rebroadcasting

packet. Receivers encode data using LT codes and store them, so that a collector can retrieve data by visiting small set of data. Hou et al. [57] seeks to maximize the minimum remaining energy among the nodes storing data items. It presents a centralized greedy heuristic to approximate the optimal solution under certain conditions.

The above optimization algorithms, though considering energy and storage constraint, ignore an important fact that disruptive condition may happen in the course of data redis- tribution. The disruptive network seriously challenges data redistribution. Distinct from the existing works, our contributions of Ravine Stream are three fold: first, Ravine Stream

conducts probabilistic broadcasting with adaptive transmission power to overcome the dis- ruptive connection during data redistribution. It has high energy efficiency and low message redundancy. Second, in-situ recoding can reduce data redundancy in symbol wise. Third,

Ravine Stream generalizes the energy and storage constraints to nodal utility, which includes

failure probability and storage constraint. According to the nodal utility, data storage deci- sion can be made distributively.