Design Overview - Personal Data Management in the Internet of Things

Bolt provides applications with a stream abstraction, where each stream is a collection of records, and each record has a timestamp and one or more tag-value pairs, that is,

<timestamp, <tag1,value1>, [<tag2, value2>, ...]>. Streams are uniquely identified by the

three-tuple: <VHomeID, AppID, StreamID>. Bolt allows retrieval and filtering of streams’

records using time ranges and tags. We first explain our design assumptions and Bolt’s data guarantees followed by a description of key design elements in Section 4.3.2. Highlighting the key design elements enables us to describe the design in detail in Section 4.4.

4.3.1 Security Assumptions and Guarantees

Bolt does not trust the cloud storage servers to maintain data confidentiality or integrity.

It assumes that the storage infrastructure is capable of performing unauthorized reads or modifications to stream records and can return old data when queried. By building on top of this untrusted storage infrastructure Bolt provides the following three data security guarantees:

1. Confidentiality: Data in a stream can be read only by an application to which the owner, that is, the writer, grants access. Once the owner revokes access, the reader cannot access data stored after revocation.

2. Tamper Evidence: Readers can detect if data has been tampered with by anyone other than the owner. However, Bolt does not defend against denial-of-service at-tacks, for example, where a storage server deletes all data or rejects all read requests.

3. Freshness: Readers can detect if the storage server returns stale data, that is, data is older than a given owner-configurable time window.

4.3.2 Key Techniques

We describe the four main techniques that allow Bolt to meet these design requirements.

Chunking

Bolt stores data records in a log per stream called the DataLog, which enables low-latency append-only writes. Streams have an index into the DataLog to support eﬃcient lookups, filtering on tags, and temporal range and sampling queries. A contiguous sequence of records within a log constitutes a chunk. A chunk is the basic unit of transfer for storage and retrieval. Data writers upload chunks instead of individual records. Bolt compresses chunks before uploading them, which lowers transfer time and storage space required.

Readers also fetch data at the granularity of chunks. Although, this may obtain more records than are needed for answering a given query, the resulting ineﬃciency is partially mitigated by the fact that applications, such as the ones surveyed in Section 4.2.1, are often interested in multiple successive queries rather than a single query. Delay incurred for common queries with temporal locality is improved by fetching chunks instead of individual records because it avoids additional round trip delays.

Note that typical sensor data, when packed into chunks, has high compression ratios that lowers the fraction of bytes transferred when fetching chunks for serving data reads.

Chunks can be compressed using existing compression techniques such as GZip and delta encoding of timestamps and values of records within a chunk, which will further improve storage and transfer eﬃciency. These techniques do not have a significant impact on re-trieval time because chunks can be uncompressed and decoded, in parallel, using a pipeline.

Separation of Index and Data

Bolt maintains an index for each stream. The index stores information about the location of diﬀerent records stored in the stream’s log. When answering queries for data stored remotely, Bolt first fetches and stores the stream index on a local disk. This separation of the index and DataLog, enables two key properties.

First, when answering read queries for data stored remotely, the index (fetched and stored locally) can be used to determine the chunks that should be fetched from remote servers. A dedicated computation endpoint, such as a query processing engine hosted in the cloud, is therefore not required, thus reducing storage and retrieval costs. This allows Bolt to use existing storage servers that only provide get and put APIs.

Second, this separation allows Bolt to relax its trust assumptions for storage servers, supporting untrusted cloud providers without compromising data confidentiality by en-crypting data. The data can be encrypted before storing and decrypted after retrievals, while the storage provider does not need to support any data semantics. Using untrusted cloud providers is challenging if the provider is expected to perform index lookups on the data.

Segmentation

Since applications only append new data and do not perform random writes, stream Dat-aLogs can grow very large over time. Bolt allows archiving of contiguous portions of a stream, that we call segments, while still allowing eﬃcient querying. The storage location

of each segment can be configured independently, enabling streams to be stored across mul-tiple storage providers. Hence, streams may be stored either locally, remotely on untrusted servers, replicated for reliability, or striped across multiple storage providers for cost eﬀec-tiveness. This configurability allows applications to prioritize their storage requirements of space, latency, cost, and reliability. Bolt currently supports local disk, Windows Azure storage, and Amazon S3 as storage providers.

Decentralized Access Control and Signed Hashes

To maintain confidentiality when using untrusted storage servers, Bolt encrypts the stream with a secret key generated by the owner, that is, the writer. Our design supports encryp-tion of both the index and data, but by default we do not encrypt indices for eﬃciency,² though in this configuration some information may be leaked through data stored in indices.

We use lazy revocation [107] for reducing computation overhead of cryptographic op-erations. Lazy revocation prevents evicted readers from accessing content stored after revocation, because any content stored before revocation may have already been accessed and cached by such readers. From the diﬀerent well-known key management schemes that support lazy revocation, we use hash-based key regression [86] for its simplicity and eﬃ-ciency. It enables the owner to share only the most recent key with authorized readers, based on which the readers can derive all previous keys to decrypt any content encrypted using those keys.

We use a trusted metadata server to store and distribute keys. It runs on a user’s personal VEE. Once an application has opened a stream, all its subsequent reads and writes occur directly between the storage server and the application. This prevents the metadata server from bottlenecking read or write operations. In addition to keys for encrypted data streams, the metadata server also stores additional per-stream metadata, which we describe in detail in Section 4.4.4.

Encryption provides confidentiality of data. However, a remote untrusted server storing part of a stream may modify data records or could return old copies of the data. Therefore, we incorporate data integrity and freshness checks into each stream. To facilitate integrity checks on data, the writer generates a hash of stream contents, which is verified by the readers. To enable freshness checks, similar to SFSRO [87] and SiRiUS [91], we include a freshness time window as a part of a stream’s integrity metadata (denoted M D_int). This time window denotes the time up to which a stream’s data can be deemed fresh; it is

2Index decryption and encryption is a one time cost paid at the start and end of a session of queries respectively (called stream open and close), and is proportional to the size of the index.

based on the periodicity with which writers expect to generate new data, since typical writers periodically append new values to streams. Writers update and sign this time window periodically, which is used by readers to verify records when they open a stream for reading.

In document Personal Data Management in the Internet of Things (Page 71-75)