Data properties - Analysing location data and privacy

4.2 Analysing location data and privacy

4.2.2 Data properties

This study aimed to analyse in depth the inferences that systems can make based on people’s location data. We wished to answer questions that would help us understand deeper these inference mechanisms. These are presented in Table4.1.

Question

1. How complex is the inference mechanism?

2. Is this information linked to a specific individual?

3. Has the individual given their consent?

4. How good is the quality of this information?

5. Who has access to this information?

6. Who is the source of this information?

Table 4.1: Questions addressed.

Based on these questions we developed a set of properties that aim to provide a deeper insight into that data and highlight the implications of exposing location and contex-tual information. Furthermore, these properties highlight the richness of the analysed information and the possibilities they have for profiling. An initial set of properties was defined based on the background literature and later refined with a small set of research papers retrieved from the Proceedings of the Mobile Human Computer Interaction con-ferences. In other words, the set of properties was used to analyse a set of data from a test sample of systems. In that way, this initial analysis verified the selected properties.

These properties are useful as they manage to show not only what data can be inferred and aggregated, but they also address all the questions included in Table 4.1. Table4.2

contains all the properties that were used in the analysis, and they are explained in the

1st Degree Directly Explicit Accurate User User

2nd Degree Indirectly Implicit Complete User Friend System

3rd Degree Heuristically Timely 3rd Party User Friend

Non Identifiable

Everyone 3rd Party

Table 4.2: Data Properties.

4.2.2.1 Data degree

Looking at the different systems presented in these conferences it quickly became evi-dent that some systems made simple inferences based on the users’ location, whereas others used more sophisticated inference mechanisms. For example,Herbst et al.(2008) developed a mobile mixed reality game where the location of each participant was used to examine how close they were at a point of interest, therefore the inference was rather simple to make. On the other hand, Cranshaw et al.(2010) collected user location data from a social network site in order to develop a model that predicted friendships be-tween people. Evidently, this was a far more complex inference mechanism requiring an algorithm to infer the new information.

Inspired by the background literature, where the issue of data inferences based on lo-cation was raised; the first property used in the analysis looks at how complex is an inference. Therefore, we called it data degree (Zafeiropoulou et al.,2012). This property addresses Question 1 in Table4.1. With that in mind, location and contextual data can be classified into different degrees of data based on the complexity of the inference that generated them.

• 1st degree of data. It refers to data that are not inferred through the system but are explicitly provided. For instance, in a location-based application the users explicitly declare their geographical location.

• 2nd degree of data. Data that are implicitly inferred, e.g. the co-location between two users.

• 3rd degree of data. Data that require inferences with more complex heuristics based on 1st and 2nd degree data. This may require the retrieval of data from a range of users.

The concept behind this classification of data can be further explained through the following scenario.

Scenario: Finding Alice

Alice is a regular smartphone user and allows her phone to update her location through a location-based application on a daily basis.

Mary, a friend of Alice, also a smartphone user and has the exact same functionality set in her own phone.

A third party collects and stores the tracks of users of this specific application. As a consequence, it is aware of the movements of Alice and Mary. The application also identifies and calculates the number of co-locations between the users. If the number of co-locations between any two users is significant, it is inferred that these two people are socially related. Apparently, Alice and Mary are often in the same location. Conse-quently, it is inferred that these two users are socially connected.

Overall, this scenario demonstrates the potential inference of several contextual elements in practice:

• location

• co-location

• activity

• social tie

• geographical hotspots

The above-mentioned contextual elements can be classified into different degrees of data based on their inference complexity. Location data is explicitly declared (i.e. no inference is required) and consequently belongs to the 1st degree of data. The 2nd degree of data refers to data that are inferred from location data, such as activity and co-location. In addition to this, the inference of co-location information makes use of data from Alice and from another user who is known to Alice (in this case Mary’s data). The 3rd degree of data makes use of more complex heuristics, such as making inferences by combining Alice’s data with the data from thousands of other users of the application who are unknown to Alice. An example could be the identification of social ties of users based on the number of co-location data between pairs of users. In the above scenario, the social tie between Alice and Mary could be inferred in that way.

Another example could be the identification of geographical hotspots based on the users’ location tracking.

4.2.2.2 Personally identifiable data

The second property addresses an issue that is of paramount importance when it comes to people’s privacy. This issue refers to the use of personally identifiable information (PII), which includes any piece of data that identifies uniquely a particular person. It addresses Question 2, which was presented in Table4.1. The location data we discuss in our analysis relates strictly to a device rather than a person. In that sense location data can potentially be personally identifiable information, especially in cases where they are combined with other pieces of information.

The reason behind the use of this property is that we wished to distinguish the data that were linked to individuals from anonymous data.

• Directly Identifiable Data. An individual is explicitly related to a piece of information. For example, in the case that a user shares their real-time location with a social network application, that location data is considered as directly identifiable.

• Indirectly Identifiable Data. It can be easily inferred that an individual is related to a piece of information.

• Heuristically Identifiable Data. It can be heuristically inferred with some probability that an individual is related to a piece of information. For example, a location keyword (i.e. the semantic name used by people to describe a location) can be heuristically identifiable by combining a set of heuristics (e.g. Lin et al.(2010) used machine learning algorithms to associate users with location keywords).

• Non Identifiable Data. A piece of information is not related to any individual.

For example, time-stamp information was regarded as non-identifiable information.

4.2.2.3 User consent

A question that was raised at the end of the previous section focused on the extent to which people are aware of the affordances of their data. This question was also included in Table4.1. With that in mind we developed the property user consent.

This property places its focus on whether the individual is asked to provide their consent before their location data is retrieved or published. User consent may be given not only explicitly but also implicitly, in cases where the user is not directly asked to give out their data, but the data are published with their full knowledge and the user does not take any action against it. User consent is only legally required for data that are PII.

4.2.2.4 Data quality

Another property that plays a significant role in location data exposure is the quality of the data (Question 4 in Table 4.1). The better the quality of data, the more accurate the inferences made upon it will be and consequently greater the threat to privacy.

Data quality was calculated based on a set of three different characteristics that are commonly used in data quality studies (Wang et al., 2008; Wang and Strong, 1996), which are accuracy, completeness and timeliness:

• Accurate Data. The data is precise and objective.

• Complete Data. The data is complete in the sense that no values are missing from it or there is nothing to be added to it.

• Timely Data. The data is current and not out-of-date.

4.2.2.5 Data access

As part of our investigation into the different types of data that are inferred based on location, we also wished to identify who had access to that data (Question 5 in Table 4.1). As shown in Figure 4.1 there are a number of different entities who may have access to the data. In addition to this, they might have different types of access (read/edit/disseminate). It is assumed that the system has always access to the data.

The sample is adequately described by a hierarchy, as shown in the figure, but of course it may be that a more complex structure is appropriate for a wider sample — for example a system might provide access to the data to itself and third party systems, but not to the user or their contacts.

Figure 4.1: Who has access to data.

4.2.2.6 Data source

The final property dealt with the origin of the data and addressed Question 6 in Table4.1.

There can be a number of different sources of data, such as the user, the system or even friends of the user and 3rd parties.

In document A paradox of privacy: unravelling the reasoning behind online location sharing (Page 67-72)