Facebook event data - Evaluation of Geospatial Features for Forecasting Parking Occupancy Using

Over its Graph API (v.2.8), Facebook provides public access to open infor-

mation created by users of event-related functionalities. Users can freely create

private or public event objects and invite other users. This provides a platform for

community-based interaction related to events. Commercial organizations mainly

use these functionalities for marketing purposes and private users for facilitation of

event information in the form of JSON responses. One method is based on the

previously obtained POI objects. Within Facebook’s graph structure, these are

directly linked to event entities as most of them are conducted at specific physical

locations in contrast to online events. The second data acquisition methodology is

a keyword-based search approach.

For the first collection method, the total population of place identification

numbers is divided into batches of 50 pieces that represent the maximum accepted

by the API within one request. These keys are passed with the API call to obtain

events that correspond to the respective places. One separate parameter deter-

mines the time frame covered by the request. Facebook also allows retrieval of

information related to past events. For a certain focus area, if more data is avail-

able that can be passed in one response due to internal performance reasons of the

API, the retrieved file contains links to further response pages. These are exhaus-

tively called with the crawler script to obtain all information that fits the specified

parameters. As identification numbers for places frequently change, the script also

provides functionalities to obtain updated object keys passes with error responses.

This avoids mistakes in the crawling process. Data collection for all available places

requires about 65,000 API calls while 57% are related to further response pages. In

total, about 1.7 million event objects are collected by conducting this procedure.

The keyword-based approach uses a different endpoint of the API that is

not location-specific. Thus, primarily keywords are passed that refer to certain

locations. First, a list of 109 cities in Germany is extracted form a publication

of the Organisation for Economic Co-operation and Development (OECD) [158].

These are used as keywords for the event search and lead to 14,700 objects retrieved

in total. Mainly due to duplicate city names on a global level, only about 90% of

retrieved, this approach is neglected. As a second procedure, specific POI names

are used as keywords to collect events objects that have no formal connection

in the graph structure but potentially show a real-world relationship based on

syntactic similarities found. Testing a sample of 2,000 random place names, in total

5,600 event objects are returned while only 1,100 of these have sufficient location

information and are set within Germany. It has to be noted that the keyword-based

search requires user-specific authentication and the response volume is strictly

limited. Thus, for large-scale data acquisition an a nationwide scope, this approach

cannot be realized.

4.3.2 Data characteristics

Users on Facebook have different options to indicate whether they will attend

an event. They can specifically accept the invitation of another user by claiming

attendance. They can also note that they are just interested in the event and

unsure of the actual attendance. Finally, they can also formally decline or avoid

replying to the event invitation. User counts for all of these options are provided

as attributes for each event in the dataset. However, due to policy changes as of

April 2017, the count of users who declined a specific event is no longer available

using the Graph API. By default, response objects with this attribute equal to

zero are returned. Thus, this factor is neglected for further analysis.

Figure 15 provides an overview on the distribution of popularity indices and

data quality. About 517,000 events from the total count of about 1.7 million objects

are significantly popular. The number of attending and interested users is equal

to zero for this segment. Thus, about 30% of the entire dataset is irrelevant for

further analysis. A lack of online popularity is assumed to be reflected in real-life

attendance and traffic demand created by the event. Objects without geographic

A smaller number of objects falls under both categories, lacking popularity and

specific location information. Together, both represent about six percent of all

events. For the creation of an event object, a specified starting time is required.

Regarding event end times, a meaningful share of users tend to avoid specifying

them. Roughly 271,000 objects do not contain end time information. For these,

it is not possible to draw conclusion on the observed traffic patterns as it remains

unclear which effects can be assigned to the event. Only about 47% of the collected

data fulfills all formal data quality requirements while about 484,000 objects in

this segment show particularly low popularity indicators. For public events, a

number of less than ten positive attendance or interest responses are considered

to be particularly low. Thus, only about 19% of all events are perceived to have

immediate relevance for further analysis.

The dataset comprises 40 different event categories. Only about 91,200 events

have assigned information in the respective field, representing a share of about five

percent compared to all objects collected. A strong tendency of the dataset towards

events in the entertainment sector is observed. Work-related events are strongly

underrepresented and only make up about 1.5% of all objects in the dataset. Fig-

ure 16 visualizes the event distribution based on the number of retrieved objects.

Here, the 40 basic event categories in the dataset are assigned to eleven superior

segments represented by homogeneously colored tiles. Subordinate tiles represent

occurrences of the original categories. The segment for miscellaneous events com-

prises categories that may include different thematic trends which cannot be clearly

assigned to other segments without introducing a bias. The explicit assignment of

original to superior categories is visualized in Appendix F2.

Figure 16. Distribution of event occurrence by superior categories

Figure 17 shows the ten most popular subordinate event categories sorted by

the median attending user count. Nightlife-related events represent the most pop-

who have not replied to the event invitation are located in an intermediate range.

Nightlife events alone correspond to 24% of all attending users summarized over

all categories. All segments contain single objects that represent characteristic

attending count outliers. They are indicated by plotted points above the boxplot

whiskers.

Figure 17. Ten most popular Facebook event categories

7.9% of the collected event objects take place in the future. This is expressed

in Figure 18 by showing the accumulated share of event objects over a period

of several months relative to the data collection time. In average over all event

categories, the accumulation of past events approximates a linear function. This

spans a time period of about 14 months prior to the request time. For future

events, the accumulation reflects a near-asymptotic behavior. Only about two

Figure 18, nightlife and comedy events serve as an example for differences in the

timewise distribution. Generally, while nightlife events are rather planned on short

notice, comedy events are prepared longer in advance.

Comparing the results of different data collections, about 5,400 events objects

show changed popularity values after they had already finished. User interactions

are technically still possible for this timeframe and are conducted to a limited

extent. This stands in opposition to the initial assumption that past events are

no matter of interest. In fact, when past event information is collected at a cer-

tain point in time, the obtained popularity indicators may slightly differ from the

original values at the time of the event. However, as less than one percent of the

filtered event objects is affected, this phenomenon is neglected for further analysis.

4.4 Reference data

In document Evaluation of Geospatial Features for Forecasting Parking Occupancy Using Social Media Data (Page 70-77)