Parking demand from topic models - Thematic modal split modeling

6.4 Thematic modal split modeling

6.4.4 Parking demand from topic models

It was found in the previous section, that the category scheme supplied by

Facebook or alternatively a summarized derivative cannot be applied as a basis

for parking demand modeling. While a lack of usable training samples is expected

mances can be found with regards to the definition of the used classes. Real-life

events are mixtures of multiple themes that can potentially interact. Binary deci-

sions for or against a certain class label neglect this complex relationship and choose

only one of many potential contents. Taking into account the previously proposed,

summarized event categorization, certain inconsistencies of exclusive categorical

labeling are observed. Music events, for instance, are likely to involve some sort

of entertainment-related activities and potentially consumption of alcohol that is

primarily covered with the category ’food’. Secondly, many category labels are

imprecisely defined, particularly in the ’miscellaneous’ group. Events in the origi-

nal categories ’community’,’meetup’ or ’neighborhood’ can account for a variety of

activities that involve different individual behavior of attendees. Basically, these

classes can contain events that cover arbitrary themes involving multiple persons.

For these reasons, LDA as an unsupervised topic modeling approach is applied to

avoid exclusive class separation and to introduce probabilistic thematic indicators.

Multiple LDA models are generated based on tf-idf features for the textual

data available for 50,000 and 100,000 random samples of the Facebook event

dataset, respectively. Event name, descriptions, place name and place category

are considered. Even though the available event dataset is much larger, batch-

based LDA training requires all samples to be loaded into memory. Given large

sparse matrices with 4,000 tf-idf features, memory limitations of the available hard-

ware determine the computational limits of the problem. Multiple LDA models

are constructed for different numbers of topics to be identified in the dataset. For

both training sets, a test set of 10,000 unseen event objects is transformed into

adequate feature vectors and used to calculate the respective model perplexity.

The obtained scores are shown in Figure 36.

Figure 36. Obtained perplexity of LDA models by number of topics and training set

the most accurate one. Additionally, the perplexity values obtained are found to

be almost independent from the amount of labeled training data used. Figure 37

visualizes the fit of the LDA model with five topics and the test corpus. Each

circle represents one topic and its penetration of the corpus based on the share of

tokens contained. A title is manually added for each topic as this step is not part

of the topic modeling process. In accordance with the features being identified as

relevant, the topic Shows comprises all events that involve performing live artists,

movie showings or similar. Sports objects denote events that focus on rather pas-

sive watching of athletes playing sports games, races or similar. Fitness events, to

the contrary, involve physical activity of the attendees. The topic Shopping relates

primarily to marketing activities initiated by local businesses. The Party topic

comprises clubbing and dancing events. The location of circles visualizes the the-

matic distance among identified topics. The input feature matrix is projected to

eficial for visualization as it emphasizes thematic distances, leading to clear topic

separations. The topic Party is chosen as an example to highlight the relevant

features being identified. As terms are stemmed in a preparatory process, the rele-

vant features represent only partly actual words. For all features, the overall term

frequency in the corpus, as well as the estimated frequency within the exemplary

topic, are calculated.

Figure 37. Topic distribution and most relevant features

Topic implications on modal split In order to highlight the relationship

between parking events and the identified topics, textual data for Facebook events

in the analysis area (Chapter 6.4.1) are transformed into tf-idf features. This

involves about 27,000 event objects while all relevant text related to one object

is treated as a separate document. A vectorizer module with 4,000 considered

features is used. It is trained on the entire dataset to account for representative

topic probabilities are assigned to each event object and the resulting vector is

added to analysis dataset. Additionally, the number of parking events within the

respective square polygons is added while parking events before, during and after

the event are distinguished. The total number of parking events in the respective

timeframes is subsequently separated into single features in accordance with the

topic probabilities. For example, a probability of 30% for one of the topics is

expected to indicate, that this topic is responsible for 30% of the observed parking

events. Figure 38 shows the mean number of parking events within 0.5 km focus

polygons organized by event topic and time. The analysis is conducted for both,

0.5 km and 2.0 km edge length of the focus polygons while identical patterns are

observed. The results using 2.0 km polygons are found in Appendix F6.

Figure 38. Probabilistic parking event distribution per topic (0.5 km focus polygons)

It can be seen that the topics ’Fitness’ and ’Sports’ generally involve a higher

observed timewise event phase. ’Party’ and ’Shopping’ events involve comparably

lower parking demand levels. The observed pattern for the topic ’Shows’ is varied

with regards to the different event timeframes. Respectively, 60 minutes before and

after the respective start and end time of the event are considered. As indicated

by the mean values over all topics, the number of parking events in the pre-event

phase is generally higher than during the proceeding phases. While start times

represent a mandatory field for event objects, most do not contain information for

the end time attribute. Thus, an event duration of one hour is inserted for objects

with missing end time values. As this estimation naturally deviates from the real

end time, lower parking event counts within the tolerated 60-minute-timeframe are

observed.

Moreover, it is assumed that many events also have flexible start and end

times, have no scheduled activities or a strict attendance policy. On the one hand,

attendees of trade fairs, clubbing events or shopping specials, for instance, can

typically join the respective event at arbitrary times during the open timeframe.

In this case, the start and end time supplied determines the maximum possible

attendance time. On the other hand, events often require simultaneous group

activity. For example, workshops or scheduled fitness trainings generally lead to

clearer parking event patterns. The feature extraction workflow is applied to all

event objects in the dataset. Topic probabilities are subsequently used as input

for the main parking demand model.

In document Evaluation of Geospatial Features for Forecasting Parking Occupancy Using Social Media Data (Page 117-122)