6.4 Thematic modal split modeling
6.4.4 Parking demand from topic models
It was found in the previous section, that the category scheme supplied by
Facebook or alternatively a summarized derivative cannot be applied as a basis
for parking demand modeling. While a lack of usable training samples is expected
mances can be found with regards to the definition of the used classes. Real-life
events are mixtures of multiple themes that can potentially interact. Binary deci-
sions for or against a certain class label neglect this complex relationship and choose
only one of many potential contents. Taking into account the previously proposed,
summarized event categorization, certain inconsistencies of exclusive categorical
labeling are observed. Music events, for instance, are likely to involve some sort
of entertainment-related activities and potentially consumption of alcohol that is
primarily covered with the category ’food’. Secondly, many category labels are
imprecisely defined, particularly in the ’miscellaneous’ group. Events in the origi-
nal categories ’community’,’meetup’ or ’neighborhood’ can account for a variety of
activities that involve different individual behavior of attendees. Basically, these
classes can contain events that cover arbitrary themes involving multiple persons.
For these reasons, LDA as an unsupervised topic modeling approach is applied to
avoid exclusive class separation and to introduce probabilistic thematic indicators.
Multiple LDA models are generated based on tf-idf features for the textual
data available for 50,000 and 100,000 random samples of the Facebook event
dataset, respectively. Event name, descriptions, place name and place category
are considered. Even though the available event dataset is much larger, batch-
based LDA training requires all samples to be loaded into memory. Given large
sparse matrices with 4,000 tf-idf features, memory limitations of the available hard-
ware determine the computational limits of the problem. Multiple LDA models
are constructed for different numbers of topics to be identified in the dataset. For
both training sets, a test set of 10,000 unseen event objects is transformed into
adequate feature vectors and used to calculate the respective model perplexity.
The obtained scores are shown in Figure 36.
Figure 36. Obtained perplexity of LDA models by number of topics and training set
the most accurate one. Additionally, the perplexity values obtained are found to
be almost independent from the amount of labeled training data used. Figure 37
visualizes the fit of the LDA model with five topics and the test corpus. Each
circle represents one topic and its penetration of the corpus based on the share of
tokens contained. A title is manually added for each topic as this step is not part
of the topic modeling process. In accordance with the features being identified as
relevant, the topic Shows comprises all events that involve performing live artists,
movie showings or similar. Sports objects denote events that focus on rather pas-
sive watching of athletes playing sports games, races or similar. Fitness events, to
the contrary, involve physical activity of the attendees. The topic Shopping relates
primarily to marketing activities initiated by local businesses. The Party topic
comprises clubbing and dancing events. The location of circles visualizes the the-
matic distance among identified topics. The input feature matrix is projected to
eficial for visualization as it emphasizes thematic distances, leading to clear topic
separations. The topic Party is chosen as an example to highlight the relevant
features being identified. As terms are stemmed in a preparatory process, the rele-
vant features represent only partly actual words. For all features, the overall term
frequency in the corpus, as well as the estimated frequency within the exemplary
topic, are calculated.
Figure 37. Topic distribution and most relevant features
Topic implications on modal split In order to highlight the relationship
between parking events and the identified topics, textual data for Facebook events
in the analysis area (Chapter 6.4.1) are transformed into tf-idf features. This
involves about 27,000 event objects while all relevant text related to one object
is treated as a separate document. A vectorizer module with 4,000 considered
features is used. It is trained on the entire dataset to account for representative
topic probabilities are assigned to each event object and the resulting vector is
added to analysis dataset. Additionally, the number of parking events within the
respective square polygons is added while parking events before, during and after
the event are distinguished. The total number of parking events in the respective
timeframes is subsequently separated into single features in accordance with the
topic probabilities. For example, a probability of 30% for one of the topics is
expected to indicate, that this topic is responsible for 30% of the observed parking
events. Figure 38 shows the mean number of parking events within 0.5 km focus
polygons organized by event topic and time. The analysis is conducted for both,
0.5 km and 2.0 km edge length of the focus polygons while identical patterns are
observed. The results using 2.0 km polygons are found in Appendix F6.
Figure 38. Probabilistic parking event distribution per topic (0.5 km focus poly- gons)
It can be seen that the topics ’Fitness’ and ’Sports’ generally involve a higher
observed timewise event phase. ’Party’ and ’Shopping’ events involve comparably
lower parking demand levels. The observed pattern for the topic ’Shows’ is varied
with regards to the different event timeframes. Respectively, 60 minutes before and
after the respective start and end time of the event are considered. As indicated
by the mean values over all topics, the number of parking events in the pre-event
phase is generally higher than during the proceeding phases. While start times
represent a mandatory field for event objects, most do not contain information for
the end time attribute. Thus, an event duration of one hour is inserted for objects
with missing end time values. As this estimation naturally deviates from the real
end time, lower parking event counts within the tolerated 60-minute-timeframe are
observed.
Moreover, it is assumed that many events also have flexible start and end
times, have no scheduled activities or a strict attendance policy. On the one hand,
attendees of trade fairs, clubbing events or shopping specials, for instance, can
typically join the respective event at arbitrary times during the open timeframe.
In this case, the start and end time supplied determines the maximum possible
attendance time. On the other hand, events often require simultaneous group
activity. For example, workshops or scheduled fitness trainings generally lead to
clearer parking event patterns. The feature extraction workflow is applied to all
event objects in the dataset. Topic probabilities are subsequently used as input
for the main parking demand model.