Data Accumulation - Predicting parking space availability based on heterogeneous data using Mac

6. Implementation

6.1 Data Accumulation

The data accumulation phase involves the collection of data from identified data sources. To retrieve data from their origins, scraping scripts were developed to extract the data available in different formats using Python-based standard frameworks described in Section 5.1. The scraping scripts were designed in such manner, that for every data

source, the available information after retrieval is stored in CSV format. The information data sources and for each data source which python frameworks were used, is described in Table 2.

Table 2. Information on identified data sources from Cork County

Data Source Origin Format Used Framework Weather data Weather station HTML Beautifulsoup4 and

Pandas

Traffic data 5 traffic counters HTML Selenium and Pandas Parking data 8 parking

garages

JSON Pandas Geographical

information

OpenStreetMap GeoJSON Overpass and Pandas

The weather data available from one of the data providers from Ireland was collected using Beautifulsoup4 and Pandas. The scraper for this data sources operates by sending a request to uniform resource locator (URL) where the information is available. Following that, the available information at the designated URL is parsed and converted into CSV format with required formatting. The second data source containing traffic counts requires human-like interaction to reach the intended webpage, where information is available. For this purpose, a combination of Selenium and Pandas is used to retrieve data in CSV. Similarly, the third data source which provides real-time parking occupancy information as it is updated. The data from this source is available through a web API in JSON format, for retrieving the information API URL is requested and received a response is then transformed to CSV format. Moreover, the received JSON response, in this case, is quite nested which required implementation of data normalization procedures.

Finally, the fourth and last data source for predicting parking availability contains information about amenities e.g. markets, parks, government service centres, hospitals, educational institutes, and others. This information is extracted from OpenStreetMap using overpass API and Pandas. Data from this source is extracted by defining a radius of 150 meters for each parking garage, which means that information of all amenities in the proximity of 150 meters is collected. This is used to determine the impact of particular amenity on each parking garage e.g. assuming a parking garage is located near markets and parks, its occupancy rate will be high compared to others especially on weekends. To enable the analytics model for determining this impact, each amenity extracted from OpenStreetMap is associated with defined classes. For example, places like fast food, continental restaurant, and cafes are classified as food service. Similarly, if there are convenience stores, book stores, toy shop, and grocery store they are classified as a market place, the same approach is applied to other types of amenities. The defined amenity classes used for classifying different places are listed below:

• Food service (used for classifying food and eating places)

• Entertainment service (used for classifying pubs, bars, and gaming places) • Transportation service (car rental service, travel agencies, and others) • Educational service (school, colleges, libraries, and others.)

• Customer service (banks, ATMs, and others)

• Recreational place (parks, and outdoor leisure places) • Workplace (offices, business centres, and others)

After associating amenities with each identified class their ratio is calculated with the total amount of classes, to make the predictive analysis more efficient.

The data is stored using a defined approach by Gilman et al. (2018), which allows ingesting retrieved data from all the data sources to HDFS depending upon their acquisition frequencies. The used approach is described in following sub-sections.

6.1.1 Manual Data ingestion

This approach is used for data sources which is updated rarely by the data provider, such data source is inserted to HDFS using web interface Hue. In case, data needs to be formatted and transformed to CSV format, custom scripts are used and then data is ingested to HDFS. In the context of this thesis, only one data source is ingested manually i.e. geographical information from OpenStreetMap, as data on maps are not updated frequently.

6.1.2 Automated Data ingestion

The automated data ingestion approach is used for data sources, where the script is written to load data automatically. Such sources have particular schedule i.e. data is updated by data provider every minute, hourly, daily or monthly. The scraping scripts for data sources with certain schedules are configured with cron scheduler, where the cron scheduler executes the scraping scripts as defined. As soon as the data arrives in the local disk – spooling directory from data source provider, it is picked up by Flume agent configured in the cluster. A sample configuration script is provided below:

spoolAgent.sources = splSrc spoolAgent.channels = splChannel spoolAgent.sinks = sinkToHdfs spoolAgent.sources.splSrc.channels = splChannel spoolAgent.sinks.sinkToHdfs.channel = splChannel spoolAgent.sources.splSrc.type = spooldir spoolAgent.sources.splSrc1.channels = splChannel spoolAgent.sources.splSrc.spoolDir = “” spoolAgent.sinks.sinkToHdfs.type = hdfs spoolAgent.sinks.sinkToHdfs.hdfs.fileType = DataStream spoolAgent.sinks.sinkToHdfs.hdfs.path = “” spoolAgent.channels.splChannel1.type = file spoolAgent.channels.splChannel1.checkpointDir spoolAgent.channels.splChannel1.dataDirs = “”

As it can be seen from the above-provided configuration, the spooling directory is configured, from where the source in Flume agent picks the data as it arrives, which is later passed to the HDFS sink and injected into the defined folder directory in HDFS. In this thesis, the described approach is adopted for three data sources i.e. weather data, traffic count data, and parking count data. The flow of data ingestion using Apache Flume is demonstrated below in Figure 8.

Figure 8. Data accumulation approach using Apache Flume, redrawn from (Gilman et al.,

2018)

In document Predicting parking space availability based on heterogeneous data using Machine Learning techniques (Page 37-40)