• No results found

Data Pre-processing

5.4 Business Domain

5.4.2 Data Pre-processing

The flights dataset [94] consist of three tables i.e., Flights, Airports and Airlines. The flights table has 5,819,079 flight records, which are described by 32 attributes. Airports and Airlines tables have details about the airports and airlines, respectively.

Table 5.1 shows that there are missing values in attributes related to flight-delay-reasoning such as AIRLINE DELAY, WEATHER DELAY. Therefore those attributes are removed from analysis. Table 5.1 also shows that the CANCELLATION REASON attribute has 98.4% missing values, this is because this attribute is only filled for the flights that are canceled (which are very low in number). Therefore, we keep this attribute and in the next section we explain how we use this attribute in analysis. Note that all other attributes are more than 98% complete.

Moreover, the attributes DAY OF WEEK and MONTH are converted to equivalent text for readability of results.

In Table 5.2, the number of categories for each nominal attribute are shown. Table 5.2 shows the YEAR attribute has only one value, because the dataset is only for the year 2015, therefore we remove the attribute YEAR from further processing.

In the dataset, the unique values of airport codes in ORIGIN AIRPORT and DESTINATION AIRPORT attributes were more than the lookup values in the Airports table. Numeric values existed which did not match with the Airports table. Consequently, the numeric airport codes are removed by joining Flights with Airports table. Eventually, a single table is created which contains valid airport codes

100 CHAPTER 5. VIEW-360: A PROTOTYPE SYSTEM FOR VIEW RECOMMENDATION

Attribute Name Type Description & Values % missing

YEAR Numeric Year of the Flight Trip

MONTH Numeric Month of the Flight Trip 0%

DAY Numeric Day of the Flight Trip 0%

DAY OF WEEK Numeric Day of week of the Flight Trip 0%

AIRLINE Nominal Airline Identifier 0%

FLIGHT NUMBER Numeric Flight Identifier 0%

TAIL NUMBER Nominal Aircraft Identifier 0.2%

ORIGIN AIRPORT Nominal Starting Airport 0%

DESTINATION AIRPORT Nominal Destination Airport 0%

SCHEDULED DEPARTURE Numeric Planned Departure Time 0%

DEPARTURE TIME Numeric WHEEL OFF - TAXI OUT 1.5%

DEPARTURE DELAY Numeric Total Delay on Departure 1.5%

TAXI OUT Numeric The time duration elapsed between departure from the origin airport gate and wheels off

1.5% WHEELS OFF Numeric The time point that the aircraft’s wheels leave the ground 1.5% SCHEDULED TIME Numeric Planned time amount needed for the flight trip 0%

ELAPSED TIME Numeric AIR TIME+TAXI IN+TAXI OUT 1.8%

AIR TIME Numeric The time duration between wheels off and wheels on time 1.8%

DISTANCE Numeric Distance between two airports 0%

WHEELS ON Numeric The time point that the aircraft’s wheels touch on the ground 1.6% TAXI IN Numeric The time duration elapsed between wheels-on and gate arrival at the

destination airport

1.6%

SCHEDULED ARRIVAL Numeric Planned arrival time 0%

ARRIVAL TIME Numeric WHEELS ON+TAXI IN 1.5%

ARRIVAL DELAY Numeric ARRIVAL TIME-SCHEDULED ARRIVAL 1.8%

DIVERTED Numeric Aircraft landed on airport that out of schedule 0%

CANCELLED Numeric Flight Cancelled (1 = cancelled) 0%

CANCELLATION REASON Nominal Reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C - National Air System; D - Security

98.4%

AIR SYSTEM DELAY Numeric Delay caused by air system 81.7%

SECURITY DELAY Numeric Delay caused by security 81.7%

AIRLINE DELAY Numeric Delay caused by the airline 81.7%

LATE AIRCRAFT DELAY Numeric Delay caused by aircraft 81.7%

WEATHER DELAY Numeric Delay caused by weather 81.7%

Table 5.1: Flights Delay Dataset: Attributes Description

and airport names. Additionally, Flights table was joined with the Airline table as well to get the names and location of airports instead of just codes.

The ARRIVAL DELAY and DEPARTURE DELAY columns have negative values which show early departure and arrival time. However, for analysis any values that are less than zero are considered as no delay. Secondly, DEPARTURE DELAY is a very important column of the flight delay dataset, View-360 uses it as a dimension attribute. However, the automatic binning of View-360 consider equal width bins only. For adding more depth to the analysis, a new categorical calculated attribute is added for the departure delay values with unequal size fixed bins. Particularly, if the departure delay is less than 5 minutes then the flight is considered on time. If the departure delay is between 5 to 45 minutes, it is consider as a small delay and all delays more than 45 minutes are considered as large delays. This attribute is named as DEPARTURE DELAY C.

Attributes Distribution

As a pre-processing step, the frequency distributions of various attributes are plotted, as shown in Figure 5.2. Firstly, these frequency distributions show that the categories in attribute AIRLINE, MONTH, DAY OF WEEK, DEPARTURE DELAY are reasonably represented. Secondly, as the aggregate views recommended by View-360 are based on distance between probability distributions of target view and comparison view, sometimes for explanation of the insight in recommended visualization, it

5.4. BUSINESS DOMAIN 101

Attribute Name Number of Unique Values

YEAR 1 MONTH 12 DAY 31 DAY OF WEEK 7 AIRLINE 14 ORIGIN AIRPORT 628 DESTINATION AIRPORT 629 DIVERTED 2 CANCELLED 2 CANCELLATION REASON 5

Table 5.2: Flights Delay Dataset: Nominal attributes

(a) Airline (b) Month (c) Day of Week

(d) Departure Delay (e) Cancelled (f) Cancellation Reason

Figure 5.2: Flight Delays Dataset: Attributes Distribution

is helpful to look at the frequency distribution of the attributes involved in the visualization.

Moreover, Figure 5.2e and 5.2f show that very few flights are canceled and out of those most cancellations are due to reason ‘B’ which corresponds to weather related issues. Therefore, canceled flights are not analyzed in the View-360 and the focus of analysis is on the delayed flights.