4 Reducing uncertainty of the supply to LNS
4.2 Data explanation and sampling
A causal model must explain the data with external variables (see Chapter 3). Therefore, we need to choose a sample on which we base our forecasting model. We try to include variables that are based on weather data because research suggest that it is possible that weather variables explain the catch size (Stergiou, 1997) (Wikstöm, 2015). Since the weather varies over the location, we need to select a region where we base the model on.
This section chooses this region and explain the data that we use to build the models. In Section 4.2.1 we explain the facts about the statistics of each region in Norway after which we choose the region and sample in Section 4.2.2.
42
4.2.1
Data explanation
To forecast the amount of fish caught by a given vessel on a given day we first need a response variable. Since LNS always has to buy the complete shipment of a fisherman, the forecast model should predict the total catch of fish per vessel, per shipment. The Norwegian Fishermen’s Sales Organization (NFSO) stores the weight of all individual shipments per specie in a public database. The data is split out per region of Norway (see Figure 4.2) The database has data available from 01-01-2017 until 31-12-2017. This is important as there is a main fishing season, which starts in January and lasts until July (see Section 1.1). Therefore, we need to include a whole year of data to be able to analyze if the season influences the catch. There is no information about the quality of the caught fish stored in the public database.
We give the data of the response variable the following notation:
Tonnes of fish caught of species i, on day j, with vessel k, in region l
ijkl
Y = (4.1)
Since we use weather variables in the models, we prefer to make the region where the fish is caught as small as possible so that the weather information is as accurate as possible. The exact location where the fish is caught is not stored in the database. The only information that is known is the location where the fish is delivered. Due to the limited time we cannot forecast the catch for all regions. We select the best region with respect to size in the next section and base our models upon this region.
4.2.2
Statistics of each region
The data of the fish catch is distinguished into five regions in Norway. From each region the public information about the catch of individual vessels is known. It is only known that the fish is caught in this region, but not in which exact location. To keep the weather information accurate, we want to have a small location where the fish can be caught. However, it is also important to have as much as data available, since the more data there is available, the more potential there is to create an accurate forecasting model (Witten, 2005).
The possible surface where the fish can be caught is determined by how many kilometers a vessel ships from the coastline into the sea, multiplied with the length of the coastline. Since we do not know the first distance, we consider the number of kilometers of coastline of a region the measurement of how big the catching surface is. Hence, we want to maximize the amount of caught kilograms fish per
kilometer of coastline. To measure the coastline, we choose the coastline of the mainland, excluding the islands. Figure 4.3 shows the total catch per region per kilometer coastline (Norges råfisklag, 2017) (Statistisk sentralbyrå, 2013). We conclude that the region Troms has the highest catch per kilometer coastline. Hence, we choose this region to base our forecasting model on.
43
Figure 4.3: Catch per kilogram coastline for the relevant regions
4.2.3
Choice of the sample
Finally, we need a sample out of the region Troms to base our forecasting model on. To do so, we first need to know how many individual data points we need. A general rule that is used is that one needs a minimum of 10 data points per covariate (predictor) (Harrell, 2001). In total, we have a maximum of 14 different covariates (see Section 4.3.2), which means that we need more than 140 usable data points for a region.
For two species (Pollock and Haddock), 58.6% of all the observations are smaller than 1 ton. These small values are the result of bycatch. Bycatch happens when a fisherman catches a different fish specie than the specie he intends to fish on. This bycatch data makes the data unreliable since we do not want to predict bycatch. When removing these observations, we have less than 140 data points left for each specie. For Cod, only 1.8% of the catch data is smaller than 1 ton (see Figure 4.4). We still have 236 data points for Cod, which is sufficient to use linear regression. Hence, we base the model on the catch of Cod in the Troms region.
Figure 4.4: Data distribution of the response variable 0 20000 40000 60000 80000 100000 120000 140000 160000 180000
Troms Finnmark Nordland Sør-Tronderlag Nord-Trønderlag
Ki logra m p er kilo m eter
Catch per kilogram coastline
560 480 400 320 240 160 80 0 240 180 120 60 0 560 480 400 320 240 160 80 0 240 180 120 60 0 Cod CATCH Fr eq ue nc y Haddock Pollock Histogram of CATCH
44