AN UNSUPERVISED LEARNING APPROACH TO INVESTIGATE THE EFFECT OF SALIENCY ON THE LEARNING RATE OF OBJECT RECOGNITION

(1)

AN UNSUPERVISED LEARNING APPROACH TO INVESTIGATE THE EFFECT OF SALIENCY ON THE LEARNING RATE OF OBJECT RECOGNITION

By Ruben Bonneur

June 18th, 2021

Student number: 11328010

Supervisors: Steven Scholte & Lynn Sörensen

Second assessor: Rik Schalbroeck

(2)

1

ABSTRACT

Deep neural networks (DDNs) can be used for the recognition of objects. These networks can be used to understanding how object recognition works in the human brain. An possible important part of object recognition is the saliency of the object. In this study the effect of saliency on object recognition performance is examined. This is tested by investigating the difference between high and low saliency images in the maximum performance and the time needed to obtain the maximum performance. The unsupervised Hebbian network Deep-hebb is used for as it resembles a biological model. The results show that the use of high saliency images, increases the object recognition accuracy and in some layers of the network increases the learning rate. The results also indicate that a lack of saliency in images caused the decrease in object recognition while the higher saliency level was found to be response for the decrease in learning rate. These results may help in further developing optimizing object recognition DDNs by optimizing the contents of the datasets. The results may also help to

understand the function of saliency in the biological systems of object recognition.

INTRODUCTION

One method of modelling the brain is by using a deep neural network (DNN). Computer models before DNNs were handcrafted and thus could only examine process were the underlying mechanisms was known. DNNs are multi-layered networks that can be trained using data without having any knowledge of the internal network structure and mechanisms. This way different networks could be created more efficiently while behaving more naturally (Saxe, Nelli, & Summerfield, 2020).

A method of increasing DDN performance is by using higher quality images. This way more details of the images remain. The Open Amsterdam Data Set (OADS) has more details in the representation than a common datasets, which may make it easier for a DNNs to learn about the global shapes in the data (Blumberg, 2020). This study will use and continue the development of OADS dataset.

An important application of DNNs in brain modelling is the understanding of the object recognition function of the brain (Hao, Andolina, Wang, & Zhang, 2020). There are multiple learning methods for object recognition. The supervised learning method learns by actively evaluating the response of a given stimuli. While the unsupervised learning method learns by developing categories when given multiple stimuli (Love, 2002). The human brain useds both methods when learning to recognize objects. Most DDNs use the supervised learning method as this is earlier to implement. However an important part of object recognition in the brain uses unsupervised learning (Kourtzi &

DiCarlo, 2006). Therefore unsupervised learning would be an closer approximation to

(3)

2

biological mechanism and therefor an preferable learning method for studying object recognition mechanisms of the human brain.

There are multiple ways to facilitate unsupervised learning. In particular, we use the network named Deep-hebb developed by J. Schreuder at the University of

Amsterdam. Deep-hebb uses an generalized Hebbian learning algorithm for learning.

Evidence of Hebbian learning has been observed in various biological systems (Caporale & Dan, 2008). In Hebbian learning connections that are used often are strengthened while unused and rarely used connections weaken and disappear (Hebb, 2012). An unsupervised Hebbian network could be used to model object recognition learning of the human brain.

An interesting factor in object recognition is saliency. Saliency is based on the eye fixations of humans. A disproportionally large amount of visual information comes from the fovea (Azzopardi & Cowey, 1993). Therefore changes in the visual fixation point could have a an effect on perception. This change in perception could influence object recognition. As the fixated objects may contain more information but with less noise. This should make it faster and easier to recognize objects as less information would need to be processed. Therefor higher saliency could decrease the difficulties in object recognition.

DDNs however need a measure to describe saliency as eye fixations are not fixed and can differ between individuals and circumstances (Sharpe, 1988). Therefore saliency is often described as an eye fixation probability distribution. The direct method of obtaining a saliency map is by using eye tracking on test subjects. However this method is rather labor intensive. As an Alternative, a DDN can be used to create

saliency probability maps. These networks were learned to predict to human fixations in images. Deepgaze II is a saliency network that can get an good prediction for real human eye fixation (Kümmerer et al., 2016). This make saliency DDN a good method of obtaining the saliency information for DDNs.

Earlier research with DNN’s has already found that by using a saliency-based filter, a DNN’s performance can improve compared to those without it (Ren et al., 2014), (Ramík et al., 2011).Most results however focus on the increased recognition accuracy.

Here, we focus on the learning process of the network and assess whether it is affected by saliency. The expectation is that the higher salient images will have a higher object recognition accuracy and will need less time to obtain the maximum accuracy. This is because we expect that by supplying the network with salient information, its input may become less noisy leading to a more efficient learning process. An unsupervised Hebbian network is used to test the difference in object recognition learning between low and high salient images. The saliency level will be determined using an DDN

deepgaze II. The learning rate is tested for its maximum value and the time the network needs to obtain its maximum value.

(4)

3

MATERIALS AND METHODS

THE OADS IMAGE SET

For this research project a dataset was created with the goal to create a dataset of high quality photographs with natural and diverse compositions. These images could then be used to create higher quality data for DNNs or in case of the original study facilitate the use of a retinal filter. The basis of the dataset and protocol of the Open Amsterdam Data Set OADS dataset was constructed in (Blumberg, 2020). The camera use for the entire dataset was Sonycyber-shot DSC-RX100 digital camera which uses a fisheye lens. The photographs were kept in the RAW format of the camera in a size of 5496 by 3672 pixels. More similar objects are more difficult for DDNs to differentiate (Humphreys & Forde, 2001). Therefor in order to increase the difficulty of the dataset, objects were often paired with similar looking objects. The photographing protocol is an altered version of the original OADS protocol (Blumberg, 2020). The changes included were the addition of object categories (Front door, balcony door and the bollard), the removal of object categories (Tree and truck) and some small tweaks to the procedure.

The photographing protocol can be found in appendix A. The tree object category was removed due to problems with tagging while the truck category was removed due to lack of object instances in the dataset. New objects categories were added to

compensated for the removal of the object categories. Images that included only removed objects categories were removed from the dataset.

The images were tagged according to the protocol found in appendix B. After the images were tagged, the tagging of images would be checked by another researcher.

After tagging some images were excluded from the dataset upon closer examination as they failed criteria of one the protocols. Object categories with fewer than 1000 object instances would be cut from the final dataset. In figure 1 there is an example

photograph to show what the images looked like and how the images were tagged.

CONSTRUCTION THE TRAINING SETS

The training sets for the DNN was created from the OADS dataset. To create the necessary training sets two things were necessary: creating the crops of the specific object categories and dividing these crops into high and low salient conditions. In addition, a control condition consisting of random samples of all objects independent of saliency was also created to test whether an effect in the experimental conditions stems from the increased amount of saliency, the lack of saliency or both. Because if a

difference with the control condition was found, the saliency level must have had an effect on the significance.

(5)

4

For making these different training sets, the original RAW images and the associated tagging files (.json) were loaded. Firstly, the centre of every object in every image was identified. If the size of an object’s tag was smaller than 1600 pixels then the object was removed from further processing. For the remaining objects, a crop was then made around this centre. Three different crops sizes were used: 1024 x 1024, 512 by 512 and 256 by 256. These sizes were chosen in order to diversify and increase the size of the training set. Crops were removed if there was overlap with objects from different categories or if the crop was fully or partially masked. The resulting crops were then resized to a size of 64 x 64 using the Pillow image package with the ANTIALIAS method and were then saved as JPEG files. Figure 1 shows how the crops are constructed form the images.

Figure 1: An example of how photographs were converted into crops. On the left the crops of varying levels of crop size taken of the marked lamppost. The image also gives an example for the type of images to expect in the OADS database and how these images were tagged.

The Deepgaze II (Kümmerer et al., 2016) network with a centerbias equal to 1 was used in order to determine the saliency for all of the given objects. The images needed to be scaled down by a factor 4 using the Cv2 package, as the full-sized images were too large for the network to handle. The Deepgaze II network would then calculate an saliency probability map of the image. Using this saliency map the peak saliency probability was determined for each object. An object was categorized as highly salient if its maximum saliency probability was above the median value and low salient if it was below the median value. Objects equal to the median value were only used in the random saliency condition. Important to note is that the saliency was determined before

(6)

5

objects were removed from the dataset thus preventing a perfect split between the saliency conditions.

THE NETWORK

The deep neural network is an unsupervised neural network currently being developed titled Deep-hebb. The network was trained using the generalized Hebbian algorithm (Hebb, 2012). The C++ 0.1 version of the network was used for the

experiment. The network is made of modules. These modules consist of 64 neurons, axons and connections between every pair of neurons. Every neuron was limited to having only a single axon. During every iteration the weakest connected axons were removed, calculated according to a Generalized Hebbian algorithm, and then

reconnected to a new synapse. This process was repeated for the entire training and occurred after every batch of 1000 images. The network is able to process colour images by separately processing the RGB components of an image. These modules are then structed into layers, that can contain one of more modules. The network consist of 3 layers: the input layer, the V1 layer and the V2 layer. The layers are then connected into a feedforward type of neural network. This networks works by first feeding the images data into the input layer. Then the input layer will feed into the V1 layer, which will then in turn feed into the V2 layer. The amount of modules differs per layer. The input layer has 1 module, the V1 layer has 8 modules and the V2 layer has 4 modules. The scoring of the Hebbian network was done by a separate network. Each layer was evaluated by a separated perceptron independent of other layers. The scoring network would only start after the Hebbian network was given time to train. This waiting time of the scoring network will be referred to as freeze time. The current scoring

network also did not use a specific validation set. The maximum size of images the Hebbian network should be able to handle is 64 by 64.

The Deep-hebb network is currently still in development and will therefore also be examined. The Tiny ImageNet-200 (TIN) dataset (Tiny ImageNet Challenge) is an small subset of ImageNet (Deng et al., 2009). This dataset contains varied images of an multitude of object classes. TIN was chosen to be used as control to test the

performance of the network for the combination of the size of the images and its

frequent usage. TIN was used instead of ImageNet as this large of a dataset is currently unfeasible. The OADS will also be examined using the TIN as standard. Multiple

configurations of object groups of TIN used when testing the network. All configurations would however include 14 object categories in order to be later compared to the OADS dataset. As the network has not been extensity tested yet the configuration will be determined during the testing phase.

Deep-hebb was be used to test the differences between the salient conditions.

The condition was tested using 2 different measures. The first measure tested the

(7)

6

maximum objects recognition accuracy (MA) of the scoring network, was able to obtain during the trail. This condition was measured in object recognition percentage. The second measure was the time necessary for the network to obtain its maximum value, referred to as time to max (TTM). This condition was measured in amount of batches necessary. The data was comprised of 10 different complete runs of the network for each condition. The significant results would then be tested for direction of the saliency effect using a random saliency condition. The Deep-hebb would be examined by running the of the TIN and OADS datasets and evaluating its performance.

If not enough data was available, data augmentation was used by adding rotated versions in 90, 180 and 270 degree angles. If this was not sufficient than object

categories can be dropped. The configuration was then based on groups of similar object categories. The configurations were then be briefly tested. From these results a configuration was chosen.

STATISTICS

All data was obtained from a low amount of samples (10), thus non-parametric testing was used for testing significance. The difference between the low and high salient conditions in MA and TTM was tested using an one-way Wilcoxon-Mann-Whitney test.

The control were the salient conditions are tested against the random salient condition was also tested with the Mann-Whitney U test. All tests would be performed of on all layers of the Deep-hebb network: input, V1 and V2. No statistical analysis were performed on the performance testing of the network due to the large amount of confounding factors.

RESULTS

DATA COLLECTION

At least over 250 images were excluded from the dataset but this figure may be

considerably larger as this was never explicitly documented. Images were excluded for various reasons but mostly due to missing of the original RAW files or problems with the image such as large blurs or a lack of focus in the image. A total of 5412 images

remained in the final version of the dataset. All of the object categories had at least 1500 instances as seen in figure 1. This meant all objects categories were of sufficient size. There was however a rather large discrepancy between the categories. As the most common category (lamppost) has more than ten times as many objects tagged than the most rare category (Carrier bike). This is somewhat unfortunate as balanced dataset tend to be preferable for optimization (Prati et al., 2009). There were no major problems found after a brief visual inspection of the crops.

(8)

7

For the saliency of these crops, it appeared that there was a higher proportion of crops obtained from the foreground of the image for the salient crops, while crops in the low salient set tented to show more to the background. More than 79000 crops were made of the OADS dataset. This means over 2/3 of the crops were removed as

normally 3 crops were created for each object. Luckily the rarer objects tended to have a relatively high number of crops per object as can be seen in figure 2.

There were significantly more crop of the high salient condition than the low salient condition. This meant that significantly less information was fed into the DNN for the low salient condition. With objects such as SUV and Carrier bike being

overwhelmingly more common in the high salient condition while lampposts and balcony doors were significantly more common in the low salient condition. This caused several objects categories to fall below 500 crop in the low salient condition as seen in figure 2.

Figure 2: Number of objects and crops per salient condition for each of the categories in the OADS. The method of making crops can create up to 3 crops per object instance.

Not all of the crops are viable which can result in less crops per object. In the figure the large disparity between the high and low salient crops can be seen.

Because of the limited amount of data, rotational data argumentation was

performed on the crops. The effect of the data argumentation was then tested. By using the rotational data augmentation the DDN object recognition accuracy decreased.

Therefor post cropping data argumentation was not used. As an alternative, object

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Lamppost Compact car

Traffic sign

Front door

Balcony door

City bike Bollard Traffic light

Van Bench Scooter Bin SUV Carrier

bike

Object occurances

Occurance of object categories in the OADS and the saliency crops

Objects in dataset High salient crops Low salient crops

(9)

8

categories had to be removed. Object categories with less than 500 instances would be removed from the dataset, removing the SUV, van, scooter and carrier bike object categories. Two other compositions of object categories were tested. The first one contained the 2 object groups with the highest lowest object group instance. This only contained the doors (front door and balcony door) and the traffic objects (traffic light, traffic sign and lamppost). The second contained all remaining complete object groups.

Adding the side walk objects (bench, bin and bollard). These results were then

compared against the all objects categories composition. The results of this test can be seen in figure 3. In the end the 3 object group condition was chosen for further testing as the 2 object groups composition was deemed to small and the all feasible object categories had a weird tendency to oscillate.

Figure 3: The object recognition learning rates for different compositions of object categories. The graphs display both the high salient (HS) and low salient (LS)

conditions. The learning rate is displayed for all layers (input, V1 and V2) of the Deep- hebb network. The total amount of iterations is dependent on the size of the dataset, and can therefor differ between the conditions. I can been seen that using less object categories will often increase performance except for the feasible object categories conditions.

(10)

9

NETWORK CONFIGURATION

After some initial testing of the network across multiple parameters, the final network configuration was chosen. The network configuration for the TIN and OADS ended up being: 20 epochs, 20000 freeze iterations, Hebbian learning rate of 0.001, scoring learning rate of 0.01 and a decay of 0.01. The low learning rate and the high decay were necessary to prevent the TIN dataset from overfitting. For the experimental conditions the configuration was changed as overfitting was not a problem when using the crops of the OADS dataset. The exact configuration was: 30 eons, 20000 freeze iterations, Hebbian learning rate of 0.01, scoring learning rate of 0.1 and a decay of 0.001.

The current version of the Deep-hebb network had trouble with the TIN dataset.

Even with the severely changed configurations the dataset could still crash or the overfitting of the data. Crashes caused the network to end prematurely and but no specific error was given. However the problems of the TIN dataset would vary

depending on the configuration of objects groups in the dataset. The crops of the OADS database would not encounter any of these problems, instead the main problem of the OADS crops was that the MA was relatively low. Even when the learning rate was increased and the decay was lowered an network trained with the OADS crops would struggle to obtain an MA above 30% .

THE EFFECT OF SALIENCY

The network did not have any problems during the processing of the experimental data unlike the tests of the TIN database. The effect of saliency was tested for 2 measures:

the average maximum accuracy value (AMA) of all training sessions and the average time (in iterations) needed for the network to obtain the maximum value of its training session, referred to as the average time to max (ATTM). Both measures were then tested for significance in all layers of the network using the Wilcoxon-Mann-Whitney test. The AMA was found to be significantly higher (p<0.01) in the high salient condition compared to the low salient condition as seen in table 1. The ATTM was found only to be significantly lower (p<0.05) in the V1 layer of the network as seen in table 2. The results found a significant effect of saliency in AMA in all layers while saliency only significantly affected the ATTM in one layer.

(11)

10

Condition input V1 V2

High salient 0.577 ± 0.06 0.472 ± 0.04 0.433 ± 0.07 Low salient 0.547 ± 0.04 0.415 ± 0.03 0.401 ± 0.05 Random Salient 0.577 ± 0.04 0.483 ± 0.06 0.461 ± 0.05

Table 1: The average maximum accuracy with standard deviation in object recognition for a given dataset. All conditions had a sample size of 10.

Condition input V1 V2

High salient 40 ± 9 25 ± 3 31 ± 13

Low salient 36 ± 10 41 ± 10 31 ± 3

Random Salient NA 41 ± 1 NA

Table 2: The average iterations needed for the network to obtain its maximum accuracy.

All conditions had a sample size of 10. The NA scores were not measured as the control was only tested for the significant conditions.

After testing for the significant effects, the direction of the saliency effect was examined.

This was done by testing the high and low salient conditions against a random salient condition. This was tested when the high and low salient conditions were significantly different. The AMA was significantly higher in the random saliency condition compared to the low saliency condition. But the AMA was not significantly lower in the random saliency condition compared to the low saliency condition as seen in table 1. The AMA of the random condition was even significantly higher (p<0.05) in layer V2. For the ATTM condition only the results of the V1 layer were examined. The ATTM was not significantly (p>0.05) lower in the low significant condition compared to the random significant conditions. But the ATTM was significantly lower (p<0.05) in the high salient condition compared to the random saliency condition as seen in table 2. These results found that the AMA of the random saliency condition significantly differs from the low salient condition while the ATTM of the random saliency condition significantly differs from the high saliency condition.

(12)

11

DISCUSSION

CONCLUSION

The use of highly salient images over low salient images can positively affect object recognition performance of an unsupervised deep Hebbian network. These results show that the maximum accuracy is increased for higher salient images. There is also

evidence that in some layers of the network, higher salient images can decrease the time needed to obtain the maximum training accuracy. Networks trained with random salient images also had a higher object recognition accuracy compared to networks trained with low salient images. If the network was trained using random salient images the object recognition was higher than when low salient images were used but there was no difference for high salient images. In the time needed to obtain the maximum score, the network was trained using random salient images was slower than when high salient images were used but there was no difference for low salient images. Taken together, this suggests that saliency can affect the learning of an unsupervised

networks. With the results indicating that the lack of saliency in images decreases the object recognition accuracy. And the results showed that, in certain layers of the network, the increase in saliency does increase the speed of obtaining the maximum object recognition accuracy. Thus saliency can influence learning in an unsupervised DNN.

When looking at the performance of the OADS dataset, we found that random samples of categories from the TIN dataset had a higher accuracy than the OADS crops. This may indicate that the OADS dataset was be more difficult to train on for a unsupervised network compared to an equally sized set of random object category from a conventional network. Closer examining the network’s performance multiple

observations can be made. Firstly, the network’s performance was comparatively low with regard to other unsupervised DNN’s (Weber et al., 2000). The network however also has a tendency to overfit and fail when the presented data is too easy for the network to solve. The freeze iterations, the process during which the actual Hebbian network is trained, had some notable effects on the overall performance. Of course, if the number of freeze iterations were too low (<15000) the network performance decreased but the network performance would also decrease if the freeze iterations were too high (>25000). Around the time where the performance would decrease the time between iterations would also drastically increase. Despite these aforementioned problems, we were still able to obtain the result.

INTERPRETATION OF RESULTS

The results showed that high salient images positively affect the object recognition accuracy. The effect is proposed to occur because high salient images contain more information. This additional information then helps the network to better distinguish objects categories. In contrast, the results of the time needed to obtain the maximum

(13)

12

accuracy was not as consistent with this. This may be due to limitations in the experimental setup such as only having a small dataset and running a only a small amount of session of the network. Another explanation for the inconsistent effects of saliency on the learning pace is that the DNN in its current form is unable to use the increased data density to speed up the process. This could be because the network is processing an equal number of properties for all levels of saliency while in the high salient conditions less properties may be needed. Another important thing to note is the limitations of the testing variable themselves, accuracy and learning pace. While

maximum accuracy value is easy to measure and the most important variable, it does not describe the entire learning process. Instead, a more advanced variable could be used to describe the difference in learning process.

It is interesting that the low saliency crops seem to have contributed to the effects of saliency in the max accuracy score while the high saliency crops seem to have

contributed to the change in learning pace. This implies that using saliency selection should not significantly change the DNN accuracy only learning rate. Yet multiple

factors should be taken into account before drawing this conclusion. For one, the lack of difference with the control does not mean the saliency level did not contribute to the overall effect. Furthermore the random saliency crops may not have been an accurate depiction as it was based on a single sample.

The difference in performance between the OADS and TIN set could be

explained by the specific set of object categories that were used. The relation between the objects changes the difficulty as more similar objects are more difficult for a DNN to recognize. The difference is then explained by the existence of multiple object grouping of similar object in the OADS dataset. As this would increase the difficulty for the

network unlike for the TIN condition were the relation between objects was random.

This could also help to explain the large variations between the TIN samples. The limited performance of Deep-hebb may have been because to large images were used.

The DNN was only sufficiently tested for images of a size 32x32 which is 4 times smaller than images used during the experiment. Another explanation for the low performance of Deep-hebb could be the lack of optimalization in certain parts of the network due to time constrains. Finally, the reported overfitting in the TIN conditions was likely caused by the lack of a separate validation set in Deep-hebb. Thus the difference in performance between the OADS an TIN dataset can be caused by several factors form limitation of the network to differences in composition of the datasets.

There are three important decisions made during the data processing that could have influenced the results. During cropping the data enhancing was used by using multiple levels of zoom crops of each object. Data enhancing can always influence the outcome of the experiment but the zoomed out crops may look similar to crop of

background objects. It could be that background objects are more prone to be of a low saliency condition. The crops of background objects could be similar to the zoomed out crops with may eventual lead to less decisive results. Another problem is that the images were always centered on the inside of the object. This leads to a decrease of

(14)

13

object variability and may cause problems with object with an empty center such as bicycles. The last influential decision was the choice to make the original cropping size constant instead variable. This led to bigger objects often appearing large, sometimes even without any contours. This would worsen the problems of the center focus. Thus when looking at the results the cropping procedure should be taken into account.

Finally the saliency results implicate an existing bias in the OADS dataset as seen figure 1. The objects categories that were relatively rare were often had an high disparity in favor of high salient crops compared to low salient crops. These objects categories are significantly more divided than object categories that are similar but more common in the dataset. This may indicate that these objects are likely photographed in more salient positions. This could have been caused due to the need to specifically include these object categories in photograph as they would be too rare otherwise to include in the dataset. However the high disparity could also have been caused by the rare object being just naturally more salient. Thus in the OADS, objects of rare object categories seem to be overwhelmingly high salient. This may have been caused by a bias during the photographing stage.

SCIENTIFIC CONTEXT

Previous studies have found similar effects of increased accuracy in object recognition when using saliency in DDN (Ren et al., 2014), (Ramík et al., 2011). One of the current areas of interests of unsupervised learning is real time object recognition, where speed is an important factor for real-time learning (Abbas et al., 2017). The results of this study indicate that by selecting salient information the network could learn faster. This could also support the argumentation of using saliency for real-time learning as proposed by earlier research (Ramík et al., 2011). What could help the development of on real-time learning DNNs. Although more research would needed into the effects of saliency on the learning rate. The study also has highlighted the effectiveness of both the OADS dataset and the potential of a new unsupervised deep Hebbian network. Future

research could use an updated methodology to confirm the results found in this study.

This can be done by changing one or more of the possibly constraining factors such as using a larger dataset, using smaller images, using a different network and updating the variable used to investigate the learning rate. The results of this study are indicate the importance of low saliency information. As most currently research only focusses on the effects of high salient information on DNN performance. Therefore, future research could further investigate the effects of low salient information on DNN performance. This could lead to a different way of pre-processing DDN images. As instead of finding the salient information, removing the low salient information may be more efficient. The results may also help in the understanding of the biological systems of object

recognition. As the results show that saliency can improve the quality and learning speed of object recognition. As the DDN is based of a biological learning method, biological systems may behave similarly. This is helped by the fact that both these factors are potentially very useful in real world scenarios. This could help explaining why

(15)

14

eye movement are non-random but is still unable to explain how exactly these eye movements work. Currently research is able to understand parts of the complexity of eye movements in humans (Hayhoe & Ballard, 2005). Future research could also focus on the mechanics of saliency that help in object recognition. Finally, this study has presented adequate results using the Deep-hebb network. The Deep-hebb network is currently in development. In the future these results can be used for testing the effect of updates in later version of Deep-hebb.

In conclusion this study has found an significant effect of saliency on

unsupervised DDN performance. The lack of saliency had clear negative effect on the maximum accuracy in object recognition. While an increase in saliency was also found to be able to positively affect the learning rate. The effects of saliency found could help with further optimization of object recognition DDNs in the future.

REFERENCES

Abbas, Q., Ibrahim, M. E., & Jaffar, M. A. (2017). Video scene analysis: an overview and challenges on deep learning algorithms. Multimedia Tools and Applications, 77(16), 20415–20453. https://doi.org/10.1007/s11042-017-5438-7

Azzopardi, P., & Cowey, A. (1993). Preferential representation of the fovea in the primary visual cortex. Nature, 361(6414), 719–721. https://doi.org/10.1038/361719a0 Blumberg, L. (2020). Oads: A New Data Set Of High Resolution Raw Images (thesis).

Caporale, N., & Dan, Y. (2008). Spike Timing–Dependent Plasticity: A Hebbian Learning Rule. Annual Review of Neuroscience, 31(1), 25–46.

https://doi.org/10.1146/annurev.neuro.31.060407.125639

Götze, J., & Boye, J. (2016). Learning landmark salience models from users’ route instructions. Journal of Location Based Services, 10(1), 47–63.

https://doi.org/10.1080/17489725.2016.1172739

Hao, W., Andolina, I. M., Wang, W., & Zhang, Z. (2020). Biologically inspired visual computing: the state of the art. Frontiers of Computer Science, 15(1), 1-15.

Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9(4), 188–194. https://doi.org/10.1016/j.tics.2005.02.009 Hebb, D. O. (2012). The organization of behavior: a neuropsychological theory.

Routledge, Taylor & Francis Group.

(16)

15

Humphreys, G. W., & Forde, E. M. (2001). Hierarchies, similarity, and interactivity in object recognition: “Category-specific” neuropsychological deficits. Behavioral and Brain Sciences, 24(3), 453–476. https://doi.org/10.1017/s0140525x01004150

Kourtzi, Z., & DiCarlo, J. (2006). Learning and neural plasticity in visual object recognition. Current Opinion in Neurobiology, 16(2), 152–158.

https://doi.org/10.1016/j.conb.2006.03.012

Kümmerer, M., Bethge, M., & Wallis, T. S. A. (2016, October 5). DeepGaze II: Reading fixations from deep features trained on object recognition. DeepAI.

https://deepai.org/publication/deepgaze-ii-reading-fixations-from-deep-features-trained- on-object-recognition.

Love, B. C. (2002). Comparing supervised and unsupervised category learning.

Psychonomic Bulletin & Review, 9(4), 829–835. https://doi.org/10.3758/bf03196342 Prati, R. C., Batista, G. E., & Monard, M. C. (2009). In Indian International Conference Artificial Intelligence. In Data mining with imbalanced class distributions: concepts and methods (pp. 359–376).

Ramanathan, S., Katti, H., Sebe, N., Kankanhalli, M., & Chua, T.-S. (2010). An Eye Fixation Database for Saliency Detection in Images. Computer Vision – ECCV 2010, 30–43. https://doi.org/10.1007/978-3-642-15561-1_3

Ramík, D. M., Sabourin, C., & Madani, K. (2011). A Cognitive Approach for Robots’

Vision Using Unsupervised Learning and Visual Saliency. Advances in Computational Intelligence, 81–88. https://doi.org/10.1007/978-3-642-21501-8_11

Ren, Z., Gao, S., Chia, L., & Tsang, I. W. (2014). Region-Based Saliency Detection and Its Application in Object Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 24(5), 769–779. https://doi.org/10.1109/TCSVT.2013.2280096 Saxe, A., Nelli, S., & Summerfield, C. (2020). If deep learning is the answer, what is the question?. Nature Reviews Neuroscience, 1-13.

Sharpe, J. A. (1988). Visual fixation stability in older adults. Survey of Ophthalmology, 32(6), 438–439. https://doi.org/10.1016/0039-6257(88)90061-6

Tiny ImageNet Challenge. Kaggle. (2019). https://www.kaggle.com/c/thu-deep-learning.

Weber, M., Welling, M., & Perona, P. (2000). In Unsupervised Learning of Models for Recognition (Vol. 1842, Ser. Lecture Notes in Computer Science, pp. 18–32). essay, Springer.

(17)

16

Yan, K., Wang, X., Kim, J., & Feng, D. (2020). A New Aggregation of DNN Sparse and Dense Labeling for Saliency Detection. IEEE Transactions on Cybernetics, 1–14.

https://doi.org/10.1109/tcyb.2019.2963287

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

(18)

17

APPENDIX A

PHOTOGRAPHING PROTOCOL

OBJECTS

The images should contain one or more of the following objects;

bench, bin, traffic sign, traffic light, lamppost, city bike, carrier bike, scooter, compact car, SUV, van, front door, balcony door or bollard.

CAMERA SETTINGS

The camera that is used when taking the images is a DSC-RX100 digital camera.

The following settings are used when taking the images.

● The Aspect Ratio is set to 1:1.

● The Quality of the images is set to RAW or RAW & JPEG.

● The image size is set to L:13M.

● The mode is set to ‘Superior Auto’.

● The zoom is to be set to 28 mm. This means the images are taken fully zoomed out.

GENERAL INSTRUCTIONS

● Images should not be zoomed in.

● The images need to be focused.

● The goal is to take pictures in the most natural way possible. Objects need not be centered in the picture. Instead, images should focus on a scene.

● Multiple objects and categories can be in a single image.

● Objects in the image should be clearly recognizable. Try to avoid photographing a group of objects that are not clearly recognizable, such as groups of bikes.

● The composition and placing of objects should be varied as much as possible.

● Photographs should only be taken in daylight. Not inside or during nighttime. They can be taken independent of weather conditions.

● Moving objects should be avoided while photographing as they often come out blurry.

(19)

18

APPENDIX B

LABELLING PROTOCOL

GENERAL GUIDELINES

● Objects will be labelled with the rectangular labelling tool in Supervise.ly.

● If you can recognize a crop, so should the DNN. Objects with something in front of it (like a human or a lamppost) may be labelled, if the object is still clearly recognizable. When in doubt, to decide what to label an object, a consensus of ⅔ majority among researchers on this project should be reached for giving a certain object a (linguistic) label.

● If the object is not occluded, things can be behind the object (like a carrier bike in front of a group of bikes). Difficult objects are good for the network. When labelling, do not take into account how difficult an object might be for the network to recognize.

● All objects in the image belonging to a category should be either labelled or masked. Objects should not be left unlabelled.

● Objects must be labelled as close to the edge of objects as possible. Very small objects in images also need to be labelled. If objects in Supervise.ly are not recognizable, they often can be recognized in the RAW images.

● Pictures must be taken in Amsterdam or Diemen. Objects outside of Amsterdam can still be labelled if the picture is taken in Amsterdam (for example, a picture taken just inside municipality borders that overlooks outside of the municipality).

ADDITIONAL LABELING INFORMATION PER OBJECT CATEGORY

BALCONY DOORS

Balcony doors are similar to front doors, but because they are on a higher floor, they do not lead into the street. Please note that balcony doors can be made of glass, so make sure to mask them if they are very reflective.

BENCHES

Benches will be labelled if they are in public space. This category does include benches in for example bus stops. Benches can be see-through. Specific sitting areas on ridges are considered benches.

Benches as part of (dining) tables are not considered benches, including picnic benches.

BINS

Bins on the street (public space) will be labelled, unless they are too heavily damaged to be recognizable.

Personal bins and trash containers are excluded from this group, so ‘bins’ are only municipal trash cans.

Containers are also not considered bins.

(20)

19

BOLLARDS

Bollards are considered to be small poles that you can find on the street, such as ‘Amsterdammertjes’.

They are used to control traffic flow in the street. Decorative bollards, such as in gardens, may also be labelled.

CARRIER BIKES

Carrier bikes are bikes that have a space in front of the bike big enough to store large objects and/or children. The bike must be functional and have its container attached to it.

CITY BIKES

In this category we will add all city bikes. Specialized bikes, such as folding bikes or tandem bikes, will not be labelled. If bikes are clustered together, they should be masked instead of labelled. Children’s bikes will be labelled.

COMPACT CARS

Compact cars are all regular cars, excluding SUVs and terrain wagons. Canta’s are not compact cars, but mini two-person cars with licence plates are. If the type of car is unclear, such as compact SUVs, the car should be masked.

LAMPPOSTS

Only the ‘lamp’ part of the lamppost is labelled (so not the post). If there are multiple lamps on the same lamppost, they can be labelled together if they are right next to each other. If they are separate, they should be labelled separately. Lampposts must be used to illuminate the public roads, thus private lights are not considered lampposts.

FRONT DOORS

Doors leading from the street into a building will be labelled, because these are considered front doors.

This means that doors leading into things like cars or open-area spaces are not labelled as doors. Garage doors will not be labelled. Double-doors as well as single doors that are right next to each other are labelled together to make crops more viable. Sliding doors can also be labelled. Front doors on higher levels like in apartment buildings are masked.

SCOOTERS

Motorcycles must not be labelled, only scooters. The windshield of the scooter should also be included in the label. Scooters that are fully or mostly covered by a cover are not considered scooters, because they are not recognizable.

(21)

20 SUVS

If you cannot tell if it is a regular compact car or an SUV from the image, consider masking the object.

Cars known as ‘Compact SUVs’, for example, will be masked. Terrain wagons are not SUVs, and will be masked instead. You can consider looking up the license plate to determine if the car is a SUV.

TRAFFIC LIGHTS

Only the light will be labelled, not the post. Lights can be labelled together if they are right next to each other. The backs of traffic lights may also be labelled.

TRAFFIC SIGNS

If traffic signs occur together (so right next to each other), they will be labelled together. Signs over the highway should be labelled. The pole is not labelled, only the sign. A traffic sign will also be labelled if only the back is visible in the image. Street names will not be labelled as traffic signs. Bike routes (the red and white signs with directions on them for bikes) will also not be labelled. Stop signs are not considered Traffic signs.

VANS

Vans are larger than cars or SUVs. Vans are vehicles designed to transport passengers in the rear seating row(s) or large objects in the trunk. Minivans are included in this category. They may have windows, but this is not necessary.

MASKS

All objects that would belong to a category, but are not clearly separate from another object in the image.

For example, a bike standing in front of a compact car may be masked. If there are lots of the same object in a picture (such as balcony doors), you may mask to avoid applying lots and lots of labels. Do not mask objects that do not belong to any category except if specified else in one of the above categories.

Masks may overlap or be placed inside an object tag.