• No results found

Using Prior Knowledge for Verification and Elimination of Stationary and Variable Objects in Real-time Images

N/A
N/A
Protected

Academic year: 2020

Share "Using Prior Knowledge for Verification and Elimination of Stationary and Variable Objects in Real-time Images"

Copied!
97
0
0

Loading.... (view fulltext now)

Full text

(1)

University of Windsor University of Windsor

Scholarship at UWindsor

Scholarship at UWindsor

Electronic Theses and Dissertations Theses, Dissertations, and Major Papers

9-24-2019

Using Prior Knowledge for Verification and Elimination of

Using Prior Knowledge for Verification and Elimination of

Stationary and Variable Objects in Real-time Images

Stationary and Variable Objects in Real-time Images

Foram Pravinkumar Patel

University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd

Recommended Citation Recommended Citation

Patel, Foram Pravinkumar, "Using Prior Knowledge for Verification and Elimination of Stationary and Variable Objects in Real-time Images" (2019). Electronic Theses and Dissertations. 7831.

https://scholar.uwindsor.ca/etd/7831

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor students from 1954 forward. These documents are made available for personal study and research purposes only, in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder (original author), cannot be used for any commercial purposes, and may not be altered. Any other use would require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or thesis from this database. For additional inquiries, please contact the repository administrator via email

(2)

Using Prior Knowledge for Verification and Elimination of Stationary

and Variable Objects in Real-time Images

by

FORAM PRAVINKUMAR PATEL

A THESIS

Submitted to the Faculty of Graduate Studies

Through Computer Science

In Partial Fulfilment of the Requirements for

The Degree of Master of Science at the

University of Windsor

Windsor, Ontario, Canada

2019

(3)

Using Prior Knowledge for Verification and Elimination of Stationary

and Variable Objects in Real-time Images

by

FORAM PRAVINKUMAR PATEL

APPROVED BY:

______________________________________________

M. Hlynka

Department of Mathematics and Statistics

______________________________________________

A. Mukhopadhyay

School of Computer Science

______________________________________________

X. Yuan, Advisor

School of Computer Science

(4)

iii

Declaration of Originality

I hereby certify that I am the sole author of this thesis and that no part of this

thesis has been published or submitted for publication.

I certify that, to the best of my knowledge, my thesis does not infringe upon

anyone’s copyright nor violate any proprietary rights. Any ideas, techniques,

quotations, or any other material from the work of other people included in my

thesis, published or otherwise, are fully acknowledged in accordance with the

standard referencing practices. Furthermore, to the extent that I have included

copyrighted material that surpasses the bounds of fair dealing within the meaning of

the Canada Copyright Act, I certify that I have obtained a written permission from

the copyright owner(s) to include such material(s) in my thesis and have included

copies of such copyright clearances to my appendix.

I declare that this is a true copy of my thesis, including any final revisions,

as approved by my thesis committee and the Graduate Studies office and that this

thesis has not been submitted for a higher degree to any other University or

(5)

iv

Abstract

With the evolving technologies in the autonomous vehicle industry, now it has

become possible for automobile passengers to sit relaxed instead of driving the car.

Technologies like object detection, object identification, and image segmentation

have enabled an autonomous car to identify and detect an object on the road in order

to drive safely. While an autonomous car drives by itself on the road, the types of

objects surrounding the car can be dynamic (e.g., cars and pedestrians), stationary

(e.g., buildings and benches), and variable (e.g., trees) depending on if the location

or shape of an object changes or not. Different from the existing image-based

approaches to detect and recognize objects in the scene, in this research 3D virtual

world is employed to verify and eliminate stationary and variable objects to allow

the autonomous car to focus on dynamic objects that may cause danger to its driving.

This methodology takes advantage of prior knowledge of stationary and variable

objects presented in a virtual city and verifies their existence in a real-time scene by

matching keypoints between the virtual and real objects. In case of a stationary or

variable object that does not exist in the virtual world due to incomplete pre-existing

information, this method uses machine learning for object detection. Verified

objects are then removed from the real-time image with a combined algorithm using

contour detection and class activation map (CAM), which helps to enhance the

(6)

v

Dedication

(7)

vi

Acknowledgements

Throughout this thesis, I have received excellent support and guidance. Firstly, I

would like to express my sincere gratitude to my advisor, Dr. Xiaobu Yuan, for his

continuous support, motivation and immense knowledge. As a mentor, his direction

supported me all the time to learn something new and to complete this research. I

would like to thank my thesis committee members, Dr. Asish Mukhopadhyay and

Dr. Myron Hlynka for their meaningful remarks and recommendations.

I am grateful to my family, whose love, motivation and positive thoughts

strengthened me to reach closer to my ambition. They are the role models and the

source of my inspiration. Last but not the least, I would like to thank all the faculties

and staff of the School of Computer Science and my friends who encouraged me at

(8)

vii

Table of Contents

Declaration of Originality ... ii

Abstract ... iv

Dedication ... v

Acknowledgements ... vi

List of Figures ... x

List of Abbreviations/Symbols ... xiii

List of Tables ... xiv

Chapter 1: Introduction ... 1

1.1 Overview ... 1

1.2 Google’s Self Driving Car ... 2

1.3 Need to aware of surroundings ... 3

1.4 Object Detection ... 4

1.5 Feature Detection and Selection ... 5

1.6 Object Elimination ... 5

Chapter 2: Literature Review ... 7

2.1 Object Detection ... 7

2.1.1 Machine Learning Approach ... 8

2.1.2 Deep Learning Approach ... 11

2.2 Feature Detection and Selection ... 15

2.3 Object Verification ... 18

2.4 Transfer Learning ... 20

2.5 Class Activation Map (CAM) Generation and Usage ... 22

2.6 Object Elimination ... 24

2.6.1 Instance Segmentation ... 24

2.6.2 Diminished Reality ... 27

2.7 Related Work ... 28

2.8 Thesis Statement ... 30

2.8.1 Problem Statement ... 30

(9)

viii

Chapter 3: Proposed Approach ... 32

3.1 Motivation ... 32

3.2 Working of the Overall System ... 32

3.2.1 Working of Individual Modules ... 33

3.3 Proposed Methodology for Static and Variable Object Verification and Object Elimination ... 36

3.3.1 Static and Variable Object Verification ... 36

3.3.1.1 Object detection module ... 37

3.3.1.2 Feature Detection module ... 37

3.3.1.3 Feature Selection module ... 38

3.3.1.4 Object Verification module ... 39

3.3.1.5 Object Verification algorithm ... 39

3.3.2 Static and Variable Object Elimination ... 41

3.3.2.1 Contour Detection Module ... 42

3.3.2.2 Generation of Class Activation Map (CAM) ... 42

3.3.2.3 Object Masking ... 43

3.3.2.4 Object Elimination algorithm ... 43

3.4 Time Complexity of the Proposed Approach ... 44

Chapter 4: Implementation and Experiments ... 45

4.1 Software Information ... 45

4.2 Construction of 3D Virtual World ... 45

4.3 Creation of Repository of Virtual World ... 46

4.4 Experiments and Results of Object Detection ... 47

4.4.1 Experiments and Results of Object Detection for Buildings ... 48

4.4.2 Experiments and Results of Object Detection for Tree ... 48

4.4.3 Experiments and Results of Object Detection for Street light ... 49

4.4.4 Experiments and Results of Object Detection for Bench ... 50

4.5 Experiments and Results of Object Verification ... 51

4.5.1 Centroid Detection of Real-time Static Object ... 51

4.5.2 Feature Detection Using FAST corner Detector of Real-time Object . 52 4.5.3 Calculation of the Distance of Real-time Object ... 52

4.5.4 Feature Selection in Real-time Image ... 52

(10)

ix

4.5.6 Object Verification of Static Object Using Prior Knowledge ... 53

4.5.7 Increasing the Confidence Score Using Neighbour Object ... 54

4.6 Experiments and Results of Object Elimination ... 55

4.6.1 Results of Generation of Class Activation Map (CAM) ... 55

4.6.2 Results of Contour Detection ... 57

4.6.3 Results of Object Elimination Using Combined Approach ... 60

4.7 Results Comparison and Discussion ... 62

4.7.1 Advantages of the Proposed Approach ... 62

4.8 Limitations of the Proposed Approach ... 64

Chapter 5: Conclusion and Future Work ... 66

5.1 Conclusion ... 66

5.2 Future Work ... 67

References/Bibliography ... 68

(11)

x

List of Figures

Figure 1: Google’s Self-Driving Car prototype ... 2

Figure 2: Google’s Autonomous Car driving on the road ... 4

Figure 3: Dynamic Object Detection ... 4

Figure 4: Detected features on a chessboard ... 5

Figure 5: Object Masking using Mask-R-CNN ... 6

Figure 6: Object Detection to identify different objects ... 7

Figure 7: Machine Learning algorithm workflow ... 8

Figure 8: Example of a CSVM network (Image source: Yakoub et al., [19]) ... 9

Figure 9: AdaBoost HOG detector applied on a test image (Image Source: Arunmozhi et al., [24])... 10

Figure 10: Deep Learning Approach Workflow ... 11

Figure 11: The architecture of R-CNN (Image source: Girshick et al., [49]) ... 12

Figure 12: The Architecture of Fast R-CNN (Image source: Girshick et al., [48]) 12 Figure 13: The architecture of Faster R-CNN (Image source: Ren et al., [47]) ... 13

Figure 14: Illustration of YOLO (Image source: Redmon et al., [40]) ... 14

Figure 15: object detection results (Image source: Tian et al., [42]) ... 15

Figure 16: Different types of detected features... 16

Figure 17: Feature Selection workflow... 16

Figure 18: Harris-Stephens Corner Detection (Image source: Haggui et al., [63]) 18 Figure 19: Image matching based on different corner detection methods. a Harris method, b the method in [75] ... 18

Figure 20: Overall system overview of DCNN approach (Image source: Chen et al., [78]) ... 19

Figure 21: Training the entire model Vs Transfer Learning ... 20

Figure 22: Three strategies for fine-tuning ... 21

Figure 23: Generated CAM for different breeds of dog using CNN ... 22

Figure 24: CCAM applied to localize common objects (Image source: Li et al., [108])... 23

(12)

xi

Figure 26: Segmented results: a. initial contours and local region; b. Final evolved

contours; c. segmented result (Image source: Lu et al. [109]) ... 25

Figure 27: Mask R-CNN framework for instance segmentation (Image source: He et al., [115]) ... 26

Figure 28: Concept of Diminished Reality ... 27

Figure 29: Diminished reality techniques: a. diminish, b. Seeing through, c. replacing, d. inpainting (Image source: Mori et al., [12]) ... 28

Figure 30: Overall system architecture ... 33

Figure 31: Architecture of the proposed approach ... 36

Figure 32: Flowchart of object verification component ... 37

Figure 33: Pixel P selected as a corner point ... 38

Figure 34: Flowchart of object elimination component ... 41

Figure 35: Constructed 3D virtual World ... 46

Figure 36: Constructed 3D Objects... 46

Figure 37: Rendered 3D object ... 47

Figure 38: Extracted 3D features on rendered images ... 47

Figure 39: Detected Skyscraper, Street light, Tree in the real-time image ... 48

Figure 40: Detected Skyscraper, Tree in the real-time image ... 48

Figure 41: Detection of Tree in the image ... 49

Figure 42: Detected Trees, Skyscraper in the real-time image ... 49

Figure 43: Street light, Trees detected in the test image ... 50

Figure 44: Street light, Trees detected in the test image ... 50

Figure 45: Bench detected in the test image ... 51

Figure 46: Centroid Calculation... 51

Figure 47: Detected centroid in the real-time image with coordinates ... 51

Figure 48: Detection of corner points using the FAST algorithm ... 52

Figure 49: Detected corner points on real-time image... 52

Figure 50: Calculating the distance between keypoints and centroid ... 52

Figure 51: Selected top 8 keypoints for top-right (1st quadrant) ... 53

Figure 52: Plotted selected keypoints on the real-time image ... 53

(13)

xii

Figure 54: Object verification result of real-time object ... 54

Figure 55: Selected keypoints of the virtual nearest static object (e.g. building) ... 54

Figure 56: Selected keypoints of the real-time nearest static object (e.g. building) ... 55

Figure 57: Verification result of neighbour stationary object (e.g. building) ... 55

Figure 58: Increased confidence score of a verified static object ... 55

Figure 59: Result of CAM Generation of Tree (left: input image, centre: generated heatmap, right: heatmap superimposed on input image) ... 56

Figure 60: Result of CAM Generation of Street light (left: input image, centre: generated heatmap, right: heatmap superimposed on input image) ... 57

Figure 61: Result of CAM Generation of Street light (left: input image, centre: generated heatmap, right: heatmap superimposed on input image) ... 57

Figure 62: Result of Contour Detection of Trees ... 58

Figure 63: Result of Contour Detection of Street light ... 59

Figure 64: Result of Contour Detection of Bench ... 59

Figure 65: Result of Contour Detection of Buildings ... 60

Figure 66: Result of Object Elimination of Trees ... 60

Figure 67: Result of Object Elimination of Street light ... 61

Figure 68: Result of Object Elimination of Bench ... 61

Figure 69: Result of Object Elimination of Buildings ... 61

Figure 70: Result of the proposed algorithm ... 62

Figure 71: Person Detected in the poster using RetinaNet model ... 63

Figure 72: Reflection of car and person on the building detected as a dynamic object ... 63

Figure 73: Detection of dynamic objects after object elimination ... 64

(14)

xiii

List of Abbreviations/Symbols

LiDAR Light Detection and Ranging

RADAR Radio Detection and Ranging

GPS Global Positioning System

3D 3-dimensional

2D 2-dimensional

CNN Convolutional Neural Network

ANN Artificial Neural Network

R-CNN Region Based Convolutional Neural

Network

YOLO You Only Load Once

ROI Region of Interest

FCN Fully Convolutional Network

SIFT Scale Invariant Feature Transformation

ORB Oriented FAST and Rotated BRIEF

BRIEF Binary Robust Independent Elementary

Features

SOA Service Oriented Architecture

SURF Speeded Up Robust Features

FAST Features from Accelerated Segment

Test

CAM Class Activation Map

(15)

xiv

List of Tables

Table 1: Related Work ... 30

Table 2: Time complexity of the proposed algorithm ... 44

(16)

1

Chapter 1: Introduction

1.1 Overview

With the aid of significant technologies and researches, now the vision of

self-driving car on the road has become a possibility to run on the road. It is not a

jaw-dropping concept as research in this direction has been carried out for years.

Seemingly within just a few years, autonomous cars have gone from science fiction

fantasy to road-bound reality [1]. Apart from many automobile giants like Tesla,

GMC, Uber, and Mercedes-Benz, there are many other technology corporations like

Google, Apple, IBM, and Intel that have infused billions of dollars in this research

and development to turn this arduous vision of a fully autonomous car into an

actuality. Nowadays, self-driving cars are being tested on the road, but they are far

away from being feasible [2]. The design of such an advanced machine involves

immense expertise to ensure smooth and safe driving.

While an autonomous car operates on the road, the knowledge of surroundings that

consist of various objects mainly differentiated according to the orientations and

movement should be taken into consideration. When a 3D virtual world is

constructed with representations of static and variable objects in the real world, it

helps the autonomous car to be familiar with the surroundings while driving. Darms

et al. [3] and Hu et al. [4] have described static objects as those that do not move

during the operation of the car. The list includes buildings and other roadside objects

like benches, trees, and light pole. Fu, Kun, et al. have used multiple class activation

map to extract discriminative parts of aircraft of different categories [5]. Verified

objects are masked out in the real-time images.

The new method of this thesis makes use of a constructed virtual environment that

works as prior knowledge to outperform various tasks such as object verification

and elimination. The existence of the real objects is verified by matching feature

(17)

2

using the combined approach of contour detection and class activation map (CAM).

This research aims to use pre-existing information to verify and eliminate stationary

and variable objects from real-time scenes to allow an autonomous car to pay

attention to moving objects like pedestrians and cars in order to drive safely.

1.2 Google’s Self Driving Car

Many auto-giants companies like Tesla, Mercedes, Uber, Google, BMW, and Volvo

have already developed their semi-autonomous cars on the market. Zhao et al.

explained the key technology of a self-driving car [6]. Despite using exclusive

hardware, autonomous vehicles are still far from being fully automatic. Google has

started constructing a self-driving car – waymo almost ten years ago. In 2015,

Google provided "the world's first fully driverless ride on public roads" to a legally

blind friend of principal engineer Nathaniel Fairfield [7]. Figure 1 displays the

Google’s autonomous car. Much tech-savvy hardware is being used to improve

visibility, to measure the distance to other objects, and to fetch geolocation

information. Although some of the hardware are expensive, they are necessary as

these sensors assist in detecting objects for safe driving. The rotating roof-top

LiDAR is considered as the heart for object detection. The LiDAR is used to

measure the distance to other objects to build the 3D map in order to see obstacles.

The bumper-mounted radar is responsible for measuring the distance to vehicles in

front and behind the car. Rear-mounted aerial receives geolocation information from

GPS satellites, and ultrasonic sensors attached to one of the rear wheels monitors

the car’s movement. These devices are necessary for the safe operation of

autonomous cars.

(18)

3

1.3 Need to Aware of Surroundings

Situational awareness is the most essential key to safe driving. Drivers should be

conscious of their location and the surroundings to navigate the car at the desired

location [128]. The same objective has been applied for an autonomous vehicle to

function on the road by recognizing objects and obstacles. When an autonomous car

is running on the road, it must be responsive to the surroundings for safe moving.

Sensor technologies including GPS provide information about the surrounding

environment. These sensors gather data to narrate the change in the position and

orientation of the car. They constantly pass on the information about surroundings

like the position of pedestrians, and other objects near the car to the system in order

to navigate smoothly.

There are mainly three categories of objects a car may come in the contact while

driving:

▪ Stationary objects: Those objects that are static at the same location with the

same pose. (e.g. Buildings, Bench, Street light)

▪ Variable objects: Those objects that stay stable at the same place, but a pose

may vary (e.g. trees)

▪ Dynamic objects: Those object that may change the location and pose (e.g.

human, animal, car)

Among all objects, dynamic objects create more danger to a self-driving car. These

objects need to be detected. Darms et al. [3] and Hu et al. [4] have described dynamic

objects as those potentially move during the observation periods. Figure 2 shows

how an autonomous car observes the surroundings while driving with the aid of

(19)

4

Figure 2: Google’s Autonomous Car driving on the road

1.4 Object Detection

Sensors attached to the car are responsible to estimate distance and steer clear of

collision with obstacles. The nearby objects are identified using various machine

learning algorithms of object detection. Druzhkov et al. [8] published a survey of

deep learning techniques for object detection and image classification. A domain

shift framework based on image-style level and instance-level algorithm based on

Faster- RCNN has been used for object detection [9]. Object classification has been

performed using a fusion of CNN and light detection and ranging (LIDAR) for an

autonomous vehicle [10]. Figure 3 depicts the object detection task where dynamic

objects are located and displayed using bounding boxes with the respective class

label and probability score. The proposed approach detects roadside stationary (e.g.

buildings, street light) and variable (e.g. trees) objects using Faster-RCNN

technique. The model has been trained to detect an instance of an object in the

real-time scene.

(20)

5

1.5 Feature Detection and Selection

Feature plays a significant role in various image-based applications. It can be

expounded as a vector to define an object or its accompaniment. Features are specific

structure in the image such as edges, corners, blobs, and points. Different feature

detection algorithms are Harris Corner detection, Shi-Tomasi Corner detector and

Good Feature Track, SIFT, SURF, FAST algorithm for corner detection, BRIEF,

and ORB. The survey of feature detection algorithm is presented by Li et al. [11].

The paper also presents mathematical models of algorithms. Performance

comparison of various algorithms is presented based on accuracy, speed, scale

invariance, and rotation invariance. Figure 4 shows the detected corner points using

Harris corner detection algorithm. The proposed approach detects corner keypoints

using feature detector and the most relevant feature points are selected using the

mathematical approach.

Figure 4: Detected features on a chessboard

1.6 Object Elimination

A part of a scene can be hidden as they were behind an invisible object using

masking technologies. Object elimination can be implemented for various tasks like

background removal, foreground subtraction, object detection, instance

segmentation. Multiple approaches to perform elimination include contour

detection, Mask-RCNN, diminished reality. A survey of different diminished reality

techniques has been published by Mori et al. [12]. Yang et al. [13] introduced a

contour detection framework using fully convolutional Encoder-Decoder Network

(21)

6

Mask-RCNN algorithm. The proposed method uses a combined approach of contour

detection and class activation map for object elimination.

Figure 5: Object Masking using Mask-R-CNN

This research work is the primary approach for stationary (e.g. buildings, street light,

benches) and variable (e.g. Trees) object verification and object elimination in

real-time images giving the insights for an autonomous car to navigate smoothly on the

road. The proposed approach performs verification task by matching interest points

extracted from a real-world object with a virtual world object. Once the object is

verified successfully, the removal of static and variable objects in the real-time scene

is executed via the fusion method of contour detection and Class Activation Map

(CAM) to achieve more accuracy.

In this thesis, Chapter 2 entails a review of the previous work done on object

detection and object elimination for autonomous vehicles. It also caters to the

criterion techniques for the proposed approach. Chapter 3 describes the proposed

system of use of prior data for object verification and elimination in the real-time

scene in depth. Chapter 4 demonstrates a detailed explanation of the proposed

approach using experimental results and presents comparison with other related

work. Chapter 5 contains the conclusion of the thesis together with possible future

(22)

7

Chapter 2: Literature Review

This chapter provides a survey of the relevant background of recent works in object

detection and object elimination using 2D and 3D images. It also covers use of prior

knowledge to help improving the performance of object verification and object

elimination.

2.1 Object Detection

Object detection is related to computer vision and image processing that locates

instances of objects of a certain class in still images and videos. Object detection has

many applications in computer vision area such as face detection, image retrieval,

and pedestrian recognition. Figure 6 below displays an illustration of object

detection algorithm to identify roadside objects (e.g. car, traffic light, truck) [14].

Zou et al. [15] discussed a survey of object detection in the last 20 years. This survey

covers milestone detectors in history, detection datasets, metrics, fundamental

building blocks of the recognition system, speed up techniques, and the recent state

of the art detection methods. It also reviews some important identification

applications, such as pedestrian detection, face detection, and text detection, etc.,

and makes an in-deep analysis of their challenges as well as technical improvements

in recent years.

Figure 6: Object Detection to identify different objects

Methods of object detection mainly fall into either Machine Learning-based

(23)

8

2.1.1 Machine Learning Approach

In machine learning approaches, it is necessary to first define features using any one

method listed below. Then classification is performed on the detected features using

a technique such as Support Vector Machine (SVM), and Random Forest (RF) [69].

• Viola-Jones Object detection framework based on Haar features • Scale-Invariant Feature Transform (SIFT)

• Histogram of Oriented Gradients (HOG) features

Erickson et al. [16] published a paper that uses machine learning for medical

imaging. They have also reviewed different classification such as Naïve Bayes,

Support Vector Machine, Neural Networks, k-Nearest Neighbours, Deep Learning,

and Decision Tree to select suitable features. Figure 7 below portrays the workflow

of general machine learning algorithms.

Figure 7: Machine Learning algorithm workflow

Lei et al. [17] analyzed the feature selection techniques for object-based

classification of unmanned aerial vehicle imagery. Their work specifically emphases

on assessing the effect of feature dimensionality and training the set size using SVM

and RF classifiers to achieve different feature selection methods, including the filter

method, wrappers, and embedded methods. Bakhshipour et al. [18] illustrated an

algorithm for weed detection based on their pattern by applying support vector

machine and artificial neural network. Their work also involves the use of shape

features such as Fourier descriptors and moment invariant features. The

(24)

9 An approach of Convolutional SVM Network was applied for object detection in

UAV Imagery [19]. The CSVM network is based on several alternating

convolutional and reduction layers ended by a linear SVM classification layer. The

convolutional layers in CSVM rely on a set of linear SVMs as filter banks for feature

map generation. During the learning phase, the weights of the SVM filters are

estimated through a forward supervised learning strategy unlike the backpropagation

algorithm widely used in standard convolutional neural networks (CNNs) [19].

Figure 8 below illustrates the architecture of the Convolutional SVM algorithm.

Figure 8: Example of a CSVM network (Image source: Yakoub et al., [19])

Wei et al. [20] designed an approach for multi-vehicle detection by combining Haar

and HOG features. This algorithm makes full use of HOG characteristics and uses

its good descriptive ability to describe target vehicles whereas Harr features are

utilized to extract the prospect region of interest (ROI). Moreover, the obtained HOG

features from the ROI target area can be selected by applying the cascade structured

AdaBoost classifier features and target area classification. The precise target can be

further detected by a Support Vector Machine (SVM) [20]. Chee et al. [21] proposed

an algorithm for Pedestrian detection through the fusion of image gradient and

magnitude properties extracted using Histogram of Oriented Gradient (HOG) and

Histogram of Magnitude (HOM) features.

An automatic method for the recognition of individual oil palm trees using images

from unmanned aerial vehicles (UAVs) was developed by Wang et al. [22]. First,

using a support vector machine (SVM) classifier, UAVs images are categorized

between vegetation and non-vegetation. Then, a feature descriptor based on the

(25)

10

features for machine learning. Finally, SVM classifier has been trained and

optimized using the HOG features from positive (i.e., oil palm trees) and negative

samples (i.e., objects other than oil palm trees) [22]. An approach for object

detection and classification was published by Rashid et al. [23] that uses a merged

strategy of deep convolutional neural network and SIFT point features. Firstly, an

improved saliency method is implemented, and the point features are obtained.

Then, DCNN features are extracted from two deep CNN models like VGG and

AlexNet. Thereafter, Reyni entropy-controlled method is executed on DCNN

pooling and the SIFT point matrix for robust feature selection. Finally, the selected

robust features are fused in a matrix by a serial approach, that is later fed to ensemble

classifier for recognition [23]. An analysis of three commonly used strategies,

Histogram of Oriented Gradients (HOG), Haar-like features and Local Binary

Pattern (LBP) for object detection is investigated using a public dataset in [26].

Figure 9 below demonstrations the result of AdaBoost HOG detector on a test image.

A novel algorithm on a mobile system that can notify drivers about the possibility

of collision with pedestrians was developed [24]. The partial Haar transform and

HOG are fused for pedestrian detection.

Figure 9: AdaBoost HOG detector applied on a test image (Image Source:

Arunmozhi et al., [24])

Prasanna et al. [25] built an approach for human tracking system using joint

Haar-like and HOG features where Haar characteristics are used for the object’s structure

and the HOG features for the edge. A set of mixed features is developed with these

(26)

11

of robust features. Finally, with the help of SVM classifier, classification is

executed.

2.1.2 Deep Learning Approach

In the Deep Learning Approaches, the algorithms are capable of performing

end-to-end object localization task without defining any features and are based on

Convolutional Neural Network (CNN) [69]. Deep Learning approaches are as

follows:

• Region Proposals (R-CNN, Fast R-CNN, Faster R-CNN) • Single Shot MultiBox Detector (SSD)

• You Only Look Once (YOLO)

Brunetti et al. [27] published a survey on computer vision and object detection

methodologies for pedestrian detection and tracking. Panchpor et al. [29] and

Pouyanfar et al. [31] published a study on various object detection algorithms using

deep learning. Arnold et al. [28] reviewed a survey on 3D object detection methods

that utilize sensors and datasets for autonomous driving. 3D object identification

technique introduces a third dimension that reveals the object’s size and location

information useful for path planning, collision avoidance, and so on [28]. Liu et al.

[30] submitted a survey on advanced techniques for generic object detection.

Sindagiet al. [32] published a survey of the latest approaches of CNN for single

image-based crowd counting.

Figure 10: Deep Learning Approach Workflow

Girshick et al. [49] introduced an approach named R-CNN: Regions with CNN

(27)

12

illustrates the workflow of deep learning approach. The input to this approach is test

image that locates the object using bounding boxes. Figure 11 shows the architecture

of R-CNN.

Figure 11: The architecture of R-CNN (Image source: Girshick et al., [49])

The algorithm uses a selective search to extract 2000 regions from the input image

that are wrapped into a square and fed into CNN to produce 4096-dimensional

feature vector. The CNN works as a feature extractor and the output dense layer

consists of the extracted features that are forwarded to SVM classifier to identify the

presence of an object within the bounding box. The problem with R-CNN is that it

takes 47 seconds to generate an output. Also, it requires much time for training as

2000 regions for each image need to be classified.

Girshick et al. [48] developed an enhanced approach of R-CNN, i.e., Fast R-CNN

algorithm. Fast R-CNN requires less time to generate an output than simple R-CNN.

The architecture of Fast R-CNN is shown in Figure 12.

Figure 12: The Architecture of Fast R-CNN (Image source: Girshick et al., [48])

According to Fast R-CNN algorithm, the input image is directly fed into CNN to

(28)

13

obtained. Then using ROI pooling layer, all the proposed regions are reshaped into

a fixed size to be fed into a fully connected layer. The corresponding class label of

the proposed region and offset value of the bounding box are predicted using

soft-max layer. The Fast R-CNN is faster than R-CNN and the convolution operation is

performed only once per image and feature map is generated from it.

Ren et al. [47] built an improved approach of Fast R-CNN called Faster R-CNN.

Figure 13 depicts the architecture of Faster R-CNN.

In Faster R-CNN, the input image is supplied to CNN to create a convolutional

feature map. Instead of using a selective search algorithm, a separate network is used

to predict the region proposals. The expected region proposals are reshaped into a

fixed size using ROI layer to classify the image within the proposed region and

estimate the offset values for the bounding boxes using soft-max layer.

Figure 13: The architecture of Faster R-CNN (Image source: Ren et al., [47])

Neumann et al. [44] published a fully annotated dataset including tracking

information for pedestrian detection and tracking at night. The scene is recorded

taking advantage of industry-standard camera including different sensors and

weather conditions. Sheng et al. [46] proposed an approach for vehicle area detection

and vehicle brand classification using RCNN, Faster R-CNN, AlexNet, ResNet,

VGGNet, and GoogLeNet.

Redmon et al. [40] designed YOLO (You Only Look Once) algorithm for object

(29)

14 Figure 14: Illustration of YOLO (Image source: Redmon et al., [40])

In YOLO, a single neural network predicts the corresponding class probability and

bounding box. First, the image is split into SxS grid where each grid cell predicts

only single object, containing m bounding box and for each bounding box, the class

probability and offset value are evaluated against a pre-set threshold value to locate

the object.

Verma et al. [33] illustrated an algorithm with a fusion of monocular camera and a

2D Lidar for vehicle detection using YOLO. Possatti et al. [34] proposed to integrate

the power of deep learning-based detection using YOLO with the prior maps utilized

by car platform IARA (Acronym for Intelligent Autonomous Robotic Automobile)

to recognize the relevant traffic lights of predefined routes. Mittal et al. [35]

explained the object detection and classification tasks using YOLO algorithm. Putra

et al. [36] developed an improved approach of YOLO for human and car recognition.

Wang et al. [37], Zhang et al. [38], and Zhang et al. [39] used SSD approach for

object detection. Chowdhury et al. [45] designed Faster R-CNN and SSD approach

for pedestrian intention detection.

Židek et al. [41] built a method for object detection using deep learning technique

trained by 3D virtual models. The CNN model is trained using 2D samples generated

automatically from the 3D virtual models. Loing et al. [43] designed a methodology

for localization without using the single real-time image by utilizing only 3D models

(30)

15

named ParallelEye. Faster R-CNN and DPM networks are trained via fusing

ParallelEye virtual dataset with a time image dataset for object detection in

real-time view. Figure 15 shows the results of object detection using two different

models. Top row images are detected using a model purely trained on the real-time

images whereas bottom row objects are detected using a network, trained using

combined virtual and real-time scenes.

Figure 15: object detection results (Image source: Tian et al., [42])

2.2 Feature Detection and Selection

In computer vision and image processing, feature plays a vital role that can be

described as a prominent characteristic of an image. They are benefited to execute

certain tasks such as feature selection, feature matching, object recognition and so

on. Features represent the specific structure in the image such as edges, centroid,

blobs, and points. Berger et al. [72] explained the feature and the actual feature usage

in the industry.

Features are identified using detectors from the image. Feature detection is the

process of finding image features or keypoints of a given type at each pixel that are

somehow special in the image. Several detectors for features detection are as

follows:

• Harris corner detection

• Shi- Tomasi corner detection algorithm

(31)

16 Figure 16: Different types of detected features

Feature extraction is the method of computing a descriptor from the pixel around

each interest point using feature descriptors such as SURF, HOG, and FREAK.

Feature starts from an initial set of measured data and builds derived values

(features) intended to be informative and non-redundant, facilitating the subsequent

learning and generalization steps, and in some cases leading to better human

interpretations. Feature extraction is related to dimensionality reduction [76].

In computer vision, feature selection is expounded as a process for creating the new

subset of essential features to enhance generalization by reducing overfitting, and

for dimensionality reduction. Common names for feature selection are variable

selection, attribute selection, or variable subset selection. Feature selection differs

to feature extraction by returning the subset of selected features, whereas feature

extraction defines new features from the function of the original features. Figure 17

portrays the workflow of the feature selection technique.

Figure 17: Feature Selection workflow

Feng et al. [73] designed a method for feature detection and matching using Harris

corner detection algorithm. Making use of Harris corner detection algorithm,

features are obtained. Rotation invariant Feature Descriptor (RFID) is used to

(32)

17

detection algorithms proposed in the last four decades. Corner detection algorithms

can be divided into intensity-based, contour-based, and model-based methods.

Intensity-based frameworks are based on measuring local intensity variation of the

image. Contour-based methods identify corners by analyzing the shape of edge

contour. Model-based algorithms extract corners by fitting the local image into a

predefined model [50]. Karim et al. [51] presented a study on the comparison of

feature extraction techniques by combining SURF with FAST and BRISK, followed

by feature matching. Al-Rawabdeh et al. [53] submitted a review on the performance

of FAST-9 and FAST-12 as well as the Harris detector in terms of the repeatability

rate, completeness, and correctness under different threshold values for UAV object

localization. Hore et al. [55]analyzed the performance of SIFT and SURF feature

descriptors in different circumstances such as rotational effect, scaling effect,

illumination effect, and blurring effect to achieve object recognition. Kabir et al.

[64] presented a comparison of four feature detection methods for the modern and

old buildings, including Canny edge detection, Hough line transform, Find

Contours, and Harris Corner. A comparison of four feature detection approaches;

Harris, SURF, FAST, and FREAK, is published by Ghosh et al. [67] for image

mosaicing.

DeTone et al. [52] built a self-supervised approach for training feature detectors and

descriptors to make them suitable for a large number of multi-view geometry

problems. Gao et al. [56] developed a novel method utilizing shadows that

automatically extracts building samples and verifies buildings accurately to enhance

automation and accuracy. Liu et al. [59] presented an automatic methodology for

building area extraction from optical high-resolution imagery using the newly

developed morphological building index (MBI). The new FPGA (Field

Programmable Gate Array) architecture for reuse of sub-image data was introduced

in [54]. In the proposed architecture, a remainder-based approach is firstly designed

for reading the sub-image and a fusion of FAST and BRIEF (Binary Robust

Independent Elementary Features) descriptors is used for corner detection and

(33)

18

algorithm against various image distortions such as rotation, scaling, fisheye, and

motion distortion.

Zhao et al. [61] built a method to estimate the height of the building using both

corner points and roofline. Haggui et al. [63] developed an approach using Harris

corner detector for NUMA manycore recognition. Figure 18 shows the Harris corner

identification on NUMA manycore.

Figure 18: Harris-Stephens Corner Detection (Image source: Haggui et al., [63])

Wu et al. [57] proposed and implemented Deep Validation, a novel framework for

real-world error-inducing corner detection using DNN-based system. A novel

framework for corner detection and tracking for the real-time scene was presented

in [58]. Ghandour et al. [60] designed Building Detection with Shadow Verification

(BDSV) for building localization using the shadow, shape, and color features of

buildings. Hu et al. [62] illustrated a non-interactive approach based on binary

feature classification for building area recognition and building contours extraction

from aerial images.

Infrared image matching using SUSAN corner detection was introduced in [75].

Figure 19 displays the comparative result of the proposed approach in [75] with

Harris Corner Detection.

Figure 19: Image matching based on different corner detection methods. a Harris

(34)

19

2.3 Object Verification

Verification has been implemented primarily for fingerprint, iris, and face matching.

Chen et al. [78] proposed an approach for unconstrained face verification practicing

Deep CNN features, trained using the CASIAWebFace dataset and the performance

was evaluated on both IJB-A and LFW datasets. Figure 20 illustrates the overall

system architecture of the proposed DNN framework for face verification in [78].

Figure 20: Overall system overview of DCNN approach (Image source: Chen et

al., [78])

An approach for video-based unconstrained face verification and recognition was

discussed in [79]. Crosswhite et al. [80] proposed a template adaptation method for

face verification and identification. Their work also proved that it can be applied to

existing state-of-art methods for enriched performance. Zhang et al. [82] evaluated

a method for animal object detection and segmentation from wildlife monitoring

videos captured by motion-triggered cameras, called camera-traps. First, using

multilevel graph cut, animal object region proposals are generated in the

spatiotemporal domain. Then using developed a cross-frame temporal patch

verification method, these region proposals are determined if they are true animals

or background patches.

Hsu et al. [83] developed an architecture for vehicle verification between two

nonoverlapped views using sparse representation. Karami et al. [66] published a

survey of image matching technologies- SIFT, SURF, BRIEF, and ORB. Taira et al.

[84] built a system for indoor object localization that estimates the 6DoF camera

(35)

20 proposed an approach for retina verification based on Structural Similarity (SSIM)

to verify using similarity score. Kavitha et al. [86] designed an approach for a

secured voting system using face, iris, and fingerprint verification. Qin et al. [87]

suggested a deep learning-based segmentation methodology for finger-vein

verification by training CNN to extract the vein patterns from any image regions and

to estimate the probability of pixels to check if they belong to the vein or the

background. They also made use of FCN to recover missing finger vein shapes for

amended performance.

2.4 Transfer Learning

In machine learning, transfer learning is a method that makes use of knowledge

gained while solving one problem to apply it for a different but related problem. For

instance, knowledge obtained to recognize a cat can be applied for dog recognition.

Some widely used pre-trained models are:

• VGG • InceptionV3 • ResNet5

Figure 21 below illustrates the difference between training the entire model and

using transfer learning.

Figure 21: Training the entire model Vs Transfer Learning

The pre-trained models are primarily trained on the large dataset and, for a new

similar dataset, the pre-trained model weights can be used for extracting the features.

(36)

21

classifier of the model has been removed and new classifier has been added to

perform a defined task and later fine-tuned using any one of the following strategies:

1. Train the entire model: Use the architecture of the pre-trained model and

train it according to the dataset from scratch.

2. Train some layers and leave the others frozen: For the small dataset, freeze

more layers to prevent overfitting whereas for a large dataset, train more

layers.

3. Freeze the convolutional base: The convolutional layer is set in its original

form and its output is used for the classification task.

Figure 22 presents these 3 strategies of fine-tuning in a graphical way.

.

Figure 22: Three strategies for fine-tuning

Weiss et al. [97] reviewed the transfer learning methodology and its applications.

Huh et al. [96] proposed an approach where they used pre-trained CNN features on

various subsets of the ImageNet dataset and evaluated transfer performance on a

variety of standard vision tasks. Yuan et al. [91] presented a learning-based

framework for shadow removal using an online learning strategy and fine-tuned with

the automatically identified examples in the new videos. Wang et al. [94] developed

an architecture for ship detection via fusion of single shot multiBox detector (SSD)

and transfer learning. A system for end-to-end airplane detection in remote sensing

images using transfer learning approach is presented in [92]. Kapur et al. [90] used

transfer learning for object detection in real-time video. A deep learning-based

(37)

22 Mohamed et al. [88] presented a work on applications of transfer learning for object

detection. Yabuki et al. [89] and Singh et al. [93] designed a method for object

detection using transfer learning based on CNN with feature extractor technique.

2.5 Class Activation Map (CAM) Generation and Usage

A Class Activation Map (CAM) is a technique for producing discriminative image

regions used by CNN to identify defined class in the input image. CAM allows us

to observe at which image regions CNN is looking and appropriate to a specific

class. Zhou et al [98] proposed a framework of re-using the trained classifier for

getting good localization of distinct class, without having bounding box coordinates

information. Figure 23 presents the generated class activation map for different

breeds of dog using CNN.

Figure 23: Generated CAM for different breeds of dog using CNN

CAM delivers certain assurance that the model has correctly learned the distinctive

features between multiple categories in the form of visualization. Moreover, it

conveys us to see what features are guiding the model’s decision to classify various

objects in the input image that are used by the model to make a prediction.

These means are pursued to generate class activation map:

1. The pre-trained network is used and most of its weights are freezed

2. The model is modified and fine-tuned to generate CAM output

3. Classifier is trained

4. The last convolutional layer is used to create CAM output

(38)

23

To generate CAM, the network architecture is limited to have a global average

pooling layer after the last convolutional layer, followed by the dense layer. To get

this model structure, the network is modified and fine-tuned to get CAM.

In detail, the first building block for this layer is a convolutional layer that produces

an output shape of in terms of batch size, number of filters, width, height. The output

from the GAP layer is treated by the dense layer and softmax layer to assign a weight

to each of the categories that set the importance of each the convolutional layer

output. To generate CAM, output images from the convolutional layer are multiplied

by their assigned weights and added. By superimposing the class activation map on

input image allows us to identify the most essential image regions to the specific

category.

Kwaśniewska et al. [99] demonstrate a method of face detection from low-resolution

thermal images and the most relevant area is highlighted. Tang et al. [101] proposed

a deep discriminative map network for visual tracking. The system utilizes two

neural networks for positioning and size change estimation. Guo et al. [103]

developed a methodology to recognize human attributes without the detection of a

body part and the prior correspondence between body parts and attributes with the

help of CAM network. Li et al. [105] showed a framework for remote sensing image

scene classification using CAM. The attention map is generated using a pre-trained

network as priors for the classification task that are used as an explicit input to

end-to-end training for the first time, aiming to force the network to focus more on the

most appropriate parts. Li et al. [108] used CAM to localize common objects in an

input image. Figure 24 displays the result of produced CAM to locate common

objects.

Figure 24: CCAM applied to localize common objects (Image source: Li et al.,

(39)

24 Charuchinda et al. [106] introduced a technique to build an image classification

network using class activation map (CAM) to identify whether each sub-image

contains the class of interest. The output of the CAM is the filter response where

pixels with high probability are likely to belong to the class of interest. Vasu et al.

[107] made use of class activation map to obtain a view of deep network’s perception

of aerial imagery and identify salient local regions. Moreover, the concept of

transfer-learning is involved to train the model on a similar dataset. Fu et al. [13]

applied a methodology of multi-class activation map for recognition of aircraft in

the remote sensing images.

Pericherla et al. [100] designed an approach to reduce the L2 distance i.e., Euclidean

distance between produced adversarial images and the original images using class

activation map. Selvaraju et al. [102] introduced Grad-CAM model for class

discriminative localization from any CNN-based network without modification and

re-training and applied for image classification and captioning. Kumar et al. [104]

developed a method for visualization and to understand the decisions made by deep

neural networks (DNNs) for a given specific input.

2.6 Object Elimination

Object elimination is the process of masking or deleting an identified or verified

object from a scene. Masking is a technique of hiding a part or a part of an object as

if it were behind an invisible object. Object elimination can be implemented for

various tasks such as foreground extraction, background removal, object detection,

and instance segmentation. Approaches to implement object masking are as follows:

1. Contour detection-based masking

2. Mask-RCNN

3. Diminished Reality

The detail of each masking methodology is described in the following subsections.

2.6.1 Instance Segmentation

Instance segmentation can be defined as the identification of boundaries of the

(40)

25

problem of detecting and delineating each distinct object of interest appearing in the

image. Instance segmentation can be achieved by various technologies like

contour-based and Mask R-CNN. Figure 25 depicts the result of instance segmentation in

the input image.

Figure 25: Example of Instance segmentation

Lu et al. [109] designed a method for lip segmentation. Lip segmentation is the initial

step for the lip-reading system. They proposed an active contour model-based lip

segmentation method that adopts local information. Figure 26 illustrates the result

of lip segmentation using contour detection.

Figure 26: Segmented results: a. initial contours and local region; b. Final

evolved contours; c. segmented result (Image source: Lu et al. [109])

Tesema et al. [110] introduced a methodology for human segmentation in still

images using Deep Contour-Aware Network (DCAN) which is a unified multi-task

deep learning framework combining the complementary object and contour

information simultaneously for better segmentation performance. Griffiths et al.

[111] presented an approach to improve public GIS building footprint labels using

(41)

26 [112] discussed a method for vehicle detection and segmentation in the context of

autonomous driving using fully convolutional network for semantic labeling and

estimating the boundary of each vehicle. CNN provides the area around the contours

that aids to separate the vehicle instance. Hayder et al. [113] introduced a distance

transform-based mask representation that allows prediction of instance

segmentations beyond the limits of initial bounding boxes. Chen et al. [81]

developed a framework for more accurate detection and segmentation of histology

images using deep contour-aware network. Yang et al. [13] suggested a deep

learning approach for contour detection using fully convolutional encoder-decoder

network. Li et al. [114] presented a novel method to borrow contour knowledge for

salient object detection.

Mask R-CNN is widely used for instance segmentation. Mask R-CNN locates each

pixel of the object in the image instead of the bounding boxes. He et al. [115]

introduced a novel framework for instance segmentation: Mask R-CNN. Figure 27

shows the framework of Mask R-CNN.

Figure 27: Mask R-CNN framework for instance segmentation (Image source: He

et al., [115])

When an input image is passed to the network, it gives the object bounding box,

classes, and masks on the detected object.

Mask R-CNN contains two stages:Firstly, it generates region proposals where there

might be an object based on the input image. Secondly, it predicts the class of the

object, refines the bounding box and applies a mask in the pixel level of the object

(42)

27 Novotny et al. [116] developed an approach for segmenting unknown 3D objects in

real-time depth images using Mask R-CNN, trained on synthetic data. Liu et al.

[117] suggested an improved version of Mask R-CNN for instance segmentation by

applying the features from low levels. Yu et al. [118] built a model for fruit detection

where Mask R-CNN adopts Resnet50 backbone network that is merged with the

Feature Pyramid Network (FPN) architecture for feature extraction. For each feature

map, RPN was trained to create region proposals. After generating mask images of

ripe fruits using Mask R-CNN, a visual localization method for strawberry picking

points was performed. Johnson at al. [119] demonstrated that Mask R-CNN allows

highly effective and efficient segmentation of a wide range of microscopy images

under various conditions and different cells.

2.6.2 Diminished Reality

Diminished Reality (DR) is a technique to virtually remove, hide, and see-through

real objects from the real world. Diminished reality is the conceptual reverse of

Augmented Reality. AR allows us to augment, add virtual world as desired whereas

DR allows erasing physical content from the real-world scene. The real-time

application of diminished reality includes furniture shopping, film studio, city

planning, and interior designing. Figure 28 demonstrates the perception of DR by

removing the glasses in an input image.

Figure 28: Concept of Diminished Reality

Mori et al. [12] published a survey on diminished reality techniques that is beneficial

for virtually erase, hide an object from the real-time scene. They provided a concept

(43)

28

executing diminish, seeing through, replace, and inpainting technologies. Figure 29

illustrates the concept of diminished reality methods.

Kawai et al. [120] performed diminished reality using inpainting method for

background geometry removal with fewer constraints than the conventional ones.

Mori et al. [121] designed an approach for capturing and reproducing the real world

as desired using diminished reality.

Figure 29: Diminished reality techniques: a. diminish, b. Seeing through, c.

replacing, d. inpainting (Image source: Mori et al., [12])

Siltanen et al. [122] developed a novel framework for interior designing without

using prior information of textures, via inpainting method. Nakajima et al. [123]

introduced an approach for deleting an object using diminished reality.

2.7 Related Work

The below table 1 highlights correlated work done so far by researchers in the area

closely related to this thesis, including their contributions and scope of

improvements.

Research Paper Contributions Scope of

Improvement

Training and Testing

Object Detectors with

Virtual Images. Tian, Y.,

Presents an artificial way to

construct perceived image

datasets automatically with

precise annotation and

Generated virtual

images can be used

as prior data for

(44)

29

Li, X., Wang, K., & Wang,

F. Y. (2018)

trained DPM and Faster

R-CNN with real-time images

and virtual images for object

detection

without training any

object detectors

A new FPGA architecture

of FAST and BRIEF

algorithm for on-board

corner detection and

matching. Huang, J., Zhou,

G., Zhou, X., & Zhang, R.

(2018)

FAST and BRISK

descriptors are combined for

corner detection and

matching.

Corner matching

between real-time

image and virtual

image using the

proposed algorithm

is not described.

Unconstrained face

verification using deep cnn

features. Chen, J. C., Patel,

V. M., & Chellappa, R.

(2016, March)

Face verification is

performed using deep

convolutional features that is

trained using IARPA

dataset.

Training of the

model is required for

face verification that

increases the time.

Localizing Common

Objects Using Common

Component Activation

Map. Li, W., Jafari, O. H.,

& Rother, C. (2019)

Designed CCAM model to

localize common objects in

an image by treating CAM

as components to discover

common elements.

This approach

doesn’t perform

elimination by using

the most relevant

part generated by

CAM.

Lip segmentation using

localized active contour

model with automatic

initial contour. Lu, Y., &

Zhou, T. (2018)

Lip segmentation is

performed using an active

contour-based model.

Requires deep

learning model to

train to get contours

that is

time-consuming.

Semantic object selection

and detection for

Introduced a model for

diminished reality that

Training of the

(45)

30

diminished reality based

on SLAM with viewpoint

class. Nakajima, Y., Mori,

S., & Saito, H. (2017,

October).

automatically recognizes the

region to be removed,

without generating the 3D

model of the target object ad

by utilizing SLAM,

segmentation, and

recognition framework.

object detection is

mandatory. This

consumes more

computational

power.

Table 1: Related Work

2.8 Thesis Statement

2.8.1 Problem Statement

Literature survey clearly shows the need to detect dynamic objects accurately while

an autonomous car is driving in order to avoid collisions. According to a literature

survey, recent technologies use machine learning approaches for dynamic object

detection and tracking in real-time that results in training the model and requires

more computational sources. In some cases, the model detects an object that is not

going to move (e.g. a person in the poster) as a dynamic object and gives bounding

box as the trained models are image-based. This ends in consuming more power to

process that exceptional object as a dynamic one. A similar concept applies to a car

when a reflection of a car falls on the glass building, and object detection algorithm

treats that reflection as a dynamic object. The methodologies for object elimination

such as diminished reality, mask R-CNN uses machine learning and requires

training of the model to detect the object that needs to be removed.

The proposed approach of object verification and object elimination uses

constructed 3D virtual world as pre-existing knowledge. The proposed algorithm of

object verification and removal makes use of virtual world to verify physical

stationary (e.g. Buildings) and variable (e.g. Trees) in the real-time environment by

matching keypoints of virtual objects with physical objects without any training. The

proposed method of elimination uses a fusion of contour detection and class

(46)

31

removal of stationary and variable objects will improve the accuracy and efficiency

of the dynamic object algorithm.

2.8.2 Thesis Contribution

The major contribution of this thesis can be summarized as follows:

• Constructed 3D virtual model works as prior information for static (e.g.

Building) and variable (e.g. Trees) object verification and elimination in the

real-time scene in order to make dynamic objection algorithm efficient and

accurate.

• The proposed approach of object verification doesn’t require any machine

learning algorithm for training.

• The verified object can be used to geo-locate the self-driving car in a

real-time environment.

• For the objects having pre-existing data, the proposed fusion approach of

contour detection and class activation map (CAM) for object elimination

algorithm can be applied directly without any training.

• In case of objects without having prior knowledge, the model is trained using

transfer learning concept to generate CAM to perform object elimination.

• After applying the object removal algorithm to the real-time image, the

resulting image will be left with dynamic objects making an autonomous car

focus only on those objects that cause more danger. This makes the detection

Figure

Figure 1: Google’s Self-Driving Car prototype
Figure 3: Dynamic Object Detection
Figure 4: Detected features on a chessboard
Figure 9: AdaBoost HOG detector applied on a test image (Image Source:
+7

References

Related documents

The laser plummet built into the product produces a visible red laser beam which emerges from the bottom of the product. The laser product described in this section, is classi-

technology integrated project-based learning to increase critical thinking skills in science for fourth grade students at Firebird Academy. The review of literature is guided by three

With these assumptions, the team members can pinpoint the actual results linked to the team building process and provide data necessary to develop the ROI.. This can be

Objectives: The aim of this retrospective study is to compare surgical margins, reoperation rates and local recurrences after breast conserving surgery (BCS) using radioguided

Panel B of Table 3 shows the differences between full- and part-time fac- ulty members’ levels of satisfaction according to a global indicator, “overall satisfaction with the job.”

At his deposition on May 31, 2013, Angel Cardoso, a former Source One customer service representative, testified that it was Source One’s policy to assign employees based on sex as

– Assess the risk-based OFAC program to evaluate whether it is appropriate for the institution’s OFAC risk, taking into account products, services, customers, transactions,

[25, 26] CDM datasets can be aggregated across multiple data sources (e.g., EHR, lab, and professional billing systems), rather than the single source of an EHR. Furthermore,