Automatic Learning Technology of Railway Based on Deep Learning for Railway Obstacle Avoidance

(1)

2018 International Conference on Applied Mechanics, Mathematics, Modeling and Simulation (AMMMS 2018) ISBN: 978-1-60595-589-6

Automatic Learning Technology of Railway Based on Deep Learning for

Railway Obstacle Avoidance

Jiao FENG, Teng LI

*

, Qian-qian NIU and Bao-hua WANG

Beijing Jiaotong University, Beijing, China

*Corresponding author

Keywords: Railway identification, Deep learning, Mask R-CNN.

Abstract. Railway identification is the key to dividing the dangerous area and safety area of the rail surface. It is also the key technology for the detection of rail surface obstacles. The problem is that the traditional rail detection edge detection effect is not good and the illumination environment is seriously affected. This paper proposes a track recognition detection system based on deep learning. The system can not only identify the rails, but also plan the dangerous areas, apply the convolutional neural network, extract the feature information of the rails in different environments, and use the Mask R-CNN instance segmentation algorithm. The identification of the rails and the division of the dangerous areas are carried out to achieve accurate and rapid detection. The test results show that the system has high anti-interference and accuracy.

Introduction

In December 2017, Chinese first rail transit automatic operation line, Beijing Metro Yanfang Line opened for operation. At present, the trains on the Yanfang line use contact-type obstacle detection devices, that is, the detection devices installed at the front end of the train automatically push the obstacles during operation or the trains generate emergency braking to avoid serious accidents. However, this method does not combine the automatic detection system such as machine vision to detect the obstacles on the track surface. For large obstacles, there is still a certain safety risk in the train operation. Therefore, it is urgent to develop an automatic detection system for track obstacles with both accuracy and real-time. Among them, the identification of the rails and the planning of the safety warning area are the basis of the automatic detection system for the track obstacles. How to adapt to the complex environment can accurately identify the rails, which requires the detection algorithm to have a good adaptability to the actual variable environment, and also can identify the rails well in the noise-contaminated environment. At present, the basic methods used for the identification and extraction of railroad tracks mainly include classical edge detection [1] and mathematical morphology edge detection method [2]. The former uses the grayscale change of the image to detect the edge of the object in the picture. When the edge of the rail and the gray value of the environment are not obvious, there is a large detection error and the requirements for the illumination condition are very high. The latter is a poor test effect when the illumination changes significantly, there is object occlusion or the environment changes. Yanqi Wang et al. [3,4,5] used the improved Canny operator to detect the rails and achieved good results. However, due to background factors and changes in illumination, there are still environmental problems that are not adaptable, and the data they use are straight. Rails, single data. Rui Li et al. [6] combined edge extraction with Hough transform to propose a linear rail identification algorithm. The disadvantage is that it is only applicable to straight rails and cannot recognize corners. Yafan Song et al. [7] can solve the influence of illumination transformation on railway identification by introducing environmental suppression factors, but the recognition effect on the presence of obstructions on the rail surface is not good.

(2)

transformation, deformation and illumination. It effectively overcomes the difficulties brought by the complex background environment, and can adaptively construct feature descriptions driven by training data, with higher flexibility and generalization ability.

R-CNN [8], which was introduced in 2013, combines traditional machine learning and deep learning theory effectively, creating a precedent in the field of deep learning, and proposes a series of classical algorithms such as selective search [9]. Subsequent deep learning networks such as SPP-Net [10], Fast RCNN [11], YOLO [12], SSD [13], Faster R-CNN [14], and Mask R-CNN [15] have appeared. Among them, Mask R-CNN is based on the improvement of Faster R-CNN, which is the combination of Faster R-CNN and FCN (Fully Convolution Network). And Faster R-CNN integrates the above-mentioned target detection optimization network except Mask R-CNN into the same deep learning convolutional neural network framework, which makes Mask R-CNN achieve better measurement results under stricter positioning metrics. To this end, this paper uses the Mask R-CNN target detection algorithm and combines a large number of rail sample data sets to construct a rail target detection system.

Railway Target Detection System Architecture

[image:2.595.148.451.393.494.2]

The railway detection system architecture designed in this paper is shown in Figure. 1. First, select the appropriate training samples from the rail dataset and label the rails in the sample. Then the labeled samples are extracted by CNN, trained in the RPN network until the network converges, and then the convolution parameters obtained by the RPN training are input to the Mask-RCNN for training until the expected effect is achieved. Finally, the parameters of each network are input into the model for the identification of the railways and the division of the dangerous areas.

Figure 1. Rail detection system. Convolutional Neural Network

In the field of computer vision, the typical network structure used for deep learning is the Convolutional Neural Network (CNN). CNN has displacement invariance, scale invariance and other forms of distortion invariance in image target detection. Since CNN's feature detection layer learns through training data, explicit feature design and extraction are avoided when CNN is used, and learning is implicitly learned from training data. CNN uses the convolution layer and the sampling layer to be alternately set, so that the convolution layer extracts the features and then combines them to form a more abstract feature for the description of the picture object. Finally, all parameters are normalized into a one-dimensional array to form a fully connected layer for target feature training or detection.

Mask R-CNN Algorithm

Mask R-CNN is an instance segmentation model that not only determines if there are railways in the image, but also determines the location of the railways in the image and plans for dangerous areas. One of the features of Mask R-CNN is that it can color the pixels in the window that represent the outline of the object. Coloring can help the subway to clear where the dangerous area is on the railway and where it is a safe area to avoid collisions.

(3)

[image:3.595.204.393.108.203.2]

+ window), and the latter is responsible for determining dangerous areas. As shown in Figure. 2, in order to improve efficiency, the region proposal network is selected to extract proposal boxes.

Figure 2. Mask R-CNN structure.

Region Proposal Network (RPN). The RPN network is mainly used to generate region proposals. First, a bunch of Anchor box is generated. After cutting and filtering, it is judged by softmax that the anchors belong to the foreground or the background, that is, the rails or not the rails, and realize a two classification; At the same time, another branching box is revised to form an accurate anchor. The Feature Map extracted by CNN enters the RPN. After a 3×3 convolution, the information in the feature map is further concentrated to realize the identification of the railway.

The RPN takes an arbitrary size picture as input and outputs a batch of rectangular area candidate frames, and each area corresponds to a probability value and position information of a target existence. A 3 × 3 convolution kernel is used on this feature map to convolve with the feature map. This 3 × 3 region is convoluted to obtain a three-dimensional matrix data structure. Because this feature map obtains a one-dimensional vector and 256 feature maps in this 3×3 region, a 256-dimensional feature vector can be obtained. Each anchor corresponds to five dimensions (32, 64, 128, 256, 512) of the input image prediction, and fifteen regions of three aspect ratios (1:1, 1:2, 2:1). That is, each anchor can generate fifteen proposal boxes. It is input to two fully connected layers, the classification layer and the regression layer, for classification and bounding box. Finally, according to the probability of the region proposal, the first 1000 region proposals are selected as the input of the Faster R-CNN for target detection. Schematic diagram of the RPN structure, as shown in Figure. 3, where the proposal box k=15.

Figure 3. RPN structure diagram.

Loss Function of Region Proposal. The existence of the loss function is to test the strengths and weaknesses of the training model. In order to train the RPN, each anchor must correspond to a binary label, that is, whether it is a railway. For the mask branch, the mask branch uses the FCN for each ROI with a 1×28×28 dimension output, that is, a binary mask with a resolution of 28×28, and a 28×28 feature layer is selected. Apply the sigmoid function to each pixel and then calculate the loss function of the mask branch. The cross entropy cost function is selected when calculating the loss function Lmask of the mask:

1

[ ln (1 ) ln(1 )]

x

C y a y a

n

 



   (1)

Where: x represents the sample and n represents the total number of samples.

[image:3.595.219.385.484.591.2]

(4)

the gradient, the faster the parameter adjustment, and the faster the training, for the following reasons: Find the gradient of the parameter w:

'

1 (1 ) 1 (1 )

( ) ( ) ( )

( ) 1 ( ) ( ) 1 ( )

( )

1 1

( ( ) ) ( ( ) )

( )(1 ( ))

j j x x j j j x x

C y y y y

z x

w n z z w n z z

z x

z y x z y

n z z n

 _           _{ } _   _{ } _          



(2) Where: σ’(z)= σ(z)(1-σ(z)); σ(z)-y Indicates the error between the output value and the actual value.

It can be known from the above formula that the original σ’(z) in the gradient of w is eliminated. According to σ(z)-y, we can know that the larger the error, the larger the gradient, the faster the parameter w is adjusted, and the faster the training speed. Similarly, the gradient of b is:

1

( ( ) )

x C

z y

b n 



 





₍₃₎

The entire loss function is:

({p },{t })_i _i _mask

LL L

(4) where: L({pi},{ti}) is the loss function in the faster rcnn, L({pi},{ti}) mainly consists of a

classification loss function and a regression loss function:

1 1

({p },{t })i i cls(p , p )i i i reg(t , t )i i

i i

cls reg

L p L

N N

L 



 



 

(5)

(p , p ) log[p p (1 p )(1 p )]

cls i i i i i i

L         ₍₆₎

(t , t ) (t t )

reg i i i i

L  R  

(7) In the above formula, pi represents the predicted probability of the i-th target, and the value of p*

iis related to the positive and negative samples. If the i-th target is a positive sample, it takes 1; otherwise, it takes 0. ti represents the four parameter coordinates of the predicted candidate frame,

and t* iis the coordinate vector of the candidate frame when the sample is positive：

( ) / ( ) / log( / ) log( / )

x a a y a a w a h a

t x x w t y y h t w w t h h

t x x w t y y h t w w t h h

     

(8) Where x, y, w, and h respectively represent the central coordinate value and the width and height of the candidate frame predicted by the RPN network, and the central coordinate values and the width and height of the candidate frame when xa, ya, wa, and ha are positive samples.

Experimental Test

Experimental Dataset

(5)

In this paper, a total of 3000 pictures are selected as data sets, including straight lines, left and right corners, etc. As shown in Table 1, samples of different types of input pictures are shown in Figure. 4.

Table 1. Different data sets.

Types Total number of samples

Number of training samples

Number of test samples

straight 1000 800 200

left corner 1000 800 200

[image:5.595.74.521.132.261.2]

right corner 1000 800 200

Figure 4. Different training samples. Feature Extraction

Before applying Mask-RCNN for model training, the feature extraction is first. After CNN convolutional neural network, the features required by Mask-RCNN can be obtained, and the background is separated from the rails, so that the background part is less active. Thereby extracting the key feature information of the railroad track, the network structure of the whole process mainly consists of five parts. conv1, conv2 learn some basic information such as the color and edge of the rail track; conv3 learns the more complex feature texture. For example, the material texture features of the rail; conv4 is a more representative feature, such as the characteristics of straight and curved; conv5 is a comprehensive study of the characteristics of the entire railway.

Experimental Results and Analysis

[image:5.595.84.522.630.681.2]

By identifying the above-mentioned model to detect the recognition performance of the rail, and comparing it with the traditional rail detection algorithm, Figure. 5 is the traditional railway extraction result. It can be seen from the figure that there is no extraction and leakage extraction in the railway extraction due to illumination, which makes the risk area division have a large error, and the proposed depth detection based rail detection algorithm can accurately identify the position of the railway. The dangerous area of the planning office, as shown in Figure. 6, can be completed accurately when iterating to 30,000 times, with an accuracy rate of 99.9%. Therefore, the method of this paper can meet the needs of the unmanned subway and ensure the safe operation of the train.

Figure 5. Traditional railway extraction results((a). Original Image;(b). Edge extraction;(c). Railway extraction)

Figure 6. Recognition result.

Conclusion

(6)

It is proved by experiments that the model built by this method is effective. The algorithm is simple and convenient, and only needs to input video frames or corresponding images to quickly and accurately identify the rails, and plan dangerous areas. It can meet the needs of obstacle detection of unmanned urban railway trains and provide guarantee for the safe operation of unmanned urban railway trains.

References

[1]Shihong Shai. Rail identification based on edge detection[J]. Railway Computer Applications, 2009, 18(4): 1-3.

[2]Xia Zhang, Jianwu Dang, Hongfeng Ma. Rail image edge detection method based on mathematical morphology of double structure elements[J]. Railway Computer Applications, 2011, 20(5): 28−31.

[3]Dandan Li, Tao Hou, Shipeng Wei. Rail Edge Detection Method Based on Improved Canny Operator[J]. Television Technology, 2015, 39(8): 55−58.

[4]Yanwei Wang, Peiqi Li. Rail Edge Detection Based on Improved Canny Algorithm[J]. Railway Communication Signals, 2015, 51(2): 75−78.

[5]Fei Peng, Weirong Chen, Bobo Chao, et al. Method of rail edge extraction based on Canny edge detection and aggregation method[J]. Journal of the China Railway Society, 2012, 34(2): 52−57.

[6]Rui Li, Xiaochun Wu. Automatic Identification of Linear Rails Based on Digital Image Processing[J]. Television Technology, 2014, 38(3): 167-169.

[7]Yafan Song, Difu Pan, Wei Han. Rail identification method based on visual saliency in complex environment[J]. Journal of Railway Science and Engineering, 2018, 15(4): 871-879.

[8]Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[J]. 2013:580-587.

[9]Uijlings, J. R R, Sande V D, et al. Selective Search for Object Recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.

[10]He K, Zhang X, Ren S, et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 37(9):1904-16.

[11]Girshick R. Fast R-CNN[C]//IEEE International Conference on Computer Vision. IEEE, 2015:1440-1448.

[12]Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]// Computer Vision and Pattern Recognition. IEEE, 2016:779-788.

[13]Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[C]//European Conference on Computer Vision. Springer, Cham, 2016:21-37.

[14]Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//International Conference on Neural Information Processing Systems. MIT Press, 2015:91-99.