Reducing Overfitting - Models for Pedestrian Trajectory Prediction and Navigation in Dynamic En

Overfitting is the term used to describe models that effectively memorize the training data, and then fail to generalize well to new datasets. Overfitting is an especially common problem in deep neural networks because the networks have a high capacity to learn relationships in the data that may be just noise in the training data rather than true patterns. In this work, four techniques were employed to reduce overfitting and improve the generalization performance of the networks: data augmentation, Dropout, weight regularization, and early-stopping.

5.2.1 Data Augmentation

The first method was to artificially expand the training data through data augmentation. Each frame of the original datasets was flipped in the x-direction, y-direction, and both directions to produce three equivalent representations of the trajectories. A visualization of the data augmentation strategy is shown in 5.3. Using flipped coordinates enables to the network to learn the relative movement of the pedestrians rather than the movement in just one direction.

Neural networks are most successful when they are presented with many unique samples. Many successful image classification networks use tens of thousands of im- ages, and translation networks are often trained with hundreds of thousands of words or sentences. Data augmentation is a simple technique for increasing the number of training samples. The data augmentation is this thesis is relatively naive and only helps the network learn orientation agnostic relationships. More involved methods like jittering the coordinates could be beneficial, but these will be left for future work.

Figure 5.3: Data Augmentation

5.2.2 Dropout

Dropout[47] was the second method that was implemented to avoid overfitting. Dropout is a technique that randomly sets some of the activations in layers of the network to 0 during training. By setting some activations to 0, the model is forced to more fully utilize all parameters of the network. An alternative view of Dropout is that it causes the network to learn an ensemble of smaller (reduced) networks. Recently, the con- cept of Dropout has been used to quantify the uncertainty in a neural network[14]. Dropout has been applied with success to a diverse array of neural network archi- tectures. Dropout is most readily applied to Feed-Forward networks that include fully-connected layers or convolutional layers. Applying Dropout to recurrent neural networks requires a careful approach. If Dropout is applied to the memory (hidden

or cell state) of a recurrent neuron, then the performance of the model may degrade drastically, as information is no longer able to propagate through time properly. Thus it is typical to use Dropout only on the inputs or outputs of recurrent neurons.

The major source of overfitting in the models that were tested was the neighbor representation. During initial experiments, the networks would memorize configu- rations of neighbors to predict subsequent locations of an agent. To prevent this, Dropout was applied only to the neighbor tensor. Dropout was applied to the neighbor representation with a keep probability of 50%. Therefore, 50% of the values in the neighbor representation were set to zero during each batch of training. After the validation loss stopped decreasing, the Dropout was removed, and the models were trained further. In this way, the networks are forced to rely on the sequence of pre- vious positions of the agent first before learning how neighbors might influence that trajectory.

5.2.3 Weight Regularization

Weight regularization is a well-known method that is used in linear regression and neural networks. Weight regularization adds an additional penalty (based on the magnitude of the weights) to the loss function. In L1 regularization, the penalty is computed using the absolute value of the weights, and in L2 regularization, the penalty is computed using the squared value of the weights. In both cases, the model then learns a set of weights that simultaneously maximize accuracy and minimize the absolute value or squared value of the weights. L1 regularization tends to produce sparse weight matrices where only some of the weights are significant and the others are close to zero. L2 regularization tends to produce weight matrices where all weights are relatively small, with few or no large weights. Weight regularization can be applied to any of the parameter matrices in a network, but it is typically not applied to the

bias vectors.

For the neural networks considered in this thesis, weight regularization was used to avoid nan (not a number) values. Generally a nan value occurs when a weight matrix has massive values (or tiny values) that cause a value to reach positive or negative infinity. Nan values cannot be used as parameters of a probability distribution, so it is essential that the neural network never introduce nans. By regularizing the weight matrices, no nan values were produced. For all trained models, L2 regularization with a scale factor of .001 was applied to the weight matrices of the final probability parameters (not the bias vectors).

5.2.4 Early-Stopping

All models were trained for approximately 100 epochs, although some were trained for less if the loss failed to decrease after 400 batches. Additionally, early-stopping was implemented to reduce the chances of overfitting. After each epoch, the loss on a validation set was computed. If the loss on the validation set increased by more than a set threshold from the last evaluation, then training was halted. If a network is allowed to train forever, it may continue to improve its fit on the training data, but it may perform worse on new data. The early-stopping technique is one way of detecting the point at which the network begins to no longer learning generalizable rules but rather begins memorizing the randomness inherent in the training data.

While effective, early-stopping requires careful planning in order to ensure that the training is not stopped too soon. For the training of the models in this thesis, a slack of .3 was chosen. As long as the network never produced a validation error that was .3 worse than the best validation error, then it was allowed to continue to train. This slack is especially important in the beginning of training where there is still large changes to the weights. After the training was halted, the weights associated with

Figure 5.4: Autoencoding Architecture for Trajectories

the smallest validation error were used in the evaluation on the test data.

In document Models for Pedestrian Trajectory Prediction and Navigation in Dynamic Environments (Page 89-93)