In addition to building the distributed structures of the Flower Counter application, an assessment was needed to determine how well the outcome from the architectures were going to satisfy the intent of getting better accuracy. Though there were many viable metrics that could have been used, the choices were kept simple to prevent ambiguity and repetition. In some certain cases it appreared to that, it was not enough for one metric to solely demonstrate the quality of the results. On those cases, other metrics were used collectively to quantify the practicality of the solution. The metrics used in this project are divided into two categories: Metrics for demonstrating Performance (Scalability) and Metrics for demonstrating Accuracy.
Metrics for demonstrating Performance (Scalability)
In order to asses the scalability of the local and the distributed version of the Flower Counter application three different metrics were chosen. The individual purpose and description of each of these metrics are given in the following bullet points.
• Total Completion Time: The primary objective of scaling a CNN application in distributed environment is to minimize the long training time. To measure the capability of the distributed solutions, the total elapsed time from the beginning of the application till the end needs to be counted. The time it takes to complete training the whole model can be a good indicator of how fast the distributed version is. To keep track of the total completion time, python’s time.time() function was used at the application’s starting and ending point to record the start and end time. Then the duration was calculated by simply subtracting the starting time from the ending time. The time.time() module usually returns a floating point number which represents the time in seconds since the epoch. Here epoch means the point where the time starts. For Unix the epoch is set to 1970.6 For every distributed experiment, the total completion time was measured as the time taken by the slowest worker to finish training. • Throughput: The scalability of a CNN application has been measured by throughput in much recent
literature [3],[6],[21],[28],[29]. The throughput was measured in form of samples been processed per second. For the Flower Counter application, the throughput was calculated in every training step by dividing the number of images been processed on that step by the duration of the step. Keeping track of throughput from all the training steps is helpful in the sense that it can help us to detect potential bottleneck by isolating any specific training step that suspiciously takes a long time. For each of our performance related experiments, an average throughput was calculated at the very end of the training to show the scalability of the Flower Counter application.
• GPU utilization: Scalability of a distributed model depends mostly on how the resources are being utilized during training. Underutilization of resources often leads to poor scalability. In TensorFlow,
the compute-intensive parts are preferred to be scheduled on GPUs.7 So, during training, it is crucial
to keep track of the GPU resource usage. Profiling GPU resources helps us to pinpoint the performance issues. For example, during a long running training step, if the GPU utilization is not approaching 80-100%, then the bottleneck may be caused by the input pipeline8 or synchronization among workers. In addition to that, profiling the GPU also helps in understanding important insights of the data. For long training steps, if the GPU utilization reaches above 80%, then that means the images been processed during those steps require more computation compared to other images.
The GPU utilization does not represent the amount of resources being used; it reports what percentage of time one or more GPU kernel(s) is active.9
The choice of the aforementioned metrics is believed to be sufficient to evaluate the scalability of the local and distributed version of the Flower Counter application. During the preliminary stages of building the distributed version, GPU memory utilization was also considered as a performance metric, but TensorFlow by default maps all the GPU memory of the selected GPU for the running process to prevent memory fragmentation. So the memory utilization remained constant over the period of training. One thing to be noted here is that, though TensorFlow by default allocates all the GPU memory for the running process, it offers the option of allocating a subset of the available GPU memory to each process. This configuration needs to be manually set before training. Once activated, if more memory is needed by the running process then TensorFlow allows memory growth as per the need of the application.10
Metrics for demonstrating Accuracy
Although the primary objective of this project is to construct a scalable version of the Flower Counter application, it is mandatory to examine how accurate the Flower Counter application is. Given an image of a canola field, the application is expected to produce an estimation or prediction of number of flowers present in the image. To find the accuracy of the prediction, three different metrics were chosen. The descriptions of each of these metric are given below:
• Mean Absolute Error (MAE): Mean Absolute Error is the average absolute distance between the pre- dicted value and the true value. This simple metric is a good indicator of the average magnitude of the errors and it has been seen used alongside other metrics to evaluate the accuracy of CNN based- applications in previous studies [23],[38],[43].
• Root Mean Squared Error (RMSE): Root Mean Squared Error is the square root of the average of squared difference between the predicted value and the true value. In contrast to MAE, RMSE gets
7https://www.TensorFlow.org/guide/using_gpu (accessed September 13, 2019)
8https://www.TensorFlow.org/guide/performance/overview (accessed September 13, 2019)
9https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#
40938696 (accessed September 13, 2019)
affected by the variance associated with the frequency distribution of error magnitudes. When the variance increases RMSE also increases. Both of the MAE and RMSE are negatively-oriented: lower values mean better results.
• Relative Error (RE): The Relative Error is the absolute error between predicted value and the true value, divided by the magnitude of the original value. RE becomes helpful in bringing error into perspective by comparing approximation of values of widely differing range. For example, an absolute error of three is significant if the original number of flowers is five, but the error is acceptable for us if the true number of flowers is more than two hundred.
Periodically testing the accuracy on the validation set during training is a common practice in Neural network applications. This helps to monitor how well the application is learning on data which is not exposed to the model before. Instead of periodically checking the accuracy on a testset during training, it was decided that for the proposed application, the accuracy of the validation set will be measured after the model has finished training. The reason behind this decision is to discard any sort of computation which does not contribute to the overall completion time of training. As the completion time was used as a metric to judge the scalability of the distributed solution, it was ensured that no other computation except training the model is taking place during the run-time of the application. Once the application finished running (after getting trained for specified epochs), the trained model was loaded from the last checkpoint to run the validation set. However, the progression of accuracy on training dataset was observed using MAE in each training step. At very last, the inference time of the CNN models was compared with the previous Flower Counter using the same methodology of calculating the Total completion time as described in 4.2.