上QQ阅读APP看书，第一时间看更新

Grasping the basics of R-CNN, R-FCN and SSD models

Even if you have clear in mind how a CNN can manage to classify an image, it could be less obvious for you how a neural network can localize multiple objects into an image by defining its bounding box (a rectangular perimeter bounding the object itself). The first and easiest solution that you may imagine could be to have a sliding window and apply the CNN on each window, but that could be really computationally expensive for most real-world applications (if you are powering the vision of a self-driving car, you do want it to recognize the obstacle and stop before hitting it).

You can find more about the sliding windows approach for object detection in this blog post by Adrian Rosebrock: https://www.pyimagesearch.com/2015/03/23/sliding-windows-for-object-detection-with-python-and-opencv/ that makes an effective example by pairing it with image pyramid.

Though reasonably intuitive, because of its complexity and being computationally cumbersome (exhaustive and working at different image scales), the sliding window has quite a few limits, and an alternative preferred solution has immediately been found in the region proposal algorithms. Such algorithms use image segmentation (segmenting, that is dividing the image into areas based on the main color differences between areas themselves) in order to create a tentative enumeration of possible bounding boxes in an image. You can find a detailed explanation of how the algorithm works in this post by Satya Mallik: https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/. The point is that region proposal algorithms suggest a limited number of boxes to be evaluated, a much smaller one than the one proposed by an exhaustive sliding windows algorithm. That allowed them to be applied in the first R-CNN, Region-based convolutional neural networks, which worked by:

finding a few hundreds or thousands of regions of interest in the image, thanks to a region proposal algorithm
Process by a CNN each region of interest, in order to create features of each area
Use the features to classify the region by a support vector machine and a linear regression to compute bounding boxes that are more precise.

The immediate evolution of R-CNN was Fast R-CNN which made things even speedier because:

it processed all the image at once with CNN, transformed it and applied the region proposal on the transformation. This cut down the CNN processing from a few thousand calls to a single one.
Instead of using an SVM for classification, it used a soft-max layer and a linear classifier, thus simply extending the CNN instead of passing the data to a different model.

In essence, by using a Fast R-CNN we had again a single classification network characterized by a special filtering and selecting layer, the region proposal layer, based on a non-neural network algorithm. Faster R-CNN even changed that layer, by replacing it with a region proposal neural network. That made the model even more complicated but most effective and faster than any previous method.

R-FCN, anyway, are even faster than Faster R-CNN, because they are fully convolutional networks, that don’t use any fully connected layer after their convolutional layers. They are end-to-end networks: from input by convolutions to output. That simply makes them even faster (they have a much lesser number of weights than CNN with a fully connect layer at their end). But their speed comes at a price, they have not been characterized anymore by image invariance (CNN can figure out the class of an object, no matter how the object is rotated). Faster R-CNN supplements this weakness by a position-sensitive score map, that is a way to check if parts of the original image processed by the FCN correspond to parts of the class to be classified. In easy words, they don’t compare to classes, but to part of classes. For instance, they don’t classify a dog, but a dog-upper-left part, a dog-lower-right-part and so on. This approach allows to figure out if there is a dog in a part of the image, no matter how it is orientated. Clearly, this speedier approach comes at the cost of less precision, because position-sensitive score maps cannot supplement all the original CNN characteristics.

Finally, we have SSD (Single Shot Detector). Here the speed is even greater because the network simultaneously predicts the bounding box location and its class as it processes the image. SSD computes a large number of bounding boxes, by simply skipping the region proposal phase. It just reduces highly-overlapping boxes, but still, it processes the largest number of bounding boxes compared to all the model we mentioned up-so-far. Its speed is because as it delimits each bounding box it also classifies it: by doing everything in one shot, it has the fastest speed, though performs in a quite comparable way.

Another short article by Joice Xu can provide you with more details on the detection models we discussed up so far: https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9

Summing up all the discussion, in order to choose the network you have to consider that you are combining different CNN architectures in classification power and network complexity and different detection models. It is their combined effect to determinate the capability of the network to spot objects, to correctly classify them, and to do all that in a timely fashion.

If you desire to have more reference in regard to the speed and precision of the models we have briefly explained, you can consult: Speed/accuracy trade-offs for modern convolutional object detectors. Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017: http://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_SpeedAccuracy_Trade-Offs_for_CVPR_2017_paper.pdf Yet, we cannot but advise to just test them in practice for your application, evaluating is they are good enough for the task and if they execute in a reasonable time. Then it is just a matter of a trade-off you have to best decide for your application.