Feature Pyramid Networks for Object Detection
source: Feature Pyramid Networks for Object Detection (paper link)
List of contents
- Introduction
- Feature Pyramids structure
- Network details
- Applications
- Experiments
- Conclusion
Introduction
In this paper, authors introduce the multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. Using FPN in a basic Faster R-CNN system, the method achieves the best single-model performance on the COCO detection benchmark surpassing all existing models in 2017.
Feature Pyramids structure
The advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.
In a featurized image network, we use an pyramid of images. Since we use multiple images of different scales to output a prediction, the inference time increases multiple times. This causes a long computational time and makes the model impractical for real applications.
The single feature map uses the single scale features for faster detection. This is the basic structure of recent CNN models and has a short computational time. However, it cannot achieve the most accurate results because it loses much of spatial and semantic information in the lower level features.
An alternative is to reuse the pyramidal feature hierarchy computed by a CNN as if it were a featurized image pyramid. However, the prediction for each scale is done independently.
Feature Pyramid Network (FPN) is probably better, and it performs fast and accurately. This model leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. It combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections.
Network details
Bottom-up pathway
The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage.
Top-down pathway
We upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up channel dimensions by element-wise addition. (which undergoes a 1×1 convolutional layer to reduce channel dimensions) This process is iterated until the finest resolution map is generated.
Applications
Feature Pyramid Networks for RPN
RPN(Region Proposal Network) is a sliding-window class-agnostic object detector. In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression. This is realized by a 3×3 convolutional layer followed by two sibling 1×1 convolutions for classification and regression, which we refer to as a network head.
We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid. Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level.
Feature Pyramid Networks for Fast R-CNN
Fast R-CNN is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Thus we adapt the assignment strategy of region-based detectors in the case when they are run on image pyramids. Formally, we assign an RoI of width w and height h (on the input image to the network) to the level Pk of our feature pyramid by:
Intuitively, the above equation means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, k = 3). We attach predictor heads to all RoIs of all levels, and they share parameters, regardless of their levels.
Experiments
Region Proposal with RPN
We evaluate the COCO-style Average Recall (AR) and AR on small, medium, and large objects (ARs, ARm, and ARl) following the definitions in We report results for 100 and 1000 proposals per images (AR100 and AR1k).
Object Detection with Fast/Faster R-CNN
Next we investigate FPN for region-based (non-sliding window) detectors. We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5). We also report COCO AP on objects of small, medium, and large sizes.
Conclusion
In this paper, authors presented a simple framework for building feature pyramids inside ConvNets. Our method shows significant improvements over several strong baselines and competition winners. Thus, it provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids.
Finally, the study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi- scale problems using pyramid representations.