chapter_computer-vision/rcnn.md
:label:sec_rcnn
Besides single shot multibox detection
described in :numref:sec_ssd,
region-based CNNs or regions with CNN features (R-CNNs)
are also among many pioneering
approaches of
applying
deep learning to object detection
:cite:Girshick.Donahue.Darrell.ea.2014.
In this section, we will introduce
the R-CNN and its series of improvements: the fast R-CNN
:cite:Girshick.2015, the faster R-CNN :cite:Ren.He.Girshick.ea.2015, and the mask R-CNN
:cite:He.Gkioxari.Dollar.ea.2017.
Due to limited space, we will only
focus on the design of these models.
The R-CNN first extracts
many (e.g., 2000) region proposals
from the input image
(e.g., anchor boxes can also be considered
as region proposals),
labeling their classes and bounding boxes (e.g., offsets).
:cite:Girshick.Donahue.Darrell.ea.2014
Then a CNN is used to
perform forward propagation on each region proposal
to extract its features.
Next, features of each region proposal
are used for
predicting the class and bounding box
of this region proposal.
:label:fig_r-cnn
:numref:fig_r-cnn shows the R-CNN model. More concretely, the R-CNN consists of the following four steps:
Uijlings.Van-De-Sande.Gevers.ea.2013. These proposed regions are usually selected at multiple scales with different shapes and sizes. Each region proposal will be labeled with a class and a ground-truth bounding box.Although the R-CNN model uses pretrained CNNs to effectively extract image features, it is slow. Imagine that we select thousands of region proposals from a single input image: this requires thousands of CNN forward propagations to perform object detection. This massive computing load makes it infeasible to widely use R-CNNs in real-world applications.
The main performance bottleneck of
an R-CNN lies in
the independent CNN forward propagation
for each region proposal,
without sharing computation.
Since these regions usually have
overlaps,
independent feature extractions lead to
much repeated computation.
One of the major improvements of
the fast R-CNN from the
R-CNN is that
the CNN forward propagation
is only performed on
the entire image :cite:Girshick.2015.
:label:fig_fast_r-cnn
:numref:fig_fast_r-cnn describes the fast R-CNN model. Its major computations are as follows:
The region of interest pooling layer proposed in the fast R-CNN is different from the pooling layer introduced in :numref:sec_pooling.
In the pooling layer,
we indirectly control the output shape
by specifying sizes of
the pooling window, padding, and stride.
In contrast,
we can directly specify the output shape
in the region of interest pooling layer.
For example, let's specify the output height and width for each region as $h_2$ and $w_2$, respectively. For any region of interest window of shape $h \times w$, this window is divided into a $h_2 \times w_2$ grid of subwindows, where the shape of each subwindow is approximately $(h/h_2) \times (w/w_2)$. In practice, the height and width of any subwindow shall be rounded up, and the largest element shall be used as output of the subwindow. Therefore, the region of interest pooling layer can extract features of the same shape even when regions of interest have different shapes.
As an illustrative example,
in :numref:fig_roi,
the upper-left $3\times 3$ region of interest
is selected on a $4 \times 4$ input.
For this region of interest,
we use a $2\times 2$ region of interest pooling layer to obtain
a $2\times 2$ output.
Note that
each of the four divided subwindows
contains elements
0, 1, 4, and 5 (5 is the maximum);
2 and 6 (6 is the maximum);
8 and 9 (9 is the maximum);
and 10.
:label:fig_roi
Below we demonstrate the computation of the region of interest pooling layer. Suppose that the height and width of the CNN-extracted features X are both 4, and there is only a single channel.
#@tab mxnet
from mxnet import np, npx
npx.set_np()
X = np.arange(16).reshape(1, 1, 4, 4)
X
#@tab pytorch
import torch
import torchvision
X = torch.arange(16.).reshape(1, 1, 4, 4)
X
Let's further suppose that the height and width of the input image are both 40 pixels and that selective search generates two region proposals on this image. Each region proposal is expressed as five elements: its object class followed by the $(x, y)$-coordinates of its upper-left and lower-right corners.
#@tab mxnet
rois = np.array([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])
#@tab pytorch
rois = torch.Tensor([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])
Because the height and width of X are $1/10$ of the height and width of the input image,
the coordinates of the two region proposals
are multiplied by 0.1 according to the specified spatial_scale argument.
Then the two regions of interest are marked on X as X[:, :, 0:3, 0:3] and X[:, :, 1:4, 0:4], respectively.
Finally in the $2\times 2$ region of interest pooling,
each region of interest is divided
into a grid of sub-windows to
further extract features of the same shape $2\times 2$.
#@tab mxnet
npx.roi_pooling(X, rois, pooled_size=(2, 2), spatial_scale=0.1)
#@tab pytorch
torchvision.ops.roi_pool(X, rois, output_size=(2, 2), spatial_scale=0.1)
To be more accurate in object detection,
the fast R-CNN model
usually has to generate
a lot of region proposals in selective search.
To reduce region proposals
without loss of accuracy,
the faster R-CNN
proposes to replace selective search with a region proposal network :cite:Ren.He.Girshick.ea.2015.
:label:fig_faster_r-cnn
:numref:fig_faster_r-cnn shows the faster R-CNN model. Compared with the fast R-CNN,
the faster R-CNN only changes
the region proposal method
from selective search to a region proposal network.
The rest of the model remain
unchanged.
The region proposal network
works in the following steps:
It is worth noting that, as part of the faster R-CNN model, the region proposal network is jointly trained with the rest of the model. In other words, the objective function of the faster R-CNN includes not only the class and bounding box prediction in object detection, but also the binary class and bounding box prediction of anchor boxes in the region proposal network. As a result of the end-to-end training, the region proposal network learns how to generate high-quality region proposals, so as to stay accurate in object detection with a reduced number of region proposals that are learned from data.
In the training dataset,
if pixel-level positions of object
are also labeled on images,
the mask R-CNN can effectively leverage
such detailed labels
to further improve the accuracy of object detection :cite:He.Gkioxari.Dollar.ea.2017.
:label:fig_mask_r-cnn
As shown in :numref:fig_mask_r-cnn,
the mask R-CNN
is modified based on the faster R-CNN.
Specifically,
the mask R-CNN replaces the
region of interest pooling layer with the
region of interest (RoI) alignment layer.
This region of interest alignment layer
uses bilinear interpolation
to preserve the spatial information on the feature maps, which is more suitable for pixel-level prediction.
The output of this layer
contains feature maps of the same shape
for all the regions of interest.
They are used
to predict
not only the class and bounding box for each region of interest,
but also the pixel-level position of the object through an additional fully convolutional network.
More details on using a fully convolutional network to predict pixel-level semantics of an image
will be provided
in subsequent sections of this chapter.
Redmon.Divvala.Girshick.ea.2016.Zhao.Zheng.Xu.ea.2019.:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab: