2.2.1. Libra R-CNN Network for Leaf Detection
The overall architecture of leaf detection model uses Libra R-CNN network, which is composed of feature extraction, region proposal generation
and region proposal optimization, as shown in
Figure 4. Firstly, feature extraction is performed on the input image. In this phase, the feature maps of different layers are fused and enhanced. Secondly, a large number of region proposals are generated on the basis of the output feature maps. Finally, the final prediction results are obtained by optimizing the region proposals.
a) Feature extraction for leaf detection
The feature extraction module includes basic feature extraction, feature fusion and feature enhancement, and its overall structure is shown in
Figure 5. Basic feature extraction uses the residual neural network (ResNet) (He et al., 2016), which shows strong feature extraction ability and is widely used in the feature extraction module of various deep neural networks. Feature maps of four different sizes are built from bottom to top. A pyramid structure of feature maps is constructed by using the forward propagation of convolutional neural networks to obtain feature maps of different size. The size between two neighbor feature maps with different resolutions is twice.
Feature fusion follows basic feature extraction. In convolutional networks, the low-level feature map has less semantic information, and the target location is relatively accurate. The high-level feature map has more semantic information, and the target location is rough. Feature fusion combines low-level and high-level features to fully utilize the features of each level, thereby better capturing image details and contextual information. Feature pyramid network (FPN) (Lin et al., 2017) is adopted for feature fusion. The feature maps output {B2, B3, B4, B5} from the last layer of each stage in ResNet50 except the first stage are used to construct the feature map pyramid. First double upsampling is performed on B5 to achieve the same resolution as B4. At the same time, 1x1 convolution is operated on B4 to get the same channel numbers as the upsampling feature maps of B5. Then add them together element by element to obtain the new feature map as C4. And so on to get feature maps C3 and C2, as shown in
Figure 5. So the features are continuously fused.
Feature enhancement is given after feature fusion. Multi-level features are further strengthened using the same deeply integrated balanced semantic features. First, the multi-level features {C2, C3, C4, C5} is scaled to the same size such as C4 with interpolation (C5) or max-pooling (C2 and C3) operation respectively. Set the resized features as {C2, C3, C4, C5}, the average feature is calculated as
Then the average feature is further refined to be more discriminative by embedded Gaussian non-local attention (Wang et al., 2018). That is, for feature
in position
and feature
in any position
in average feature
C, the attention is defined as
where
and
are 1*1 convolution operator.
With the attention as residual block, feature
in position
in the refined average feature
R is written as
where
is convolution operation with the same channel number as
.
The refined average feature R is as the integrated balanced semantic features. Feature R is then rescaled to the same resolutions as {C2, C3, C4, C5} and is added with them separately to form new multi-scale features {P2, P3, P4, P5} to strengthen the features. The features {P2, P3, P4, P5} are used for leaf detection.
b) Region proposal generation
The region proposal generation is including three steps, as shown in
Figure 6. First, according to input image size and feature maps, the anchor generator generates thousands of anchors. Second, IoU-balanced sampling is executed on the anchors to select. Third, with the selected anchors and feature maps as the input, region proposal generator generates region proposals.
The anchors are generated based on the feature maps. Specifically, for each feature point of one feature map, three anchors are generated on the corresponding image position. The anchors are with the same area and three different aspect ratios of 1:1, 1:2 and 2:1, as shown in
Figure 7. We assign the anchor size of 1:1 aspect ratio for feature map P2 to be 13*13. Since deeper feature map has a wider field of perception and can better detect large-sized objects, the anchors mapped to the original image are assigned a larger size. According to the scale factors between the sizes of feature maps, the anchor sizes of 1:1 aspect ratio for feature map P3, P4 and P5 are 25*25, 50*50 and 100*100 separately.
The traditional random sampling method ignores large number of hard samples, which can improve the accuracy of the network. IoU-balanced Sampling is adopted to mine hard anchor samples. IoU(Intersection over Union) is the ratio of the overlap area and the union area between the predicted bounding box and the ground truth, which is defined as
If an anchor has an IoU value less than 0.3, it is classified as a negative sample. If the IoU value is greater than 0.7, it is classified as a positive sample. In other cases, anchors are classified as discarded samples and do not participate in the loss calculation. IoU-balanced Sampling uses stratified sampling to select hard negative samples. This is done by dividing the sampling interval equally into K bins according to the IoU, and then selecting samples from them uniformly.
The region proposal generator generates the region proposals. First, the class (foreground or background) and regression parameters (the position, length and width) are predicted for each anchor. Then proposals are generated according to the regression parameters. Finally, region proposals are obtained by non-maximum suppression (NMS) which searches for proposals with the highest prediction probability in the local area.
c) Region proposal optimization
Region proposal optimization is the process of adjusting the position, width and height of region proposals and predicting the probability scores of each box for all classes. The network structure is shown in
Figure 8. First, RoIAlign converts features of RoI (region of interest) to a small feature map with a fixed size of 77. RoIAlign, proposed by Mask R-CNN, is a feature extraction module for RoIs. It maps the region proposals to the corresponding feature map to obtain RoIs and then performs a maximum pooling operation on thses specific regions to produce 77 feature matrices. To pool the specific region, a bilinear interpolation is used to calculate each element value of the matrices. Second, these 77 feature matrices are flattened and inputed into two serial fully connected layers. Then two parallel branches of the fully connected layer are followed, which output the class probabilities and regression parameters for each proposal separately. Finally, the regression parameters are used to correct the position and size of the proposals. A series of leaf bounding boxes can be obtained by selecting the boxes with high leaf class probability.
d) Loss function definition for leaf detection
The loss of the leaf detection network consists of two parts, one for the RPN and the other for the region proposal optimization network, which is defined as
The loss function
for the RPN network is given as
which includes the classification loss
and the regression loss
of anchors, where
is the probability that anchor i is predicted to be positive.
is 1 if anchor
is positive and 0 otherwise.
is the four predicted regression parameters on anchor
, while
is the actual regression parameters. The anchor classification loss
defined by binary cross-entropy is
the regression loss
is based on balanced L1 loss. The key idea of balanced L1 loss is suppressing the regression gradients from outliers (inaccurate samples) to balance the involved samples and tasks. The gradient function for
is given as
where
and b are control factors,
is a constant. To ensure continuity of the gradient, set
. In our experiments,
is set to 0.5 and
is set to 1.5.
The balanced L1 loss is got by integrating the gradient formulation in Equation (8) as
where C is a constant.
The definition of regression loss
using Balanced L1 Loss is
where
(j=x, y, w, h) is a specific regression parameter of
, which is used to correct the x-coordinate, y-coordinate, height and width of Anchor respectively, and
is a specific regression parameter of
. The loss of leaf bounding box prediction network
is defined as
Here
is the softmax probability of proposal
for each category (including background).
is the actual category label.
is the predicted regression parameters for the category
corresponding to proposal
and
is the actual regression parameters. The classification loss
of the proposals is defined using the cross-entropy loss for multiple classifications as
where
is the predicted probability of the category
corresponding to proposal
. In this study, although
and
are used for different purposes, one for binary classification and the other for multiple classification, since the leaf bounding box prediction network predicts only two category probabilities for a proposal: background and leaf,
here is equal to
. The regression loss
is used to defined the regression loss of Proposals, which is calculated in equation (10).
2.2.2. Target Leaf Localization
We have detected the rectangular bounding boxes of all possible leaves in the image during the detection phase. However, only the leaf is needed which is located in the central region of the image and with a large size, so is the bounding box. The distance between the two center points of the image and each rectangular bounding box is used to measure the position, which is defined as
where
is the center position of the i-th rectangular bounding box and c is the center position of the image. We normalize the distance as
where
and
, that is,
and
is the maximum distance and the minimum distance of all
.
Similarly, set the area of the i-th rectangular bounding box
, the maximum area and the minimum area of all bounding boxes
and
. The normalization is
The maximum area and the smallest distance are expected for the target bounding box. Gaussian function is used to balance the role of area
and position
. The target bounding box is founded by the maximum product of two Gaussian functions, that is
where σ1 and σ2 are control parameters. As shown in
Figure 3., the leaf detection model detects all leaves marked by bounding boxes and then the target leaf localization model finds the target bounding box.
a) Accuracy driven parameter setting for target leaf localization
The control parameters
,
are two key parameters of the target leaf localization module. In order to find the ideal values for
,
, we took 20 numbers in the interval of (0,2] as the candidates in an equally spaced manner. That is, totally 400 set of
and
are provided. Since the target leaf is located by the position and area, we regard the center of bounding box predicted by Libra R-CNN as the leaf position and the area of bounding box as the leaf area. In addition, we introduce a label to mark whether a leaf is the target. If the label is 1, it shows the leaf is the target, otherwise, it is not. We tested on 1335 soybean leaf images including 2956800 bounding boxes. The results of the experiment are shown in
Figure 9, where the green dots indicate all estimated target leaves correct. We calculated the average values of
and
corresponding to those green dots as the final parameter values. That is,
and
.
b) Vertex offset strategy for target leaf bounding box optimization
In some cases, the bounding box predicted by the leaf detector cannot completely enclose the whole leaf, as shown in
Figure 10, which will lead to incomplete segmentation results. If the vertices of the bounding box are moved outward by a certain distance to make the bounding box include the whole leaf, the segmentation effect can be improved. The new vertex coordinates moving outward are calculated as
where
,
are the original coordinates of the vertex,
,
are the coordinates after moving,
is the moving distance and
,
are the factors measuring the relative positions of the vertices.
is equal to 1 if the vertex is on the left side of the bounding box and 0 otherwise, and
is equal to 1 if the vertex is on the upper side of the bounding box and 0 otherwise.
Bounding box optimization is to improve leave segment accuracy. We provide a strategy for parameter setting guided by segment accuracy. Set five values 0, 5, 10, 15 and random value in [0, 15] for to correct target leaf bounding box. We train five target leaf segmentation network according to the five setting for . Set these five leaf segmentation networks are called Fix0_SNet, Fix5_SNet, Fix10_SNet, Fix15_SNet and Ran_SNet. To evaluate the segmentation accuracy, we utilize five leaf bounding box under five settings as input. The experiment results show the highest segmentation accuracy from the combination of Fix10_SNet with =10 and test input of optimized bounding box with . Experiment details and analysis in 3.1.
2.2.3. Target Leaf Segmentation Network
The target leaf segmentation model is consisted of four stages, as shown in
Figure 11. (1) input data processing, where outside information and inside information of target leaf are given. (2) feature extraction, where the multi-scale features of the target leaf are extracted. (3) feature refinement, where the multi-scale features are upsampling and fused to repair boundary feature of segmentation region. (4) the mask prediction, where the mask of target leaf is generated.
a) Input data processing
The input data processing introduces prior guidance for target leaf segmentation model. It consists of three steps, as shown in
Figure 12. (1) the part region of original image is used for target leaf segmentation. Image cropping operation is executed to obtained the part region by shifting 30 pixels outwards along the target leaf bounding box. (2) the cropped image is scaled to a standard size (e.g., 512
512), and the vertex coordinates of the target leaf bounding box are adjusted accordingly. (3) two single-channel Gaussian heat maps are constructed to provide background and foreground prior guidance according to the target leaf bounding box. Using the center point coordinates (
,
) of the bounding box, the channel for foreground prior guidance is defined by Gaussian heat map as
where
. Similarly, using the 4 vertex coordinates {
| i
{1,2,3,4}} of the bounding box, the channel for background prior guidance is
The two Gaussian heat maps are concatenated with the scaled image to form an input data with five channels for target leaf segmentation model.
b) Feature extraction for target leaf segmentation
The feature extraction network adopts a structure design similar to FPN, as shown in
Figure 13, including basic feature extraction and semantic information fusion. In the basic feature extraction part, the ResNet101 is used to construct a pyramid structured multi-scale feature map. Unlike the general FPN structure, the deepest feature map output from ResNet101 is processed by the pyramid scene parsing (PSP) (Zhao et al., 2017) module to enrich the feature representation with global contextual information. The structure of PSP module is shown by
Figure 14. Firstly, the input feature map is averaged by four pooling windows to produce four pooled feature maps with size 1*1, 2*2, 3*3 and 6*6 respectively. Secondly, these feature maps are operated in sequence by convolution, batch normalization and finally up-sampled to generate multi-scale feature maps with the same size as the original feature map. Finally, the generated multi-scale feature maps and the original feature map are concatenated and followed by convolution and batch normalization to output a feature map of semantic information fusion. After enriching the feature representation of the last layer, FPN is used to fuse the feature maps of adjacent stages through the top-down path and the lateral path.
C) Feature refinement
The multi-scale feature maps extracted from the feature extraction network lose feature details, which may destroy the boundary of segment region. The feature refinement network aims to repair the lost boundary feature by up sampling and fusing the multi-scale feature information. The structure of the feature refinement network is shown in
Figure 15. Firstly, the residual block is used for feature enhancement. The number of residual blocks is different for different layers, where 3, 2, 1, 0 are for C5, C4, C3 and C2 separately. Next, up-sampling is performed on these enhanced feature maps so that their size is equal to the lowest level feature map P2. Finally, the feature maps P2, P3, P4 and P5 are concatenated to obtain the refined feature map.
d) Mask prediction
The process of mask prediction is shown in
Figure 16. The refined feature map is first passed through the mask predictor to generate a target mask, and then the mask is mapped back to the original image. The predictor adopts a structure similar to the residual block. Three serial convolution layer operators act on and add the input feature map. The result is followed by a batch normalization operation to generate the mask of target leaf. According to the position and size of target leaf bounding box, the mask is mapped to the original image so as to segment the target leaf from the original image.
e) Loss function definition for leaf segmentation
In order to better supervise the model training, we not only construct the loss for the final generated mask, but also calculate the loss for the mask predicted from each level of CoarseNet (C2, C3, C4, C5). Therefore, the total loss of the model is the sum of the five losses
The loss
is defined using binary cross-entropy loss as
where
is the predicted value of pixel
of mask
and
(0 or 1) is the ground truth.