Deep residual networks (ResNet)_Advanced Deep Learning with Keras-QQ阅读女频青春网

上QQ阅读APP看书，第一时间看更新

Deep residual networks (ResNet)

One key advantage of deep networks is that they have a great ability to learn different levels of representations from both inputs and feature maps. In both classification, segmentation, detection and a number of other computer vision problems, learning different levels of features generally leads to better performance.

However, you'll find that it's not easy to train deep networks as a result of the gradient vanishes (or explodes) with depth in the shallow layers during backpropagation. Figure 2.2.1 illustrates the problem of vanishing gradient. The network parameters are updated by backpropagation from the output layer to all previous layers. Since backpropagation is based on the chain rule, there is a tendency for gradients to diminish as they reach the shallow layers. This is due to the multiplication of small numbers, especially for the small absolute value of errors and parameters.

The number of multiplication operations will be proportional to the depth of the network. It's also worth noting that if the gradient degrades, the parameters will not be updated appropriately.

Hence, the network will fail to improve its performance:

Figure 2.2.1: A common problem in deep networks is that the gradient vanishes as it reaches the shallow layers during backpropagation.

Figure 2.2.2: A comparison between a block in a typical CNN and a block in ResNet. To prevent degradation in gradients during backpropagation, a shortcut connection is introduced.

To alleviate the degradation of the gradient in deep networks, ResNet introduced the concept of a deep residual learning framework. Let's analyze a block, a small segment of our deep network.

The preceding figure shows a comparison between a typical CNN block and a ResNet residual block. The idea of ResNet is that in order to prevent the gradient from degrading, we'll let the information flow through the shortcut connections to reach the shallow layers.

Next, we're going to look at more details within the discussion of the differences between the two blocks. Figure 2.2.3 shows more details of the CNN block of another commonly used deep network, VGG[3], and ResNet. We'll represent the layer feature maps as x. The feature maps at layer l are

. The operations in the CNN layer are Conv2D-Batch Normalization (BN)-ReLU.

Let's suppose we represent this set of operations in the form of H() = Conv2D-Batch Normalization(BN)-ReLU, that will then mean that:

(Equation 2.2.1)

(Equation 2.2.2)

In other words, the feature maps at layer l - 2 are transformed to

by H() = Conv2D-Batch Normalization(BN)-ReLU. The same set of operations is applied to transform

. To put this another way, if we have an 18-layer VGG, then there are 18 H() operations before the input image is transformed to the 18^th layer feature maps.

Generally speaking, we can observe that the layer l output feature maps are directly affected by the previous feature maps only. Meanwhile, for ResNet:

(Equation 2.2.3)

(Equation 2.2.4)

Figure 2.2.3: A detailed layer operations for a plain CNN block and a Residual block

is made of Conv2D-BN, which is also known as the residual mapping. The + sign is tensor element-wise addition between the shortcut connection and the output of

. The shortcut connection doesn't add extra parameters nor extra computational complexity.

The add operation can be implemented in Keras by the add() merge function. However, both the

equation and x should have the same dimensions. If the dimensions are different, for example, when changing the feature maps size, we should perform a linear projection on x as to match the size of

. In the original paper, the linear projection for the case, when the feature maps size is halved, is done by a Conv2D with a 1 × 1 kernel and strides=2.

Back in Chapter 1, Introducing Advanced Deep Learning with Keras, we discussed that stride > 1 is equivalent to skipping pixels during convolution. For example, if strides=2, we could skip every other pixel when we slide the kernel during the convolution process.

The preceding Equations 2.2.3 and 2.2.4, both model ResNet residual block operations. They imply that if the deeper layers can be trained to have fewer errors, then there is no reason why the shallower layers should have higher errors.

Knowing the basic building blocks of ResNet, we're able to design a deep residual network for image classification. This time, however, we're going to tackle a more challenging and advanced dataset.

In our examples, we're going to consider CIFAR10, which was one of the datasets the original paper was validated. In this example, Keras provides an API to conveniently access the CIFAR10 dataset, as shown:

from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

Like MNIST, the CIFAR10 dataset has 10 categories. The dataset is a collection of small (32 × 32) RGB real-world images of an airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and a truck corresponding to each of the 10 categories. Figure 2.2.4 shows sample images from CIFAR10.

In the dataset, there are 50,000 labeled train images and 10,000 labeled test images for validation:

Figure 2.2.4: Sample images from the CIFAR10 dataset. The full dataset has 50,000 labeled train images and 10,000 labeled test images for validation.

For the CIFAR10 data, ResNet can be built using different network architectures as shown in Table 2.2.1. The values of both n and the corresponding architectures of ResNet were validated in Table 2.2.2. Table 2.2.1 means we have three sets of residual blocks. Each set has 2n layers corresponding to n residual blocks. The extra layer in 32 × 32 is the first layer for the input image.

The kernel size is 3, except for the transition between two feature maps with different sizes that implements a linear mapping. For example, a Conv2D with a kernel size of 1 and strides=2. For the sake of consistency with DenseNet, we'll use the term Transition layer when we join two residual blocks of different sizes.

ResNet uses kernel_initializer='he_normal' in order to aid the convergence when backpropagation is taking place [1]. The last layer is made of AveragePooling2D-Flatten-Dense. It's worth noting at this point that ResNet does not use dropout. It also appears that the add merge operation and the 1 × 1 convolution have a self-regularizing effect. Figure 2.2.4 shows the ResNet model architecture for the CIFAR10 dataset as described in Table 2.2.1.

The following listing shows the partial ResNet implementation within Keras. The code has been contributed to the Keras GitHub repository. From Table 2.2.2 we can also see that by modifying the value of n, we're able to increase the depth of the networks. For example, for n = 18, we already have ResNet110, a deep network with 110 layers. To build ResNet20, we use n = 3:

n = 3

# model version
# orig paper: version = 1 (ResNet v1), 
# Improved ResNet: version = 2 (ResNet v2)
version = 1

# computed depth from supplied model parameter n
if version == 1:
    depth = n * 6 + 2
elif version == 2:
    depth = n * 9 + 2
…
if version == 2:
    model = resnet_v2(input_shape=input_shape, depth=depth)
else:
    model = resnet_v1(input_shape=input_shape, depth=depth)

The resnet_v1() method is a model builder for ResNet. It uses a utility function, resnet_layer() to help build the stack of Conv2D-BN-ReLU.

It's referred to as version 1, as we will see in the next section, an improved ResNet was proposed, and that has been called ResNet version 2, or v2. Over ResNet, ResNet v2 has an improved residual block design resulting in better performance.

Table 2.2.1: ResNet network architecture configuration

Figure 2.2.4: The model architecture of ResNet for the CIFAR10 dataset classification

Table 2.2.2: ResNet architectures validated with CIFAR10

The following listing shows the partial code of resnet-cifar10-2.2.1.py, which is the Keras model implementation of ResNet v1:

def resnet_v1(input_shape, depth, num_classes=10):
    if (depth - 2) % 6 != 0:
        raise ValueError('depth should be 6n+2 (eg 20, 32, 44 in [a])')
    # Start model definition.
    num_filters = 16
    num_res_blocks = int((depth - 2) / 6)

    inputs = Input(shape=input_shape)
    x = resnet_layer(inputs=inputs)
    # Instantiate the stack of residual units
    for stack in range(3):
        for res_block in range(num_res_blocks):
            strides = 1
            if stack > 0 and res_block == 0:
                strides = 2  # downsample
            y = resnet_layer(inputs=x,
                             num_filters=num_filters,
                             strides=strides)
            y = resnet_layer(inputs=y,
                             num_filters=num_filters,
                             activation=None)
            if stack > 0 and res_block == 0
                # linear projection residual shortcut connection 
                # to match changed dims
                x = resnet_layer(inputs=x,
                                 num_filters=num_filters,
                                 kernel_size=1,
                                 strides=strides,
                                 activation=None,
                                 batch_normalization=False)
            x = add([x, y])
            x = Activation('relu')(x)
        num_filters *= 2

    # Add classifier on top.
    # v1 does not use BN after last shortcut connection-ReLU
    x = AveragePooling2D(pool_size=8)(x)
    y = Flatten()(x)
    outputs = Dense(num_classes,
                    activation='softmax',
                    kernel_initializer='he_normal')(y)

    # Instantiate model.
    model = Model(inputs=inputs, outputs=outputs)
    return model

There are some minor differences from the original implementation of ResNet. In particular, we don't use SGD, and instead, we'll use Adam. This is because ResNet is easier to converge with Adam. We'll also use a learning rate (lr) scheduler, lr_schedule(), in order to schedule the decrease in lr at 80, 120, 160, and 180 epochs from the default 1e-3. The lr_schedule() function will be called after every epoch during training as part of the callbacks variable.

The other callback saves the checkpoint every time there is progress made in the validation accuracy. When training deep networks, it is a good practice to save the model or weight checkpoint. This is because it takes a substantial amount of time to train deep networks. When you want to use your network, all you need to do is simply reload the checkpoint, and the trained model is restored. This can be accomplished by calling Keras load_model(). The lr_reducer() function is included. In case the metric has plateaued before the schedule reduction, this callback will reduce the learning rate by the factor if the validation loss has not improved after patience=5 epochs.

The callbacks variable is supplied when the model.fit() method is called. Similar to the original paper, the Keras implementation uses data augmentation, ImageDataGenerator(), in order to provide additional training data as part of the regularization schemes. As the number of training data increases, generalization will improve.

For example, a simple data augmentation is flipping the photo of the dog, as shown in following figure (horizontal_flip=True). If it is an image of a dog, then the flipped image is still an image of a dog. You can also perform other transformation, such as scaling, rotation, whitening, and so on, and the label will still remain the same:

Figure 2.2.5: A simple data augmentation is flipping the original image

It's often difficult to exactly duplicate the implementation of the original paper, especially in the optimizer used and data augmentation, as there are slight differences in the performance of the Keras ResNet implementation in this book and the model in the original paper.