Hands-On Deep Learning with Apache Spark
上QQ阅读APP看书,第一时间看更新

CNNs

The most common use case scenarios of CNNs are all to do with image processing, but are not restricted to other types of input, whether it be audio or video. A typical use case is image classification – the network is fed with images so that it can classify the data. For example, it outputs a lion if you give it a lion picture, a tiger when you give it a tiger picture, and so on. The reason why this kind of network is used for image classification is because it uses relatively little preprocessing compared to other algorithms in the same space – the network learns the filters that, in traditional algorithms, were hand-engineered.

Being a multilayered neural network, A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers can be convolutional, pooling, fully connected, and normalization layers. Convolutional layers apply a convolution operation (https://en.wikipedia.org/wiki/Convolution) to an input, before passing the result to the next layer. This operation emulates how the response of an individual physical neuron to a visual stimulus is generated. Each convolutional neuron processes only the data for its receptive field (which is the particular region of the sensory space of an individual sensory neuron in which a change in the environment will modify the firing of that neuron). Pooling layers are responsible for combining the outputs of clusters of neurons in a layer into a single neuron in the next layer. There are different implementations of poolings—max pooling, which uses the maximum value from each cluster from the prior layer; average pooling, which uses the average value from any cluster of neurons on the prior layer; and so on. Fully connected layers, instead, as you will clearly realize from their name, connect every neuron in a layer to every other neuron in another layer.

CNNs don't parse all the training data at once, but they usually start with a sort of input scanner. For example, consider an image of 200 x 200 pixels as input. In this case, the model doesn't have a layer with 40,000 nodes, but a scanning input layer of 20 x 20, which is fed using the first 20 x 20 pixels of the original image (usually, starting in the upper-left corner). Once we have passed that input (and possibly used it for training), we feed it using the next 20 x 20 pixels (this will be explained better and in a more detailed manner in Chapter 5, Convolutional Neural Networks; the process is similar to the movement of a scanner, one pixel to the right). Please note that the image isn't dissected into 20 x 20 blocks, but the scanner moves over it. This input data is then fed through one or more convolutional layers. Each node of those layers only has to work with its close neighboring cells—not all of the nodes are connected to each other. The deeper a network becomes, the more its convolutional layers shrink, typically following a divisible factor of the input (if we started with a layer of 20, then, most probably, the next one would be a layer of 10 and the following a layer of 5). Powers of two are commonly used as divisible factors.

The following diagram (by Aphex34—own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374) shows the typical architecture of a CNN:

Figure 2.8