Hands-On Neural Networks with Keras
上QQ阅读APP看书,第一时间看更新

Information theory

Before a deeper dive into various network architectures and some hands-on examples, it would be a pity if we did not elaborate a little on the pivotal notion of gaining information through processing real-world signals. We speak of the science of quantifying the amount of information present in a signal, also referred to as information theory. While we don't wish to provide a deep mathematical overview on this notion, it is useful to know some background on learning from a probabilistic perspective.

Intuitively, learning that an unlikely event has occurred is more informative than learning that an expected event has occurred. If I were to tell you that you can buy food at all supermarkets today, I won't be met with gasps of surprise. Why? Well, I haven't really told you something beyond your expectations. Conversely, if I told you that you cannot buy food at all supermarkets today, perhaps due to some general strike, well, then you would be surprised. You would be surprised because an unlikely piece of information has been presented (in our case, this is the word not, appearing in the configuration previously presented). Such intuitive knowledge is what we attempt to codify, in the field of information theory. Other similar notions include the following:

  • An event with a lower likelihood of occurrence should have lower information content
  • An event with a higher likelihood of occurrence should have higher information content
  • An event with a guaranteed occurrence should have no information content
  • An event with an independent likelihood of occurrence should have additive information content

Mathematically, we can actually satisfy all of these conditions by using the simple equation modeling the self-information of an event (x), as follows:

l(x) is denoted in the nat unit, quantifying the amount of information gained by observing an event of probability, 1/e. Although the preceding equation is nice and neat, it only allows us to deal with a single outcome; this is not too helpful in modeling the dependent complexities of the real world. What if we wanted to quantify the amount of uncertainty in an entire probability distribution of events? Then, we employ another measure, known as Shannon entropy, as shown in the following equation: