# Le Cun and Backpropagation

A couple of posts ago I wrote about an interview with Yann Le Cun. I subsequently found this interesting note in Smith’s **Neural Networks for Statistical Modeling**:

Backpropagation is an example of multiple invention. David Parker (1982,1985) and Yann LeCun (1986) working independently of each other and of the Rumelhart group, published similar discoveries. But none of these workers made the first discovery of backpropagation. That honor goes, belatedly, to Paul Werbos, whose 1974 Harvard Ph.D. thesis,

Beyond Regression,contains the earliest exposition of the techniques involved (Werbos 1974).Werbos’ 1974 discovery had gone unappreciated, but Rumelhart, Hinton, and Williams’ 1986 discovery did not. It kindled a firestorm of interest in Neural Networks.

I don’t know any details of Le Cun’s discovery in 1986, but I’m curious to look it up. Note that it was published in the same year as the Rumelhart paper. Here’s the full reference:

Le Cun, Yann. 1986. Learning Processes in a Asymetric Threshold Network. In *Disordered Systems and Biological Organization*, ed. E. Bienenstock, F. Fogelman Soulie, and G. Weisbuch. Berlin: Springer.

# Explanation of “multi-layer backpropagation Neural network called a Convolution Neural Network”

In my last post I said:

The technology behind the ATMs was developed by LeCun and others almost 10 years ago, at AT&T Bell Labs… The algorithm they developed goes under the name LeNet, and is a multi-layer backpropagation Neural network called a Convolution Neural Network. I will explain this terminology in my next post.

**ANNs** are mathemetical models composed of interconnected nodes. Each node is a very simple processor: it collects input from either outside the system or from other nodes, makes a simple calculation (for example, summing the inputs and comparing to a threshold value) and produces an output.

ANNs are usually composed of several layers. There are a number of input nodes — imagine 1 node for each input. Similarly, there is an output layer. Between these two there can be one or more ‘hidden’ layers, so called because they are only interact with other nodes and are therefore invisible outside the system. The addition of hidden nodes allows greater complexity in the system. Choosing the number of and architecture of hidden nodes is an important consideration in the design of an ANN. The description of LeNet as a “**multi-layer ANN**” indicates that one or more hidden layers are used.

“**Backpropagatio**n” is by far the most common type of ANN in use today. The development of the backpropagation technique was very significant and was responsible for reviving interest in ANNs. After the initial excitement due to Rosenblatt’s development of the Perceptron (in 1957), which some people (briefly) believed was a algorithm-panacea, researchers hit a brick wall due to the limitations of the perceptron. Minsky and Papert published one of the most important papers in the field (in 1969) that proved this limitation and drove the proverbial nail in the coffin. Work and interest in ANNs practically vanished. In the 1980’s ANNs were revived by the work on backpropagation techniques by Rumelhart and others.

Which doesn’t really answer the question. What is backpropagation and how did it overcome these limitations? The question deserves a dedicated discussion, but imagine the flow of information through a multi-layer neural network such as the one in the picture above. We start at the input layer, where external input enters the system. The input nodes pass this along to the hidden nodes. Each hidden node sums up the input from several input nodes, and compares it to some threshold value. — 1) This sum is in fact a weighted sum — we assign a weight to each node which determines how significant that node’s contribution will be. 2) We will then employ a mathematical function such as the logistics function to compare the sum to our ‘threshold’ value. This is done so that we don’t have to work with a hard-limiter function, which would give us a less useful yes/no. — The result from our hidden node is passed along to one or more output nodes (unless there is another hidden layer). The output nodes follow the same process and then pass along their ouput which leaves the system as the final output. Information has moved *forward* through our network.

This final output will hopefully be correct, but, during training, it will be wrong or not sufficiently close to the right answer. We can tell the network which of these is the case — this is called ‘supervised learning’ — by calculating the error in each of the output nodes. Now we will go backwards through the system: each of the output nodes must adjust its output, and then pass back information to the hidden layer so that each of those nodes can also adjust its output. The way in which the network adjusts its output is by changing weights and threshold values. The exact method used to decide how much to adjust these values is clearly very important, but the general principle we have employed is the backwards propagation of errors — this is the revolutionary ‘backpropagation’ technique.

Which brings us to the final term: “convolution”, which I was completely unfamiliar with. After doing some reserach, here’s my first attempt at an explanation — please Email me or leave a comment if I have something wrong and I will make the necessary revisions.

A **convolution neural network** is special architecture (arrangement of layers, nodes, and connections) commonly used in visual and auditory processing which more specifically defines a spacial relationship between layers of nodes. Imagine we are trying to recognize objects from a picture, which we subdivide into a coarse grid and then subdivide further into progressively finer grids. We could define node connections in such a way that a single grid unit (pixel) from layer *l *corresponds only to a block of pixels in layer (*l + 1)*. We use this limitation to increase efficiency. Since we are now dealing with many layers, we choose to create an interpretation of output several times, not just at the end. The first time we do this we look for coarse information, like edges, then something more refined. CNN architectures are usually characterized by local receptive fields, shared weights, and spatial or temporal subsampling.

[Update: check out a Matlab class for CNN implementation on the Matlab file exchange, by Mihail Sirotenko.]

Put together, LeCun tells us that LeNet is a “multi-layer backpropagation Neural network called a Convolution Neural Network”.

Return to the post about LeCun’s visual processing algorithm.

# Yann Le Cun

Yann Le Cun was recently featured on NYAS’s Science and the City podcast. He spoke about visual processing using Artificial Neural Networks (ANNs) and in particular his work on the system that reads the amounts on your cheques at the ATM. The host, Alana Range, notes that her ATM has never gotten her cheque amounts wrong. I myself no longer bother to check.

The technology behind the ATMs was developed by Le Cun and others almost 10 years ago, at AT&T Bell Labs [which, tragically, has been closed down]. The algorithm they developed [now] goes under the name LeNet, and is a multi-layer backpropagation Neural network called a Convolution Neural Network. I will explain this terminology in my next post. [Update: explanation now posted.]

In the object recognition demonstration, Yann Le Cun describes four output interpretations in the LeNet algorithm: 1) edges and contours, 2) motifs, 3) categories, and finally 4) objects. By narrowing down options in several steps LeNet can arrive at the final outputn (identifying the object) far more rapidly — the demonstration on the podcast proceses 4-5 pictures each second, and can recognize five different objects. There are also demonstrations online on Yann Le Cun’s website.

I wondered as I listened to this podcast about the comparisons drawn between mammalian visual processing and Le Cun’s algorithm. There were some very pronounced differences that suggest that our brains utilize a totally different technique, not simply a more complex version of LeNet, as was implied in the interview. These differences were:

- LeNet utilizes supervised learning. Our learning is largely unsupervised.
- We extrapolate, not just interpolate. So while LeNet needs tens of thousands of samples before it starts recognizing an object, babies might see a few toy planes and recognize the one in the sky. In interview, Le Cun notes that his algorithm, when used for letter recognition, needs to be trained with both printed and handwritten samples and is unable to extrapolate from one to the other.

I think there were some other differences that I do not now recall. I got flashbacks of the TED-talk in which Handspring founder Jeff Hawkins cracked some jokes about the fundamental differences between computer and human visual processing. I’ll try to post about that; it was pretty entertaining.

[Update: check out a Matlab class for CNN implementation on the Matlab file exchange, by Mihail Sirotenko.]