Summary The Forward-Forward Algorithm: Some Preliminary Investigations – arXiv Vanity www.arxiv-vanity.com
10,090 words - html page - View html page
One Line
Many researchers have contributed to the field of deep learning, from Wu et al. (2018) to Hinton (1980s) and experiments with the MNIST dataset have shown FF can achieve 1.4% test error.
Key Points
- Forward-Forward (FF) algorithm is a new learning procedure for neural networks that can be used with low-power analog hardware and achieves 1.4% test error on MNIST dataset.
- FF can be used for contrastive learning, which transforms input vectors into representation vectors and logits, and is used to perform supervised learning by using real data vectors as positive examples and corrupted data vectors as negative examples.
- FF can be used for recurrent neural networks to process video input, and is related to other contrastive learning techniques such as Boltzmann Machines and GANs.
- Backpropagation and Boltzmann Machines emerged in the 1980s as promising learning procedures for deep neural networks, however Boltzmann Machines take too long to approach equilibrium and have no evidence of detailed symmetry of connections in cortex.
- FF allows for simultaneous online weight updates in multiple layers, and can be implemented with analog hardware to avoid flaws.
- Distillation can be used to transfer what has been learned by one piece of hardware to a different piece of hardware, and normalization, mutual information, layer-wise training, and generative adversarial nets have all been proposed for deep learning.
Summaries
190 word summary
Many researchers have contributed to the field of deep learning, including Wu et al. (2018), Welling et al. (2003), van den Oord et al. (2018), Srivastava et al. (2014), Scellier and Bengio (2017), Rumelhart et al. (1986), Rosenblatt (1958), Richards and Lillicrap (2019), Ren et al. (2022), Rao and Ballard (1999), Raghu et al. (2020), Pereyra et al. (2017), Osindero et al. (2006) and Krizhevsky and Hinton (2009).
Kohan et al. (2022, 2018) studied signal propagation and error forward-propagation, Kendall et al. (2020) proposed an optimal architecture and learning technique for analog VLSI networks, and Jabri and Flower (1992) proposed a fast learning algorithm for deep belief nets. Hinton has made important contributions to neural information processing systems, such as training products of experts, distilling knowledge in a neural network, and representing part-whole hierarchies in a neural network. Backpropagation and Boltzmann Machines were proposed in the 1980s as promising learning procedures for deep neural networks, with FF being more suitable for real-time learning and prediction of the next character in a sequence. Experiments using the MNIST dataset with 2000 neurons showed FF can achieve 1.4% test error without using regularizers.
384 word summary
FF has advantages over backpropagation, such as being able to learn through a black box or in real-time, and can be used when details of the forward pass are unknown. Contrastive learning is a supervised technique used to transform input vectors into representation vectors and logits, and FF can be used to perform this type of learning. Experiments using the MNIST dataset with 2000 neurons showed FF can achieve 1.4% test error without using regularizers. FF was tested on CIFAR-10 and showed slightly worse performance than backpropagation, but can also be used to predict the next character in a sequence.
In the 1980s, backpropagation and Boltzmann Machines emerged as promising learning procedures for deep neural networks. Boltzmann Machines take too long to approach equilibrium for practical use, so contrastive learning is used with other energy functions, such as Hopfield energy. It is also related to techniques such as GANs and self-supervised contrastive learning. FF can easily learn with the insertion of a black box between layers and can make big updates to the weights without using a learning rate.
Important contributions to neural information processing systems have been made by Dayan and Hinton (1992), Crick and Mitchison (1983), Chen et al. (2020b), Carandini and Heeger (2013), Bengio et al. (2007), Bachman et al. (2019), Ba et al. (2016b, 2016a), Guerguiev et al. (2017), Buchatskaya et al. (2020), Grathwohl et al. (2019), and Goodfellow et al. (2014). Kohan et al. (2022, 2018) studied signal propagation and error forward-propagation, Kendall et al. (2020) proposed an optimal architecture and learning technique for analog VLSI networks, and Jabri and Flower (1992) proposed a fast learning algorithm for deep belief nets. Hinton (2002, 2014) proposed training products of experts, distilling knowledge in a neural network, and representing part-whole hierarchies in a neural network. Hinton and Sejnowski (1986) explored backpropagation, and Gutmann and Hyvärinen (2010) proposed noise-contrastive estimation for unnormalized statistical models.
Many researchers have contributed to the field of deep learning, including Wu et al. (2018), Welling et al. (2003), van den Oord et al. (2018), Srivastava et al. (2014), Scellier and Bengio (2017), Rumelhart et al. (1986), Rosenblatt (1958), Richards and Lillicrap (2019), Ren et al. (2022), Rao and Ballard (1999), Raghu et al. (2020), Pereyra et al. (2017), Osindero et al. (2006) and Krizhevsky and Hinton (2009).
624 word summary
Wu et al. (2018), Welling et al. (2003), van den Oord et al. (2018), Srivastava et al. (2014), Scellier and Bengio (2017), Rumelhart et al. (1986), Rosenblatt (1958), Richards and Lillicrap (2019), Ren et al. (2022), Rao and Ballard (1999), Raghu et al. (2020), Pereyra et al. (2017), Osindero et al. (2006) and Krizhevsky and Hinton (2009) have made important contributions to the field of deep learning.
Kohan et al. (2022, 2018) explored signal propagation and error forward-propagation, Kendall et al. (2020) proposed an optimal architecture and learning technique for analog VLSI networks, and Jabri and Flower (1992) proposed a fast learning algorithm for deep belief nets. Hinton (2002, 2014) proposed training products of experts, distilling knowledge in a neural network, and representing part-whole hierarchies in a neural network. Hinton and Sejnowski (1986) explored backpropagation, and Gutmann and Hyvärinen (2010) proposed noise-contrastive estimation for unnormalized statistical models.
Bachman et al. (2019) and Ba et al. (2016b) explored learning representations by maximizing mutual information and applying layer normalization. Ba et al. (2016a) additionally proposed using fast weights to attend to the recent past. Guerguiev et al. (2017) explored segregated dendrites for unsupervised learning, while Buchatskaya et al. (2020) proposed bootstrapping one's own latent. Grathwohl et al. (2019) suggested treating classifiers as energy based models, and Goodfellow et al. (2014) presented generative adversarial nets.
Dayan and Hinton (1992), Crick and Mitchison (1983), Chen et al. (2020b), Carandini and Heeger (2013) and Bengio et al. (2007) have made important contributions to neural information processing systems, dream sleep functions, self-supervised models, contrastive learning of visual representations, normalization as a neural computation, and layer-wise training of deep networks respectively.
FF may be capable of producing a After weight updates, the activity of a neuron is determined by a scalar product of incoming weights, layer normalization, and a learning rate. Boltzmann Machine learning is used to avoid flaws by contrasting statistics from two different external boundary conditions. FF can easily learn with the insertion of a black box between layers and can make big updates to the weights without using a learning rate. It also allows for simultaneous online weight updates in multiple layers.
However, Boltzmann Machines take too long to approach equilibrium for practical use and there is no evidence of detailed symmetry of connections in cortex. Contrastive learning can be used with other energy functions, such as Hopfield energy, and is also related to techniques such as Generative Adversarial Networks (GANs) and self-supervised contrastive learning.
In the 1980s, backpropagation and Boltzmann Machines emerged as promising learning procedures for deep neural networks. Boltzmann Machines are networks of stochastic binary neurons with pairwise connections. Kullback-Liebler divergence is used to measure data and model distributions on the visible neurons of the Boltzmann machine at thermal equilibrium.
A preliminary experiment was conducted using a static MNIST image and two or three intermediate layers with 2000 neurons. FF was tested on CIFAR-10 and showed slightly worse performance than backpropagation. It can also be used to predict the next character in a sequence. Contrastive learning is a supervised learning technique used to transform input vectors into representation vectors and logits. FF can be used to perform this type of learning by using real and corrupted data vectors. Experiments using the MNIST dataset show FF can achieve 1.4% test error without using regularizers, which can be further reduced by combining supervised and unsupervised learning.
FF has advantages over backpropagation, such as being able to learn through a black box or in real-time, and can be used when the precise details of the forward pass are unknown. It can also be used with low-power analog hardware, and is an alternative to reinforcement learning which suffers from high variance. Further investigation is needed to determine its usefulness.
1218 word summary
This paper introduces the Forward-Forward (FF) algorithm, a new learning procedure for neural networks that replaces backpropagation's forward and backward passes with two forward passes. FF has advantages over backpropagation, such as being able to learn through a black box or in real-time, and can be used when the precise details of the forward pass are unknown. It can also be used with low-power analog hardware, and is an alternative to reinforcement learning which suffers from high variance. Experiments using the MNIST dataset show FF can achieve 1.4% test error without using regularizers; this can be further reduced by combining supervised and unsupervised learning. Further investigation is needed to determine its usefulness. Contrastive learning is a supervised learning technique used to transform input vectors into representation vectors and logits, which are then used in a softmax to determine a probability distribution over labels. FF can be used to perform this type of learning by using real data vectors as positive examples and corrupted data vectors as negative examples. Hybrid images were created to focus on the longer range correlations in images that characterize shapes and a network with four hidden layers of 2000 ReLUs was trained for 100 epochs, resulting in a test error rate of 1.37%. Local receptive fields without weight-sharing were also used, resulting in a test error of 1.16% after 60 epochs. For supervised learning with FF, the label is included in the input and negative data consists of an image with the incorrect label. After training with FF, it is possible to classify a test digit with a single forward pass and a softmax. To augment the training data, images were jittered up to two pixels in each direction, resulting in 0.64% test error after 500 epochs of training, similar to a convolutional neural net trained with backpropagation. This paper aimed to understand how a recurrent neural network (RNN) could be used to process video input, and a preliminary experiment was conducted using a static MNIST image. Two or three intermediate layers with 2000 neurons were used. The idea of using contextual predictions as a teaching signal has been around, and an example net trained on MNIST achieved 1.31% test error. FF was tested on CIFAR-10 and showed slightly worse performance than backpropagation, but the gap did not increase with more hidden layers. Finally, FF is related to other contrastive learning techniques, such as Boltzmann Machines, and can be used to predict the next character in a sequence. In the 1980s, backpropagation and Boltzmann Machines emerged as promising learning procedures for deep neural networks. Boltzmann Machines are networks of stochastic binary neurons with pairwise connections. They update each other by setting them to the on state with a probability equal to the logistic of their total input from active neurons. This results in an equilibrium distribution with each global configuration having a log probability proportional to its negative energy. Kullback-Liebler divergence between data and model distributions on the visible neurons of the Boltzmann machine at thermal equilibrium has a simple derivative for weights deep within the network.
However, Boltzmann Machines take too long to approach equilibrium for practical use and there is no evidence of detailed symmetry of connections in cortex. Contrastive learning can be used with other energy functions, such as Hopfield energy, and is used very effectively in Grathwohl et al. (2019). Nock and Guillame-Bert (2022) use a similar trick with sharing trees. FF can be viewed as a special case of GAN, where the discriminative and generative models reuse the same hidden representations. SimCLR-like methods are a type of self-supervised contrastive learning that measure agreement between representations of crops from the same image and disagreement between crops from two different images. A problem with stacked contrastive learning is that it restricts the system's capacity, which can be solved by dividing each layer into small blocks and forcing each block to decide between positive and negative cases using the length of its pre-normalized activity vector (Löwe et al., 2019). After weight updates, the activity of a neuron is determined by a scalar product of incoming weights, layer normalization, and a learning rate. Boltzmann Machine learning is used to avoid flaws by contrasting statistics from two different external boundary conditions. FF can easily learn with the insertion of a black box between layers, allowing for quick adaptation to new data. The perceptron convergence procedure has the property that big updates to the weights can be made without using a learning rate, as long as positive and negative examples are lineary separable. FF also allows for simultaneous online weight updates in multiple layers.
Computer Science separates software from hardware to allow for copying programs to multiple computers and running models in parallel. Deep learning implications for computer design allow programs and weights to be run on different hardware. A more energy efficient way to multiply an activity vector by a weight matrix is to implement activities as voltages and weights as conductances. However, this does not allow for backpropagation so people have had to resort to A-to-D converters and digital computations. It is possible to use analog hardware for FF and implement backpropagation inside the black box.
FF may be capable of producing a generative model of images or video suitable for unsupervised learning, but its scalability to large neural networks is yet to be seen. Distillation works best when the teacher provides highly informative outputs, which may be one of the major functions of language. There is also a more biological way of transferring what has been learned by one piece of hardware to a different piece of hardware through distillation. Goodfellow et al. (2014) and Dayan and Hinton (1992) proposed advances in neural information processing systems and dream sleep function. Crick and Mitchison (1983) and Chen et al. (2020b) investigated self-supervised models and contrastive learning of visual representations. Carandini and Heeger (2013) and Bengio et al. (2007) discussed normalization as a neural computation and layer-wise training of deep networks.
Bachman et al. (2019) and Ba et al. (2016b) explored learning representations by maximizing mutual information and applying layer normalization. Ba et al. (2016a) additionally proposed using fast weights to attend to the recent past. Guerguiev et al. (2017) explored segregated dendrites for unsupervised learning, while Buchatskaya et al. (2020) proposed bootstrapping one's own latent. Grathwohl et al. (2019) suggested treating classifiers as energy based models, and Goodfellow et al. (2014) presented generative adversarial nets.
Krizhevsky and Hinton (2009) proposed learning layers of features from tiny images, while Kohan et al. (2022, 2018) explored signal propagation and error forward-propagation in deep learning. Kendall et al. (2020) proposed an optimal architecture and learning technique for analog VLSI networks, and Jabri and Flower (1992) proposed a fast learning algorithm for deep belief nets. Hinton (2002, 2014) proposed training products of experts, distilling knowledge in a neural network, and representing part-whole hierarchies in a neural network. Hinton and Sejnowski (1986) explored backpropagation, and Gutmann and Hyvärinen (2010) proposed noise-contrastive estimation for unnormalized statistical models.
Wu et al. (2018), Welling et al. (2003), van den Oord et al. (2018), Srivastava et al. (2014), Scellier and Bengio (2017), Rumelhart et al. (1986), Rosenblatt (1958), Richards and Lillicrap (2019), Ren et al. (2022), Rao and Ballard (1999), Raghu et al. (2020), Pereyra et al. (2017), Osindero et al. (2006
2977 word summary
Replicate is a project that uses LaTeXML to create papers. If you find a bug, you can file an issue on GitHub or try to fix it yourself; the renderer is open source. Sign up to the mailing list for updates.
Recent publications include Wu et al. (2018) for pattern recognition, Welling et al. (2003) for extreme components analysis, van den Oord et al. (2018) for representation learning, Srivastava et al. (2014) for dropout, Scellier and Bengio (2017) for equilibrium propagation, Rumelhart et al. (1986) for backpropagation, Rosenblatt (1958) for the perceptron, Richards and Lillicrap (2019) for dendritic solutions, Ren et al. (2022) for scaling forward gradient, Rao and Ballard (1999) for predictive coding, Raghu et al. (2020) for teaching with commentaries, Pereyra et al. (2017) for regularizing neural networks, Osindero et al. (2006) for topographic product models, Nock and Guillame-Bert (2022) for generative trees, Löwe et al. (2019) for end-to-end learning, Lillicrap et al. (2020) for backpropagation and the brain, and Lillicrap et al. (2016) for synaptic feedback weights. Krizhevsky and Hinton (2009) proposed learning multiple layers of features from tiny images by using a forward pass. Kohan et al. (2022, 2018) explored signal propagation and error forward-propagation to propagate errors in deep learning. Kendall et al. (2020) proposed an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks, while Jabri and Flower (1992) proposed a fast learning algorithm for deep belief nets. Hinton (2002, 2014) proposed training products of experts, distilling the knowledge in a neural network, and representing part-whole hierarchies in a neural network. Hinton and Sejnowski (1986) explored backpropagation and Gutmann and Hyvärinen (2010) proposed noise-contrastive estimation as a new estimation principle for unnormalized statistical models.
Guerguiev et al. (2017) explored segregated dendrites for unsupervised learning, while Buchatskaya et al. (2020) proposed bootstrapping one's own latent as a new approach to self-supervised learning. Grathwohl et al. (2019) suggested treating classifiers as energy based models, and Goodfellow et al. (2014) presented generative adversarial nets. Goodfellow et al. (2014) and Dayan and Hinton (1992) respectively propose advances in neural information processing systems and the function of dream sleep. Crick and Mitchison (1983) and Chen et al. (2020b) investigate big self-supervised models and a simple framework for contrastive learning of visual representations. Carandini and Heeger (2013) and Bengio et al. (2007) discuss normalization as a canonical neural computation and greedy layer-wise training of deep networks. Bachman et al. (2019) and Ba et al. (2016b) present learning representations by maximizing mutual information across views, and layer normalization. Lastly, Ba et al. (2016a) explore using fast weights to attend to the recent past. FF may be capable of producing a generative model of images or video that is suitable for unsupervised learning. However, its scalability to large neural networks is yet to be seen. Distillation works best when the teacher provides highly informative outputs that reveal a lot about the teacher’s internal representations, and this may be one of the major functions of language. There is also a more biological way of transferring what has been learned by one piece of hardware to a different piece of hardware through distillation, which optimizes generalization. The separation of software from hardware is a fundamental concept in Computer Science and has numerous advantages, such as being able to copy a program to multiple computers, and running models in parallel. However, if we were to accept that the knowledge contained in programs or weights is mortal and dies with the hardware, we could save energy and costs in fabricating hardware. The research community has been slow to understand the implications of deep learning for computer design, such as allowing the same program or weights to be run on different hardware. A more energy efficient way to multiply an activity vector by a weight matrix is to implement activities as voltages and weights as conductances, but this does not allow for the efficient implementation of the backpropagation procedure. Therefore, people have had to resort to using A-to-D converters and digital computations. In spite of this, it is still possible to use analog hardware for FF and implement the backpropagation procedure inside the black box. Unlike backpropagation, FF can easily learn with the insertion of a black box between layers. This black box can be a neural net with a few hidden layers that learn very slowly. This allows for the FF learning to quickly adapt to new data, while the slow learning in the black boxes improves the system over a longer timescale. The perceptron convergence procedure (Rosenblatt, 1958) has the property that big updates to the weights can be made without using a learning rate. However, this only works if the positive and negative examples are linearly separable. It is possible to perform simultaneous online weight updates in many different layers with FF due to the fact that the weight update does not change the layer normalized output for an input vector. The only term that depends on j in the change of activity caused by the weight update is yj, so all the hidden activities change by the same proportion. After the weight update, the change in the activity of neuron j is given by the scalar product Δwjx, where Δwj is the vector of incoming weights of neuron j, yj is the activity of the ReLU before layer normalization, and ϵ is the learning rate. Boltzmann Machine learning was designed to avoid flaws by contrasting statistics from two different external boundary conditions. This approach is used in Restricted Boltzmann machines (RBMs) and stacked autoencoders (Bengio et al., 2007).
A problem with stacked contrastive learning is that it restricts the system's capacity. To solve this, divide each layer into small blocks and force each block to decide between positive and negative cases using the length of its pre-normalized activity vector. An example of this approach can be found in Löwe et al. (2019). SimCLR-like methods are a type of self-supervised contrastive learning, where an objective function is used to favor agreement between representations of two different crops from the same image, and disagreement between representations of crops from two different images. These methods use many layers to extract representations of the crops and train these layers by backpropagation. However, they don't work if the two crops always overlap in the same way. FF uses a different way to measure agreement which is more flexible and does not require dividing the input into two separate sources. FF can be viewed as a special case of a GAN, in which the discriminative and generative models reuse the same hidden representations to eliminate mode collapse and problems arising from one model learning too fast relative to the other. The generative model only needs to learn how to convert the hidden representations into generated data. A similar trick has been used with sharing trees (Nock and Guillame-Bert, 2022). Retrospectively, it is obvious that contrastive learning can be used with many other energy functions, such as Hopfield energy. This approach is used very effectively in (Grathwohl et al., 2019). It also eliminates the need for sampling from the equilibrium distribution, which improves efficiency. The Boltzmann Machine is a machine learning technique that combines two ideas of minimizing free energy on real data and maximizing free energy on negative data generated by the network itself. This mathematical simplicity comes at a high price, however, as the network needs to approach its equilibrium distribution for it to be effective, which can take too long for practical use. Additionally, there is no evidence for detailed symmetry of connections in cortex, and the learning procedure fails if many positive updates of the weights are followed by many negative updates.
The Kullback-Liebler divergence between the data distribution and the model distribution on the visible neurons of the freely running Boltzmann machine at thermal equilibrium has a simple derivative w.r.t. any weight, wij in the network. This result gives derivatives for weights deep within the network without ever propagating error derivatives explicitly. In the 1980s, two promising learning procedures for deep neural networks emerged: backpropagation and Boltzmann Machines (Hinton and Sejnowski, 1986). Boltzmann Machines are networks of stochastic binary neurons with pairwise connections that have the same weight in both directions. These neurons update each other by setting them to the on state with a probability equal to the logistic of their total input from active neurons. This results in an equilibrium distribution where each global configuration has a log probability proportional to its negative energy.
Feedforward (FF) is related to other contrastive learning techniques, such as Boltzmann Machines, and it can be used to predict the next character in a sequence from the previous ten character window. However, alternating between thousands of weight updates on positive data and thousands on negative data only works if the learning rate is very low and the momentum is extremely high. The positive data is processed when awake and the negative data is created by the network itself and processed during sleep.
Table 1 shows a comparison of backpropagation and FF on CIFAR-10 using non-convolutional nets with local receptive fields of size 11x11 and 2 or 3 hidden layers. After training with FF, a single forward pass through the net is a quick but sub-optimal way to classify an image. It is better to run the net with a particular label as the top-level input and record the average goodness over the middle iterations. This allows for the selection of the label with the highest goodness. We tested a net trained with FF on CIFAR-10 (Krizhevsky and Hinton, 2009), which has 50,000 training images that are 32 x 32 with three color channels for each pixel. The images have complicated backgrounds that are highly variable and cannot be modeled well given such limited training data. We used a forward pass or 10 iterations with the image and labels to accumulate energy for a label over iterations 4 to 6. The test performance of FF is slightly worse than backpropagation, but the gap does not increase with more hidden layers. Backpropagation reduces the training error much more quickly.
The networks contained two or three hidden layers of 3072 ReLUs each, with 11 x 11 receptive fields in the layer below and up to 363 top-down inputs from an 11 x 11 receptive field in the layer above. We used weight-decay to reduce overfitting. The results show that with plenty of hidden units, FF is comparable in performance to backpropagation for images that contain highly variable backgrounds. The idea of using contextual predictions as a teaching signal for local feature extraction in neural networks has been around for a while, but it has been difficult to make it work using spatial context instead of temporal context. To reduce the problem, negative data can be generated by using probabilities from a single forward pass. This makes training more efficient. An example net trained on MNIST achieved 1.31% test error after 60 epochs. A preliminary experiment was conducted to determine if a recurrent neural network (RNN) could be used to process video input. The experiment used a static MNIST image to test the approach. Two or three intermediate layers each with 2000 neurons were used, and an alternating update was designed to avoid biphasic oscillations. Further experiments used synchronous updates with pre-normalized states set to 0.3 of the previous state plus 0.7 of the computed new state.
The aim of this paper was to understand, not performance on benchmarks, so the simplest architecture was used. Previous examples of image classification have used feed-forward neural networks that were learned one layer at a time, which has a major limitation compared to backpropagation. To overcome this, the RNN was used to model top-down effects in perception.
To augment the training data, images were jittered up to two pixels in each direction to get 25 different shifts for each image. After 500 epochs of training with this augmented data, a test error of 0.64% was obtained, similar to a convolutional neural net trained with backpropagation. Additionally, interesting receptive fields were observed in the first hidden layer. After training with FF, it is possible to classify a test digit with a single forward pass and a softmax. This is a quick but sub-optimal way to classify an image. It is better to run the net with a particular label as part of the input and accumulate the goodness of all but the first hidden layer. MNIST images contain a black border, which makes it easy to display what the first hidden layer learns. With 4 hidden layers each containing 2000 ReLUs and full connectivity between layers, a network gets 1.36% test errors on MNIST after 60 epochs. Backpropagation takes about 20 epochs to get similar test performance.
For supervised learning with FF, the label is included in the input. This is similar to error forward-propagation and signal propagation procedures, but FF does not require the hidden representations of the label and the image to remain separate. Positive data consists of an image with the correct label and negative data consists of an image with the incorrect label. FF should ignore all features of the image that do not correlate with the label. Instead of using fully connected layers, local receptive fields (without weight-sharing) can improve performance. An architecture with approximately 2000 hidden units per layer was used, with the first layer using a 4x4 grid of locations with a stride of 6, a receptive field of 10x10 pixels and 128 channels at each location. The second layer used a 3x3 grid with 220 channels at each grid point, and the third layer used a 2x2 grid with 512 channels. After training for 60 epochs, test error was 1.16%. "Peer normalization" was used to keep the average activity of the hidden units at a sensible target value.
To focus on the longer range correlations in images that characterize shapes, hybrid images were created by adding together one digit image times a mask and a different digit image times the reverse of the mask (as shown in Figure 1). Masks were created by starting with a random bit image and then repeatedly blurring the image with a filter of the form [1/4,1/2,1/4] in both the horizontal and vertical directions, and then thresholding the image at 0.5. After training a network with four hidden layers of 2000 ReLUs each for 100 epochs, test error rate was 1.37%. Contrastive learning is a supervised learning technique used to transform input vectors into representation vectors and then a linear transformation of these vectors into logits which are used in a softmax to determine a probability distribution over labels. FF can be used to perform this type of learning by using real data vectors as positive examples and corrupted data vectors as negative examples. There are two main questions about FF: does it learn effective multi-layer representations that capture the structure in the data, and where does the negative data come from?
A simple unsupervised example of FF was tested on MNIST, achieving 1.4% test error without using complicated regularizers. Sensibly-engineered convolutional neural nets with a few hidden layers typically get about 0.6% test error, and the error can be reduced to around 1.1% using regularizers such as dropout or label smoothing. It can be further reduced by combining supervised and unsupervised learning. The experiments in the paper use the MNIST dataset of handwritten digits, with 50,000 training images and 10,000 validation images. The official test set of 10,000 images is then used to compute the test error rate. MNIST is convenient for testing new learning algorithms.
The aim of this paper is to introduce the FF algorithm and show that it works in small neural networks containing a few million connections. A subsequent paper will investigate how well it scales to large networks.
FF uses layer normalization which does not subtract the mean before dividing by the length of the activity vector. The aim of learning is to make the goodness be well above some threshold value for real data and well below that value for negative data. The goodness function for a layer is the sum of the squares of the activities of the rectified linear neurons in that layer. The Forward-Forward algorithm is a greedy multi-layer learning procedure inspired by Boltzmann machines and Noise Contrastive Estimation. It operates by making two forward passes, one on real data and one on "negative data", with opposite objectives. It is slower than backpropagation and does not generalize as well, so is unlikely to replace it. It may be superior when used as a model of learning in cortex, or when used with very low-power analog hardware. Its main advantage is that it can be used when the precise details of the forward computation are unknown, and it can learn while pipelining sequential data without storing neural activities or propagating error derivatives. It is an alternative to reinforcement learning, which suffers from high variance. Backpropagation is a popular method for performing stochastic gradient descent in deep learning, however it has serious limitations. It requires perfect knowledge of the forward pass and cannot be used to learn through a black box or in real-time. Additionally, backpropagation through time as a way of learning sequences is implausible. Despite efforts to make it viable, there is no convincing evidence that cortex explicitly implements backpropagation. However, its success in deep learning has established its effectiveness. This paper introduces the Forward-Forward algorithm, a new learning procedure for neural networks. It replaces the forward and backward passes of backpropagation with two forward passes, one with positive data and the other with negative data. Each layer has its own objective function to classify data as positive or negative. The algorithm could be used in a pipeline format to process video without storing activities or propagating derivatives. Further investigation is needed to determine its usefulness.