Summary of Self-Expanding Neural Networks A Natural Gradient Approach

Summary Self-Expanding Neural Networks A Natural Gradient Approach arxiv.org

10,852 words - PDF document - View PDF document

One Line

SENN is a method that solves the problem of determining neural network size by starting small and expanding as necessary during training.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Self-Expanding Neural Networks: A Natural Gradient Approach

Source: arxiv.org - PDF - 10,852 words - view

Introduction

• Self-Expanding Neural Networks (SENN) address the challenge of choosing the appropriate architecture size for a neural network.

• SENN proposes starting with a small architecture and expanding it as necessary during training.

• Two methods for expanding the network are width expansion and inserting a new layer.

Visual: Illustration of a small neural network expanding during training

Determining Network Capacity

• The addition of neurons or layers in SENN is determined based on a fractional increase in the squared norm of the gradient.

• A new neuron or layer is added if it provides a sufficient increase in the norm.

• The initial value of the norm determines the starting capacity of the network.

Bounded Successive Additions

• The maximum number of successive additions in a neural network is bounded.

• This ensures that the network does not grow indefinitely during training.

• Bounded additions help maintain computational efficiency.

Application in Regression and Classification

• SENN has been successfully applied in regression and classification tasks.

• The trace formula for SENN and the gradient for W are introduced.

• The correlation coefficient of new activations with residual gradients is a key factor in determining when to expand.

Adapting to Dataset Information

• SENN can adapt its size based on the amount of information in a dataset.

• Training SENNs on class-balanced subsets of the MNIST dataset has shown promising results.

• Dataset-specific adaptation improves performance and efficiency.

References

• The excerpt includes references to several papers related to neural networks and their expansion.

• Topics covered include deep convolutional neural networks, backpropagation, optimization methods, activation functions, and more.

Stopping Criterion for Expansions

• The stopping criterion for parameter expansions requires a reduction in loss of at least 12%.

• The maximum possible reduction in loss is 21%.

• The total number of added neurons is bounded by a certain value.

Visualization Experiments

• In visualization experiments, a threshold value of 2 is used.

• Higher thresholds result in longer training times but potentially better performance.

Visual: Comparison of visualizations with different threshold values

Image Classification Experiments

• For image classification experiments, threshold values of 1.007 and 1.03 are used.

• Threshold values impact the trade-off between accuracy and training time.

Visual: Accuracy and training time comparison for different threshold values

Conclusion

• Self-Expanding Neural Networks (SENN) provide a solution to the challenge of determining neural network size.

• SENN's natural gradient approach allows for adaptive expansion during training.

• Bounded additions and dataset-specific adaptation contribute to computational efficiency and improved performance.

Key Takeaways

• Self-Expanding Neural Networks (SENN) address the challenge of choosing the appropriate architecture size for a neural network.

• SENN proposes starting with a small architecture and expanding it as necessary during training.

• The addition of neurons or layers in SENN is determined based on a fractional increase in the squared norm of the gradient.

• The maximum number of successive additions in a neural network is bounded.

• Adaptation to dataset information improves performance and efficiency.

Key Points

Self-Expanding Neural Networks (SENN) address the challenge of choosing the appropriate architecture size for a neural network.
SENN proposes starting with a small architecture and expanding it as necessary during training.
Two methods for expanding the network are width expansion and inserting a new layer.
The addition of neurons or layers in SENN is determined based on a fractional increase in the squared norm of the gradient.
The maximum number of successive additions in a neural network is bounded.

Summaries

25 word summary

Self-Expanding Neural Networks (SENN) address the challenge of determining neural network architecture size by starting with a small architecture and expanding as needed during training.

42 word summary

This paper introduces Self-Expanding Neural Networks (SENN) as a solution to the challenge of determining the appropriate architecture size for a neural network. The authors propose starting with a small architecture and expanding it as needed during training. They present two methods

496 word summary

This paper introduces Self-Expanding Neural Networks (SENN), which address the challenge of choosing the appropriate architecture size for a neural network. Instead of starting with a large architecture, the authors propose starting with a small architecture and expanding it as necessary during

In this document, the authors present Self-Expanding Neural Networks (SENN), which address the problem of adding nodes to neural networks during training. They prove that the number of neurons added simultaneously in SENN is bounded and introduce a computationally efficient

This summary discusses the concept of Self-Expanding Neural Networks (SENN) and how to add more capacity to a neural network without changing the overall function. The authors propose two methods for expanding the network: width expansion and inserting a new layer.

SENN (Self-Expanding Neural Networks) determines when to add more capacity based on a fractional increase in the squared norm of the gradient. A new neuron or layer is added if it provides a sufficient increase in the norm. The initial value of

The summary of the text excerpt is as follows:

The paper discusses self-expanding neural networks (SENN) and proposes a natural gradient approach for training them. The authors argue that a network is considered converged when the changes in loss become small, and

The excerpt discusses Self-Expanding Neural Networks (SENN) and their application in regression and classification tasks. It introduces the trace formula for SENN and the gradient for W. The "correlation coefficient" of new activations with residual gradients is a

The study focuses on self-expanding neural networks (SENNs) and their ability to adapt their size based on the amount of information in a dataset. The researchers trained SENNs on class-balanced subsets of the MNIST dataset and found that the

The excerpted text includes a list of references to various research papers and conference proceedings related to neural networks and machine learning. The references cover topics such as deep convolutional neural networks, backpropagation, optimization methods, activation functions, and the use of

The summary of the text excerpt is as follows:

The excerpt includes references to several papers related to neural networks and their expansion. The authors prove Theorem 1, which states that the maximum number of successive additions in a neural network is bounded. They

The excerpt discusses the residual part of v p not predicted by v c and provides a proof for it. It also presents a block LDU decomposition and decomposes A ?1. The desired result is then obtained by substitution into v T A ?1

The stopping criterion for parameter expansions requires a reduction in loss of at least 12%. The maximum possible reduction in loss is 21%. The total number of added neurons is bounded by a certain value. If the true hessian of the loss is a

In the visualization experiments, a threshold value of 2 is used, while for the image classification experiments, threshold values of 1.007 and 1.03 are used for the whole dataset and variable subset experiments respectively. Higher thresholds result in longer

Raw indexed text (60,368 chars / 10,852 words / 1,033 lines)

Self-Expanding Neural Networks

Rupert Mitchell 1 Martin Mundt 1,2 Kristian Kersting 1,2,3,4

Department of Computer Science, TU Darmstadt, Darmstadt, Germany

Hessian Center for AI (hessian.AI), Darmstadt, Germany

German Research Center for Artificial Intelligence (DFKI), Darmstadt, Germany

Centre for Cognitive Science, TU Darmstadt, Darmstadt, Germany

{rupert.mitchell,martin.mundt,kersting}@cs.tu-darmstadt.de

Abstract

The results of training a neural network are heavily dependent on the architecture

chosen; and even a modification of only the size of the network, however small,

typically involves restarting the training process. In contrast to this, we begin

training with a small architecture, only increase its capacity as necessary for the

problem, and avoid interfering with previous optimization while doing so. We

thereby introduce a natural gradient based approach which intuitively expands

both the width and depth of a neural network when this is likely to substantially

reduce the hypothetical converged training loss. We prove an upper bound on

the “rate” at which neurons are added, and a computationally cheap lower bound

on the expansion score. We illustrate the benefits of such Self-Expanding Neural

Networks in both classification and regression problems, including those where the

appropriate architecture size is substantially uncertain a priori.

Introduction

Correctly tailoring a model’s capacity to an arbitrary task is extremely challenging, especially when

the latter is not yet well studied. This challenge can be side stepped by choosing an architecture which

is so large that a poor solution is nevertheless unlikely to occur [19], e.g. due to the double-descent

phenomenon. However, since it is hard to predict what size would be large enough this will often in

practice entail using a massively overparameterized network [22] [12] [11]. Surely it is possible to

detect that the existing capacity of the network is insufficient and add more neurons when and where

they are needed? In fact, biological neural networks are grown by adding new neurons to the existing

network through the process of neurogenesis. The popular review [9] discusses the relatively recent

discovery that this process is still active in the adult mammalian brain [23], and [13] [5] identify it as

a key ability underpinning lifelong learning. Thus inspired, we propose an analogous process for

adding both neurons and layers to an artificial neural network during training, based on a local notion

of “sufficient capacity” derived from first principles in close relation to the natural gradient [1] [17].

Any method for artificial neurogenesis must answer three questions to avoid the problem of locally

insufficient capacity [6]. It must determine when the current capacity is insufficient and that neuron(s)

must therefore be added. It must identify where these neurons should be introduced. Finally, it must

choose what initialization is appropriate for these neurons. These questions, if they are addressed

at all in the literature, are normally addressed piecemeal or in ad-hoc ways. For example, very few

methods address the question of what [6] [26]. When is answered either by assuming predetermined

schedules [26] [21], or by waiting for the training loss to converge [27] [25], neither of which are

informative about where. “Whenever you parry, hit, spring, ..., you must cut the enemy in the same

movement.” 1 Our metaphorical enemy is not a loss which is momentarily poor, or even one which is

Miyamoto Musashi, The Book of Five Rings (circa 1645)

Preprint. Under review.Table 1: Existing expansion methods’ answer to when, where, what and whether they consider depth.

SENN is the only approach to provide a cohesive answer to all three based on natural gradients.

METHOD

ActiveNAS [7]

DyNC [2]

NeurDL [5]

DENNs [27]

PrNNs [21]

GradMax [6]

SSD [25]

Firefly [26]

SENN (ours)

WHEN

end of training

converged loss

recon error

converged loss

at new task

future work

converged loss

N epochs

natural gradient

WHERE

argmax on presets

preset

recon error

preset then prune

preset

future work

loss reduction

vanilla gradient

natural gradient

WHAT

reinitialize

random

vanilla gradient

loss reduction

natural gradient

DEPTH?

Yes

converging to a poor value: it is a deficiency in our parameterization such that the optimizer cannot

make progress. We argue that by inspecting the degrees of freedom of the optimizer in function space,

one may not only strike faster in answer to when, but answer where and what in the same stroke.

From a mathematical perspective, these degrees of freedom available to the optimizer are given by the

image of the parameter space under the Jacobian, and the derivative with respect to the loss in function

space will not in general lie in this subspace. It is however possible to project this derivative onto

that subspace, and the natural gradient, F −1 g, is exactly the change in parameters which changes

the function according to this projection. In order to measure the size of that projection for a given

parameterization, we introduce the natural expansion score η = g T F −1 g. Specifically, the capacity

of a neural network is locally insufficient when this score is small for the current parameterization.

We therefore add neurons when this substantially increases η, where they will maximally increase

η, and choose what initialization to use for the new parameters according to how it increases η. To

summarize, our contributions are:

1. We introduce the natural expansion score which measures the increase in rate of loss

reduction under natural gradient descent when width or depth is added to a neural network.

2. We show how such additions may be made during training without altering the function

represented by the network. Our neurogenesis inspired Self-Expanding Neural Networks

(SENN) thus avoid interfering with previous optimization or requiring restarts of training.

3. We prove that the number of neurons added simultaneously in SENN is bounded. We further

introduce a computationally efficient approximation as a provable lower bound to increases

in natural expansion score resulting from additions.

4. We demonstrate SENN’s effectiveness for regression and classification.

In the remainder of this paper, we proceed as follows: In section 2 we summarize existing growth

methods, in section 3 we then describe SENN, and in section 4 we illustrate its operation in practice.

Related Methods for Growing Neural Networks

The problem of adding nodes to neural networks during training has been under consideration for

over 30 years (e.g. Dynamic Node Creation (DyNC) [2]), but remains substantially unsolved. There

does not seem to exist a unified answer to when , where and what , as we summarize in table 1. Most

methods cannot add depth and sideline at least one of these questions.

Inspired by neurogenesis like SENN, Draelos et al. [5] examine the case of representational learning

with stacked autoencoders, where they exploit local reconstruction error to determine when and

where to add neurons. Due to their more general setting, DyNC, Progressive NNs (PrNNs) [21]

and Dynamically Expandable NNs (DENNs) [27] use simple training loss convergence or even task

boundaries to answer when , but must then fall back on ad-hoc preset decisions for where . (However,

DENNs use subsequent pruning to mitigate the excess capacity introduced by the preset.) All four

methods freeze old neurons or continue training from their present values, but randomly initialize

new neurons in answer to what . While ActiveNAS [7] can add both width and depth, it does so by

2Figure 1: SENN is able to add width (orange, left side) and depth (green, right side) to a neural

network without changing the overall function. In the diagram, the model’s component functions

are composed vertically, e.g. σ 1 (W 1 x) gives the first set of hidden activations. ⊎ indicates concate-

nation along a hidden dimension, i.e. width addition. W 2 W p −1 indicates matrix multiplication, used

in depth expansion via the insertion of an identity function. (Best viewed in color.)

completely restarting training with a fresh initialization of the whole network after every modification.

It then waits for convergence, and uses preset answers to where similar to the previous methods.

The final cluster of three methods all aim to improve on random initialization as an answer to what.

Splitting Steepest Descent (SSD) [25] and Firefly [26] make small changes to the existing function

and answer what by optimizing the consequent loss reduction. The former answers when by waiting

for convergence and examining the loss, whereas the latter simply adds more capacity every N epochs.

Gradmax [6] is the closest to SENN in spirit, but is based on vanilla rather than natural gradient

descent. More importantly, potential extensions of the method to the where and when questions are

mentioned briefly and their investigation deferred to future work. All three of these latter methods are

only able to avoid redundancy of added neurons with existing neurons to the extent that the network

is already converged. Of these three, only GradMax completely avoids changing the overall function.

In contrast, SENN provides a monolithic answer to all three questions via the natural expansion score.

Self-Expanding Neural Networks

To provide a cohesive answer to when , where , what with Self-Expanding Neural Networks, we

start with the definition of the natural expansion score as the foundation:

Definition 1. The natural expansion score η = g T F −1 g is given by the inner product of the natural

gradient F −1 g with the gradient g.

With this definition we will describe how we add capacity without interfering with the existing

optimized parameters in section 3.1. We then in section 3.2 give an intuitive account of what our

score η measures, and why we use this to decide when to add capacity. Section 3.3 gives a more

mathematically precise account of the meaning of η, and what this says about what initializations

should be used for new capacity. Section 3.4 extends the argument of 3.3 to deciding where new

capacity should be added and whether it should be depth or width, allowing us to put the ingredients

of SENN together and summarize this combination. Finally, sections 3.5 and 3.6 cover the practical

questions of convergence guarantees and computational efficiency respectively.

3.1

How to add: expanding without changing the overall function

In order to explain how to add without changing the overall function, we will consider the illustration

in figure 1. This shows a perceptron with two hidden layers, each with three neurons. The number of

neurons a hidden layer may be increased by introducing a new copy of the activation function σ p

and connecting it to the neurons of the preceding layer with some linear transform W p . As shown

on the left of the figure, we connect the new neuron to the subsequent layer (in this case the output

y) with a linear transform initialized to zero. In doing so, we guarantee that we will not perturb the

function specified by the existing parameters. Although W p will initially receive zeroed gradients

3since the output transform is zero, this latter transform will immediately receive non-zero gradients

and thereby become non-zero. The new neuron may thus be used in future optimization.

In addition to width expansion, we now consider inserting an entirely new layer, as shown on the

right of figure 1. In essence, a particular linear transform, W 2 in the figure, is replaced with a single

layer perceptron. To this end, we assume our nonlinearity σ p to be parameterised, and there to exist a

choice of those parameters such that σ p = I is the identity. If we require the initial linear transform

W p of the inserted perceptron to be invertible (but otherwise arbitrary), then we may choose the

output linear transform of the perceptron to be the matrix product W 2 W p −1 . With these choices

made, the inserted perceptron is equivalent to the linear transform W 2 it replaces, and the overall

parameterized function once again remains unchanged. We thus have the first ingredient of SENN:

SENN Ingredient 1: How to add more capacity without changing the overall function.

We add proposed neurons p to layer i by concatenation along the the ith hidden dimension

(0 ⊎ W i+1 ) ◦ (σ p ⊎ σ i ) ◦ (W p ⊎ W i ) = W i+1 ◦ σ i ◦ W i , and initialize the output weights of p

to zero. We insert a new layer q by replacing some linear transform W i with the composition

(W i W q −1 ) ◦ (σ q = I) ◦ W q , where W q is invertible and σ q is initialized to be the identity.

We must therefore choose a suitable parameterized activation function. Rational activation functions

satisfy the above conditions and were shown to obtain good real world performance [18]. We use the

simplified parameterization σ θ (x) = αx + (β + γx)/(1 + x 2 ), where θ = {α, β, γ} are the three

parameters of σ, and setting θ = {1, 0, 0} results in the identity function, as required. Since this

parameter count is small, we do not share the activation function weights within our layers.

3.2

When to add: deciding whether more capacity is useful

Having decided how to add, perhaps the most natural way to evaluate the utility of making some

change to the parameterization is to ask what immediate effect this has on the total loss. However,

we cannot do this as we have assumed the overall function to remain unaltered. We must therefore

consider additional information such as the gradients of the function. Specifically, one can favor

adding neurons which maximally increase the euclidean norm of the gradients ||g|| 2 . As found in

Evci et al. [6] this norm functions well for selecting which neurons to add when the network is close

to convergence since it is a direct measure of the rate at which gradient descent will decrease the loss.

Unfortunately, comparing the gradient norms ||g|| 22 and ||g ′ || 22 for the current parameterization θ and

some new expanded parameterization θ ′ is insufficient to determine whether or not more capacity

is needed in the first place. This is primarily because it does not account for redundancy in the

parameterization: if there is some neuron a such that the gradients of the linear weights in the next

layer “listening” to it have some large norm ||g a || 2 , then we could introduce an exact copy of this

neuron a ′ for which the corresponding norm would also be ||g a ′ || 2 = ||g a || 2 . Since the squared

euclidean norm is additive across parameters, we could unboundedly increase ||g|| 22 just by adding

very many copies of this one neuron a. 2

In SENN, we avoid this problem with the following simple notion of redundancy. We are using our

parameters to express a point in function space. At some point in optimization we are therefore also

using them to express a small change in function space. There is some direction that our optimizer

“wants” to move in (i.e. the direction in function space which most quickly reduces the loss). We can

define new parameters as being useful in a way which is non-redundant with the old parameters to

the extent that they allow the optimizer to better express the direction in function space it “wants” to

move in. Our natural expansion score η = g T F −1 g captures this combined sense of usefulness and

non-redundancy in a way which will be made more mathematically precise in the next section. This

description of its function is sufficient, however, to justify our answer to when:

SENN Ingredient 2: When to add more capacity. A new neuron or layer will be helpful

and non-redundant if it provides a fractional increase in η = g T F −1 g greater than some

threshold τ . When we find a potential new neuron or layer for which this is true, we add it.

More generally, the same problem would occur when considering a new neuron c whose activations were

some linear combination of those of some existing neurons a and b.

4We defer specific choices for τ to section 3.4, at which point we may draw on the derivation of η.

3.3

What to add: determining the initial value of new neurons

The reader may at this point be expecting us to tackle the question of where additional capacity is

most useful, but this would put the cart before the horse. Additional capacity is useful to the extent

that it can be initialized in a way which is useful, which we now consider.

To simplify mathematical notation in this section, we consider the output y to be concatenated over

the entire training dataset. While the gradient of the loss with respect to the output g y tells us how

the loss changes for arbitrary changes in y, the only changes in y we can actually achieve with some

parameterization Θ are given by Jacobian product J t for some small parameter change t ∈ Θ.

Let P Θ be the orthogonal projection onto this space of directions in output space. The vector P Θ g y

is then the portion of g y which lies in the space of achievable output changes, and its squared norm

||P Θ g y || 22 is a scalar measure of how much of g y this portion is. The vector P Θ g y is the image under

the Jacobian J t of some tangent vector t in the parameter space. By the definition of orthogonal

projection, t minimizes ||J t − g y || 2 , but if there are redundant directions in Θ then there may exist

multiple such t. There is however a unique t ∗ which minimizes ||t|| 2 among those t which minimise

||J t − g y || 2 . The Moore-Penrose inverse, J + , of J is the unique matrix such that t ∗ = J + g y for

arbitrary g y . However, J is a map from parameter space to total output space, which depends on

dataset size N . This dependency can be avoided by working with maps from the parameter space to

itself, such as the following average over the dataset F = N 1 J T J , known as the Fisher information

matrix. The natural gradient is then given by F̂ −1 g, where g = N 1 J T g y is the gradient of the

loss with respect to the parameters averaged over the training set, and existence of F̂ = F + ϵI is

guaranteed by the addition of a small multiple ϵ of the identity. In the limit of small ϵ this is exactly

our t ∗ . 3 We are now able to rewrite the squared norm ||P Θ g y || 22 in the familiar form of definition 1:

||P Θ g y || 22 = t T ∗ J T J t ∗ = g T F −1 J T J F −1 g = N g T F −1 g = N η

(1)

Here, the factor of the dataset size N appears because the average gradient g and our η are normalized

according to the training set size. With this formula, we have now derived η from first principles and

may use it to choose between specific initializations, yielding our third SENN ingredient:

SENN Ingredient 3: What Initialization to Use. If θ ′ ∈ Θ ′ is an initialization of an

expanded parameterization Θ ′ such that the overall function remains unchanged (see section

3.1), then the best such initialization θ ∗ ′ is given by arg max θ ′ (η ′ ). When we add new

neurons or layers, we choose what initialization to use by this method.

3.4

Where to add: completing the algorithm

Much as the euclidean norm ||g|| 2 measures the rate of loss reduction according to vanilla gradient

descent, our η measures the rate of loss reduction according to natural gradient descent. This gives a

uniform way of comparing the effect of new capacity no matter where it is added in the network or

whether it takes the form of new neurons in an existing layer or a new layer. In particular, one may

compare the η values of the best initializations (see section 3.3) for each such variety of addition. 4

SENN Ingredient 4: Where to Add. A choice of whether to add width or depth, and

where in the network the new neuron/layer will be added, specifies a particular extension of

the current parameter space Θ ′ . We make those choices which correspond to the extension

Θ ′∗ = arg max Θ ′ arg max θ ′ (η ′ ) for which the best initialization is possible.

Our newfound knowledge of η as a rate of loss reduction in hand, we return to the question of

specifying the expansion threshold τ , which we deferred from section 3.2 in our previous answer

to when. An increase from the current natural expansion score η c to a new score η p due to some

In fact, an alternative definition of the Moore-Penrose inverse is: J + := lim ϵ→0 (J T J + ϵI) −1 J T

In general one can also adjust for the “size” of each addition in some relevant sense. We found it sufficient

to just penalize adding entire new layers versus single new neurons by some constant factor.

5proposed expansion p corresponds to an increase in the rate of loss reduction by natural gradient

descent. We define this increase to be “sufficient” when it corresponds to a relative increase η p /η c > τ

in loss reduction rate greater than the expansion threshold τ . For example, with the intuitive choice

τ = 2, each addition must at least double the rate of loss reduction. Following the well known

intuition that a network does not practically converge without setting the learning rate to zero, it is

generally considered to have converged once changes in loss become sufficiently small. In analogy

to monitoring plateaus in loss, we further require the increase in loss reduction resulting from new

capacity to surpass an absolute stopping criterion α. While we answer when, where and what

cohesively with η during training, we thus concur with all prior works on terminating training.

Overall, we may now summarize all ingredients of SENN on the basis of the natural expansion score:

SENN: Summary. When we add width or depth we do so without changing the overall

function. We add new capacity when this produces a relative increase in score η p /η c > τ

larger than the expansion threshold τ . We add new capacity where it would most increase η,

and choose what initialization to use in order to maximize η. We ensure the addition process

terminates by additionally comparing each ∆η to the absolute stopping criterion α, and not

adding capacity when η p − η c ≤ α.

3.5

Bounds on convergence of expansion

Consider repeatedly running our addition algorithm for a network with initial expansion score η 0 .

The expansion threshold τ guarantees that η i > τ η i−1 after the i-th addition. Since η = g T F −1 g is

the squared length of the projected gradient in output space ||P Θ g y || 2 , it is non-negative and bounded

above by η ≤ λ = ||g y || 22 . Since η i grows exponentially with i and is bounded above by λ the

maximum number of sequential additions i < N s increases logarithmically with λ. Specifically,

N s < (ln λ − ln η 0 )/ ln τ . This bound becomes large when η 0 is small, but we also know that η 1 > α

from the stopping criterion α.

Theorem 1 (Upper bound on the “rate” of neuron addition). The maximum number of additions

N s from repeatedly running the expansion algorithm is bounded: N s < 1 + (ln λ − ln α)/ ln τ .

ln 10

(Proof in supplementary material.) For example, if τ = 2 and α/λ > 10 −3 then N s < 1+ 3 ln

2 < 11.

Note that exponentially large ratios between α and λ produce only linearly large bounds on N s .

We now consider the number of additions N T made over the course of training with natural gradient

descent. Intuitively, λ is the total possible loss reduction and α is the minimum reduction which

justifies expanding the network. If every time we expand the network it only achieves this minimum

reduction then we must expand a total of roughly N T ≈ λ/α times. If the loss function has constant

curvature equal to the fisher F , then the total loss reduction possible with the current parameters

is given by 12 η and we have N T < λ/α exactly. More generally, we expect that when F is an

underestimate of the true curvature, η will overestimate the usefulness of new neurons causing N T to

be larger, and vice versa for F an overestimate. See supplementary for more in depth discussion.

3.6

Efficiently computing a lower bound on score increase

Recall that the natural expansion score η is given by the inner product of the gradient g with the

natural gradient F −1 g. Since working with the natural gradient can be challenging due to the matrix

inverse F −1 , we will make use of established approximation techniques. Specifically, when we

need the natural gradient for the whole network we will use the iterative conjugate gradient method,

as suggested for the Hessian in Martens [16], performing Fisher-vector multiplication cheaply via

auto-differentiation. When we require the inverse Fisher F l −1 for the linear transform in some

layer l considered in isolation, we approximate F l by the Kronecker product F l ≈ F̃ l = S l ⊗ A l ,

where A l is the second moment of the activations at the input of the linear transform, and S l is

given by the second moment of some distribution of gradients with respect to the output of the linear

transform. The relevant gradient distribution is determined by the choice of metric on the output

space implicit in the exact definition of F one is using, which for us is the euclidean metric. The

advantage of this Kronecker factorization is that F̃ l may be inverted by inverting A l and S l separately:

−1

F̃ l −1 = A −1

l ⊗ S l , which is much cheaper. If ∂W is the gradient with respect to the weights as

6a matrix, then the natural gradient is given by S −1 ∂W A −1 [15]. The natural expansion score η is

given by the inner product of the gradient with the P

natural gradient as vectors, which in this matrix

form becomes the elementwise inner product η = i,j ∂W ij (S −1 ∂W A −1 ) ij , which can also be

expressed as a trace: η = Tr[∂W T S −1 ∂W A −1 ].

The trace formula for η is reminiscent of the definition of the pearson correlation coefficient

T r =

E [xy] /(E [xx] E [yy]). The gradient for W is given by the expectation ∂W = E ag , where

a is the input activation vector, g is the derivative of the loss with respect

to the outputs, and the

expectation is over the dataset. Let the residual gradient g r = g − E ga T A −1 a be the part

of the gradient not predicted by the current activations a. Then if a p is the activation vector of a

set of proposed neurons, and A p is their second moment, then the “correlation coefficient” of the

new activations with the residual gradients is a lower bound ∆η ′ on the improvement ∆η in natural

expansion score (proof in appendix via block LDU decomposition of joint activation covariance):

Theorem 2 (Computationally

in natural expansion score).

−1 cheap

lower bound on increase

′

∆η ′ := Tr[A −1

p E a p g r G l E g r a p ] is a lower bound ∆η ≤ ∆η = η p −η c on the improvement

in natural expansion score due to some proposed addition of neurons p to a layer l.

Intuitively, ∆η ′ is the fraction of variance in residual gradients “explained” by the output of our new

neuron(s). This result holds for adding an arbitrary number of new neurons to an existing layer. If a

layer was inserted while retaining residual connections around it, then the same result would hold if

we treated the activations of the new layer as “new neurons” in the old layer to calculate ∆η ′ . Because

our activation function can represent the identity, we will automatically add these connections if in

fact they are necessary, so we in fact use this same method for evaluating our actual layer insertions.

The bound ∆η ′ can be computed for an arbitrary proposal p of additional neurons using only those

intermediate activations and gradients which it would be necessary to cache in order to calculate

the gradient g and (Kronecker factored approximate) natural gradient F̃ −1 g via backpropagation.

Therefore, if we have an outer optimizer which computes g and F̃ −1 , then we may optimize arbitrarily

many proposals p for arbitrarily many steps with an inner optimizer without incurring any additional

costs related to the evaluation of the existing network. The costs of this inner optimizer instead scale

with the size of the (very small) networks whose addition to the existing network is being considered.

Experiments

We now apply Self-Expanding Neural Networks to regression and classification, to illustrate the

behavior of the natural expansion score and demonstrate SENN’s efficacy.

4.1

Width Addition in Least-Squares Regression

We first show that the evolution over training of the possible improvements ∆η ′ in natural expansion

score due to potential width expansions is meaningful. In order to do so we consider the application

of a single layer SENN to a one dimensional least squares regression task as shown in figure 2, i.e.

SENN with depth addition deliberately disabled. The reason to have only one hidden layer is that

this is effectively least squares regression with basis functions given by the neurons of that layer. We

can therefore plot the normalized score increase ∆η ′ /η c of the best neuron for each basis function

location and length scale. Where ∆η ′ /η c > 1 there exists an acceptable proposal. Accepted/rejected

proposed neurons are shown on this landscape in red/black at key points in training. We see in the

leftmost figure that the best such proposal is accepted because it achieves a large improvement in η,

and it corresponds to a basis function location close to datapoints with visibly large prediction error

which we have been unable to reduce using the existing neurons. The next figure to the right shows

the same landscape after the new neuron is introduced, and it can be seen that the ∆η ′ /η c values for

neurons with similar locations to it have been dramatically reduced since they would be redundant.

The second figure from the right shows the result of optimizing the new expanded parameters until

the point at which the next neuron would be added. It can be seen that the prediction errors in the

region of the previously introduced neuron are now practically invisible, and that the next neuron is

to be introduced in a different region in which errors remain. The rightmost figure shows the function

approximation at the conclusion of training, and it can be seen that the prediction errors are negligible

and proposals with large relative increase in η are not to be found in the region considered. The reader

may note that there are some possible new neurons with small length scales which would surpass the

70.6

10 2

0.8

10 1

0.9

1.0

10 0

10 1

10 0

0.7

10 1

10 2

Location x

Figure 2: A single layer SENN (black, solid) is trained to approximate a target function (red, dashed)

via non-linear least-squares regression on samples (blue, markers). The location of existing neurons

is shown by vertical lines. The lower figures show ∆η ′ /η c as a function of the location and scale of

the nonlinearity introduced by a new neuron. Accepted and rejected proposals are marked in red and

black respectively. From left to right we see the landscape before and immediately after the fourth

neuron is added, before the fifth neuron is added, and at the end of training. SENN adds neurons

where they are relevant in order to achieve a good final fit. (Best viewed in color.)

Figure 3: Classification is performed with SENN on the half moons [20] dataset. The normalized

layer addition score ∆η ′ /η c is shown as a function of optimization steps; the horizontal bar shows the

point above which a layer will be added. The score increases during three phases during which the

SENN has initial zero, one and then two hidden layers. The respective decision boundary is shown

at the beginning and end of these. These layer additions allow SENN to represent more complex

decision boundaries when required for global expressivity. (Best viewed in color.)

expansion threshold which we do not find; we could deliberately try optimizing initializations at this

lengthscale to find these, but this would likely result in overfitting. Overall, SENN thus identifies

regions of locally insufficient capacity in our parameterization, targets these regions precisely with

new added neurons, and uses this selectively added capacity to achieve a good final fit.

4.2

Layer Addition in Classification

We now highlight SENN’s depth expansion in the context of classification. Specifically, we consider

two-dimensional inputs from the half-moons dataset [20]. In figure 3 we plot ∆η ′ /η c for the best

layer addition proposals as a function of overall optimizer steps. Visualizations of the learned

decision boundary at initialization and just before layer additions are shown. We can observe that

∆η ′ /η c increases approximately monotonically during three phases, punctuated by large drops when

layers are added. In the initial phase the network has zero hidden layers (i.e. is linear), and the

simplicity of the decision boundary at the end of this phase reflects this. Since the datapoints are not

linearly separable, the large ∆η ′ /η c value correctly indicates that the introduction of a hidden layer

is necessary in order to further reduce loss. The visible increase in decision boundary complexity

8Figure 4: SENN shows reasonable and reproducible hidden layer growth on MNIST at appealing

any-time validation accuracy without intermittent perturbations (left pair of panels). SENN features

appropriate scaling with respect to data complexity in its chosen network sizes (right pair of panels).

and accuracy over the course of the second phase confirms this. The beginning of the third phase

marks the introduction of a second hidden layer and we wait until ∆η ′ /η c rises again, indicating

an exhaustion of this new capacity, before reexamining the decision boundary. The increase in

boundary complexity is less visible this time, but close inspection reveals that the boundary has

become narrower and more rounded. Conclusively, we have intentionally constructed a scenario

where depth addition is necessary for a good fit to lie in the space of solutions, and seen that SENN

inserts new layers when this is necessary for global expressivity.

4.3

Dynamic Selection of Appropriate Architecture Size in Image Classification

Finally, we examine the ability of self-expanding neural networks to choose an appropriate size when

classifying MNIST [4] images. The leftmost plots of figure 4 show SENN’s total hidden size and

validation accuracy during training on the full dataset as a function of total batches seen. This use of

mini-batching is not strictly necessary for MNIST but we use it to better reflect the realities of training

modern neural networks. Our SENN is initialized with a single hidden layer of size 10, and promptly

adds a second hidden layer, also of size 10. All five seeds considered then proceed to consistently

add width to these layers at a moderate rate until a total hidden size of around 40 is reached, at which

point far fewer productive extensions of the network are found and addition slows dramatically. It

can be seen that this results in respectable validation performance (>97%) by the end of training with

very modest hidden neuron counts (50-60). It is of particular note that our method produces strong

anytime performance: we are able to continually expand size, and even insert layers, during training

without any attendant drops in validation accuracy. Indeed, our method exhibits mostly monotonic

improvement up to stochasticity from batching, a property not shared by methods which rely on

reinitializing a new network, e.g. [7]. This makes SENN a perfect fit to prospective applications in

e.g. active or continual learning, in the spirit of our original neurogenesis inspiration.

Having verified sensible performance of SENN on the full MNIST dataset, we now examine the way

in which they adapt their final converged size to the amount of information in the dataset. To this

end, we take class-balanced subsets of MNIST of varying sizes and train SENNs to convergence. To

maximize clarity in our examination of this relationship, we restrict the SENN to width addition. The

converged hidden sizes are shown together with the standard error across five seeds in the rightmost

plots of figure 4. The first of these shows log width against linear subset size for ease of comparison to

the leftmost panel. It can be seen that the final width tails off rapidly with subset size. The rightmost

plot shows instead linear width against logarithmic subset size, in which we can now distinguish

three regimes. For the smallest subsets, the initial hidden size of 10 is sufficient. For subsets between

10% and 60% of the standard training set, the final hidden size increases logarithmically, but past that

point further increases in subset size do not similarly increase the final network size. We posit that

this is due to substantial redundancy within the MNIST training set, leaving further capacity growth

unnecessary. Thus, SENN does not only provide desirable any time performance, but also tailors its

size suitably to the available data.

Conclusion

We have introduced the natural expansion score η and shown how it may be used to cohesively answer

the three key questions when , where and what of growing neural networks. We have demonstrated its

ability to capture redundancy of new neurons with old and thereby make sensible expansion decisions

across time and tasks. While we have focused on providing a thorough mathematical grounding of

the natural expansion score in this work, we acknowledge that the multilayer perceptrons on which

it was demonstrated differ in scale and complexity from many of the architectures in active use for

deep learning in the modern big data regime. Dually, however, prospects for further development

are promising, as our theoretical results regarding η apply for arbitrary expansions of parameterized

models, and our method of expansion would extend naturally to, for example, convolutional neural

networks or normalizing flows where layers may be initialized invertibly.

Acknowledgments

This work was supported by the project “safeFBDC - Financial Big Data Cluster” (FKZ:

01MK21002K), funded by the German Federal Ministry for Economics Affairs and Energy as

part of the GAIA-x initiative, and the Hessian research priority programme LOEWE within the

project “WhiteBox”. It benefited from the Hessian Ministry of Higher Education, Research, Science

and the Arts (HMWK; projects “The Third Wave of AI” and “The Adaptive Mind”).

10References

[1] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Comput., 10(2):251–276,

1998.

[2] Timur Ash. Dynamic node creation in backpropagation networks. Connection Science, 1(4):365–

375, 1989.

[3] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal

Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao

Zhang. JAX: composable transformations of Python+NumPy programs, 2018.

[4] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE

Signal Processing Magazine, 29(6):141–142, 2012.

[5] Timothy J. Draelos, Nadine E. Miner, Christopher C. Lamb, Jonathan A. Cox, Craig M. Vineyard,

Kristofor D. Carlson, William M. Severa, Conrad D. James, and James B. Aimone. Neurogenesis

deep learning: Extending deep networks to accommodate new classes. International Joint

Conference on Neural Networks (IJCNN), pages 526–533, 2017.

[6] Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Fabian Pedregosa, and Max Vladymyrov.

Gradmax: Growing neural networks using gradient information. In The Tenth International Confer-

ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2022.

[7] Yonatan Geifman and Ran El-Yaniv. Deep active learning with a neural architecture search. In

Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox,

and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual

Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019,

Vancouver, BC, Canada, pages 5974–5984, 2019.

[8] Mark Girolami and Ben Calderhead. Riemann Manifold Langevin and Hamiltonian Monte Carlo

Methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(2):123–214,

03 2011.

[9] Charles G. Gross. Neurogenesis in the adult brain: Death of a dogma. Nature Reviews Neuro-

science, 1(1):67–73, 2000.

[10] Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas

Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023.

[11] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen,

HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Effi-

cient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo

Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,

Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informa-

tion Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada,

pages 103–112, 2019.

[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep

convolutional neural networks. Commun. ACM, 60(6):84–90, 2017.

[13] Dhireesha Kudithipudi, Mario Aguilar-Simon, Jonathan Babb, Maxim Bazhenov, Douglas

Blackiston, Josh Bongard, Andrew Brna, Suraj Chakravarthi Raja, Nick Cheney, Jeff Clune,

Anurag Daram, Stefano Fusi, Peter Helfer, Leslie Kay, Nicholas Ketz, Zsolt Kira, Soheil Kolouri,

Jeff Krichmar, Sam Kriegman, and Hava Siegelmann. Biological underpinnings for lifelong

learning machines. Nature Machine Intelligence, 4:196–210, 03 2022.

[14] Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient backprop.

In Grégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller, editors, Neural Networks:

Tricks of the Trade: Second Edition, pages 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg,

2012.

11[15] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored

approximate curvature. In Francis R. Bach and David M. Blei, editors, Proceedings of the

32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015,

volume 37 of JMLR Workshop and Conference Proceedings, pages 2408–2417. JMLR.org, 2015.

[16] James Martens. Deep learning via hessian-free optimization. In Johannes Fürnkranz and

Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning

(ICML-10), June 21-24, 2010, Haifa, Israel, pages 735–742. Omnipress, 2010.

[17] James Martens. New insights and perspectives on the natural gradient method. J. Mach. Learn.

Res., 21:146:1–146:76, 2020.

[18] Alejandro Molina, Patrick Schramowski, and Kristian Kersting. Padé activation units: End-to-

end learning of flexible activation functions in deep networks. In 8th International Conference on

Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net,

2020.

[19] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.

Deep double descent: Where bigger models and more data hurt. In 8th International Conference on

Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net,

2020.

[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,

M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine

Learning Research, 12:2825–2830, 2011.

[21] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,

Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR,

abs/1606.04671, 2016.

[22] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov,

Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA,

June 7-12, 2015, pages 1–9. IEEE Computer Society, 2015.

[23] Krishna C. Vadodaria and Sebastian Jessberger. Functional neurogenesis in the adult hippocam-

pus: Then and now. Frontiers in Neuroscience, 8(8 MAR):1–3, 2014.

[24] Richard von Mises. Mathematical theory of probability and statistics, chapter VIII.9.3. Aca-

demic Press, New York, 1964.

[25] Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural

architectures. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc,

Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems

32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December

8-14, 2019, Vancouver, BC, Canada, pages 10655–10665, 2019.

[26] Lemeng Wu, Bo Liu, Peter Stone, and Qiang Liu. Firefly neural architecture descent: a general

approach for growing neural networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell,

Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,

December 6-12, 2020, virtual, 2020.

[27] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with

dynamically expandable networks. In 6th International Conference on Learning Representations,

ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.

OpenReview.net, 2018.

12A

A.1

Proofs

Theorem 1: Bounded rate of addition

In this section we prove theorem 1 of the main body. We will assume F ≻ 0 to be positive definite,

with the following straightforward consequence

Lemma 1. The natural expansion score is non-negative η = g T F −1 g ≥ 0.

Proof. If F ≻ 0, then F −1 ≻ 0, and v T F −1 v ≥ 0 for all v.

Considering the effect of the expansion threshold τ we obtain the following bound:

Lemma 2. Let η have initial value η 0 and be bounded above by λ > η. If threshold τ guarantees that

η i > τ η i−1 for the i-th addition, then the maximum number of successive additions N s is bounded

η 0

by N s < ln λ−ln

ln τ

Proof. Due to the threshold τ , η grows at least exponentially: η i > τ i η 0 . But η is bounded:

λ ≥ η i > τ i η 0 . Since ln is monotonic, we may take logarithms: ln λ > i ln τ + ln η 0 . and rearrange

η 0

to get i < ln λ−ln

for all additions i. This true for every i-th addition which is accepted, and so in

ln τ

particular also true for the last N s -th addition.

Considering also the effect of the stopping criterion α we obtain theorem 1:

Theorem 1. If the stopping criterion α guarantees that η i > η i−1 , then the maximum number of

successive additions N s is either 0, or bounded by N s < 1 + ln λ−ln

ln τ

Proof. Either N s = 0, or there is a first addition with natural expansion score η 1 for which η 1 − η 0 >

α. From lemma 1 we then have η 1 > α. We may then substitute α into lemma 2 in place of η 0 to

obtain a bound on further additions, yielding N s < 1 + ln λ−ln

ln τ

This theorem is important because it guarantees that SENN will add a limited number of neurons or

layers before continuing training. Intuitively, this is because it rapidly becomes the case that any new

neuron is either not relevant to rapidly decreasing the loss, or is redundant with some already extant

neuron.

A.2

Theorem 2: Lower bound on increase in natural expansion score

We now prove theorem 2 of the main body, concerning a lower bound on the increase in natural

expansion score η due to the addition of new proposed neuron(s) to a layer. Let the joint h activations

A C

a = [ a a p c ] of the current and proposed neurons have second moment E aa T = A = C pc c A cp

We will assume the Fisher matrix F for the layer to which neurons are to be added to factorize as

F = S ⊗ A, where S ≻ 0 is positive definite. We first derive a convenient form of a known result

discussed in, for example, von Mises [24], related to the joint covariance of multivariate Gaussian

distributions.

v c

Lemma 3. Let Â p = A p − C pc A −1

c C cp be the Schur complement of A c in A. Let also v = [ v p ]

be an arbitrary vector, and R be the linear operator defined by Rv = v p − C pc A −1

c v c , i.e. the

T −1

residual part of v p not predicted by v c . Then, v T A −1 v = v c T A −1

c v c + (Rv) Â p Rv.

Proof. The following may be obtained by performing a block LDU decomposition:

I c

0 A c 0

A c C cp

I c A −1

c C cp

A =

C pc A p

C pc A −1

I p

0 Â p 0

which we may then use to decompose A −1 :

−1

A c

A c C cp

I −A −1

c C cp

A −1 =

= c

C pc A p

I p

Â p

I c

−C pc A −1

(2)

I p

(3)The desired result then follows by substitution into v T A −1 v:

−1

A c

I c

I −A −1

c C cp

v T A −1 v = [v c v p ] c

−C pc A −1

I p

Â −1

I p

v c

v p

(4)

−1

T −1

−1

= v c T A −1

c v c + (v p − C pc A c v c ) Â p (v p − C pc A c v c )

Recall from section 3.6 that η may be expressed as a trace: η = Tr[S −1 E ga T A −1 E ag T ]

where g is the derivative of the loss with respect to the outputs (i.e. layer pre-activations) of the linear

transform.

Corollary 3.1. We can use lemma 3 to write the increase in natural expansion score ∆η as

∆η = Tr[S −1 E ga T A −1 E ag T ] − Tr[S −1 E [ga c ] A −1

]

c E a c g

(5)

−1

= Tr[S E g(Ra) Â p E (Ra)g ]

where we can take R inside the expectations by linearity.

It is computationally convenient for us to be able to have an expression in terms of residual gradients

instead of residual activations, so we note the following:

Lemma 4. E g(Ra) T = E g r a Tp where g r = g − E ga Tc A −1

c a c is the residual gradient.

Proof.

E g(Ra) T = E g(a p − E a p a Tc A −1

c a c )

= E ga Tp − E ga Tc A −1

c E a c a p

= E (g − E [ga c ] A −1

c a c )a p

= E g r a Tp

−1

Finally, we establish the following relationship between A −1

p and Â p :

−1

Lemma 5. Â −1

− A −1

p − A p = (A p − C pc A c C cp )

p ⪰ 0.

Proof. The matrix inverse Â −1

p can be expanded as the following power series

−1

Â −1

p = (A p − C pc A c C cp )

∞

−1

−1 n

A −1

p (C pc A c C cp A p )

(6)

n=0

We observe that this is a sum of positive semi-definite matrices, and truncate the series at n = 0 and

rearrange:

∞

−1

−1 n

Â −1

A −1

(7)

p − A p =

p (C pc A c C cp A p ) ⪰ 0

n=1

We may now prove theorem 2 from section 3.6.

Theorem 2. ∆η ′ is a lower bound on the increase in natural expansion score ∆η due to the addition

of some proposed neurons p:

∆η ≥ ∆η ′ = Tr[S −1 E g r a Tp A −1

(8)

p E a p g r ]

Proof. Substituting lemma 4 into corollary 3.1 we have ∆η = Tr[S −1 E g r a T Â −1

p E ag r ]. The

difference between ∆η and ∆η ′ is given by ∆η − ∆η ′ = Tr[S −1 E g r a T ( Â −1

− A −1

)E ag r T ].

This is the squared norm of E g r a T as a vector according to the Kronecker product S −1 ⊗

−1

( Â −1

p − A p ). The first factor is positive semi-definite by assumption, the second by lemma 5,

and the Kronecker product of positive semi-definite matrices is positive semi-definite. Therefore

∆η − ∆η ′ ≥ 0 and so ∆η ≥ ∆η ′ .

14The significance of this lower bound on ∆η is that g r and S −1 may be computed once, and then

used to optimize very many proposals with different activations a p . That is, performing N steps of

gradient descent to optimize proposed neurons p scales linearly in the evaluation cost of a p and A −1

p .

These linear costs are unaffected by the number of neurons currently in the layer being added to, and

unaffected by the total number of layers in the network.

The consequences of non-Fisher curvature for total neurons added

In section 3.5 we discussed the total number of neurons added during training, and in particular the

extent to which we could provide bounds on this. As noted there, in the case where the Fisher F is

constant over training and exactly equal to the hessian, the dynamics of training are very simple. The

loss L has its global minimum at the point reached by a step of exactly F −1 g, and it can be seen

by integration that the reduction in loss due to such a step is exactly ∆L = 12 g T F −1 g = 12 η. The

stopping criterion α corresponds to the requirement that parameter expansions should enable a further

reduction in loss of at least 12 α. Since η ≤ λ is bounded by λ, the maximum possible reduction

in loss is ∆L max = 2 1 λ. If we pessimistically assume that every parameter expansion enables the

minimal loss reduction of only 12 α, then the total number of added neurons N T is still bounded by

N T < α λ .

The case where the true hessian of the loss H is some constant multiple of the Fisher H = κF

which is itself constant, is almost as simple. The parameters evolve along the same trajectory, only

they move a factor of κ faster than they would if F = H. This also results in a rescaling η = κη B of

natural expansion scores relative to the baseline value η B in the case where F was accurate. While

this has no effect on the behaviour of the expansion threshold τ , the inflated η values mean that the

effective value of α is reduced by a factor of κ and so the total number of added neurons N T is now

only bounded by N T < κ α λ .

We will now try to describe the effect of more general failures of F to represent the true curvature

H. Local expansion behaviour, i.e. without further parameter optimization, is bounded by lemma

2 of appendix A. Assuming the baseline case of H = F , we may substitute λ = 2∆L max . If we

assume small step sizes, the rate of loss reduction L̇ = −η is given by the natural expansion score by

definition, regardless of H. If at all times t during training the rate of reduction of expansion score

− η̇(t) < − η̇ B (t) is lower than the baseline scenario, then η will at all times be greater than expected.

Since the rate of loss reduction L̇ = η is given by η, L will decrease faster than expected and the

remaining maximum possible loss reduction ∆L max will be at all times less than expected. It can

be seen from lemma 2 that discrepancies in these directions relative to baseline will result in fewer

additions being made.

We now only need to establish conditions under which the actual rate of reduction in η is lower

than the expected rate. The rate of change during optimization (indicated by overdot) of the various

components of η can be described as follows:

θ̇ = −F −1 g

(9)

ġ = H θ̇ = −HF

ġ F

−1

g = −g F

η̇ = − ġ F

−1

(10)

−1

(11)

g − g F ġ − g F

= −g T F −1 2H + Ḟ F −1 g

−1

Ḟ F

−1

(12)

(13)

Since in the base case H B = F and Ḟ B = 0, we have that if H + 12 Ḟ ⪯ F then − η̇ ≤ − η̇ B .

Putting the above results together, we have that if at all times during training H + 12 Ḟ ⪯ F , then the

bound on total additions N T < α λ should hold. Incorporating the previous result regarding H = κF ,

it also appears that if at all times H + 12 Ḟ ⪯ κF , then N T < κ α λ . Assuming F positive definite

and the loss surface smooth (i.e. H and Ḟ finite), then there will exist some finite κ for which the

condition holds and so N T will be bounded.

15C

Hyperparameters and implementation details

All experiments were run on a single Nvidia A100 or V100 GPU, using no longer than one day each.

Our implementation uses the JAX [3] autodifferentiation and Flax [10] neural network libraries. The

full source code used to run the experiments is provided in the supplementary material, and will be

made publicly available on publication of this work. In all experiments we optimize our parameters

via natural gradient descent with a learning rate of 0.1 and Tikhonov damping of magnitude 0.1. In

the image classification experiments we use batches of size 1024 and a weight decay of rate 0.001.

We initialize our dense layers with the default initialization of Flax (LeCun Normal) [14], and use a

unit normal initialization for the parameters of our rational functions.

For the visualization experiments we use τ = 2, for the image classification experiments we use

τ = 1.007 and τ = 1.03 for the whole dataset and variable subset experiments respectively. Larger

thresholds τ result in longer training times but more conservative network sizes and higher accuracy

of η estimates due to F being a closer approximation to the curvature near convergence on the

existing parameters. Any extra costs are negligible for the visualization experiments, so we use

the intuitive value of 2, but we choose τ values for the image classification experiments in light of

this natural trade-off. We use α = 0.0025 for all experiments apart from the whole dataset image

classification, for which we use α = 0.25. Here the latter choice compensates for larger noise in ∆η ′

introduced by use of a validation batch, as will be discussed shortly. We adjust the expansion score

increases for layer additions by a constant factor of 2 in the visualization experiments and 60 in the

image classification experiments. These values are selected to be within an order of magnitude of the

actual layer sizes expected in classification of a toy dataset versus images, and so of the number of

new neurons a new layer represents.

We calculate the natural gradient via the conjugate gradient method with a maximum iteration count

of 100 when optimizing the existing parameters. When optimizing the initializations of proposed

neurons or layers we use the Kronecker factored approximation of the Fisher matrix for the relevant

layer based on derivatives of the predictions of the network as in Martens and Grosse [15]. We

compute ∆η ′ based on this F̃ and normalize it with respect to the output gradient magnitudes of the

particular task. When comparing ∆η ′ /η c to τ we use the η c value given by F̃ for the layer in question.

When considering adding layers, we ensure new layers are invertible by adding a regularization term

of 0.01(ln det W ) 2 when optimizing the initialization of their linear transform W , and by setting the

minimal singular values of W to be at least 0.001 times its average singular value before adding the

layer to the network. In our visualization experiments we do not use batching, so we consider adding

depth and width every 30 steps, and add at most one layer per 90 steps. In the image classification

experiments we use batching and so consider adding width and depth every 10 epochs, adding at

most one layer each time.

We use the same scheme for initializing proposed new neurons or layers as for initializing the

starting network. In our whole dataset image classification experiment we then optimize proposal

initializations to maximize ∆η ′ via 300 steps of vanilla gradient descent on a fixed batch of 1024

images. We consider 10000 neuron proposals and 100 layer proposals per location, and use a learning

rate of 0.3, reducing this by a factor of 3 as necessary to maintain monotonic improvement in ∆η ′ for

each proposal. We take the best proposal on this batch of size 1024 for each depth and width addition

location, and reevaluate its ∆η ′ on a fixed validation batch of size 1024 when deciding whether and

where to add. The variable degree of overfitting of the best proposal results in some noise in ∆η ′ at

each location which we compensate for by choosing a relatively large α.

For our other experiments we optimize proposal initializations using 3000 steps of the Metropolis

Adjusted Langevin Algorithm (MALA) [8], using a unit gaussian prior on initializations during these

steps. We use a temperature T of 10 and an initial step size of 0.3, and adjust by a factor of 3 every 10

steps if necessary to maintain an acceptance rate of around 0.6. We consider 100 width proposals and

100 layer proposals for each location, and obtain 100 final MALA samples i for each location width

could be added and each location depth could be added. We then construct a categorical distribution

over each set of 100 samples via softmax( T 1 ∆η i ′ ), and use the corresponding expectation of ∆η ′

when deciding when and where to add capacity and whether it should be depth or width. We draw

initializations for new capacity from this categorical distribution, except in the initial least squares

regression experiment, where we use arg max i ∆η i ′ over the 100 samples i to make figure 2 more

intuitive.