Summary of Conservation Laws for Gradient Flows

Summary Conservation Laws for Gradient Flows arxiv.org

19,160 words - PDF document - View PDF document

One Line

The article examines the geometric aspects of gradient descent in machine learning, focusing on conservation laws and the preservation of functions during optimization.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

The Geometric Aspects of Gradient Descent in Machine Learning

Source: arxiv.org - PDF - 19,160 words - view

Introduction

• Gradient descent dynamics in machine learning models

• Understanding the geometric properties

• Conservation laws and preservation of functions

[Image: Illustration of gradient descent]

Factorization of Cost Function

• Proposed factorization of the cost function E

• Valid for optimization by gradient descent

• Function of the mapping ? and data fidelity f X,Y

Finding Conservation Laws

• Conservation laws in a finite-dimensional space

• Projection of equations in a basis

• Known conservation laws: polynomial "balancedness-type conditions"

Dimension and Lie Algebra

• Dimension of the trace of Lie(V) is locally constant

• Equal to the dimension of Vi

• Number of conservation laws characterized by the Lie algebra generated by V?

Symmetric Semi-Definite Matrix

• Existence of a symmetric semi-definite matrix satisfying an ODE equation

• Analytic examples illustrating the concept

• Application to linear and ReLU neural networks

Numerical Comparison

• Confirmation of no additional conservation laws for deeper linear networks and ReLU networks

• Open-sourced code available on GitHub

• Applicability to any space of displacements

References and Citations

• List of references related to conservation laws for gradient flows

• Topics include implicit regularization, optimization geometry, nonlinear system control, etc.

Key Takeaways

• Conservation laws are sets of independent quantities conserved during gradient flows

• Factorization of cost function allows preservation during optimization

• Projection in a basis helps find conservation laws in finite-dimensional spaces

• Dimension of Lie(V) is locally constant and equal to Vi's dimension

• Symmetric semi-definite matrix satisfies ODE equation for conservation laws

• Numerical comparison confirms existing conservation laws for linear and ReLU networks

• References provide additional resources on gradient flows and implicit bias in machine learning

[Image: Icon representing the main message or theme of the presentation]

Key Points

Conservation laws are sets of independent quantities that are conserved during the gradient flows of a model.
A factorization of the cost function E is proposed, which is valid for optimization by gradient descent.
Conservation laws in a finite-dimensional space can be found by projecting the equations in a basis.
The dimension of the trace of Lie(V) is locally constant and equal to the dimension of Vi.
The existence of a symmetric semi-definite matrix that satisfies an ODE equation is discussed.
Numerical comparison confirms that there are no more conservation laws than the ones already known for deeper linear networks and ReLU networks.
The text excerpt contains a list of references and citations related to gradient flows and implicit bias in machine learning.

Summaries

34 word summary

This article explores the geometric properties of gradient descent in machine learning. It introduces "conservation laws" as independent quantities conserved during gradient flows. The paper analyzes the functions preserved during optimization by gradient descent.

44 word summary

This article focuses on understanding the geometric properties of gradient descent dynamics in machine learning models. It introduces the concept of "conservation laws" as sets of independent quantities conserved during gradient flows. The paper analyzes the functions preserved during optimization by gradient descent and

677 word summary

This article focuses on understanding the geometric properties of gradient descent dynamics in machine learning models. It introduces the concept of "conservation laws", which are sets of independent quantities that are conserved during the gradient flows of a model. The article explains how to

The paper aims to analyze the functions that are preserved during optimization by gradient descent. It proposes a factorization of the cost function E, where E is a function of the mapping ? and the data fidelity f X,Y. The factorization is valid for

A function h is conserved through a subset V if and only if ?h(?) is in the linear space V. The set of functions conserved during all flows defined by the ODE corresponds to the functions conserved through a specific subset of

Conservation laws in a prescribed finite-dimensional space can be found by projecting the equations in a basis. For linear and ReLU cases, known conservation laws are polynomial "balancedness-type conditions." By focusing on the corresponding subspace of polynomials, a

The document discusses conservation laws for gradient flows. It states that the dimension of the trace of Lie(V) is locally constant and equal to the dimension of Vi. The number of conservation laws is characterized by the Lie algebra generated by V? and can be

Assuming certain conditions are met, the document discusses the existence of a symmetric semi-definite matrix that satisfies an ODE equation. Analytic examples are provided to illustrate these concepts. The document then explores the conservation laws for linear and ReLU neural networks

We conducted a numerical comparison to confirm that there are no more conservation laws than the ones already known for deeper linear networks and ReLU networks. Our code is open-sourced and available on GitHub. Our theory can be applied to any space of displacements

This excerpt contains a list of references to various articles and papers related to the topic of gradient flows and implicit bias in machine learning. The references cover a range of topics including functional dependence, algorithmic regularization, optimization geometry, implicit regularization, nonlinear system control

The summary of the text excerpt is as follows:

The excerpt contains a list of references and citations related to the topic of conservation laws for gradient flows. The references include papers on implicit regularization in deep learning, exact solutions to the nonlinear dynamics of learning,

The document discusses conservation laws for gradient flows in the context of linear and ReLU networks. The main goal is to show that CL(V ? [C ? ]) = CL(V ? [F ? ]) under certain assumptions. The document introduces Assumption B

The proof shows that the activation status of neurons is locally constant in a neighborhood, which implies that the conclusion follows from the fact that g ? (x) = C ?,x (? ReLU (?)) for all ?, x.

Lemma B.8

The text excerpt discusses conservation laws for gradient flows. It presents a proof of Theorem 3.3 and introduces the fundamental result of Frobenius, which states that if the dimension of a vector space is constant on a domain, then two conditions

The text discusses conservation laws for gradient flows. It introduces a condition (23) of Frobenius theorem and proves that it holds for each block of coordinates. It also shows that the dimension of V ? (?) is n + m - 1 and

The text excerpt discusses conservation laws for gradient flows. It presents two cases and shows that the condition of Frobenius Theorem is satisfied. It also provides a proof of Proposition 3.8 and an additional example. The example demonstrates how the function

The document discusses conservation laws for gradient flows. It states that if certain conditions are met, there are linearly dependent functions that have already been obtained. It also mentions the existence of independent conserved functions based on the values of n, m, and

If (U ; V ) has full rank, all conserved functions are given by ? : (U, V ) 7? U ? U ? V ? V and no more conserved functions exist. The dimension of Lie(V ? )(U

Raw indexed text (86,419 chars / 19,160 words / 1,642 lines)

Abide by the Law and Follow the Flow:

Conservation Laws for Gradient Flows

Sibylle Marcotte

ENS - PSL Univ.

[email protected]

Rémi Gribonval

Univ Lyon, EnsL, UCBL,

CNRS, Inria, LIP,

[email protected]

Gabriel Peyré

CNRS, ENS - PSL Univ.

[email protected]

Abstract

Understanding the geometric properties of gradient descent dynamics is a key in-

gredient in deciphering the recent success of very large machine learning mod-

els. A striking observation is that trained over-parameterized models retain some

properties of the optimization initialization. This “implicit bias” is believed to be

responsible for some favorable properties of the trained models and could explain

their good generalization properties. The purpose of this article is threefold. First,

we rigorously expose the definition and basic properties of “conservation laws”,

which are maximal sets of independent quantities conserved during gradient flows

of a given model (e.g. of a ReLU network with a given architecture) with any

training data and any loss. Then we explain how to find the exact number of these

quantities by performing finite-dimensional algebraic manipulations on the Lie

algebra generated by the Jacobian of the model. Finally, we provide algorithms

(implemented in SageMath) to: a) compute a family of polynomial laws; b) com-

pute the number of (not necessarily polynomial) conservation laws. We provide

showcase examples that we fully work out theoretically. Besides, applying the two

algorithms confirms for a number of ReLU network architectures that all known

laws are recovered by the algorithm, and that there are no other laws. Such compu-

tational tools pave the way to understanding desirable properties of optimization

initialization in large machine learning models.

1 Introduction

State-of-the-art approaches in machine learning rely on the conjunction of gradient-based optimiza-

tion with vastly “over-parameterized” architectures. A large body of empirical [27] and theoretical

[4] works suggest that, despite the ability of these models to almost interpolate the input data, they

are still able to generalize well. Analyzing the training dynamics of these models is thus crucial to

gain a better understanding of this phenomenon. Of particular interest is to understand what prop-

erties of the initialization are preserved during the dynamics, which is often loosely referred to as

being an “implicit bias” of the training algorithm. The goal of this article is to make this statement

precise, by properly defining maximal sets of such “conservation laws”, by linking these quantities

to algebraic computations (namely a Lie algebra) associated with the model parameterization (in our

framework, this parameterization is embodied by a mapping φ), and finally by exhibiting algorithms

to implement these computations in SageMath [26].

Over-parameterized model Modern machine learning practitioners and researchers have found

that over-parameterized neural networks (with more parameters than training data points), which

are often trained until perfect interpolation, have impressive generalization properties [27, 4]. This

performance seemingly contradicts classical learning theory [22], and a large part of the theoretical

deep learning literature is aimed at explaining this puzzle. The choice of the optimization algorithm

is crucial to the model generalization performance [9, 18, 12], thus inducing an implicit bias.

Preprint. Under review.Implicit bias The terminology “implicit bias” informally refers to properties of trained models

which are induced by the optimization procedure, typically some form of regularization [19]. For

gradient descent, in simple cases such as scalar linear neural networks or two-layer networks with a

single neuron, it is actually possible to compute in closed form the implicit bias, which induces some

approximate or exact sparsity regularization [9]. Another interesting case is logistic classification on

separable data, where the implicit bias selects the max-margin classifier both for linear models [23]

and for two-layer neural networks in the mean-field limit [7]. The key hypothesis to explicit the

implicit bias is that the Riemannian metric associated to the over-parameterization is of Hessian

type [9], which is a very strong constraint. Unfortunately, even for matrix factorization (so more

than a single neuron), this is not the case, and no closed form is known for the implicit bias [10].

The work of [15] gives conditions on the over-parameterization for this to be possible (for instance

the Lie brackets should vanish: they are (as could be expected

Conservation laws Finding functions conserved during gradient flow optimization of neural net-

works (a continuous limit of gradient descent often used to model the optimization dynamics) is par-

ticularly useful to better understand the flow behavior. One can see conservation laws as a “weak”

form of implicit bias: to explain, among a possibly infinite set of minimizers, which properties

(e.g. in terms of sparsity, low-rank, etc.) are being favored by the dynamic. If there are enough

conservation laws, one has an exact description of the dynamic, and in some cases, one can even

determine explicitly the implicit bias. Otherwise, one can still predict what properties of the ini-

tialization are retained at convergence, and possibly leverage this knowledge. For example, in the

case of linear neural networks, certain balancedness properties are satisfied and provide a class of

conserved functions [21, 8, 1, 2, 13, 25, 16]. These conservation laws enable for instance to prove

the global convergence of the gradient flow [3] under some assumptions. We detail these laws in

Proposition 4.1. A subset of these “balancedness” laws still holds in the case of a ReLU activation

[8], which reflects the scaling invariance of these networks (see Section 4 for more details). More

generally such conservation laws are a consequence [14] of the invariances of the model: to each

1-parameter group of transformation preserving the loss, one can associate a conserved quantity,

which is in some sense analogous to Noether’s theorem [20]. Similar reasoning is used by [28] to

show the influence of initialization on convergence and generalization performance of the neural

network. Our work is somehow complementary to this line of research: instead of assuming a pri-

ori known symmetries, we directly analyze the model and give access to conservation laws using

algebraic computations. For matrix factorization as well as for certain ReLU network architectures,

this allows us to show that the conservation laws reported in the literature are complete (there are no

other independent quantities that would be preserved by all gradient flows).

Contributions

We formalize the notion of a conservation law, a quantity preserved through all gradient flows given a

model architecture (e.g. a ReLU neural network with prescribed layers) and a family of "data-fidelity

functions", typically associated to the empirical loss on a training set. Our main contributions are:

• to show that for several classical losses, characterizing conservation laws for deep linear (resp.

shallow ReLU) networks boils down to analyzing a finite dimensional space of vector fields;

• to propose an algorithm (coded in SageMath) identifying polynomial conservation laws on linear

/ ReLU network architectures; it identifies all known laws on selected examples;

• to formally define the maximum number of (not necessarily polynomial) independent conserva-

tion laws and characterize it a) theoretically via Lie algebra computations; and b) practically via

an algorithm (coded in SageMath) computing this number on worked examples;

• to illustrate that in certain settings these findings allow to rewrite an over-parameterized flow as

an “intrinsic” low-dimensional flow;

• to highlight that the cost function associated to the training of linear and ReLU networks, shallow

or deep, with various losses (quadratic and more) fully fits the proposed framework.

A consequence of our results is to show for the first time that conservation laws commonly reported

in the literature are maximal: there is no other independent preserved quantity (see Propositions 4.3,

4.2, Corollary 4.4) and Section 4.2).

22 Conservation Laws for Gradient Flows

After some reminders on gradient flows, we formalize the notion of conservation laws.

2.1 Over-parameterized models

We consider learning problems, where we denote x i ∈ R m the features and y i ∈ Y the targets

(for regression) or labels (for classification) in the case of supervised learning, while y i can be

considered constant for unsupervised/self-supervised learning. The prediction is performed by a

parametric mapping g θ : R m → R n (for instance a neural network) which is trained by empirical

risk minimization of a cost E

ℓ(g θ (x i ), y i ),

(1)

min E(θ) :=

θ∈R D

where ℓ is the loss function (for regression, one typically has Y = R n ). The goal of this paper is to

analyze what are the functions h(θ) which are preserved during the optimization by gradient descent

of the cost E(θ). To make the mathematical analysis tractable and provide algorithmic procedure to

determine these functions, our fundamental hypothesis is that the cost E can be factored – at least

locally, in a sense that will be made precise – in the form

∀θ ∈ Ω,

E(θ) = f X,Y (φ(θ))

(2)

where the data fidelity f X,Y depends on the data X := (x i ) i , Y := (y i ) i and the loss ℓ, while the

mapping φ must be independent from these quantities. Formally, Ω is a non-empty open subset

of the domain of trainable parameters, R D (introduced to capture the local training dynamics) and

φ ∈ C ∞ (Ω, R d ).

Example 2.1. (Factorization for linear neural networks) In the two-layer case, with r neurons,

denoting θ = (U, V ) ∈ R n×r × R m×r (so that D P

= (n + m)r), we can factorize g θ (x) := U V ⊤ x

⊤

by the mapping φ(θ) := U V using f X,Y (·) = i ℓ(·x i , y i ). More generally for q layers, with

θ = (U 1 , · · · , U q ), we can still factorize g θ (x) := U 1 · · · U q x using φ(θ) := U 1 · · · U q and the same

f X,Y . This factorization is globally valid on Ω = R D in the sense that f X,Y does not depend on θ.

The notion of locality of the factorization f X,Y ◦ φ is illustrated by the next example.

Example 2.2 (Factorization

for two-layer ReLU network without bias). Consider g θ (x) =

P r

j=1 u k,j σ(hv j , xi) k=1 , with σ(t) := max(t, 0) the ReLU activation function and v j ∈ R ,

n×r

u k,j ∈ R. Then, denoting again θ = (U, V ) with U = (u k,j ) k,j =: (u 1 P

, · · · , u r ) ∈ R

and

V = (v 1 , · · · , v r ) ∈ R m×r (so that D = (n + m)r), we rewrite g θ (x) = j=1 u j ε j,x v j ⊤ x where

ε j,x = 1(v j ⊤ x > 0) is piecewise constant with respect to θ. Thus, on any domain Ω ⊂ R n×r ×R m×r

such that ε j,x i (θ) := 1(v j ⊤ x i > 0) is constant over θ ∈ Ω for each training sample x i , the model

g θ (x) can be factorized by the mapping φ(θ) = (φ jkl ) jkl := (u j v j ⊤ ) rj=1 ∈ R r×n×m (here d = rmn)

P P

using f X,Y (φ) := i ℓ( j,k,l ε j,x i φ j,k,l , y i ). On Ω we obtain a factorizing mapping φ(θ) con-

taining r matrices of size m × n (of rank at most one) associated to a “local” data-fidelity f X,Y valid

in a neighborhood of θ. A similar factorization is possible for deeper ReLU networks, including

with biases [24], as further discussed in the proof of Theorem 2.8 in Appendix B.

A priori, one can consider different "levels" of conservation, depending whether h is conserved:

i. during the optimization of E for a given loss ℓ and a given data set (x i , y i ) i ; i.e. during the

optimization of f X,Y ◦ φ, for a given f X,Y ;

ii. given a loss ℓ, during the optimization of E for any data set; i.e., during the optimization of

f X,Y ◦ φ, for every data set (X, Y );

iii. during the optimization of f ◦ φ for any choice of smooth f (not necessarily associated to

a data set (X, Y )).

Our analysis focuses on last two cases, and shows that under certain assumptions, being conserved

for a given loss and every dataset is indeed equivalent to being conserved for every smooth f . As

a consequence, our theoretical analysis then studies functions h(θ) preserved by flows of functions

f ◦ φ for a fixed mapping φ but any choice of fidelity f . We call these functions “conservation laws”

associated to the mapping φ, and they are formally defined in Section 2.3. Theorem 2.8 shows that

in the two examples given above, this is equivalent to the conservation for all cost E of the form (1).

32.2 Gradient dynamics

We consider training using the gradient flow (the continuous time limit of gradient descent) of f ◦ φ:

θ(t) = −∇(f ◦ φ)(θ(t)) = −[∂φ(θ(t))] ⊤ ∇f (φ(θ(t))), with θ(0) = θ init ,

(3)

where the “data fidelity function” f is differentiable and arises from (2) with some dataset (x i , y i ) i

and some loss function ℓ. Here ∂φ(θ) ∈ R d×D is the Jacobian of the factorizing mapping. Note that

using stochastic optimization methods and discrete gradients would break the exact preservation of

the conservation laws, and only approximate conservation would hold, as remarked in [14].

The core of our analysis is to analyze the algebraic structure of the Jacobian vector fields involved

in (3). In practice, the dimensions often satisfy rank∂φ(θ) < min(d, D), i.e., φ(θ) lives in a

manifold of lower dimension. This corresponds to the fact that θ is an over-parameterized variable,

and φ is an over-parameterized model with nontrivial conserved quantities during the optimization.

Our goal is to determine the “number" of independent functions conserved through all such flows (i.e.

for every choice of data fidelity function f restricted to the form (2)). We show in Section 2.3 that,

under mild assumptions on the loss ℓ, these conserved functions are exactly the functions conserved

through all flows (3) for every infinitely smooth data fidelity function f ∈ C ∞ (φ(Ω), R).

Example 2.3. As a first simple example, consider a two-layer linear neural network in dimension 1

(both for the input and output), with a single neuron. For such – admittedly trivial – architecture, the

function to minimize is factorized by the mapping φ : (u ∈ R, v ∈ R) 7→ uv ∈ R with θ := (u, v).

One can directly check that the function: h(u, v) = u 2 − v 2 satisfies that, for all initial conditions

(u init , v init ) ∈ R 2 , h(u(t), v(t)) = h(u init , v init ), as soon as θ(t) := (u(t), v(t)) is a solution of the

ODE (3) with some differentiable data-fidelity function f . We say in that case that h is a conservation

law for φ. Are there other such functions? Example 3.6 explains that on this example the answer is

negative. This results from algebraic computations, implemented in SageMath, see Section 3.3.

2.3 Conserved functions and Conservation laws

We define conserved functions associated with (collections of) vector fields in X (Ω) := C ∞ (Ω, R D ).

Definition 2.4. Let χ ∈ X (Ω) be an infinitely smooth vector field. By the Cauchy-Lipschitz theo-

rem, for each initial condition θ init , there exists a unique maximal solution t ∈ [0, T θ init ) 7→ θ(t, θ init )

of the ODE θ̇(t) = χ(θ(t)) with θ(0) = θ init . A function h : Ω ⊆ R D → R is conserved during the

flow induced by χ if h(θ(t, θ init )) = h(θ init ) for each choice of θ init and every t ∈ [0, T θ init ).

It is conserved through a subset V ⊂ X (Ω) if h is conserved during all flows induced by all χ ∈ V .

A basic property of C 1 conserved functions (which proof can be found in Appendix A) corresponds

to an “orthogonality” between their gradient and the considered vector fields.

Proposition 2.5. Given a subset V ⊂ X (Ω), consider its trace at θ ∈ Ω, defined as the linear space

V (θ) := span{χ(θ) : χ ∈ V } ⊆ R D .

(4)

A function h ∈ C (Ω, R) is conserved through V if, and only if, ∇h(θ) ⊥ V (θ) for every θ ∈ Ω.

Given a family F ⊆ C ∞ (φ(Ω), R) of data-fidelity functions, the set of functions that are conserved

during all flows defined by the ODE (3), with each f ∈ F , corresponds by definition to the functions

that are conserved through the subset

V φ [F ] := {χ : ∃f ∈ F, χ = ∇(f ◦ φ) on Ω}.

(5)

Given a loss ℓ, our goal is to study the functions conserved through V φ [F ℓ ], where F ℓ collects all

smooth data-fidelity functions f ∈ C ∞ (φ(Ω), R) that satisfy (f ◦ φ)(θ) = N

i=1 ℓ(g θ (x i ), y i ) for

some training set of arbitrary size, i.e.

)

(

∞

ℓ(g θ (x i ), y i ) on Ω . (6)

F ℓ := f ∈ C (φ(Ω), R) : ∃(X, Y ), f ◦ φ(θ) = f X,Y ◦ φ(θ) :=

i=1

For linear and ReLU networks we show in Theorem 2.8 and Proposition 2.9 that:

1. under (mild) assumptions on the loss ℓ(·, ·), being conserved through V φ [F ℓ ] is the same as being

conserved through V φ [C ∞ ] := V φ [C ∞ (φ(Ω), R)], i.e. through any infinitely smooth data-fidelity;

42. being conserved through the (a priori infinite-dimensional) subspace V φ [C ∞ ] is in turn equivalent

to being conserved through the finite-dimensional subspace

(7)

a i ∇φ i (θ) : (a 1 , . . . , a d ) ∈ R d

V φ := span{∇φ 1 (·), · · · , ∇φ d (·)} = θ 7→

where we write ∂φ(θ) = (∇φ 1 (θ), · · · , ∇φ d (θ)) ∈ R D×d , with ∇φ i ∈ X (Ω).

⊤

The first point (that we establish below with Theorem 2.8) motivates the following definition

Definition 2.6. A real-valued function h is a conservation law of φ if it is conserved through V φ [C ∞ ].

Proposition 2.5 yields the following intermediate result.

Proposition 2.7. h ∈ C 1 (Ω, R) is a conservation law of φ iff ∇h(θ) ⊥ V φ [C ∞ ](θ), ∀ θ ∈ Ω.

The following theorem (which proof can be found in Appendix B) establishes that in some cases,

the functions conserved through V φ [F ℓ ] are exactly the conservation laws of φ.

Theorem 2.8. Assume that the loss (z, y) 7→ ℓ(z, y) satisfies the condition:

span {∇ z ℓ(z, y)} = R n , ∀z ∈ R n ,

(8)

y∈Y

then for linear neural networks, the conservation laws of φ are exactly the conserved functions

through V φ [F ℓ ], with φ from Example 2.1. The same result holds for two-layer ReLU networks with

φ from Example 2.2 under an additional hypothesis on Ω: the parameter θ of the network is such

that hidden neurons are associated to pairwise distinct “hyperplanes” (cf Appendix B for details).

Condition (8) holds for classical losses ℓ (e.g. quadratic/logistic losses), as shown in Lemma B.5 in

Appendix B. Note that the additional hypothesis of pairwise distinct hyperplanes for the two-layer

ReLU case is a generic hypothesis and is usual (see e.g. the notion of twin neurons in [24]). The

tools from Appendix B extend Theorem 2.8 beyond (deep) linear and shallow ReLU networks. An

open problem is whether Theorem 2.8 still holds for deep ReLU networks.

For the second point (the link between conservation through V φ [C ∞ ] and through V φ ), an appar-

ent difficulty is that the space V φ [C ∞ ] of all gradient fields is a priori infinite-dimensional. In

contrast, the space V φ defined in (7) introduces a much simpler finite-dimensional proxy. A cor-

nerstone of our analysis is to show that the study of conservation laws boils down to the study

of this finite-dimensional vector space. This will be crucial in Section 4.1, to provide a tractable

scheme (i.e. operating in finite dimension) to analyze the algebraic relationship induced by these

vector fields. By combining Proposition 2.7 with the observation that for all θ ∈ Ω we have

V φ [C ∞ ](θ) = span{∇φ 1 (θ), . . . , ∇φ d (θ)} = range(∂φ(θ) ⊤ ) = V φ (θ), we obtain:

Proposition 2.9. h ∈ C 1 (Ω, R) is a conservation law for φ (cf Definition 2.6) if and only if it is

conserved though the finite-dimensional space V φ defined in (7), i.e. if

∇h(θ) ⊥ ∇φ j (θ), ∀ θ ∈ Ω, ∀j ∈ {1, . . . , d}.

Example 2.10. Revisiting Example 2.3, with φ : (u ∈ R, v ∈ R) 7→ uv and θ := (u, v), we saw that

h((u, v)) := u 2 − v 2 is conserved: and indeed h∇h(u, v), ∇φ(u, v)i = 2uv − 2vu = 0, ∀(u, v).

In this simple example, the characterization of Proposition 2.9 gives a constructive way to find

such a conserved function: we only need to find a function h such that h∇h(u, v), ∇φ(u, v)i =

h∇h(u, v), (v, u) ⊤ i = 0. The situation becomes more complex in higher dimensions, since one

needs to understand the interplay between the different vector fields in V φ .

2.4 Constructibility of some conservation laws

Observe that in Example 2.10 both the mapping φ and the conservation law h are polynomials, a

property that surprisingly systematically holds in all examples of interest in the paper, making it

possible to algorithmically construct some conservation laws as detailed now.

By Proposition 2.9, a function h is conserved if it is in the kernel of the linear operator h ∈

C 1 (Ω, R) 7→ (θ ∈ Ω 7→ (h∇h(θ), ∇φ i (θ)i) i=1,··· ,d ). Thus, one could look for conservation laws

in a prescribed finite-dimensional space by projecting these equations in a basis (as in finite-element

5methods for PDEs). Choosing the finite-dimensional subspace could be generally tricky, but for the

linear and ReLU cases all known conservation laws are actually polynomial “balancedness-type con-

ditions” [1, 2, 8], see Section 4. In these cases, the vector fields in V φ are also polynomials (because

φ is polynomial, see Theorem B.4 and Lemma B.7 in Appendix B), hence θ 7→ h∇h(θ), ∇φ i (θ)i is

a polynomial too. This allows us to compute a basis of independent polynomial conservation laws

of a given degree (to be freely chosen) for these cases, by simply focusing on the corresponding sub-

space of polynomials. We coded the resulting equations in SageMath, and we found back on selected

examples (see Appendix I) all existing known conservation laws both for ReLU and linear networks.

Open-source code is available at https://github.com/sibyllema/Conservation_laws.

2.5 Independent conserved functions

Having an algorithm to build conservation laws is nice, yet how can we know if we have built

“all” laws? This requires first defining a notion of a “maximal” set of functions, which would

in some sense be independent. This does not correspond to linear independence of the functions

themselves (for instance, if h is a conservation law, then so is h k for each k ∈ N but this does not

add any other constraint), but rather to pointwise linear independence of their gradients. This notion

of independence is closely related to the notion of “functional independence” studied in [6, 17].

For instance, it is shown in [17] that smooth functionally dependent functions are characterized by

having dependent gradients everywhere. This motivates the following definition.

Definition 2.11. A family of N functions (h 1 , · · · , h N ) conserved through V ⊂ X (Ω) is said to be

independent if the vectors (∇h 1 (θ), · · · , ∇h N (θ)) are linearly independent for all θ ∈ Ω.

The goal is thus to find the largest set of independent conserved functions. An immediate upper

bound holds on the number N of functionally independent functions h 1 , . . . , h N conserved through

V : for θ ∈ Ω ⊆ R D , the space W (θ) := span{∇h 1 (θ), . . . , ∇h N (θ)} ⊆ R D is of dimension

N (by independence) and (by Proposition 2.9) orthogonal to V φ (θ). Thus, it is necessary to have

N ≤ D − dim V (θ). As we will now see, this bound can be tight under additional assumptions on

V related to Lie brackets (corresponding to the so-called Frobenius theorem). This will in turn lead

to a characterization of the maximum possible N .

3 Conservation Laws using Lie Algebra

The study of hyper-surfaces trapping the solution of ODEs is a recurring theme in control theory,

since the existence of such surfaces is the basic obstruction of controllability of such systems [5].

The basic result to study these surfaces is the so-called Frobenius theorem from differential calculus

(See Section 1.4 of [11] for a good reference for this theorem). It relates the existence of such

surfaces, and their dimensions, to some differential condition involving so-called “Lie brackets”

[u, v] between pairs of vector fields (see Section 3.1 below for a more detailed exposition of this

operation). However, in most cases of practical interest (such as for instance matrix factorization),

the Frobenius theorem is not suitable for a direct application to the space V φ because its Lie bracket

condition is not satisfied. To identify the number of independent conservation laws, one needs to

consider the algebraic closure of V φ under Lie brackets. The fundamental object of interest is thus

the Lie algebra generated by the Jacobian vector fields, that we recall next.

Notations Given a vector subspace of infinitely smooth vector fields V ⊆ X (Ω) := C ∞ (Ω, R D ),

we recall (cf Proposition 2.5) that its trace at some θ is the subspace

V (θ) := span{χ(θ) : χ ∈ V } ⊆ R D .

(9)

For each open subset Ω ⊆ Ω, we introduce the subspace of X (Ω ): V |Ω ′ := {χ |Ω ′ : χ ∈ V }.

′

3.1 Background on Lie algebra

A Lie algebra A is a vector space endowed with a bilinear map [·, ·], called a Lie bracket, that verifies

for all X, Y, Z ∈ A: [X, X] = 0 and the Jacobi identity: [X, [Y, Z]] + [Y, [Z, X]] + [Z, [X, Y ]] = 0.

For the purpose of this article, the Lie algebra of interest is the set of infinitely smooth vector fields

X (Ω) := C ∞ (Ω, R D ), endowed with the Lie bracket [·, ·] defined by

[χ 1 , χ 2 ] : θ ∈ Ω 7→ [χ 1 , χ 2 ](θ) := ∂χ 1 (θ)χ 2 (θ) − ∂χ 2 (θ)χ 1 (θ),

(10)

6with ∂χ(θ) ∈ R D×D the jacobian of χ at θ. The space R n×n of matrices is also a Lie algebra

endowed with the Lie bracket [A, B] := AB − BA. This can be seen as a special case of (10) in the

case of linear vector fields, i.e. χ(θ) = Aθ.

Generated Lie algebra Let A be a Lie algebra and let V ⊂ A be a vector subspace of A. There

exists a smallest Lie algebra that contains V . It is denoted Lie(V ) and called the generated Lie

algebra of V . The following proposition [5, Definition 20] constructively characterizes Lie(V ),

where for vector subspaces [V, V ′ ] := {[χ 1 , χ 2 ] : χ 1 ∈ V, χ 2 ∈ V ′ }, and V + V ′ = {χ 1 + χ 2 :

χ 1 ∈ V, χ 2 ∈ V ′ }.

Proposition 3.1. Given any vector subspace V ⊆ A we have Lie(V ) = k V k where:

V 0 := V

V k := V k−1 + [V 0 , V k−1 ] for k ≥ 1.

We will see in Section 3.2 that the number of conservation laws is characterized by the dimension of

the trace Lie(V φ )(θ) defined in (9). The following lemma (proved in Appendix C) gives a stopping

criterion to algorithmically determine this dimension (see Section 3.3 for the algorithm).

Lemma 3.2. Given θ ∈ R D , if for a given i, dimV i+1 (θ ′ ) = dimV i (θ) for every θ ′ in a neighbor-

hood of θ, then there exists a neighborhood Ω of θ such that V k (θ ′ ) = V i (θ ′ ) for all θ ′ ∈ Ω and

k ≥ i, where the V i are defined by Proposition 3.1. Thus Lie(V )(θ ′ ) = V i (θ ′ ) for all θ ′ ∈ Ω. In

particular, the dimension of the trace of Lie(V ) is locally constant and equal to the dimension of

V i (θ).

3.2 Number of conservation laws

The following theorem uses the Lie algebra generated by V φ to characterize precisely the number

of conservation laws. The proof of this result is based on two successive uses of the Frobenius

theorem and can be found in Appendix D (where we also recall Frobenius theorem for the sake of

completeness).

Theorem 3.3. If dimLie(V φ )(θ) is locally constant then each θ ∈ Ω admits a neighborhood Ω ′ such

that there are D − dimLie(V φ )(θ) (and no more) independent conserved functions through V φ |Ω ′ .

Combining Proposition 2.9 and Theorem 3.3 we obtain:

Corollary 3.4. If dimLie(V φ )(θ) is locally constant then each θ ∈ Ω admits a neighborhood Ω ′

such that there are D − dimLie(V φ )(θ) (and no more) independent conservation laws of φ on Ω ′ .

Remark 3.5. The proof of the Frobenius theorem (and therefore of our generalization Theorem 3.3)

is actually constructive. From a given φ, conservation laws are obtained in the proof by integrating in

time (i.e. solving an advection equation) the vector fields belonging to V φ . Unfortunately, this cannot

be achieved in closed form in general, but in small dimensions, this could be carried out numerically

(to compute approximate discretized laws on a grid or approximate them using parametric functions

such as Fourier expansions or neural networks).

A fundamental aspect of Corollary 3.4 is to rely only on the computation of the dimension of the trace

of the Lie algebra associated with the finite-dimensional vector space V φ . Yet, even if V φ is finite-

dimensional, it might be the case that Lie(V φ ) itself remains infinite-dimensional. Nevertheless,

what matters is not the dimension of Lie(V φ ), but that of its trace Lie(V φ )(θ), which is always finite

(and potentially much smaller that dim Lie(V φ ) even when the latter is finite) and computationally

tractable thanks to Lemma 3.2 as detailed in Section 3.3. In section 4.1 we work out the example

of matrix factorization, a non-trivial case where the full Lie algebra Lie(V φ ) itself remains finite-

dimensional.

Corollary 3.4 requires that the dimension of the trace at θ of the Lie algebra is locally constant.

This is a technical assumption, which typically holds outside a set of pathological points. A good

example is once again matrix factorization, where we show in Section 4.1 that this condition holds

generically.

3.3 Method and algorithm, with examples

Given a factorizing mapping φ for the architectures to train, to determine the number of inde-

pendent conservation laws of φ, we leverage the characterization 3.1 to algorithmically compute

7dimLie(V φ )(θ) using an iterative construction of bases for the subspaces V k starting from V 0 := V φ ,

and stopping as soon as the dimension stagnates thanks to Lemma 3.2. Our open-sourced code

is available at https://github.com/sibyllema/Conservation_laws and uses SageMath. As

we now show, this algorithmic principle allows to fully work out certain settings where the stopping

criterion of Lemma 3.2 is reached at the first step (i = 0) or the second one (i = 1). Section 4.2 also

discusses its numerical use for an empirical investigation of broader settings.

Example where the iterations of Lemma 3.2 stop at the first step. This corresponds to the case

where LieV φ (θ) = V 1 (θ) = V 0 (θ) := V φ (θ) on Ω. This is the case if and only if V φ satisfies that

[χ 1 , χ 2 ](θ) := ∂χ 1 (θ)χ 2 (θ) − ∂χ 2 (θ)χ 1 (θ) ∈ V φ (θ),

for all χ 1 , χ 2 ∈ V φ and all θ ∈ Ω. (11)

i.e., when the Frobenius Theorem (see Theorem D.1 in Appendix D) applies directly. The first

example is a follow-up to Example 2.2.

Example 3.6 (two-layer ReLU networks without bias). Consider θ = (U, V ) with U ∈ R n×r , V ∈

R m×r , n, m, r ≥ 1 (so that D = (n + m)r), and the mapping φ(θ) := (u i v i ⊤ ) i=1,··· ,r ∈ R n×m×r ,,

where U = (u 1 ; · · · ; u r ) and V = (v 1 ; · · · ; v r ), As detailed in Appendix E.1, since φ(θ) is a

collection of r rank-one n × m matrices, dimV φ (θ) = rank∂φ(θ) = (n + m − 1)r is constant

on the domain Ω such that u i , v j 6 = 0, and V φ satisfies (11), hence by Corollary 3.4 each θ has

a neighborhood Ω ′ such that there exists r (and no more) independent conserved function through

V φ |Ω ′ . The r known conserved functions [8] given by h i : (U, V ) 7→ ku i k 2 − kv i k 2 , i = 1, · · · , r,

are independent, hence they are complete.

Example where the iterations of Lemma 3.2 stop at the second step (but not the first one). Our

primary example is matrix factorization, as a follow-up to Example 2.1.

Example 3.7 (two-layer linear neural networks). With θ = (U, V ), where (U ∈ R n×r , V ∈ R m×r )

as in Example 3.6, the mapping φ(θ) := U V ⊤ ∈ R n×m , (here d = nm) factorizes the functions

minimized during the training of linear neural networks with two layers (see Example 2.1). As

shown in Appendix H, condition (11) is not satisfied when r > 1 and max(n, m) > 1. Thus, the

stopping criterion of Lemma 3.2 is not satisfied at the first step. However, as detailed in Proposi-

tion G.3 in Appendix G, (V φ ) 1 = (V φ ) 2 = LieV φ , hence the iterations of Lemma 3.2 stop at the

second step.

We complete this example in the next section by showing that known conservation laws are indeed

complete (see Corollary 4.4). Whether known conservation laws remain valid and/or complete in

this settings and extended ones is further studied in Section 4 and Appendix E using the toolset that

we have presented.

3.4 Application: recasting over-parameterized flows as low-dimensional Riemannian flows

One striking application of Corollary 3.4 (in simple cases where dimV φ (θ) = dimLieV φ (θ) is

constant on Ω, i.e., rank∂φ(θ) is constant on Ω and V φ satisfies (11)) is to fully rewrite the high-

dimensional flow θ(t) ∈ R D as a low-dimensional flow z(t) ∈ R d , where this flow is associated with

a Riemannian metric tensor M that is induced by φ and depends on the initialization θ init . We insist

on the fact that this is only possible in very specific cases, but this phenomenon is underlying many

existing works which aim at writing in closed form the implicit bias associated with some training

dynamics (see Section 1 for some relevant literature. Our analysis shed some light on cases where

this is possible (see Appendix F for a proof). Note that the metric M (z, θ init ) can have a kernel,

typically when φ(Ω) is a sub-manifold. The evolution (12) should then be understood as a flow on

this manifold. The kernel of M (z, θ init ) is orthogonal to the tangent space at z of this manifold.

Proposition 3.8. Assume that rank(∂φ(θ)) is constant on Ω and that V φ satisfies (11). If θ(t) ∈ R D

satisfies the ODE (3) where θ init ∈ Ω, then there is 0 < T θ ⋆ init ≤ T θ init such that z(t) := φ(θ(t)) ∈ R d

satisfies the ODE

z(t) = −M (z(t), θ init )∇f (z(t))

for all t ∈ [0, T θ ⋆ init ), with z(0) = φ(θ init ),

where M (z(t), θ init ) ∈ R d×d is a symmetric semi-definite matrix.

Revisiting Example 3.6 leads to the following analytic example.

(12)d

Example 3.9. Given the mapping φ : (u ∈ R ∗ , v ∈ R d ) 7→ uv ∈ R p

, the variable z := uv satisfies

−1

⊤

(12) with M (z, θ init ) = kzk δ I d + kzk δ zz , with kzk δ := δ + δ 2 + kzk 2 , δ := 1/2(u 2 init −

kv init k 2 ).

Another analytic example is discussed in Appendix F. In light of these results, an interesting perspec-

tive is to better understand the dependance of the Riemannian metric with respect to initialization,

to possibly guide the choice of initialization for better convergence dynamics.

4 Conservation Laws for Linear and ReLU Neural Networks

To showcase the impact of our results, we show how they can be used to determine whether known

conservation laws for linear (resp. ReLU) neural networks are complete, and to recover these laws

algorithmically using factorizing mappings φ adapted to these two settings. Concretely, we study

the conservation laws for neural networks with q layers, and either a linear or ReLU activation, with

an emphasis on q = 2. We write θ = (U 1 , · · · , U q ) with U i ∈ R n i−1 ×n i the weight matrices and

we assume that θ satisfies the gradient flow (3) for some data fidelity function f ∈ C ∞ (φ(Ω), R). In

the linear case the mapping is φ Lin (θ) := U 1 · · · U q . For ReLU networks, we use the (polynomial)

mapping φ ReLu of [24, Definition 6], which is defined for any (deep) feedforward ReLU network,

with or without bias. In the simplified setting of networks without biases it reads explicitly as:

φ ReLu (U 1 , · · · , U q ) := U 1 [:, j 1 ]U 2 [j 1 , j 2 ] · · · U q−1 [j q−2 , j q−1 ]U q [j q−1 , :]

(13)

j 1 ,··· ,j q−1

with U [i, j] the (i, j)-th entry of U . This covers φ(θ) := (u j v j ⊤ ) rj=1 ∈ R n×m×r from Example 2.2.

Some conservation laws are known for the linear case φ Lin [1, 2] and for the ReLu case φ ReLu [8].

Proposition 4.1 ( [1, 2, 8] ). If θ := (U 1 , · · · , U q ) satisfies the gradient flow (3), then for each i =

⊤

)

1, · · · , q−1 the function θ 7→ U i ⊤ U i −U i+1 U i+1

(resp. the function θ 7→ diag U i ⊤ U i − U i+1 U i+1

defines n i × (n i + 1)/2 conservation laws for φ Lin (resp. n i conservation laws for φ ReLu ).

P q−1

Proposition 4.1 defines i=1 n i ×(n i +1)/2 conserved functions for the linear case. In general they

are not independent, and we give below in Proposition 4.2, for the case of q = 2, the exact number

of independent conservation laws among these particular laws. Establishing whether there are other

(previously unknown) conservation laws is an open problem in the general case q > 2. We already

answered negatively to this question in the two-layer ReLu case without bias (See Example 3.6). In

the following Section (Corollary 4.4), we show the same result in the linear case q = 2. Numerical

computations suggest this is still the case for deeper linear ReLU networks as detailed in Section 4.2.

4.1 The matrix factorization case (q = 2)

To simplify the analysis when q = 2, we rewrite θ = (U, V ) as a vertical matrix concatenation

denoted (U ; V ) ∈ R (n+m)×r , and φ(θ) = φ Lin (θ) = U V ⊤ ∈ R n×m .

How many independent conserved functions are already known? The following proposition

refines Proposition 4.1 for q = 2 by detailing how many independent conservation laws are already

known. See Appendix G.1 for a proof.

Proposition 4.2. Consider Ψ : θ = (U ; V ) 7→ U ⊤ U − V ⊤ V ∈ R r×r and assume that (U ; V ) has

full rank noted rk. Then the function Ψ gives rk · (2r + 1 − rk)/2 independent conserved functions.

There exist no more independent conserved functions. We now come to the core of the analysis,

which consists in actually computing Lie(V φ ) as well as its traces Lie(V φ )(θ) in the matrix factoriza-

tion case. The crux of the analysis, which enables us to fully work out theoretically the case q = 2,

is that V φ is composed of linear vector fields (that are explicitly characterized in Proposition G.2 in

Appendix G), the Lie bracket between two linear fields being itself linear and explicitly character-

ized with skew matrices, see Proposition G.3 in Appendix G. Eventually, what we need to compute

is the dimension of the trace Lie(V φ )(U, V ) for any (U, V ). We prove the following in Appendix G.

Proposition 4.3. If (U ; V ) ∈ R (n+m)×r has full rank noted rk, then: dimLie(V φ )(U ; V ) = (n +

m)r − (2r + 1 − rk)/2.

9With this explicit characterization of the trace of the generated Lie algebra and Proposition 4.2, we

conclude that Proposition 4.1 has indeed exhausted the list of independent conservation laws.

Corollary 4.4. If (U ; V ) has full rank, then all conserved functions are given by Ψ : (U, V ) 7→

U ⊤ U − V ⊤ V . In particular, there exist no more independent conserved functions.

4.2 Numerical guarantees in the general case

The expressions derived in the previous section are specific to the linear case q = 2. For deeper linear

networks and for ReLU networks, the vector fields in V φ are non-linear polynomials, and computing

Lie brackets of such fields can increase the degree, which could potentially make the generated Lie

algebra infinite-dimensional. One can however use Lemma 3.2 and stop as soon as dim ((V φ ) k (θ))

stagnates. Numerically comparing this dimension with the number N of independent conserved

functions known in the literature (predicted by Proposition 4.1) on a sample of depths/widths of

small size, we empirically confirmed that there are no more conservation laws than the ones already

known for deeper linear networks and for ReLU networks too (see Appendix I for details). Our code

is open-sourced and is available at https://github.com/sibyllema/Conservation_laws. It

is worth mentioning again that in all tested cases φ is polynomial, and there is a maximum set of con-

servation laws that are also polynomial, which are found algorithmically (as detailed in Section 2.4).

Conclusion

In this article, we proposed a constructive program for determining the number of conservation laws.

An important avenue for future work is the consideration of more general classes of architectures,

such as deep convolutional networks, normalization, and attention layers. Note that while we focus

in this article on gradient flows, our theory can be applied to any space of displacements in place

of V φ . This could be used to study conservation laws for flows with higher order time derivatives,

for instance gradient descent with momentum, by lifting the flow to a higher dimensional phase

space. A limitation that warrants further study is that our theory is restricted to continuous time

gradient flow. Gradient descent with finite step size, as opposed to continuous flows, disrupts exact

conservation. The study of approximate conservation presents an interesting avenue for future work.

Acknowledgement

The work of G. Peyré was supported by the European Research Council (ERC project NORIA)

and the French government under management of Agence Nationale de la Recherche as part of the

“Investissements d’avenir” program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). The

work of R. Gribonval was partially supported by the AllegroAssai ANR project ANR-19-CHIA-

0009.

References

[1] S. A RORA , N. C OHEN , N. G OLOWICH , AND W. H U , A convergence analysis of gradient

descent for deep linear neural networks, arXiv preprint arXiv:1810.02281, (2018).

[2] S. A RORA , N. C OHEN , AND E. H AZAN , On the optimization of deep networks: Implicit ac-

celeration by overparameterization, in Int. Conf. on Machine Learning, PMLR, 2018, pp. 244–

253.

[3] B. B AH , H. R AUHUT , U. T ERSTIEGE , AND M. W ESTDICKENBERG , Learning deep linear

neural networks: Riemannian gradient flows and convergence to global minimizers, Informa-

tion and Inference: A Journal of the IMA, 11 (2022), pp. 307–353.

[4] M. B ELKIN , D. H SU , S. M A , AND S. M ANDAL , Reconciling modern machine-learning prac-

tice and the classical bias–variance trade-off, Proc. of the National Academy of Sciences, 116

(2019), pp. 15849–15854.

[5] B. B ONNARD , M. C HYBA , AND J. R OUOT , Geometric and Numerical Optimal Control - Ap-

plication to Swimming at Low Reynolds Number and Magnetic Resonance Imaging, Springer-

Briefs in Mathematics, Springer Int. Publishing, 2018.

10[6] A. B. B ROWN , Functional dependence, Transactions of the American Mathematical Society,

38 (1935), pp. 379–394.

[7] L. C HIZAT AND F. B ACH , Implicit bias of gradient descent for wide two-layer neural networks

trained with the logistic loss, in Conf. on Learning Theory, PMLR, 2020, pp. 1305–1338.

[8] S. S. D U , W. H U , AND J. D. L EE , Algorithmic regularization in learning deep homogeneous

models: Layers are automatically balanced, Adv. in Neural Inf. Proc. Systems, 31 (2018).

[9] S. G UNASEKAR , J. L EE , D. S OUDRY , AND N. S REBRO , Characterizing implicit bias in terms

of optimization geometry, in Int. Conf. on Machine Learning, PMLR, 2018, pp. 1832–1841.

[10] S. G UNASEKAR , B. E. W OODWORTH , S. B HOJANAPALLI , B. N EYSHABUR , AND N. S RE -

BRO , Implicit regularization in matrix factorization, Adv. in Neural Inf. Proc. Systems, 30

(2017).

[11] A. I SIDORI , Nonlinear system control, New York: Springer Verlag, 61 (1995), pp. 225–236.

[12] Z. J I , M. D UDÍK , R. E. S CHAPIRE , AND M. T ELGARSKY , Gradient descent follows the

regularization path for general losses, in Conf. on Learning Theory, PMLR, 2020, pp. 2109–

2136.

[13] Z. J I AND M. T ELGARSKY , Gradient descent aligns the layers of deep linear networks, arXiv

preprint arXiv:1810.02032, (2018).

[14] D. K UNIN , J. S AGASTUY -B RENA , S. G ANGULI , D. L. Y AMINS , AND H. T ANAKA , Neural

mechanics: Symmetry and broken conservation laws in deep learning dynamics, arXiv preprint

arXiv:2012.04728, (2020).

[15] Z. L I , T. W ANG , J. L EE , AND S. A RORA , Implicit bias of gradient descent on reparametrized

models: On equivalence to mirror descent, arXiv preprint arXiv:2207.04036, (2022).

[16] H. M IN , S. T ARMOUN , R. V IDAL , AND E. M ALLADA , On the explicit role of initialization

on the convergence and implicit bias of overparametrized linear networks, in Int. Conf. on

Machine Learning, PMLR, 2021, pp. 7760–7768.

[17] W. F. N EWNS , Functional dependence, The American Mathematical Monthly, 74 (1967),

pp. 911–920.

[18] B. N EYSHABUR , Implicit regularization in deep learning, arXiv preprint arXiv:1709.01953,

(2017).

[19] B. N EYSHABUR , R. T OMIOKA , AND N. S REBRO , In search of the real inductive bias: On the

role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614, (2014).

[20] E. N OETHER , Invariante variationsprobleme, Nachrichten von der Gesellschaft der Wis-

senschaften zu Göttingen, Mathematisch-Physikalische Klasse, 1918 (1918), pp. 235–257.

[21] A. M. S AXE , J. L. M C C LELLAND , AND S. G ANGULI , Exact solutions to the nonlinear dy-

namics of learning in deep linear neural networks, arXiv preprint arXiv:1312.6120, (2013).

[22] S. S HALEV -S HWARTZ AND S. B EN -D AVID , Understanding machine learning: From theory

to algorithms, Cambridge university press, 2014.

[23] D. S OUDRY , E. H OFFER , M. S. N ACSON , S. G UNASEKAR , AND N. S REBRO , The implicit

bias of gradient descent on separable data, The Journal of Machine Learning Research, 19

(2018), pp. 2822–2878.

[24] P. S TOCK AND R. G RIBONVAL , An Embedding of ReLU Networks and an Analysis of their

Identifiability, Constructive Approximation, (2022). Publisher: Springer Verlag.

[25] S. T ARMOUN , G. F RANCA , B. D. H AEFFELE , AND R. V IDAL , Understanding the dynam-

ics of gradient flow in overparameterized linear models, in Int. Conf. on Machine Learning,

PMLR, 2021, pp. 10153–10161.

11[26] T HE S AGE D EVELOPERS , SageMath, the Sage Mathematics Software System (Version 9.7),

2022. https://www.sagemath.org.

[27] C. Z HANG , S. B ENGIO , M. H ARDT , B. R ECHT , AND O. V INYALS , Understanding deep

learning requires rethinking generalization, in Int. Conf. on Learning Representations, 2017.

[28] B. Z HAO , I. G ANEV , R. W ALTERS , R. Y U , AND N. D EHMAMY , Symmetries, flat minima,

and the conserved quantities of gradient flow, arXiv preprint arXiv:2210.17216, (2022).

12A

Proof of Proposition 2.5

Proposition 2.5 is a direct consequence of the following lemma (remember that ∇h(θ) = [∂h(θ)] ⊤ ).

Lemma A.1 (Smooth functions conserved through a given flow.). Given χ ∈ X (Ω), a function

h ∈ C 1 (Ω, R) is conserved through the flow induced by χ if and only if ∂h(θ)χ(θ) = 0 for all

θ ∈ Ω.

Proof. Assume that ∂h(θ)χ(θ) = 0 for all θ ∈ Ω. Then for all θ init ∈ Ω and for all t ∈ (0, T θ init ) :

h(θ(t, θ init )) = ∂h(θ(t, θ init )) θ(t, θ init ) = ∂h(θ(t, θ init ))χ(θ(t, θ init )) = 0.

Thus: h(θ(t, θ init )) = h(θ init ), i.e., h is conserved through χ. Conversely, assume that there exists

θ 0 ∈ Ω such that ∂h(θ 0 )χ(θ 0 ) 6 = 0. Then by continuity of θ ∈ Ω 7→ ∂h(θ)χ(θ), there exists r > 0

such that ∂h(θ)χ(θ) 6 = 0 on B(θ 0 , r). With θ init = θ 0 by continuity of t 7→ θ(t, θ init ), there exists

ε > 0, such that for all t < ε, θ(t, θ init ) ∈ B(θ 0 , r). Then for all t ∈ (0, ε): dt

h(θ(t, θ init )) =

∂h(θ(t, θ init ))χ(θ(t, θ init )) 6 = 0, hence h is not conserved through the flow induced by χ.

B Proof of Theorem 2.8

Considering a family of functions F ⊂ C ∞ (φ(Ω), R) we recall the notation (cf (5))

V φ [F ] := {χ : ∃f ∈ F, χ = ∇(f ◦ φ) on Ω}.

Given a loss function ℓ we also recall our primary interest in the family

(

F ℓ :=

∞

f ∈ C (φ(Ω), R) : ∃N ∈

N, ∃(x i , y i ) N

i=1 , ∀θ

∈ Ω, (f ◦ φ)(θ) =

(14)

)

ℓ(g θ (x i ), y i ) .

i=1

(15)

We note CL(V φ [C ]) (resp CL(V φ [F ℓ ]) the set of all functions conserved through V φ [C ] (resp.

V φ [F ℓ ])). Our goal is to show that CL(V φ [C ∞ ]) = CL(V φ [F ℓ ]) under the assumptions of the theo-

rem. For this we will show below that linear and ReLU networks satisfy the following intermediate

assumption:

Assumption B.1. The model (x, θ) ∈ X × Ω 7→ g θ (x) ∈ R n , where X ⊆ R m , can be factorized

′

via φ: for every x ∈ X, there is L x ∈ C ∞ (R d , R n ) such that g θ (x) = L x (φ(θ)) for every θ ∈ Ω.

∞

For linear networks, with θ := (U 1 , . . . , U q ) and φ(θ) := U 1 . . . U q , since g θ (x) = U 1 . . . U q x the

assumption indeed holds with X = R m , any Ω, and L x (φ(θ)) := φ(θ)x. The assumption also

holds for shallow ReLU networks without bias when Ω is a sufficiently small neighborhood of some

reference parameter θ and X is chosen adequately, cf Example 2.2. As a consequence of the results

of [24] this extends to deeper ReLU networks, including with biases, as will be soon detailed.

As a first step, we establish the following consequence of Assumption B.1.

Proposition B.2. Consider a loss function ℓ(z, y) that is differentiable with respect to z ∈ R n .

Under Assumption B.1 (and with the corresponding notations), if for all θ ∈ Ω we have

span {[∂L x (φ(θ))] ⊤ ∇ z ℓ(g θ (x), y))} = R d

(16)

x∈X,y

then CL(V φ [C ∞ ]) = CL(V φ [F ℓ ]).

Proof. We will show that V φ [F ℓ ](θ) = V φ [C ∞ ](θ) for all θ ∈ Ω. By Proposition 2.5, it will follow

that CL(V φ [C ∞ ]) = CL(V φ [F ℓ ]) as claimed.

By definition, F ℓ ⊆ C ∞ (φ(Ω), R), we have V φ [F ℓ ] ⊆ V φ [C ∞ (φ(Ω), R)] =: V φ [C ∞ ] (cf (5)) hence

V φ [F ℓ ](θ) ⊆ V φ [C ∞ ](θ) for every θ ∈ Ω. There only remains to prove the converse inclusion.

For this, consider an arbitrary θ ∈ Ω. By the definition (5) of V φ [C ∞ ] and of its trace (cf Propo-

sition 2.5), one can check that V φ [C ∞ ](θ) = range ∂φ(θ) ⊤ . Thus we simply need to show that

range ∂φ(θ) ⊤ ⊆ V φ [F ℓ ](θ). By Assumption B.1 for each x ∈ X there is a C ∞ function L x such

13that g θ (x) = L x (φ(θ)) for every θ ∈ Ω. For each “label” y, defining f x,y (·) := ℓ(L x (·), y), we

obtain that (f x,y ◦ φ)(θ) = ℓ(L x (φ(θ)), y) = ℓ(g θ (x), y) is C ∞ on φ(Ω), so that f x,y ∈ F ℓ , and

the vector field χ x,y : θ 7→ ∇(f x,y ◦ φ) = [∂φ(θ)] ⊤ [∂L x (φ(θ))] ⊤ ∇ z ℓ(g θ (x), y)) belongs to V φ [F ℓ ].

As a consequence for all θ ∈ Ω:

(16)

V φ [F ℓ ](θ) ⊇ ∂φ(θ) ⊤ span {[∂L x (φ(θ))] ⊤ ∇ z ℓ(g θ (x), y))} = range ∂φ(θ) ⊤ = V φ [C ∞ ](θ).

x∈X,y

As a corollary, we establish a result that decouples an assumption on the loss function ℓ and an

assumption on the model g θ , expressed via properties of L x (under Assumption B.1).

Corollary B.3. Consider a loss ℓ(z, y) that is differentiable with respect to z ∈ R n and such that

span y ∇ z ℓ(z, y) = R n ,

∀z ∈ R n .

(17)

Under Assumption B.1, if for all θ ∈ Ω,

span x∈X,w∈R n {[∂L x (φ(θ))] ⊤ w} = R d ,

(18)

then CL(V φ [C ∞ ]) = CL(V φ (F ℓ )).

Proof. Altogether the two conditions imply (16) so that we can apply Proposition B.2.

We are now equipped to prove the theorem, beginning by the most easy setting of linear networks.

Theorem B.4 (linear networks). Consider a linear network parameterized by q matrices, θ =

(U 1 , . . . , U q ) and defined via g θ (x) := U 1 . . . U q x. With φ(θ) := U 1 . . . U q ∈ R n×m (identified

with R d with d = nm), and for any loss ℓ satisfying (17), we have CL(V φ [C ∞ ]) = CL(V φ [F ℓ ]).

Proof. Write g θ (x) = φ(θ)x =: L x (φ(θ)) ∈ R n for θ ∈ Ω = R D and x ∈ X = R m . Since

span x∈R m ,w∈R n {[∂L x (φ(θ))] ⊤ w} = span x,w {wx ⊤ } = R d we can apply Corollary B.3.

Before proceeding to the case of ReLU networks, let us show that (17) holds for standard ML losses.

Lemma B.5. The mean-squared error loss (z, y) 7→ ℓ 2 (z, y) := ky − zk 2 and the logistic loss

(z ∈ R, y ∈ {−1, 1}) 7→ ℓ logis (z, y) := log(1 + exp(−zy)) satisfy condition (17).

Proof. To show that ℓ 2 satisfies (17) we observe that, with e i the i-th canonical vector, we have

R n = span{e i : 1 ≤ i ≤ n} =

span

y∈{z−e i /2} n

i=1

For the logistic loss, ∇ z ℓ logis (z, y) =

2(z − y) ⊆ span 2(z − y) = span ∇ z ℓ 2 (z, y) ⊆ R n .

y∈R n

−y exp(−zy)

1+exp(−zy)

y∈R n

= 0 hence span y ∇ z ℓ logis (z, y) = R.

Remark B.6. In the case P of the cross-entropy loss (z ∈ R n , y ∈ {1, · · · , n}) 7→

:= −z y + log ( ni=1 exp z i ), ℓ cross does not satisfy (17) as ∇ z ℓ cross (z, y) = −e y +

ℓ cross (z, y) P

exp(z 1 )/( i exp z i )

· · P

satisfies for all z ∈ R n :

exp(z n )/( i exp z i )

span y ∇ z ℓ cross (z, y) = {w := (w 1 , · · · , w n ) ∈ R n :

w i = 0} =: W cross .

An interesting challenge is to investigate variants of Corollary B.3 under weaker assumptions that

would cover the cross-entropy loss.

We now treat the case of ReLU networks, using notations and concepts from [24] that generalize

observations from Example 2.2 to deep ReLU networks with biases. Given a feedforward network

architecture of arbitrary depth, denote θ the collection of all parameters (weights and biases) of a

ReLU network on this architecture, and consider θ 7→ φ ReLU (θ) the rescaling-invariant polynomial

function of [24, Definition 6] and C θ,x , the matrices of [24, Corollary 3] such that the output of the

network with parameters θ, when fed with an input vector x ∈ R m , can be written g θ (x) = C θ,x φ(θ).

From its definition in [24, Corollary 3], given x, the matrix C θ,x only depends on θ via the so-called

activation status of the neurons in the network (cf [24, Section 4.1]). By [24, Lemma 11 and

14Definition 10], for each θ, there exists 1 an open set X θ ⊆ R m such that, for every x ∈ X θ , the

activation status is locally constant on a neighborhood of (θ, x). The specifics of how φ ReLU (θ) and

C θ,x are built will be provided soon, but they are unimportant at this stage of the analysis. With

these notations we can proceed.

Lemma B.7. Consider a feedforward ReLU network architecture and the rescaling-invariant poly-

nomial function θ 7→ φ ReLU (θ) from [24, Definition 6]. Given a parameter θ 0 and x 0 ∈ X θ 0 ⊆ R m ,

there exists an open neighborhood Ω of θ 0 and an open neighborhood X ⊆ R m of x 0 such that

Assumption B.1 holds with the model (x, θ) ∈ X × Ω 7→ g θ (x) ∈ R n : g θ (x) = L x (φ ReLU (θ)) :=

C θ 0 ,x φ ReLU (θ) for all x ∈ X, θ ∈ Ω, with C θ,x the matrices of [24, Corollary 3].

Proof. As described above, by definition of X θ 0 the activation status of the neurons is locally con-

stant in a neighborhood of (θ 0 , x 0 ), hence there exists an open neighborhood Ω of θ 0 and an open

neighborhood X of x 0 such that C θ,x = C θ 0 ,x for every θ ∈ Ω and x ∈ X. The conclusion follows

from the fact that g θ (x) = C θ,x (φ ReLU (θ)) for every θ, x.

Lemma B.8. Consider a feedforward ReLU network architecture and the rescaling-invariant poly-

nomial function θ 7→ φ ReLU (θ) from [24, Definition 6]. If θ 0 is a parameter such that:

span x∈X θ

⊤

,w∈R n {C θ 0 ,x w}

= R d ,

(19)

then θ 0 admits a neighborhood Ω such that, for any loss ℓ satisfying (17), CL(V φ ReLU [C ∞ ]) =

CL(V φ ReLU [F ℓ ]).

Proof. Let us assume that θ 0 verifies (19) and that the loss ℓ satisfies (17). By definition of a

generated vector space in R d , there exists a finite set X 0 := {x i } di=1 ⊂ X θ 0 such that:

(19)

span x∈X 0 ,w∈R n {C θ ⊤ 0 ,x w} = span x∈X θ 0 ,w∈R n {C θ ⊤ 0 ,x w} = R d .

(20)

Then, for 1 ≤ i ≤ d by applying Lemma B.7 there exists an open neighborhood Ω i of θ 0 and

an open neighborhood X i ⊆ X θ 0 ⊆ R m of x i such that Assumption B.1 holds with the model

(x, θ) ∈ X i × Ω i 7→ g θ (x) ∈ R n . Thus, by taking Ω := ∩ i Ω i and X := ∪ i X i , Assumption B.1

holds with the model (x, θ) ∈ X × Ω 7→ g θ (x) ∈ R n : g θ (x) = L x (φ ReLU (θ)) := C θ 0 ,x φ ReLU (θ) for

all x ∈ X, θ ∈ Ω, with C θ,x the matrices of [24, Corollary 3]. Finally, for each θ ∈ Ω, we have

span {[∂L x (φ(θ))] ⊤ w} =

x∈X,w∈R n

span {C θ ⊤ 0 ,x w} ⊇

x∈X,w∈R n

span

x∈X 0

,w∈R n

(20)

{C θ ⊤ 0 ,x w} = R d .

The conclusion follows using Corollary B.3.

In the specific case of two-layer (ReLU) networks with r neurons, one can write θ = (U, V, b, c) ∈

R n×r × R m×r × R r × R

u j (resp. v j , b j ) the columns of U (resp. columns of V , entries

P r , and denote

of b), so that g θ (x) = j=1 u j σ(v j ⊤ x + b j ) + c. As soon as v j 6 = 0 for every neuron j, the set X θ is

[24, Definition 10] simply the complement in the input domain R m of the union of the hyperplanes

H j := {x ∈ R m : v j ⊤ x + b j = 0}.

(21)

Theorem B.9 (two-layer ReLU networks ). On a two-layer ReLU network architecture, let θ be

a parameter such that all hyperplanes H j defined in (21) are pairwise distincts. Then θ admits a

neighborhood such that, for any loss ℓ satisfying (17), we have CL(V φ ReLU [C ∞ ]) = CL(V φ ReLU [F ℓ ]).

Proof. Consider θ a parameter. By Lemma B.8, we only need to show that:

⊤

span x∈X θ ,w∈R n {C θ,x

w} = R d .

1st case: We consider first the case without bias (b j , c = 0).

In that case we write: θ = (U, V ) ∈ R n×r × R m×r , and denote u j (resp. v j ) the columns of U (resp.

columns of V ), as in Example 2.2.

The set X θ essentially corresponds [24, Lemma 11] to all inputs for which each neuron is either strictly

activated or strictly non-activated, cf [24, Definition 10].

15Here d = rnm and the matrix C θ,x is defined by:

⊤

C θ,x

where:

A : x ∈ R m

ε 1 (x, θ)A(x)

···

ε r (x, θ)A(x)



 0

A(x) := 

→

···

∈ R (rnm)×n

···



0 

∈ R (nm)×n ,

· · · 

and where ε i (x, θ) = 1(v i ⊤ x > 0). Indeed one can easily check that the matrix C θ,x satisfies

the properties of Lemma B.7 where φ ReLU coincides with the mapping of Example 2.2, φ(θ) :=

(u j v j ⊤ ) rj=1 . For j = 1, · · · , r we denote:

⊤

A +

j := {x ∈ R : v j x > 0},

⊤

and A −

j := {x ∈ R : v j x < 0}.

The open Euclidean ball of radius r > 0 centered at c ∈ R m is denoted B(c, r).

Consider a hidden neuron i ∈ {1, · · · , r} and denote H i ′ := H i − j6 = i H j . Since the hyperplanes

are pairwise distinct, H i ′ 6 = ∅ so we can consider an arbitrary x ′ ∈ H i ′ . Given any η > 0, by

′

−

continuity of x ∈ R m 7→ (v 1 ⊤ x, · · · , v r ⊤ x) ∈ R r , there exists x +

η ∈ B(x , η) ∩ A i and x η ∈

−

′

⊤ ±

⊤ ′

B(x , η) ∩ A i such that for all j 6 = i, sign(v j x η ) = sign(v j x ). It follows that x η ∈ X θ

(remember that X θ is the complement of ∪ j H j ). As a consequence:

 0 

 ··· 

 0 





⊤

 A(x ′ )  = lim C ⊤ + − C ⊤ − ∈ span {C ⊤ } = span {C θ,x

θ,x





θ,x η

x∈X θ

 0  η→0





···

where the nonzero line in the left-hand-side is the i-th, and we used that every finite-dimensional

space is closed.

Moreover still by continuity of x ∈ R m 7→ (v 1 ⊤ x, · · · , v r ⊤ x) ∈ R r , there exists γ > 0, such that for

k = {−2, −1, 1, 2}, the vectors defined as:

x k := x ′ + γkv i ,

satisfy for all j 6 = i, sign(v j ⊤ x k ) = sign(v j ⊤ x ′ ) and v i ⊤ x k 6 = 0, so that x k ∈ X θ and we similarly

obtain

 0 

 ··· 

 0 





⊤

 γA(v i )  = C θ,x

−

θ,x −2 ∈ span {C θ,x }.

θ,x −1

θ,x 1





x∈X θ

 0 





···

As this holds for every x ′ ∈ H i ′ , and since span{v i , H i ′ } = R m , we deduce that for any x ∈ R m

 0 

 ··· 

 0 





⊤

 A(x)  ∈ span {C θ,x





 0  x∈X θ





···

As this holds for every hidden neuron i = 1, · · · , r it follows that for every x 1 , · · · , x r ∈ R m





A(x 1 )

⊤

 · · ·  ∈ span {C θ,x

x∈X

A(x )

16Moreover, by definition of A(·), for each x ∈ R m and each w = (w 1 , · · · , w n ) ∈ R n , we have

w 1 x

A(x)w = · · · ∈ R nm .

w n x

Identifying R nm with R m×n and the above expression with xw ⊤ , we deduce that

span

x∈R m ,w∈R n

A(x)w = R nm

and we let the reader check that this implies





A(x 1 )

A(x 1 )w

 ···  w =

 · · ·  = R rnm .

span

1 ,··· ,x r ∈R m ,w∈R n

x 1 ,··· ,x r ∈R m ,w∈R n

A(x )

A(x r )w

Thus, as claimed, we have

span

x∈X θ ,w∈R n

⊤

{C θ,x

w} = R rnm = R d .

2d case: General case with biases. The parameter is θ = (U, V, b, c) ∈ R n×r × R m×r × R r × R n

with b = (b i ) ri=1 , where b i ∈ R the bias of the i-th hidden neuron, and c the output bias.

In that case, d = rn(m + 1) and one can check that the conditions of Lemma B.8 hold with

φ ReLU (θ) := ((u i (v i , b i ) ⊤ ) ri=1 , c) and the matrix C θ,x defined by:





ε 1 (x, θ)A ′ (x)

···





⊤

:= 

C θ,x

∈ R (rn(m+1)+n)×n

ε r (x, θ)A ′ (x) 

I n

where, denoting x̄ = (x ⊤ , 1) ⊤ ∈ R m+1 , we defined

A ′ : x ∈ R m



x̄

 0

′

A (x) := 

→

···

and ε i (x, θ) := 1(v i ⊤ x + b i > 0).

x̄

···



0 

∈ R n(m+1)×n ,

· · · 

x̄

Using the sets

⊤

A +

j := {x ∈ R : v j x + b j > 0},

⊤

and A −

j := {x ∈ R : v j x + b j < 0},

a reasoning analog to the case without bias allows to show that for each i = 1, · · · , r:

 0 

 ··· 

 0 

 ′ 

⊤



{C θ,x

}

span 

 A (x)  ∈ span

x∈X θ

x∈R m  0 





···

so that, again, for every x 1 , . . . , x r ∈ R m we have

 ′ 1 

A (x )

 ··· 

⊤

 A ′ (x r )  ∈ span {C θ,x }.

x∈X θ

 ′ 1 

 

A (x )

 ··· 

 · · · 

⊤

As   ∈ span {C θ,x } too, we obtain that  ′ r  ∈ span {C θ,x

A (x )

x∈X θ

I n

17Now, for each x ∈ R m and w = (w 1 , · · · , w n ) ∈ R n , we have





w 1 x

 w 1 

w 1 x̄





A ′ (x)w = · · · =  · · ·  ∈ R n(m+1) .



w n x 

w n x̄

w n

Again, identifying the above expression with w(x ⊤ , 1) ∈ R n×(m+1) it is not difficult to check that

span

x∈R m ,w∈R n

A ′ (x)w = R n(m+1) ,

and we conclude as before.

Combining Theorem B.4 and Theorem B.9 establishes the proof of Theorem 2.8 as claimed.

Proof of Lemma 3.2

Lemma C.1. Given θ ∈ R D , if for a given i, dimV i+1 (θ ′ ) = dimV i (θ) for every θ ′ in a neighbor-

hood of θ, then for all k ≥ i, we have V k (θ ′ ) = V i (θ ′ ) for all θ ′ in a neighborhood Ω of θ, where

the V i are defined by Proposition 3.1. Thus Lie(V )(θ ′ ) = V i (θ ′ ) for all θ ′ ∈ Ω. In particular, the

dimension of the trace of Lie(V ) is locally constant and equal to the dimension of V i (θ).

Proof. The result is obvious for k = i. The proof is by induction on k starting from k = i + 1. We

denote m := dim V i (θ).

1st step: Initialization k = i + 1. By definition of the spaces V i (cf Proposition 3.1) we

have V i ⊂ V i+1 hence V i (θ) ⊆ V i+1 (θ). Since dimV i+1 (θ) = dimV i (θ) = m, it follows

that there exists χ 1 , · · · , χ m ∈ V i such that span χ j (θ) = V i (θ) = V i+1 (θ) (hence the m

vectors (χ 1 (θ), · · · , χ m (θ)) are linearly independent). Since each χ j is smooth, it follows that

(χ 1 (θ ′ ), · · · χ m (θ ′ )) remain linearly independent on some neighborhood Ω of θ, which we assume

to be small enough to ensure dim V i+1 (θ ′ ) = m for all θ ′ ∈ Ω. As χ j ∈ V i ⊂ V i+1 , we obtain that

′

for each θ ′ ∈ Ω, the family {χ j (θ ′ )} m

j=1 is a basis of the m-dimensional subspace V i+1 (θ ), hence:

V i (θ ′ ) ⊂ V i+1 (θ ′ ) = span j χ j (θ ′ ) ⊂ V i (θ ′ ),

∀θ ′ ∈ Ω

(22)

2nd step: Induction. We assume V k (θ ′ ) = V i (θ ′ ) on Ω. Let us show that V k+1 (θ ′ ) = V i (θ ′ ) on

Ω. Since V k+1 := V k + [V 0 , V k ] it is enough to show that [V 0 , V k ](θ ′ ) ⊆ V i (θ ′ ) on Ω. For this,

considering two vector fields, f ∈ V 0 and χ ∈ V k , we will show that [f, χ](θ ′ ) ∈ V i+1 (θ ′ ) for each

θ ′ ∈ Ω. In light of (22), this will allow us to conclude.

Indeed, from the induction hypothesis we know that V k (θ ′ ) = span j χ j (θ ′ ) = V i (θ ′ ) on Ω, hence

P m

′

for each θ ′ ∈ Ω there are coefficients a j (θ ′ ) such that χ(θ ′ ) =

j=1 a j (θ )χ j (θ ). Standard

linear algebra shows that these coefficients depend smoothly on χ(θ ′ ) and χ j (θ ′ ), which are smooth

functions of θ ′ , hence the functions a j (·) are smooth. By linearity of the Lie bracket and of V i+1 (θ ′ )

it is enough to show that [f, a j χ j ](θ ′ ) ∈ V i+1 (θ ′ ) on Ω for each j. Standard calculus yields

[f, a j χ j ] = (∂f )(a j χ j ) −

∂(a j χ j )

| {z }

f = a j [(∂f )χ j − (∂χ j )f ] − χ j (∂a j )f

=χ j ∂a j +a j ∂χ j

= a j [f, χ j ] − [(∂a j )f ]χ j

since (∂a j )f is scalar-valued (consider the corresponding dimensions). Since f ∈ V 0 and χ j ∈ V i ,

by definition of V i+1 (cf Proposition 3.1) we have [f, χ j ], χ j ∈ V i+1 hence by linearity we conclude

that [f, a j χ j ](θ ′ ) ∈ V i+1 (θ ′ ). As this holds for all j, we obtain [f, χ](θ ′ ) ∈ V i+1 (θ ′ ). As this is

(22)

valid for any f ∈ V 0 , χ ∈ V k this establishes [V 0 , V k ](θ ′ ) ⊆ V i+1 (θ ′ ) = V i (θ ′ ) and we conclude as

claimed that V i (θ ′ ) ⊆ V k+1 (θ ′ ) = V k (θ ′ ) + [V 0 , V k ](θ ′ ) ⊆ V i (θ ′ ) on Ω.

18D

Proof of Theorem 3.3

We recall first the fundamental result of Frobenius using our notations (See Section 1.4 of [11]).

Theorem D.1 (Frobenius theorem). Consider V ⊆ X (Ω), and assume that the dimension of V (θ)

is constant on Ω ⊆ R D . Then the two following assertions are equivalent:

1. each θ ∈ Ω admits a neighborhood Ω ′ such that there exists D−dimV (θ) independent conserved

functions through V |Ω ′ ;

2. the following property holds:

[u, v](θ) ∈ V (θ),

for each u, v ∈ V, θ ∈ Ω

(23)

Proposition D.2. Under the assumption that dimV (θ) is locally constant on Ω, Condition (23) of

Frobenius Theorem holds if, and only if, the linear space V ′ := {χ ∈ X (Ω), ∀θ ∈ Ω : χ(θ) ∈ V (θ)}

(which is a priori infinite-dimensional) is a Lie algebra.

Proof. ⇐ If V ′ is a Lie algebra, then as V ⊂ V ′ we get: for all u, v ∈ V ⊂ V ′ , [u, v] ∈ V ′ . Given

the definition of V ′ this means that (23) is satisfied.

⇒ Assuming now that (23) holds, we prove that V ′ is a Lie algebra. For this, given X, Y ∈ V ′ we

wish to show that [X, Y ](θ) ∈ V (θ) for every θ ∈ Ω.

Given θ ∈ Ω, we first reason as in the first step of the proof of Lemma 3.2 to obtain the exis-

tence of a neighborhood Ω ′ of θ and of m := dimV (θ) ′ vector fields χ 1 , · · · , χ m ∈ V such

that (χ 1 (θ ′ ), · · · , χ m (θ ′ )) is a basis of V (θ ′ ) for each θ ′ ∈ Ω. By definition of V ′ we have

X(θ ′ ) ∈ V (θ ′ ) and

Y (θ ′ ) ∈ V (θ ′ ) for every P

θ ′ ∈ Ω ′ . Thus, there are smooth functions a j , b j

P m

′

such that X(·) = 1 a i (·)χ i (·) and P

Y (·) = m

1 b i (·)χ i (·) on Ω , and we deduce by bilinearity

of the Lie brackets that [X, Y ](θ ′ ) = i,j [a i χ i , b j χ j ](θ ′ ) on Ω ′ . Since V (θ) is a linear space, we

will conclude that [X, Y ](θ) ∈ V (θ) if we can show that [a i χ i , b j χ j ](θ) ∈ V (θ). Indeed, we can

compute

[a i χ i , b j χ j ] = a i b j [χ i , χ j ] + b j [(∂a i )χ j ]χ j − a i [(∂b j )χ i ]χ j

where, due to dimensions, both (∂a i )χ j and (∂b j )χ i are smooth scalar-valued functions. By con-

struction of the basis {χ j } j we have χ i (θ), χ j (θ) ∈ V (θ), and by assumption (23) we have

[χ i , χ j ](θ) ∈ V (θ), hence we conclude that [X, Y ](θ) ∈ V (θ). Since this holds for any choice

of X, Y ∈ V ′ , this establishes that V ′ is a Lie algebra.

Theorem D.3. If dimLie(V φ )(θ) is locally constant then each θ ∈ Ω has a neighborhood Ω ′ such

that there are D − dimLie(V φ )(θ) (and no more) independent conserved functions through V φ |Ω ′ .

Proof. 1st step: Existence of Ω ′ and of D − dimLie(V φ )(θ) independent conserved functions. Let

θ ∈ Ω. Since dimLie(V φ )(θ) is locally constant there is a neighborhood Ω ′′ of θ on which it is

constant. Since V := (Lie(V φ )) |Ω ′′ ⊆ X (Ω ′′ ) is a Lie Algebra, by Proposition D.2 and Frobenius

theorem (Theorem D.1) there exists a neighborhood Ω ′ ⊆ Ω ′′ of θ and D − dimV (θ) independent

conserved functions through V |Ω ′ . As V φ ⊂ Lie(V φ ), these functions are (locally) conserved through

V φ too. We only need to show that there are no more conserved functions.

2nd step: There are no more conserved functions. By contradiction, assume there exists θ 0 ∈ Ω,

an open neighborhood Ω ′ of θ 0 , a dimension k < dimLie(V φ )(θ 0 ), and a collection of D − k

independent conserved functions through V φ , gathered as the coordinates of a vector-valued function

h ∈ C 1 (Ω ′ , R D−k ). Consider W := {X ∈ X (Ω ′ ), ∀θ ∈ Ω ′ , X(θ) ∈ ker∂h(θ)}. By the definition

of independent conserved functions, the rows of the (D − k) × D Jacobian matrix

∂h(θ) are linearly independent on Ω ′ , and the dimension of W (θ) = ker∂h(θ) is constant and equal

to k on Ω ′ . By construction of W and Proposition 2.5, the D − k coordinate functions of h are inde-

pendent conserved functions through W . Thus, by Frobenius Theorem (Theorem D.1) and Proposi-

tion D.2, W is a Lie algebra. By Proposition 2.5 we have V φ (θ) = range∂φ(θ) ⊤ ⊂ ker∂h(θ) on

Ω ′ , hence V φ |Ω ′ ⊂ W , and therefore Lie(V φ ) |Ω ′ = Lie(V φ |Ω ′ ) ⊂ W . In particular: Lie(V φ )(θ 0 ) ⊂

W (θ 0 ), which leads to the claimed contradiction that dimLie(V φ )(θ 0 ) ≤ dimW (θ 0 ) = k.

19E Proofs of the Examples of Section 3.3 and additional example

E.1

Proof of the result given in Example 3.6

Proposition E.1. Consider θ = (U, V ) ∈ R n×r × R m×r , φ, and Ω ⊆ R D , D = (n + m)r, as

in Example 3.6. The dimension of V φ (θ) is constant and equal to (n + m − 1)r and V φ verifies

condition (11) of Frobenius Theorem (i.e. condition (23) of Theorem D.1).

Proof. Denoting u i (resp. v i ) the columns of U (resp. of V ), for θ ∈ Ω we can write φ(θ) =

(ψ(u i , v i )) i=1,···r with ψ : (u ∈ R n − {0}, v ∈ R m − {0}) 7→ uv ⊤ ∈ R n×m . As this decouples φ

into r functions each depending on a separate block of coordinates, Jacobian matrices and Hessian

matrices are block-diagonal. Establishing condition (23) of Frobenius theorem is thus equivalent to

showing it for each block, which can be done by dealing with the case r = 1. Similarly, V φ (θ) is

a direct sum of the spaces associated to each block, hence it is enough to treat the case r = 1 (by

proving that the dimension is n + m − 1) to obtain that for any r ≥ 1 the dimension is r(n + m − 1).

1st step: We show that V φ satisfies condition (23) of Frobenius Theorem. For u ∈ R n − {0},

v ∈ R m − {0} we write θ = (u; v) ∈ R D = R n+m and φ i,j (θ) := u i v j for i = 1, · · · , n

and j = 1, · · · , m. Now u i and v j are scalars (and no longer columns of U and V ). Denoting

e i ∈ R D = R n+m the vector such that all its coordinates are null except the i-th one, we have:

∇φ i,j (θ) = v j e i + u i e n+j ∈ R D ,

∂ 2 φ i,j (θ) = E j+n,i + E i,j+n ∈ R D×D ,

with E i,j ∈ R D×D the one-hot matrix with the (i, j)-th entry being 1. Let i, k ∈ {1, · · · , n} and

j, l ∈ {1, · · · , m}.

1st case: (i, j) = (k, l) Then trivially ∂ 2 φ i,j (θ)∇φ k,l (θ) − ∂ 2 φ k,l (θ)∇φ i,j (θ) = 0.

2nd case: ((i 6 = k) and (j 6 = l)) Then

[∇φ i,j , ∇φ k,l ](θ) = (E j+n,i +E i,j+n )(v l e k +u k e n+l )−(E l+n,k +E k,l+n )(v j e i +u i e n+j ) = 0−0.

3d case: i = k and j 6 = l. Then as u 6 = 0, there exists l ′ ∈ {1, · · · , n} such that u l ′ 6 = 0.

∂ 2 φ i,j (θ)∇φ k,l (θ) − ∂ 2 φ k,l (θ)∇φ i,j (θ) = v l e n+j − v j e n+l

v l

v j

∇φ l ′ ,j (θ) −

∇φ l ′ ,l (θ),

u l ′

∈ span{∇φ i,j (θ)} = V φ (θ).

4d case: ((i 6 = k) and (j = l)) We treat this case in the exact same way than the 3d case.

Thus V φ verifies condition (11) of Frobenius Theorem.

2d step: We show that dimV φ (θ) = (n + m − 1). As u, v 6 = 0 each of these vectors has at least one

nonzero entry. For simplicity of notation, and without loss of generality, we assume that u 1 6 = 0 and

v 1 6 = 0. It is straightforward to check that (∇φ 1,1 (θ), (∇φ 1,j (θ)) j=2,··· ,m , (∇φ i,1 (θ)) i=2,··· ,n ) are

n + m − 1 linearly independent vectors. To show that dimV φ (θ) = (n + m − 1) is it thus sufficient

to show that they span V φ (θ). This is a direct consequence of the fact that, for any i, j, we have

v j

u i

v j u i

(v 1 e i + u i e n+1 ) + (u 1 e n+j + v j e 1 ) −

(u 1 e n+1 + v 1 e 1 ) ,

v 1

u 1

u 1 v 1

v j

u i

v j u i

= ∇φ i,1 (θ) + ∇φ 1,j (θ) +

∇φ 1,1 (θ).

v 1

u 1

u 1 v 1

∇φ i,j (θ) = v j e i + u i e n+j =

E.2

An additional example beyond ReLU

In complement to Example 3.6, we give a simple example studying a two-layer network with a

positively homogeneous activation function, which include the ReLU but also variants such as the

leaky ReLU or linear networks.

Example E.2 (Beyond ReLU: Neural network with one hidden neuron with a positively homoge-

neous activation function of degree one). Let σ be P a positively one-homogeneous activation func-

tion. In (2), this corresponds to setting g θ (x) = i=1 u i σ(hv i , xi) ∈ R. Assuming hv i , xi 6 = 0

20for all i to avoid the issue of potential non-differentiability at 0 of σ (for instance for the ReLU),

and in particular assuming v i 6 = 0, the function minimized during training can be factored via

φ(θ) = (ψ(u i , v i )) ri=1 where

θ := (u ∈ R, v ∈ R d−1 − {0}) 7→ (ukvk, v/kvk) ∈ R × S d−1 ⊂ R d .

(24)

Proposition E.3. Consider d ≥ 2 and φ(θ) = (ψ(u i , v i )) ri=1 where ψ is given by (24) on Ω :=

{θ = (u ∈ R r , V = (v 1 , . . . , v r ) ∈ R m×r ) : v i 6 = 0}. We have dimV φ (y) = r(d − 1) and V φ

verifies condition (23) of Frobenius Theorem (Theorem D.1), so each θ = (u, V ) ∈ Ω admits a

neighborhood Ω ′ such that there exists r (and no more) conserved function through V φ |Ω ′ .

As in Example 3.6, such candidate functions are given by h i : (u i , v i ) 7→ u 2 i − kv i k 2 . A posteriori,

these functions are in fact conserved through all V φ .

Proof of Proposition E.3. As in the proof of Proposition E.1 it is enough to prove the result for r = 1

hidden neuron. Note that here D = d. To simplify notations, we define φ 0 , ..., φ d−1 for θ = (u, v)

as:

φ 0 (θ) = ukvk,

and for i = 1, ..., d − 1:

φ i (θ) = v i /kvk.

1st step: explicitation of span{∇φ 0 , ..., ∇φ d−1 }. We have



kvk

uv ⊤ /kvk



∂φ(θ) = 

 0 (d−1)×1

kvk P v





 ,



where: P v := I d−1 − vv ⊤ /kvk 2 is the orthogonal projector on (Rv) ⊥ (seen here as a subset of

R d−1 ) and its rank is d − 2. Thus dimV φ (θ) = rank(∂φ(θ)) = d − 1 and span{∇φ 0 , ..., ∇φ d−1 } =

R∇φ 0 + (Rv) ⊥ .

2d step: calculation of the Hessians.

1st case: The Hessian of φ i for i ≥ 1. In this case, φ i does not depend on the first coordinate u so

we proceed as if the ambient space here was R d−1 . We have already that for i ≥ 1:

∇φ i (θ) = e i /kvk − v i v/kvk

hence

v i I d−1 + V i + V i ⊤ ,

∂ 2 φ i = 3v i vv ⊤ /kvk − 1/kvk

where all columns of matrix V i := (0, ..., v, 0, ..., 0) are zero except the i-th one, which is set to v.

2d case: The Hessian of φ 0 . Since

we have

3rd step: Conclusion.

∇φ 0 (θ) = kvk, uv ⊤ /kvk





∂ 2 φ 0 (θ) = 

 v/kvk

⊤

v ⊤ /kvk

kvk P v





 .



1st case: i, j ≥ 1 and i 6 = j. We have:

∂ 2 φ i (θ)∇φ j (θ) − ∂ 2 φ j (θ)∇φ i (θ),

= v j /kvk e i − v i /kvk e j ∈ (Rv) ⊥ ,

⊂ span{∇φ 0 (θ), ..., ∇φ d−1 (θ)}.

212d case: i ≥ 1 and j = 0. We have:

∂ 2 φ i (θ)∇φ 0 (θ) − ∂ 2 φ 0 (θ)∇φ i (θ),

= −2u/kvk∇φ i (θ),

∈ span{∇φ 0 (θ), ..., ∇φ d−1 (θ)}.

In both cases, we obtain as claimed that the condition (23) of Frobenius Theorem is satisfied, and

we conclude using the latter.

F Proof of Proposition 3.8 and additional example

Proposition F.1. Assume that rank(∂φ(θ)) is constant on Ω and that V φ satisfies (11). If t 7→ θ(t)

satisfies the ODE (3) then there is 0 < T θ ⋆ init < T θ init such that z(t) := φ(θ(t)) ∈ R d satisfies the

ODE

z(t) = −M (z(t), θ init )∇f (z(t)) for all 0 ≤ t < T θ ⋆ init ,

(25)

z(0) = φ(θ init ),

where M (z(t), θ init ) ∈ R d×d is a symmetric positive semi-definite matrix.

Proof. As z = φ(θ) and as θ satisfies (3), we have:

z = ∂φ(θ) θ = −∂φ(θ)∇(f ◦ φ)(θ) = −∂φ(θ)[∂φ(θ)] ⊤ ∇f (z).

Thus, we only need to show M (t) := ∂φ(θ(t))[∂φ(θ(t))] ⊤ , which is a symmetric, positive semi-

definite d × d matrix, only depends on z(t) and θ init . Since dim V φ (θ) = rank(∂φ(θ)) is constant

on Ω and V φ satisfies (11), by Frobenius Theorem (Theorem D.1), for each θ ∈ Ω, there exists a

neighborhood Ω 1 of θ and D − d ′ independent conserved functions h d ′ +1 , · · · , h D through (V φ ) |Ω ′ ,

with d ′ := dim V φ (θ) = rank(∂φ(θ)). Moreover, by definition of the rank, for the considered θ,

there exists a set I ⊂ {1, . . . , d} of d ′ indices such that the gradient vectors ∇φ i (θ), i ∈ I are

linearly independent. By continuity, they stay linearly independent on a neighborhood Ω 2 of θ. Let

us denote P I the restriction to the selected indices and

θ ′ ∈ R D 7−→ Φ I (θ ′ ) := (P I φ(θ ′ ), h d ′ +1 (θ ′ ), ..., h D (θ ′ )) ∈ R D

As the functions h i are independent conserved functions, for each θ ′ ∈ Ω ′ := Ω 1 ∩Ω 2 their gradients

∇h i (θ ′ ), d ′ + 1 ≤ i ≤ D are both linearly independent and (by Proposition 2.5 and (7)) orthogonal

to V φ (θ ′ ) = range[∂φ(θ ′ )] ⊤ = span{∇φ i (θ) : i ∈ I}. Hence, on Ω ′ , the Jacobian ∂Φ I is an

invertible D × D matrix. By the implicit function theorem, the function Φ I is thus locally invertible.

Applying this analysis to θ = θ(0) and using that h i are conserved functions, we obtain that in an

interval [0, T θ ⋆ init ) we have

Φ I (θ(t)) = (P I z(t), h d+1 (θ init ), ..., h D (θ init ))

By local inversion of Φ I this allows to express θ(t) (and therefore also M (t)

∂φ(θ(t))[∂φ(θ(t))] ⊤ ) as a function of z(t) and of the initialization.

(26)

In complement to Example 3.9 we provide another example related to Example E.2.

Example F.2. Given the mapping φ : (u ∈ R, v ∈ R d−1 − {0}) 7→ (ukvk, v/kvk) ∈

R × S d−1 ⊂ R d  (cf

 (r, h) = (ukvk, v/kvk) satisfies (25) with:

√ (24)), the variable z :=

r 2 + δ 2

0 1×k





 , where P h := I d−1 − hh T /khk 2 and δ :=

M (z, θ init ) = 

 0 (d−1)×1



√ 1

δ+ r 2 +δ 2 h

u 2 init − kv init k 2 .

G Proofs of results of Section 4

G.1 Proof of Proposition 4.2

Proposition G.1. Consider Ψ : (U, V ) 7→ U ⊤ U − V ⊤ V ∈ R r×r and assume that (U ; V ) has full

rank. Then:

221. if n + m ≤ r, the function Ψ gives (n + m)(r − 1/2(n + m − 1)) independent conserved

functions,

2. if n + m > r, the function Ψ gives r(r + 1)/2 independent conserved functions.

Proof. Let write U = (U 1 ; · · · ; U r ) and V = (V 1 ; · · · ; V r ) then: Ψ i,j (U, V )

∇Ψ i,j (U, V )

hU i , U j i − hV i , V j i for i, j

1, · · · , r.

Then f i,j

(0; · · · ; 0; U j ; · · · ; U i ; 0; · · · ; −V j ; · · · ; V i ; · · · ; 0) ⊤ ∈ R (n+m)r×1 .

(i)

(j)

(j+r)

(i+r)

1st case: n + m ≤ r. As (U ; V ) has full rank, its rank is n + m. In particular, U and V have a full

rank too. Without loss of generality we can assume that (U 1 , · · · , U n+m ) are linearly independent,

and (V 1 , · · · , V n+m ) too. Then for all i > n + m, U i ∈ F U := span(U 1 , · · · , U n+m ) and V i ∈

F V := span(V 1 , · · · , V n+m ). We want to count the number of f i,j that are linearly independent.

1. if i ≤ j ∈ [[1, n + m]], then all the associated f i,j are linearly independent together. There

are (n + m)(n + m + 1)/2 such functions. Moreover, these functions generate vectors of

the form:

(A 1 ; · · · ; A n+m ; 0; · · · ; 0; B 1 ; · · · ; B n+m ; 0; · · · ; 0)

where A i ∈ F U and B i ∈ F V .

2. if i ∈ [[1, n + m]] and j ∈ [[n + m + 1, r]], then all of the associated f i,j are linearly

independent and the last ones are linearly independent together. We obtain (n + m)(r −

(n + m)) more functions. Moreover, these functions generate vectors of the form:

(0 · · · ; 0; A n+m+1 ; · · · ; A r ;

0; · · · ; 0; B n+m+1 ; · · · ; B r )

where A i ∈ F U and B i ∈ F V .

3. if i ≤ j ∈ [[n+m+1, r]], the associated f i,j are linearly dependent of thus already obtained.

Finally there are exactly (n + m)(r − 1/2(n + m − 1) independent conserved functions given by Ψ.

2d case: n + m > r. Then all (U i ; −V i ) for i = 1, · · · r are linearly independent. Then there are

r(r + 1)/2 independent conserved functions given by Ψ.

G.2 Proofs of other results

0 ∆

, one has ∂φ(U, V ) ⊤ : ∆ ∈

∆ ⊤ 0

R n×m 7→ S ∆ · (U ; V ). Hence V φ = span{A ∆ , ∀∆ ∈ R n×m }, where A ∆ : (U ; V ) 7→ S ∆ · (U ; V )

is a linear endomorphism. Moreover one has [A ∆ , A ∆ ′ ] : (U, V ) 7→ [S ∆ , S ∆ ′ ] × (U ; V ).

Proposition G.2. For every ∆ ∈ R n×m denote S ∆ :=

This proposition enables the computation of the Lie brackets of V φ by computing the Lie bracket of

matrices. In particular, Lie(V φ ) is necessarily of finite dimension.

Proposition G.3. The Lie algebra Lie(V φ ) is equal to

I n

(U ; V ) 7→

: M ∈ A n+m

× M ×

0 −I m

where A n+m ⊂ R (n+m)×(n+m) is the space of skew symmetric matrices.

Remark G.4. By the characterization of Lie(V φ ) in Proposition G.3 we have that the dimension of

Lie(V φ ) is equal to (n + m) × (n + m − 1)/2.

Proof. 1st step: Let us characterize V 1 = span{V φ + [V φ , V φ ]}. Let ∆, ∆ ′ ∈ R n×m , then:

Y, 0

[A ∆ , A ∆ ′ ]((U, V )) = [S ∆ , S ∆ ′ ] × (U ; V ) =

0, Z

(27)with Y := ∆∆ ′⊤ − ∆ ′ ∆ ⊤ ∈ A n and Z := ∆ ⊤ ∆ ′ − ∆ ′⊤ ∆ ∈ A m . Then:

Y, X

n×m

V 1 = (U ; V ) 7→

: X ∈ R

, Y ∈ A n , Z ∈ A m ,

X ⊤ , Z

I n , 0

: M ∈ A n+m .

= u M := (U ; V ) 7→

× M ×

0, −I m

2d step: Let us show that V 2 = V 1 . Let M, M ′ ∈ A n+m . Then:

I n , 0

[u M , u M ′ ] =

M =

M ′ − M ′

M̃ ,

0, −I m

I n , 0

with M̃ := M

M ′ − M ′

M ∈ A n+m .

0, −I m

I n , 0

Finally: Lie(V φ ) = V 1 = (U ; V ) 7→

: M ∈ A n+m .

× M ×

0, −I m

Eventually, what we need to compute is the dimension of the trace Lie(V φ )(U, V ) for any (U, V ).

Proposition G.5. Let us assume that (U ; V ) ∈ R (n+m)×r has full rank. Then:

1. if n + m ≤ r, then dimLie(V φ )(U ; V ) = (n + m)(n + m − 1)/2;

2. if n + m > r, then dimLie(V φ )(U ; V ) = (n + m)r − r(r + 1)/2.

Proof. Let us consider the linear application:

Γ : M ∈ A n+m 7→

I n , 0

0, −I m

× M ×

where A n+m ⊂ R (n+m) is the space of skew symmetric matrices. As rangeΓ(A n+m ) =

Lie(V φ )(U ; V ), we only want to calculate rankΓ(A n+m ). But by rank–nullity theorem, we have:

dim ker Γ + rank Γ = (n + m)(n + m − 1)/2.

1st case: n + m ≤ r. Then as (U ; V ) has full rank n + m, Γ is injective and then rankΓ(A n+m ) =

(n + m)(n + m − 1)/2.

2d case: n + m > r. We write (U ; V ) = (C 1 ; · · · ; C r ) with (C 1 , · · · , C r ) that are linearly

independent as (U ; V ) has full rank r. Let M ∈ A n+m such that Γ(M ) = 0. Then M · (U ; V ) = 0.

Then we write M ⊤ = (M 1 ; · · · ; M n+m ). Then as M × (U ; V ) = 0, we have that hM i , C j i = 0

for all i = 1, · · · , n + m and for all j = 1, · · · , r. We note C := span C i that is of dimension r

i=1,··· ,r

as (U ; V ) has full rank r.

M 1 must be in C ⊥ and its first coordinate must be zero as M must be a skew matrix. Then M 1

lies in a space of dimension n + m − r − 1. Then M 2 must be in C ⊥ too, and its first coordinate

is determined by M 1 and its second is null as M is a skew matrix. Then M 2 lies in a space of

dimension n + m − r − 2. By recursion, after building M 1 , · · · , M i , M i+1 must be in C ⊥ too,

and its i first coordinates are determined by M 1 , · · · , M i and its i + 1-th one is null as M is a skew

matrix. Then M i+1 lies in a space of dimension max(0, n + m − r − (i + 1)). Finally the dimension

of kerΓ is equal to:

n+m−r

i=1

(n + m − r − i) = (n + m − r − 1)(n + m − r)/2.

Then: rankΓ(A n+m ) = (n + m)r − r(r + 1)/2.

Thanks to this explicit characterization of the trace of the generated Lie algebra, combined with

Proposition 4.2, we conclude that Proposition 4.1 has indeed exhausted the list of independent con-

servation laws.

24Corollary G.6. If (U ; V ) has full rank, then all conserved functions are given by Ψ : (U, V ) 7→

U ⊤ U − V ⊤ V . In particular, there exist no more conserved functions.

Proof. As (U ; V ) has full rank, this remains locally the case. By Proposition 4.3 the dimension

of Lie(V φ )(U ; V ) is locally constant, denoted m(U, V ). By Theorem 3.4, the exact number of

independent conserved functions is equal to (n + m)r − m(U, V ) and that number corresponds to

the one given in Proposition 4.2.

H About Example 3.7

Proposition H.1. Let us assume that (U ; V ) ∈ R (n+m)×r has full rank. If max(n, m) > 1 and

r > 1, then V φ does not satisfy the condition (11).

Proof. Let us consider the linear application:

′

n×m

Γ :∆ ∈ R

→

0, ∆

∆ ⊤ , 0

By Proposition G.2, rangeΓ ′ (R n×m ) = V φ (U ; V ). Thus, as by definition V φ (U ; V ) ⊆

Lie(V φ (U ; V )), V φ does not satisfy the condition (11) if and only if dimV φ (U ; V ) <

dimLieV φ (U ; V ).

1st case: n + m ≤ r. Then as (U ; V ) has full rank n + m, Γ ′ is injective and then rankΓ ′ (R n×m ) =

n × m.

Thus by Proposition G.5, we only need to verify that: n × m < (n + m)(n + m − 1)/2 =:

LieV φ (U ; V ). It is the case as max(n, m) > 1.

2d case: n + m > r. We write (U ; V ) = (C 1 ; · · · ; C r ) with (C 1 , · · · , C r ) that are linearly

independent as (U ; V ) has full rank r. Let ∆ ∈ R n×m such that Γ ′ (∆) = 0. Let us define the

symmetric matrix M by:

0, ∆

(28)

M :=

∆ ⊤ , 0

Then M · (U ; V ) = 0. Then we write M ⊤ = (M 1 ; · · · ; M n+m ). Then as M × (U ; V ) = 0, we

have that hM i , C j i = 0 for all i = 1, · · · , n + m and for all j = 1, · · · , r. We note C := span C i

i=1,··· ,r

that is of dimension r as (U ; V ) has full rank r.

For all i = 1, · · · , n, M i must be in C ⊥ and its n first coordinate must be zero by definition

(28). Then M i lies in a space of dimension max(0, n + m − r − n). For all j > n, M j are

entirely determined by {M i } i≤n by definition (28). Finally the dimension of kerΓ ′ is equal to:

n × max(0, m − r). Then: dimV φ (U ; V ) = rankΓ ′ (R n×m ) = nm − n × max(0, m − r).

Thus by Proposition G.5, we only need to verify that: nm − n max(0, m − r) < (n + m)r − r(r +

1)/2 =: LieV φ (U ; V ).

Let us assume m < r. Then by looking at f (r) := (n+m)r−r(r+1)/2−nm = dimLieV φ (U ; V )−

dimV φ (U ; V ) for r ∈ {m + 1, · · · , n + m − 1} =: I n,m , we have: f ′ (r) = (n + m) − 1/2 − r > 0

(as n + m > r is an integer), so f is increasing, so on I n,m , we have (as r > m): f (r) > f (m) =

(n + m)m − m(m + 1)/2 − nm = m 2 − m(m + 1)/2 ≥ 0 as m ≥ 1.

Let us assume m ≥ r. Then

dimLieV φ (U ; V ) − dimV φ (U ; V ) = (n + m)r − r(r + 1)/2 − (nm − n(m − r)),

= mr − r(r + 1)/2,

≥ r 2 − r(r + 1)/2 as m ≥ r,

> 0 as r > 1.

Thus dimLieV φ (U ; V ) − dimV φ (U ; V ) > 0.

25I Details about experiments

We used the software SageMath [26] that relies on a Python interface. Computations were run in

parallel using 64 cores on an academic HPC platform.

First we compared the dimension of the generated Lie algebra Lie(V φ )(θ) (computed using the

algorithm presented in Section 3.3) with D − N , where N is the number of independent conserved

functions known by the literature (predicted by Proposition 4.1 for ReLU and linear neural networks).

We tested both linear and ReLU architectures (with and without biases) of various depths and widths,

and observed that the two numbers matched in all our examples.

For this, we draw 50 random ReLU (resp. linear) neural network architectures, with depth drawn

uniformly at random between 2 to 5 and i.i.d. layer widths drawn uniformly at random between 2 to

10 (resp. between 2 to 6). For ReLU architectures, the probability to include biases was 1/2.

Then we checked that all conservation laws can be explicitly computed using the algorithm presented

in Section 2.4 and looking for polynomial solutions of degree 2 (as conservation laws already known

by the literature are polynomials of degree 2). As expected we found back all known conservation

laws by choosing 10 random ReLU (resp. linear) neural network architectures with depth drawn

uniformly at random between 2 to 4 and i.i.d. layer widths drawn uniformly at random between 2 to