Summary of Benign Oscillation of Stochastic Gradient Descent

Summary Benign Oscillation of Stochastic Gradient Descent arxiv.org

41,837 words - PDF document - View PDF document

One Line

The paper investigates how neural networks trained with high learning rates and stochastic gradient descent perform in terms of generalization.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Benign Oscillation: Improving Generalization in Neural Networks

Source: arxiv.org - PDF - 41,837 words - view

Introduction

• Neural networks trained with high learning rates and stochastic gradient descent (SGD)

• Investigating generalization properties

• Introducing the concept of "benign oscillation"

Theoretical Framework

• Deep learning and the need for theoretical understanding

• Empirical evidence of improved generalization with large learning rates

• Theoretical framework based on feature learning perspective

Oscillating SGD Training

• Effective learning of weak features with large learning rates

• Division in feature learning with small learning rates

• Better generalization performance for large learning rate trained networks

Numerical Experiments

• CNN trained with large vs. small learning rates

• Higher test accuracy with large learning rate

• Failure to generalize with small learning rate

Conclusion

• Theoretical investigation of large learning rate SGD training

• Benign oscillation leads to better generalization performance

• Insights for optimization analysis

Proof of Theorem 4.3

• SGD effectively learns weak signals while avoiding overfitting and noise memorization

• Boundedness and sign stability of SGD algorithm

• Exponential learning rate for weak signals

• Control of noise memorization

Proof of Proposition D.5

• Establishing upper and lower bounds on inner product change

• Comparison of bounds to ensure sign stability

Proof of Proposition D.4

• Induction proof for weight update at each step

• Bounds on weight updates based on data labels and inner products

Proof of Proposition D.1

• Expression for weight update at step t+1

• Sign stability during weight update

Analysis of SGD in Multiple Data Setting

• Dynamics of SGD in the presence of multiple training data points

• Lower bounds on fitting residual and upper bounds on inner product growth

• Differentiation between strong and weak signals

Key Takeaways

• Benign oscillation improves generalization in neural networks

• SGD effectively learns weak features with large learning rates

• Hyperparameter selection and weight initialization are crucial for stable oscillation and accurate learning

Key Points

The paper investigates the generalization properties of neural networks trained using stochastic gradient descent (SGD) with large learning rates.
The concept of "benign oscillation" is introduced, which refers to the oscillation of the neural network weights caused by large learning rate SGD training that benefits the generalization of the network.
Theoretical results show that oscillating SGD training with large learning rates effectively learns weak features in the presence of strong features, leading to better generalization performance.
Numerical experiments demonstrate that a convolutional neural network (CNN) trained with a large learning rate achieves higher test accuracy compared to a CNN trained with a small learning rate.
The document provides a detailed analysis of SGD in the presence of noise and multiple training data, establishing key properties and proving important results.
The authors emphasize the importance of hyperparameter selection and weight initialization in achieving stable oscillation and accurate learning in SGD.
The proof of Theorem 4.3 demonstrates that SGD can effectively learn weak signals in a neural network while avoiding overfitting and memorizing noise.
The analysis of SGD in the presence of multiple training data points reveals the dynamics of learning the strong signal, learning the weak signal, and memorizing noise.

Summaries

18 word summary

This paper explores the generalization properties of neural networks trained with large learning rates and stochastic gradient descent.

90 word summary

This paper examines the generalization properties of neural networks trained with stochastic gradient descent (SGD) and large learning rates. The authors introduce "benign oscillation," which refers to the oscillation of network weights caused by large learning rate SGD training, and show its positive impact on generalization. Experimental results support these findings. The paper also includes theoretical results, lemmas, and proofs characterizing SGD behavior in the presence of noise and multiple training data. The paper contributes to understanding large learning rate training in deep learning and provides insights for optimization analysis.

214 word summary

This paper investigates the generalization properties of neural networks trained using stochastic gradient descent (SGD) with large learning rates. The authors introduce the concept of "benign oscillation," which refers to the oscillation of neural network weights caused by large learning rate SGD training, and demonstrate its beneficial effect on generalization. The theoretical results show that oscillating SGD training with large learning rates effectively learns weak features alongside strong features, leading to better generalization performance. Numerical experiments using convolutional neural networks (CNNs) support these findings, showing that CNNs trained with large learning rates achieve higher test accuracy compared to those trained with small learning rates. The paper also discusses concentration results, lemmas, and proofs characterizing the behavior of SGD in the presence of noise and multiple training data. The proof of the main result and the conditions required for it are presented. Additionally, proofs for lemmas discussing training dynamics and weak signal learning are provided. The proof of Theorem 4.3 demonstrates the effectiveness of SGD in learning weak signals while avoiding overfitting and memorizing noise. Finally, the proof of Proposition D.5 establishes upper and lower bounds on various terms using lemmas B.5 and D.2. Overall, this paper contributes to the understanding of large learning rate training in deep learning and provides insights for optimization analysis.

568 word summary

This paper explores the generalization properties of neural networks trained using stochastic gradient descent (SGD) with large learning rates. The authors introduce the concept of "benign oscillation" which refers to the oscillation of the neural network weights caused by large learning rate SGD training, which is beneficial for generalization. The paper provides an overview of deep learning and the need for a theoretical understanding of its optimization and generalization properties. It highlights empirical evidence that using a large learning rate improves generalization but lacks theoretical understanding.

The main theoretical results of the paper show that oscillating SGD training with large learning rates effectively learns weak features in the presence of strong features. On the other hand, SGD with small learning rates only learns strong features and makes little progress in learning weak features. This division in feature learning leads to better generalization performance for neural networks trained with large learning rates.

Numerical experiments are provided to support these findings. The authors train convolutional neural networks (CNNs) using large and small learning rates and show that the CNN trained with a large learning rate achieves higher test accuracy than the one trained with a small learning rate. The CNN trained with a small learning rate fails to generalize to testing data without strong features.

In conclusion, this paper provides a theoretical investigation of large learning rate SGD training for neural networks. The concept of benign oscillation is introduced and shown to lead to better generalization performance by effectively learning weak features. The findings are supported by numerical experiments. This work contributes to the understanding of large learning rate training in deep learning and provides insights for optimization analysis.

The document discusses the benign oscillation of SGD in the context of machine learning. It presents lemmas and proofs to characterize the behavior of SGD in the presence of noise and multiple training data. The first section focuses on concentration results and provides lemmas to characterize the concentration properties of random elements involved in the problem.

The second section focuses on the single training data case and presents a formal statement and proof for the main result, Theorem 3.2. The conditions and assumptions required for Theorem 3.2 are discussed, including conditions on hyperparameters, weight initialization scale, signal strength, and dimension.

The third section presents proofs for the lemmas in Appendix B.1, which discuss basic properties of training dynamics and the two-layer CNN. These lemmas establish properties of neuron subsets, such as fixed signs of inner products and connections between inner products and model outcomes.

The fourth section presents a fundamental reasoning towards weak signal learning. Lemma B.7 provides a quantitative interpretation of the increasing behavior of inner products, formalizing the role of oscillation in learning weak signals.

The proof of Theorem 4.3 in the document "Benign Oscillation of Stochastic Gradient Descent" is presented. The main goal is to show that the SGD algorithm can effectively learn the weak signal in a neural network while avoiding overfitting and memorizing noise.

670 word summary

This paper investigates the generalization properties of neural networks (NN) trained using stochastic gradient descent (SGD) algorithm with large learning rates. The authors introduce the concept of “benign oscillation” which refers to the oscillation of the NN weights caused by the large learning rate SGD training that is beneficial for the generalization of the NN. The authors propose a theoretical framework based on the feature learning perspective of deep learning to explain this phenomenon.

The paper starts by providing an overview of deep learning and the need for a theoretical understanding of its optimization and generalization properties. It highlights the empirical evidence that using a large learning rate in NN training improves generalization but the theoretical understanding of this phenomenon is limited. The paper then introduces the problem settings, including the data generation model and the two-layer convolutional neural network (CNN) used for training.

The main theoretical results of the paper are presented next. The authors show that under certain conditions and assumptions, oscillating SGD training with large learning rates leads to effective learning of weak features in the presence of strong features. On the the other hand, SGD with small learning rates only learns the strong features and makes little progress in learning the weak features. This division in feature learning leads to better generalization performance for NN trained with large learning rates.

The authors provide numerical experiments to demonstrate their findings. They train CNNs using large and small learning rates and show that the CNN trained with a large learning rate achieves higher test accuracy than the one trained with a small learning rate. The CNN trained with a small learning rate fails to generalize to testing data that lacks strong features.

In conclusion, this paper provides a theoretical investigation of large learning rate SGD training for NNs. The authors introduce the concept of benign oscillation and show that it leads to better generalization performance by effectively learning weak features. The findings are supported by numerical experiments. This work contributes to the understanding of large learning rate training in deep learning and provides insights for optimization analysis.

The document discusses the benign oscillation of stochastic gradient descent (SGD) in the context of machine learning. It presents a series of lemmas and proofs to characterize the behavior of SGD in the presence of noise and multiple training data.

The first section focuses on concentration results and provides lemmas to characterize the concentration properties of random elements involved in the problem. These lemmas establish bounds on the number of positive and negative labels in the training data and the weight initialization scale.

The third section presents the proofs for the lemmas in Appendix B.1, which discuss basic properties of training dynamics and the two-layer CNN. These lemmas establish properties of neuron subsets, such as fixed signs of inner products and connections between inner products and model outcomes.

The proof of Theorem 4.3 in the document “Benign Oscillation of Stochastic Gradient Descent” is presented. The main goal is to show that the stochastic gradient descent (SGD) algorithm can effectively learn the weak signal in a neural network, while avoiding overfitting and memorizing noise.

1867 word summary

This paper investigates the generalization properties of neural networks (NN) trained using stochastic gradient descent (SGD) algorithm with large learning rates. The authors introduce the concept of "benign oscillation" which refers to the oscillation of the NN weights caused by the large learning rate SGD training that is beneficial for the generalization of the NN. The authors propose a theoretical framework based on the feature learning perspective of deep learning to explain this phenomenon.

The fifth section presents the proof of Theorem B.3, which establishes the benign oscillation of SGD. The proof is done by contradiction and involves bounding the stopping time T(v) and establishing lower bounds on inner products. The proof is organized into several steps, including an analysis of pre-T1 behavior, bounding max inner products, lower bounding inner products, and upper bounding inner products.

Overall, the document provides a detailed analysis of SGD in the presence of noise and multiple training data, establishing key properties and proving important results. The lemmas and proofs contribute to a better understanding of the behavior of SGD and its ability to learn weak signals.

The paper investigates the behavior of stochastic gradient descent (SGD) in the context of a single training data point. The objective function is rearranged to focus on the difference between the predicted output and the true output. The weights are updated according to specific rules based on the inner products between the weights and the signal patches. The authors analyze the conditions on the hyperparameters and the initialization of the weights. They prove concentration results for the initialization and provide necessary conditions for stable oscillation.

The authors introduce Assumption C.2, which includes conditions on the weight initialization scale, signal strength, and dimension. They then present Theorem C.3, which states that under these conditions and with appropriate choice of learning rate, the average loss over iterations decreases and the model does not learn weak signals well enough compared to the initialization.

The paper also includes Lemma C.4, which provides a lower bound on the fitting residual based on the maximal inner product between the weights and the signal vectors. The authors divide the analysis into two stages: exponential growth and stabilization. In the exponential growth stage, they track the maximal inner product between the weights and the signal vectors.

The authors conclude that their findings provide insights into the behavior of SGD in the small learning rate regime for single training data points. They emphasize the importance of hyperparameter selection and initialization in achieving stable oscillation and accurate learning.

Overall, this paper contributes to the understanding of SGD in the context of single training data points and provides necessary conditions for stable oscillation. The results have implications for hyperparameter selection and weight initialization in machine learning algorithms.

The proof of Theorem 4.3 in the document "Benign Oscillation of Stochastic Gradient Descent" is presented. The main goal is to show that the stochastic gradient descent (SGD) algorithm can effectively learn the weak signal in a neural network, while avoiding overfitting and memorizing noise. The proof is organized into several steps.

First, the boundedness and sign stability of the SGD algorithm are established. It is shown that the inner products associated with the strong signal are bounded and remain stable throughout the training process. The proof also demonstrates that the inner products associated with the weak signal and noise are negligible compared to the strong signal.

Next, the exponential learning rate of the weak signal is proven. It is shown that the SGD algorithm can effectively learn the weak signal with an exponential rate of convergence. This is achieved by analyzing the dynamics of the inner products and showing that they increase linearly with time.

Finally, the control of noise memorization is addressed. It is proven that the SGD algorithm can effectively avoid memorizing noise and focus on learning the relevant features of the data. This is achieved by bounding the maximum absolute value of the noise inner products over time.

Overall, the proof demonstrates that the SGD algorithm can effectively learn the weak signal in a neural network while avoiding overfitting and memorizing noise. The results provide insight into the behavior of SGD and its ability to generalize well to unseen data.

The proof of Proposition D.5 begins by expanding the expression for f(tj(Sj,k)) and breaking it down into several terms. The proof utilizes Lemma B.5 and Lemma D.2 to establish upper and lower bounds on various terms in the expression. The proof also makes use of the local sign stability assumption to ensure that the inner products grow proportionally. By applying these bounds and using induction, the proposition is proven for all steps t+1.

The proof of Proposition D.4 is also done by induction. The base case is trivial, and it is assumed that the proposition holds for each step t1,...,t. Using Equation (D.4), the proof shows that the weight update at step t+1 can be bounded by the sum of the weight update at step t and a term involving the data labels and the inner products. By applying Lemma D.2 and Lemma A.3, upper bounds on the weight updates are established, and the final result is obtained by combining these bounds.

The proof of Proposition D.1 utilizes Equation (D.2) to derive an expression for the weight update at step t+1. By analyzing the expression and using the local sign stability assumption, it is shown that the sign of the inner products does not change during the weight update. This result holds for all inner products, and therefore, the proposition is proven.

The proof of Lemma B.5 in the multiple data setting begins by defining a quantity that measures the change in the inner products. By applying Lemma D.2 and Lemma A.3, upper and lower bounds on this quantity are established, which depend on the number of data points and certain constants. By comparing these bounds, it is shown that if the lower bound is sufficiently large, then the upper bound must also be large. This condition ensures that the sign stability holds, which completes the proof.

Overall, these proofs establish important results related to the weight updates and sign stability in stochastic gradient descent. These results provide insights into the behavior of the algorithm and can be used to analyze its convergence properties.

The document discusses the behavior of stochastic gradient descent (SGD) in the context of machine learning. The authors analyze the dynamics of SGD and its ability to learn from training data. They focus on the case of multiple training data points and a small learning rate.

In the first stage of the analysis, the authors track the maximal inner product between the weight vectors and the signal vectors, as well as the inner product between the weight vectors and the noise vectors. They establish lower bounds on the fitting residual and derive upper bounds on the growth of these inner products.

In the second stage, the authors show that before the model learns the weak signal or memorizes any noise vectors, it already fits a proportion of the data points by exploiting the strong signal. They provide bounds on the average loss over iterations and show that the model does not learn the weak signal as well as it learns the strong signal.

In the third stage, the authors analyze the memorization of noise vectors by the model. They show that after fitting the strong signal, the model interpolates the entire dataset by memorizing noise vectors. They provide a reference point for this stage and analyze the evolution dynamics during this interval.

The authors conclude that SGD can effectively learn from multiple training data points with a small learning rate. They highlight the importance of differentiating between strong and weak signals and show that SGD can exploit strong signals to fit data points. They also discuss the memorization of noise vectors and how it leads to interpolation of the entire dataset.

Overall, this analysis provides insights into the behavior of SGD in machine learning and highlights its ability to learn from training data. The authors present detailed mathematical proofs and establish important bounds on various parameters. This research contributes to our understanding of SGD and its applications in training machine learning models.

Raw indexed text (170,832 chars / 41,837 words / 7,647 lines)

Benign Oscillation of Stochastic Gradient Descent with Large

Learning Rates

Miao Lu ∗†

Beining Wu ∗‡

Xiaodong Yang §

Difan Zou ¶

October 27, 2023

Abstract

In this work, we theoretically investigate the generalization properties of neural networks (NN) trained

by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime,

our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns

out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained

by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a

phenomenon “benign oscillation”. Our theory towards demystifying such a phenomenon builds upon the

feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation

model that consists of (i) weak features which have a small ℓ 2 -norm and appear in each data point; (ii)

strong features which have a larger ℓ 2 -norm but only appear in a certain fraction of all data points; and

(iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn

the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small

learning rate can only learn the strong features but makes little progress in learning the weak features.

Consequently, when it comes to the new testing data which consist of only weak features, the NN trained

by oscillating SGD with a large learning rate could still make correct predictions consistently, while the

NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training

benefits the generalization of NNs. Experimental results demonstrate our finding on “benign oscillation”.

Introduction

While deep neural networks (NNs) have achieved tremendous empirical success in various domains including

images, language processing, decision-making, etc, the theoretical understanding of deep learning is still far

behind satisfactory, especially the relationships between optimization of the NN and its generalization. From

the viewpoint of optimization, using a large learning rate in NN training has been empirically shown to be

of vital importance to its generalization (He et al., 2016; Xing et al., 2018; Smith and Topin, 2019; Frankle

et al., 2019; Damian et al., 2022; Kaur et al., 2023). Nevertheless, a principled theoretical understanding for

the mechanism behind the benefits of large learning rate NN training still remains limited.

To better capture the key ingredients in the training dynamics of stochastic gradient descent (SGD) with

large learning rates, we train two ResNets (He et al., 2016) using SGD with small and large learning rates

respectively, and present the training and testing results in Figure 1. When using a large learning rate SGD,

we can clearly observe an “oscillating” training curve, i.e., the training loss fluctuates at different iterations

(generally this happens only when the learning rate exceeds the inverse of the objective smoothness), while

for small learning rate SGD, the training curve is much smoother and converges more rapidly. On the other

hand, the smooth convergence in training loss can not bring any benefit for the testing performance – SGD

with a large learning rate achieves a significantly higher test accuracy than SGD with a small learning rate.

These empirical observations suggest that the oscillation during training could be closely tied to the better

generalization performance achieved by SGD with large learning rates.

∗ Equal contribution. This work was done when Miao Lu and Beining Wu were visiting the Department of Computer Science

at the University of Hong Kong.

† Department of Management Science and Engineering, Stanford University. Email: [email protected]

‡ Department of Statistics, University of Chicago. Email: [email protected]

§ Department of Statistics, Harvard University. Email: [email protected]

¶ Department of Computer Science & Institute of Data Science, The University of Hong Kong. Email: [email protected]

110 1

10 2

10 3

Loss

10 0

large lr

small lr

# epochs

100

large lr

small lr

# epochs

100

Figure 1: Training and testing performance of ResNet-18 on CIFAR-10 dataset, when trained via SGD with

small and large learning rates (η = 0.01 vs. η = 0.75). We adopt the same configuration as in Andriushchenko

et al. (2023): using weight decay but no momentum and no data augmentation. A clear difference between

the large learning rate training and small learning rate training can be observed: SGD with a large learning

rate leads to an “oscillating” training curve with higher testing accuracy; SGD with a small learning rate

has a rapid and smooth convergence but gives lower testing accuracy.

Motivated by the previous observations, in this paper, we study the learning dynamics of SGD with large

learning rates by investigating the oscillation happening during the optimization process, and we explain the

benefits of oscillation to the generalization performance. The key message is that, compared to the smooth

convergence achieved by SGD with small learning rates,

the oscillation prevents the over-greedy convergence and serves as the engine that drives the learning of

less-prominent data patterns.

These data patterns would be beneficial for the NN to generalize well on unseen testing data. This interprets

from the theoretical side why large learning rate training can help NN to generalize better in practice.

Our investigation of SGD with large learning rates for NN training builds upon the feature learning per-

spective of deep learning theory (Allen-Zhu and Li, 2022), which explicitly considers data models consisting

of different types of features and noise. For the sake of our goal, we consider a new feature-noise data model

that consists of two types of features with different strength and different distributions among data. Based

on this data model, by carefully tracking the process of feature learning of a NN trained by SGD with large

or small learning rates, we prove that only when trained by large learning rate SGD would the NN effectively

learn the key features for generalizing to each new data point. The NN trained by small learning rate SGD

fails to generalize to certain testing data because of the limited learning of the features which are crucial to

the generalization to those new data points. This shows a division of the generalization property of the NN

trained by large and small learning rates respectively.

To explain this phenomenon, our theory then identifies the core incentives for the superior performance

of large learning rate SGD to learn the key features as the oscillation during NN training, which also seems

to be related to the regime of “edge of stability” (Cohen et al., 2020) in the NN optimization. Intuitively, the

oscillation can prevent over-greedy convergence which could only leverage the most prominent components

of the data, thus allowing for all useful components to be discovered and learned via gradient descent. In

view of our finding that indicates oscillating NN training with large learning rates possibly resulting in better

generalization, we refer to such a phenomenon as “benign oscillation”.

1.1

Outline of the Paper

This paper is organized as following. The remaining of this section summarizes our key contributions and

discusses related works on large learning rate NN training and feature learning. Section 2 defines the problem

settings. In Section 3, we go through the key motivations and results of our theory on a simplified one-data

noiseless setting. Without the burden of multiple data training analysis, we could clearly present our core

ideas. In Section 4, we show our main results on the multiple data setting with numerical demonstrations.

21.2

Our Contributions

Dynamic analysis framework for SGD with large learning rates. We provide a theoretical frame-

work to understand and explain the oscillation in NN training by SGD with a large learning rate. Specifically,

we consider a feature-noise data generation model which consists of two types of features – the strong features

and the weak features – that have different strength and distributions among data to capture our core ideas

towards explaining the relationship between large learning rate SGD training and generalization. Then, our

theoretical framework establishes a sharp characterization of the learning dynamics of these features and

noise, based on which we can precisely analyze the generalization of the NN trained by SGD with small or

large learning rates. We remark that in general studying the NN optimization dynamics when the learning

rate is greater than twice inversed smoothness is quite challenging, and our theoretical analysis framework

based upon the feature-noise model potentially provides useful guidance which could be leveraged to study

other nonconvex optimization problems of independent interest in the future.

A new theoretical argument for feature learning driven by oscillation. The key to explaining the

large learning rate training regime is a novel theoretical argument for learning the weak features driven by

oscillation. As we illustrate in Section 3, the oscillation of the NN value (prediction) around the target (label)

does not cancel with each other. Instead, the fluctuations would accumulate linearly over time (Lemma D.3).

This further serves as the engine driving the learning of the weak features, resulting in better generalization.

This characterizes the distinctive training dynamics of SGD under the large learning rate training regime,

revealing the benefits of the oscillation in learning useful data patterns.

Division for generalization by different learning rates. In contrast to effectively learning the weak

features by large learning rate oscillating SGD training, we also show that the smooth and rapid convergence

achieved by SGD with small learning rates would not help the NN learn the weak features, thus being unable

to generalize to the new data without the strong features. This gives a division of the generalization property

of NNs trained by large and small learning rates, which further demonstrates the benefits of the oscillation.

1.3

Related Works

Large learning rate NN training. Gradient descent training with large learning rates for deep learning is

receiving an ever increasing attention in recent years (Cohen et al., 2020; Lewkowycz et al., 2020; Jastrzebski

et al., 2021; Cohen et al., 2022; Andriushchenko et al., 2023). For GD training, the phenomenon of “edge of

stability” (Cohen et al., 2020, 2022) showed that the sharpness of the loss Hessian would finally hover just

above 2/η, and thus a larger learning rate would prefer a flatter minimum and possibly better generalization,

and this have received great attention for recent years (Arora et al., 2022; Ahn et al., 2022; Chen and Bruna,

2022; Damian et al., 2022; Wang et al., 2022; Zhu et al., 2022b; Wu et al., 2023). Another related phenomenon

is the “catapults” in GD training (Lewkowycz et al., 2020) which happens during large learning rate training

phase with a sharp increase-then-decrease spike in the training loss. The mechanism behind this phenomenon

is further investigated by Kalra and Barkeshli (2023); Zhu et al. (2022a, 2023).

Besides, Li et al. (2019) studied the regularization effect of large learning rates on SGD at initialization

which results in better generalization than using a small initial learning rate. Wu et al. (2021) considered the

implicit bias of SGD with a moderate large learning rate for overparametrized linear regression. Wu et al.

(2023) then studied the implicit bias of large learning rate GD training in logistic regression. Additionally,

Andriushchenko et al. (2023) showed that SGD with a large learning rate could help NNs to learn sparse

features from data, but did not provide rigorous theoretical justifications. We highlight that our theoretical

work on large learning rate SGD builds upon a multi-pass fashion of SGD and a feature-noise data generation

model, which is different from previous works (Li et al., 2019; Wu et al., 2021; Andriushchenko et al., 2023)

where noise-approximated-SGD is used for analysis. Also, we study the behavior of large learning rate SGD

by focusing on the role of oscillation, which is also largely different from the prior works.

Feature learning in deep learning theory. There has been a long line of research in deep learning

theory from the perspective of feature learning (Allen-Zhu and Li, 2022; Wen and Li, 2021; Zou et al., 2022;

Cao et al., 2022; Chen et al., 2022; Zou et al., 2023; Huang et al., 2023; Yang et al., 2023). The core idea

is that, by explicitly characterizing the dynamics of feature learning during training, one can figure out how

3different algorithms and data structures can influence the learning of features by the neural network, further

uncovering the properties of interests in deep learning, e.g., ensemble (Allen-Zhu and Li, 2022), adaptive

gradients (Zou et al., 2022), the phenomenon of benign overfitting (Cao et al., 2022), data augmentation via

mixup (Zou et al., 2023), etc. Specifically, the work of Cao et al. (2022) showed that under small learning rate

regimes, training on data with low signal-to-noise ratio (SNR) would result in harmful overfitting, leading

to the poor generalization abilities of the neural network. Our work extends this line of research to the less

theoretically understood regime of large learning rates by characterizing the feature learning process when

oscillation happens during gradient descent and explaining its benefits to generalization.

Problem Setting

In this section, we introduce the theoretical setting for our investigation of generalization properties of SGD

through the task of binary classification. We first introduce the multi-view data generation model and then

define the two-layer convolutional neural network and the SGD algorithm.

Data generation model. We let v ⊥ u ∈ R d be two fixed vectors, denoting the signal (or feature) part

shared by each data point. Each data point, denoted by (x, y) where x = (x (1) , x (2) , x (3) ) contains 3 patches,

is generated as following: let y ∈ {1, −1} be independently generated by P(y = 1) = P(y = −1) = 1/2, and

• Weak signal patch. One patch of x is taken by the weak signal y · v;

• Strong signal patch. With probability 1 − ρ, one patch of x that is different from y · v, is taken by

the strong signal y · u;

• Noise patch. All the remaining patches are taken by independent Gaussian noise ξ ∼ N (0, σ p 2 (I d −

vv ⊤ /∥v∥ 22 − uu ⊤ /∥u∥ 22 )) for some variance σ p > 0.

For simplicity, we refer to the data with strong signal patch as the strong data, denoted by

(x, y) = (y · u, y · v, ξ), y ,

and we refer to the data with only the weak signal patch as the weak data, denoted by

e y · v, ξ), y .

(x, y) = ( ξ,

Here by “strong”, we mean a vector with a larger ℓ 2 -norm, as we would specify in the theory part. Intuitively,

the weak signal y · v can be interpreted as the invariant and common signals across data like the shape of key

objects in an image. The strong signal y · u can be understood as the background or the domain information

which is stronger but only appears in a certain fraction of all data points. This indicates that in order for a

classifier to generalize well to all new data, it must effectively learn the weak signal y · v.

Our proposed data generation model is adapted from the feature-learning-based line of research on deep

learning (Allen-Zhu and Li, 2022; Cao et al., 2022; Zou et al., 2023), and it can serve as a good theoretical

platform to investigate the relationship between oscillating NN training by large learning rate and the NN

generalization. Finally, we remark that this data model can be extended for generality, e.g., multiple features,

more patches, multi-class data. In fact, as long as the signal and noise patches have properly different strength

and fractions among data, our theoretical analysis can be directly applied.

Two-layer CNN. We consider a two-layer convolutional neural network (CNN) with filters applied to the

three patches separately. We assign the parameters of the second layer of the CNN to a fixed +1 and −1,

respectively. Formally, the CNN function f (·; W) : R 3d 7→ R is defined as

f (x; W) =

jF j (x; W j ),

with F j (x; W j ) =

1 X X

σ(⟨w j,r , x (p) ⟩),

p=1

(2.1)

r∈[m]

j∈{±1}

where m ∈ N + is the number of filters (i.e., neurons), σ(z) = (max{z, 0}) 2 is the ReLU 2 activation function,

and w j,r ∈ R d denotes the weights of the r-th neuron of F j . We use W = {W j } j∈{±1} and W j = {w j,r } r∈[m]

to denote the collections of the weights.

4Loss function and stochastic gradient descent (SGD). Having access to n i.i.d. samples from the

data generation model, S = {(x i , y i )} i∈[n] , we solve a binary classification task by minimizing the following

mean squared loss,

L(W) =

1 X

ℓ(f (x i ; W), y i ) =

f (x i ; W) − y i ,

i∈[n]

(2.2)

i∈[n]

where ℓ(f (x i ; W), y i ) = (f (x i ; W) − y i ) 2 /2 is the loss on a single data point. Inspired by “edge of stability”

(Cohen et al., 2020), adopting mean squared error is believed to make it easier to identify the effects of large

learning rates. Besides, mean squared loss has also been demonstrated to be comparable or even better than

cross-entropy loss in many classification tasks (Hui, 2020).

We optimize the loss function (2.2) by multi-pass stochastic gradient descent (SGD), starting from some

(0)

Gaussian weights, where each entry of W +1 and W −1 is sampled from N (0, σ 0 2 ). The SGD goes for several

epochs. In each epoch, we use each data (x i , y i ) for exactly once, in the exact order of (x 1 , y 1 ) → (x 2 , y 2 ) →

· · · → (x n , y n ) 1 . Thus, the weights of the CNN are updated obeying the following rule,

(t+1)

w j,r

(t)

= w j,r − η · ∇ w j,r ℓ(f (W (t) , x i t ), y i t )

(t)

= w j,r −

jη

(t)

(p)

σ ′ (⟨w j,r , x i t ⟩) · x i t ,

· f (W (t) , x i t ) − y i t ·

p=1

∀t ≥ 0,

(2.3)

for each j ∈ {±1} and r ∈ [m], where i t = (t + 1) mod n and η > 0 is the learning rate.

Generalization via signal (feature) learning. Our goal is to study the generalization property of the

CNN (2.1) trained by SGD (2.3). Given a new testing data point (x ⋄ , y ⋄ ) sampled from the data generation

model, we measure the generalization of the CNN by the correctness of its classification (Zhang et al., 2021),

E[1{y ⋄ · f (x ⋄ ; W sgd ) > 0}] = P(y ⋄ · f (x ⋄ ; W sgd ) > 0),

(2.4)

where W sgd denotes the weights trained by SGD (2.3).

The way we investigate the generalization (2.4) is to track the process of signal (feature) learning. More

(t)

specifically, by the SGD updates (2.3), the weights w j,r of the CNN is a linear combination of the initialization

(0)

w j,r , the strong signal j · u, the weak signal j · v, and the noise vectors. This motivates us to consider the

following representation of the CNN weights, for j ∈ {±1} and r ∈ [m],

(t)

(0)

w j,r ≈ w j,r +

(t)

⟨w j,r , ju⟩

⟨w j,r , jv⟩

· ju +

· jv + noise parts.

∥u∥ 2

∥v∥ 22

The relative scales of these combination coefficients actually imply how the weights learn the strong signal

j · u, the weak signal j · v, or memorize the noise, which determines how the CNN can generalize.

As is shown by Cao et al. (2022), the CNN tends to fit the training dataset utilizing patches with higher

strength when trained by small learning rate gradient descents. Therefore in that training regime, the CNN

tends to fit the training data using the strong signal y · u, making less progress in learning the weak signal

y · v. Thus when it comes to the testing data which lack the strong signal patch y · u, the CNN trained in

this manner would make a false classification.

On the contrary, our paper investigates the large learning rate regime, and suggests that the oscillation

happening during SGD training is beneficial for learning the weak signal, giving better generalization results.

So our main focus in the sequel would be studying the dynamics of the inner products

(t)

⟨w j,r , ju⟩,

(t)

⟨w j,r , jv⟩,

(t)

⟨w j,r , ξ⟩,

∀t ≥ 0,

(t)

where ξ is either ξ i or ξ e i . We show that by oscillating SGD training with large learning rates, ⟨w j,r , jv⟩ can

be effectively learned to a relatively large scale compared to its initialization. This is further provably useful

for the CNN to generalize to all new data points.

1 We

consider the same order for all epochs for the simplicity of analysis. Our analysis can be easily extended to multi-pass

SGD with shuffling.

Understand the Oscillation: Single Training Data Case

Before giving our main theory on large learning rate SGD training, let’s first study a simplified setup where

we consider only a single training data point consisting only of a weak signal patch y · v and a strong signal

patch y · u, without the noise patch. This setting helps to illustrate the key insights behind our main theory

regarding the understanding of oscillation. Without loss of generality, we denote the single training data as

(x, y) = (y · u, y · v), y ,

without the sample index i. We can also simplify the CNN expression (2.1) and the SGD updates (2.3) to

1 X

σ(⟨w j,r , yu⟩) + σ(⟨w j,r , yv⟩),

r∈[m]

j∈{±1}

ηj

(t)

= w j,r −

· f (x; W (t) ) − y · σ ′ (⟨w j,r , yu⟩) · yu + σ ′ (⟨w j,r , yv⟩) · yv .

f (x; W) =

(t+1)

w j,r

jF j (x; W j ),

with

F j (x; W j ) =

(3.1)

(3.2)

In such a simplified setup, we aim to explain that, when SGD training belongs to certain oscillation regime,

which typically occurs under large learning rate η, the CNN is guaranteed to make progress in learning the

weak signal y · v. Here by oscillation, we mean that the values of the CNN f (x; W (t) ) keep oscillating around

the label y during training. This phenomenon greatly contrasts with known results for feature learning when

gradient descent training converges smoothly under relatively small learning rates (Allen-Zhu and Li, 2022;

Cao et al., 2022), which we are going to review in the following.

Review: small learning rate training regime. Firstly, we make a review of what may happen when

using SGD updates (3.2) with a small learning rate η. The following proposition proves that in this case the

CNN can not make much progress in learning the weak signal y · v.

Proposition 3.1 (Small learning rate training: single training data (informal)). Under mild conditions on

(d, m, σ 0 , ∥u∥ 2 , ∥v∥ 2 ), if we choose the learning rate η ≤ m/(6∥u∥ 22 ) small enough, then with high probability,

the training loss can smoothly converge, during which

(t)

e (σ 0 ∥v∥ 2 ) .

max

|⟨w j,r , jv⟩| ≤ O

j∈{±1},r∈[m]

Please refer to Appendix C for more details and proofs of Proposition 3.1. This shows that in the small

learning rate training regime, the CNN only learns the weak signal y · v to the same scale as its initialization.

CNN trained in this manner may fail to generalize to testing data without strong features (substituted by a

noise patch ξ), because it would make predictions relying mainly on the random noise. On the contrary, in

the following we intuitively explain that under certain large learning rate training regime, the CNN can learn

the weak signal y · v up to a constant level higher than its initialization. Such a phenomenon is depicted in

Figure 2 on an 8-neuron CNN trained by SGD with η = 0.1 and η = 0.5 respectively.

Theoretical motivations: large learning rate regime and oscillation. When using a large enough

learning rate η that exceeds the twice inversed smoothness, the weights of the CNN would keep oscillating,

which makes the values of f (x; W (t) ) fluctuate around y, or equivalently, y · f (x; W (t) ) fluctuate around 1.

The key finding towards our theory is that the fluctuations of y · f (x; W (t) ) around 1 would not cancel with

each other. Instead, the oscillation accumulates linearly over time, as we could observe from Figure 2. This

further serves as the engine driving the learning of the weak signal y · v. In the sequel, we explain why the

cancellation does not happen.

The core idea is that, with a reasonably large learning rate η, the CNN weights would be quickly enlarged

from the learning of the strong signal y · u and then keep oscillating, but still stay well bounded. As a result

of the SGD updates (3.2), the summation of the gradient terms is also well bounded. More specifically, let’s

look carefully into the dynamics of learning the strong signal y · u. For some time steps t 0 , t 1 , and certain

neuron r ∈ [m], it holds from (3.2) that

t 1

(t 1 +1)

(t 0 )

(s)

O(1) = ⟨w y,r , yu⟩ − ⟨w y,r , yu⟩ ≈ Θ

1 − yf (x; W ) · ⟨w y,r , yu⟩ .

(3.3)

s=t 0

6value of yf(x, W (t) ) and L(W (t) ), LR = 0.5

1.00

1.25

1.0

0.75

0.8

0.6

1.2

1.0

0.8

0.6

0.4 0.25 0.2 0.2

0.00 0.0 0.0

steps

0.4

100

150

steps

200

250

300

1.0

weak signal (j=y)

strong signal (j=y)

1.4

0.50

100 125 150 175 200 225 250 275 300

value of signal learning, LR = 0.1

1.6

f value

loss value

1.2

1.50

weak signal (j=y)

strong signal (j=y)

value of yf(x, W (t) ) and L(W (t) ), LR = 0.1

f value

loss value

0.8

value of signal learning, LR = 0.5

1.75

0.6

0.4

0.2

0.0

steps

(t)

100

steps

100

(t)

Figure 2: The progress of signal learning and the values of yf (x; W ) and L(W ) under different learning

rates η. The CNN in the first two figures is trained by SGD with η = 0.5 (large learning rate), while the

CNN in the last two figures is trained by SGD with η = 0.1 (small learning rate). For signal learning (first

(t)

and third figures), the gray lines depict the strong signal learning ⟨w y,r , yu⟩ by all neurons r ∈ [m], and the

(t)

light blue lines depict the weak signal learning ⟨w y,r , yv⟩ P

by all neurons r ∈ [m]. As we can see, with a large

learning rate, the value of CNN oscillates around y, and t (1 − yf (x; W (t) )) is going to increase, which, as

(t)

our theory indicates, incentivizes ⟨w y,r , yv⟩ to increase. In contrast, with a small learning rate, ⟨w y,r , yv⟩

would stay at the same scale as its initialization throughout the training process.

Now we split the summation on the right hand side of (3.3) into two parts: one part is S + containing s such

that yf (x; W (s) ) > 1 and the other part is S − containing s such that yf (x; W (s) ) < 1. It turns out that

when the weak signal component of the CNN is relatively small compared with the strong signal component,

the whole behavior of the CNN would be dominated by the dynamics of the strong signal component. In

(s)

other words, when yf (x; W (s) ) > 1, the inner products ⟨w y,r , yu⟩ would also take a relatively large value.

(s)

Conversely, when yf (x; W ) < 1, the inner products ⟨w y,r , yu⟩ would also take a relatively small value.

(s)

Consequently, in view of (3.3), we can see that the total increases of ⟨w y,r , yu⟩ and decreases of ⟨w y,r , yu⟩

during the oscillation period are approximately balanced, i.e.,

(s)

, yu⟩ ≈

, yu⟩ .

1 − yf (x; W (s) ) · ⟨w y,r

yf (x; W (s) ) − 1 · ⟨w y,r

}

{z }

s∈S −

s∈S +

yf (x;W (s) )>1

yf (x;W (s) )<1

relatively large

relatively small

Consequently, the summation of 1 − yf (x; W (s) ) over s ∈ S − would take a larger value than the summation

of yf (x; W (s) ) − 1 over s ∈ S + . This means that the whole summation

yf (x; W (s) ) − 1 =

1 − yf (x; W (s) ) −

yf (x; W (s) − 1) ≥ 0.

(3.4)

s∈S + ∪S −

s∈S −

s∈S +

That is, the oscillation of f (x; W (s) ) around the label y over time does not tend to cancel with each other.

Instead, the summation of the fluctuations would have a determined sign. Furthermore, if the CNN values

are bounded away from the label by a uniform constant δ > 0 (i.e., the magnitude of the oscillation), we can

further improve (3.4) into

yf (x; W (s) ) − 1 = Ω δ · max{|S + |, |S − |} = Ω δ · (t 1 − t 0 ) .

(3.5)

s∈S + ∪S −

This is the key observation behind our theory for studying the oscillating SGD. With (3.5) in hand, we can

further show that as long as the weak signal component of the CNN is still small, which means that the weak

signal hasn’t been well learned, the linear accumulation of the oscillation would incentivize the learning of

the weak signal by a careful analysis of the updates (3.2).

Outcome of oscillation: effective weak signal learning. Based on previous discussions, we can arrive

at our main results for the simplified setup of this explanatory section: the oscillating SGD can indeed make

progress in learning the weak signal y · v, which then helps the CNN to generalize to new data points which

possibly lack the strong signal y · u. This is summarized in the following (informal) theorem.

Theorem 3.2 (Large learning rate training: single training data case (informal)). Under mild conditions

on (d, m, σ 0 , ∥u∥ 2 , ∥v∥ 2 ), if we choose the learning rate η > m/(4∥u∥ 22 ) reasonably large such that the SGD

7training (3.2) oscillates in the sense that |yf (x; W (t) ) − 1| ≥ δ for some constant δ > 0 and each t ≥ 0, then

−1

with high probability there exists a t ⋆ ≤ T max with T max ∈ poly(d, m, η −1 , δ −1 , ∥u∥ −1

2 , ∥v∥ 2 ) such that

1 X

(t)

σ(⟨w y,r

, yv⟩) ≥ δ,

∀t ≥ t ⋆ .

r∈[m]

Please refer to Appendix B for formal and detailed statement of Theorem 3.2 and its proofs. Theorem 3.2

shows that via oscillating SGD training, the CNN learns the weak signal y · v up to a constant scale of δ,

which is typically much larger than the scale of its initialization, since

1 X

(0)

(t ⋆ )

e 0 2 ∥v∥ 22 ) ≪ δ ≤ 1

σ(⟨w y,r

, yv⟩) ≤ O(σ

, yv⟩),

r∈[m]

whenever the initialization of the CNN is small as in practice. We remark that here we only consider neurons

with j = y since the CNN is trained only on a single data with label y.

Implications to the simple signal-noise model. Our findings can also be adapted to explain the data

model considered by Cao et al. (2022). In that case, each data point x = (y · µ, ξ) consists of a single signal

patch y · µ and a single Gaussian noise patch ξ. A trained neural network can generalize to new data points

only when it learns the common signal vector µ.

As is shown by Cao et al. (2022), in small learning rate training regime, if the data model has a low signal-

to-noise ratio (SNR), that is, the strength of the noise patch is relatively stronger than the the strength of

the signal patch, then overfitting the training data would result in poor generalization (harmful overfitting).

That is because the neural network would memorize the noise patch quickly so as to fit the training data,

and consequently the signal patch is not well learned. In contrast, we could show that under the oscillating

SGD training regime, the signal can also be well learned even with a low SNR. The mechanism behind this is

still that the oscillation during training would accumulate and incentivize the NN towards signal learning.

Main Theory

In this section, we present our main theory on benign oscillation of SGD with large learning rates based on

the general setups introduced in Section 2. We first introduce the key conditions and assumptions required

by our theory in Section 4.1. Then we present our theoretical results in Section 4.2, where we also compare

large learning rate oscillating training to small learning rate training. Finally in Section 4.3 we demonstrate

our theoretical findings via numerical experiments.

4.1

Key Conditions and Assumptions

Before presenting our theoretical results, we outline the key conditions and assumptions needed on the model

and the training dynamics. Firstly, our results are based upon the following conditions on the initialization

scale σ 0 , dimension d, datasize n, CNN width m, and signal and noise strength ∥u∥ 2 , ∥v∥ 2 , and σ p .

Assumption 4.1 (Conditions on hyperparameters). √

Suppose that the following conditions hold: (i) the CNN

−1 −1/2

·d

); (ii) the dimension d = Ω(n 2 , polylog(m));

weight initialization scale σ 0 = Θ(max{∥u∥

∥v∥

p d}

−2

2 −1

(iii) the signal strength ∥v∥ 2 ≤ 0.01·∥u∥ 2 , ∥u∥ 2 +∥v∥ 2 ≤ O(n(σ p d) ). (iv) the learning rate m/(4∥u∥ 22 ) ≤

η ≤ 2m/(5∥u∥ 22 ). (v) the weak data fraction ρ ≤ c for some small constant c.

We explain these conditions in Assumption 4.1 one by one. The conditions on the initialization scale σ 0

and the learning rate η are to ensure that the whole training process is well bounded while oscillates (rather

than converging smoothly). Then the condition on the dimension d puts us in the regime of high dimension

for which independent Gaussian noise has small correlations. The conditions on the signal strength separate

the strong signal u from the weak signal v by ℓ 2 -norms. Besides, we ensure that the data are not too noisy

via restricting the variance of the Gaussian noise. Finally, the condition on the fraction ρ of weak data is

for technical considerations in analyzing the training process, and can be relaxed for testing data population.

The next assumption is on the training dynamics, which requires that the SGD oscillates. For simplicity,

we denote the index set of the weak training data points lacking the strong feature patch as W.

8Assumption 4.2 (Oscillating SGD). We assume that there exists some constant δ ∈ (0.2, 0.8), such that

|y i t f (x i t ; W (t) ) − 1| ≥ δ holds for any t ≥ 0 such that i t ∈

/ W.

Through Assumption 4.2, we require that the value of yf (x; W (t) ) on data points with strong features

oscillates around the desired value, 1, by a scale of δ ∈ (0.2, 0.8), i.e., the magnitude of the oscillation is at

least δ. Here the range for δ is only for technical considerations to simplify the theoretical analyses.

We remark that in general the dynamics of the training process could be quite subtle when oscillation

happens, and there exist other more complicated patterns of oscillations if one deliberately chooses a specific

learning rate η. Our work focuses on a relatively simple but common pattern of oscillation. It turns out that

under the oscillation pattern in Assumption 4.2, we can show the benefits of oscillation on the generalization

properties of the CNN. Actually, we could also extend our analysis to a weakened version of Assumption 4.2

that the time average of |y i t f (x i t ; W (t) ) − 1| is larger than δ.

It is notable that Assumption 4.2 implicitly requires that the learning rate η should be chosen properly.

A large learning rate forces the training trajectories to escape from the regular region, while a small learning

rate shall result in smooth convergence. In both cases the phenomenon described in Assumption 4.2 does not

happen. We also remind readers that the η condition in Assumption 4.1 is only sufficient for the regularities

such as boundedness and sign stability. Readers can refer to Appendix B.7 for a discussion of the necessary

conditions on the learning rate η implied by Assumption 4.2.

Finally, we remark that we only assume the oscillation on strong data, since intuitively on weak data the

CNN fits the label via the weak features and the noise (both have smaller strength than strong features) and

may converge slower and more smoothly. Please see Section 4.3 for experimental evidence.

4.2

Main Theoretical Results

Our main results are parallel to the single data setup in Section 3: under previous conditions and assumptions

on the hyperparameters and the training dynamics, the CNN can make enough progress in learning the weak

signal v thanks to the oscillation happening during training. We refer to Appendix D for a detailed proof of

the following results.

Theorem 4.3 (Weak signal learning: oscillating training with large learning rate). Under Assumptions 4.1

−1

and 4.2, w.p. at least 1 − 1/poly(d), there exists t ⋆ ≤ poly(d, m, n, δ −1 , η −1 , σ p −1 , ∥u∥ −1

2 , ∥v∥ 2 ) such that





 δ

 1 X

⋆

(t )

σ(⟨w j,r , jv⟩) −

σ(⟨w −j,r , jv⟩) ≥ .

max

 4

j∈{±1}  m

r∈[m]

Remark 4.4 (Interpretations of Theorem 4.3). This theorem concludes that by large learning rate oscillating

SGD training, the CNN learns the weak signal up to the scale of δ at certain time t ⋆ and j ∈ {±1}. We can

refine this result to any step after t ≥ t ⋆ and each j ∈ {±1} with some more intricate analysis regarding the

signal learning dynamics, which we have done on the single training data case where the result holds for each

step t ≥ t ⋆ (see Theorem 3.2). Despite that, Theorem 4.3 has already shown the power of large learning rate

training for learning the weak signal in the multiple data setup, since as is shown in the following, the weak

signal learning keeps at the same scale as its initialization under the small learning rate training regime.

Small learning rate training regime. In contrast, under the small learning rate regime, the CNN would

not learn the weak features, which is the following proposition with more details and proofs in Appendix E.

Proposition 4.5 (Small learning rate training). Under Assumption E.1 on (d, m, n, σ 0 , ∥u∥ 2 , ∥v∥ 2 , σ p ), if

we choose learning rate η ≤ m/(6∥u∥ 22 ) small enough, then w.p. at least 1 − 1/poly(d), the training loss can

smoothly converge, during which

(t)

e (σ 0 ∥v∥ 2 ) .

max

|⟨w j,r , jv⟩| ≤ O

j∈{±1},r∈[m]

Division of generalization. Suppose we are given a new testing data point (x ⋄ , y ⋄ ) with an input x ⋄ =

( ξ e ⋄ , y ⋄ v, ξ ⋄ ) only consisting of the weak signal y ⋄ · v. Then a reliable prediction can only count on utilizing

9value of some f(x i ; W (t) ), LR = 1.2

value of signal learning, LR = 1.2

weak signal (j=1)

strong signal (j=1)

1.4

1.2

1.0

0.8

0.6

f on weak data 1

f on weak data 2

f on a strong data

0.4

0.5

0.2

0.0

250 500 750 1000 1250 1500 1750 2000

steps

1.4

1.2

1.5

1.0

1.5

value of some f(x i ; W (t) ), LR = 0.1

value of signal learning, LR = 0.1

weak signal (j=1)

strong signal (j=1)

2.0

250

500

750

1.0

2.0

1.0

f on weak data 1

f on weak data 2

f on a strong data

0.8

0.6

0.4

0.5

0.2

0.0

1000 1250 1500 1750 2000

steps

250 500 750 1000 1250 1500 1750 2000

steps

250

500

750

1000 1250 1500 1750 2000

steps

Figure 3: The dynamics of signal learning under a large learning rate η large = 1.2 and a small learning rate

(t)

η small = 0.1. The values of signal learning are obtained by characterizing the inner products ⟨w j,r , ju⟩ and

(t)

⟨w j,r , jv⟩. When using the large learning rate, strong signal learning as well as the NN outputs will oscillate,

during which weak signal will be gradually learned. When using the small learning rate, strong signal learning

will converge quickly, and the weak signal learning will stay at the same scale as its initialization.

the weak signal y ⋄ · v. To see this and to be more specific, in the regime specified by Theorem 4.3,

⋆

y ⋄ f (x ⋄ ; W (t ) ) =

1 X

(t ⋆ )

σ(⟨w y ⋄ ,r , y ⋄ v⟩) −

σ(⟨w −y ⋄ ,r , y ⋄ v⟩)

r∈[m]

}

Weak signal component ≥δ/4

j X

(t ⋆ )

σ(⟨w j,r , ξ ⋄ ⟩) + σ(⟨w j,r , ξ e ⋄ ⟩) ,

r∈[m]

j∈{±1}

}

Noise component ≤o(1)

⋆

there holds 2 y ⋄ · f (x ⋄ ; W (t ) ) ≥ δ/4 − o(1) > 0 which corresponds to correct prediction almost certainly. In

contrast, when applying the small learning rate, as specified by Proposition 4.5, the trained NN fails to take

advantage of the weak signal y ⋄ ·v from data x ⋄ . Therefore, it would be likely to make the prediction based on

a random guess (the randomness stems from the random initialization and the noise ξ ⋄ , ξ e ⋄ ). Consequently,

noting that the weak data takes up a ρ fraction of the data distribution (see our data model in Section 2),

SGD with large learning rates would achieve a Θ(ρ) higher test accuracy than SGD with small learning rates,

demonstrating the benefit of oscillation and large learning training in terms of the generalization ability.

4.3

Numerical Experiments

In this section, we conduct numerical experiments to demonstrate our findings on “benign oscillation”. We

follow the same data generation model and optimization algorithm as we described as Section 2. Specifically,

we consider a dataset with n = 16 and |W| = 2, that is, ρ ≈ 0.125. The dimension is d = 64, and the number

of neurons for each direction j is m = 8. We generate the data with strong signal ∥u∥ 2 = 2, weak signal

∥v∥ 2 = 0.4, and noise ∥ξ∥ 2 ≈ σ p d 1/2 = 0.8.

Weak signal learning. We run the SGD (2.3) to train the CNN with two different scale of learning rates:

a large learning rate η large = 1.2, a small learning rate η small = 0.1. We plot the dynamics of signal learning

for each neuron r ∈ [m] from these two training regimes in Figure 3. As we can see from Figure 3, with large

learning rate SGD training, the CNN can effectively learn the weak signal to a scale much larger than the

initialization. On the contrary, by small learning rate SGD training, the CNN does not learn the weak signal

since it just remains at the same level as the initialization. This demonstrates our theory in Section 4.2.

Furthermore, in Figure 3, we plot the values of y i f (x i ; W (t) ) on certain data points i ∈ [n] and the value

of L(W (t) ) for those two training regimes. Specifically, we plot the values of y i f (x i ; W (t) ) on a strong data

i ∈

/ W (randomly sampled) and the values of y i f (x i ; W (t) ) on the two weak data i ∈ W. As we can observe

from the large learning rate training case, the values of f on the strong data oscillate while the values of f

on the weak data do not. This matches our Assumption 4.2 that the oscillation in f value only happens for

2 the

e 2 σ 2 d) which is O(d

e −1 ) under Assumption 4.1.

noise patch term is of order O(σ

0 p

10strong data. Also, for the small learning rate training case, the values of f on the weak data converge slower

than those on the strong data. This is because on the weak data the CNN mainly utilizes the noise to fit the

target which is of lower strength than the strong signals on strong data. But still, the noise outweights the

weak signals and consequently the CNN makes no progress in learning the weak signal if trained smoothly.

Generalization properties. Finally, we test the CNN trained by two different learning rates on the new

testing data generated in the same way as the training data. The testing data size is 32 with 4 weak data

points. We repeat the testing evaluation over 5 random seeds and take the average. The result is that for the

CNN trained by η large the testing accuracy is 99.38%, and for the CNN trained by η small the testing accuracy

is 93.75%, matching our theoretical insights that large learning rate training benefits NN generalization. For

the CNN trained by η small , it misclassifies certain weak data points. As we previously discussed, on the data

without the strong signal, the CNN approximately uses a random guess.

Conclusions

This work theoretically investigated the NN training with large learning rates and established a theoretical

framework to understand the oscillation phenomenon. We revealed the benefits of oscillation training to the

NN generalization, which we summarize as the phenomenon of “benign oscillation”. Our theory demystified

the phenomenon based on a feature learning perspective and showed that the oscillation can drive the learning

of weak but important patterns from data that are crucial to generalization. Our theory sheds light on the

understanding of large learning rate NN training and provided useful guidance towards the optimization

analysis when smooth convergence is not guaranteed.

References

Ahn, K., Zhang, J. and Sra, S. (2022). Understanding the unstable convergence of gradient descent. In

International Conference on Machine Learning. PMLR. 3

Allen-Zhu, Z. and Li, Y. (2022). Towards understanding ensemble, knowledge distillation and self-

distillation in deep learning. In The Eleventh International Conference on Learning Representations. 2,

3, 4, 6

Andriushchenko, M., Varre, A. V., Pillaud-Vivien, L. and Flammarion, N. (2023). Sgd with large

step sizes learns sparse features. In International Conference on Machine Learning. PMLR. 2, 3

Arora, S., Li, Z. and Panigrahi, A. (2022). Understanding gradient descent on the edge of stability in

deep learning. In International Conference on Machine Learning. PMLR. 3

Cao, Y., Chen, Z., Belkin, M. and Gu, Q. (2022). Benign overfitting in two-layer convolutional neural

networks. Advances in neural information processing systems 35 25237–25250. 3, 4, 5, 6, 8, 15, 63

Chen, L. and Bruna, J. (2022). On gradient descent convergence beyond the edge of stability. arXiv

preprint arXiv:2206.04172 . 3

Chen, Z., Deng, Y., Wu, Y., Gu, Q. and Li, Y. (2022). Towards understanding the mixture-of-experts

layer in deep learning. Advances in neural information processing systems 35 23049–23062. 3

Cohen, J., Kaur, S., Li, Y., Kolter, J. Z. and Talwalkar, A. (2020). Gradient descent on neural

networks typically occurs at the edge of stability. In International Conference on Learning Representations.

2, 3, 5

Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D.,

Cardoze, D., Nado, Z., Dahl, G. E. et al. (2022). Adaptive gradient methods at the edge of

stability. arXiv preprint arXiv:2207.14484 . 3

Damian, A., Nichani, E. and Lee, J. D. (2022). Self-stabilization: The implicit bias of gradient descent

at the edge of stability. In The Eleventh International Conference on Learning Representations. 1, 3

11Frankle, J., Schwab, D. J. and Morcos, A. S. (2019). The early phase of neural network training. In

International Conference on Learning Representations. 1

He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition. 1

Huang, W., Cao, Y., Wang, H., Cao, X. and Suzuki, T. (2023). Graph neural networks provably

benefit from structural information: A feature learning perspective. arXiv preprint arXiv:2306.13926 . 3

Hui, L. (2020). Evaluation of neural architectures trained with square loss vs cross-entropy in classification

tasks. In The Ninth International Conference on Learning Representations (ICLR 2021). 5

Jastrzebski, S., Arpit, D., Astrand, O., Kerg, G. B., Wang, H., Xiong, C., Socher, R., Cho, K.

and Geras, K. J. (2021). Catastrophic fisher explosion: Early phase fisher matrix impacts generalization.

In International Conference on Machine Learning. PMLR. 3

Kalra, D. S. and Barkeshli, M. (2023). Phase diagram of training dynamics in deep neural networks:

effect of learning rate, depth, and width. arXiv preprint arXiv:2302.12250 . 3

Kaur, S., Cohen, J. and Lipton, Z. C. (2023). On the maximum hessian eigenvalue and generalization.

In Proceedings on. PMLR. 1

Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J. and Gur-Ari, G. (2020). The large learning

rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218 . 3

Li, Y., Wei, C. and Ma, T. (2019). Towards explaining the regularization effect of initial large learning

rate in training neural networks. Advances in Neural Information Processing Systems 32. 3

Smith, L. N. and Topin, N. (2019). Super-convergence: Very fast training of neural networks using large

learning rates. In Artificial intelligence and machine learning for multi-domain operations applications,

vol. 11006. SPIE. 1

Wang, Z., Li, Z. and Li, J. (2022). Analyzing sharpness along gd trajectory: Progressive sharpening and

edge of stability. Advances in Neural Information Processing Systems 35 9983–9994. 3

Wen, Z. and Li, Y. (2021). Toward understanding the feature learning process of self-supervised contrastive

learning. In International Conference on Machine Learning. PMLR. 3

Wu, J., Braverman, V. and Lee, J. D. (2023). Implicit bias of gradient descent for logistic regression at

the edge of stability. arXiv preprint arXiv:2305.11788 . 3

Wu, J., Zou, D., Braverman, V. and Gu, Q. (2021). Direction matters: On the implicit bias of stochastic

gradient descent with moderate learning rate. In International Conference on Learning Representation

(ICLR). 3

Xing, C., Arpit, D., Tsirigotis, C. and Bengio, Y. (2018).

arXiv:1802.08770 . 1

A walk with sgd.

arXiv preprint

Yang, N., Tang, C. and Tu, Y. (2023). Stochastic gradient descent introduces an effective landscape-

dependent regularization favoring flat solutions. Physical Review Letters 130 237101. 3

Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2021). Understanding deep learning

(still) requires rethinking generalization. Communications of the ACM 64 107–115. 5

Zhu, L., Liu, C., Radhakrishnan, A. and Belkin, M. (2022a). Quadratic models for understanding

neural network dynamics. arXiv preprint arXiv:2205.11787 . 3

Zhu, L., Liu, C., Radhakrishnan, A. and Belkin, M. (2023). Catapults in sgd: spikes in the training

loss and their impact on generalization through feature learning. arXiv preprint arXiv:2306.04815 . 3

Zhu, X., Wang, Z., Wang, X., Zhou, M. and Ge, R. (2022b). Understanding edge-of-stability training

dynamics with a minimalist example. In The Eleventh International Conference on Learning Representa-

tions. 3

12Zou, D., Cao, Y., Li, Y. and Gu, Q. (2022). Understanding the generalization of adam in learning neural

networks with proper regularization. In The Eleventh International Conference on Learning Representa-

tions. 3, 4

Zou, D., Cao, Y., Li, Y. and Gu, Q. (2023). The benefits of mixup for feature learning. In Proceedings

of the 40th International Conference on Machine Learning. 3, 4

13Contents

1 Introduction

1.1 Outline of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Problem Setting 4

3 Understand the Oscillation: Single Training Data Case 6

4 Main Theory

4.1 Key Conditions and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Conclusions 11

A Preliminary Lemmas on Concentration 15

B Proofs for Single Training Data Case (Section 3)

B.1 Basic Properties of Training Dynamics and the Two-Layer CNN

B.2 Fundamental Reasoning towards the Weak Signal Learning . . .

B.3 Proof of Theorem B.3 . . . . . . . . . . . . . . . . . . . . . . . .

B.4 Proof of Lemmas in Appendix B.1 . . . . . . . . . . . . . . . . .

B.4.1 Proof of Lemma B.4 . . . . . . . . . . . . . . . . . . . . .

B.4.2 Proof of Lemma B.5 . . . . . . . . . . . . . . . . . . . . .

B.4.3 Proof of Lemma B.6 . . . . . . . . . . . . . . . . . . . . .

B.5 Proof of Fundamental Reasoning (Lemma B.7) . . . . . . . . . .

B.6 Proof of Technical Results . . . . . . . . . . . . . . . . . . . . . .

B.6.1 Proof of Proposition B.8 . . . . . . . . . . . . . . . . . . .

B.6.2 Proof of Proposition B.9 . . . . . . . . . . . . . . . . . . .

B.7 Discussion: Necessary Condition for δ-Oscillation . . . . . . . . .

C Single Training Data Case: Small Learning Rate Regime

C.1 Stage 1. Exponential Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

C.2 Stage 2. Stabilized Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

D Proofs for Main Theoretical Results (Section 4)

D.1 Preliminary Analysis . . . . . . . . . . . . . . . . . . .

D.2 Overview of Analysis . . . . . . . . . . . . . . . . . . .

D.3 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . .

D.4 Proof of Lemma D.2 . . . . . . . . . . . . . . . . . . .

D.5 Proof of Lemma D.3 . . . . . . . . . . . . . . . . . . .

D.6 Proof of Technical Results . . . . . . . . . . . . . . . .

D.6.1 Proof of Proposition D.1 . . . . . . . . . . . . .

D.6.2 Proof of Lemma B.5 for Multiple Data Setting

D.6.3 Proof of Proposition D.4 . . . . . . . . . . . . .

D.6.4 Proof of Proposition D.5 . . . . . . . . . . . . .

E Multiple Training Data Case: Small Learning Rate Regime

E.1 Stage 1. Learning Strong Signal Exponentially Fast . . . . . . . . . . . . . . . . . . . . . . . . 55

E.2 Stage 2. Exploiting Strong Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

E.3 Stage 3. Memorizing Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

14A

Preliminary Lemmas on Concentration

In this section, we give finite-sample concentration results to characterize the high-probability concentration

properties of the random elements involved in our problem. Throughout this paper, we fix a small constant

failure probability p = 1/poly(d).

Lemma A.1. Suppose that n ≥ 8 log(4/p), then with probability at least 1 − p, we have that

|{i ∈ [n] : y i = 1}| ∧ |{i ∈ [n] : y i = −1}| ≥

Proof of Lemma A.1. See Lemma B.1 in Cao et al. (2022) for a proof.

Lemma A.2. Suppose that n ≥ 8 log(4/p), then with probability at least 1 − p, we have that

|W| ≤

2ρ

Proof of Lemma A.2. This follows from the same proof as Lemma A.1.

Lemma A.3. Suppose that d = Ω(log(4n/p)), then with probability 1 − p, we have that

σ p 2 d/2 ≤ ∥ξ∥ 22 ≤ 3σ p 2 d/2, |⟨ξ i , ξ i ′ ⟩| ≤ 2σ p 2 · d log(2n/p)

hold for all i, i ′ ∈ [n].

Proof of Lemma A.3. See Lemma B.2 in Cao et al. (2022) for a proof.

Lemma A.4. Suppose that d ≥ Ω(log(mn/p)). Then with probability at least 1 − p, we have that

(0)

σ 0 ∥u∥ 2 /2 ≤

max

⟨w j,r , ju⟩ ≤ 2 log(16m/p) · σ 0 ∥u∥ 2 ,

j∈{±1},r∈[m]

min

j∈{±1},r∈[m]

σ 0 ∥v∥ 2 /2 ≤

max

(0)

⟨w j,r , ju⟩ ≥ −

(0)

⟨w j,r , jv⟩ ≤

j∈{±1},r∈[m]

min

(0)

⟨w j,r , jv⟩ ≥ −

j∈{±1},r∈[m]

2 log(16m/p) · σ 0 ∥u∥ 2 ,

2 log(16m/p) · σ 0 ∥v∥ 2 ,

2 log(16m/p) · σ 0 ∥v∥ 2 .

√

(0)

σ 0 σ p d/4 ≤ max j · ⟨w j,r , ξ i ⟩ ≤ 2 log(16mn/p) · σ 0 σ p d,

∀i ∈ [n],

r∈[m]

Proof of Lemma A.4. See Lemma B.3 in Cao et al. (2022) for a proof.

Proofs for Single Training Data Case (Section 3)

In this section, we give a formal statement and a detailed proof for our main results on the single noiseless

training data setup, Theorem 3.2.

We begin with the formal statement on the conditions and assumptions required for Theorem 3.2. Firstly,

we put requirements on the data model, initialization, and learning rates.

Assumption B.1 (Conditions on hyperparameters). Suppose that the following holds:

1. The learning rate m/(4∥u∥ 22 ) ≤ η ≤ 2m/(5∥u∥ 22 );

−1

2. The weight initialization scale σ 0 = Θ(max{∥u∥

· d −1/2 );

2 , ∥v∥ 2 }

3. The signal strength ∥v∥ 2 < 0.01 · ∥u∥ 2 ;

4. The dimension d satisfies d = Ω(polylog(m)).

15These conditions are the simplified version of those conditions from Assumption 4.1 for the multiple data

setup. We refer the readers to Section 4.1 for an explanation of these conditions.

The next assumption is on the training process, which requires that the SGD oscillates.

Assumption B.2 (Oscillations SGD: single training data case). We assume that there exists some constant

δ ∈ (0.2, 0.8), such that |yf (x; W (t) ) − 1| ≥ δ for any t ≥ 0, where (x, y) denotes the single data point, W (t)

denotes the weights found by SGD (3.2).

Again, we refer the readers to Section 4.1 for an explanation of Assumption B.2. With Assumptions B.1

and B.2, our formal statement of Theorem 3.2 is the following theorem.

Theorem B.3 (Restatement of Theorem 3.2). Under Assumptions B.1 and B.2, with probability at least

1 − 1/poly(d), there exists a step T (v) such that for j = y and any t ≥ T (v) , it holds that

1 X

(t)

σ(⟨w j,r , jv⟩) −

σ(⟨w −j,r , jv⟩) ≥ ,

r∈[m]

−1

where δ > 0 is specified in Assumption B.2 and T (v) ≤ Θ(m

· η −1 · ∥v∥ −2

· log(mδσ 0 −1 ∥v∥ −1

2 · δ

2 )).

Proof of Theorem B.3. Please refer to Appendix B.3 for a detailed proof.

The following of this section is organized as following. Appendix B.1 presents important properties of the

whole training dynamics and the CNN, which serve as the basis for all the following proofs. Appendix B.2

presents the fundamental step that allows for proving weak signal learning. Based on that, we prove the

main theorem in Appendix B.3. Finally, the remaining subsections collect the proof for all the lemmas and

technical results involved in Appendix B.

B.1

Basic Properties of Training Dynamics and the Two-Layer CNN

Properties of training dynamics. We first define some neuron subsets. For j ∈ {±1}, we define that

(t)

U j,+ = r ∈ [m] : ⟨w y,r

, yu⟩ > 0 , U j,− = r ∈ [m] : ⟨w y,r

, yu⟩ ≤ 0 .

and

(t)

V j,+ = r ∈ [m] : ⟨w y,r

, yv⟩ > 0 ,

(t)

V j,− = r ∈ [m] : ⟨w y,r

, yv⟩ ≤ 0 .

(t)

By the update formula (3.2), we know that the gradient descent iterates the inner products {⟨w j,r , ju⟩} t≥0

as follows:

η∥u∥ 22

(t)

· 1 − yf (x; W (t) ) · σ ′ ⟨w y,r

, yu⟩ ,

η∥u∥ 22

(t+1)

(t)

⟨w −y,r , −yu⟩ = ⟨w −y,r , −yu⟩ +

· 1 − yf (x; W (t) ) · σ ′ −⟨w −y,r , −yu⟩ .

(t+1)

(t)

⟨w y,r

, yu⟩ = ⟨w y,r

, yu⟩ +

(B.1)

(B.2)

Analogously, we have that

η∥v∥ 22

(t)

· 1 − yf (x; W (t) ) · σ ′ ⟨w y,r

, yv⟩ ,

η∥v∥ 22

(t+1)

(t)

⟨w −y,r , −yv⟩ = ⟨w −y,r , −yv⟩ +

· 1 − yf (x; W (t) ) · σ ′ −⟨w −y,r , −yv⟩ .

(t+1)

(t)

⟨w y,r

, yv⟩ = ⟨w y,r

, yv⟩ +

(B.3)

Then we invite readers to some facts that helps to understand the behavior of the inner products during the

(0)

training processes. First, we note that by (B.1), for any r ∈ U y,− ,

(t)

⟨w y,r

, yu⟩

t≥0

16stays fixed at its initialization, thus automatically keeps a fixed sign. The same phenomenon that the inner

products are fixed can be verified on following neuron sets,

(t)

(0)

(t)

(0)

(t)

⟨w −y,r , −yu⟩

, ∀r ∈ U −y,+ ;

⟨w y,r

, yv⟩

, ∀r ∈ V y,− ;

⟨w −y,r , −yv⟩

, ∀r ∈ V −y,+ .

t≥0

The benefit of this phenomenon is when looking at the prediction f (·; W (t) ) on the single data with label y,

f (x; W (t) ) =

y X

(t)

σ ⟨w y,r

, yu⟩ + σ ⟨w y,r

, yv⟩ − σ ⟨w −y,r , yu⟩ − σ ⟨w −y,r , −yv⟩ .

(B.4)

r∈[m]

By splitting the summation over r ∈ [m] according to our defined neuron sets, we can simplify (B.4) as

f (x; W (t) ) =

y X

(t)

σ ⟨w y,r

, yu⟩ +

σ ⟨w y,r

, yv⟩

(t)

r∈U y,+

−

r∈V y,+

(t)

−⟨w −y,r , −yu⟩

−

(t)

r∈U −y,−

(t)

σ −⟨w −y,r , −yv⟩ ,

(B.5)

(t)

r∈V −y,−

(t)

which only involves w y,r with r ∈ U y,+ ∪ V y,+ and w −y,r with r ∈ U −y,− ∪ V −y,− . Combined with previous

observations, we only need to track the training dynamics of the neurons appearing in (B.5).

Thanks to these facts and the signal strength regime in Assumption B.2, we can then retreat to tracking

the movements of

(0)

(t)

(0)

(t)

, ∀r ∈ U −y,−

⟨w −y,r , −yu⟩

⟨w y,r

, yu⟩

, ∀r ∈ U y,+ and

t≥0

whenever the weak signal part is not yet effectively learned and the strong signal part still dominates. Ideally,

(t)

only the inner products ⟨w y,r , yu⟩, r ∈ U y,+ take the lead.

Then a natural and crucial question is the boundedness of these inner products, which turns out to be

the cornerstone for the subsequent analysis. A straightforward but helpful lemma indicates that the inner

products that are initialized to be the maximal (resp. minimal) among all inner products continue to be the

maximal (resp. minimal) throughout the training process. To put formally, we have the following.

Lemma B.4 (Maximum and minimum neurons). Suppose that the signs of all the related inner products do

(t 1 )

(t)

(t 1 )

(t)

for all t ∈ [t 1 , t 2 ]. Then we have that

and U −y,− = U −y,−

not change throughout [t 1 , t 2 ], i.e., U y,+ = U y,+

(t 1 )

(t)

argmax ⟨w y,r

, yu⟩ = argmax ⟨w y,r

, yu⟩,

r∈[m]

(t 1 )

argmin ⟨w −y,r

, −yu⟩

r∈[m]

(t)

= argmin ⟨w −y,r , −yu⟩

r∈[m]

hold for all t ∈ [t 1 , t 2 ].

Proof of Lemma B.4. See Appendix B.4.1 for a detailed proof.

A direct profit of this lemma is that when the signs keep invariant, it suffices to track two specific indices

(t)

in U y,+ and U −y,− , the maximum and the minimum, to analyze upper and lower bounds for all neurons.

Single neuron behaves similarly to CNN. The proof of boundedness utilizes another property of two-

layer CNN defined in Equation (2.1) that exhibits the connections between the behavior of inner products

⟨w y,r , yu⟩ and the outcome of the model f (· ; W). We state it as follows.

Lemma B.5 (Single neuron imitates entire CNN). Define the major part of y · f (x; W) as

g(x, y; W) =

1 X

σ ⟨w y,r , yu⟩ .

r∈[m]

17(t)

(t )

Suppose that there is t 1 < t 2 such that U y,+ = U y,+

for all t ∈ [t 1 , t 2 ]. Then, for any c > 0, g(x, y; W (t) ) ≥ c

implies that for all t ∈ [t 1 , t 2 ],

(t)

max ⟨w y,r

, yu⟩ ≥ (β u ∗ (t 1 ) mc) 1/2 .

r∈[m]

On the other hand, g(x; W (t) ) ≤ c implies that for all t ∈ [t 1 , t 2 ],

(t)

max ⟨w y,r

, yu⟩ ≤ (β u ∗ (t 1 ) mc) 1/2 .

r∈[m]

Here β u ∗ (t) is defined as

(t)

max r∈[m] σ(⟨w y,r , yu⟩)

β u ∗ (t) = P

(t)

r∈[m] σ(⟨w y,r , yu⟩)

Proof of Lemma B.5. See Appendix B.4.2 for a detailed proof. For multiple training data setting, the proof

is in Appendix D.6.2.

Being a subtle condition required in the previous two lemmas, whether the signs of these inner products

are invariant throughout the process remains unknown, making the behaviors of these inner products com-

plicated. The answer to this question is affirmative, as we are able to prove that, under proper conditions,

the signs of these inner products are fixed throughout the process. Using Lemmas B.4 and B.5, we prove the

boundedness and the sign stability simultaneously through a sophisticated inductive argument. The formal

statement of this result is as follows.

We first define a stopping time. Let









1 X

(t)

T (v) = min t :

σ ⟨w y,r

, yv⟩ > δ .

(B.6)

t≥0 



r∈[m]

Such a stopping time helps to control the non-dominating terms in the CNN. Moreover, once the process

reaches T (v) , the conclusion in Theorem B.3 is nearly achieved. Our results are summarized as follows.

Lemma B.6 (Boundedness and sign stability: single training data case). Suppose that Assumptions B.1

and B.2 hold. With probability at least 1 − p = 1 − 1/poly(d), we have the following bounds:

1/2

(t)

max ⟨w y,r

, yu⟩ ≤ 1.5 · 1.05β u ∗ m

(t)

min ⟨−w −y,r , −yu⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 ,

(t)

min ⟨−w −y,r , −yv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 , ∀t ≤ T (v) , (B.7)

∀t ≤ T (v) , (B.8)

∀t ≤ T (v) , (B.9)

0 ≤ yf (x; W (t) ) ≤ 3, ∀t ≤ T (v) .

(t)

Here β u ∗ := β u ∗ (0) is defined in Lemma B.5. Besides, the sign stability holds before T (v) , i.e. U ±y,± and V ±y,±

remain invariant in t, and the superscript (t) can be dropped.

Proof of Lemma B.6. See Appendix B.4.3 for a detailed proof.

(t)

Thanks to the stopping time defined in (B.6) which puts controls on the scale of ⟨w j,r , jv⟩, the major

(t)

part function g defined in the previous lemma dominates the entire CNN, as the negative parts ⟨w −y,r , −yu⟩,

(t)

⟨w −y,r , −yv⟩ can be lower bounded in Lemma B.6 through a delicate analysis of the training dynamics.

Lemmas B.4, B.5, and B.6 reveal the key properties of the training dynamics and the two-layer CNN.

Several remarks are put here again. Lemmas B.4 and B.5 are temporarily local, with the condition on the

local sign stability. They are not informative until we are able to extend the sign stability to a wider sense,

which is achieved by Lemma B.6. Nevertheless, these two local lemmas are used frequently throughout the

subsequent analysis, so we single them out here to make the proof more readable.

18B.2

Fundamental Reasoning towards the Weak Signal Learning

The previous section presents several basic properties of the training dynamics and the CNN. However, they

are insufficient in interpreting the driving force of the weak signal learning during oscillation. In the lemma

(t)

below, we discover a quantitative interpretation towards the increasing on ⟨w y,r , yv⟩, which formalizes the

illustration of the function of oscillation in Section 3.

Lemma B.7 (Weak signal learning: single training data case). Under Assumptions B.1 and B.2, suppose

that there exists t 0 ≤ t 1 , such that:

(t)

1. The sign stability holds on [t 0 , t 1 ], i.e., i.e. U ±y,± and V ±y,± remain invariant in t;

(t)

2. max r∈[m] ⟨w y,r , yu⟩ < B · (β u ∗ m) 1/2 , ∀t ∈ [t 0 , t 1 ] for some B > 0;

(t)

3. min r∈[m] ⟨w −y,r , −yu⟩ > −0.1 and min r∈[m] ⟨w −y,r , −yv⟩ > −0.1, ∀t ∈ [t 0 , t 1 ];

r∈[m]

(t)

σ(⟨w y,r , yv⟩) < δ, ∀t ∈ [t 0 , t 1 ];

5. −2 ≤ 1 − yf (x; W (t) ) ≤ 1, ∀t ∈ [t 0 , t 1 ].

Then we have that

t X

1 −1

1 − yf (x; W (s) ) ≥ 2ϵ · (t 1 − t 0 ) −

s=t 0

η∥u∥ 22

√

1.05 − δ

(t )

Consequently, we further have that for any r ∈ V y,+ = {r ∈ [m] : ⟨w y,r 0 , yv⟩ > 0}, it holds that

η∥v∥ 22 ϵ

∥v∥ 22

(t 1 )

(t 0 )

⟨w y,r , yv⟩ ≥ ⟨w y,r , yv⟩ · exp

· (t 1 − t 0 ) −

∥u∥ 22 (1.05 − δ) 1/2

Here ϵ = (δ − δ(1.05 − δ) 1/2 )/4 with δ specified in Assumption B.2.

Proof of Lemma B.7. See Appendix B.5 for a detailed proof.

This lemma asserts that,

P stable oscillation in a bounded and favorable training area leads to a linear

increasing lower bound for t (1 − yf t ). This is the first part of Lemma B.7. The second part of this lemma

(t)

relates the increasing speed of ⟨w y,r , yv⟩ to the summation of 1 − yf t . The derivation of this part relies on

the fact that α = ∥v∥ 22 /∥u∥ 22 ∈ (0, 1) and is close to 0. This observation motivates us to approximate the

ratio with a first-order Taylor expansion,

(t )

⟨w y,r 1 , yv⟩

(t )

⟨w y,r 0 , yv⟩

t Y

1 −1

′

1 + αη̃ 1 − yf (x; W (t ) )

t ′ =t 0

≈ 1 + αη̃

t X

1 −1

′

1 − yf (x; W (t ) ) .

t ′ =t 0

The proof of the second part of this lemma justifies this intuition formally with more delicate analysis. With

all these collected results, we are ready to present the proof of the main theorem.

B.3

Proof of Theorem B.3

Proof of Theorem B.3. We prove Theorem B.3 by contradiction. Recall that in the previous section we have

defined that









(t)

T (v) = min t :

σ ⟨w y,r

, yv⟩ > δ .

t≥0 



r∈[m]

19With this definition, our first goal is to prove that T (v) is bounded by a finite time with explicit expression.

To put it precisely, we are going to prove by contradiction that

(

)

√

1.05

2 mδ

∥v∥ 22

T (v) < T 0 :=

· log

+ 1.5 ·

η∥v∥ 22 ϵ

σ 0 ∥v∥ 2

∥u∥ 22

1 − δ

where ϵ is specified in Lemma B.7 and δ is specified in Assumption B.2. Suppose otherwise that T (v) ≥ T 0 .

By Lemma B.6 we can see that for t ∈ [0, T 0 ] all the conditions of Lemma B.7 are satisfied. Then Lemma B.7

implies that,

)

(

∥v∥ 22

1.05

η∥v∥ 22 ϵ

(T 0 )

(0)

· T 0 − 1.5 ·

⟨w y,r ∗ , yv⟩ ≥ ⟨w y,r ∗ , yv⟩ · exp

∥u∥ 22

1.05 − δ

√

2 mδ

≥ σ 0 ∥v∥ 2 ·

≥ mδ.

σ 0 ∥v∥ 2

(t)

where r ∗ = argmax r∈[m] ⟨w y,r , yv⟩ are fixed throughout t ∈ [0, T 0 ] (see Lemma B.4) and in the second

inequality we apply Lemma A.4 to lower bound the initialization. This leads to the following,

1 X

(T )

(T 0 )

σ ⟨w y,r

, yv⟩ ≥ σ ⟨w y,r 0 ∗ , yv⟩ ≥ δ,

r∈[m]

which contradicts the definition of T (v) , and therefore T (v) < T 0 .

The rest part of the proof is to show that the sequence





 1 X



(t)

σ ⟨w y,r

, yv⟩

 m



r∈[m]

(B.10)

t≥T (v)

does not fall below δ/2. Intuitively, as long as the sequence above falls below δ, the sequence would have the

incentive to increase, as the results in Lemma B.7 are revived again based on another boundedness argument.

The analysis resembles the proof of Lemma B.6 with slight differences. We provide the following proposition

with the proofs delayed to Appendix B.6.

Proposition B.8 (Weak signal memorization). It holds that

1 X

(t)

σ ⟨w y,r

, yv⟩ > ,

∀t ≥ T (v) .

r∈[m]

Proof of Proposition B.8. See Appendix B.6.1 for a detailed proof.

This proposition finalizes the proof of Theorem B.3, and we are done.

B.4

B.4.1

Proof of Lemmas in Appendix B.1

Proof of Lemma B.4

Proof of Lemma B.4. By our assumption that the signs of the inner products does not change throughout

(t)

(t 1 )

(t)

[t 1 , t 2 ], it is straightforward that U y,+ = U y,+

for every t ∈ [t 1 , t 2 ]. The same is true for U −y,− . Therefore,

we are able to drop the superscript (t) temporarily, as the attention is restricted to a local interval [t 1 , t 2 ].

20Regarding the maximal index, introducing r ′ ̸ = r ∈ U y,+ , we have following relation

(t)

argmax ⟨w y,r

, yu⟩ = argmax ⟨w y,r

, yu⟩,

r∈U y,+

r∈[m]

(t)

= argmax ⟨w y,r

, yu⟩/⟨w y,r ′ , yu⟩

r∈U y,+

1+ 2η∥u∥ 22

Q t−1

· t ′ =t 1 1 + 2η∥u∥ 22

(t )

⟨w y,r 1 , yu⟩ ·

= argmax

(t )

⟨w y,r 1 ′ , yu⟩

r∈U y,+

Q t−1

t ′ =t

· 1 − yf (x; W (t) )

(t )

(t 1 )

= argmax ⟨w y,r

, yu⟩/⟨w y,r 1 ′ , yu⟩

r∈U y,+

(t 1 )

= argmax ⟨w y,r

, yu⟩.

r∈U y,+

The same relation can be verified for ⟨w −y,r , yu⟩ with r ∈ U −y,− , finishing the proof.

B.4.2

Proof of Lemma B.5

Proof of Lemma B.5. Again, the local sign stability assumption ensures that each inner product grows pro-

portionally and the superscript (t) in the neuron index sets can be dropped, with

g(x, y; W (t) ) =

1 X

(t)

σ ⟨w y,r

, yu⟩

r∈[m]

t−1

′

1 X

2η∥u∥ 22

(t 1 )

σ ⟨w y,r

, yu⟩ ·

· 1 − yf (x; W (t ) )

t ′ =t 1

r∈U y,+

t−1

(t )

max r∈[m] σ ⟨w y,r 1 , yu⟩ Y

2η∥u∥ 22

(t ′ )

)

−

(x;

∗,(t )

mβ u 1

t ′ =t 1

(t)

σ max r∈[m] ⟨w y,r , yu⟩

∗,(t )

mβ u 1

(t)

Here the second line and the last equality is true because (B.1) implies that all the positive ⟨w y,r , yu⟩ iterates

by sequentially multiplying the same factor

′

2η∥u∥ 22

· 1 − yf (x; W (t ) ) .

∗,(t )

The third equality comes from the definition of β u 1 in Lemma B.5. Thus, g(x, y; W (t) ) ≥ c implies that

(t)

σ max ⟨w y,r

, yu⟩ > β u ∗,(t 1 ) · mc

r∈[m]

and the desired lower bound follows. The upper bound can be proved analogously and is omitted here.

B.4.3

Proof of Lemma B.6

Proof of Lemma B.6. A roadmap is provided to help understand how every single step is achieved so that

the readers can skip the details without leaving the key ideas behind.

Recap on notations.

(t)

Recall that, U j,+ is the set of indices r ∈ [m] such that ⟨w j,r , ju⟩ > 0 and U j,− is the

(t)

(0)

set of indices r ∈ [m] such that ⟨w j,r , ju⟩ ≤ 0. Specially, let U y,+ = U y,+ and U −y,− = U −y,− . With probabil-

(0)

ity one, it holds that U j,− ∪ U j,+ = [m]. Let r ∗ := argmax r∈[m] ⟨w y,r , yu⟩ and r ∗ := argmin r∈[m] ⟨w −y,r , −yu⟩,

i.e., r ∗ (resp. r ∗ ) denote the index of the maximum (resp. minimum) throughout the process as we are able

to extend the results in Lemma B.4 globally.

21We also introduce several supplemental notations here to facilitate the proof. Recursively, we define

T̄ k := min t : t > T̄ k−1 , yf (x; W (t) ) ≥ 1 and yf (x; W (t−1) ) < 1 ,

(B.11)

t≥0

with T̄ 0 = 0. Similarly, we define that

T k := min t : t > T k−1 , yf (x; W (t) ) < 1 and yf (x; W (t−1) ) ≥ 1 ,

t≥0

and T 0 = 0. Intuitively, T̄ k captures the times that yf (x; W (t) ) just exceeds 1, and similarly for T k .

Roadmap.

From a high level, three steps are required to establish the full proof:

1. We verify that the lower bound in Inequality (B.8) in Lemma B.6 holds for all t ∈ [0, T̄ 1 ] with a direct

monotonicity argument. Additionally, we can prove that the upper bound in Inequality (B.7) holds for

t ∈ [0, T̄ 1 ) by using Lemma B.5 and the definition of T̄ 1 . The signs do not change in this stage, as shown

in the details below.

2. We extend the results in Lemma B.6 to t ∈ [ T̄ 1 , T̄ 2 ) with repeated use of Lemma B.5. The sign stability

is guaranteed from an intermediate upper bound on |1 − yf (x; W (t) )|.

3. Note that the condition on [0, T̄ 1 ) (which is proved in the first step) required for the proof of the second

step is again true for t ∈ [0, T̄ 2 ), which is a consequence of the second step. Therefore we can repeat

the second step to extend the results in Lemma B.6 to t ∈ [ T̄ 2 , T̄ 3 ), and so on. So the results are true

for all t ≤ T (v) .

The first and the last step above are relatively straightforward. However, the second step requires a delicate

break-down analysis. Here we provide a more detailed roadmap for the second step. The goal is to prove

the results in Lemma B.6, restricted to t ∈ [ T̄ 1 , T̄ 2 ). This would be achieved in four split steps:

2.1 Firstly, we prove the upper bound in Inequality (B.7) for t = T̄ 1 by tracking one-step gradient descent

( T̄ −1)

and the upper bound for ⟨w y,r 1 ∗ , yu⟩. Then with a monotonicity argument,

(t)

( T̄ )

⟨w y,r ∗ , yu⟩ ≤ ⟨w y,r 1 ∗ , yu⟩,

∀t ∈ [ T̄ 1 , T 1 ].

(t)

Besides, for t ∈ [ T̄ 1 , T 1 ) we can give a lower bound on ⟨w y,r ∗ , yu⟩ with the help of Lemma B.5.

( T )

( T̄ )

¯ ∗ , yu⟩

2.2 Based on the previous lower and upper bounds on ⟨w y,r 1 ∗ , yu⟩, we can derive a lower bound on ⟨w y,r

by tracking one-step gradient descent. We apply this worst-case tight lower bound to conclude the sign

(t)

stability. We note that this step is free of the lower bound on ⟨w −y,r ∗ , −yu⟩ for t ∈ ( T̄ 1 , T 2 ], which we

have not yet proved to be true.

( T )

2.3 Now we give lower bounds on ⟨w −y,r

, −yu⟩ and ⟨w −y,r

, −yv⟩. We achieve this by a delicate usage of

∗

( T )

¯ ∗ , yu⟩ (which we have proved in Step 2.2) plus an inequality that connects the

the lower bound on ⟨w y,r

(t)

relative increment of ⟨w y,r ∗ , yu⟩ and ⟨w −y,r ∗ , −yu⟩ (or ⟨w −y,r ∗ , −yv⟩). Thus we prove Inequalities (B.8)

and (B.9) for t = T 1 , and this can be further extended to the entire [ T̄ 1 , T̄ 2 ] by another monotonicity

argument since T 1 is the local minima.

(t)

2.4 The remaining to is upper bound ⟨w y,r ∗ , yu⟩ for t ∈ ( T 1 , T̄ 2 ). This is again a consequence of Lemma B.5,

(t) ¯

as exactly what has been done to upper bound ⟨w y,r ∗ , yu⟩, t ≤ T̄ 1 in Step 1.

Now with the roadmap in mind, we are ready to dive into the details of every step.

22Step 1: Pre- T̄ 1 Analysis. Lemma A.4 indicates that the lower bound in Inequality (B.8) holds at ini-

tialization t = 0 under Assumption B.1. Moreover, the upper bounds on the maximal initial inner products

in Lemma A.4 with Assumption B.1 indicate that

(0)

|yf (x; W (0) )| ≤ 2 · max ⟨w y,r

, yu⟩ 2 ∨ max ⟨w −y,r , −yu⟩ 2

r∈[m]

e σ 0 2 ∥u∥ 22 ≪ 1.

= O

From this we know that T̄ 1 ≥ 1 and the upper bound in Inequality (B.7) is true at t = 0.

(t)

Then the first step to do is to extend the lower on ⟨w −y,r , −yu⟩ to [1, T̄ 1 ]. Definition of T̄ 1 implies that

(0)

yf (x; W (t) ) ≤ 1 for t ∈ [0, T̄ 1 ). Therefore for r ∈ U −y,− , Equation (B.2) gives that

(t+1)

η∥u∥ 22

(t)

· 1 − yf (x; W (t) ) · σ ′ − ⟨w −y,r , −yu⟩

2η∥u∥ 22

(t)

= ⟨w −y,r , −yu⟩ −

· 1 − yf (x; W (t) ) · ⟨w −y,r , −yu⟩

(t)

≥ ⟨w −y,r , −yu⟩.

(t)

⟨w −y,r , −yu⟩ = ⟨w −y,r , −yu⟩ +

(B.12)

And furthermore

( T̄ )

(0)

⟨w −y,r

, −yu⟩ ≥ ⟨w −y,r , −yu⟩ ≥ −

2 log(16m/p) · σ 0 ∥u∥ 2 .

( T̄ )

Taking minimum with respect to r ∈ [m] gives the result. Same argument can be applied to ⟨w −y,r

, −yv⟩.

So the lower bounds in Inequalities (B.8) and (B.9) hold for t ∈ [1, T̄ 1 ].

(0)

(t)

(0)

(t)

Also, (B.1) and (B.3) imply that ⟨w y,r , yu⟩, r ∈ U y,+ and ⟨w y,r , yv⟩, r ∈ V y,+ increase for all t < T̄ 1 . A

(t)

natural consequence is that yf (x; W ) is non-decreasing in t in this stage, since every summand (possibly

with the negative sign before) in the summation is non-decreasing.

(0)

(t)

(0)

Now we can prove the sign stability. For r ∈ U y,+ , we know from (B.1) that ⟨w y,r , yu⟩ > ⟨w y,r , yu⟩ > 0,

(0)

(t)

hence r ∈ U y,+ is non-vanishing in t by induction. Additionally, for r ∈ U y,− , ⟨w y,r , yu⟩ stays fixed in t, as

(t)

(0)

mentioned in Appendix B.1. Two points together ensure that U y,+ = U y,+ for t ∈ [1, T̄ 1 ].

(0)

For r ∈ U −y,− , let’s take a closer look at Equation (B.12) with t = 0:

2η∥u∥ 22

(1)

(0)

⟨w −y,r , −yu⟩ = ⟨w −y,r , −yu⟩ −

· 1 − yf (x; W (0) ) · ⟨w −y,r , −yu⟩

2η∥u∥ 22

(0)

= ⟨w −y,r , −yu⟩ · 1 −

· 1 − yf (x; W (0) ) .

}

>0 if η<(1−o(1))/2·m∥u∥ −2

−2

Assumption B.1 ensures that η < 0.4m∥u∥ −2

2 < (1 − o(1))/2 · m∥u∥ 2 so the sign change does not happen at

(t)

t = 0. As mentioned before, yf (x; W ) ≤ 1 is non-decreasing in t at this stage, and putting them together

we know that

2η∥u∥ 22

1 −

1 − yf (x; W (t) ) ≥ 0, ∀t ∈ [1, T̄ 1 ).

(t)

The same sign stability can be verified for ⟨w −y,r , −yv⟩, and therefore, the sign stability for U −y,− and

(t)

V −y,−

is ensured for t ∈ [0, T̄ 1 ] and the lower bound in Inequality (B.8) holds for t ∈ [1, T̄ 1 ].

Now we turn to prove the upper bound (B.7). Note that yf (x; W (t) ) ≤ 1 for t < T̄ 1 , and the definition

of T (v) then implies that

g(x, y; W (t) ) ≤ 1 + 2 2 log(16m/p) · σ 0 ∥u∥ 2 ≤ 1.05.

Now that the sign stability holds for t ∈ [0, T̄ 1 ), Lemma B.5 gives that

1/2

(t)

max ⟨w y,r

, yu⟩ ≤ 1.05β u ∗ m

r∈[m]

Here β u ∗ is defined in B.5 with the superscript (0) dropped.

(B.13)(t)

Step 2.1: Bounding max r ⟨w y,r , yu⟩ for t ∈ [ T̄ 1 , T 1 ). In order for the upper bound to be tight, we need

a lower bound on yf (x; W ( T̄ 1 −1) ). Let η̃ = 2η∥u∥ 22 /m > 1/2, we have the following result.

Proposition B.9. For every k ≥ 1, suppose that

1 X

( T̄ k −1)

σ − ⟨w −y,r

, −yu⟩ + σ − ⟨w −y,r

, −yv⟩ < .

E T̄ k −1 :=

r∈[m]

Then we have that

yf (x; W ( T̄ k −1) ) ≥

2 + η̃ −

η̃ 2 + 4η̃

2η̃

(B.14)

Moreover, it holds that

+ 2E T̄ k −1 .

yf (x; W ( T̄ k ) ) ≤ yf (x; W ( T̄ k −1) ) · 1 + η̃ · 1 − yf (x; W ( T̄ k −1) )

(B.15)

Also, it is notable that the result here still holds for T̄ k ≥ T (v) .

Proof of Proposition B.9. See Appendix B.6.2 for a detailed proof.

Clearly, the conditions required for Proposition B.9 is true for before T̄ 1 . So consider a one-step gradient

descent at t = T̄ 1 − 1, by Proposition B.9 and Equation (B.1),

( T̄ )

η∥u∥ 22

( T̄ −1)

· 1 − yf (x; W ( T̄ 1 −1) ) · σ ′ ⟨w y,r 1 ∗ , yu⟩

η̃ + 2 − η̃ 2 + 4η̃

( T̄ 1 −1)

≤ ⟨w y,r ∗ , yu⟩ · 1 + η̃ 1 −

2η̃

1/2

≤ 1.5 · 1.05β u ∗ m

( T̄ −1)

⟨w y,r 1 ∗ , yu⟩ = ⟨w y,r 1 ∗

, yu⟩ +

(B.16)

Here the first inequality above comes from Inequality (B.14), and the second inequality is derived from

taking the suprema η̃ = 1/2 and Inequality (B.13). Additionally, we can further deliver an upper bound on

yf (x; W ( T̄ 1 ) ). Inequality (D.10) and Inequality (D.11) that have been proved to be true on [0, T̄ 1 ] indicate

e 2 ∥u∥ 2 ) ≪ 1. Combining with Inequality (B.15), it holds that

that E T̄ 1 = O(σ

yf (x; W ( T̄ 1 ) ) ≤ yf (x; W ( T̄ k −1) ) · 1 + η̃ · 1 − yf (x; W ( T̄ k −1) )

+ o(1)

≤ 1 + η̃ · 1 − yf (x; W ( T̄ k −1) )

+ o(1)

≤ 1 + η̃ − 1 − η̃/2 + η̃ 2 /4 + η̃

= η̃/2 + η̃ 2 /4 + η̃ + o(1).

(B.17)

Since η̃ < 1, we know that yf (x; W ( T̄ 1 ) ) ≤ 3 and |yf (x; W ( T̄ 1 ) ) − 1| ≤ 2. (We keep η̃ in the upper bound

above for deriving a sufficient condition on η̃ for the sign stability later.)

Now we look to the lower bound. The definition of T̄ 1 , T (v) and Assumption B.2 implies that yf (x; W (t) ) >

1 + δ, hence g(x, y; W (t) ) > 1 + δ − δ = 1 for t ∈ [ T̄ 1 , T 1 ). Thus Lemma B.5 implies that

(t)

∗

⟨w y,r ∗ , yu⟩ > (β u m) 1/2 , t ∈ [ T̄ 1 , T 1 ).

( T )

¯ , yu⟩. Note that yf (x; W (t) ) ≤ yf (x; W ( T̄ 1 ) ) from the local mono-

Step 2.2: Lower bounding ⟨w y,r

tonicity. Consider doing one-step gradient descent with Equation (B.1):

( T 1 )

( T 1 −1)

¯ ∗ , yu⟩ = ⟨w y,r

¯ ∗

⟨w y,r

, yu⟩ · 1 − η̃ · yf (x; W (¯ T 1 −1) ) − 1

( T 1 −1)

¯ ∗

≥ ⟨w y,r

, yu⟩ · 1 − η̃ · yf (x; W ( T̄ 1 −1) ) − 1

(B.18)

≥ (β u ∗ m) 1/2 · 1 − η̃ η̃/2 + η̃ 2 /4 + η̃ − 1 − o(1) .

}

>0 with η̃<4/5

24Here the last inequality is a consequence of Inequality (B.17) and the deterministic estimation that

1 − η̃ η̃/2 + η̃ 2 /4 + η̃ − 1

> 1/4.

(B.19)

min

η̃∈[1/2,4/5]

We note that every step above is free of the lower bound in Inequality (B.8). Therefore, the sign stability is

true on [ T̄ 1 , T 1 ], because all the inner products are non-decreasing in t ∈ [ T 1 , T̄ 2 ]

( T )

¯ , −yu⟩ and ⟨w −y,r

¯ , −yv⟩. The key is to notice that the sign stability

Step 2.3: Lower bounding ⟨w −y,r

on [0, T 1 ] guarantees that for every t ≤ T 1 − 1,

∥v∥ 22

· 1 − yf (x; W (t) ) > 0.

1 ± η̃ · 1 − yf (x; W (t) ) > 0, 1 ± η̃ ·

∥u∥ 22

From Equation (B.2), we have that

( T )

(0)

¯ , −yu⟩ = ⟨w −y, , −yu⟩ ·

⟨w −y,r

T 1 −1

¯ Y

1 − η̃ 1 − yf (x; W (t) )

t=0

(0)

≥ ⟨w −y, , −yu⟩ · exp





−η̃







1 − yf (x; W (t) )



T 1 −1

¯ X

t=0

(B.20)

On the other hand,

( T )

¯ ∗ , yu⟩

⟨w y,r

(0)

⟨w y,r ∗ , yu⟩

T 1 −1

¯ Y

1 + η̃ · 1 − yf (x; W (t) )



1 −1

 ¯ T X

≤ exp



t=0

(0)

However, ⟨w y,r ∗ , yu⟩ ≤ (β u ∗ m) 1/2 ·

( T )

⟨w y,r ∗ , yu⟩

≥

(β u ∗ m) 1/2 · (1/4 − o(1))

≥ 1.

(β u ∗ m) 1/2 · 2 log(16m/p) · σ 0 ∥u∥ 2

Combining Inequalities (B.21) and (B.22), we can obtain that

Inequality (B.20), this in turns leads to

( T )

(0)

P T 1 −1

¯ , −yu⟩ = min ⟨w −y,r , −yu⟩ · exp

0 > min ⟨w −y,r

t=0





T 1 −1

¯ X

r∈[m]





1 − yf (x; W (t) )



r∈[m]

Analogously we have that







T 1 −1



2 ¯

∥v∥

( T 1 )

(0)

(t)

¯ , −yv⟩ = ⟨w −y,r , −yv⟩ · exp

0 > min ⟨w −y,r

−η̃ ·

−

(x;

)





∥u∥ 22 t=0

r∈[m]

(0)

≥ min ⟨w −y,r , −yv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 .

r∈[m]

In conclusion, we have that

(t)

max ⟨w y,r

, yu⟩ ≥ 0.2(β u ∗ m) 1/2 ,

r∈[m]

(t)

t ∈ [ T̄ 1 , T̄ 2 ],

2 log(16m/p) · σ 0 ∥u∥ 2 , t ∈ [ T̄ 1 , T̄ 2 ],

(t)

min ⟨w −y,r , −yv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 , t ∈ [ T̄ 1 , T̄ 2 ].

min ⟨w −y,r , −yu⟩ ≥ −

r∈[m]

(B.22)

1 − yf (x; W (t) ) ≥ 0. Together with

−η̃



t=0

(0)

≥ min ⟨w −y,r , −yu⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 .

r∈[m]

(B.21)

2 log(16m/p) · σ 0 ∥u∥ 2 and Inequality (B.18) imply that

¯ ∗ , yu⟩

⟨w y,r

(0)

t=0





η̃ · 1 − yf (x; W (t) )

(t)

Additionally, the lower bound on max r∈[m] ⟨w y,r , yu⟩ indicates that the sign stability holds on [ T̄ 1 , T̄ 2 ), since

(t)

the last drop step does not change the sign of ⟨w y,r , yu⟩. Furthermore, once the u-sign stability holds, the

v-sign stability can be easily derived. To see this, we note that the u-sign stability implies that, for any

t ∈ [0, T (v) ], it holds that 1 ± 2η∥u∥ 22 · (1 − yf (x; W (t) ))/m > 0. Now that ∥u∥ 2 > ∥v∥ 2 , one clearly sees

that 1 ± 2η∥v∥ 22 · (1 − yf (x; W (t) ))/m > 0 and the v-sign stability holds.

(t)

Step 2.4: Upper bounding ⟨w y,r , yu⟩ for t ∈ ( T 1 , T̄ 2 ).

ity (B.13) in Step 1, and is thus omitted.

This is exactly the same as the proof of Inequal-

Step 3. Finalizing proofs. At this point, all the results in Lemma B.6 have been proved to be true on

t ∈ [ T̄ 1 , T̄ 2 ). It is important to note that the only inductive hypothesis used for the local extension on [ T̄ 1 , T̄ 2 )

(t)

is the lower bound on ⟨w −y,r , −yu⟩ for t ≤ T̄ 1 . The rest part merely comes from the definition of T̄ 1 and T 2

and Assumption B.2. So repeating the same argument extends previous steps to all t ≤ T (v) .

Remark B.10. In the proof, we implicitly utilize the fact that T̄ k , T k < +∞, ∀k ≥ 0. One may conjecture

that there could be cases that for some k, yf (x; W (t) ) > 1 (resp. < 1) for all t ≥ T̄ k (resp. T k ). However,

Assumption B.2 (guaranteed by tuning a proper η) indicates that this cannot happen. One can argue that

(t)

once yf (x; W ) > 1, the Assumption B.2 enables the dynamic to bounce back towards 1 with at least

exponential rate. Therefore, yf (x; W (t) ) falls below 1 + δ within a few steps and Assumption B.2 forces

yf (x; W (t) ) < 1 − δ, and T k < +∞. Same argument can be used to prove that T̄ k < +∞.

B.5

Proof of Fundamental Reasoning (Lemma B.7)

(t )

Proof of Lemma B.7. Let r ∗ := argmax r∈[m] ⟨w y,r 0 , yu⟩. For the step t 0 < t 1 , (B.1) implies that

(t )

2η∥u∥ 22

(t )

⟨w y,r 1 ∗ , yu⟩ = ⟨w y,r 0 ∗ , yu⟩ +

2η∥u∥ 22

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )≥1

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩.

(B.23)

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

(s)

By Condition 4 in Lemma B.7, for all s ∈ [t 0 , t 1 ], m

r∈[m] σ(⟨w y,r , yv⟩) < δ. Therefore by Assumption B.2,

for s such that yf (x; W (s) ) ≥ 1 (and thus > 1 + δ), it holds from (B.5) that

1 X

(s)

σ ⟨w y,r

, yu⟩ ≥ 1 + δ − δ = 1.

g(x, y; W (t) ) =

r∈[m]

Hence by Lemma B.5, we have that

(s)

⟨w y,r ∗ , yu⟩ ≥ (β u ∗ m) 1/2 .

(B.24)

(s)

On the other hand, by Conditions 3 in Lemma B.7, for all s ∈ [t 0 , t 1 ], it holds that min r∈[m] ⟨w −y,r , −yu⟩ >

(s)

−0.1 and min r∈[m] ⟨w −y,r , −yv⟩ > −0.1, which lead to

1 X

(s)

σ − ⟨w −y,r , −yu⟩ + σ − ⟨w −y,r , −yv⟩ ≤ 2 × 0.1 2 < 0.05.

r∈[m]

Therefore by Assumption B.2, for s such that yf (x; W (s) ) ≤ 1 (and thus < 1 − δ), it holds from (B.5) that

g(x, y; W (t) ) =

1 X

(s)

σ ⟨w y,r

, yu⟩ ≤ 1 − δ + 0.05.

r∈[m]

26Hence by Lemma B.5, we know that

(s)

⟨w y,r ∗ , yu⟩ ≤ (1.05 − δ) 1/2 · (β u ∗ m) 1/2 .

(B.25)

Meanwhile, Condition 1 (u-sign stability) and Condition 2 (boundedness) in Lemma B.7 imply that

(t )

⟨w y,r 1 ∗ , yu⟩ − ⟨w y,r 0 ∗ , yu⟩ ≤ ⟨w y,r 1 ∗ , yu⟩ ∨ ⟨w y,r 0 ∗ , yu⟩ ≤ B · (β u ∗ m) 1/2 .

(B.26)

Putting Inequality (B.24), (B.25), (B.26) and Equation (B.23) together, we have that

B · (β u ∗ m) 1/2 ≥

2η∥u∥ 22

≥

−

≥

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )≥1

2η∥u∥ 22

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

(s)

yf (x; W (s) ) − 1 · ⟨w y,r ∗ , yu⟩

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )≥1

2η∥u∥ 22

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

2η∥u∥ 22

· (β u ∗ m) 1/2 ·

−

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩

yf (x; W (s) ) − 1

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )≥1

2η∥u∥ 22

· (β u ∗ m) 1/2 · (1.05 − δ) 1/2 ·

This is also equivalent to

(s)

1 − yf (x; W (s) ) · ⟨w y,r ∗ , yu⟩ ≥

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

1 − yf (x; W (s) ) .

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

(1.05 − δ) −1/2 · yf (x; W (s) ) − 1

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )≥1

−

√

1.05 − δ

}

(B.27)

2η∥u∥ 22

:=∆(B)

Now we are ready to lower bound the summation that we are interested in. Since

|{s ∈ [t 0 , t 1 − 1] : yf (x; W (s) ) < 1}| + |{s ∈ [t 0 , t 1 − 1] : yf (x; W (s) ) > 1}| = t 1 − t 0 ,

we have either |{s ∈ [t 0 , t 1 − 1] : yf (x; W (s) ) > 1}| > (t 1 − t 0 )/2, which by (B.27) implies that

t X

1 −1

s=t 0

1 − yf (x; W (s) ) =

1 − yf (x; W (s) ) −

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

yf (x; W (s) ) − 1

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )>1

≥ (1.05 − δ) −1/2 − 1 ·

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )>1

yf (x; W (s) ) − 1 − ∆(B)≥

δ(1.05 − δ) −1/2 − δ · (t 1 − t 0 ) − ∆(B),

(B.28)

or |{s ∈ [t 0 , t 1 − 1] : yf (x; W (s) ) < 1}| > (t − t 0 )/2, which by (B.27) implies that

t X

1 −1

1 − yf (x; W (s) ) =

s=t 0

yf (x; W (s) ) − 1 −

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

≥ 1 − (1.05 − δ) 1/2 ·

1 − yf (x; W (s) )

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )>1

1 − yf (x; W (s) ) − (1.05 − δ) 1/2 · ∆(B)

s∈[t 0 ,t 1 −1]:

yf (x;W (s) )<1

≥

δ − δ(1.05 − δ) 1/2 · (t 1 − t 0 ) − (1.05 − δ) 1/2 · ∆(B).

(B.29)

In both two cases we have used Assumption B.2 to bound yf (x; W (s) ) from 1. Combining (B.28) and (B.29),

we have that

t X

1 −1

s=t 0

1 − yf (x; W (s) ) ≥ δ − δ(1.05 − δ) 1/2 · (t 1 − t 0 ) − ∆(B).

(B.30)

Now plugging in the definition of ∆(B) in (B.27), we have proved the linear increasing lower bound for the

summation, which is the first conclusion of Lemma B.7.

Then we turn to prove the second conclusion of Lemma B.7. For simplicity, we denote α := ∥v∥ 22 /∥u∥ 22

and ϵ = (δ − δ(1.05 − δ) 1/2 )/4. Note that from the v-sign stability (Condition 1) and (B.3), we have that

2η∥v∥ 22

· 1 − yf (x; W (t) ) > 0,

∀t ∈ [t 0 , t 1 − 1].

(0)

(t)

Therefore, we can lower bound the logarithmic ratio ⟨w y,r , yv⟩/⟨w y,r , yv⟩ for r ∈ V y,+ as (recall that

η̃ = 2η∥u∥ 22 /m)

t X

1 −1

log 1 + αη̃ 1 − yf (x; W (t) )

t=t 0

t X

1 −1 Z α

t=t 0

t X

1 −1 Z α

t=t 0

η̃ 1 − yf (x; W (t) )

1 + η̃z 1 − yf (x; W (t) )

t X

1 −1 Z α

η̃ · 1 − yf (x; W (t) ) + 2

2η̃

dz −

dz.

1 + η̃z 1 − yf (x; W (t) )

η̃z

−

yf (x; W (t) )

t=t

(B.31)

Note that by Condition 5 in Lemma E.3, −2 ≤ 1 − yf (x; W (t) ) ≤ 1, which further lower bounds (B.31) as

t X

1 −1 Z α η̃ ·

1 −1 Z α

1 − yf (x; W (t) ) + 2

2η̃

(B.31) ≥

dz −

η̃z

−

2η̃z

t=t 0 0

Z α η̃ · P t 1 −1 1 − yf (x; W (t) ) + 2(t 1 − t 0 )

Z α

t=t 0

2η̃(t 1 − t 0 )

dz −

dz.

(B.32)

η̃z

1 − 2η̃z

Now applying (B.30) to the summation in (B.32), we can arrive at

Z α η̃ · 2ϵ(t 1 − t 0 ) − ∆(B) + 2(t 1 − t 0 )

Z α

2η̃(t 1 − t 0 )

(B.32) ≥

dz −

1 + η̃z

1 − 2η̃z

≥ 2ϵ(t 1 − t 0 ) − ∆(B) + 2(t 1 − t 0 ) · log(1 + αη̃) + (t 1 − t 0 ) · log(1 − 2αη̃)

≥ 2ϵ(t 1 − t 0 ) − ∆(B) · log(1 + αη̃) + (t 1 − t 0 ) · log (1 + αη̃) 2 · (1 − 2αη̃) .

(B.33)

To further lower bound (B.33), we note that αη̃ > log(1 + αη̃) ≥ 2 1 αη̃ since by Assumption B.1, 0 < αη̃ < 1.

Furthermore, Assumption B.1 guarantees that α < ϵ/2 = (δ − δ · (1.05 − δ) 1/2 )/8 and 2α 3 η̃ 3 + 3α 2 η̃ 2 ≤ 4α 2 <

1/5 ∧ 2ϵ/5, so we have that

log (1 + αη̃) 2 · (1 − 2αη̃) = log 1 − 3α 2 η̃ 2 − 2α 3 η̃ 3

3α 2 η̃ 2 + 2α 3 η̃ 3

= − log 1 +

1 − 3α 2 η̃ 2 − 2α 3 η̃ 3

2 2

−3α η̃ − 2α 3 η̃ 3

≥

1 − 3α 2 η̃ 2 − 2α 3 η̃ 3

≥ −5α 2 .

(B.34)

Putting (B.33) and (B.34) together, we have that

t X

1 −1

log 1 + αη̃ 1 − yf (x; W (t) ) ≥ αη̃ϵ − 5α 2 · (t 1 − t 0 ) − ∆(B) · log(1 + αη̃)

t=t 0

αη̃ϵ · (t 1 − t 0 ) − ∆(B)αη̃

And consequently we have the following result, for all r ∈ V y,+ ,

≥

(t 1 )

(t 0 )

⟨w y,r

, yv⟩ = ⟨w y,r

, yv⟩ ·

t Y

1 −1

(B.35)

1 + αη̃ 1 − yf (x; W (t) )

t=t 0

(t 0 )

⟨w y,r

, yv⟩

· exp

( t −1

)

log 1 + αη̃ 1 − yf (x; W )

(t)

t=t 0

(t 0 )

≥ ⟨w y,r

, yv⟩ · exp

αη̃ϵ · (t 1 − t 0 ) − ∆(B)αη̃ ,

where in the last inequality we use (B.35). Plugging in the expression of ϵ, α, η̃, and ∆(B), we have proved

the second conclusion of Lemma B.7. This concludes the proof of Lemma B.7.

B.6

B.6.1

Proof of Technical Results

Proof of Proposition B.8

Proof of Proposition B.8. We want to track the sequence (B.10) after it falls below δ. To this end, we define

two stopping times









1 X

(t)

T (v),δ,L = min t ≥ T (v) :

σ ⟨w y,r

, yv⟩ < δ ,





r∈[m]









+,2

(t)

T (v)

= min t ≥ T (v),δ,L :

σ ⟨w y,r

, yv⟩ ≥ δ .





r∈[m]

+,2

If T (v),δ,L = +∞, then the proof is over. Otherwise we prove that, before T (v)

≤ +∞ (possibly equal), the

sequence never falls below δ/2.

Let’s take a closer look at the controls over the negative parts while the weak signal remain learned. Note

that for t ∈ [T (v) , T (v),δ,L ] and r ∈ V y,+ , we have that

t−1

2η∥v∥

(t)

(0)

(s)

⟨w y,r , yv⟩ = ⟨w y,r , yv⟩ ·

· 1 − yf (x; W )

s=0

(

)

t−1

2η∥v∥ 22 X

(0)

(s)

≤ ⟨w y,r , yv⟩ · exp

1 − yf (x; W ) .

s=0

29(t)

(0)

Meanwhile, for t ∈ [T (v) , T (v),δ,L ], we have that ⟨w y,r , yv⟩ > (β v ∗ mδ) 1/2 ≫ ⟨w y,r , yv⟩. Hence

(

exp

t−1

2η∥v∥ 22 X

1 − yf (x; W (s) )

s=0

)

> 1,

∀t ∈ [T (v) , T (v),δ,L − 1],

and in consequence we have that

t−1

1 − yf (x; W (s) ) > 0,

∀t ∈ [T (v) , T (v),δ,L − 1].

(B.36)

s=0

(0)

On the other hand, for r ∈ V −y,− , it holds that ⟨w −y,r , −yv⟩ < 0 and thus by (B.36) we have that

(t)

⟨w −y,r , −yv⟩

t−1

2η∥v∥ 22

(s)

· 1 − yf (x; W )

1 −

s=0

(

)

t−1

2η∥v∥ 22 X

(0)

(s)

≥ ⟨w −y,r , −yv⟩ · exp −

1 − yf (x; W )

s=0

(0)

≥ ⟨w −y,r , −yv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 ,

(0)

⟨w −y,r , −yv⟩

(B.37)

for all t ∈ [T (v) , T (v),δ,L ]. Analogously, we also have that

(0)

(t)

⟨w −y,r , −yu⟩ ≥ ⟨w −y,r , −yu⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2

(B.38)

for all t ∈ [T (v) , T (v),δ,L ]. Therefore, we have that for all t ∈ [T (v) , T (v),δ,L ],

1 X

(t)

σ − ⟨w −y,r , −yu⟩ + σ − ⟨w −y,r , −yv⟩

r∈[m]

(t)

≤ 2 · max σ − ⟨w −y,r , −yu⟩ ∨ σ − ⟨w −y,r , −yv⟩

E t :=

r∈[m]

≤ 2 2 log(16m/p) · σ 0 ∥u∥ 2

≪ .

(B.39)

This allows us to leverage Proposition B.9 to upper bound yf (x; W (t) ) for t ∈ [T (v) , T (v),δ,L ]. Specifically,

we locate the last step before T (v),δ,L when yf just bounces up over 1, which is,

T̄ k ∗ := max T̄ k : T̄ k ≤ T (v),δ,L ,

where T̄ k is defined in (B.11). Then Proposition B.9 with Inequality (B.39) implies that

yf (x; W ( T̄ k ∗ ) ) ≤ yf (x; W ( T̄ k ∗ −1) ) · 1 + η̃ 1 − yf (x; W ( T̄ k −1) )

+ 2E T̄ k ∗ −1

≤ 1 + η̃ 1 − yf (x; W ( T̄ k ∗ −1) )

+ o(1)

≤ 1 + η̃ 1 −

η̃ + 2 − η̃ 2 + 4η̃

+ o(1)

2η̃

≤ 3,

where we have applied the fact that η̃ < 1. On the other hand, we have that

1 X

( T̄ k ∗ −1)

σ ⟨w y,r

, yu⟩ ≤ 1.05,

r∈[m]

(B.40)( T̄

∗ −1)

for which Lemma B.5 indicates that max r∈[m] ⟨w y,r k

, yu⟩ ≤ (1.05β u ∗ m) 1/2 . As in Inequality (B.16), one

step gradient descent then gives that

1/2

2η∥u∥ 22

( T̄ k ∗ )

( T̄ k ∗ −1)

· 1 − yf (x; W

⟨w y,r , yu⟩ ≤ ⟨w y,r

, yu⟩ · 1 +

) ≤ 1.5 · 1.05β u ∗ m

Now we consider the scale of these inner products right at T (v),δ,L . From the definitions of T (v),δ,L and T̄ k ∗

we know that yf t > 1 between these steps (otherwise the sequence wouldn’t fall below δ). Thus by (B.40),

1 < yf (x; W (t) ) < yf (x; W ( T̄ k ∗ ) ) ≤ 3,

We first state that

r∈[m]

∀t ∈ [ T̄ k ∗ , T (v),δ,L ].

)

σ(⟨w y,r (v),δ,L , yv⟩) is not far away from δ. Note that by (B.36),

1 X

2η∥v∥ 22

1 X

(T (v),δ,L )

(T (v),δ,L −1)

σ ⟨w y,r

, yv⟩ =

σ ⟨w y,r

, yv⟩ · 1 +

· 1 − yf (x; W

)

r∈[m]

2η∥v∥ 22

· (1 − 3)

≥ δ · 1+

∥v∥ 22

≥ δ · 1 − 2 ·

∥u∥ 22

(B.41)

≥ δ.

Last inequality uses the fact that ∥v∥ 22 /∥u∥ 22 ≤ 1/4. This shows that the sequence does not fall below 3δ/4

at the step T (v),δ,L . In the sequel, we study the behavior of the sequence after step T (v),δ,L .

Firstly, by Inequalities (B.37) and (B.38) with t = T (v),δ,L , we obtain that the negative parts satisfy

(T (v),δ,L )

min ⟨w −y,r

, −yu⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 ,

r∈[m]

(T (v),δ,L )

min ⟨w −y,r

, −yv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 .

r∈[m]

which by definition of E t in (B.39) implies that

E T (v),δ,L ≤ 2

2 log(16m/p) · σ 0 ∥u∥ 2

≪ 1.

(B.42)

For the positive part, we have that

)

( T̄ k ∗ )

max ⟨w y,r (v),δ,L , yu⟩ < max ⟨w y,r

, yu⟩ < 1.5 · (1.05β u ∗ m) 1/2 ,

r∈[m]

(B.43)

r∈[m]

)

and by the same argument as Inequality (B.18) we can obtain the lower bound of max r∈[m] ⟨w y,r (v),δ,L , yu⟩,

)

max ⟨w y,r (v),δ,L , yu⟩ ≥ max ⟨w y,r (v),δ,L

r∈[m]

−1)

r∈[m]

2η∥u∥ 22

, yu⟩ 1 +

1 − yf (x; W (T (v),δ,L −1) )

1/2

≥ 0.2 · β u ∗ m

(B.44)

+,2

and finally for t ∈ [T (v),δ,L , T (v)

], we have that

1 X

(t)

σ(⟨w y,r

, yv⟩) < δ.

m r

(B.45)

With all these initial conditions (B.42), (B.43), (B.44), and (B.45), one can then consider all the steps T k ′ ,

T̄ k ′ > T (v),δ,L and apply an inductive argument exactly the same as in the proof of Lemma B.6 to conclude

31+,2

that for all t ∈ [T (v),δ,L , T (v)

] it holds that

(t)

0.2 · (β u ∗ m) 1/2 ≤ max ⟨w y,r

, yu⟩ ≤ 1.5 · (1.05β u ∗ m) 1/2 ,

(t)

min ⟨−w −y,r , −yu⟩ > − 2 log(16m/p) · σ 0 ∥u∥ 2 ,

(t)

min ⟨−w −y,r , −yv⟩ > − 2 log(16m/p) · σ 0 ∥v∥ 2 ,

0 ≤ yf (x; W (t) ) ≤ 3,

+,2

]. Therefore, Lemma B.7 and (B.41) implies

and the sign stability is also true throughout t ∈ [T (v),δ,L , T (v)

+,2

], it holds that

that for any t ∈ [T (v),δ,L , T (v)

)

(

√

1 X

∥v∥ 22 1.5 1.05

(T (v),δ,L )

(t)

· √

σ ⟨w y,v , yv⟩ ≥

σ ⟨w y,v

≥ δ.

, yv⟩ · exp −2 ·

m r

∥u∥ 2

1.05 − δ

In conclusion, we have obtained that

1 X

σ ⟨w y,r , yv⟩ ≥ δ/2,

+,2

∀t ∈ [T (v),δ,L , T (v)

r∈[m]

+,2

If T (v)

= +∞, repeat the above argument and we can finish the proof of Proposition B.8.

B.6.2

Proof of Proposition B.9

Proof of Proposition B.9. We continue with the notation η̃ = 2η∥u∥ 22 /m. Define the function

h η̃ (z) := 1 + η̃(1 − z) · z.

Note that ∥v∥ 22 ≤ ∥u∥ 22 and that

2η∥u∥ 22

· 1 − yf (x; W ( T̄ k −1) ) <

< 1,

from the definition of T̄ k − 1. Thus we have the following,

yf (x, y; W ( T̄ k ) )

1 X

2η∥u∥ 22

( T̄ k −1)

σ ⟨w y,r

, yu⟩ · 1 +

· 1 − yf (x; W ( T̄ k −1) )

r∈[m]

2η∥v∥ 22

1 X

( T̄ k −1)

σ ⟨w y,r

, yv⟩ · 1 +

· 1 − yf (x; W ( T̄ k −1) )

r∈[m]

−

1 X

2η∥u∥ 22

( T̄ k −1)

σ − ⟨w −y,r

, −yu⟩ · 1 −

· 1 − yf (x; W ( T̄ k −1) )

r∈[m]

−

1 X

2η∥v∥ 22

( T̄ k −1)

σ − ⟨w −y,r

, −yv⟩ · 1 −

· 1 − yf (x; W ( T̄ k −1) )

r∈[m]

≤

1 X

2η∥u∥ 22

( T̄ k −1)

σ ⟨w y,r

, yu⟩ + σ ⟨w y,r

, yv⟩ · 1 +

· 1 − yf (x; W ( T̄ k −1) )

r∈[m]

1 X

( T̄ k −1)

−

σ − ⟨w −y,r

, −yu⟩ + σ − ⟨w −y,r

, −yv⟩

r∈[m]

2η∥u∥ 22

· 1 −

· 1 − yf (x; W ( T̄ k −1) )

= yf (x; W ( T̄ k −1) ) · 1 + η̃ 1 − yf (x; W ( T̄ k −1) )

1 X

( T̄ k −1)

σ − ⟨w −y,r

, −yu⟩ + σ − ⟨w −y,r

, −yv⟩ · 2η̃ 1 − yf (x; W ( T̄ k −1) )

}

r∈[m]

<2η̃<2

≤ yf (x; W ( T̄ k −1) ) · 1 + η̃ 1 − yf (x; W ( T̄ k −1) )

+ 2E T̄ k −1 .

(B.46)

Thus by (B.46) we have proved the second conclusion of Proposition B.9. In the following, we prove the first

conclusion of Proposition B.9.

By Assumption B.2, we know that yf (x; W ( T̄ k ) ) > 1 + δ. The definition of T (v) along with with Inequal-

ity (B.46) imply that

h η̃ f (x; W ( T̄ k −1) ) = yf (x; W ( T̄ k −1) ) · 1 + η̃ 1 − yf (x; W ( T̄ k −1) )

≥ 1 + δ − δ = 1.

(B.47)

Now it suffices to consider the equation h η̃ − 1 = 0, and one can easily verify that it has three roots

z 1 = 1,

z 2 =

z 3 =

η̃ + 2 − p

η̃ + 2 + 2η̃

η̃ 2 + 4η̃

2η̃

2η̃ + 2

> 1.

2η̃

And the second root z 2 < 1 if and only if η̃ > 1/2.

Now if 0 < η̃ ≤ 1/2, then z 1 ≤ z 2 < z 3 , and then h(z) < 1 for z < z 1 = 1. Therefore Inequality (B.47)

implies that yf (x; W ( T̄ k −1) ) ≥ p z 1 = 1, and recursively implies that yf (x; W (0) ) ≥ 1, which contradicts with

the fact that yf (x; W (0) ) ≤ 2 2 log(16m/p) · σ 0 ∥u∥ 2 . So in order for T̄ 1 < +∞, we must have η̃ > 1/2. In

conclusion, one necessary condition for the stable oscillation Assumption B.2 is that η̃ > 1/2, and therefore

η̃ + 2 − η̃ 2 + 4η̃

( T̄ −1)

yf (x; W

) ≥ z 2 =

2η̃

This finishes the proof of Proposition B.9.

B.7

Discussion: Necessary Condition for δ-Oscillation

Inequality (B.17) provides an upper bound involving η̃, which should be compatible with Assumption B.2,

hence

√

η̃/2 + η̃ 2 /4 + η̃ > 1 + δ ⇔ η̃ > (1 + δ −1 ) 1 + δ − 1 .

One can verify with software that RHS is monotonically increasing in δ ∈ [0, 1] with minimal value 0.5, when

η̃ = 0, which is in line with the weakest oscillation condition discovered in the last part. And the maximal

value taken at δ = 1 is less than 0.83.

On the other hand, Inequality (B.14) should also be compatible with the Assumption 4.2, which indicates

2 + η̃ − η̃ 2 + 4η̃

( T̄ k−1

1 − δ > yf (x; W

)) >

2η̃

The readers can see that √ it is equivalent to η̃ > δ −1 ((1 − δ) −1/2 − 1). Furthermore, we have that δ −1 ((1 −

δ) −1/2 − 1) > (1 + δ −1 )( 1 + δ − 1) thus it is a stronger requirement on η.

Single Training Data Case: Small Learning Rate Regime

This section focuses on training our model with single noiseless data point (x, y), where x = (yu, yv) contains

two signal patches with u much stronger than v. Therefore, the whole objective can be rearranged by



 2

1  X j X

L(W) = f (x; W) − y =

σ(⟨w j,r , yu⟩) + σ(⟨w j,r , yv⟩) − y  .

j∈{±1}

r∈[m]

33In this simplified setting, each weight vector is updated by

jyη

(t+1)

(t)

w j,r = w j,r −

· f (x; W (t) ) − y · σ ′ (⟨w j,r , yu⟩)u + σ ′ (⟨w j,r , yv⟩)v .

Then we can directly obtain the following updating rules of the inner products,

jyη

(t)

· f (x; W (t) ) − y · σ ′ (⟨w j,r , yu⟩) · ∥u∥ 22 ,

jyη

(t+1)

(t)

⟨w j,r , v⟩ = ⟨w j,r , u⟩ −

· f (x; W (t) ) − y · σ ′ (⟨w j,r , yv⟩) · ∥v∥ 22 ,

(t+1)

⟨w j,r

(t)

, u⟩ = ⟨w j,r , u⟩ −

since u and v are assumed to be orthogonal in our data generation model. In this section, we also denote

the fitting residual at iteration t as ℓ (t) = f (x; W (t) ) − y for convenience.

To better prepare for the analysis of this section, we single out the following concentration results from

Appendix A which provides a high-probability bound on the initialization W (0) .

Lemma C.1 (Initialization). Suppose that d = Ω(log(m/p)) and m = Ω(log(1/p)). Then with probability

at least 1 − p, there holds

(t)

σ 0 ∥u∥/2 ≤ max ⟨w j,r , ju⟩ ≤ 2 log(16m/p)σ 0 ∥u∥,

r∈[m]

(t)

σ 0 ∥v∥/2 ≤ max ⟨w j,r , jv⟩ ≤

r∈[m]

2 log(16m/p)σ 0 ∥v∥,

for all j ∈ {±1}.

The result of this section relies on the following conditions on the data model and the initialization.

Assumption C.2 (Conditions on hyperparameters). Suppose that the following holds:

−1

1. The weight initialization scale σ 0 = Θ(∥u∥

2 );

e 2 ) · ∥v∥ 2 ;

2. The signal strength ∥u∥ 2 > Ω(m

3. The dimension d satisfies d = Ω(polylog(m)).

Theorem C.3 (Restatement of Proposition 3.1). Under Assumption C.2, choosing the learning rate η ≤

m/6∥u∥ 22 small enough and ϵ = 0.01, then with probability at least 1 − p = 1 − 1/poly(d), there exist

2ι

Cm 3

†

T † =

log

η(1 − τ )∥u∥ 22

σ 0 ∥u∥ 2

2ηϵ∥u∥ 2

with τ , ι defined in (C.1), (C.2), such that: (i) the average loss over iterations [T † , T ] decreased to 2ϵ, i.e.

L(W (s) ) ≤ 2ϵ,

T − T † + 1

†

s=T

(ii) the model does not learn weak signal v well enough, compared to initialization, i.e.

(t)

max

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

j∈{±1},r∈[m]

In the small learning rate regime, the dynamics go through two stages, in which the strong signal u will

be firstly learned exponentially fast, and subsequently fully fit the given data point (x, y) therefore stabilizing

the training process. The following lemma plays an important role in the exponentially increasing stage, for

which we single it out here.

Lemma C.4 (Derivative lower bound). For any 0 < τ < 1 to be tuned later, suppose at some time t there

holds

o r τ

(t)

max

⟨w j,r , u⟩ , ⟨w j,r , v⟩ ≤

r∈[m],j∈{±1}

then we can lower bound the fitting residual by −yℓ (t) ≥ 1 − τ .

34Proof of Lemma C.4. Plug into the CNN model definition (3.1), we have that

−yℓ (t) = 1 − F y (x; W (t) ) + F −y (x; W (t) ) ≥ 1 − F y (x; W (t) ).

We can upper bound F y (x; W (t) ) further by

1 X

(t)

σ(⟨w y,r

, yu⟩) + σ(⟨w y,r

, yv⟩)

r∈[m]

(t)

≤ max ⟨w y,r

, yu⟩ 2 + ⟨w y,r

, yv⟩ 2

F y (x; W (t) ) =

r∈[m]

≤ τ.

Then it follows that −yℓ (t) ≥ 1 − τ .

C.1

Stage 1. Exponential Growth

We will mainly track the maximal inner product between w and the signal vectors v and u, i.e.,

(t)

Φ (t) = max ⟨w j,r , u⟩ .

Ψ (t) = max ⟨w j,r , v⟩ ,

j,r

In the following, we would take

(

1/2−∥v∥ 22 /4∥u∥ 22

τ = max 2σ 0 ∥u∥ 2 (2 log(16m/p))

(

ι = σ 0 ∥u∥ 2 · exp

1 − τ

log

σ 0 ∥u∥ 2

τ /2

)

√

6 2∥v∥ 22

, 1 −

∥u∥ 22 log(2/ 2 log(16m/p))

2 log(16m/p)

(C.1)

(C.2)

By the conditions in Assumption C.2 on ∥v∥ 22 /∥u∥ 22 , we find τ, ι both constants in (0, 1).

Lemma C.5 (First stage: one training data case). Under the same conditions as Theorem C.3, there exists

time

2ι

†

log

T =

η(1 − τ )∥u∥ 22

σ 0 ∥u∥ 2

such that: (i) the model learns the strong signal to a constant level, i.e.,

†

(T )

max ⟨w y,r

, yu⟩ ≥ ι,

r∈[m]

(ii) compared to the random initialization, the model does not learn weak signal that much, i.e.,

max

j∈{±1},r∈[m]

(T † )

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

Proof of Lemma C.5. Firstly, we would find {Ψ (t) , Φ (t) } t≥0 having an exponentially growing upper bound.

Recursively, we would have that

Ψ (t+1) ≤ Ψ (t) +

max

j∈{±1},r∈[m]

jyη

(t)

· f (x; W (t) ) − y · σ ′ (⟨w j,r , yv⟩) · ∥v∥ 22

(t)

· ℓ (t) · ∥v∥ 22 ·

max

σ ′ (⟨w j,r , yv⟩)

j∈{±1},r∈[m]

2η

· ℓ (t) · ∥v∥ 22 · Ψ (t)

≤ Ψ (t) +

6η∥v∥ 22

≤ exp

· Ψ (t) .

= Ψ (t) +

35Therefore, we have that

Ψ (t) ≤ exp

6η∥v∥ 22 t

· Ψ (0) ≤ exp

6η∥v∥ 22 t

2 log(16m/p) · σ 0 ∥v∥ 2 .

It follows by the same argument that that

6η∥u∥ 22 t

Φ (t) ≤ exp

Φ (0) ≤ exp

· 2 log(16m/p) · σ 0 ∥u∥ 2 .

(C.3)

(C.4)

Note that the growing rates of these two bounds differ a lot due to the different magnitudes of ∥u∥ 2 and

∥v∥ 2 . Our subsequent analysis illustrates that Φ (t) can grow into a constant-level magnitude since the strong

signal u is significant enough. Now we can track how well our model learns u by

(t)

A (t) = max ⟨w y,r

, yu⟩.

r∈[m]

By the definition, A (t) ≤ Φ (t) also admits an exponentially growing upper

p bound. For a certain τ ∈ (0, 1),

due to the previous upper bounds (C.3) and (C.4), max{Φ (t) , Ψ (t) } ≤ τ /2 remains true at least until

τ /2

(C.5)

T 1 =

log

6η∥u∥ 2

σ 0 ∥u∥ 2 2 log(16m/p)

Consequently, until at least T 1 , we can use Lemma C.4 to conclude that −yℓ (t) ≥ 1 − τ , which enables lower

bounding A (t) . Specifically, start with the updating rule

(t)

· − yℓ (t) · σ ′ (⟨w y,r

, yu⟩) · ∥u∥ 22

2η(1 − τ )∥u∥ 22

(t)

≥ ⟨w y,r

, yu⟩ +

· max ⟨w y,r

, yu⟩, 0 ,

r∈[m]

(t+1)

(t)

⟨w y,r

, yu⟩ = ⟨w y,r

, yu⟩ +

and take maximum over r ∈ [m] to see that

A (t+1) ≥ A (t) +

2η(1 − τ )∥u∥ 22

· A (t) ≥ exp

η(1 − τ )∥u∥ 22

· A (t) ,

where the last equality is by 1 + z ≥ exp(z/2) for any 0 ≤ z ≤ 2. Consequently, we would have

η(1 − τ )∥u∥ 22 t

A (t) ≥ exp

· A (0) ≥ exp

· σ 0 ∥u∥ 2 /2,

(C.6)

at least until t ≤ T 1 defined in (C.5). Then we define another time

2ι

T 2 =

log

≤ T 1 ,

η(1 − τ )∥u∥ 22

σ 0 ∥u∥ 2

where the inequality is due to the scaling of ι upon τ . Plugging T 2 into the exponential lower bound (C.6),

we can conclude that

Φ (T 2 ) ≥ A (T 2 ) ≥ ι,

which already grows up to a constant level magnitude by the time T 2 . Lastly, we plug the definition of T 2

to upper bound Ψ (T 2 ) as

6∥v∥ 22

2ι

Ψ (T 2 ) ≤ exp

log

· 2 log(16m/p) · σ 0 ∥v∥ 2 ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

(1 − τ )∥u∥ 2

σ 0 ∥u∥ 2

In conclusion, by taking T † = T 2 , this lemma is completely proved.

36C.2

Stage 2. Stabilized Convergence

In the second stage, our lemmas would suggest that before the model really learns the weak signal v (i.e.

(t)

e 0 ∥v∥ 2 ) upper bound), the model already fits the given data point by

before max j,r |⟨w j,r , v⟩| breaks the O(σ

exploiting the strong signal u and decreasing the loss to ϵ.

Lemma C.6 (Second stage: one training data case). Under the same conditions as Theorem C.3, for any

ϵ = 0.01, there exists time

Cm 3

T = T † +

2ηϵ∥u∥ 22

such that: (i) the average loss over iterations within this stage has decreased to 2ϵ, i.e.,

L(W (s) ) ≤ 2ϵ,

T − T † + 1

†

s=T

(ii) throughout the training dynamics 0 ≤ t ≤ T , there holds

(t)

max

|⟨w j,r , v⟩| ≤ 2 2 log(16m/p)σ 0 ∥v∥ 2 .

j∈{±1},r∈[m]

The proof of Lemma C.6 relies on the following three lemmas (Lemmas C.7, C.8, and C.9). We present

these lemmas with proofs and then combine them to give the proof of Lemma C.6.

(t)

Firstly, we identify when the upper bound on ⟨w j,r , v⟩ breaks and find that the conclusions of Lemma C.5

still holds before that time.

Lemma C.7. Under the same conditions in Theorem C.3, take η ≤ m/6∥u∥ 22 . There exists a time

T ‡ =

log(2) ≥ T †

6η∥v∥ 22

such that

(t)

max ⟨w y,r

, yu⟩ ≥ ι/2,

r∈[m]

max

j∈{±1},r∈[m]

(t)

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

hold for any T † ≤ t ≤ T ‡ .

Proof of Lemma C.7. Firstly, we need to adopt the exponential upper bound derived in proving Lemma C.5,

6η∥v∥ 22 t

Ψ (t) ≤ exp

· Ψ (0) ≤ exp

· 2 log(16m/p) · σ 0 ∥v∥ 2 .

Then we naturally find that before T ‡ , it would always hold that

(t)

max

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

j∈{±1},r∈[m]

Due to the conditions on ∥u∥ 22 /∥v∥ 22 , T ‡ is found to be much larger than T † . Then we proceed by induction

(t)

to prove the other assertion. At time t = T † , the lower bound max r ⟨w y,r , yu⟩ ≥ ι/2 holds as a consequence

of the previous lemma. Suppose it holds until time t. We restate the updating rule by

(t+1)

(t)

⟨w y,r

, yu⟩ = ⟨w y,r

, yu⟩ +

· 1 − yf (x; W (t) ) · σ ′ (⟨w y,r

, yu⟩) · ∥u∥ 22 ,

(t+1)

(t)

from which we find max r∈[m] ⟨w y,r , yu⟩ ≥ max r∈[m] ⟨w y,r , yu⟩ must hold when yf (x; W (t) ) ≤ 1. Otherwise,

once yf (x; W (t) ) > 1, it immediately follows that

1 < yf (x; W (t) ) = F y (x; W (t) ) − F −y (x; W (t) )

1 X

(t)

≤ F y (x; W (t) ) =

σ(⟨w y,r

, yu⟩) + σ(⟨w y,r

, yv⟩)

r∈[m]

≤

(t)

max ⟨w y,r

, yu⟩ 2

r∈[m]

+ 2 log(16m/p) · σ 0 2 ∥v∥ 22 .

37(t)

Consequently, for the specific neuron r ∗ = argmax r∈[m] ⟨w y,r , yu⟩, there holds

3η

(t)

· ⟨w y,r ∗ , yu⟩ · ∥u∥ 22

3η

≥ 1 − 2 log(16m/p) · σ 0 2 ∥v∥ 22 · 1 −

· ∥u∥ 22

≥ ,

√

where the last inequality is enabled by taking η ≤ m/6∥u∥ 22 and σ 0 ≤ 1 − ι/(2 log(16m/p)∥v∥ 2 ). Thus by

induction, we have finished the proof of Lemma C.7.

(t+1)

(t)

⟨w y,r ∗ , yu⟩ ≥ ⟨w y,r ∗ , yu⟩ −

Our subsequently analysis confirms that even before T ‡ , the model can already fit the given data point

by exploiting u. For the given 0 < ϵ < 1, define a reference point W ∗ as

∗

w j,r

4m(1 + ϵ) ju

∥u∥ 22

j ∈ {±1}, r ∈ [m].

(C.7)

Lemma C.8. Under the same condition as the previous lemma, for all T † ≤ t ≤ T ‡ , there holds that

y⟨∇f (x; W (t) ), W ∗ ⟩ ≥ 2 · (1 + ϵ).

Proof of Lemma C.8. Recall the definition of the CNN in (3.1) and that u ⊥ v, so we have

(t)

∗

y⟨∇f (x; W (t) ), W ∗ ⟩ =

σ ′ (⟨w j , yu⟩) · ⟨w j,r

, yu⟩

j∈{±1},r∈[m]

(t)

σ ′ (⟨w j,r , yu⟩) ·

j∈{±1},r∈[m]

(t)

≥ max ⟨w y,r

, yu⟩ ·

r∈[m]

4(1 + ϵ)

≥ 2 · (1 + ϵ),

(t)

where the last inequality is by max r∈[m] ⟨w y,r , yu⟩ ≥ ι/2 as shown by the previous lemma.

Lemma C.9. Continued from the previous setting, we know that for T † ≤ t ≤ T ‡ , it holds that

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F ≥ 2ηL(W (t) ) − 2ηϵ 2 .

Proof of Lemma C.9. Firstly we expand the difference by

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F = 2η · ⟨∇L(W (t ), W (t) − W ∗ ⟩ − η 2 · ∥∇L(W (t )∥ 2 F .

(C.8)

With one data point, ∇L(W (t) ) = ℓ (t) ∇f (x; W (t) ) admits a simplified expression, where ℓ (t) = f (W (t) , x)−y

denotes the fitting residual. Since the neural network f (W, x) is 2-homogeneous in W due to the activation

function σ(z) = max{z, 0} 2 , we can have

⟨∇f (x; W (t) ), W (t) ⟩ = 2f (x; W (t) ).

Stack these observations into the first term of previous difference expansion to obtain that

⟨∇L(W (t ), W (t) − W ∗ ⟩ = ℓ (t) · ⟨∇f (x; W (t) ), W (t) − W ∗ ⟩

= ℓ (t) · 2f (x; W (t) ) − ⟨∇f (x; W (t) ), W ∗ ⟩

= 2ℓ (t) · f (x; W (t) ) − y + ℓ (t) · y · 2 − y⟨∇f (x; W (t) ), W ∗ ⟩ .

Note that the first term is exactly 4L(W (t) ). As for the second term, we need to plug in Lemma C.8 to see

2 − y⟨∇f (x; W (t) ), W ∗ ⟩ ≤ −2ϵ < 0, so that

ℓ (t) · y · 2 − y⟨∇f (x; W (t) ), W ∗ ⟩ ≤ ℓ (t) + 2ϵ 2 = L(W (t) ) + 2ϵ 2 .

38As a result, we would know ⟨∇L(W (t ), W (t) − W ∗ ⟩ ≥ 3L(W (t) ) − 2ϵ 2 . Next, an upper bound on the second

order term η 2 · ∥∇L(W (t )∥ 2 F is given by





(t)

σ ′ (⟨w j,r , yu⟩) + ∥v∥ 22

σ ′ (⟨w j,r , yv⟩) 

η 2 · ∥∇L(W (t) )∥ 2 F = η 2 · ℓ (t) ·  ∥u∥ 22

j∈{±1},r∈[m]

max{∥u∥ 22 , ∥v∥ 22 }

≤O

j∈{±1},r∈[m]

· η · L(W ),

(t)

since the dynamics of the inner products ⟨w j,r , yu⟩, ⟨w j,r , yv⟩ are well bounded by O(1) throughout the time

we are considering. By scaling ηO(max{∥u∥ 2 , ∥v∥ 2 }) ≤ 1, we would know that η 2 ∥∇L(W (t )∥ 2 F ≤ ηL(W (t ).

Eventually, continued from (C.8), we can completely prove this lemma.

Equipped with Lemmas C.7, C.8, and C.9, we are ready to prove the main lemma for the second stage.

Proof of Lemma C.6. Continued from Lemma C.9, for any t ≥ T † , it holds that

†

∥W (T ) − W ∗ ∥ 2 F

(s)

L(W

)

≤

+ ϵ 2 .

† + 1)

t − T † + 1

2η(t

−

†

s=T

Before proceeding to scale time t, it would be helpful to decompose ∥W (T

bound on this term,

∥W (T

†

)

j∈{±1},r∈[m]

(T † )

∗

⟨w j,r − w j,r

, v⟩ 2

⟨w j,r − w j,r

, u⟩ 2

∥u∥ 22

∥v∥ 22

†

(T )

≤

)

− W ∗ ∥ 2 F and to have an upper

− W ∗ ∥ 2 F

(T † )

†

j∈{±1},r∈[m]

(T )

∗

2⟨w j,r , u⟩ 2 + 2⟨w j,r

, u⟩ 2

⟨w j,r , v⟩ 2

∥u∥ 2

∥v∥ 22

vv ⊤

uu ⊤

(T † )

∗

I d −

−

(w j,r − w j,r

)

∥v∥ 22

∥u∥ 22

vv ⊤

uu ⊤

(0)

I d −

−

w j,r

∥v∥ 22

∥u∥ 22

(C.9)

where we exploit the fact that w ∗ is parallel to u by Lemma (C.7), and the gradient steps only updates w

along the directions of u, v. Recall that by Lemma C.7,

(T † )

max

j∈±1,r∈[m]

⟨w j,r , ju⟩ = Ω(1),

†

max

j∈±1,r∈[m]

(T )

e 0 ∥v∥ 2 ),

⟨w j,r , v⟩ = O(σ

√

(0)

∗

e 0 d), the leading term in (C.9) would be P

and also that ∥w j,r ∥ = O(σ

j∈{±1},r∈[m] ⟨w j,r , u⟩ /∥u∥ 2 . There-

†

fore, we conclude that ∥W (T ) − W ∗ ∥ 2 F ≤ Cm 3 /∥u∥ 22 for some constant C > 0. As a result, the average loss

after iterations T † can be bounded by

Cm 3

L(W (s) ) ≤

+ ϵ 2 ,

†

† + 1)

t − T +1

2η∥u∥

−

†

∀t † ≤ t ≤ T ‡ .

s=T

e 2 ) by

Then we choose time T = T † + ⌊Cm 3 /(2ηϵ∥u∥ 22 )⌋ as stated in Lemma C.6. Since ∥u∥ 22 /∥v∥ 22 ≥ Ω(m

‡

Assumption C.2, we can verify that T ≤ T . In conclusion, the final output would be

L(W (s) ) ≤ ϵ + ϵ 2 ≤ 2ϵ.

T − T † + 1

†

s=T

This finishes the proof of Lemma C.6.

Combine Lemmas C.5 and C.6 to obtain the full version of Theorem C.3.

39D

Proofs for Main Theoretical Results (Section 4)

In this section, we give a detailed proof for our main theoretical results for the multiple training data case,

i.e., Theorem 4.3. The proofs follow the similar idea as the proofs for single training data case (Appendix B).

The readers interested in the proofs are encouraged to first go through Appendix B to get the idea of the core

steps. In Appendix D.1, we give a preliminary analysis of the SGD training dynamics. In Appendix D.2, we

give an overview of the proofs with our fundamental reasoning towards weak signal learning. In Appendix D.3,

we give the proof of Theorem 4.3. We prove other lemmas in subsequent sections.

D.1

Preliminary Analysis

Recall that W is the index set of training data points which lack the strong feature patch. By Equation (2.3),

the CNN weights are updated according to

(t+1)

w j,r

jη

(t)

· f (x i t ; W (t) ) − y i t · σ ′ (⟨w j,r , y i t u⟩) · y i t u · 1{i t ∈

/ W}

(t)

+ σ ′ (⟨w j,r , y i t v⟩) · y i t v + σ ′ (⟨w j,r · ξ i t ⟩) · ξ i t + σ ′ (⟨w j,r · ξ e i t ⟩) · ξ e i t · 1{i t ∈ W} .

(t)

= w j,r −

(D.1)

Also, recall that the correct index sets for the strong and weak signal patches are defined as

(t)

U j,+ := r ∈ [m] : ⟨w j,r , ju⟩ ≥ 0 , V j,+ := r ∈ [m] : ⟨w j,r , jv⟩ ≥ 0 ,

By the CNN expression (2.1), for each j ∈ {±1}, the inner products that matter are

(t)

1. positive neurons: ⟨w j,r , ju⟩ for r ∈ U j,+ , ⟨w j,r , jv⟩ for r ∈ V j,+ ;

(t)

2. negative neurons: ⟨w −j,r , ju⟩ for r ∈

/ U −j,+ , ⟨w −j,r , jv⟩ for r ∈

/ V −j,+ .

By (D.1), the update formula of these inner products of interests are given by

η∥u∥ 22

(t)

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , ju⟩jy i t ) · 1{i t ∈

/ W},

η∥v∥ 22

(t+1)

(t)

⟨w j,r , jv⟩ = ⟨w j,r , jv⟩ +

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , jv⟩jy i t ),

(t+1)

⟨w j,r

(t)

, ju⟩ = ⟨w j,r , ju⟩ +

(D.2)

(D.3)

Also by (D.1), the update formula of the inner products with noise vectors are given by

η · jy i t

(t)

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , ξ i t ⟩) · ⟨ξ i t , ξ i ⟩

(t)

+ σ ′ (⟨w j,r , ξ e i t ⟩) · ⟨ ξ e i t , ξ i ⟩ · 1{i t ∈ W} , i ∈ [n],

η · jy i t

(t)

(t+1)

(t)

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , ξ i t ⟩) · ⟨ξ i t , ξ e i ⟩

⟨w j,r , ξ e i ⟩ = ⟨w j,r , ξ e i ⟩ +

(t)

+ σ ′ (⟨w j,r , ξ e i t ⟩) · ⟨ ξ e i t , ξ e i ⟩ · 1{i t ∈ W} , i ∈ W.

(t+1)

⟨w j,r

(t)

, ξ i ⟩ = ⟨w j,r , ξ i ⟩ +

(D.4)

At first glance, the update formulas given above seem intangible. The following proposition indicates that we

can separate the neurons into two parts, with each individual part learning one kind of sample independently.

For simplicity, we let η̃ := 2η∥u∥ 22 /m and α = ∥v∥ 22 /∥u∥ 22 and f ˜ t = y i t f (x i t ; W (t) ). For any j ∈ {±1} and

s ≥ 0, we define effective running time for learning label-j-samples as

t j (s) = min t ∈ N : t > t j (s − 1), y i t = j ,

(D.5)

with t j (0) = min{t ∈ N : y i t = j}.

40(t)

(0)

Proposition D.1. Suppose that the sign stability condition holds before some T sign , i.e., U ±j,± = U ±j,± :=

(t)

(0)

U ±j,± and V ±j,± = V ±j,± := V ±j,± for t ≤ T sign . Then for s such that t j (s) ≤ T sign , it holds that

(t (s+1))

⟨w j,r j

(t (s))

, ju⟩ = ⟨w j,r j

, ju⟩ · 1 + η̃(1 − f ˜ t j (s) ) · 1{i t j (s) ∈ W} ,

(t j (s))

, −ju⟩ · 1 − η̃(1 − f ˜ t j (s) ) · 1{i t j (s)

, −ju⟩ = ⟨w −j,r

(t (s))

(t (s+1))

, jv⟩ = ⟨w j,r j , jv⟩ · 1 + αη̃(1 − f ˜ t j (s) ) , ∀r ∈ V j,+ ;

⟨w j,r j

(t j (s))

(t j (s+1))

, −jv⟩ · 1 − αη̃(1 − f ˜ t j (s) ) , ∀r ∈ V −j,− .

, −jv⟩ = ⟨w −j,r

⟨w −j,r

(t (s+1))

∈ W} ,

⟨w −j,r

∀r ∈ U j,+ ;

(D.6)

∀r ∈ U −j,− ;

Moreover, for every t ∈ (t j (s), t j (s + 1)], it holds that

(t)

(t (s)+1) , ju⟩,

(t (s)+1) , −ju⟩,

⟨w j,r , ju⟩ = ⟨w j,r j

(t)

⟨w −j,r , −ju⟩ = ⟨w −j,r

(t)

⟨w j,r , jv⟩

(t)

⟨w −j,r , −jv⟩

= (t (s)+1)

, jv⟩,

⟨w j,r j

= (t j (s)+1)

⟨w −j,r

, −jv⟩,

∀r ∈ U j,+ ;

∀r ∈ U −j,− ;

∀r ∈ V j,+ ;

∀r ∈ V −j,− .

Proof of Proposition D.1. See Appendix D.6.1 for a detailed proof.

This proposition helps to break down the dynamics of multiple data training into two folds. To see this,

it suffices to note that the following components

(t)

⟨w −j,r , −jv⟩

⟨w j,r , jv⟩

⟨w −j,r , −ju⟩

f ˜ t j (s) ,

⟨w j,r , ju⟩

r∈V −j,+

r∈V j,+

r∈U −j,−

r∈U j,+

are independent with { f ˜ t } i t =−j and the rest inner products associated with −j.

D.2

Overview of Analysis

The roadmap towards proving our main theorem shares nearly the same logic with the proof of single data

setup (Appendix B). Basically, as long as the weak signal component and the noise component are not learned

in the sense that these inner products remain negligible, we can prove that the inner products associated

with the strong signal would dominate and the oscillation would accumulate at a linear rate. This further

(t)

gives Lemma D.3 which shows the CNN would learn effectively learn the weak signal ⟨w j,r , jv⟩. Meanwhile,

(t)

we can prove that the influences from the negative part of weak signal learning ⟨w −j,r , jv⟩ and the noise

(t)

memorization ⟨w ±1,r , ξ i ⟩, ⟨w ±1,r , ξ e i ⟩ can be well controlled (Propositions D.4). Putting all together, we can

prove the main result Theorem 4.3.

To be formal, we define two important stopping times as follows:









(t)

T (v)

= min t :

σ(⟨w j,r , jv⟩) ≥ δ/2 ,

(D.7)

t≥0 



r∈[m]

(t)

(t) e

T (ξ) = min t :

max

max ⟨w j,r , ξ i ⟩ , max ⟨w j,r , ξ i ⟩ ≥ δ/4 .

(D.8)

t≥0

r∈[m],j∈{±1}

i∈W

i∈[n]

We recap that W denotes the index set of weak data, and ξ e i denotes the Gaussian noise appearing on the

lacking strong signal patch for those weak data. Also we note that T (v)

, T (ξ) ≤ +∞, where the equal sign is

attainable. We then define

T max

= min T (v)

, T (ξ) .

In the first place, following the same arguments as in the single data setup (Appendix B), we can derive

the boundedness and sign stability results before time T max

Lemma D.2 (Boundedness and sign stability). Under Assumptions 4.1 and 4.2, for fixed j ∈ {±1}, the

followings hold with probability at least 1 − p = 1 − 1/poly(d):

41(t)

(0)

(t)

(0)

1. it holds that U j,+ = U j,+ ̸ = ∅ and V j,+ = V j,+ ̸ = ∅ for any t ∈ [0, T max

]. Hence the superscript (t) can be

dropped;

2. for any t ∈ [0, T max

], we have that

(t)

∗

max ⟨w j,r , ju⟩ ≤ 1.5 · (1.05β u,j

m) 1/2 ,

(D.9)

r∈[m]

∗

where β u,j

is defined in Lemma B.5;

3. for any t ∈ [0, T max

] it holds that

(t)

min ⟨w −j,r , −ju⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 , (D.10)

(t)

min ⟨w −j,r , −jv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 ; (D.11)

r∈[m]

4. for any t ∈ [0, T max

] such that y i t = j, it holds that

1 − y i t f (x i t ; W (t) ) ≤ 2.

Proof of Lemma D.2. See Appendix D.4 for a detailed proof.

This boundedness and sign stability result further implies weak signal learning with an exponential rate,

which is formally presented as follows.

Lemma D.3 (Weak signal learning). Under Assumptions 4.1 and 4.2, with probability at least 1−1/poly(d),

it holds that for any j ∈ {±1} and 0 ≤ t 1 ≤ t 2 ≤ T max

that

t 2

s=t 1

y is =j

m(1.05) 2

1 − y i s f (x i s ; W (s) ) ≥

· 1 − (1.05 − δ/4) 2 · (t 2 − t 1 + 1) −

1 ,

2η∥u∥ 22 (1.05 − δ/4) 2

(0)

where δ is specified in Assumption 4.2. Moreover, for r ∈ V j,+ , we have that

(

(t +1)

⟨w j,r 2 , jv⟩

≥

(t )

⟨w j,r 1 , jv⟩·exp

η∥v∥ 22

∥v∥ 22 (1.05) 2

· δ − δ(1.05 − δ/4) 2 ·(t 2 − t 1 + 1) −

32m

∥u∥ 22 (1.05 − δ/4) 2

)

.(D.12)

Proof of Lemma D.3. See Appendix D.5 for a detailed proof.

The last component to complete our proof of main theorem is controlling noise memorization during the

training process. For simplicity, we define the maximum absolute value of the noise inner products over data

(t)

Υ (t) =

max

max ⟨w j,r , ξ i ⟩ , max ⟨w j,r , ξ e i ⟩ .

r∈[m],j∈{±1}

i∈W

i∈[n]

We have the following proposition to control the growth of Υ (t) .

Proposition D.4 (Noise memorization). Under Assumptions 4.1 and 4.2, then with probability at least

1 − 1/poly(d), it holds for any 0 ≤ t 1 ≤ min j∈{±1} {T max

} − T e (ξ) and j ∈ {±1} that

Υ (t) ≤ Υ (t 1 ) · (1 + ϵ),

∀r ∈ [m],

∀t 1 ≤ t ≤ t 1 + T e (ξ)

where T e (ξ) = Θ(mn

· η −1 · ϵ · (1 + ϵ) −1 · (σ p 2 d) −1 ) for any ϵ > 0.

Proof of Proposition D.4. See Appendix D.6.3 for a detailed proof.

42D.3

Proof of Theorem 4.3

With Lemma D.2 and D.3 and Proposition D.4, we are ready to prove Theorem 4.3.

′

Proof of Theorem 4.3. For the j = argmin j ′ ∈{±1} {T max

}, we are going to prove that T max

is bounded by a

polynomial time by using contradiction. Specifically, we prove that

)

(

√

1 −1

mδ

∥v∥

1.05

32m

+ 1.5 ·

· δ − δ(1.05 − δ/4) 2

· log

T max

≤ T j,0 :=

η∥v∥ 22

σ 0 ∥v∥ 2

∥u∥ 22

1.05 − δ/4

Suppose that the result fails, then by definition we have that

∧ T (ξ) > T j,0 .

T max

= T (v)

Then Lemma D.2 and Lemma D.3 hold on [0, T j,0 ]. By applying the lower bound in Inequality (D.12) as

well as Lemma A.4, we have that

)

(

1.05

η∥v∥ 22

∥v∥ 22

(T j,0 )

(0)

max ⟨w j,r , jv⟩ ≥ max ⟨w j,r , jv⟩ · exp

· δ − δ(1.05 − δ/4) 2 ·T j,0 − 1.5 ·

32m

∥u∥ 22

1.05 − δ/4

r∈[m]

√

2 mδ

≥ σ 0 ∥v∥ 2 ·

σ 0 ∥v∥ 2

√

= mδ.

This leads to the following,

1 X

(T )

σ(⟨w j,r j,0 , jv⟩) ≥

· max (⟨w j,r j,0 , jv⟩) 2 ≥ δ,

m r∈[m]

r∈[m]

which clearly contradicts the definition of T (v)

in (D.7), hence it must be that T max

≤ T j,0 < +∞.

= T (v)

< T (ξ) , for which our conclusion directly follows due to (D.7).

In the following, we prove that T max

= T (ξ) < T (v)

Again we prove by contradiction. Suppose that T max

, then Υ (t) would reach δ/4 for time less

than T j,0 due to definition of T (ξ) in (D.8). But by Proposition D.4 with ϵ = 1, we know Υ (t) takes at least

mnσ p 2 d

√

K T e (ξ) := Θ

· log

σ 0 σ p d

steps to reach δ/4, where T e (ξ) is defined in Proposition D.4. Then due to the scaling of the signal strength

in Assumption 4.1, we can also find that K T e (ξ) > T j,0 , which is a contradiction with the definition of T (ξ) .

Therefore, we must have that T max

= T (v)

, which means that at t ⋆ = T max

1 X

(t ⋆ )

σ(⟨w j,r , jv⟩) ≥ .

r∈[m]

In the meanwhile, Proposition D.2 guarantees that at time t ⋆ = T max

1 X

(t ⋆ )

σ(⟨w −j,r , jv⟩) ≤ .

r∈[m]

Thus, we conclude that at time t ⋆ ,

1 X

(t ⋆ )

σ(⟨w j,r , jv⟩) −

σ(⟨w −j,r , jv⟩) ≥ .

r∈[m]

This finishes the proof of Theorem 4.3.

43D.4

Proof of Lemma D.2

∧ T (ξ) , where

Proof of Lemma D.2. Recall that T max

= T (v)

T (v)

T (ξ)









1 X

(t)

= min t :

σ ⟨w j,r , jv⟩ ≥ δ/2 ,

t≥0 



r∈[m]

(t)

= min t :

⟨w j,r , ξ i ⟩ ∨

max

t≥0

max

j∈{±1},r∈[m],i∈W

j∈{±1},r∈[m],i∈[n]

(t)

⟨w j,r , ξ e i ⟩ ≥ δ/4 .

We now define some notations to simplify our presentation. Recall that f ˜ = y i t f (x i t ; W (t) ), η̃ = 2η∥u∥ 22 /m

and α = ∥v∥ 22 /∥u∥ 22 . For fixed j, we define

S̄ j,k := min s ∈ N : s > S̄ j,k−1 such that f ˜ t j (s) ≥ 1 and f ˜ t j (max{s ′

< 1 ,

(D.13)

with S̄ j,0 = 0, and

S j,k := min s ∈ N : s > S j,k−1 such that i t j (s) ∈

/ W, f ˜ t j (s) < 1 and f ˜ t j (max{s ′

≥ 1 (D.14)

, since

with S j,0 = 0. From the definition of T max

, we know that i t j ( S̄ j,k ) ∈

/ W for k such that t j ( S̄ j,k ) ≤ T max

otherwise f ˜ t j ( S̄ j,k ) < δ < 1. Moreover we define

(t (0))

U j,+ := r ∈ [m] : ⟨w j,r j , ju⟩ > 0 ,

(t (0))

U j,− := r ∈ [m] : ⟨w j,r j , ju⟩ ≤ 0 ,

and V j,± are defined analogously.

The following analysis is nearly the same as the proof of one data case in the sense that each step of the

proof is organized exactly the same to the proof of one data case. Therefore we suggest readers to refer to

the roadmap provided in Appendix B.1 frequently for better understanding.

We note that for t ≤ T (v)

∧ T (ξ) , which is the time scale we are mainly focusing on, it holds that

max

j∈{±1},r∈[m],i∈[n],i ′ ∈W

(t)

⟨w j,r , ξ i ⟩ ∨ ⟨w j,r , ξ e i ′ ⟩ ≤ ,

1 X

(t)

σ ⟨w j,r , jv⟩ < .

r∈[m]

Now we start presenting the proof of boundedness and sign stability with step-by-step analysis. The j

is arbitrary but fixed throughout analysis. In view of the effective running time for learning label-j-samples

defined in (D.5), (D.13), and (D.14), the proofs for a certain j directly do induction over s.

Step 1: Pre- S̄ j,1 analysis. Clearly at s = 0, the lower bounds in Inequalities (D.10) and (D.11) are both

guaranteed by Lemma A.4. Also we know that f ˜ t j (0) ≪ 1 and so S̄ j,1 ≥ 1. The initialization also guarantees

that the upper bound in Inequality (D.9) holds at s = 0.

For s ∈ [0, S̄ j,1 ), the definition of S̄ j,1 indicates that f ˜ t j (s) ≤ 1. Therefore, from Proposition D.1 we can

t (s)

see that the ⟨w j,r

, ju⟩, r ∈ U j,+ and ⟨w −j,r

, −ju⟩, r ∈ U −j,− are non-decreasing in s during this stage.

One naturally infers that for all r ∈ U −j,− , it holds that

(t ( S̄

))

(t ( S̄

j,1

⟨w −j,r

, −ju⟩ ≥ ⟨w −j,r

−1))

(t j (0))

, −ju⟩ ≥ ⟨w −j,r

, −ju⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 .

(t (s))

Same can be verified for ⟨w −j,r

, −jv⟩ with r ∈ V −j,− . Hence the lower bounds in Inequality (D.10)

and (D.11) are extended to s ∈ [0, S̄ j,k ]. Also, for these inner products, the sign stability holds on [0, S̄ j,1 ],

since η̃ < 4/5 and thus we have

2η∥u∥ 22

· (1 − f ˜ t j (s) ) ≥ 1 − η̃ 1 + o(1) ≥ 0,

2η∥v∥ 22

1 ±

· (1 − f ˜ t j (s) ) ≥ 1 − αη̃ 1 + o(1) ≥ 0.

1 ±

44(t (s))

Then we turn to upper bound ⟨w j,r j

definition of T max

implies that

1 − δ ≥ f ˜ t j (s) ≥

, ju⟩ for s ∈ [0, S̄ j,1 ). Note that, for s such that i t j (s) ∈

/ W, the

1 X

(t (s))

σ ⟨w j,r j , ju⟩ +

σ ⟨w j,r j , jv⟩

r∈[m]

1 X

(t j (s))

−

σ − ⟨w −j,r

, −ju⟩ −

σ − ⟨w −j,r

, −jv⟩

r∈[m]

1 X

(t (s))

(t j (s))

−

σ ⟨w j,r j , ξ i ⟩ +

σ ⟨w −j,r

, ξ i ⟩

r∈[m]

1 X

(t (s))

≥

σ ⟨w j,r j , ju⟩ − 2 × o(1) − 2 × δ/4.

r∈[m]

As a consequence,

1 X

(t (s))

σ ⟨w j,r j , ju⟩ ≤ 1 − δ/2 + o(1) ≤ 1.05.

r∈[m]

Thanks to the local sign stability, Lemma B.5 implies that

(t (s))

max ⟨w j,r j

r∈[m]

∗

, ju⟩ ≤ (1.05β u,j

m) 1/2 .

If otherwise s ∈ [0, S̄ j,k−1 ) such that i t j (s) ∈ W, then we choose s̃ = min{s ′ > s : i t j (s) ∈

/ W}. It holds that

either s̃ < S̄ j,1 , or s̃ = S̄ j,1 . For the previous case, similar argument implies that

(t (s))

⟨w j,r j

(t (s̃))

, ju⟩ ≤ ⟨w j,r j

∗

, ju⟩ ≤ (1.05β u,j

m) 1/2 .

(D.15)

The latter case reduces to establish the upper bound for t j ( S̄ j,1 ), which is derived in the next step.

(t (s))

Step 2.1: Bounding ⟨w j,r j , ju⟩ for s ∈ [ S̄ j,1 , S j,1 ). Since the weak data does not contributes to

learning strong signal, as indicated in Proposition D.1, we know that

(t ( S̄ j,1 ))

⟨w j,r j

(t ( S̃ j,1 ))

, ju⟩ = ⟨w j,r j

, ju⟩ · 1 + η̃(1 − f ˜ t j ( S̃ j,1 ) ) ,

where S̃ j,1 = max{s ≤ S̃ j,1 − 1 : i t j (s) ∈

/ W}.

We begin with a proposition that is parallel to Proposition B.9.

Proposition D.5. For simplicity, we assume that i t j ( S̄ j,k −1) ∈

/ W and S̃ j,k = S̄ j,k − 1. Otherwise we can

find the last step before S̄ j,k such that i t j (s) ∈

/ W and leverage the previous observation in Equation (D.4).

Suppose that

E t j ( S̄ j,k −1) :=

1 X

(t j ( S̄ j,k −1))

(t ( S̄ −1))

σ − ⟨w −j,r

, −ju⟩ + σ − ⟨w j,r j j,k

, −jv⟩ ≤ δ/4,

r∈[m]

Υ (t j ( S̄ j,k )) :=

(t ( S̄ j,k ))

max

j∈{±1},r∈[m],i∈[n]

⟨w j,r j

, ξ i ⟩ ∨

(t ( S̄ j,k ))

max

j∈{±1},r∈[m],i∈W

⟨w j,r j

, ξ e i ⟩ ≤ δ/4,

then we have that

η̃ + 2 −

f ˜ t j ( S̄ j,k −1) ≥

2η̃

η̃ 2 + 4η̃

and that

f ˜ t j ( S̄ j,k ) ≤ 1 + η̃ 1 − f ˜ t j ( S̄ j,k −1)

+ 2E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k ))

≤ η̃/2 + η̃ 2 /4 + η̃ + 2E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k )) .

(D.16)Proof of Proposition D.5. See Appendix D.6.4 for a detailed proof.

(t ( S̄

))

With this proposition, we can derive an upper bound on ⟨w j,r j j,1 , ju⟩. One step gradient (D.6) implies

that, for r ∈ U j,+ , it holds that

(t ( S̄ ))

(t ( S̄ −1))

⟨w j,r j j,1 , ju⟩ = ⟨w j,r j j,1

, ju⟩ · 1 + η̃ 1 − f ˜ t j ( S̄ j,1 −1)

η̃ + 2 − η̃ 2 + 4η̃

(t j ( S̄ j,1 −1))

≤ ⟨w j,r

, ju⟩ · 1 + η̃ 1 −

2η̃

∗

≤ 1.5 · (1.05β u,j

m) 1/2 .

(t ( S̄

−1))

, ju⟩ has

Here the first inequality comes from Inequality (D.16). Note that the upper bound on ⟨w j,r j j,1

been derived in Inequality (D.15) in Step 1.. Taking η̃ = 4/5 easily implies the second inequality above.

(t (s))

This upper bound continues to hold for ⟨w j,r j , ju⟩ with s ∈ [ S̄ j,1 , S j,1 ) because of monotonicity. We

(t (s))

/ W. For these s, we have that

consider the lower bound for ⟨w j,r j , ju⟩ with s ∈ [ S̄ j,1 , S j,1 ) and i t j (s) ∈

(s))

(t (s))

1 + δ < f ˜ t j (s) ≤

σ ⟨w j,r , ju⟩ +

σ ⟨w j,r j , jv⟩ + (negative part) + (noise part)

r∈[m]

1 X

(t (s))

≤

σ ⟨w j,r j , ju⟩ + δ/2 + δ/4.

r∈[m]

Combining with Lemma B.5, we obtain that

(t (s))

max ⟨w j,r j

r∈[m]

∗

, ju⟩ ≥ (β u,j

m) 1/2 ,

for s ∈ [ S̄ j,1 , S j,1 ) and i t j (s) ∈

/ W.

(t ( S

))

Step 2.2: Lower bounding max r ⟨w y,r j ¯ j,k , yu⟩. Same to the proof of Proposition D.5, we can assume

that i t j ( S j,1 −1) ∈

/ W, without loss of generality.

Note that, while f ˜ t j (s) is not monotonically changing in s ∈ [ S̄ j,1 , S j,1 ] in the presence of noise components,

the inner products are still changing monotonically, as indicated in Proposition D.1. Therefore, we can

leverage the control over noise components to derive an approximate monotonic property. Specifically, the

inner products

(t (s))

(t j (s))

(t (s))

(t j (s))

σ ⟨w j,r j , ju⟩ , −σ ⟨w −j,r

, ju⟩ , σ ⟨w j,r j , jv⟩ , −σ ⟨w −j,r

, jv⟩ ,

are decreasing in s ∈ [ S̄ j,1 , S j,1 ]. From Inequality (D.31) in the proof of Proposition D.5, we know that

1 X

(t ( S̄ ))

(t j ( S̄ j,1 ))

f t j ( S j,1 −1) ≤ f t j ( S̄ j,1 ) −

σ ⟨w j,r j j,1 , ξ i tj ( S̄ j,1 ) ⟩ − σ ⟨w −j,r

, ξ i tj ( S̄ j,1 ) ⟩

r∈[m]

1 X

(t ( S −1))

(t j ( S j,1 −1))

σ ⟨w j,r j ¯ j,1

, ξ i tj ( Sj,1 −1) ⟩ − σ ⟨w −j,r

, ξ i tj ( Sj,1 −1) ⟩

r∈[m]

(D.17)

≤ η̃/2 + η̃ 2 /4 + η̃ + 2E T j ( S̄ j,1 −1) + Υ (t j (¯ S j,1 )) .

Recall that in Step 1., we proved that E j( S̄ j,1 −1) ≤ 2( 2 log(16m/p) · σ 0 ∥u∥ 2 ) 2 = o(1). Thus from doing

one-step gradient descent (D.6), we know that for r ∈ U j,+ , it holds that

(t ( S ))

(t ( S −1))

⟨w j,r j ¯ j,1 , ju⟩ = ⟨w j,r j ¯ j,1

, ju⟩ · 1 + η̃ 1 − f ˜ t j ( S j,1 −1)

(t j ( S j,1 −1))

≥ ⟨w j,r ¯

, ju⟩ · 1 + η̃ 1 − (η̃/2 + η̃ 2 /4 + η̃) 2 − 0.1

∗

≥ 0.1 · (β u,j

m) 1/2 .

(D.18)

Here the first inequality is by (D.17), E j( S̄ j,1 −1) = o(1) and the definition of T (ξ) , and the second inequality

is a consequence of Inequality (B.19). This positive constant-level lower bound guarantees that our η̃ choice

is sufficient for the sign stability to hold for s ∈ [ S̄ j,1 , S j,2 ], as shown in the next step.

46(t ( S

))

(t ( S

))

−j,1

j,1

Step 2.3: Lower bounding ⟨w −j,r

, −ju⟩ and ⟨w −j,r

, −jv⟩.

and r ∈ U −j,− . Inequality (D.18) indicates that



S j,1 −1

(t ( S ))



¯ Y

⟨w j,r j ¯ j,1 , ju⟩

˜ t (s) ≤ exp η̃

1 ≪

η̃

−

(t (0))



⟨w j , ju⟩

s=0,i tj (s) ∈W

j,r

It suffices to consider r ∈ V −j,−

S j,1 −1

¯ X

s=0,i tj (s) ∈W





1 − f ˜ t j (s)



Therefore,

S j,1 −1

¯ X

1 − f ˜ t j (s) > 0.

s=0,i tj (s) ∈W

Now we can prove the lower bound. For r ∈ U −j,− , we have that

(t ( S

))

S j,1 −1

¯ Y

(t (0))

j,1

⟨w −j,r

, −ju⟩ = ⟨w −j,r

, −ju⟩ ·

1 − η̃ 1 − f ˜ t j (s)

s=0,i tj (s) ∈W

(t (0))

≥ ⟨w −j,r

, −ju⟩ · exp





S j,1 −1

¯ X

−η̃



s=0,i tj (s) ∈W





1 − f ˜ t j (s)



(t j (0))

≥ ⟨w −j,r

, −ju⟩ ≥ − 2 log(16m/p) · σ 0 ∥u∥ 2 .

(D.19)

= T (v)

∧ T (ξ) guarantees that

On the other hand, for s ∈ [0, S j,1 − 1] such that i t j (s) ∈ W, condition t ≤ T max

1 X

(t (s))

(t j (s))

f ˜ t j (s) =

σ ⟨w j,r j , jv⟩ − σ − ⟨w −j,r

, −jv⟩

r∈[m]

1 X

(t (s))

(t j (s))

σ ⟨w j,r j , ξ i tj (s) ⟩ − σ ⟨w −j,r

, ξ i tj (s) ⟩

r∈[m]

1 X

(t (s))

(t j (s)) e

σ ⟨w j,r j , ξ e i tj (s) ⟩ − σ ⟨w −j,r

, ξ i tj (s) ⟩

r∈[m]

≤ δ/2 + 2 × δ/4

≤ 1.

Hence we derive that

1 − f ˜ t j (s) =

s∈[0, S j,1 −1]

1 − f ˜ t j (s) +

1 − f ˜ t j (s) ≥ 0.

s∈[0, S j,1 −1]

¯ ∈W

i tj (s)

s∈[0, S j,1 −1]

i ¯ ∈W

tj (s)

And in consequence, for r ∈ V j,− it holds that

(t j ( S j,1 ))

⟨w −j,r

, −jv⟩

(t j (0))

⟨w −j,r

, −jv⟩

S j,1 −1

¯ Y

1 − αη̃ 1 − f ˜ t j (s)

s=0

≥

(t j (0))

⟨w −j,r

, −jv⟩

· exp





S j,1 −1

¯ X





1 − f ˜ t j (s)



−αη̃



s=0

(t j (0))

≥ ⟨w −j,r , −jv⟩ ≥ − 2 log(16m/p) · σ 0 ∥v∥ 2 .

(D.20)

The monotonicity again extends lower bounds in Inequalities (D.20) and (D.19) to s ∈ [ S̄ j,1 , S̄ j,2 ]. And the

sign stability naturally holds on this interval.

47Step 2.4 & Finalizing. Thanks to the sign stability proved in the last step, we can now apply Lemma B.5

to derive that the upper bound in Inequality (D.9) continues to hold for all s ∈ [ S j,1 , S̄ j,2 ], with exactly the

same argument as the Step 1.. With an inductive argument that exactly repeats the argument above, we

can infer that all the results in Lemma D.2 holds for t j (s) with s ∈ [ S̄ j,1 , S̄ j,2 ]. Proposition D.1 implies that

(t (s ))

(t)

for any t ∈ N we can find s t ∈ N such that t j (s t ) ≤ t < t j (s t + 1) and ⟨w j,r , ju⟩ = ⟨w j,r j t , ju⟩ (so are all

other inner products), therefore all the results in Lemma D.2 hold for t ≤ T max .

D.5

Proof of Lemma D.3

(t)

Proof of Lemma D.3. For any 0 ≤ t 1 ≤ t 2 ≤ T max

, j ∈ {±1}, and r ∈ U j,+ , by (D.2),

(t +1)

⟨w j,r 2

2η∥u∥ 22

(t )

, ju⟩ = ⟨w j,r 1 , ju⟩ +

2η∥u∥ 22

(s)

1 − y i s f (x i s ; W (s) ) · ⟨w j,r , ju⟩

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )>1

(s)

1 − y i s f (x i s ; W (s) ) · ⟨w j,r , ju⟩,

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )<1

Note that for 0 ≤ t 1 ≤ t 2 ≤ T max

, we can apply the conclusions of Lemma D.2. Specifically, we consider the

(t)

∗

maximal neuron r = argmax r∈[m] ⟨w j,r , ju⟩, and

(t +1)

∗

(1.05) 2 · (β u,j

m) 2 ≥ ⟨w j,r 2 ∗

2η∥u∥ 22

(t )

, ju⟩ − ⟨w j,r 1 ∗ , ju⟩

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )>1

−

(D.21)

(s)

1 − y i s f (x i s ; W (s) ) · ⟨w j,r ∗ , ju⟩

| {z }

∗ m) 2

>(β u,j

(s)

1 − y i s f (x i s ; W (s) ) ·

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )<1

⟨w j,r ∗ , ju⟩

| {z }

∗ m) 2

<(1.05−δ/4) 2 ·(β u,j

where the red remarks follow from Lemma B.5. Rearranging terms, we conclude from (D.21) that

1 − y i s f (x i s ; W (s) )

(D.22)

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )<1

≥ (1.05 − δ/4)

− 12

y i s f (x i s ; W (s) ) − 1 −

t 1 ≤s≤t 2

y is =j,i s ∈W

y is f (x is ;W (s) )>1

m(1.05) 2

2η∥u∥ 22 (1.05 − δ/4) 2

Under Assumption 4.1, with probability at least 1 − 1/poly(d), it holds that

|{t 1 ≤ s ≤ t 2 : y i s = j, i s ∈

/ W}| ≥

· (t 2 − t 1 + 1),

(D.23)

Combining (D.22) and (D.23), we can finally prove that

t 2

1 − y i s f (x i s ; W (s) ) ≥

s=t 1

y is =j,i s ∈W

m(1.05) 2

· 1 − (1.05 − δ/4) 2 ·(t 2 − t 1 + 1) −

1 .

2η∥u∥ 22 (1.05 − δ/4) 2

(D.24)Finally, for the weak data i s ∈ W, under Assumption 4.1, with probability at least 1 − 1/poly(d),

|{t 1 ≤ s ≤ t 2 : y i s = j, i s ∈ W}| ≤ 2ρ · (t 2 − t 1 + 1) ≤

· 1 − (1.05 − δ/4) 2 · (t 2 − t 1 + 1),

where the second inequality follows from the condition on ρ by Assumption 4.1. By Lemma D.2, we have

that |1 − y i s f (x i s ; W (s) )| ≤ 2 for 0 ≤ s ≤ T max

and y s = j. Therefore, we have that

t 2

1 − y i s f (x i s ; W (s) ) ≥ −

s=t 1

y is =j,i s ∈W

· 1 − (1.05 − δ/4) 2 · (t 2 − t 1 + 1).

(D.25)

Combining (D.24) and (D.25), we can conclude that

t 2

m(1.05) 2

· 1 − (1.05 − δ/4) 2 ·(t 2 − t 1 + 1) −

1 .

2η∥u∥ 22 (1.05 − δ/4) 2

1 − y i s f (x i s ; W (s) ) ≥

s=t 1

y is =j

This finishes the proof of the first part in Lemma D.3. Now we consider the second part. For simplicity, we

(t)

, j ∈ {±1}, and r ∈ V j,+ ,

denote by α = ∥v∥ 22 /∥u∥ 22 and η̃ = 2∥u∥ 22 /m. Consider that for any 0 ≤ t ≤ T max

due to Lemma D.2, the v-sign stability condition is true on [0, T max ]. In view of (D.3), this means that

1 + αη̃ · 1 − y i s f (x i s ; W (s) ) > 0, ∀0 ≤ s ≤ T max

s.t. y i s = j.

(t)

, t 1 ≤ s ≤ t, and r ∈ C j,+ , since −2 ≤ 1 − y i s f (x i s ; W (t) ) ≤ 2 due to Lemma D.2,

Then for any t 1 ≤ t ≤ T max

we can lower bound the relative increment as

Z α

η̃ 1 − y i s f (x i s ; W (s) )

(s)

log 1 + αη̃ 1 − y i s f (x i s ; W ) =

(s) )

0 1 + η̃z 1 − y i s f (x i s ; W

s=t

y is =j

s=t 1

y is =j

s=t 1

y is =j

η̃

Z α

1 − y i s f (x i s ; W (s) ) + 2

2η̃

dz −

(s)

(s) )

1 + η̃z 1 − y i s f (x i s ; W )

0 1 + η̃z 1 − y i s f (x i s ; W

s=t

y is =j

≥

1 − y i s f (x i s ; W (s) ) + 2

1 + 2η̃z

s=t 1 ,y is =j

≥

dz −

s=t 1

y is =j

2η̃

1 − 2η̃z

1 − y i s f (x i s ; W (t) ) + 2N j (t 0 , t)

1 + 2η̃z

dz −

2η̃N j (t 1 , t)

dz,

1 − 2η̃z

(D.26)

where for simplicity we denote N j (t 1 , t) = |{t 1 ≤ s ≤ t : y i s = j}|. Now we can use Proposition D.3 to lower

bound the right hand side of (D.26), Denoting ϵ = δ · (1 − (1.05 − δ/4) 2 )/16, we have that

log 1 + αη̃ 1 − y i s f (x i s ; W (s) )

s=t 1

y is =j

η̃ ϵ(t − t 1 + 1) − ∆ + 2N j (t 1 , t)

≥

(t − t 1 + 1) −

≥

(t − t 1 + 1) −

dz −

2η̃N j (t 1 , t)

1 − 2η̃

1 + 2η̃z

∆

+ N j (t 1 , t) · log(1 + 2αη̃) + N j (t 1 , t) · log(1 − 2αη̃)

∆

· log(1 + 2αη̃) + N j (t 1 , t) · log (1 + 2αη̃) · (1 − 2αη̃)},

49where ∆ is defined in Lemma D.3. Moreover since for our choice of αη̃ ≪ 1 in Assumption 4.1,

log(1 + 2αη̃) ≥ αη̃, log (1 + 2αη̃) · (1 − 2αη̃) = log 1 − 4α 2 η̃ 2 ≥ −2α 2 η̃ 2 ,

and using the fact that with probability at least 1 − 1/poly(d) it holds that N j (t 1 , t) ≤ (t − t 1 + 1)/2, we

finally have that

s=t 1

y is =j

o αη̃

log 1 + αη̃ 1 − y i s f (x i s ; W (s) ) ≥

·(ϵ − 2αη̃)·(t 1 − t 1 + 1) − ∆ · log(1 + 2αη̃)

≥

αη̃ · ϵ · (t 1 − t 1 + 1) − 2αη̃ · ∆.

Finally, using (D.3) again, we can lower bound our target as

(t+1)

⟨w j,r

(t )

, jv⟩ = ⟨w j,r 1 , jv⟩ ·

1 + αη̃ 1 − y i s f (x i s ; W (s) )

s=t 1

y is =j

(t )

= ⟨w j,r 1 , jv⟩ · exp







log 1 + αη̃ 1 − y i s f (x i s ; W (s) )



 s=t 1

y is =j

(t )

≥ ⟨w j,r 1 , jv⟩ · exp

αη̃ · ϵ · (t 1 − t 1 + 1) − 2αη̃ · ∆ .







o 





Plugging in the definition of ϵ, ∆, α, and η̃, we can arrive at

(

)

η∥v∥ 22

∥v∥ 22 (1.05) 2

(t+1)

(t 1 )

⟨w j,r , jv⟩ ≥ ⟨w j,r , jv⟩ · exp

· δ − δ(1.05 − δ/4) ·(t − t 1 + 1) −

32m

∥u∥ 22 (1.05 − δ/4) 2

This finishes the proof of Lemma D.3.

D.6

D.6.1

Proof of Technical Results

Proof of Proposition D.1

Proof of Proposition D.1. From Equation (D.2), for r ∈ U j,+ and t ′ ∈ (t j (s), t j (s + 1)], we can infer that

(t ′ )

⟨w j,r , ju⟩

(t (s))

⟨w j,r j , ju⟩

′

t X

−1

t=t j (s)

η∥u∥ 22

(t)

· 1 − f ˜ t · σ ′ ⟨w j,r , ju⟩ · jy i t · 1{i t ∈ W}.

The definition of t j (s) implies that for t ∈ (t j (s), t j (s + 1)), y i t j = −1. On the other hand y i tj (s) j = 1, hence

we obtain that

(t ′ )

(t (s))

⟨w j,r , ju⟩ = ⟨w j,r j

, ju⟩ +

2η∥u∥ 22

(t)

· 1 − f ˜ t j (s) · ⟨w j,r , ju⟩ · 1{i t j (s) ∈ W}.

Since the sign stability holds, this multiplication by a non-negative factor does not change the sign of the

inner products. Therefore we have that

(t (s+1))

⟨w j,r j

(t (s+1)−1)

, ju⟩ = ⟨w j,r j

(t (s)+1)

, ju⟩ = · · · = ⟨w j,r j

, ju⟩

(t (s))

= ⟨w j,r j , ju⟩ · 1 + η̃ · (1 − f ˜ t j (s) ) .

(t)

which concludes our result for ⟨w j,r , ju⟩, r ∈ U j,+ . Other result can be proved analogously and are omitted

here.

50D.6.2

Proof of Lemma B.5 for Multiple Data Setting

Proof of Lemma B.5 (multiple data setting). In the multiple data setting, the (positive) inner products only

changes at the steps where the corresponding data label aligns with the directions of the neurons (i.e.,

j = ±1). Define

(t )

max r∈[m] σ(⟨w j,r 1 , ju⟩)

∗,(t )

β u,j 1 = P

(t 1 )

r∈[m] σ(⟨w j,r , ju⟩)

Again, the local sign stability assumption ensures that each inner product grows proportionally and the

superscript (t) in the neuron index sets can be dropped, with

1 X

(t)

(t )

σ ⟨w j,r , ju⟩ =

σ ⟨w j,r 1 , ju⟩ ·

r∈U j,+

r∈[m]

(t )

max r∈[m] σ ⟨w j,r 1 , ju⟩

∗,(t )

mβ u,j 1

t ′ ∈[t 1 ,t−1]:

y it =j, i t ∈W

′

2η∥u∥ 22

· 1 − y i t ′ f (x i t ′ ; W (t ) )

′

2η∥u∥ 22

· 1 − y i t ′ f (x i t ′ ; W (t ) )

t ′ ∈[t 1 ,t−1]:

y i ′ =j, i t ∈W

(t)

max r∈[m] ⟨w j,r , ju⟩

∗,(t )

mβ u,j 1

(t)

Here the first and the third equality is true because Equation (B.1) implies that all the positive ⟨w y,r , yu⟩

iterates by sequentially multiplying the same factor

′

2η∥u∥ 22

· 1 − yf (x; W (t ) ) .

∗,(t )

The second equality comes from the definition of β u 1 in Lemma B.5.

∗,(t )

(t)

Therefore, m −1 · r∈[m] σ(⟨w j,r , ju⟩) > c implies that σ(max r∈[m] ⟨w y,r , yu⟩) > β u,j 1 mc and the desired

lower bound follows. The upper bound can be proved analogously.

D.6.3

Proof of Proposition D.4

Proof of Proposition D.4. We prove the result by induction. For the step t = t 1 , the result holds trivially.

Suppose that this result holds for each step t 1 , · · · , t, then by (D.4), we have that

(t+1)

⟨w j,r

η · jy i s

(s)

· 1 − y i s f (x i s ; W (s) ) · σ ′ (⟨w j,r · ξ i s ⟩) · ⟨ξ i s , ξ i ⟩

s=t 1

(s) e

′

+ σ (⟨w j,r · ξ i s ⟩) · ⟨ ξ e i s , ξ i ⟩ · 1{i s ∈ W}

(t )

, ξ i ⟩ = ⟨w j,r 1 , ξ i ⟩ +

(t )

= ⟨w j,r 1 , ξ i ⟩ +

η∥ξ i ∥ 22 · jy i s

(s)

· 1 − y i f (x i ; W (s) ) · σ ′ (⟨w j,r · ξ i ⟩)

s=t

i s =i

η∥ξ i ∥ 22 · jy i s

⟨ξ i s , ξ i ⟩

(s)

· 1 − y i s f (x i s ; W ) · σ ′ (⟨w j,r · ξ i s ⟩) ·

∥ξ i ∥ 22

s=t

i s ̸ = i

+ σ

′

(s)

(⟨w j,r

⟨ ξ e i s , ξ i ⟩

· ξ e i s ⟩) ·

· 1{i s ∈ W}

∥ξ i ∥ 22

!Taking absolute value, we have that

(t+1)

⟨w j,r

(t )

, ξ i ⟩ ≤ ⟨w j,r 1 , ξ i ⟩ +

η∥ξ i ∥ 22 · jy i s

(s)

· 1 − y i f (x i ; W (s) ) · σ ′ (⟨w j,r · ξ i ⟩)

s=t

i s =i

η∥ξ i ∥ 22 · jy i s

· 1 − y i s f (x i s ; W (s) )

s=t

i s ̸ = i

· σ

′

(s)

(⟨w j,r

⟨ ξ e i s , ξ i ⟩

⟨ξ i s , ξ i ⟩

(s)

+ σ ′ (⟨w j,r · ξ e i s ⟩) ·

· 1{i s ∈ W}

· ξ i s ⟩) ·

∥ξ i ∥ 2

∥ξ i ∥ 22

Applying the definition of Υ (s) , we further obtain that

(t+1)

⟨w j,r

, ξ i ⟩ ≤ Υ (t 1 ) +

2η∥ξ i ∥ 22

· 1 − y i f (x i ; W (s) ) · Υ (s)

s=t

i s =i

2η∥ξ i ∥ 22

· 1 − y i s f (x i s ; W (s) ) · Υ (s) ·

s=t

|⟨ξ i s , ξ i ⟩| |⟨ ξ e i s , ξ i ⟩|

∥ξ i ∥ 22

(D.27)

i s ̸ = i

}, we have that |1 − y i s f (x i s ; W (s) )| ≤ 2. Also, by

Now using Lemma D.2, for s ≤ min j∈{±1} {T max

e −1/2 ). Thus we can further upper bound

Lemma A.3, it holds that∥ξ i ∥ 22 ≤ 3σ p 2 d/2 and |⟨ξ, ξ ′ ⟩|/∥ξ∥ 22 ≤ O(d

(D.27) as





6ησ





(t+1)

e −1/2 ) ·

max ⟨w j,r , ξ i ⟩ ≤ Υ (t 1 ) +

Υ (s) + O(d

Υ (s) 

(D.28)

· 

i∈[n]

s=t

i s ̸ = i

i s =i



6ησ





e −1/2 ) ·

Υ (t 1 ) 

Υ (t 1 ) + O(d

= Υ (t 1 ) +

· 

s=t



i s ̸ = i

i s =i



6ησ p 2 d  X



e −1/2 ) ·

Υ (t 1 ) − Υ (s) + O(d

Υ (t 1 ) − Υ (s)  .

· 

s=t



i s ̸ = i

i s =i

By our induction, we have that Υ (s) − Υ (t 1 ) ≤ Υ (t 1 ) · ϵ, for which we can further bound (D.28) as





 

6ησ





 

(t+1)

e −1/2 ) ·

max ⟨w j,r , ξ i ⟩ ≤ Υ (t 1 ) ·  1 +

· (1 + ϵ) · 

1 + O(d

1  

i∈[n]

s=t

i s =i

i s ̸ = i

6ησ p 2 d

2(t − t 1 + 1)

−1/2

≤ Υ

· 1+

· (1 + ϵ) ·

+ O(d

) · (t − t 1 + 1)

18ησ p 2 d

(t 1 )

≤ Υ

· 1+

· (1 + ϵ) · (t − t 1 + 1) ,

(t 1 )

where in the first inequality we have utilized the fact that |{t 1 ≤ s ≤ t : i s = i}| ≤ 2(t − t 1 + 1)/n, and in the

e 2 ) by Assumption 4.1. Therefore, when t ≤ t 1 + T e (ξ) − 1

last inequality we apply the condition that d = Ω(n

−1

with T e (ξ) = Θ(mn

· η · ϵ · (1 + ϵ) · (σ p d) ), it holds that

(t+1)

max ⟨w j,r

i∈[n]

, ξ i ⟩ ≤ Υ (t 1 ) · (1 + ϵ).

(D.29)By using the same argument as proving (D.29), we can also show that for t ≤ t 1 + T e (ξ) − 1

(t+1)

max ⟨w j,r

i∈W

, ξ e i ⟩ ≤ Υ (t 1 ) · (1 + ϵ).

(D.30)

By combining (D.29) and (D.30), we can arrive at

(t+1) e

(t+1)

max

max ⟨w j,r , ξ i ⟩ , max ⟨w j,r , ξ i ⟩ ≤ Υ (t 1 ) · (1 + ϵ).

j∈{±1},r∈[m]

i∈W

i∈[n]

Thus we have proved our induction statement for step t + 1. Repeating the induction completes the proof

of Proposition D.4.

D.6.4

Proof of Proposition D.5

Proof of Proposition D.5. We expand f ˜ t j ( S̄ j,k ) as follows.

1 X

(t ( S̄ −1))

f ˜ t j ( S̄ j,k ) ≤

σ ⟨w j,r j j,k

, ju⟩ · 1 + η̃ 1 − f ˜ t j ( S̄ j,k −1)

r∈[m]

1 X

(t ( S̄ −1))

, jv⟩ · 1 + αη̃ 1 − f ˜ t j ( S̄ j,k −1)

σ ⟨w j,r j j,k

r∈[m]

1 X

(t j ( S̄ j,k −1))

−

σ − ⟨w −j,r

, −ju⟩ · 1 − η̃ 1 − f ˜ t j ( S̄ j,k −1)

r∈[m]

1 X

(t ( S̄ −1))

−

σ − ⟨w j,r j j,k

, −jv⟩ · 1 − αη̃ 1 − f ˜ t j ( S̄ j,k −1)

r∈[m]

1 X

(t ( S̄ ))

(t j ( S̄ j,k ))

σ ⟨w j,r j j,k , ξ i tj ( S̄ ) ⟩ −

σ ⟨w −j,r

, ξ i tj ( S̄ ) ⟩

j,k

r∈[m]

1 X

(t ( S̄ −1))

σ ⟨w j,r j j,k

, ju⟩ + σ ⟨w j,r j j,k

, jv⟩

≤

r∈[m]

(t j ( S̄ j,k −1))

(t ( S̄ −1))

−σ − ⟨w −j,r

, −ju⟩ − σ − ⟨w j,r j j,k

, −jv⟩ · 1 + η̃ 1 − f ˜ t j ( S̄ j,k −1)

1 X

(t j ( S̄ j,k −1))

(t ( S̄ −1))

σ − ⟨w −j,r

, −ju⟩ + σ − ⟨w j,r j j,k

, −jv⟩ · 2η̃ 1 − f ˜ t j ( S̄ j,k −1)

r∈[m]

+ Υ (t j ( S̄ j,k ))

≤ f ˜ t j ( S̄ j,k −1) · 1 + η̃ 1 − f ˜ t j ( S̄ j,k −1)

+ 2E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k )) .

(D.31)

By Assumption B.2 we know that f ˜ t j ( S̄ j,k ) > 1 + δ. From the discussion in the proof of Lemma B.9, we know

that once

2E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k )) ≤ δ,

we have that

η̃ + 2 −

f ˜ t j ( S̄ j,k −1) ≥

2η̃

η̃ 2 + 4η̃

and that

f ˜ t j ( S̄ j,k ) ≤ 1 + η̃(1 − f ˜ t j ( S̄ j,k −1) ) + E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k ))

≤ η̃/2 + η̃ 2 /4 + η̃ + E t j ( S̄ j,k −1) + Υ (t j ( S̄ j,k )) .

This finishes the proof of Proposition D.5

53E

Multiple Training Data Case: Small Learning Rate Regime

This section focuses on the multiple training data setup with a small learning rate.

Recall that W is the index set of training data points which lack the strong feature patch. By (2.3), the

CNN weights are updated according to

jη

(t+1)

(t)

w j,r = w j,r −

· f (x i t ; W (t) ) − y i t · σ ′ (⟨w j,r , y i t u⟩) · y i t u · 1{i t ∈

/ W}

(t)

+ σ ′ (⟨w j,r , y i t v⟩) · y i t v + σ ′ (⟨w j,r · ξ i t ⟩) · ξ i t

(t)

+ σ ′ (⟨w j,r · ξ e i t ⟩) · ξ e i t , ·1{i t ∈ W} .

(E.1)

Subsequently, by (E.1) the update formulas of those inner products of interests are given by

η∥u∥ 22

(t)

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , ju⟩jy i t ) · 1{i t ∈

/ W},

η∥v∥ 22

(t+1)

(t)

⟨w j,r , jv⟩ = ⟨w j,r , jv⟩ +

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r · jv⟩jy i t ),

(t+1)

⟨w j,r

(t)

, ju⟩ = ⟨w j,r , ju⟩ +

and

η · jy i t

(t)

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r · ξ i t ⟩) · ⟨ξ i t , ξ i ⟩

(t)

+ σ ′ (⟨w j,r · ξ e i t ⟩) · ⟨ ξ e i t , ξ i ⟩ · 1{i t ∈ W} .

η · jy i t

(t+1)

(t)

⟨w j,r , ξ e i ⟩ = ⟨w j,r , ξ e i ⟩ +

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w j,r , ξ i t ⟩) · ⟨ξ i t , ξ e i ⟩

(t)

+ σ ′ (⟨w j,r , ξ e i t ⟩) · ⟨ ξ e i t , ξ e i ⟩ · 1{i t ∈ W} , i ∈ W.

(t+1)

⟨w j,r

(t)

, ξ i ⟩ = ⟨w j,r , ξ i ⟩ +

(t)

For convenience, we also write ℓ i = f (x i ; W (t) ) − y i as the fitting residual.

The result of this section relies on the following conditions on the data model and the initialization.

Assumption E.1 (Conditions on hyperparameters). Suppose that the following holds. For some ϵ ∈ (0, 1),

−1

1. The weight initialization scale σ 0 = Θ(∥u∥

2 );

√

e (m/ √ nϵ) · ∥v∥ 2 ;

2. Strong signal strength ∥u∥ 2 > Ω(m/

nϵ) · σ p d and weak signal strength σ p d ≥ Ω

3. The dimension d satisfies d = Ω(polylog(m, n)).

Theorem E.2 (Restatement of Proposition 4.5). Under Assumption E.1, choosing the learning rate η ≤

m/6∥u∥ 22 small enough and ϵ ′ ∈ (0, 1), then with high probability 1 − 1/poly(d), there exist

2 %

W (T ) − W ⋆ F

2ι

Cm 3

†

′

T =

log

, T = T +

η(1 − τ )(1 − ρ)∥u∥ 22

σ 0 ∥u∥ 2

2ηϵ∥u∥ 22

2ηϵ

with τ , ι defined in (E.3), (E.4) such that: (i) the average loss on samples W c decreases to 3ϵ over iterations

[T † , T ], i.e.

o 2

1 X

min

y i − f (x i ; W (t) ) ≤ 3ϵ,

T † ≤t≤T

i∈W

(ii) average loss on samples W decreases to 3ϵ ′ over iterations [T, T ′ ], i.e.

1 X

min ′ y i − f (x i ; W (t) ) ≤ 3ϵ ′ ,

T ≤t≤T

i∈W

(iii) the model does not learn weak signal v well enough even until T ′ , compared to initialization, i.e.

(t)

max ⟨w j∈{±1},r∈[m] , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 , ∀t ≤ T ′ .

j,r

54In the multiple training data small learning rate regime, the dynamics go through three stages, which we

characterize in Appendices E.1, E.2, and E.3, respectively. The following lemma plays an important role in

the exponentially increasing stage (stage 1), for which we single it out here.

Lemma E.3 (Derivative lower bound). For any 0 < τ < 1 to be tuned later, suppose at some time t there

holds

(t)

⟨w j,r , u⟩ , ⟨w j,r , v⟩ , max ⟨w j,r , ξ i ⟩ , max ⟨w j,r , ξ e i ⟩ ≤

max

i∈W

i∈[n]

j∈{±1},r∈[m]

(t)

then we can lower bound the fitting residual −yℓ i ≥ 1 − τ for every i ∈ [n].

Proof of Lemma E.3. Plug into the CNN model definition (2.1), we have that

(t)

−yℓ i = 1 − F y (x i ; W (t) ) + F −y (x i ; W (t) ) ≥ 1 − F y (x i ; W (t) ).

If i ∈ W c , we can upper bound F y (x i ; W (t) ) further by

1 X

(t)

σ(⟨w y,r

, yu⟩) + σ(⟨w y,r

, yv⟩) + σ(⟨w y,r

, ξ i ⟩)

r∈[m]

(t)

≤ max ⟨w y,r

, yu⟩ 2 + ⟨w y,r

, yv⟩ 2 + ⟨w y,r

, ξ i ⟩ 2 ≤ τ.

F y (x i ; W (t) ) =

r∈[m]

Otherwise, if i ∈ W, we also have

1 X

(t) e

(t)

σ(⟨w y,r

, ξ i ⟩) + σ(⟨w y,r

, yv⟩) + σ(⟨w y,r

, ξ i ⟩)

r∈[m]

(t) e 2

(t)

≤ max ⟨w y,r

, ξ i ⟩ + ⟨w y,r

, yv⟩ 2 + ⟨w y,r

, ξ i ⟩ 2 ≤ τ.

F y (x i ; W (t) ) =

r∈[m]

Then it follows that −yℓ (t) ≥ 1 − τ .

E.1

Stage 1. Learning Strong Signal Exponentially Fast

In this stage, we mainly track the maximal inner product between w and the signal vectors v and u, with

extra attention to the maximal inner product between w and the noise vectors.

Ψ (t) =

max

j∈{±1},r∈[m]

(t)

Γ i =

e (t) =

max

j∈{±1},r∈[m]

max

j∈{±1},r∈[m]

(t)

Φ (t) =

⟨w j,r , v⟩ ,

max

j∈{±1},r∈[m]

(t) i ∈ [n],

(t) i ∈ W.

⟨w j,r , ξ i ⟩ ,

⟨w j,r , ξ e i ⟩ ,

(t)

⟨w j,r , u⟩ ,

Lemma E.4 (First stage: noise). Under the same conditions as Theorem E.2, ever since initialization, at

least until time

T + :=

3η(4 + ρ)σ p 2 d

there still holds that

√

(t)

max Γ i ≤ σ 0 σ p d,

√

e (t) ≤ σ 0 σ p d.

max Γ

i∈W

i∈[n]

(E.2)

Proof of Lemma E.4. For those inner products with noise vectors, ∀i ∈ [n], the updating rules become

(t+1)

(t)

⟨w j,r , ξ i ⟩ ≤ ⟨w j,r , ξ i ⟩ +

· ℓ i t · σ ′ (⟨w j,r · ξ i t ⟩) · |⟨ξ i t , ξ i ⟩| + σ ′ (⟨w j,r · ξ e i t ⟩) · ⟨ ξ e i t , ξ i ⟩ · 1{i t ∈ W}

(t)

≤ ⟨w j,r , ξ i ⟩ +

6η

(t)

· ⟨w j,r , ξ i t ⟩ · |⟨ξ i t , ξ i ⟩| + ⟨w j,r , ξ e i t ⟩ · ⟨ ξ e i t , ξ i ⟩ · 1{i t ∈ W} .

55By taking maximum over r ∈ [m], we conclude that

6η (t)

(t+1)

(t)

e (t) · ⟨ ξ e i , ξ i ⟩ · 1{i t ∈ W} ,

· Γ i t · |⟨ξ i t , ξ i ⟩| + Γ

Γ i

≤ Γ i +

i t

Similarly, we also have that

e (t+1) ≤ Γ

e (t) + 6η · Γ (t) · ⟨ξ i , ξ e i ⟩ + Γ

e (t) · ⟨ ξ e i , ξ e i ⟩ · 1{i t ∈ W} ,

i t

∀i ∈ [n].

∀i ∈ W.

We then use induction to rigorously prove our conclusions. Firstly, (E.2) holds at time t = 0. Now suppose

that (E.2) holds until some T e < T + . Fixing some i ∈ [n],

( T e +1)

Γ i

√

T e

6ησ 0 σ p d X

≤

|⟨ξ i t , ξ i ⟩| + ⟨ ξ e i t , ξ i ⟩ · 1{i t ∈ W}

t=0

√

3 T e σ p 2 d

6ησ 0 σ p d

≤

+ 2 T e · (1 + ρ) · σ p d log(4n 2 /p)

√

3ησ 0 σ p d(4 + ρ) T e σ p 2 d

≤

√ nm

≤ σ 0 σ p d.

The first inequality is by induction hypothesis. The second inequality is because that there are at most T e /n

many i t ’s would equal i and at most ρ T e many i t ’s would be in W, and we also use Lemma A.3 to control

the correlations between noise vectors. The third inequality is by d ≥ 16n 2 log(4n 2 /p) for Assumption E.1 ,

e (t+1) for some fixed i ∈ W as

while the last inequality is due to T e < T + . Similarly, we can also control Γ

√

T e

6ησ 0 σ p d X

( T e +1)

Γ i

≤

⟨ξ i t , ξ e i ⟩ + ⟨ ξ e i t , ξ e i ⟩ · 1{i t ∈ W}

t=0

√

6ησ 0 σ p d 3 T e σ p 2 d

≤

+ 2 T e · (1 + ρ) · σ p d log(4n 2 /p)

√

≤ σ 0 σ p d,

where the second inequality is because there are at most T e /n many i t ’s would equal i ∈ W. In conclusion,

(E.2) holds at least until T + .

In the following, we would take

(

τ = max 2σ 0 ∥u∥ 2 (2 log(16m/p))

ι = 2σ 0 ∥u∥ 2 · exp

1/2−∥v∥ 22 /4(σ p 2 d)

(1 − τ )(1 − ρ)∥u∥ 22

3(4 + ρ)σ p 2 d

)

√

6 2σ p 2 d

, 1 −

∥u∥ 22 log(2/ 2 log(16m/p))

(E.3)

(E.4)

By the conditions in Assumption E.1 on ∥v∥ 22 /∥u∥ 22 , we find τ, ι both constant in (0, 1).

Lemma E.5 (First stage: signal). Under the same conditions as Theorem E.2, there exists time

2ι

†

T =

log

η(1 − τ )(1 − ρ)∥u∥ 22

σ 0 ∥u∥ 2

such that: (i) the model learns strong signal to a constant level,

(T † )

max ⟨w j,r , ju⟩ ≥ ι,

∀j ∈ {±1},

r∈[m]

(ii) compared to the random initialization, the model does not learn weak signal that much, i.e.,

(T † )

max

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2

j∈{±1},r∈[m]

56Proof of Lemma E.5. Firstly, we would find {Ψ (t) , Φ (t) } t≥0 having an exponentially growing upper bound.

Recursively, we would have that

Ψ (t+1) ≤ Ψ (t) +

max

j∈{±1},r∈[m]

jyη

(t)

· f (x i ; W (t) ) − y · σ ′ (⟨w j,r , yv⟩) · ∥v∥ 22

(t)

max

σ ′ (⟨w j,r , yv⟩)

· ℓ i · ∥v∥ 22 ·

j∈{±1},r∈[m]

2η

(t)

≤ Ψ (t) +

· ℓ i · ∥v∥ 22 · Ψ (t)

6η∥v∥ 22

≤ exp

· Ψ (t) .

(t) +

6η∥v∥ 22 t

=Ψ

Therefore, we have that

Ψ (t) ≤ exp

· Ψ (0) ≤ exp

6η∥v∥ 22 t

2 log(16m/p) · σ 0 ∥v∥ 2

It follows similarly that

6η(1 − ρ)∥u∥ 22 t

Φ (t) ≤ exp

· Φ (0) ≤ exp

· 2 log(16m/p) · σ 0 ∥u∥ 2 .

(E.5)

(E.6)

The extra factor 1 − ρ appears because only a 1 − ρ proportion of data points would contain u, and therefore

contribute to evolution of Φ (t) . Note that growing rates of these two bounds differ a lot due to the different

magnitudes of (1 − ρ)∥u∥ 2 and ∥v∥ 2 .

Our subsequent analysis illustrates that Φ (t) can grow into a constant-level magnitude since strong signal

u is significant enough. We can track how well our model learns u by

(t)

A 1 =

(t)

max

(t)

r∈[m],i t ∈W

⟨w 1,r , u⟩,

(t)

A −1 =

max

r∈[m],i t ∈W

(t)

⟨w −1,r , −u⟩.

(t)

By definition, A 1 , A −1 ≤ Φ (t) also admits an exponentially upper bound in (E.6) . For a certain τ ∈ (0, 1),

due to the exponential upper bounds (E.5) and (E.6), max{Φ (t) , Ψ (t) } ≤ τ /3 is true at least until

τ /2

T 1 =

log

6η(1 − ρ)∥u∥ 22

σ 0 ∥u∥ 2 2 log(16m/p)

Moreover, since we have (1 − ρ)∥u∥ 22 ≫ σ p 2 d/n by Assumption E.1, we also know T 1 ≤ T + where T + comes

from Lemma E.4. Therefore

√

(t)

e (t) ≤ σ 0 σ p d ≤ τ /3, ∀t ≤ T 1

Γ i ≤ σ 0 σ p d ≤ τ /3, Γ

(t)

Consequently, until at least time T 1 , we can use Lemma E.3 to conclude that −y i t ℓ i t ≥ 1 − τ , which enables

lower bounding A (t) in the following.

The i t -th sample would be used to update parameters, according to our multi-pass SGD updates (2.3).

(t+1)

(t)

If i t ∈ W, then ⟨w j,r , ju⟩ = ⟨w j,r , ju⟩ holds for any j ∈ {±1} and r ∈ [m]. If i t ∈

/ W but y i t = −1, then

(t+1)

(t)

/ W and

max r ⟨w 1,r , u⟩ = max r ⟨w 1,r , u⟩ ≥ 0 since that neuron will not be activated. Otherwise, only if i t ∈

y i t = 1, the updating rule becomes

(t)

· − y i t ℓ i t · σ ′ (⟨w 1,r , u⟩) · ∥u∥ 22

2η(1 − τ )∥u∥ 22

(t)

≥ ⟨w 1,r , u⟩ +

· max ⟨w 1,r , u⟩, 0 .

(t+1)

(t)

⟨w 1,r , u⟩ = ⟨w 1,r , u⟩ +

Take maximum over r ∈ [m] to see that

(t+1)

A 1

(t)

≥ A 1 +

2η(1 − τ )∥u∥ 22

(t)

· A 1 ≥ exp

η(1 − τ )∥u∥ 2

(t)

· A 1 ,where the last equality is by 1 + z ≥ exp(z/2) for any 0 ≤ z ≤ 2. Consequently, when t is large (larger than

n), we would have





2 X

η(1

−

)∥u∥

(0)

(t)

1{i t ′ ∈

/ W, y i t = 1}  · A 1

A 1 ≥ exp 

t ′ ≤t

η(1 − τ )∥u∥ 22 · (1 − ρ) · t

· σ 0 ∥u∥ 2 /2,

(E.7)

≥ exp

at least until step t ≤ T 1 . We use the fact that t ′ ≤t 1{i t ′ ∈

/ W, y i t = 1} ≥ (1 − ρ)t/4 because the sample

labels are balanced (Lemma A.1) and 1 − ρ proportion of samples come with the strong signal. In the same

manner, we would have that

η(1 − τ )∥u∥ 22 (1 − ρ)t

(t)

A −1 ≥ exp

· σ 0 ∥u∥ 2 /2.

(E.8)

(t)

Define the time when A ±1 both break ι,

T 2 =

log

η(1 − τ )(1 − ρ)∥u∥ 22

2ι

σ 0 ∥u∥ 2

≤ T 1 ,

where the inequality is due to the scaling of ι upon τ . Moreover, we also need that T 2 ≤ T + , where T + is the

√

(t)

time that ⟨w j,r , ξ e i ⟩, ⟨w j,r , ξ i ⟩ remains in O(σ 0 σ p d). And this requirement is also achieved by the selection

of τ and ι. Plugging T 2 into the exponential lower bounds (E.7) and (E.8), we can conclude that

(T )

Φ (T 2 ) ≥ A ±1 2 ≥ ι,

which already grows up to a constant level magnitude by the time T 2 . Lastly, plug the definition of T 2 to

upper bound ψ (T 2 ) as

24∥v∥ 22

2ι

Ψ (T 2 ) ≤ exp

log

· 2 log(16m/p) · σ 0 ∥v∥ 2 ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

(1 − τ )(1 − ρ)∥u∥ 2

σ 0 ∥u∥ 2

In conclusion, by taking T † = T 2 , this lemma is completely proved.

E.2

Stage 2. Exploiting Strong Signal

In the second stage, our lemmas suggest that before the model really learns the weak signal v or memorizes

any noise vector, the model already fits a proportion 1 − ρ of the data points (i.e., strong data) by exploiting

strong signal u.

Lemma E.6 (Second stage). Under the same conditions as Theorem E.2, there exists time

Cm 3

T = T † +

2ηϵ∥u∥ 22

such that: (i) the average loss over iterations within this stage has decreased to 2ϵ, i.e.,

o 2

1 X

min

y i − f (x i ; W (t) ) ≤ 3ϵ,

T † ≤t≤T

i∈W

(ii) all through the training dynamics 0 ≤ t ≤ T , there holds that

(t)

max

⟨w j,r , v⟩ ≤ 2 2 log(16m/p)σ 0 · ∥v∥ 2 ,

j∈{±1},r∈[m]

(iii) all through the training dynamics 0 ≤ t ≤ T , there holds that

√

(t)

max

⟨w j,r , ξ i ⟩ ≤ σ 0 σ p d,

max

j∈{±1},r∈[m],i∈[n]

j∈{±1},r∈[m],i∈W

√

(t)

⟨w j,r , ξ e i ⟩ ≤ σ 0 σ p d.(t)

In studying the second stage, we firstly identify when the upper bound on ⟨w j,r , v⟩ breaks and find that

the conclusions of Lemma E.5 still holds before that time.

Lemma E.7. Under the same conditions as Theorem E.2, take η ≤ m/6∥u∥ 22 . There exists a time

τ /2

‡

T =

log

≥ T †

6η∥v∥ 22

σ 0 ∥v∥ 2 2 log(16m/p)

such that (E.2) and

max

j∈{±1},r∈[m]

(t)

⟨w j,r , v⟩ ≤ 2 2 log(16m/p)σ 0 ∥v∥,

(t)

max ⟨w j,r , ju⟩ ≥ ι/2,

∀j ∈ {±1},

r∈[m]

hold for any T † ≤ t ≤ T ‡ .

Proof of Lemma E.7. Firstly, we need to adopt the exponential upper bound derived in proving Lemma E.5,

6η∥v∥ 22 t

Ψ (t) ≤ exp

· Ψ (0) ≤ exp

· 2 log(16m/p) · σ 0 ∥v∥ 2 .

Then we naturally find that before T ‡ , it would always hold that

(t)

max

⟨w j,r , v⟩ ≤ 2 2 log(16m/p) · σ 0 ∥v∥ 2 .

j∈{±1},r∈[m]

Due to the conditions on ∥u∥ 22 /∥v∥ 22 , T ‡ is found to be much larger than T † . Then we proceed to prove the

(t)

other assertion by induction. At time t = T † , the lower bound max j,r ⟨w j,r , ju⟩ ≥ ι/2 holds as a consequence

(t+1)

of the previous lemma. Suppose it holds until time t. If i t ∈ W, then ⟨w j,r

(t)

, ju⟩ = ⟨w j,r , ju⟩ holds for any

(t+1)

(t)

j ∈ {±1}, r ∈ [m]. If i t ∈

/ W but y i t = −1, then max r∈[m] ⟨w 1,r , u⟩ = max r∈[m] ⟨w 1,r , u⟩ ≥ 0 since that

neuron will not be activated. Otherwise, if i t ∈

/ W and y i t = 1, consider the updating rule

(t+1)

(t)

⟨w 1,r , u⟩ = ⟨w 1,r , u⟩ +

· 1 − y i t f (x i t ; W (t) ) · σ ′ (⟨w 1,r , u⟩) · ∥u∥ 22 ,

(t)

(t+1)

(t)

from which we find max r∈[m],i t ∈W

/ ⟨w y i ,r , y i u⟩ must hold if y i t f (x i t ; W ) ≤ 1.

/ ⟨w y i ,r , y i u⟩ ≥ max r∈[m],i t ∈W

(t)

Otherwise, once y i t f (x i t ; W ) > 1, it immediately follows that

1 < y i t f (x i t ; W (t) ) = F y it (x i t ; W (t) ) − F −y it (x i t ; W (t) )

1 X

(t)

σ(⟨w 1,r , y i t u⟩) + σ(⟨w 1,r , y i t v⟩) + σ(⟨w 1,r , ξ i t ⟩)

≤ F y it (x i t ; W (t) ) =

r∈[m]

≤

(t)

max ⟨w 1,r , u⟩ 2

r∈[m]

+ σ 0 2 ∥v∥ 2 + σ 0 2 σ p 2 d.

(t)

Consequently, for the specific neuron r ∗ = argmax r∈[m] ⟨w 1,r , u⟩ 2 , there holds

(t+1)

(t)

⟨w 1,r ∗ , u⟩ ≥ ⟨w 1,r ∗ , u⟩ −

3η

(t)

· ⟨w 1,r ∗ , u⟩ · ∥u∥ 22

≥ 1 − 8 log(16m/p) ·

≥

σ 0 2 ∥v∥ 22

−

σ 0 2 σ p 2 d

3η

· 1 −

· ∥u∥ 22

where the last inequality is enabled by taking

η ≤

6∥u∥ 22

σ 0 ≤

1 − ι

8 log(16m/p) · ∥v∥ 22 + σ p 2 d

(t+1)

Therefore, we find that max r∈[m] ⟨w 1,r , u⟩ ≥ ι/2 must hold no matter what i t is. In the same way, one can

(t+1)

also obtain max r∈[m] ⟨w −1,r , −u⟩ ≥ ι/2. By induction, the induction proof is complete.

59Our subsequently analysis confirms that even before T ‡ , the model can already fit those data points with

strong signal by exploiting u. For the given 0 < ϵ < 1, define a reference point W ∗ as

∗

w j,r

4m(1 + ϵ) ju

∥u∥ 22

j ∈ {±1}, r ∈ [m].

(E.9)

Lemma E.8. Under the same condition as the previous lemma, for all T † ≤ t ≤ T ‡ , there holds

y i ⟨∇f (x i ; W (t) ), W ∗ ⟩ ≥ 2 · (1 + ϵ)

for any i ∈

/ W.

Proof of Lemma E.8. Recall that the definition of CNN in (2.1) and that u ⊥ span(v, ξ i ). We have that

y i ⟨∇f (x i ; W (t) ), W ∗ ⟩ =

(t)

∗

σ ′ (⟨w j,r , y i u⟩) · ⟨w j,r

, y i u⟩

j∈{±1},r∈[m]

(t)

σ ′ (⟨w j,r , y i u⟩) ·

j∈{±1},r∈[m]

≥ max ⟨w y (t)

, y i u⟩ ·

i ,r

r∈[m]

4(1 + ϵ)

≥ 2 · (1 + ϵ),

(t)

where the last inequality is by max r ⟨w j,r , ju⟩ ≥ ι/2 for any j ∈ {±1} as shown by the previous lemma.

Lemma E.9. Continued from the previous setting, for T † ≤ t ≤ T ‡ , if i t ∈

/ W, there holds

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F ≥ 2η f (x i t ; W (t) ) − y i t

− 2ηϵ 2 .

Proof of Lemma E.9. Firstly we expand the difference by

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F

(t)

= 2η⟨ℓ i t ∇f (x i t ; W (t) ), W (t) − W ∗ ⟩ − η 2 · ℓ i t

· ∥∇f (x i t ; W (t) )∥ 2 F .

(E.10)

Since the neural network f (x; W) is 2-homogeneous in W due to the activation function σ(z) = max{z, 0} 2 ,

we have that

⟨∇f (x; W (t) ), W (t) ⟩ = 2f (x; W (t) ).

Stack these observations into the first term of previous difference expansion to obtain

(t)

⟨ℓ i t ∇f (x i t ; W (t) ), W (t) − W ∗ ⟩ = ℓ i t · 2f (x i t ; W (t) ) − ⟨∇f (x i t ; W (t) ), W ∗ ⟩

(t)

= 2ℓ i t · f (x i t ; W (t) ) − y i t + ℓ i t · y i t · 2 − y i t ⟨∇f (x i t ; W (t) ), W ∗ ⟩ .

Note that the first term is exactly 2(f (x i t ; W (t) ) − y i t ) 2 . As for the second term, since i t ∈

/ W, we need to

plug in Lemma E.8 to see 2 − y i t ⟨∇f (x i t ; W (t) ), W ∗ ⟩ ≤ −2ϵ < 0, so that

(t)

ℓ i t · y i t · 2 − y i t ⟨∇f (x i t ; W (t) ), W ∗ ⟩

≤

1 (t) 2

ℓ

+ 2ϵ 2 .

2 i t

As a result, we would know that

(t)

⟨ℓ i t ∇f (x i t ; W (t) ), W (t) − W ∗ ⟩ ≥

(f (x i t ; W (t) ) − y i t ) 2 − 2ϵ 2 .

2Next, an upper bound on the second order term η 2 ∥∇L(W (t )∥ 2 F is given by

(t)

η 2 · ℓ i t

· ∥∇f (x i t ; W (t) )∥ 2 F



(t)2

= η 2 · ℓ i t

(t)

·  ∥u∥ 22 ·

σ ′ (⟨w j,r , yu⟩) 2

j∈{±1},r∈[m]



(t)

+∥v∥ 22 ·

(t)

σ ′ (⟨w j,r , yv⟩) 2 + ∥ξ i t ∥ 22 ·

j∈{±1},r∈[m]

σ ′ (⟨w j,r , ξ i t ⟩) 2 

j∈{±1},r∈[m]

≤ O(max{∥u∥ 22 , ∥v∥ 22 , ∥ξ i t ∥ 22 }) · η 2 ·

(t)

(t)2

ℓ i t ,

(t)

since the dynamics of inner products ⟨w j,r , yu⟩, ⟨w j,r , yv⟩, ⟨w j,r , ξ i ⟩ are well bounded by O(1). Via scaling

(t)

(t)2

η·O(max{∥u∥ 22 , ∥v∥ 22 , ∥ξ i t ∥ 22 }) ≤ 1, we would know η 2 |ℓ i t | 2 ∥∇f (x i t ; W (t) )∥ 2 F ≤ ηℓ i t . Eventually, continued

from (E.10), we can completely prove this lemma.

Lemma E.10. Continued from the previous setting, for T † ≤ t ≤ T ‡ , if i t ∈ W, there holds

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F ≥ −Cησ 0 2 · ∥v∥ 22 + σ p 2 d .

(E.11)

Proof of Lemma E.10. Same as the last lemma, from the SGD setting, we have that

∥W (t) − W ∗ ∥ 2 F − ∥W (t+1) − W ∗ ∥ 2 F

(t)

= 2η⟨ℓ i t ∇f (x i t ; W (t) ), W (t) − W ∗ ⟩ − η 2 · ℓ i t

· ∥∇f (x i t ; W (t) )∥ 2 F ,

(E.12)

and from the 2-homogeneity, it follows that

⟨∇f (x i t ; W (t) ), W (t) ⟩ = 2f (x i t ; W (t) ).

Since i t ∈ W, every ∇ w j,r f (x i t ; W (t) ) is in span(v, ξ i t , ξ e i t ) ⊥ u, so

⟨∇f (x i t ; W (t) ), W ∗ ⟩ = 0.

As a result, the first term in (E.12) can be bounded by

(t)

2η⟨ℓ i t ∇f (x i t ; W (t) ), W (t) − W ∗ ⟩

= 4η · f (x i t ; W (t) ) − y i t · f (x i t ; W (t) )

≤

12η

(t)

σ(⟨w j,r , ξ e i t ⟩) +

j∈{±1},r∈[m]

(t)

j∈{±1},r∈[m]

(t)

σ(⟨w j,r , y i t v⟩) +

σ(⟨w j,r , ξ i t ⟩)

j∈{±1},r∈[m]

≤ O ησ 0 2 ∥v∥ 22 + ησ 0 2 σ p 2 d .

We can also deal with the second term in (E.12) by

(t)

η 2 · ℓ i t

· ∥∇f (x i t ; W (t) )∥ 2 F



(t)2

≤ η 2 · ℓ i t

(t)

σ ′ (⟨w j,r , ξ e i t ⟩) 2

·  ∥ ξ e i t ∥ 22 ·

j∈{±1},r∈[m]



+∥v∥ 22 ·

(t)

σ ′ (⟨w j,r , y i t v⟩) 2 + ∥ξ i t ∥ 22 ·

j∈{±1},r∈[m]

≤O η

(σ 0 4 ∥v∥ 42

σ 0 4 σ p 4 d 2 )

(t)

σ ′ (⟨w j,r , ξ i t ⟩) 2 

j∈{±1},r∈[m]

(t)

where the last inequality is due to ℓ i t being O(1). Since we already take η ≤ m/6∥u∥ 22 , the second term of

(E.12) is ignorable compared to the first term of (E.12). Therefore, we can conclude (E.11).

61Equipped with Lemmas E.7, E.8, E.9, and E.10, we are ready to prove the main lemma of second stage.

Proof of Lemma E.6. Continued from Lemmas E.9 and E.10, for any t ≥ T † , it holds that

1{i s ∈

/ W} · f (x i t ; W (t) ) − y i t

†

t − T +1

†

s=T

†

∥W (T ) − W ∗ ∥ 2 F

≤

+ ϵ 2 (1 − ρ) + Cσ 0 2 ∥v∥ 22 + σ p 2 d ρ.

2η(t − T † + 1)

Before proceeding to scale time t, it is helpful to decompose ∥W (T

(T † )

∥W

(T † )

−

W ∗ ∥ 2 F

j∈{±1},r∈[m]

− W ∗ ∥ 2 F and have an upper bound,

(T † )

∗

⟨w j,r − w j,r

, u⟩ 2

⟨w j,r − w j,r

, v⟩ 2

(T † )

∗

)

+ P ξ, ξ e (w j,r − w j,r

∥u∥ 2

∥v∥ 2

(T † )

2⟨w j,r , u⟩ 2

∗

+ 2⟨w j,r

, u⟩ 2

∥u∥ 22

j∈{±1},r∈[m]

I d −

)

vv ⊤

uu ⊤

(T † )

∗

I d −

−

− P ξ, ξ e (w j,r − w j,r

)

∥v∥ 22

∥u∥ 22

≤

†

(T † )

⟨w j,r , v⟩ 2

∥v∥ 22

vv ⊤

uu ⊤

(T † )

∗

−

e (w j,r − w j,r )

ξ,

∥v∥ 2

∥u∥ 2

(T † )

+ P ξ, ξ e w j,r

(E.13)

where P ξ, ξ e denotes the projection matrix onto linear space span i∈[n],i ′ ∈W (ξ i , ξ e i ′ ). In these derivations, we

exploit the fact that w ∗ is parallel to u, and the gradient steps only updates w along the directions of u, v.

Recall that by Lemma E.7,

max

j∈{±1},r∈[m]

(T † )

†

⟨w j,r , ju⟩ = Ω(1),

max

j∈{±1},r∈[m],i∈[n]

(T † )

⟨w j,r , ξ i ⟩

max

j∈{±1},r∈[m]

(T )

e 0 ∥v∥ 2 ),

⟨w j,r , v⟩ = O(σ

√

(0)

e 0 d),

∥w j,r ∥ 2 = O(σ

√

e 0 σ p d),

= O(σ

†

∗

the leading term in (E.13) is j∈{±1},r∈[m] ⟨w j,r

, u⟩ 2 /∥u∥ 22 . Therefore, we would conclude that ∥W (T ) −

W ∗ ∥ 2 F ≤ Cm 3 /∥u∥ 22 . As a result, the average loss after iterations T † can be bounded by

1{i s ∈

/ W} · f (x i t ; W (t) ) − y i t

†

t − T +1

†

s=T

≤

Cm 3

+ ϵ 2 (1 − ρ) + Cσ 0 2 (∥v∥ 22 + σ p 2 d)ρ.

− T † + 1)

2η∥u∥ 22 (t

e 2 ), we can verify

Then choose T = T † + ⌊Cm 3 /(2ηϵ∥u∥ 22 )⌋ as stated in Lemma E.6. Since ∥u∥ 22 /∥v∥ 22 ≥ Ω(m

‡

that T ≤ T where T is given in Lemma E.7 until when the weak signal cannot be fully learned. Moreover,

we also have T ≤ T + where T + is given in Lemma E.4 when the noise is not memorized.

In conclusion, via scaling σ 0 2 ≤ ϵ/(Cρ(∥v∥ 22 + σ p 2 d)), the final output would be

o 2

1 X

min

y i − f (x i ; W (t) ) ≤

1{i s ∈

/ W} · f (x i t ; W (t) ) − y i t

†

t − T +1

T ≤t≤T

i∈W c

s=T †

≤ ϵ + ϵ 2 (1 − ρ) + Cσ 0 2 ∥v∥ 22 + σ p 2 d ρ

≤ 3ϵ,

ending the proof of Lemma E.6.

62E.3

Stage 3. Memorizing Noise

After the second stage, the model already fits those data points with strong signal by exploiting u. Subse-

(t)

quently, in the following third stage, the residual ℓ i for i ∈ W c would remain quite small, preventing the

model from learning u.

e 2 ) is still far from its label y i for each sample i ∈ W without

On the contrary, since f (x i ; W (t) ) = O(σ

the strong signal u. Therefore, the weight vectors would still evolve in the directions √ perpendicular to u. In

Assumption E.1, the ratio between the weak signal v and the typical noise norm σ p d is scaled by

∥v∥ 2

e √ 1 .

√ ≤ O

(E.14)

σ p d

Therefore, the model will eventually interpolates the whole dataset by memorizing noise vectors ( ξ e i , ξ i ), i ∈

W, Now we define a new reference point W ⋆ by

ξ e i

ξ i

4m(1 + ϵ ′ )

⋆

∗

, j ∈ {±1}, r ∈ [m],

+ 1{y i = j} ·

1{y i = j} ·

w j,r = w j,r +

∥ξ i ∥ 2

∥ ξ e i ∥ 2

i∈W

∗

where w j,r

defined in (E.9) is the reference point we used in the second stage. The following lemma is an

adaptation of Theorem 4.4 of Cao et al. (2022) onto SGD with square loss.

Lemma E.11 (Third stage). Under the same conditions as Theorem E.2, for some ϵ ′ ∈ (0, 1), let

′

T = T +

W (T ) − W ⋆

2ηϵ

where T is the end of the second stage in Lemma E.6. Then we would have that

max

j∈{±1},r∈[m]

(t)

e 0 ∥v∥ 2 ),

⟨w j,r , v⟩ ≤ O(σ

even until t ≤ T ′ . But the whole dataset has already been interpolated during this interval,

o 2

1 X

min ′ y i − f (x i ; W (t) ) ≤ 3ϵ ′ .

T ≤t≤T

2ρn

i∈W

Proof of Lemma E.11. As the closing stage of the training dynamics, the evolution dynamics during this

interval is straightforward based on all techniques developed in Appendices C and E. Inner products

(t)

⟨w j,r , ξ i ⟩, ⟨w j,r , ξ e i ⟩

i∈W

would firstly go through a substage in which they exponentially increase to a constant level (just as stage 1).

And then the model will fit all samples indexed by W by memorizing these noise vectors in polynomial time.

√

(t)

e 0 ∥v∥) due to the scale of ∥v∥ 2 /(σ p d)

All through this interval, max j∈{±1},r∈[m] |⟨w j,r , v⟩| would stay O(σ

in (E.14). A detailed proof is omitted here for readability.

Combine Lemmas E.5, E.6 and E.11 to obtain the full version of Theorem E.2.