Summary of Provably Faster Gradient Descent via Long Steps

Summary Provably Faster Gradient Descent via Long Steps arxiv.org

9,636 words - PDF document - View PDF document

One Line

The text introduces a novel analysis technique for gradient descent, utilizing long step sizes to achieve improved convergence rates and providing a table of step size patterns that lead to faster convergence.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Provably Faster Gradient Descent via Long Steps

Source: arxiv.org - PDF - 9,636 words - view

Introduction

• The work presents a new analysis technique for gradient descent that establishes faster convergence rates.

• The use of long step sizes in gradient descent algorithms can improve convergence rates.

• Nonconstant stepsize policies, including periodic long steps, may increase the objective value in the short term but lead to faster convergence.

Strong Convexity and Growth Bound Condition

• Strong convexity and the growth bound condition play a role in achieving faster convergence in gradient descent algorithms.

• These conditions help ensure that the objective value decreases rapidly during the optimization process.

• By satisfying these conditions, the convergence rate can be significantly improved.

Long Steps in Gradient Descent

• Long step sizes in gradient descent algorithms can accelerate convergence.

• The analysis of nonconstant stepsize policies shows that periodic long steps can lead to faster convergence.

• These long steps may temporarily increase the objective value but ultimately result in faster convergence.

Step Size Patterns

• Different step size patterns can result in faster convergence in gradient descent algorithms.

• The document presents a table showing various step size patterns and their corresponding convergence rates.

• Each pattern is proven using a semidefinite programming solution certificate.

Optimal Accelerated and Subgradient Methods

• The analysis of optimal accelerated and subgradient methods takes into account the decreasing distance to a minimizer.

• These methods consider both the objective value and the proximity to the optimal solution during the optimization process.

• By incorporating this information, faster convergence rates can be achieved.

Closing Slide

• The use of long steps in gradient descent algorithms can lead to provably faster convergence rates.

• Strong convexity and the growth bound condition are key factors in achieving faster convergence.

• Different step size patterns can result in improved convergence rates.

• Considering the decreasing distance to a minimizer can further enhance convergence speed.

• Overall, utilizing long steps in gradient descent algorithms offers a promising approach for optimizing convergence rates.

[Optional visuals: Include graphs or charts showing the convergence rates of different step size patterns]

Key Takeaways

• The work presents a new analysis technique for gradient descent that establishes faster convergence rates.

• Long step sizes in gradient descent algorithms can improve convergence rates.

• Nonconstant stepsize policies, including periodic long steps, may increase the objective value in the short term but lead to faster convergence.

• Strong convexity and the growth bound condition play a role in achieving faster convergence.

• Considering the decreasing distance to a minimizer can further enhance convergence speed.

• Utilizing long steps in gradient descent algorithms offers a promising approach for optimizing convergence rates.

Key Points

The work presents a new analysis technique for gradient descent that establishes faster convergence rates.
The use of long step sizes in gradient descent algorithms can improve convergence rates.
Nonconstant stepsize policies, including periodic long steps, may increase the objective value in the short term but lead to faster convergence.
The concept of strong convexity and the growth bound condition play a role in achieving faster convergence in gradient descent algorithms.
The analysis of optimal accelerated and subgradient methods takes into account the decreasing distance to a minimizer.

Summaries

36 word summary

This work presents a new analysis technique for gradient descent, using long step sizes, to achieve faster convergence rates. It includes a table of step size patterns that result in faster convergence, using inequalities and equalities.

47 word summary

This work introduces a new analysis technique for gradient descent that achieves faster convergence rates. It discusses the use of long step sizes in gradient descent algorithms and presents a table showing different step size patterns that result in faster convergence. The analysis involves inequalities and equalities

428 word summary

This work presents a new analysis technique for gradient descent that establishes faster convergence rates than previous proofs. The theory allows for nonconstant stepsize policies, including periodic long steps that may increase the objective value in the short term but lead to faster convergence in the

The document discusses the use of long step sizes in gradient descent algorithms to improve convergence rates. It presents a table showing different step size patterns that result in faster convergence, with each pattern proven using a semidefinite programming solution certificate. The analysis of non

The given text excerpt discusses the use of long steps in gradient descent algorithms to improve their efficiency. The authors present inequalities and equalities that describe the behavior of gradient descent on convex and smooth functions. They introduce nonnegative multipliers and combine these inequalities to

The text excerpt discusses the convergence rate of gradient descent algorithms and the conditions under which faster convergence can be achieved. It introduces the concept of strong convexity and the growth bound condition for optimization functions. The excerpt also mentions the straightforward stepsize pattern and its

The text discusses a method for faster gradient descent in optimization problems. It introduces a problem formulation (3.1) and then presents a relaxation of this problem to an SDP (3.2). The text explains that under certain conditions, the QC

The text excerpt discusses the concept of straightforward stepsize patterns in gradient descent algorithms. The main result, Theorem 3.1, states that if a stepsize pattern satisfies certain conditions, it is considered straightforward. The proof of this theorem involves checking

The text discusses the use of long stepsize patterns in gradient descent optimization algorithms. The authors present a theorem that proves the convergence rate of gradient descent with a specific stepsize pattern. They provide numerical evidence and conjecture about the longest straightforward stepsize patterns

The focus of the research is on the decrease of the objective gap in gradient descent methods. The analysis of optimal accelerated and subgradient methods also takes into account the decreasing distance to a minimizer. Finding more specific Lyapunovs and stepsize

The summary includes a list of references cited in the document. These references are from various publications related to optimization methods. The publications cover topics such as the design and analysis of first-order methods, branch-and-bound performance estimation programming, optimized first-order methods for

The excerpt includes a list of references to various scientific papers and publications. It also mentions a computer-generated certificate proving a specific rate for a pattern of length 7. The certificate includes a matrix with values and calculations verifying their feasibility.

Raw indexed text (51,178 chars / 9,636 words / 1,694 lines)

Provably Faster Gradient Descent via Long Steps

Benjamin Grimmer ∗

Abstract

This work establishes provably faster convergence rates for gradient descent via a computer-

assisted analysis technique. Our theory allows nonconstant stepsize policies with frequent long

steps potentially violating descent by analyzing the overall effect of many iterations at once

rather than the typical one-iteration inductions used in most first-order method analyses. We

show that long steps, which may increase the objective value in the short term, lead to provably

faster convergence in the long term. A conjecture towards proving a faster O(1/T log T ) rate for

gradient descent is also motivated along with simple numerical validation.

Introduction

This work proposes a new analysis technique for gradient descent, establishing provably better

convergence rates for smooth, convex optimization than the prior state-of-art textbook proofs. Our

theory allows for nonconstant stepsize policies, periodically taking larger steps that may violate the

monotone decrease in objective value typically needed by analysis. In fact, contrary to the common

intuition, we show periodic long steps, which may increase the objective value in the short term,

provably speed up convergence in the long term, with increasingly large gains as longer and longer

steps are periodically included. This bears a similarity to accelerated momentum methods, which

also depart from ensuring a monotone objective decrease at every iteration.

Establishing this requires new proof machinery capable of analyzing the overall effect of many

iterations at once rather than the typical (naive) one-iteration inductions used in most first-order

method analyses. Our proofs are based on the Performance Estimation Problem (PEP) ideas

of [1, 2], which cast computing/bounding the worst-case problem instance of a given algorithm

as a Semidefinite Program (SDP). We show that the existence of a feasible solution to a related

SDP implies a convergence rate for gradient descent using a corresponding pattern of nonconstant

stepsizes. This is in contrast to most existing PEP literature where computer-solves help guide the

search for tighter convergence proofs [3–12] and inform the development of new, provably faster

algorithms [13–17]. Here the computer output itself constitutes the proof.

We consider gradient descent with a sequence of (normalized) stepsizes h = (h 0 , h 1 , h 2 , . . . )

applied to minimize a convex function f : R n → R with L-Lipschitz gradient by iterating

h k

(1.1)

x k+1 = x k − ∇f (x k )

given an initialization x 0 ∈ R n . We assume throughout that a minimizer x ⋆ of f exists and its level

sets are bounded D = sup{∥x − x ⋆ ∥ 2 | f (x) ≤ f (x 0 )} 1 . The classic convergence guarantee [18] for

gradient descent is that with constant stepsizes h = (1, 1, 1, . . . ), every T > 0 has

f (x T ) − f (x ⋆ ) ≤

∗

LD 2

2(T + 1)

Johns Hopkins University, Department of Applied Mathematics and Statistics, [email protected]

This assumption can likely be relaxed but is used to ease our development herein.

(1.2)The proof of this result relies on the monotone decrease of the distance ∥x k − x ⋆ ∥ 2 and the objective

f (x k ). Such monotone decreases are guaranteed if and only if h k ∈ (0, 2). Taylor et al. [2] conjectured

that this rate may be improvable by up to a factor of four by optimally tuning the constant stepsize

used between one and two.

This LD 2 /2T rate has been the long-standing, best-known convergence guarantee and is used

to justify constant stepsize policies. We show that provably faster convergence follows from utilizing

nonconstant stepsizes, which periodically take longer steps than one can guarantee descent for. For

example, consider gradient descent alternating stepsizes h = (2.9, 1.5, 2.9, 1.5, . . . ). Such a scheme

is beyond the reach of traditional descent-based analysis as one may fear the stepsizes of 2.9 can

increase the function value individually more than the 1.5 is guaranteed to decrease it. Regardless,

this “long step” method is provably faster than any existing constant stepsize guarantee. We show

f (x T ) − f (x ⋆ ) ≤

LD 2

+ O(1/T 2 )

2.2 × T

for every even T > 0. See our Theorem 3.2 characterizing many such alternating stepsize methods.

The +O(1/T 2 ) term above is used to suppress two universal constants, namely we show there exist

constants s̄ and C such that all even T > 2s̄ have bound LD 2 /(2.2 × T − C).

Using longer cycles, we derive further performance gains. For example, we show a carefully

selected stepsize pattern of length 127 periodically taking stepsizes of 370.0 converges at a rate of

LD 2 /(5.8346303 × T ). Generally, given a stepsize pattern h = (h 0 , . . . , h t−1 ) ∈ R t , we consider the

gradient descent method repeatedly applying the pattern of stepsizes

x k+1 = x k −

h (k

mod t)

∇f (x k ) .

(1.3)

In Theorem 2.1, we give a convergence guarantee for any straightforward stepsize pattern h of

f (x T ) − f (x ⋆ ) ≤

LD 2

+ O(1/T 2 ) .

avg(h)T

(1.4)

See Section 2 for the formal introduction of this straightforwardness property. Hence the design of

provably faster nonconstant gradient descent methods amounts to seeking straightforward stepsize

patterns with large average stepsize values. Certifying a given pattern is straightforward can

be done via semidefinite programming (see our Theorem 3.1). So the convergence rate analysis

of such nonconstant stepsizes is a natural candidate for computer assistance. Table 1 shows

straightforward stepsize patterns with increasingly fast convergence guarantees, each proven using a

computer-generated, exact-arithmetic semidefinite programming solution certificate.

This work demonstrates that prior convergence theory and its focus on constant stepsizes

and monotone objective decreases are insufficient to capture gradient descent’s potential.

A first provably good path forward is offered.

Future works identifying longer straightforward patterns and other tractable families of nonconstant,

periodically long stepsize policies will surely be able to improve on this work’s particular rates.

The analysis of such nonconstant, long stepsize gradient descent methods has eluded the literature,

with one exception. In 1953, Young [19] showed optimal, accelerated convergence is possible for

gradient descent when minimizing a smooth, strongly convex quadratic function by using a careful

nonconstant selection of h i . Namely, Young set h 0 . . . h T −1 as one over the roots of the T -degree

Chebyshev polynomial 2 . No progress has been made beyond quadratics in accelerating gradient

A nice summary of this is given by the recent blog post [20].

2Pattern Length “Straightforward” Stepsize Pattern h

(longest stepsize marked in bold)

t =2 (3 − η, 1.5)

for any η ∈ (0, 3)

t =3 (1.5, 4.9, 1.5)

t =7 (1.5, 2.2, 1.5, 12.0, 1.5, 2.2, 1.5)

t = 15 (1.4, 2.0, 1.4, 4.5, 1.4, 2.0, 1.4, 29.7,

1.4, 2.0, 1.4, 4.5, 1.4, 2.0, 1.4)

t = 31 (1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 8.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 72.3,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 8.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4)

t = 63 (1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 14.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 164.0,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 14.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4)

t = 127 (1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 12.6,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 23.5,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 12.6,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 370.0,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 12.6,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 23.5,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.5,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 12.6,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4, 7.2,

1.4, 2.0, 1.4, 3.9, 1.4, 2.0, 1.4)

Convergence Rate

(+O(1/T 2 ) omitted)

LD 2

(2.25 − η/2) × T

LD 2

2.63333... × T

LD 2

3.1999999 × T

LD 2

3.8599999 × T

LD 2

4.6032258 × T

LD 2

5.2253968 × T

LD 2

5.8346303 × T

Table 1: Improved convergence rates for Gradient Descent with stepsizes cycling through a “straight-

forward” pattern. Each convergence rate is proven by producing a certificate of feasibility for a

related SDP, which is sufficient by our Theorems 2.1 and 3.1. Coefficients for t ≥ 7 rates are slightly

smaller than the ideal avg(h) due to rounding to produce an exact arithmetic certificate.

3Figure 1: Least squares problems minimizing ∥Ax − b∥ 22 (left) and ∥Ax − b∥ 22 + ∥x∥ 22 (right) with

i.i.d. normal entries in A ∈ R n×n and b ∈ R n for n = 4000. Gradient Descent (1.3)’s objective gap

is plotted over T = 2000 iterations with h = (1) and with each pattern from Table 1. Note this

second objective is substantially more strongly convex, so its faster linear convergence is expected.

descent via long steps. This is partly due to Young’s arguments relying fundamentally on framing

gradient descent as repeated matrix multiplication, lacking a clear path to generalize. The primary

innovation in this work can then be viewed as identifying a tractable analysis technique capable of

advancing in this seventy-year-old open direction.

The search for long, straightforward stepsize patterns h is hard; the set of all straightforward

patterns is nonconvex, making local searches often unfruitful. Our patterns of length t = 2 m − 1

in Table 1 were created by repeating the pattern for t = 2 m−1 − 1 twice with a new long step

added in between and (by hand) shrinking the long steps in the length 2 m−1 − 1 subpatterns. This

recursive pattern has some similarities to the cyclic and fractal Chebyshev patterns for quadratic

minimization considered by [21, 22], although we have no provable connection. This doubling

procedure consistently increased avg(h) by ≈ 0.6. We conjecture the following.

Conjecture 1.1. For any t, there exists a straightforward h ∈ R t with avg(h) = Ω(log(t)).

If true, this would likely yield convergence rates on the order of O(1/T log(T )), strictly improving

on the classic O(1/T ) guarantee. If such long patterns exist, one natural question is how close to

the optimal O(1/T 2 ) rate attained by momentum methods can be achieved by gradient descent

with long steps. The numerics of [12, Figure 2] suggested a O(1/T 1.178 ) rate may be possible. The

theory of Lee and Wright [23] showed asymptotic o(1/T ) convergence for constant stepsize gradient

descent, which also motivates the possibility for improved convergence rates like are presented here.

For strongly convex optimization (or, more generally, any problem satisfying a Hölder growth

bound), the classic convergence rates for constant stepsize gradient descent are known to improve.

We show the same improvements occur for any straightforward stepsize pattern (see Theorem 2.2)

with an additional gain of avg(h). We validate that the convergence speed-ups of Table 1 actually

occur on randomly generated least squares problems in Figure 1, seeing gains proportional to avg(h).

These plots also showcase that descent is not ensured within the execution of a straightforward

pattern as t = 7 rapidly oscillates within each pattern while converging overall.

Outline. In the remainder of this section, we informally sketch how our proof technique proceeds.

Then Section 2 formally introduces our notion of straightforward stepsize patterns, showing that

any such pattern has a guarantee of the form (1.4). Section 3 shows the existence of a solution to a

certain semidefinite program implies straightforwardness. Section 4 concludes by outlining several

future directions of interest enabled by and hopefully able to improve on this work.

41.1

Sketch of Proof Technique - Reducing proving eventual descent to an SDP

Our analysis works by guaranteeing a sufficient decrease is achieved after applying the whole pattern

h = (h 0 , h 1 , . . . , h t−1 ) of t steps (but not necessarily descending at any of the intermediate iterates).

Our notion of straightforward stepsize patterns aims to ensure that for all δ > 0 small enough, if

f (x 0 ) − f (x ⋆ ) ≤ δ, then x t will always attain a descent of at least

P t−1

i=0 h i 2

δ .

LD 2

f (x t ) − f (x ⋆ ) ≤ δ −

(1.5)

For constant stepsizes equal to one, this amounts to f (x 0 − ∇f (x 0 )/L) ≤ f (x 0 ) − δ 2 /LD 2 , a classic

descent result that holds for all L-smooth convex f . To prove a descent lemma like (1.5), we

consider taking a direct combination of several known inequalities. We are given the equalities

x k+1 = x k − (h k /L)∇f (x k ) and ∇f (x ⋆ ) = 0 and as inequalities an initial distance bound

∥x 0 − x ⋆ ∥ 22 ≤ D 2 ,

an initial objective gap bound

f (x 0 ) − f (x ⋆ ) ≤ δ ,

and for any x i and x j with i, j ∈ {⋆, 0, 1, 2, . . . t}, convexity and smoothness imply [24, (2.1.10)]

f (x i ) ≥ f (x j ) + ∇f (x j ) T (x i − x j ) +

∥∇f (x i ) − ∇f (x j )∥ 22 .

For any nonnegative multipliers v(δ), w(δ), λ i,j (δ) ≥ 0 (parameterized by the initial objective gap δ),

one can combine these inequalities to conclude that on every L-smooth convex function, gradient

descent (1.3) with stepsizes h satisfies

v(δ)(∥x 0 − x ⋆ ∥ 22 − D 2 )

+w(δ)(f (x 0 ) − f (x ⋆ ) − δ)

λ i,j (δ) f (x j ) + ∇f (x j ) T (x i − x j ) +

i,j∈{⋆,0,...t}

∥∇f (x i ) − ∇f (x j )∥ 22 − f (x i ) ≤ 0 .

(1.6)

Our proof then proceeds by showing carefully selected values of w(δ), v(δ), λ(δ) reduce this inequality

to guaranteeing (1.5). We find to prove our rates in Table 1 it suffices to set v(δ) =

P t−1

t−1

i=0

D 2

h i 2

δ ,

h i

i=0

w(δ) = 1 − LD

δ, and use a linear function λ(δ) = λ + δγ. For this choice of v(δ) and w(δ), our

Theorem 3.1 shows (1.6) implies (1.5) if (λ, γ) is feasible to a certain SDP. Hence proving a t iteration

descent lemma of the form (1.5) can be done by semidefinite programming. Our Theorem 2.1 then

completes the argument by showing such a periodic descent guarantee implies the rate (1.4).

Straightforward Stepsize Patterns

To analyze a given stepsize pattern, we first aim to understand its worst-case problem instance. We

do so through the “Performance-Estimation Problem” (PEP) framework of [1]. Given a pattern h

and bounds on smoothness L, initial distance to optimal D, and initial objective gap δ, the worst

5final objective gap able to be produced by one application of the stepsize pattern is given by

p L,D (δ) :=





max x 0 ,x ⋆ ,f



s.t.









f (x t ) − f (x ⋆ )

f is convex, L-smooth

∥x 0 − x ⋆ ∥ 2 ≤ D

f (x 0 ) − f (x ⋆ ) ≤ δ

∇f (x ⋆ ) = 0

x k+1 = x k − h L k ∇f (x k ) ∀k = 0, . . . , t − 1 .

(2.1)

The key insight of Drori and Teboulle [1] is that this infinite-dimensional problem can be reformulated

as a finite-dimensional semidefinite program. We formalize and utilize this insight in Section 3.

Generally, the worst-case functions f attaining (2.1) can be nontrivial. To find a tractable

family of stepsizes for analysis, we focus on ones where this worst-case behavior is no worse than

a simple one-dimensional setting: Consider the one-dimensional (nonsmooth) convex function

linearly decreasing from x 0 to x ⋆ and constant thereafter. That is, given L, D, δ, consider the

problem instance with f (x) = max{δx/D, 0}, x 0 = D, and x ⋆ = 0. Provided δ is small enough (i.e.,

δ ≤ LD 2 / h i ), the gradient descent iteration x k+1 = x k − h L k f ′ (x k ) has

P t−1

i=0 h i 2

LD 2 0

δ t = δ 0 −

(2.2)

where δ k = f (x k ) − f (x ⋆ ) denotes the objective gap. In this example, gradient descent spends

all t iterations moving straight forward in a line along the slope. So the descent achieved is just

controlled by the objective’s slope δ 0 /D (squared) and the total length of steps taken t−1

i=0 h i /L.

We say that a stepsize pattern h is straightforward if its worst-case behavior (over all smooth

functions, as defined in (2.1)) is no worse than this one-dimensional piecewise linear setting for δ

small enough. Formally, we say that a stepsize pattern h is straightforward if for some ∆ ∈ (0, 1/2],

P t−1

p L,D (δ) ≤ δ −

i=0 h i 2

LD 2

∀δ ∈ [0, LD 2 ∆]

for any L, D > 0. Our analysis of gradient descent with periodic long steps proceeds by:

(i) Certifying h is straightforward (able to be computer automated)

see Theorem 3.1,

LD 2

+ O(1/T 2 )

see Theorem 2.1.

(ii) Solving a resulting recurrence, showing δ T ≤ avg(h)T

The second step is simpler, so we address it first. Additionally, we show the same factor of rate

improvement avg(h) carries over to more structured domains like strongly convex optimization.

2.1

Convergence Guarantees for Straightforward Stepsize Patterns

Note straightforwardness provides no guarantees on the intermediate objective values of at iterations

k = 1, 2, . . . t − 1. We show in our theorem below descent at every tth iteration. To allow for some

numerical flexibility, we say a pattern is ϵ-straightforward

if for some ∆ ∈ (0, 1/2], all δ ∈ [0, LD 2 ∆]

have value function bounded by p L,D (δ) ≤ δ −

t−1

(h i −ϵ)

i=0

LD 2

δ 2 .

Theorem 2.1. Consider any L-smooth, convex f . If h = (h 0 , . . . , h t−1 ) is ϵ-straightforward with

parameter ∆ ∈ (0, 1/2], then gradient descent (1.3) has, for any s, t ∈ N, after T = st gradient steps

f (x T ) − f (x ⋆ ) ≤



P t−1



 (1 − i=0 (h i − ϵ)∆) (f (x 0 ) − f (x ⋆ ))

LD 2





(avg(h) − ϵ)(T − s̄t) +

∆

if s ≤ s̄

if s > s̄

(2.3)

where s̄ = 





f (x 0 )−f (x⋆ )

LD 2 ∆

t−1

log(1− i=0 (h i −ϵ)∆)

log

 . In particular, suppressing lower-order terms, this rate is



f (x T ) − f (x ⋆ ) ≤

LD 2

+ O(1/T 2 ) .

(avg(h) − ϵ)T

Proof. We begin by showing ϵ ≥ 0-straightforwardness implies for any s ∈ N, the recurrence relation



P t−1

(h i −ϵ) 2



δ st − i=0

δ st

LD 2

δ (s+1)t ≤

 1 − t−1 (h − ϵ)∆ δ

i=0

if δ st ≤ LD 2 ∆

if δ st > LD 2 ∆ .

(2.4)

Consider first s = 0. The

first case of (2.4) follows from the definition of straightforwardness as

P t−1

(h i −ϵ) 2

i=0

δ t ≤ p L,D (δ 0 ) ≤ δ 0 −

δ 0 . The second case of (2.4) follows from observing

that p L,D (·)

LD 2

is concave: Since p L,D (0) = 0 and p L,D LD 2 ∆ ≤ LD 2 ∆ − t−1

i=0 (h i − ϵ)∆ , every δ 0 > LD ∆

must have

t−1

δ 0

δ t ≤ p L,D (δ 0 ) ≤

p L,D LD ∆ ≤ δ 0 1 −

(h i − ϵ)∆ .

LD 2 ∆

i=0

As a result, applying the sequence of steps from the straightforward pattern h from x 0 yields δ t ≤ δ 0 .

Hence ∥x t − x ⋆ ∥ ≤ D since x t lies in the initial level set {x | f (x) ≤ f (x 0 )}. Thus the above

reasoning can apply inductively, giving the claimed recurrence.

Next, we show this recurrence implies the claimed convergence rate (2.3). Suppose first, δ st is

larger than LD 2 ∆. Then applying the pattern h contracts the objective gap 3 , inductively giving

δ st ≤ (1 − t−1

i=0 (h i − ϵ)∆) δ 0 . After at most s̄ executions of the stepsize pattern, one must have

δ st ≤ LD ∆. Afterward, for any s > s̄, the objective gap decreases by at least

P t−1

i=0 (h i −

LD 2

δ (s+1)t ≤ δ st −

ϵ)

δ st

Solving this recurrence with the initial condition δ s̄t ≤ LD 2 ∆ gives the claimed sublinear rate

LD 2

i=0 (h i − ϵ)(s − s̄) +

δ st ≤ P t−1

2.2

∆

Faster Convergence for Straightforward Patterns given Growth Bounds

A function f is µ-strongly convex if f − µ 2 ∥ · ∥ 22 is convex. This condition is well-known to lead

to linear convergence for most first-order methods in smooth optimization. More generally, faster

convergence occurs whenever f satisfies a Hölder growth or error bound condition. Here we consider

settings where all x ∈ R n within the level set f (x) ≤ f (x 0 ) satisfy

f (x) − f (x ⋆ ) ≥

∥x − x ⋆ ∥ q 2 .

(2.5)

Strong convexity implies this condition with q = 2 and leads gradient descent with constant

h = (1, 1, . . . ) to converge at a rate of O((1 − µ/L) T ). When q > 2, improved sublinear guarantees

of O((L/µ 2/q T ) q/(q−2) ) follow. Below we show that any straightforward stepsize pattern enjoys the

same convergence improvements gaining an additional factor of avg(h).

Interestingly, this contraction factor is independent of L, D. As a result, problem conditioning plays a minimal

role in this initial phase of convergence. Instead, only h and the associated straightforwardness parameter ∆ matter.

7Theorem 2.2. Consider any L-smooth, convex objective f satisfying (2.5). If h = (h 0 , . . . , h t−1 ) is

ϵ-straightforward with parameter ∆ ∈ (0, 1/2], then gradient descent (1.3) has, for any s, t ∈ N with

s > s̄, after T = st gradient steps

f (x T ) − f (x ⋆ ) ≤







2/q

µ (q − 2)(avg(h) − ϵ)(T − s̄t)

q/(q−2)

if q > 2

(s−s̄)



 1 − µ(avg(h)−ϵ)t

LD 2 ∆

if q = 2 .

Proof. Let D k = sup{∥x − x ⋆ ∥ | f (x) ≤ f (x k )} denote the size of each level set visited by gradient

descent. The growth bound (2.5) ensures D k ≤ ( µ q δ k ) 1/q . Then the recurrence (2.4) implies

P t−1



2/q

(h i −ϵ) 2−2/q

 δ − µ

i=0

δ st

Lp 2/q

δ (s+1)t ≤

 1 − t−1 (h − ϵ)∆ δ

i=0

if δ st ≤ LD 0 2 ∆

if δ st > LD 0 2 ∆ .

The analysis of initial linear convergence is unchanged. Once δ 0 ≤ LD 0 2 ∆ (after at most s̄

applications of the straightforward

stepsize pattern), convergence occurs based on the recurrence

µ 2/q

t−1

h i 2−2/q

i=0

relation δ (s+1)t ≤ δ st −

. Solving this modified recurrence relation (for example,

δ st

Lq 2/q

see [25, Lemma A.1] for this calculation) shows for any s > s̄,

δ st ≤





 q

µ 2/q (q − 2)



 1 −

! q/(q−2)

P t−1

i=0 (h i − ϵ)(s −

(s−s̄)

P t−1

µ i=0 (h i −ϵ)

LD 0 2 ∆

s̄)

if q > 2

if q = 2 .

Certificates of Straightforwardness

All that remains is to show how one can certify the straightforwardness of a stepsize pattern.

We do this in two steps. First, Section 3.1 shows that p L,D (δ) is upper bounded by an SDP

minimization problem using PEP techniques. Hence straightforwardness is implied by showing

the SDP corresponding to each δ ∈ [0, LD 2 ∆] has a sufficiently good feasible solution. Second,

Section 3.2 shows that another semidefinite programming feasibility problem can certify that such

an interval of solutions exists.

3.1

The Worst-Case Value Function is Upper Bounded by an SDP

We first reformulate the infinite-dimensional problem (2.1) as a finite-dimensional nonconvex quadrat-

ically constrained quadratic problem (see (3.1)), relax that formulation into a SDP (see (3.2)),

and then upper bound the SDP by its dual problem (see (3.3)). This process has been detailed

by [1, 4, 12] and is carried out in our setting below (following the notations of [12]).

Step 1: A QCQP reformulation. First, using the interpolation theorem of [2], p L,D (δ) equals

the following finite-dimensional nonconvex problem optimizing over all possible objective values f k

8and gradients g k at the points x k with k ∈ I t ⋆ := {⋆, 0, 1, . . . t}

p L,D (δ) =





max x 0 ,f,g



 s.t.









f t − f ⋆

∥g i − g j ∥ 22

f i ≥ f j + g j T (x i − x j ) + 2L

∥x 0 − x ⋆ ∥ 22 ≤ D 2

f 0 − f ⋆ ≤ δ

x ⋆ = 0, f ⋆ = 0, g ⋆ = 0

∀i = 0, . . . , t − 1

x i+1 = x i − h L i g i

∀i, j ∈ I t ⋆

(3.1)

where, without loss of generality, we have fixed x ⋆ = 0, f ⋆ = 0, g ⋆ = 0.

Step 2: An SDP relaxation. Second, we relax the nonconvex problem (3.1) to the following

SDP. Following the notation of [12], define

H := [x 0 | g 0 | g 1 | . . . | g t ] ∈ R d×(t+2) ,

G := H T H ∈ S t+2

+ ,

F := [f 0 | f 1 | . . . | f t ] ∈ R 1×(t+1) ,

with the following notation for selecting columns and elements of H and F :

g ⋆ := 0 ∈ R t+2 , g i := e i+2 ∈ R t+2 ,

x 0 := e 1 ∈ R

x i := x 0 −

t+2

, x ⋆ := 0 ∈ R

t+2

i−1

1 X

h j g j ∈ R t+2 ,

L j=0

i ∈ [0 : t]

i ∈ [1 : t]

f ⋆ := 0 ∈ R t+1 , f i := e i+1 ∈ R t+1 ,

i ∈ [0 : t] .

This notation ensures x i = Hx i , g i = Hg i , and f i = F f i . Furthermore, for i, j ∈ I t ⋆ , define

A i,j (h) := g j ⊙ (x i − x j ) ∈ S t+2 ,

B i,j (h) := (x i − x j ) ⊙ (x i − x j ) ∈ S t+2

+ ,

C i,j := (g i − g j ) ⊙ (g i − g j ) ∈ S t+2

+ ,

a i,j := f j − f i ∈ R t+1

where x ⊙ y = 12 (xy T + yx T ) denotes the symmetric outer product. This notation is defined so

that g j T (x i − x j ) = TrGA i,j (h), ∥x i − x j ∥ 22 = TrGB i,j (h), and ∥g i − g j ∥ 22 = TrGC i,j for any i, j ∈ I t ⋆ .

Then the QCQP formulation (3.1) can be relaxed 4 to

p L,D (δ) ≤





max F,G



 s.t.





F f t

F a i,j + TrGA i,j (h) +

−G ⪯ 0

TrGB 0,⋆ ≤ D 2

F f 0 ≤ δ

2L TrGC i,j

≤ 0,

i, j ∈ I t ⋆ : i ̸ = j

(3.2)

with decision variables F ∈ R 1×(t+1) and G ∈ R (t+2)×(t+2) .

Under an additional rank condition (that the problem dimension n exceeds t + 2), the QCQP problem (3.1) and

SDP (3.2) are actually equivalent. However, this is not needed for our analysis, so we make no such assumption.

9Step 3: The upper bounding dual SDP. Third, we note the maximization SDP (3.2) is bounded

above by its dual minimization SDP by weak duality (note strong duality is not assumed):

p L,D (δ) ≤





min λ,v,w,Z



s.t.









3.2

D 2 v + δw

λ i,j a i,j = a ⋆,t − wa ⋆,0

i,j∈I t ⋆ :i̸ = j

vB 0,⋆ +

i,j∈I t ⋆ :i̸ = j

λ i,j A i,j (h) +

Z ⪰ 0

v, w ≥ 0, λ i,j ≥ 0,

2L C i,j

(3.3)

= Z

i, j ∈ I t ⋆ : i ̸ = j .

An SDP Feasibility Certificate that implies Straightforwardness

The preceding bound (3.3) establishes that ϵ ≥ 0-straightforwardness holds if for some P

∆ ∈ (0, 1/2],

t−1

(h i −ϵ) 2

every δ ∈ [0, LD 2 ∆] has a corresponding dual feasible solution with objective at most δ− i=0

δ .

LD 2

We claim that it suffices to fix L = 1, D = 1 without loss of generality. For any L-smooth f with

∥x 0 − x ⋆ ∥ ≤ D, this follows by instead considering minimizing f ˜ (x̃) = LD

One can

2 f (Dx̃).

easily verify f is 1-smooth, has ∥x̃ 0 − x̃ ⋆ ∥ ≤ 1 for x̃ 0 = Dx 0 , x̃ ⋆ = Dx ⋆ , and gradient descent

x̃ k+1 = x̃ k − h k ∇ f ˜ (x k ) produces exactly the iterates of x k+1 = x k − h k /L∇f (x k ) rescaled by D.

Hence p L,D (δ) = LD 2 p 1,1 (δ/LD 2 ).

We restrict our search for dual certificates bounding p 1,1 (δ) to a special case, which we numerically

observed to hold approximately at the minimizers of (3.3): given δ, fix v = t−1

i=0 (h i + ϵ)δ and

P t−1

t−1

w = 1 − 2 i=0 h i δ. Noting this fixed variable setting has v + δw = δ − i=0 (h i − ϵ)δ 2 , ϵ-

straightforwardness follows if one can show feasible solutions with these fixed values exist.

Given δ and a selection of λ ∈ R (t+2)×(t+2) and fixing v as above, we define

Z h,ϵ (λ, δ) :=

t−1

(h i + ϵ)δ B 0,⋆ +

i,j∈I t ⋆ :i̸ = j

i=0

λ i,j

A i,j (h) + C i,j

(3.4)

Observe that Z h,ϵ (λ, δ) is nearly linear: the first entry has the only nonlinear behavior, depending

quadratically on δ, with the rest depending only linearly on λ. Written in block form, we denote

Z h,ϵ (λ, δ) =:

t−1

i=0 (h i

+ ϵ)δ 2 m h (λ) T

m h (λ)

M h (λ)

(3.5)

where m h : R (t+2)×(t+2) → R t+1 and M h : R (t+2)×(t+2) → R (t+1)×(t+1) are linear functions. Certifying

p(δ) ≤ δ − t−1

i=0 (h i − ϵ)δ for fixed δ then follows by showing the following spectral set is nonempty

R h,ϵ,δ =











i,j∈I t ⋆ :i̸ = j

λ i,j a i,j = a ⋆,t − 1 − 2

λ ∈ R (t+2)×(t+2) | λ ≥ 0

Z h,ϵ (λ, δ) ⪰ 0

P t−1

i=0



h i δ a ⋆,0 







Lemma 3.1. A stepsize pattern h ∈ R t is ϵ ≥ 0-straightforward if for some ∆ ∈ (0, 1/2], R h,ϵ,δ is

nonempty for all δ ∈ [0, ∆]. Straightforwardness of h is implied by each R h,0,δ being nonempty.

This lemma alone does not directly enable the computation of a convergence-proof certificate.

One would need certificates of feasibility for the infinitely many sets given by each δ ∈ [0, ∆]. The

following theorem shows that the existence of such solutions can be certified via a single feasible

solution to yet another semidefinite program.

10Theorem 3.1. A stepsize pattern h ∈ R t is ϵ ≥ 0-straightforward if for some ∆ ∈ (0, 1/2], S h,ϵ,∆ is

nonempty where

S h,ϵ,∆ =







λ i,j a i,j = a ⋆,t − a ⋆,0

P t−1

⋆

i,j∈I t :i̸ = j γ i,j a i,j = 2

i=0 h i a ⋆,0

m h (λ) = 0

≥ 0, λ + ∆γ ≥ 0

| "P t−1

ϵ)

(γ)

i=0 i

⪰ 0

m h (γ)

M h (λ)

t−1

m h (γ) T

i=0 (h i + ϵ)

⪰ 0

m h (γ)

M h (λ + ∆γ)

⋆

P i,j∈I t :i̸ = j

(λ, γ) ∈ R (t+2)×(t+2) × R (t+2)×(t+2)















Proof. Let (λ, γ) ∈ S h,ϵ,∆ . We prove this by showing λ (δ) := λ + δγ ∈ R h,ϵ,δ for every δ ∈ [0, ∆] by

Lemma 3.1. This amounts to verifying the three conditions defining

R h,ϵ,δ for each λ (δ) .

(δ)

First, we check i,j∈I t ⋆ :i̸ = j λ i,j a i,j = a ⋆,t − 1 − 2 t−1

i=0 h i δ a ⋆,0 . The first equality defining

S h,ϵ,∆ ensures this for λ (0) = λ. Adding δ times the second equality

defining S h,ϵ,∆ establishes

(δ)

the equality for every λ (δ) as i,j∈I t ⋆ :i̸ = j (λ i,j + δγ i,j )a i,j = a ⋆,t − 1 − 2 t−1

i=0 h i δ a ⋆,0 . Second, we

check nonnegativity λ (δ) ≥ 0. This follows by noting λ (δ) is a convex combination of λ and λ + ∆γ,

which are nonnegative by construction. Finally, we check the nonlinear (but nearly linear) condition

Z h,ϵ (λ (δ) , δ) ⪰ 0. We consider the block-form (3.5) of this semidefinite inequality, seeking

Z h,ϵ (λ

(δ)

t−1

, δ) =

+ ϵ)δ 2 m h (λ (δ) ) T

⪰ 0 .

m h (λ (δ) ))

M h (λ (δ) )

i=0 (h i

Since m h (λ) = 0, using the linearity of m h and M h , the above can be expanded to equal

t−1

+ ϵ)δ 2

δm h (γ) T

⪰ 0 .

δm h (γ)

M h (λ) + δM (γ)

i=0 (h i

Rescaling the first row and column by 1/δ gives an equivalent condition, which is now linear in δ,

t−1

i=0 (h i

+ ϵ)

m h (γ) T

⪰ 0 .

m h (γ)

M (λ) + δM h (γ)

When δ = 0 or ∆, this condition is explicitly ensured by the definition of S h,ϵ,∆ . Then the linearity

and convexity of this condition imply it holds for all intermediate λ (δ) , completing the proof.

3.3

Certificates of Straightforwardness Proving Faster Rates in Table 1

To prove a given pattern h converges at rate LD 2 /avg(h)T , we only need to show some S h,0,∆ is

nonempty. The most natural path is to provide an exact member of this set. For all of the stepsizes

in Table 1, it was relatively easy to find a feasible solution S h,0,∆ in floating point arithmetic via an

interior point method. However, exactly identifying a member of S h,0,∆ from this can still be hard.

First, we prove the claimed rates for the t = 2 and t = 3 stepsize patterns of Table 1 by presenting

exact members of S h,0,∆ . Then, to handle larger values of t, we present a simple rounding approach

able to produce members of S h,ϵ,∆ , often with ϵ around machine precision ≤ 10 −9 . This approach

produced rational-valued certificates proving the rest of the claimed convergence rates in Table 1.

11The exact rational arithmetic verifying the correctness of all certificates (λ, γ) was done in

Mathematica 13.0.1.0. Note that the entries in these certificates for t ≥ 7 are entirely computer-

generated and lack real human insight. As an example for reference, the certificate for t = 7 is

included in the appendix. Larger certificates are impractical to include here. For example, our

t = 127 rate is certificate (λ, γ) has 32640 nonzero entries. Certificates for every pattern in Table 1

and exact verifying computations are available at github.com/bgrimmer/LongStepCertificates.

Theorem 3.2. For any η ∈ (0, 3), the stepsize pattern h = (3 − η, 1.5) is straightforward. Hence

gradient descent (1.3) alternating between these two stepsizes has every even T satisfy

f (x T ) − f (x ⋆ ) ≤

LD 2

+ O(1/T 2 ) .

(2.25 − η/2) × T

Proof. For any η ∈ (0, 3), consider the selection of (λ, γ) given by:



 0



λ = 

 0

0 0

0 12

0 0



1 

2 

1 

2 



0 3 − η



 0

γ = 

 0

6−η

−(6−η)

6−η 

−(6−η) 





It suffices to show for some ∆ ∈ (0, 1/2], (λ, γ) ∈ S h,0,∆ . One can easily verify the needed equalities

and nonnegativities hold for all 0 ≤ ∆ ≤ 1/(6 − η). The first positive semidefiniteness condition of

(λ, γ) ∈ S h,0,∆ amounts to checking every η ∈ (0, 3) has



t−1

+ ϵ) m h (γ) T

m h (γ)

M h (λ)

i=0 (h i

9−2η

 −(3−η)



 2

 −(6−η)

 4

−(6−η)

−(3−η)

2−η

−(6−η)

2−η

−(6−η)

2−η 







⪰ 0 .

Since this convex condition is linear in η, it suffices to check it at η = 0 and η = 3. Moreover,

for any η ∈ (0, 3), note this matrix has exactly two zero eigenvalues with associated eigenvectors

spanning (1/2, 1/2, 1, 0) and (1/2, 1/2, 0, 1). The second positive semidefiniteness condition amounts

to checking an update to this matrix of size ∆ remains positive semidefinite, namely



t−1

i=0 (h i

+ ϵ)

m h (γ) T

m h (γ)

M h (λ + ∆γ)

9−2η

 −(3−η)



 2

 −(6−η)

 4

−(6−η)

−(3−η)

2−η

−(6−η)

2−η



−(6−η)



2−η 

  0

 + 

  0





0 0

−3

6−η

4 6−η



6−η 

4 

 ∆

0 

⪰ 0 .

One can check this added matrix term is positive semidefinite on the subspace spanned by

(1/2, 1/2, 1, 0) and (1/2, 1/2, 0, 1) (again by checking when η = 0 and η = 3 and then using

convexity). As a result, positive semidefiniteness is maintained for ∆ small enough. Exact arithmetic

verifying all of these claims are given in the associated Mathematica notebook. Hence (λ, γ) ∈ S h,0,∆ ,

proving the main claim by Theorem 3.1 and the claimed convergence rate by Theorem 2.1.

Theorem 3.3. The stepsize pattern h = (1.5, 4.9, 1.5) is straightforward. Hence gradient de-

scent (1.3) alternating between these three stepsizes has every T = 3s satisfy

f (x T ) − f (x ⋆ ) ≤

LD 2

+ O(1/T 2 ) .

2.63333... × T

12Proof. This result is certified with ∆ = 10 −4 by the following exact values for (λ, γ) ∈ S h,0,∆ of





 0

1.95 0.003 0.007 









0.5

0.5  ,

λ =  0 0.95





 0 0.006

0.51 

0 0.004

0.013





0 0.005 7.825 3.9497 4.0203

 0

−5.24 −10.555

0 









7.9

−5.315  .

γ =  0





 0

1.2947 

Based on numerical exploration, we conjecture that patterns of the form (3 − η, 1.5) are the longest

straightforward patterns of length two and (1.5, 5 − η, 1.5) are the longest of length three.

For larger settings t ≥ 7, determining an exact member of S h,0,∆ proved difficult. So we resort

to a fully automated construction of a convergence rate certificate by first numerically computing

an approximate member of S h,0,∆ (via an interior point method) and then rounding to a nearby

rational-valued exact member of S h,ϵ,∆ for some small ϵ. In light of our Theorem 2.1, such rounding

only weakens the resulting guarantee’s coefficient from avg(h) to avg(h) − ϵ.

Computer Generation of Convergence Rate Proof Certificates. Given a pattern h and ∆,

(i) Numerically compute some ( λ̃, γ̃) approximately in S h,0,∆ ,

(ii) Compute rational ( λ̂, γ̂) near ( λ̃, γ̃) exactly satisfying the three needed equalities,

(iii) Check in exact arithmetic nonnegativity and positive definiteness of M h ( λ̂) and M h ( λ̂ + ∆γ̂),

LD 2

+ O(1/T 2 ) convergence rate, for

(iv) If so, ( λ̂, γ̂) ∈ S h,ϵ,∆ , certifying a (avg(h)−ϵ)T

ϵ =

max{m h (γ̂) T M h ( λ̂) −1 m h (γ̂), m h (γ̂) T M h ( λ̂ + ∆γ̂) −1 m h (γ̂)}

− avg(h) .

The above value of ϵ is the smallest value with ( λ̂, γ̂) ∈ S h,ϵ,∆ , since by considering their Schur

complements, the two needed positive semidefinite conditions hold if and only if

t−1

−1

(h i + ϵ) − m h (γ̂) M h ( λ̂)

i=0

m h (γ̂) ≥ 0 and

t−1

(h i + ϵ) − m h (γ̂) T M h ( λ̂ + ∆γ̂) −1 m h (γ̂) ≥ 0 .

i=0

Theorem 3.4. The patterns of lengths t ∈ {7, 15, 31, 63, 127} in Table 1 are all ϵ-straightforward for

ϵ ∈ {10 −9 , 10 −9 , 10 −11 , 10 −3 , 10 −4 } and ∆ ∈ {10 −5 , 10 −6 , 10 −8 , 10 −7 , 10 −8 }. Hence the convergence

rates claimed in Table 1 hold for each corresponding “long step” gradient descent method.

Proof. Certificates ( λ̂, γ̂) ∈ S h,ϵ,∆ (produced via the above procedure) are available along with exact

arithmetic validation at github.com/bgrimmer/LongStepCertificates.

Future Directions

We have shown that using nonconstant, long stepsize patterns improves the performance guaran-

tees of gradient descent. This runs contrary to widely held intuitions regarding constant stepsize

selections and the importance of monotone objective decreases. Instead, we show that long-run

performance improves by periodically taking (very) long steps that may increase the objective value

in the short term. We accomplish this by developing a technique based on computer-generatable

proof certificates to analyze the collective effect of many varied stepsizes at once. We conclude by

discussing a few possible future improvements on and shortcomings of this technique.

Future Improvements in Algorithm Design. The search for long, straightforward stepsize

patterns h is hard. The patterns presented in Table 1 resulted from substantial brute force searching.

13The task of maximizing avg(h) subject to h being straightforward, although nonconvex, may be

approachable using branch-and-bound techniques similar to those recently developed by Gupta et

al. [12] and applied to a range of PEP parameter optimization problems. Such an approach may

yield numerically, globally optimal h for fixed length t. This may also generate insights into the

general form of the longest straightforward stepsize patterns for each fixed t.

One practical drawback of the method (1.3) is the requirement that one knows L. Since our

analysis in Theorem 2.1 only relies on decreases in objective value after t steps, backtracking

linesearch schemes or other adaptive ideas may be applicable. Such an approach could check a suf-

ficient decrease condition every t iterations, halving the estimate of L and repeating those steps if not.

Future Improvements in Analysis Techniques. Almost all iterative optimization methods

are analyzed based on deriving a recurrence, ensuring progress at every iteration. Our techniques

deviate from this norm, only requiring improvement by a fixed horizon of length t. This deviation is

also true of Young’s gradient descent method [19] for the strongly convex quadratic setting, which

only ensures an optimal rate upon completion of the pattern. Determining which other first-order

methods could benefit from collectively analyzing iterations up to a fixed horizon is an interesting

open direction. Bregman, inexact, and stochastic variants of gradient descent are natural candidates.

Future works may improve our analysis by considering more sophisticated Lyapunov functions.

Our proofs are only concerned with the eventual decrease of the objective gap. The analysis of

optimal accelerated and subgradient methods relies additionally on the decreasing distance to a

minimizer or a decreasing combination thereof. Identifying more tailored Lyapunovs and stepsize

patterns guaranteed to eventually decrease it may lead to stronger guarantees.

Future Improvements in Computational Aspects. We observed that numerically computed

primal optimal solutions to (3.2) were rank-one for all considered straightforward patterns. This

corresponds to the worst-case objective function being essentially one-dimensional. This property

was not used herein but could likely be leveraged to enable customized solvers for evaluating p L,D (δ)

and checking membership of S h,0,∆ . Such improvements in tractability for SDPs with rank-one

solutions have been studied widely [26–32] and may enable the search for longer, provably faster

straightforward stepsize patterns than shown here.

As another avenue of improvement, note that any certificates produced by using floating point

arithmetic followed by a rounding step (as done here) will likely lose a small ϵ amount in the rate.

The use of an algebraic solver, like SPECTRA [33], could enable the automated production of exact

certificates of ϵ = 0−straightforwardness as well as being able to certify when S h,0,∆ is empty.

Acknowledgements. The author thanks Fanghua Chen for preliminary discussions based on numerical

observations that motivated this work. The openly released Performance Estimation Problem Branch-and-

Bound solver and software of [12] was especially helpful in these explorations and for building intuitions.

Additional thanks to Axel Böhm for providing useful feedback on the initial presentation of this work.

References

[1] Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a

novel approach. Mathematical Programming, 145:451–482, 2012.

[2] Adrien Taylor, Julien Hendrickx, and François Glineur. Smooth strongly convex interpolation and exact

worst-case performance of first-order methods. Mathematical Programming, 161:307–345, 2017.

[3] Etienne de Klerk, François Glineur, and Adrien Taylor. On the worst-case complexity of the gradient

method with exact line search for smooth strongly convex functions. Optimization Letters, 11:1185–1199,

2016.

14[4] Adrien Taylor, Julien Hendrickx, and François Glineur. Exact worst-case performance of first-order

methods for composite convex optimization. SIAM Journal on Optimization, 27(3):1283–1313, 2017.

[5] Adrien Taylor, Bryan Van Scoy, and Laurent Lessard. Lyapunov functions for first-order methods: Tight

automated convergence guarantees. In Proceedings of the 35th International Conference on Machine

Learning, volume 80, pages 4897–4906. PMLR, 10–15 Jul 2018.

[6] Radu-Alexandru Dragomir, Adrien Taylor, Alexandre d’Aspremont, and Jérôme Bolte. Optimal com-

plexity and certification of bregman first-order methods. Mathematical Programming, 194:41 – 83,

2019.

[7] Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided

analyses via potential functions. ArXiv, abs/1902.00947, 2019.

[8] Felix Lieder. On the convergence rate of the halpern-iteration. Optimization Letters, pages 1–14, 2020.

[9] Ernest Ryu, Adrien Taylor, Carolina Bergeling, and Pontus Giselsson. Operator splitting performance

estimation: Tight contraction factors and optimal parameter selection. SIAM Journal on Optimization,

30(3):2251–2271, 2020.

[10] Guoyong Gu and Junfeng Yang. Tight sublinear convergence rate of the proximal point algorithm for

maximal monotone inclusion problems. SIAM Journal on Optimization, 30(3):1905–1921, 2020.

[11] Mathieu Barr’e, Adrien Taylor, and Francis Bach. Principled analyses and design of first-order methods

with inexact proximal operators. Mathematical Programming, 2022.

[12] Shuvomoy Das Gupta, Bart P.G. Van Parys, and Ernest Ryu. Branch-and-bound performance estimation

programming: A unified methodology for constructing optimal optimization methods. Mathematical

Programming, 2023.

[13] Donghwan Kim and Jeffrey A. Fessler. Optimized first-order methods for smooth convex minimization.

Math. Program., 159(1–2):81–107, sep 2016.

[14] Yoel Drori and Adrien Taylor. Efficient first-order methods for convex minimization: a constructive

approach. Mathematical Programming, 184:183 – 220, 2018.

[15] Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing

the gradient of smooth convex functions. J. Optim. Theory Appl., 188(1):192–219, jan 2021.

[16] Donghwan Kim. Accelerated proximal point method for maximally monotone operators. Math. Program.,

190(1–2):57–87, nov 2021.

[17] Adrien Taylor and Yoel Drori. An optimal gradient method for smooth strongly convex minimization.

Math. Program., 199(1–2):557–594, jun 2022.

[18] Dimitri P. Bertsekas. Convex optimization algorithms. 2015.

[19] David Young. On richardson’s method for solving linear systems with positive definite matrices. Journal

of Mathematics and Physics, 32(1-4):243–255, 1953.

[20] Fabian Pedregosa. Acceleration without momentum, 2021.

[21] V.I. Lebedev and S.A. Finogenov. Ordering of the iterative parameters in the cyclical chebyshev iterative

method. USSR Computational Mathematics and Mathematical Physics, 11(2):155–170, 1971.

[22] Naman Agarwal, Surbhi Goel, and Cyril Zhang. Acceleration via fractal learning rate schedules. In

International Conference on Machine Learning, 2021.

[23] Ching-Pei Lee and Stephen Wright. First-order algorithms converge faster than o(1/k) on convex problems.

In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International

Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages

3754–3762. PMLR, 09–15 Jun 2019.

[24] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing

Company, Incorporated, 1 edition, 2014.

15[25] Mateo Díaz and Benjamin Grimmer. Optimal convergence rates for the proximal bundle method. SIAM

Journal on Optimization, 33(2):424–454, 2023.

[26] Stefania Bellavia, Jacek Gondzio, and Margherita Porcelli. A relaxed interior point method for low-rank

semidefinite programming problems with applications to matrix completion. Journal of Scientific

Computing, 89, 2019.

[27] Lijun Ding, Alp Yurtsever, Volkan Cevher, Joel A. Tropp, and Madeleine Udell. An optimal-storage ap-

proach to semidefinite programming using approximate complementarity. SIAM Journal on Optimization,

31(4):2695–2725, 2021.

[28] Alp Yurtsever, Joel A. Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable semidefinite

programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021.

[29] Mario Souto, Joaquim D. Garcia, and Álvaro Veiga. Exploiting low-rank structure in semidefinite

programming by approximate operator splitting. Optimization, 71(1):117–144, 2022.

[30] Alex L. Wang, Yunlei Lu, and Fatma Kilinç-Karzan. Implicit regularity and linear convergence rates for

the generalized trust-region subproblem. SIAM Journal on Optimization, 33(2):1250–1278, 2023.

[31] Lijun Ding and Benjamin Grimmer. Revisiting spectral bundle methods: Primal-dual (sub)linear

convergence rates. SIAM Journal on Optimization, 33(2):1305–1332, 2023.

[32] Alex L. Wang and Fatma Kilinc-Karzan. Accelerated first-order methods for a class of semidefinite

programs, 2022.

[33] Didier Henrion, Simone Naldi, and Mohab Safey El Din. Spectra – a maple library for solving linear

matrix inequalities in exact arithmetic. Optimization Methods and Software, 34:62 – 78, 2016.

A Computer-Generated Straightforwardness Certificate with

∆ = 10 −5 and ϵ = 10 −9 for h = (1.5, 2.2, 1.5, 12.0, 1.5, 2.2, 1.5)

Below is a certificate ( λ̂, γ̂) ∈ S h,ϵ,∆ , completely computer generated, proving a LD 2 /(3.1999999 × T )

rate for the pattern of length t = 7 in Table 1. Given the length of these 9 × 9 matrices, we display

their first five and last four columns separately below. Exact calculations verifying the feasibility of

these values are given in the associated publicly posted Mathematica notebook.



λ̂ 1:5

 0



 0



 0



= 

 0



 0



 0



2191522964335457

2251799813685248

6721678331720401

2305843009213693952

1313681189860411

1152921504606846976

1137203495150241

4611686018427387904

5789750435206701

18446744073709551616

7965115934238233

36893488147419103232

1977810093374139

9223372036854775808

 4697224493034383

 9223372036854775808

 7733370871946025

 4611686018427387904

 2740682573240027

 576460752303423488

 8956652459407733

λ̂ 6:9 = 

 72057594037927936



 6929007831194361

 288230376151711744

 3923691798295005

 144115188075855872

13037057806369

2251799813685248



0 0 0

8837407518919583

4503599627370496 5370688140802311

2305843009213693952

4118290538555273

2251799813685248 960254226721649

144115188075855872

2688409565283275

4503599627370496

5801123626984329

1125899906842624

4407991053556385

18014398509481984

2695784755734549

2251799813685248

4688926225212825

9223372036854775808

4956986009057139

9223372036854775808

6653418691702949

9223372036854775808

1222316311137735

1152921504606846976

8068866010524833

2251799813685248

4572972758977097

4611686018427387904

5204247088958345

4611686018427387904

4849791056907609

4611686018427387904

6194623724653895

4611686018427387904

0 0

8368394272075953

2305843009213693952

5388133457257119

2305843009213693952

5290333777740781

1152921504606846976

4723166920241497

18014398509481984

772844649199827

4503599627370496 6128278626086993

9223372036854775808

8119439550484479

4611686018427387904

4877381506142205

1152921504606846976

5356440120675597

18014398509481984

2338413531710477

288230376151711744

8173541145090013

36028797018963968

1250438084247729

288230376151711744

7367859533233689

576460752303423488

2876396710542189

288230376151711744



 ,



1182579142099203 

1152921504606846976 

5375927246703545 

9223372036854775808 

7288685819438951 



9223372036854775808

1504051934577545

1152921504606846976



80543723501136599 

36893488147419103232 

19797186857495837 

9223372036854775808 

12485929602009775 



2305843009213693952 

339319020784307309 

1152921504606846976 

13381432557675475 

2305843009213693952 

3837009581398546887 

18446744073709551616 

18985683012247614283 



36893488147419103232

,1445047782665419

0 360287970189639680

 0



3452744755754301

 0 − 70368744177664



5865189115672077

 0 36028797018963968



8990085085489805

= 

 0

562949953421312

 0

814636363991245



2251799813685248



3576868018334213

 0 18014398509481984



7467009600885335

 0 72057594037927936

5171454268783955

0 18014398509481984



γ̂ 1:5

 842771278878770475

288230376151711744

 − 7575374264747013

 1125899906842624

 256866650877453

 562949953421312

 4819157742253121

 2251799813685248

 451884763914337

γ̂ 6:9 = 

 281474976710656



 6392823462111415

 − 72057594037927936

 3454102047914875

 − 9007199254740992

− 1957421956488181

4503599627370496

4403953050470099

36028797018963968

− 4876214136104831

140737488355328

− 7694464236006067

281474976710656

8180724221663923

9007199254740992

− 5002382223044879

9007199254740992

3095322656598895

− 18014398509481984

7502387060304591

18014398509481984

4976089659881855

4503599627370496

1763107326300397405

288230376151711744

− 847735720154307

140737488355328

− 2501054473773415

1125899906842624

8130105639853323

18014398509481984

7742823693376695

18014398509481984

3545908575737889

− 288230376151711744

2760923356849777

− 36028797018963968

− 7174137438907087

4503599627370496

12063929807726837

144115188075855872

− 8833525210676647

1125899906842624

6337493011708677

− 36028797018963968

− 170650240960227

70368744177664

− 5824324289872487

4503599627370496

− 4741216870166437

4503599627370496

− 2299019529082251

2251799813685248

849946842273963

− 1125899906842624

490223898427757151

72057594037927936

− 3644178695304033

562949953421312

− 1008218866207467

281474976710656

2019419188776927

1125899906842624

2908076698229235

2251799813685248

3082103085688847

18014398509481984

226196870745667

9007199254740992

4503200630060789

− 36028797018963968

3158567943322891003 

144115188075855872



− 2173342829362743

281474976710656 

6175482448055731 

− 2251799813685248



5761085953121451 



− 144115188075855872



− 5420042187723199

2251799813685248

8742107678118927

18014398509481984

5226016391129105

4503599627370496

4041378073126239

2251799813685248

246036905476920105 

36028797018963968



− 2065817857485759

281474976710656 

6366578400890641 

− 2251799813685248 

2348470109478957 



281474976710656 

6004115926843261 

− 1125899906842624 

3631847684938235 

1125899906842624 

5233938836038739 

− 2251799813685248



664739622472857 



− 2251799813685248

 ,



