Summary of Scaling Physics-Informed Neural Networks for High-Dimensional PDEs

Summary Scaling Physics-Informed Neural Networks for High-Dimensional PDEs arxiv.org

21,648 words - PDF document - View PDF document

One Line

This text discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs, which involves randomly selecting indices and computing gradients.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Scaling Physics-Informed Neural Networks for High-Dimensional PDEs

Source: arxiv.org - PDF - 21,648 words - view

The Curse of Dimensionality

• The curse of dimensionality poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs.

Stochastic Dimension Gradient Descent (SDGD)

• Stochastic Dimension Gradient Descent (SDGD) is proposed as a new method for solving high-dimensional PDEs.

• SDGD utilizes sampling for both forward and backward passes, enabling Physics-Informed Neural Networks (PINNs) to be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours.

Low Memory Cost and Unbiased Gradient

• The algorithm for scaling PINNs for high-dimensional PDEs has low memory cost because the backward pass only backpropagates over terms with i ? I.

• The gradient used is an unbiased estimate.

Rapid Convergence in High-Dimensional Cases

• Algorithm 2 shows rapid convergence even in extremely high-dimensional cases.

• Instability may occur in Algorithm 2 for 100,000 dimensions due to small batch size and resulting gradient variance.

Closing Slide

• Physics-Informed Neural Networks (PINNs) offer a promising solution for scaling and speeding up the solution of high-dimensional partial differential equations (PDEs).

• SDGD enables training on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours.

• Remember the challenges posed by the curse of dimensionality and the importance of efficient algorithms for high-dimensional PDEs.

[Include relevant visuals such as graphs demonstrating convergence or comparisons between different algorithms]

Scaling Physics-Informed Neural Networks for High-Dimensional PDEs

• PINNs and SDGD offer a solution to the curse of dimensionality in solving high-dimensional PDEs.

• SDGD enables rapid convergence even in extremely high-dimensional cases.

• Efficient algorithms are crucial for tackling the challenges posed by high-dimensional PDEs.

Key Points

The curse of dimensionality poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs.
Stochastic Dimension Gradient Descent (SDGD) is proposed as a new method for solving high-dimensional PDEs.
Physics-Informed Neural Networks (PINNs) can be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours using a speed-up method utilizing sampling for both forward and backward passes.
The algorithm for scaling PINNs for high-dimensional PDEs has low memory cost and uses an unbiased gradient.
Algorithm 2 shows rapid convergence even in extremely high-dimensional cases, but instability may occur due to small batch size and resulting gradient variance.

Summaries

28 word summary

This text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs. The algorithm involves randomly selecting indices and computing gradients with respect to those indices.

37 word summary

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). The algorithm for training the PINN involves randomly selecting indices and computing the gradient with respect to those indices. Two

810 word summary

The curse of dimensionality (CoD) poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs. In this paper, a new method called Stochastic Dimension Gradient Descent (SDGD) is proposed

With a speed-up method utilizing sampling for both forward and backward passes, Physics-Informed Neural Networks (PINNs) can be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours of training time. This

Wang et al. propose tensor neural networks for solving high-dimensional Schro?dinger equations. Zhang et al. propose using PINNs for solving stochastic differential equations. Zang et al. propose a weak adversarial network for solving PDEs.

Our algorithm for scaling Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs has several advantages. It has low memory cost because the backward pass only backpropagates over terms with i ? I. The gradient used is an unbiased

The document discusses scaling physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). The algorithm for training the PINN involves randomly selecting indices and computing the gradient with respect to those indices. Two extensions for further speed

The text excerpt discusses a training algorithm for scaling up and speeding up physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). The algorithm involves sampling both the forward and backward passes. The tradeoff between speed and

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It introduces Theorem 4.2, which states that the variance of the stochastic gradient decreases as batch sizes

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It presents a lemma and a remark regarding the boundedness of higher-order derivatives of the network. The SGD trajectory

We explore the acceleration achieved by our algorithm and whether Physics-Informed Neural Networks (PINNs) can overcome the curse of dimensionality. We use consistent hyperparameter settings and vary PDE dimensions from 10^2 to 10^5. Our

Our method achieves faster convergence for high-dimensional scenarios, with Algorithm 2 showing rapid convergence even in extremely high-dimensional cases. However, instability was observed in Algorithm 2 for 100,000 dimensions due to small batch size and resulting gradient variance. The

Algorithm 1 and Algorithm 2 are effective in reducing memory cost and accelerating convergence in training Physics-Informed Neural Networks (PINNs). The acceleration provided by Algorithm 2 outweighs the impact of gradient variance in higher dimensional cases. The symmetry of the

The document discusses the use of Physics-Informed Neural Networks (PINN) to approximate the solutions of high-dimensional partial differential equations (PDEs). Two specific PDEs, the Fisher-KPP equation and the Sine-Gordon equation, are

The excerpt discusses the use of physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs) with linear-quadratic-Gaussian (LQG) control. It mentions the specific form of the PDEs

The method presented in the document is explored to determine whether it can converge on asymmetric and anisotropic partial differential equations (PDEs). An anisotropic Fokker-Planck equation is designed as an example. The convergence results of the

The primary computational bottleneck for TNNs in solving high-dimensional PDEs lies in the first-order derivatives in the loss function. The computational cost scales linearly with the dimension, and TNN incurs larger memory consumption due to non-sharing of network

This summary highlights the key points of the text excerpt. In Figure 4, the stable convergence of SDGD is demonstrated for the Laplace eigenvalue problem, showing that it can scale up to larger problem sizes. The algorithm is able to handle high

Our method effectively captures low-dimensional solutions within high-dimensional equations, leading to faster convergence. We use small-dimension batch sizes and optimize one dimension at a time, triggering a transfer learning effect on other dimensions. The transferability of gradients is validated by theory

The document discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It introduces the full batch gradient and stochastic gradient for PINNs and analyzes their variances. The authors propose a distributed

The article discusses the scaling of physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). It presents convergence plots and performance results for different algorithms and hyperparameters. The data-parallel approach is compared to the tensor

This excerpt includes a list of references cited in the document "Scaling Physics-Informed Neural Networks for High-Dimensional PDEs." The references include various papers and articles related to the use of deep learning and neural networks in solving high-dimensional partial differential equations

This summary provides a list of references from various articles and papers related to the application of deep learning and neural networks in solving high-dimensional partial differential equations (PDEs). The references cover topics such as the numerical approximation of semilinear parabolic PDE

This text excerpt is a list of references to various research papers on the topic of physics-informed neural networks (PINNs) and their applications in solving high-dimensional partial differential equations (PDEs). The papers mentioned cover a range of topics such as

Raw indexed text (120,236 chars / 21,648 words / 3,342 lines)

Tackling the Curse of Dimensionality with Physics-Informed Neural Networks

Zheyuan Hu *

Khemraj Shukla †

George Em Karniadakis †

Kenji Kawaguchi *

Abstract

The curse-of-dimensionality (CoD) taxes computational resources heavily with exponentially increasing computa-

tional cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs as Richard

Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial

differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling

of general nonlinear PDEs to high dimensions has never been achieved. In this paper, we develop a new method of

scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method,

called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding

to different dimensions and samples randomly a subset of these dimensional pieces in each iteration of training

PINNs. We theoretically prove the convergence guarantee and other desired properties of the proposed method. We

experimentally demonstrate that the proposed method allows us to solve many notoriously hard high-dimensional

PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schrödinger equations in thousands of dimensions very

fast on a single GPU using the PINNs mesh-free approach. For instance, we solve nontrivial nonlinear PDEs (one

HJB equation and one Black-Scholes equation) in 100,000 dimensions in 6 hours on a single GPU using SDGD with

PINNs. Since SDGD is a general training methodology of PINNs, SDGD can be applied to any current and future

variants of PINNs to scale them up for arbitrary high-dimensional PDEs.

Introduction

The curse-of-dimensionality (CoD) refers to the computational and memory challenges when dealing with high-

dimensional problems that do not exist in low-dimensional settings. The term was first introduced in 1957 by Richard

E. Bellman [7, 18], who was working on dynamic programming, to describe the exponentially increasing costs due to

high dimensions. Later, the concept was adopted in various areas of science, including numerical partial differential

equations, combinatorics, machine learning, probability, stochastic analysis, and data mining. In the context of

numerically solving Partial Differential Equations (PDEs), an increase in the dimensionality of the PDE’s independent

variables tends to lead to a corresponding exponential rise in computational costs needed for obtaining an accurate

solution. This poses a considerable challenge for all existing algorithms.

Physics-Informed Neural Networks (PINNs) are highly practical in solving partial differential equation (PDE)

problems. Compared to other methods, PINNs can solve various PDEs in their general form, enabling mesh-free

solutions and handling complex geometries. Additionally, PINNs leverage the interpolation capability of neural

networks to provide accurate predictions throughout the domain, unlike other methods that can only compute the value

of PDEs at a single point [1, 19, 40]. Due to neural networks’ flexibility and expressiveness, in principle, the set of

functions representable by PINNs have the ability to approximate high-dimensional solutions. However, PINNs may

also fail in tackling CoD due to insufficient memory and slow convergence when dealing with high-dimensional PDEs.

For example, when PDE’s dimension is very high, even just using one collocation point can lead to an insufficient

memory error. Unfortunately, the aforementioned high-dimensional and high-order cases often occur for very useful

PDEs, such as the Hamilton-Jacobi-Bellman (HJB) equation in stochastic optimal control, the Fokker-Planck equation

in stochastic analysis and high-dimensional probability, and the Black-Scholes equation in mathematical finance, etc.

These PDE examples pose a grand challenge to the application of PINNs in effectively tackling real-world large-scale

problems.

To scale up PINNs for arbitrary high-dimensional PDEs, we propose a training method of PINNs by decomposing

and sampling gradients of PDEs, which we name stochastic dimension gradient descent (SDGD). It accelerates the

training of PINNs while significantly reducing the memory cost, thereby realizing the training of any high-dimensional

PDEs with acceleration. Specifically, we reduce the memory consumption in PINNs by decomposing the computational

and memory bottleneck, namely the gradient of the residual loss that contains PDE terms. We propose a novel

decomposition of the gradient of the residual loss into each piece corresponding to each dimension of PDEs. Then, at

* Department of

† Division

Computer Science, National University of Singapore, Singapore, 119077 ([email protected],[email protected])

Applied Mathematics, Brown University, Providence, RI 02912, USA (khemraj [email protected],

george [email protected])

1each iteration of training, we sample a subset of these dimensional pieces to optimize PINNs. Although we use only a

subset per iteration, this sampling process is ensured to be an unbiased estimator of the full gradient of all dimensions,

which is then used to guarantee convergence for all dimensions.

SDGD enables more efficient parallel computations, fully leveraging multi-GPU computing to scale up and expedite

the speed of PINNs. Parallelizing the computation of mini-batches of different dimensional pieces across multiple

devices can expedite convergence. This is akin to traditional SGD over data points, which allows simultaneous

computation on different machines. SDGD also accommodates the use of gradient accumulation on resource-limited

machines to enable larger batch sizes and reduce gradient variance. Our method also enables parallel computations,

fully leveraging multi-GPU computing to scale up and expedite the speed of PINNs. Gradient accumulation involves

accumulating gradients from multiple batches to form a larger batch size for stochastic gradients, reducing the variance

of each stochastic gradient on a single GPU. Due to the refined and general partitioning offered by SDGD, our gradient

accumulation can be executed on devices with limited resources, ideal for edge computing.

We theoretically prove that the stochastic gradients generated by SDGD are unbiased. Based on this, we also

provide convergence guarantees for our method. Additionally, our theoretical analysis showed that, under the same

batch size and memory cost, proper batch sizes for PDE terms and residual points can minimize gradient variance

and accelerate convergence, an outcome not possible solely with conventional SGD over points. SDGD extends and

generalizes traditional SGD over points, providing stable, low-variance stochastic gradients, while enabling smaller

batch sizes and reduced memory usage.

We conduct extensive experiments on several high-dimensional PDEs. We investigate the relationship between

SDGD and SGD over collocation points, where the latter is stable and widely adopted to reduce memory cost previously.

We vary the number of PDE terms and collocation points to study the stability and convergence speed of SDGD and

SGD under the same memory budgets. Experimental results show that our proposed SDGD is as stable as SGD over

points on multiple nonlinear high-dimensional PDEs and can accelerate convergence. In some cases, SDGD over terms

can achieve faster convergence than SGD, under the same memory cost for both methods.

Furthermore, we showcase large-scale PDEs where traditional PINN training directly fails due to an out-of-memory

(OOM) error. Concretely, with the further speed-up method via sampling both forward and backward passes, we can

train PINNs on nontrivial nonlinear PDEs in 100,000 dimensions in 6 hours of training time, while the traditional

training methods exceed the GPU memory limit directly. Combined with parallel computing, the training of the

100,000D nonlinear equation can be reduced to just a few minutes.

After validating the effectiveness of SDGD compared to SGD over collocation points, we compare the algorithms

with other methods. On multiple nonlinear PDEs, our algorithm outperforms other methods in various metrics.

Additionally, our method enables mesh-free training and prediction across the entire domain, while other methods

typically only provide predictions at a single point. Furthermore, we combine our algorithm with adversarial training,

notorious for its slowness in high-dimensional machine learning, and our algorithm significantly accelerates it. We also

demonstrate the generality of our method by applying it to the Schrödinger equation, which has broad connections with

quantum physics and quantum chemistry. Overall, our method provides a paradigm shift for performing large-scale

PINN training.

The rest of this paper is arranged as follows. We present related work in Section 2, our main algorithm for

accelerating and scaling up PINNs in Section 3, theoretical analysis of our algorithms’ convergence in Section 4,

numerical experiments in Section 5, and conclusion in Section 6.

2.1

Related Work

Physics-Informed Machine Learning

This paper is based on the concept of Physics-Informed Machine Learning [29]. Specifically, PINNs [42] utilize neural

networks as surrogate solutions for PDEs and optimize the boundary loss and residual loss to approximate the PDEs’

solutions. On the other hand, DeepONet [33] leverages the universal approximation theorem of infinite-dimensional

operators and directly fits the PDE operators in a data-driven manner. It can also incorporate physical information by

using residual loss. Therefore, our algorithms can also be flexibly combined with the training of physics-informed

DeepONet. Since their inception, PINNs have succeeded in solving numerous practical problems in computational

science, e.g., nonlinear PDEs [41], solid mechanics [17], uncertainty quantification [49], inverse water waves problems

[27], and fluid mechanics [8], etc. DeepONets also have shown their great potential in solving stiff chemical mechanics

[14], shape optimization [45], and materials science problems [15].

The success of PINNs has also attracted considerable attention in theoretical analysis. In the study conducted by

Luo et al. [34], the authors delve into the exploration of PINNs’ prior and posterior generalization bounds. Additionally,

they utilize the neural tangent kernel to demonstrate the global convergence of Physics-Informed Neural Networks

(PINNs). Mishra et al. [37] derive an estimate for the generalization error of PINNs, considering both the training

2error and the number of training samples. In a different approach, Shin et al. [44] adapt the Schauder approach and

the maximum principle to provide insights into the convergence behavior of the minimizer as the number of training

samples tends towards infinity. They demonstrate that the minimizer converges to the solution in both C 0 and H 1 .

Furthermore, Lu et al. [32] employ the Barron space within the framework of two-layer neural networks to conduct

a prior analysis on PINN with the softplus activation function. This analysis is made possible by drawing parallels

between the softplus and ReLU activation functions. More recently, Hu et al. [23] employ the Rademacher complexity

concept to provide a measure of the generalization error for both PINNs and the extended PINNs variant.

2.2

Machine Learning PDEs Solvers in High-Dimensions

There have been numerous attempts to tackle high-dimensional PDEs by deep learning to overcome the curse of

dimensionality. In the PINN literature, [46] shows that to learn a high-dimensional HJB equation, the L ∞ loss is

required. The authors of [21] propose to parameterize PINNs by the Gaussian smoothed model and to optimize PINNs

without back-propagation by Stein’s identity, avoiding the vast amount of differentiation in high-dimensional PDE

operators to accelerate the convergence of PINNs. Separable PINN [11] considers a per-axis sampling of residual points

instead of point-wise sampling in high-dimensional spaces, thereby reducing the computational cost of PINN. However,

this method [21, 11] focuses on acceleration on only modest (less than 4) dimensions. Specifically, separable PINNs

[11] are designed mainly for increasing the number of collocation points in 3D PDEs, whose separation of sampling

points becomes intractable in high-dimensional spaces. Separable PINNs [11] also struggle in PDE problems with

non-separable solutions. Han et al. [19, 20] proposed the DeepBSDE solver for high-dimensional PDEs based on the

classical BSDE method and they use deep learning to approximate unknown functions. Extensions of the DeepBSDE

method are presented in [2, 9, 22, 24, 28]. Becker et al. [6] solve high-dimensional optimal stopping problems via

deep learning. The authors of [1] combine splitting methods and deep learning to solve high-dimensional PDEs. Raissi

[40] leverages the relationship between high-dimensional PDEs and stochastic processes whose trajectories provide

supervision for deep learning. Despite the effectiveness of the method in [1, 19, 20, 40], they can only be applied

to a certain restricted class of PDEs. Wang et al. [47, 48] propose tensor neural networks with efficient numerical

integration and separable DeepONet structures for solving high-dimensional Schrödinger equations in quantum physics.

Zhang et al. [51] propose to use PINNs for solving stochastic differential equations by representing their solutions via

spectral dynamically orthogonal and bi-orthogonal methods. Zang et al. [50] propose a weak adversarial network that

solves PDEs by its weak formulation.

In the numerical PDE literature, attempts exist to scale numerical methods to high dimensions, e.g., proper

generalized decomposition (PGD) [10], multi-fidelity information fusion algorithm [39], and ANOVA [35, 52]. Darbon

and Osher [12] propose a fast algorithm for solving high-dimensional Hamilton-Jacobi equations if the Hamiltonian is

convex and positively homogeneous of degree one. The multilevel Picard method [3, 4, 5, 25, 26] is another approach

for approximating solutions of high-dimensional parabolic PDEs, which reformulates the PDE problem as a stochastic

fixed point equation, which is then solved by multilevel and nonlinear Monte-Carlo. Similar to Beck et al. [1] and Han

et al. [19], the multilevel Picard method can only output the solution’s value on one test point and can only be applied

to a certain restricted class of PDEs.

2.3

Stochastic Gradient Descent

This paper accelerates and scales up PINNs based on the idea of stochastic gradient descent (SGD). In particular, we

propose the aforementioned SDGD method. SGD is a common approach in machine learning for handling large-scale

data. In general, SGD is an iterative optimization algorithm commonly used in machine learning for training models. It

updates the model’s parameters by computing the gradients on a small randomly selected subset of training examples,

known as mini-batches. This randomness introduces stochasticity, hence enabling faster convergence and efficient

utilization of large datasets, making SGD a popular choice for training deep learning models.

Among them, the works in [13, 31, 36] are remarkable milestones. Specifically, the condition presented in [31]

stands as a minimum requirement within the general optimization literature, specifically pertaining to the Lipschitz

continuity of the gradient estimator. Their proof can also be extended in establishing convergence for our particular

case, necessitating the demonstration of an upper bound on the Lipschitz contact of our gradient estimator. On the

other hand, [13, 36] adopt a classical condition that entails a finite upper bound on the variance. In this case, it suffices

to compute an upper bound for the variance term. While [31] assumes a more relaxed condition, the conclusion is

comparably weaker, demonstrating proven convergence but with a notably inferior convergence rate. In the theoretical

analysis presented in section 4 of this paper, we will utilize the aforementioned tools to provide convergence guarantees

for our algorithm, with an emphasis on the effect of the choice of SDGD on stochastic gradient variance and PINNs

convergence.

Method

3.1

Physics-Informed Neural Networks (PINNs)

In this paper, we focus on solving the following partial differential equations (PDEs) defined on a domain Ω ⊂ R d :

Lu(x) = R(x) in Ω,

Bu(x) = B(x) on Γ,

(1)

where L and B are the differential operators for the residual in Ω and for the boundary/initial condition on Γ. PINNs

[42] is a neural network-based PDE solver via minimizing the following boundary and residual loss functions.

L(θ) = λ b L b (θ) + λ r L r (θ)

n b

n r

λ b X

λ r X

|Lu θ (x r,i ) − R(x r,i )| .

|Bu θ (x b,i ) − B(x b,i )| +

n b i=1

n r i=1

3.2

(2)

Methodology for High-Dimensional PDEs

We first adopt the simple high-dimensional second-order Poisson’s equation for illustration, then we move to the

widely-used high-dimensional Fokker-Planck equation for the general case.

3.2.1

Introductory Case of the High-Dimensional Poisson’s Equation

We consider a simple high-dimensional second-order Poisson’s equation for illustration:

∆u(x) =

d 2

2 u(x) = R(x), x ∈ Ω ⊂ R ,

i=1

(3)

where u is the solution, and u θ is our PINN model parameterized by θ. The memory and computational cost scale

linearly as d increases. So, for extremely high-dimensional PDEs containing many second-order terms, using one

collocation point can lead to insufficient memory.

However, the memory problem is solvable by inspecting the residual loss on the collocation point x:

! 2

d 2

u θ (x) − R(x) ,

dx 2 i

i=1

ℓ(θ) =

The gradient with respect to the model parameters θ for training the PINN is

! d

X ∂ d 2

∂ℓ(θ)

d 2

grad(θ) :=

u θ (x) − R(x)

u θ (x) .

∂θ

dx 2 i

∂θ dx 2 i

i=1

(4)

(5)

We are only differentiating with respect to parameters θ on the d PDE terms dx

2 u θ (x), which is the memory bottleneck

in the backward pass since the shape of the gradient is proportional

to both the PDE dimension and the parameter count

P d d 2

of the PINN. In contrast, the first part

i=1 dx 2 u θ (x) − R(x) is a scalar, which can be precomputed and detached

from the GPU since it is not involved in the backpropagation for θ. Since the full gradient grad(θ) is the sum of d

independent terms, we can sample several terms for stochastic gradient descent (SGD) using the sampled unbiased

gradient estimator. Concretely, our algorithm can be summarized as follows:

1. Choose random indices I ⊂ {1, 2, · · · , d} where |I| is the cardinality of the set I, which is the batch size over

PDE terms, where we can set |I| ≪ d to minimize memory cost.

2. For i = 1, 2, · · · , d, compute dx

2 u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it

from GPU to save memory. After detachment, the term will not be involved in the costly backpropagation, and

its gradient with respect to θ will not be computed.

3. Compute the unbiased stochastic gradient used to update the model

X d 2 ∂

d X d 2

grad I (θ) =

u θ (x) − R(x)

u θ (x) .

|I| i=1 dx 2 i

dx 2 i ∂θ

i∈I

(6)4. If not converged, go to 1.

Our algorithm enjoys the following good properties and extensions:

• Low memory cost: Since the main cost is from the backward pass, and we are only backpropagating over terms

with i ∈ I, the cost is the same as the corresponding |I|-dimensional PDE.

• Unbiased stochastic gradient: Our gradient is an unbiased estimator of the true full batch gradient, i.e.,

E I [grad I (θ)] = grad(θ),

(7)

so that modern SGD accelerators such as Adam [30] can be adopted.

• Accumulate gradient for full batch GD: For the full GD that exceeds the memory, we can select non-overlapping

index sets {I k } k∈K such that ∪ k I k = {1, 2, · · · , d}, then we can combine these minibatch gradients to get the

full gradient,

1 X

grad I k (θ) = grad(θ).

(8)

|K|

This baseline process is memory-efficient but time-consuming. It is memory efficient since it divides the

entire computationally intensive gradient into the sum of several stochastic gradients whose computations are

memory-efficient. This operation is conducted one by one using the “For loop from i = 1 to d”. But it is rather

time-consuming because the “For loop” cannot be parallelized.

A common practice in PINN is to sample stochastic collocation points to reduce the batch size of points, which is

a common way to reduce memory costs in previous methods and even the entire field of deep learning. Next, we

compare SDGD with SGD over collocation points:

• SGD on collocation points has a minimum batch size of 1 point plus the d-dimensional equation.

• SDGD can be combined with SGD on points so that its minimal batch size is 1 point plus 1D equation.

• Thus, our method can solve high-dimensional PDEs arbitrarily since its minimal computational cost will not

grow as the dimension d increases. In contrast, the minimum cost of SGD on points scales linearly as d grows.

• Empirically, it is interesting to see how the two SGDs affect convergence. In particular, B-point + D-term

with the same B × D quantity has the same computational and memory cost, e.g., 50-point-100-terms and

500-point-10-terms.

3.2.2

General Case

We are basically doing the decomposition:

Lu =

N L

L i u,

(9)

i=1

where L i are the N L decomposed terms. We have the following examples:

∂

• In the d-dimensional Poisson cases, N L = d and L i u = ∂x

2 u(x). This observation also highlights the

significance of decomposition methods for high-dimensional Laplacian operators. The computational bottleneck

in many high-dimensional second-order PDEs often lies in the high-dimensional Laplacian operator.

• Consider the d-dimensional Fokker–Planck (FP) equation, which has numerous real-world applications in

optimal control and finance:

∂

1 X

∂ 2

∂u(x, t)

= −

[F i (x)u(x, t)] +

[D ij (x)u(x, t)] ,

∂t

∂x i

2 i,j=1 ∂x i ∂x j

i=1

(10)

where x = (x 1 , x 2 , . . . , x n ) is a d-dimensional vector, u(x, t) is the probability density function of the

stochastic process x(t), F i (x) is the i-th component of the drift coefficient vector F (x), and D ij (x) is the

(i, j)-th component of the diffusion coefficient matrix D(x).

5Then, our decomposition over the PDE terms will be

Lu =

∂u(x, t) X ∂

1 X

∂ 2

[F i (x)u(x, t)] −

[D ij (x)u(x, t)]

∂t

∂x i

2 i,j=1 ∂x i ∂x j

i=1

1 ∂ 2

1 ∂u(x, t) 1 ∂

(x)u(x,

t)]

−

(x)u(x,

t)]

d 2

∂t

d ∂x i

2 ∂x i ∂x j

i,j=1

(11)

L ij u(x),

i,j=1

where we denoted

L ij u(x) =

1 ∂ 2

1 ∂u(x, t) 1 ∂

[F i (x)u(x, t)] −

[D ij (x)u(x, t)] .

∂t

d ∂x i

2 ∂x i ∂x j

(12)

• The d-dimensional bi-harmonic equation:

∆ 2 u(x) = 0,

x ∈ Ω ⊂ R d ,

(13)

P d ∂ 2

where ∆ = i=1 ∂x

2 is the Laplacian operator. For this high-dimensional fourth-order PDE, we can do the

following decomposition for the PDE operator:

Lu = ∆ 2 u(x) =

∂ 4

u(x)

L ij u(x),

∂x 2 i ∂x 2 j

i,j=1

where

L ij u(x) =

∂ 4

u(x).

∂x 2 i ∂x 2 j

i,j=1

(14)

(15)

• For other high-dimensional PDEs, their complexity arises from the dimensionality itself. Therefore, it is always

possible to find decomposition methods that can alleviate this complexity.

We go back to the algorithm after introducing these illustrative examples, the computational and memory bottleneck

is the residual loss:

ℓ(θ) = (Lu(x) − R(x)) .

(16)

The gradient with respect to the model parameters θ for training the PINN is

! N

N L

∂ℓ(θ)

∂

(17)

grad(θ) =

L i u θ (x) − R(x)

L i u θ (x) .

∂θ

i=1

Subsequently, we are sampling the red part to reduce memory cost. Concretely, our algorithm can be summarized as

follows:

1. Choose random indices I ⊂ {1, 2, · · · , N L } where |I| is the cardinality of the set I, which is the batch size over

PDE terms, where we can set |I| ≪ N L } to minimize memory cost.

2. For i = 1, 2, · · · , N L , compute L i u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it

from GPU to save memory. After detachment, the term will not be involved in the costly backpropagation, and

its gradient with respect to θ will not be computed, which saves GPU memory costs.

3. Compute the unbiased stochastic gradient used to update the model

N L

X ∂

N L X

grad I (θ) =

L i u θ (x) − R(x)

L i u θ (x) .

|I| i=1

∂θ

i∈I

4. If not converged, go to 1.

The algorithm is summarized in 1.

(18)Algorithm 1 Training Algorithm for scaling-up by sampling the gradient in the backward pass.

1: while NOT CONVERGED do

Choose random indices I ⊂ {1, 2, · · · , N L } where |I| is the cardinality of the set I, which is the batch size

over PDE terms, where we can set |I| ≪ N L } to minimize memory cost.

for i ∈ {1, 2, · · · , N L } do

compute L i u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it from GPU to save

memory. After detachment, the term will not be involved in the costly backpropagation, and its gradient with

respect to θ will not be computed.

end for

Compute the unbiased stochastic gradient used to update the model

N L

X ∂

N L X

L i u θ (x) − R(x)

grad I (θ) =

L i u θ (x) .

|I| i=1

∂θ

i∈I

3.3

end while

Further Scale-up via Gradient Accumulation and Parallel Computing

This subsection introduces two straightforward extensions of our Algorithm 1 for further speed-up and scale-up.

Gradient Accumulation. Since our method involves SDGD, in large-scale problems, we have to reduce the batch

size to fit it within the available GPU memory. However, a very small batch size can result in significant gradient

variance. In such cases, gradient accumulation is a promising idea. Specifically, gradient accumulation involves

sampling different index sets I 1 , I 2 , · · · , I n , computing the corresponding gradients grad I k (θ) for k = 1, 2, · · · , n,

and then averaging them as the final unbiased stochastic gradient for one-step optimization to effectively increase the

batch size. The entire process can be implemented on a single GPU, and we only need to be mindful of the tradeoff

between computation time and gradient variance.

Parallel Computing. Just as parallel computing can be used in machine learning to increase batch size and

accelerate convergence in SGD, our new SDGD also supports parallel computing. Recall that the stochastic gradient

by sampling the PDE terms randomly is given by

N L

X ∂

N L X

L i u θ (x) − R(x)

grad I (θ) =

L i u θ (x) .

|I| i=1

∂θ

i∈I

We can compute the above gradient for different index sets I 1 , I 2 , · · · , I n on various machines simultaneously and

accumulate them to form a larger-batch stochastic gradient.

3.4

Further Speed-up via Simultaneous Sampling in Forward and Backward Passes

In this section, we discuss how to accelerate further large-scale PINN training based on our proposed memory reduction

methods to make it both fast and memory efficient.

Although the proposed methods can significantly reduce memory cost and use SDGD to obtain some acceleration

through gradient randomness, the speed is still slow for particularly large-scale problems because we need to calculate

each term {L i u(x)} N

i=1 in the forward pass one by one, and we are only omitting some these terms in the backward pass

to scale up. For example, in extremely high-dimensional cases, we may need to calculate thousands of second-order

derivatives of PINN, and the calculation speed is clearly unacceptable.

To overcome this bottleneck, we can perform the same unbiased sampling on the forward pass to accelerate it while

ensuring that the entire gradient is unbiased. In other words, we only select some indices for calculation in the forward

pass and select another set of indices for the backward pass, combining them into a very cheap unbiased stochastic

gradient.

Mathematically, consider the full gradient again:

! N

N L

∂ℓ(θ)

∂

(19)

grad(θ) =

L i u θ (x) − R(x)

L i u θ (x) .

∂θ

i=1

We choose two random and independent indices sets I, J ⊂ {1, 2, · · · , N L } for the sampling of the backward and the

forward passes, respectively. The corresponding stochastic gradient is:

 



X ∂

N L   N L X

grad I,J (θ) =

L i u θ (x)  − R(x) 

L i u θ (x) .

(20)

|I|

|J|

∂θ

i∈I

j∈J

7Since I, J are independent, the gradient estimator is unbiased:

E I,J grad I,J (θ) = grad(θ).

(21)

Concretely, our algorithm can be summarized as follows:

1. Choose random indices I, J ⊂ {1, 2, · · · , N L }, where we can set |I| ≪ N L to minimize memory cost and

|J| ≪ N L to further speed up.

2. For i ∈ I, compute L i u θ (x) and keep the gradient with respect to θ.

3. For j ∈ J, compute L j u θ (x) and detach it to save memory cost.

4. Compute the unbiased stochastic gradient used to update the model

 



X ∂

N L   N L X

grad I,J (θ) =

L j u θ (x)  − R(x) 

L i u θ (x) .

|I|

|J|

∂θ

j∈J

(22)

i∈I

5. If not converged, go to 1.

The algorithm is summarized in Algorithm 2.

Algorithm 2 Training Algorithm for scaling-up and further speeding-up by sampling both the forward and backward

passes.

1: while NOT CONVERGED do

Choose random indices I, J ⊂ {1, 2, · · · , N L }, where we can set |I| ≪ N L to minimize memory cost and

|J| ≪ N L to further speed up.

for i ∈ I do

compute L i u θ (x) and keep the gradient with respect to θ.

end for

for j ∈ J do

compute L j u θ (x) and detach it to save memory cost.

end for

Compute the unbiased stochastic gradient used to update the model

 



X ∂

N L   N L

L j u θ (x)  − R(x) 

L i u θ (x) .

grad I,J (θ) =

|I|

|J|

∂θ

j∈J

10:

i∈I

end while

Trading off Speed with Gradient Variance. Obviously, since we have done more sampling, the gradient variance

of this method may be larger, and the convergence may be sub-optimal. However, at the initialization stage of

the problem, an imprecise gradient is sufficient to make the loss drop significantly, which is the tradeoff between

convergence quality and convergence speed. We will demonstrate in experiments that this method is particularly

effective for extremely large-scale PINNs training.

3.5

Practical Consideration: Normalization for Stable Gradient

In high-dimensional scenarios, the number of terms in the PDE increases with the dimensionality of the problem,

leading to the possibility of a substantially large scale for the entire PDE residual. This is different from low-dimensional

problems, necessitating the need to normalize both the PDE terms and their corresponding residual losses to prevent

the occurrence of gradient explosion. Specifically, our approach is as follows.

For the general PDE we have considered, the full batch gradient and stochastic gradients generated by Algorithm 1

8and Algorithm 2 are

N L

∂

L i u θ (x) − R(x)

grad(θ) =

L i u θ (x) .

∂θ

i=1

N L

X ∂

N L X

L i u θ (x) − R(x)

L i u θ (x) .

grad I (θ) =

|I| i=1

∂θ

i∈I

 



X ∂

N L   N L

grad I,J (θ) =

L j u θ (x)  − R(x) 

L i u θ (x) .

|I|

|J|

∂θ

N L

j∈J

i∈I

Since the full gradient is the sum of N L subgradients along each dimension where N L grows with d, the scale of this

PDE inherently increases with dimension. If we do not control the learning rate or normalize appropriately, even with

regular training, we may encounter the problem of exploding gradients. Compared with adjusting the learning rate,

normalizing the loss or gradients is more versatile. We only need to normalize using N 1 2 for both the forward and

backward passes and the corresponding new gradients are given by:

N L

R(x)

∂

1 X

L i u θ (x) −

grad(θ) =

L i u θ (x) .

N L i=1

N L

N L i=1 ∂θ

N L

1 X

R(x)

1 X ∂

grad I (θ) =

L i u θ (x) −

L i u θ (x) .

N L i=1

N L

|I|

∂θ

i∈I

 



X ∂

R(x)



grad I,J (θ) =  

L j u θ (x)  −

L i u θ (x) .

|J|

N L

|I|

∂θ

j∈J

i∈I

Theory

In this section, we analyze the convergence of our proposed algorithm.

4.1

SDGD is Unbiased

In previous sections, we have shown that the stochastic gradients generated by our proposed SDGD are unbiased:

Theorem 4.1. The stochastic gradients grad I (θ) and grad I,J (θ) in our Algorithms 1 and 2, respectively, parameterized

by index sets I, J, are an unbiased estimator of the full-batch gradient grad(θ) using all PDE terms, i.e., the expected

values of these estimators match that of the full-batch gradient, E I [grad I (θ)] = E I,J [grad I,J (θ)] = grad(θ).

Proof. We prove the theorem in Section 3.

4.2

SDGD Reduces Gradient Variance

In this subsection, we aim to show that SDGD can be regarded as another form of SGD over PDE terms, serving as a

complement to the commonly used SGD over residual points. Specifically, B-point + D-term where B, D ∈ Z + with

the same B × D quantity has the same computational and memory cost, e.g., 50-point-100-terms and 500-point-10-

terms. In particular, we shall demonstrate that properly choosing the batch sizes of residual points B and PDE terms D

under the constant memory cost (B × D) can lead to reduced stochastic gradient variance and accelerated convergence

compared to previous practices that use SGD over points only.

P N L

We assume that the total full batch is with N r residual points {x i } N

i=1 and N L PDE terms L =

i=1 L i , then the

loss function for PINN optimization is

N r

(Lu θ (x i ) − R(x i )) ,

2N r N L 2 i=1

(23)

where we normalize over both the number of residual points and the number of PDE terms, which does not impact the

9directions of the gradients. The full batch gradient is:

N r

1 X

∂

(Lu θ (x i ) − R(x i )) Lu θ (x i )

N r N L i=1

∂θ





N L

N r

∂

(Lu θ (x i ) − R(x i )) 

L j u θ (x i )  .

N r N L 2 i=1

∂θ

j=1

g(θ) :=

(24)

If the random index sets J ⊂ {1, 2, · · · , N L } and B ⊂ {1, 2, · · · , N r } are chosen, then the SGD gradient with |B|

residual points and |J| PDE terms using these index sets is





∂

g B,J (θ) =

(Lu θ (x i ) − R(x i )) 

L j u θ (x i )  ,

(25)

|B||J|N L

∂θ

i∈B

j∈J

where | · | computes the cardinality of a set. It is straightforward to show that E B,J [g B,J (θ)] = g(θ).

Theorem 4.2. For the random index sets (B, J) where J ⊂ {1, 2, · · · , N L } is that for indices of PDE terms and

B ⊂ {1, 2, · · · , N r } is that for indices of residual points, then

V B,J [g B,J (θ)] =

C 1 |J| + C 2 |B| + C 3

|B||J|

(26)

where V computes the variance of a random variable, C 1 , C 2 , C 3 are constants independent of B, J.

Proof. The theorem is proved in Appendix A.1

Intuitively, the variance of the stochastic gradient tends to decrease as the batch sizes (|B|, |J|) increase, and it

converges to zero for |B|, |J| → ∞.

This verifies that SDGD can be regarded as another form of SGD over PDE terms, serving as a complement

to the commonly used SGD over residual points. |B|-point + |J|-term with the same |B| × |J| quantity has the

same computational and memory cost, e.g., 50-point-100-terms and 500-point-10-terms. According to Theorem

4.2, under the same memory budget, i.e., |B| × |J| is fixed, then there exists a particular choice of batch sizes |B|

and |J| that minimizes the gradient variance, in turn accelerating and stabilizing convergence. This is because the

stochastic gradient variance V B,J [g B,J (θ)] is a function of finite batch sizes |B| and |J|, which therefore can achieve

its minimum value at a certain choice of batch sizes.

Let us take the extreme cases as illustrative examples. The first extreme case is when the PDE terms have no

∂

variance, meaning that the terms ∂θ

L j u θ (x i ) are identical for all j after we fix i. In this case, if a memory budget of,

for example, 100 units is given, the optimal choice would be to select 100 points and one PDE term with the minimum

variance. Choosing more PDE terms would not decrease the gradient variance and would be less effective than using

the entire memory budget for sampling points. Conversely, if the points have no variance in terms of the gradient they

induce, the optimal choice would be one point and 100 PDE terms. In practice, the selection of residual points and

PDE terms will inevitably introduce some variance. Therefore, in the case of a fixed memory cost, the choice of batch

size for PDE terms and residual points involves a tradeoff. Increasing the number of PDE terms reduces the variance

contributed by the PDE terms but also decreases the number of residual points, thereby increasing the variance of

the points. Conversely, decreasing the number of PDE terms reduces the variance from the points but increases the

variance from the PDE terms. Thus, there exists an optimal selection strategy to minimize the overall gradient variance

since the choices of batch sizes are finite.

4.3

Gradient Variance Bound and Convergence of SDGD

To establish the convergence of unbiased stochastic gradient descent, we require either Lipschitz continuity of the

gradients [31] or bounded variances [13, 36], with the latter leading to faster convergence. To prove this property, we

need to make the following steps. Firstly, we define the neural network serving as the surrogate model in the PINN

framework.

Definition 4.1. (Neural Network). A deep neural network (DNN) u θ : x = (x 1 , . . . , x d ) ∈ R d 7−→ u θ (x) ∈ R,

parameterized by θ of depth L is the composition of L linear functions with element-wise non-linearity σ, is expressed

as below:

u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · ),

(27)

10where x ∈ R d is the input, and W l ∈ R m l ×m l−1 is the weight matrix at l-th layer with d = m 0 and m L = 1. The

parameter vector θ is the vectorization of the collection of all parameters. We denote h as the maximal width of the

neural network, i.e., h = max(m L , · · · , m 0 ).

For the nonlinear activation function σ, the residual ground truth R(x), we assume the following:

Assumption 4.1. We assume that the activation function is smooth and |σ (n) (x)| ≤ 1 and σ (n) is 1-Lipschitz, e.g., the

sine and cosine activations. We assume that |R(x)| ≤ R for all x.

Remark 1. The sine and cosine functions naturally satisfy the aforementioned conditions. As for the hyperbolic

tangent (tanh) activation function, when the order n of the PDE is determined, there exists a constant C n such that the

activation function C n tanh(x) satisfies the given assumptions. This constant can be absorbed into the weights of the

neural network. This is because both the tanh function and its derivatives up to order n are bounded.

Recall that the full batch gradient is





N L

∂

(Lu θ (x i ) − R(x i )) 

g(θ) =

L j u θ (x i )  .

N r N L 2 i=1

∂θ

j=1

N r

(28)

The stochastic gradient produced by Algorithm 1 is





X ∂

(Lu θ (x i ) − R(x i )) 

L j u θ (x i )  ,

g B,J (θ) =

|B||J|N L

∂θ

i∈B

(29)

j∈J

The stochastic gradient produced by Algorithm 2 is

g B,J,K (θ) =

|B||J|N L

i∈B

N L

|K|

L k u θ (x i )



!

X ∂

− R(x i ) 

L j u θ (x i )  ,

∂θ

(30)

j∈J

k∈K

where B is the index set sampling over the residual points, while J and K sample the PDE terms in the forward pass

and the backward pass, respectively.

Lemma 4.1. (Bounding the Gradient Variance) Suppose that we consider the neural network function class:

NN L

M := {x 7→ W L σ(W L−1 σ(· · · σ(W 1 x) · · · ) | ∥W l ∥ 2 ≤ M (l)} .

(31)

If the PDE under consideration has at most nth order derivative, then the gradient variance can be bounded by

V B,J (g B,J (θ)) ≤

|B||J|N L

n!(L − 1) M (L)

L−1

! 2

(n + 1)!(L − 1) n+1 M (L)

L−1

! 2

M (l) n+1 max ∥x∥

n!(L − 1) n M (L)

x∈Ω

l=1

V B,J,K (g B,J,K (θ)) ≤

|B||J||K|

M (l) + R

l=1

L−1

(32)

! 2

M (l) n + R

l=1

(n + 1)!(L − 1)

n+1

M (L)

L−1

l=1

! 2

M (l)

n+1

max ∥x∥

x∈Ω

Proof. The proof of this lemma is provided in Appendix A.2

Remark 2. Intuitively, given the order of the PDE and the number of layers in the neural network, it is observed that

both of them are finite. Consequently, the absolute values of the higher-order derivatives of the network are necessarily

bounded by a quantity that is related to the norms of its weight matrices.

Next, we define the SGD trajectory, and we will demonstrate the convergence of the SGD trajectory based on the

stochastic gradients provided by our Algorithms 1 and 2.

111

Definition 4.2. (SGD Trajectory) Given step sizes (learning rates) {γ n } ∞

n=1 , and the initialized PINN parameters θ ,

then the update rules of SGD using Algorithms 1 and 2 are

θ n+1 = θ n − γ n g B,J (θ n ), (33)

θ n+1 = θ n − γ n g B,J,K (θ n ), (34)

respectively.

The following theorem demonstrates the convergence rate of SDGD in PINN training.

Theorem 4.3. (Convergence of SDGD under Bounded Gradient Variance) Fix some tolerance level δ > 0, suppose

that the SGD trajectory given in equations (33, 34) are bounded, i.e., ∥W l n ∥ ≤ M (l) for all n where the collection

∗

of all {W l n } L

l=1 is θ , and that the minimizer of the PINN loss is θ , and that the SGD step size follows the form

γ n = γ/(n + m) for some p ∈ (1/2, 1] and large enough γ, m > 0, then

1. There exist neighborhoods U and U 1 of θ ∗ such that, if θ 1 ∈ U 1 , the event

E ∞ = {θ n ∈ U for all n ∈ N}

(35)

occurs with probability at least 1 − δ, i.e., P(E ∞ |θ 1 ∈ U 1 ) ≥ 1 − δ.

2. Conditioned on E ∞ , we have

E[∥θ n − θ ∗ ∥ 2 |E ∞ ] ≤ O(1/n p ).

(36)

Remark 3. The convergence rate of stochastic gradient descent is O(1/n p ), where O represents the constant involved,

including the variance of the stochastic gradient. A larger variance of the stochastic gradient leads to slower

convergence, while a smaller variance leads to faster convergence. The selection of the learning rate parameters γ

and m depends on the specific optimization problem itself. For example, in a PINN problem, suitable values of γ and

m ensure that the initial learning rate is around 1e-4, which is similar to the common practice for selecting learning

rates.

Remark 4. In our convergence theorem, we have not made any unrealistic assumptions; all we require is the

boundedness of the optimization trajectory of the PINN. Empirically, as long as the learning rate and initialization

are properly chosen, PINN always converges, and the values of its weight matrices do not diverge to infinity. In the

literature, many convergence theorems for SGD require various conditions to be satisfied by the loss function of the

optimization problem itself, such as Lipschitz continuity. However, in our case, none of these conditions are required.

Experiments

Our main goal is to showcase the following properties and advantages of our method:

1. For Algorithm 1, reducing the residual point batch size and the PDE term batch size yields a comparable

acceleration effect under the same memory budget. In some cases, our SDGD can even outperform the traditional

SGD, demonstrating that SDGD can be an alternative form of SGD applied to PDE terms that is stable and

reliable.

2. The main objective of Algorithm 2 is to scale up and accelerate training for large-scale PINN problems. We

will demonstrate that while Algorithm 2 exhibits poorer convergence on small-scale problems due to increased

gradient variance, its convergence speed and accuracy improve significantly as the scale of the PDE problem

grows, such as in cases with tens of thousands of dimensions. In many of these large-scale instances, Algorithm

2 achieves convergence, whereas Algorithm 1 and full training only reach a few epochs without convergence.

Additionally, full training may lead to out-of-memory (OOM) errors directly.

3. We compare our PINN-based method with other non-PINN mainstream approaches for solving high-dimensional

PDEs [1, 19, 40], especially in terms of accuracy.

All the experiments are done on an NVIDIA A100 GPU with 40GB memory.

5.1 Hamilton-Jacobi-Bellman (HJB) Equation with Linear Solution and Black-Scholes-

Barenblatt (BSB) Equation: Scaling-up to 100,000D, and Comparisons with Baselines

This experiment will demonstrate how Algorithms 1 and 2 are superior to the commonly used full-batch gradient

descent on training PINNs, and how our SDGD-PINNs method can surpass strong non-PINNs baselines, on several

nonlinear PDEs.

125.1.1

Experimental Design

We consider the following two nonlinear PDEs:

• Hamilton-Jacobi-Bellman equation with linear solution (HJB-Lin) [46] with the terminal condition given

below:

1 X

∂ t u(x, t) + ∆u(x, t) −

|∂ x i u| c = −2, x ∈ R d , t ∈ [0, T ]

d i=1

(37)

u(x, T ) =

x i ,

i=1

where c = 1, T = 1 and the dimension d can be arbitrary chosen. The analytic solution is:

u(x, t) =

x i + T − t.

(38)

i=1

In this example, we decompose the residual prediction of PINN along each dimension as follows:





d 



1 X

|∂ x i u| c =

∂ t u(x, t) + d∂ x 2 i u(x, t) −

|∂ x j u| c ,

∂ t u(x, t) + ∆u(x, t) −



d i=1

d i=1 

d j=1

(39)

since the first-order derivative is much cheaper and is not the main bottleneck.

• Black-Scholes-Barenblatt (BSB) equation

1 X 2

x i u x i ),

x i u x i x i + r(u −

u t = − σ 2

2 i=1

i=1

(40)

u(x, T ) = ∥x∥ ,

where T = 1, σ = 0.4, r = 0.05 with the following analytic solution

u(x, t) = exp (r + σ 2 )(T − t) ∥x∥ 2 .

In this example, we decompose the residual prediction of PINN along each dimension as follows:







1 2 X 2

1 X 

1 2 2

u t + σ

x i u x i x i − r(u −

x i u x i ) =

u t + σ dx i u x i x i − r(u −

x j u x j ) ,



2 i=1

d i=1 

i=1

j=1

(41)

(42)

Using consistent hyperparameter settings, we explore varying PDE dimensions in the range of 10 2 , 10 3 , 10 4 , 10 5 .

This allows us to assess the acceleration achieved by our algorithm and whether PINNs can overcome the CoD.

Specifically, for the hyperparameters of PINNs, we employ a four-layer PINN with 1024 hidden units, trained using

the Adam optimizer [30] with an initial learning rate of 1e-3, exponentially decaying with a factor of 0.9995. Our

testing datasets consist of 2e4 points, sampled from x ∼ N (0, I) and t ∼ Unif[0, 1]. The residual points are sampled

from the same distribution. The testing criterion is the relative L 2 error between the model and the solution given by

N test

i=1 |u pred (X i , t i ) − u(X i , t i )|

(43)

N test

i=1 |u(X i , t i )|

where u pred is the prediction model and u is the true solution, N test is the number of test points and (X i , t i ) is the ith

test point.

Due to the large scale of the terminal condition, such as in HJB-Lin, where it follows a Gaussian distribution

with zero mean and variance of d, or in the BSB equation, where it follows a chi-squared distribution with a mean of

d, there is a high probability of sampling points with large values in high-dimensional space. Directly learning the

terminal condition using the boundary loss can result in numerical overflow and optimization difficulties. Therefore,

we modified the PINN structure to satisfy the terminal condition automatically. This allows us to focus solely on

13PDE

HJB-Lin

BSB

10 2

34min

31min

10 3

68min

57min

10 4

81min

118min

10 5

310min

41min

Table 1: This table presents the convergence time required by our SDGD for different PDEs. In the HJB-Lin equation,

as the dimensionality increases from 100 to 100,000, the dimensionality grows by a factor of 1000, while the time only

increases by a factor of ten. This indicates that our method can withstand the curse of dimensionality. In the second

BSB equation, surprisingly, the high-dimensional case converges faster than the low-dimensional case, demonstrating

the so-called blessing of dimensionality and that our method can harness the blessing of dimensionality.

the residual loss. Specifically, for the HJB-Lin and BSB equations with the terminal condition u(x, T ) = g(x), we

consider the following PINN models:

u HJB-Lin

(x, t) = u θ (x, t)(T − t) +

x i ,

u BSB

θ (x, t) = ((T − t)u θ (x, t) + 1) ∥x∥ ,

(44)

i=1

where u θ is a vanilla multi-layer perceptron network while u HJB-Lin

and u BSB

are the prediction models for HJB-Lin

and BSB equations, respectively. For the HJB-Lin equation, the initialization is already sufficiently accurate regarding

P d

the L 2 error of the prediction for u, as the i=1 x i scale is much larger than T − t. Therefore, we focus on testing the

accuracy of ∂ t u, which is a nontrivial task:

N test

i=1 |∂ t u pred (X i , t i ) − ∂ t u(X i , t i )|

(45)

N test

|∂

u(X

i i

i=1

where u pred is the prediction model and u is the true solution.

For the baselines [1, 19, 40], it is necessary to specify the initial point of the stochastic process corresponding to

the PDE. These methods can only make predictions at this initial point, hence reporting the relative L 1 error at a single

point. We will also report the error of PINN at this initial point. We choose the initial points as (X 0 , t 0 ) = (x 1 =

1, x 2 = 1, · · · , x d = 1, t = 0) for the BSB equation and (X 0 , t 0 ) = (x 1 = 0, x 2 = 0, · · · , x d = 0, t = 0) for the

HJB-Lin equation, and the corresponding error is

|u pred (X 0 , t 0 ) − u(X 0 , t 0 )|

u(X 0 , t 0 )

(46)

where u pred is the prediction model and u is the true solution.

5.1.2

Results of Our Method and Full-Batch GD on Training PINN

This subsection presents the results comparing our Algorithms 1 and 2 with full batch gradient descent when training

PINN for the BSB and HJB equations, which are demonstrated in Figures 1 and 2.

Specifically, the captions of Figures 1 and 2 illustrate different algorithms’ speeds and memory costs in the HJB and

BSB equations across various dimensions. Regarding memory cost, full gradient descent encounters OOM problems in

100,000 dimensions even with only one residual point, while SDGD can handle even higher-dimensional problems. In

terms of speed, under comparable memory costs, Algorithm 2 consistently exhibits the highest speed, followed by

Algorithm 1, while full batch GD is the slowest. Notably, in high-dimensional cases, our Algorithm 2 demonstrates

significant acceleration compared to other algorithms. In lower dimensions, the acceleration effect of Algorithm 2 is

less pronounced, and Algorithm 1 suffices.

In the case of the 1000-dimensional BSB equation, Algorithm 2 exhibits a shorter time per epoch compared to

Algorithm 1. However, the convergence results are inferior, suggesting that Algorithm 2 sacrifices speed for larger

gradient variance due to increased random sampling in both forward and backward passes. Nevertheless, as the

dimension of the PDE increases, the advantageous fast convergence of Algorithm 2 becomes more evident. For

instance, on the largest 100,000-dimensional problem, Algorithm 2 achieved convergence with a relative error of less

than 3.477e-5 (HJB) / 4.334e-3 (BSB) in under 4 hours. In contrast, full GD encountered OOM errors, and the speed of

Algorithm 1 was extremely slow, running only a few epochs within the same time frame and still far from convergence.

Meanwhile, in the 100,000-dimensional HJB equation with a relatively small batch size (100 PDE terms and 100

residual points), we observed a significant increase in the relative L 2 error performance of Algorithm 2 during training,

attributable to the heightened variance of the gradient.

14100-Dimensional HJB-Lin Equation

10 1

10 2

1000-Dimensional HJB-Lin Equation

10 0

Algorithm 1: 1000-point-10-term

Memory: 2464MB, Speed: 0.14s/it

Full batch GD: 100-point-100-term

Memory: 2406, Speed: 0.22s/it

10 0

500

1000

Time (s)

1500

10 1

10 2

2000

Algorithm 1: 1000-point-100-term

Memory: 9386MB, Speed: 1.29s/it

Algorithm 2: 1000-point-100-term

Memory: 9386MB, Speed: 0.49s/it

Full batch GD: 100-point-1000-term

Memory: 9660MB, Speed: 2.15s/it

10000-Dimensional HJB-Lin Equation

10 2

500

15000

20000

10 0

10000

Time (s)

100000-Dimensional HJB-Lin Equation

Algorithm 1: 100-point-100-term

Memory: 6920MB, Speed: 10.01s/it

Algorithm 2: 100-point-100-term

Memory: 6920MB, Speed: 0.39s/it

Full batch GD: 1-point-10000-term

Memory: 7272MB, Speed: 21.12s/it

10 0

5000

1000 1500 2000 2500 3000 3500 4000

Time (s)

10 1

10 2

10 3

10 4

Algorithm 1: 100-point-100-term

Memory: 36896MB, Speed: 391.88s/it

Algorithm 2: 100-point-100-term

Memory: 36896MB, Speed: 2.15s/it

5000

10000

Time (s)

15000

20000

Figure 1: High-dimensional HJB-Lin equation: convergence, memory cost, and speed of different methods under

various dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red: Our Algorithm 2; Green:

Baseline full batch GD over PDE terms. The x-axis represents the running time, while the y-axis represents the relative

L2 error of u t in the HJB equation (a nontrivial test). At initialization, the relative L 2 error is always nearly 1000%,

leading to a nontrivial learning problem for PINNs. In the 100-dimensional, our Algorithm 1 has a similar convergence

speed to the baseline, but the latter converges better. Our method in 100D and 1000D demonstrates faster completion

within the same number of epochs in higher-dimensional cases, resulting in a shorter total running time. In even higher

dimensions, we ensure that the running times of different methods are similar. Our method consistently achieves faster

convergence for dimensionality higher than 1000. In extremely high-dimensional scenarios, the baseline full batch

gradient descent encounters memory limitations, whereas our Algorithm 2 is capable of rapid convergence. However,

we observed that in the case of 100,000 dimensions, although convergence was achieved, the convergence curve of

our Algorithm 2 exhibited some instability. This instability was attributed to the small batch size, which resulted in

gradient variance and consequently led to the observed instability.

1510 1

10 2

10 3

1000-Dimensional BSB Equation

Algorithm 1: 1000-point-10-term

Memory: 2386MB, Speed: 0.18s/it

Full batch GD: 100-point-100-term

Memory: 2190MB, Speed: 0.32s/it

L 2 error

100-Dimensional BSB Equation

500

1000

1500

2000

Time (s)

2500

3000

10 1

10 2

10 3

Algorithm 1: 1000-point-100-term

Memory: 11054MB, Speed: 1.58s/it

Algorithm 2: 1000-point-100-term

Memory: 11054MB, Speed: 0.85s/it

Full batch GD: 100-point-1000-term

Memory: 10764MB, Speed: 3.35s/it

10000-Dimensional BSB Equation

3000 4000

Time (s)

5000

6000

7000

10 0

2000

100000-Dimensional BSB Equation

Algorithm 1: 100-point-1000-term

Memory: 30430MB, Speed: 20.59s/it

Algorithm 2: 100-point-1000-term

Memory: 30430MB, Speed: 4.77s/it

Full batch GD: 10-point-10000-term

Memory: 29892MB, Speed: 42.51s/it

1000

10 1

10 2

Algorithm 1: 100-point-100-term

Memory: 40054MB, Speed: 391.88s/it

Algorithm 2: 100-point-100-term

Memory: 40054MB, Speed: 1.58s/it

2000

4000

6000

8000

Time (s)

10000

12000

2000

4000

6000

Time (s)

8000

10000

Figure 2: High-dimensional BSB equation: convergence, memory cost, and speed of different methods under various

dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red: Our Algorithm 2; Green: Baseline

full batch GD over PDE terms. The x-axis represents the running time, while the y-axis represents the relative L 2

error of the BSB equation. At initialization, the relative L 2 error is always larger than 10%, leading to a nontrivial

learning problem for PINNs. In the 100-dimensional and 1000-dimensional cases, our method demonstrates faster

completion within the same number of epochs, resulting in a shorter total running time. In higher dimensions, we

ensure that the running times of different methods are similar. Our method consistently achieves faster convergence. In

extremely high-dimensional scenarios, the baseline full batch gradient descent encounters memory limitations, whereas

our Algorithm 2 is capable of rapid convergence.

16Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours Evaluation Metric

Relative L 2 error of u

Relative L 2 error of u Dim=10 2

N.A.

6.182E-4 Dim=10 3

N.A.

1.284E-3 Dim=10 4

N.A.

3.571E-3 Dim=10 5

N.A.

4.486E-3

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours Evaluation Metric

Relative L 1 error on one point

Relative L 1 error on one point Dim=10 2

1.378E-5

1.106E-5

2.048E-3

6.781E-6 Dim=10 3

8.755E-5

5.184E-4

1.942E-3

4.915E-5 Dim=10 4

4.003E-5

3.013E-4

1.992E-3

1.084E-5 Dim=10 5

3.559E-5

4.519E-4

2.109E-3

1.972E-5

Table 2: Results of various methods on the BSB equations with different dimensions and evaluation metrics.

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours Evaluation Metric

Relative L 2 error of u

Relative L 2 error of u Dim=10 2

N.A.

1.159E-4 Dim=10 3

N.A.

3.508E-4 Dim=10 4

N.A.

1.216E-5 Dim=10 5

N.A.

1.937E-7

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours Evaluation Metric

Relative L 2 error of u t

Relative L 2 error of u t Dim=10 2

N.A.

4.918E-3 Dim=10 3

N.A.

3.243E-3 Dim=10 4

N.A.

2.655E-3 Dim=10 5

N.A.

3.792E-5

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours Evaluation Metric

Relative L 1 error on one point

Relative L 1 error on one point Dim=10 2

6.327E-5

N.A.

1.034E-2

2.711E-5 Dim=10 3

7.062E-5

N.A.

Diverge

6.208E-6 Dim=10 4

4.916E-5

N.A.

Diverge

2.513E-5 Dim=10 5

5.218E-5

N.A.

Diverge

3.614E-5

Table 3: Results of various methods on the HJB-Lin equations with different dimensions and evaluation metrics.

Thus, Algorithm 1 effectively reduces the memory cost of PINN by minimizing gradient variance and accelerates

convergence through gradient randomness. Algorithm 2 further enhances iteration speed while reducing memory

requirements, albeit at the expense of increased gradient variance and compromised convergence results. However, in

higher dimensional cases, the acceleration provided by Algorithm 2 outweighs the impact of gradient variance.

In addition to the stochastic gradient acceleration employed in our method, the symmetry of the PDE plays a

crucial role in achieving convergence. These PDEs exhibit symmetry along each x i axis regardless of dimensionality.

This symmetry reduces gradient variance in SDGD and random sampling during the forward pass in Algorithm 2,

facilitating rapid convergence in our approach.

In conclusion, the experimental results demonstrate that Algorithm 1 and Algorithm 2 exhibit significant advantages

over full batch gradient descent in training PINNs. This is particularly evident in the case of very high dimensions,

where their acceleration and computational efficiency are especially pronounced.

5.1.3

Comparison Between SDGD-PINNs and Strong Baselines

This subsection presents the results comparing SDGD based on PINN and other strong non-PINN baselines [1, 19, 40]

for the BSB and HJB equations.

Tables 2 and 3 showcase the comparative results between our algorithm and the baselines on the BSB and HJB

equations, respectively. Specifically, we present the prediction performance of various models on both the solution field

(u and/or u t ) and individual points, which are introduced in the experimental setup. If predictions are to be made across

the entire domain, only our PINN-based approach can achieve that, while other methods can only make predictions at

a single point. This showcases the advantage of PINN in not requiring a mesh/grid, unlike other methods that rely on

temporal discretization. Even in predicting at a single point, PINN performs exceptionally well and achieves the best

results among various methods. This is because although PINN is trained across the entire domain, its neural network

possesses strong interpolation capabilities, enabling accurate predictions at single points. Other methods perform

poorly on these PDEs because the scale of the entire PDE solution is very large, leading to potential issues such as

numerical overflow or gradient explosion. However, in the case of PINNs, we cleverly employ boundary/terminal

17Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours PDE

Allen-Cahn

Allen-Cahn Dim=10 2

1.339E-3

4.581E-3

3.642E-3

7.815E-4 Dim=10 3

2.184E-3

2.516E-3

1.563E-3

3.142E-4 Dim=10 4

7.029E-3

2.980E-3

2.369E-3

7.042E-4 Dim=10 5

5.392E-2

2.975E-3

2.962E-3

2.477E-4

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours PDE

Semilinear Heat

Semilinear Heat Dim=10 2

3.872E-3

3.024E-3

2.823E-3

1.052E-3 Dim=10 3

8.323E-4

3.222E-3

3.432E-3

5.263E-4 Dim=10 4

7.705E-3

3.510E-3

3.439E-3

6.910E-4 Dim=10 5

1.426E-2

4.279E-3

3.525E-3

1.598E-3

Method

FBSNN [40]

DeepBSDE [19]

DeepSplitting [1]

Ours PDE

Sine-Gordon

Sine-Gordon Dim=10 2

2.641E-3

3.217E-3

3.297E-3

2.265E-3 Dim=10 3

4.283E-3

2.752E-3

2.674E-3

2.310E-3 Dim=10 4

2.343E-1

2.181E-3

2.170E-3

1.674E-3 Dim=10 5

5.185E-1

2.630E-3

2.249E-3

1.956E-3

Table 4: Results of various methods on the Allen-Cahn, semilinear heat, and sine-Gordon PDEs with different

dimensions. The evaluation metric is all relative L 1 error on one test point.

condition augmentation to mitigate these issues.

5.2

More Comparisons with Strong Baselines [1, 19, 40]

This subsection presents further results comparing our method based on PINN and other strong non-PINN baselines

[1, 19, 40] for several nonlinear PDEs without analytical solutions. Thus, we adopt other methods’ settings and evaluate

the model on one test point, where we show that PINNs still outperform. other competitors.

• Allen-Cahn equation.

∂ t u(x, t) = ∆u(x, t) + u(x, t) − u(x, t) 3 ,

(x, t) ∈ R d × [0, 0.3],

(47)

with the initial condition u(x, t) = arctan(max i x i ). We aim to approximate the solution’s true value on the

one test point (x, t) = (0, · · · , 0, 0.3).

• Semilinear heat equation.

∂ t u(x, t) = ∆u(x, t) +

1 − u(x, t) 2

1 + u(x, t) 2

(x, t) ∈ R d × [0, 0.3],

(48)

with the initial condition u(x, t) = 5/(10 + 2∥x∥ 2 ). We aim to approximate the solution’s true value on the one

test point (x, t) = (0, · · · , 0, 0.3).

• Sine-Gordon equation.

∂ t u(x, t) = ∆u(x, t) + sin (u(x, t)) ,

(x, t) ∈ R d × [0, 0.3],

(49)

with the initial condition u(x, t) = 5/(10 + 2∥x∥ 2 ). We aim to approximate the solution’s true value on the one

test point (x, t) = (0, · · · , 0, 0.3).

The reference values of these PDEs on the test point are computed by the multilevel Picard method [5, 25] with

sufficient theoretical accuracy. The residual points of PINN are chosen along the trajectories of the stochastic processes

that those PDEs correspond to, see the last section for details.

The results for these three nonlinear PDEs are shown in Table 4. PINN surpasses other methods in all dimensions

and for all types of PDEs. Whether it is predicting the entire domain or the value at a single point, PINN is capable of

handling both scenarios. This is attributed to the mesh-free nature of PINN and its powerful interpolation capabilities.

5.3

PINN and Adversarial Training (L ∞ Loss)

In Wang et al. [46], it was demonstrated that the class of Hamilton-Jacobi-Bellman (HJB) equation [19, 46] could only

be effectively solved using adversarial training, which approximates the L ∞ loss. The first one is the HJB equation

18Dim

250

1,000

10,000

100,000

Dim

250

1,000

10,000

100,000

HJB-LQG Results for u

Wang et al. [46]

He et al. [21]

Time

Rel. L 2 Error

Time

Rel. L 2 Error

38 hours

1.182E-2

12 hours

1.370E-2

>5 days

N.A.

>5 days

N.A.

OOM

N.A.

OOM

N.A.

OOM

N.A.

OOM

N.A.

HJB-Lin Results for ∂ t u

Wang et al. [46]

He et al. [21]

Time

Rel. L 2 Error

Time

Rel. L 2 Error

38 hours

1.874E-3

12 hours

1.769E-3

>5 days

N.A.

>5 days

N.A.

OOM

N.A.

OOM

N.A.

OOM

N.A.

OOM

N.A.

SDGD (Ours)

Time

Rel. L 2 Error

210 minutes

6.147E-3

217 minutes

5.852E-3

481 minutes

4.926E-2

855 minutes

4.852E-2

SDGD (Ours)

Time

Rel. L 2 Error

210 minutes

4.710E-4

217 minutes

4.455E-4

481 minutes

3.609E-4

855 minutes

1.305E-3

Table 5: Relative L 2 error results, running time across different dimensions for the baselines and ours, for two HJB

equations requiring adversarial training of PINNs. In all dimensions, our method significantly outperforms the other

two baselines in terms of accuracy and speed. The baselines, employing full batch gradient descent, experience

slower speed and increased memory requirements in higher-dimensional scenarios. In contrast, our approach utilizes

Algorithm 2, which employs stochastic gradient descent over PDE terms, resulting in faster computation and relatively

lower memory consumption. Additionally, the symmetry of the PDE guarantees the accuracy of our method.

with linear-quadratic-Gaussian (LQG) control:

∂ t u(x, t) + ∆u(x, t) − ∥∇ x u(x, t)∥ 2 = 0,

x ∈ R d , t ∈ [0, 1]

u(x, T ) = g(x),

where g(x) is the terminal condition to be chosen, with the analytic solution:

u(x, t) = − log

(2π) −d/2 exp(−∥y∥ 2 /2) exp(−g(x − 2(1 − t)y))dy .

(50)

(51)

R d

We choose the cost function g(x) = log((1 + ∥x∥ 2 )/2) following [19, 21, 40, 46] to obtain the HJB-LQG case. It is

worth noting that the solution’s integration in HJB-LQG-Log cannot be analytically solved, necessitating Monte Carlo

integration for approximation. We use the relative L 2 error approximating u as the evaluation metric for this equation.

The second HJB equation requiring L ∞ loss and adversarial training is the previous HJB-Lin equation but with a

different PDE operator:

∂ t u(x, t) + ∆u(x, t) −

u(x, T ) =

1 X

|∂ x i u| c = −2,

d i=1

x ∈ R d , t ∈ [0, T ]

(52)

x i ,

i=1

P d

where c = 1.75, T = 1 and the dimension d can be arbitrarily chosen. The analytic solution is u(x, t) = i=1 x i +

T − t. Specifically, Wang et al. [46] demonstrated that as c increases within the range of 1-2, the effectiveness of the

L 2 PINN training loss diminishes, and PINN relies more on the L ∞ norm loss to achieve smaller testing error. For this

equation, we also evaluate the more challenging L 2 error of ∂ t u.

In this example of the HJB equation with LQG control, we decompose the residual prediction of PINN along each

dimension as follows:

∂ t u(x, t) + ∆u(x, t) − µ∥∇ x u(x, t)∥ 2 =

1 X

∂ 2 u

∂ t u(x, t) + d 2 − µ∥∇ x u(x, t)∥ 2 .

d i=1

∂x i

(53)

For the HJB equation with linear solution, we decompose the residual prediction of PINN along each dimension as in

the previous example.

Despite its ability to maintain a low maximal memory cost during training, adversarial training approximating

the L ∞ loss is widely recognized for its slow and inefficient nature [43], which poses challenges in applying it to

19100D ISOTROPIC Fokker-Planck

10 2

500

1000

1500

Time (s)

2000

10 0

10 1

10 2

2500

100D ANISOTROPIC Fokker-Planck

10 2

500

1000

1500

Time (s)

2000

1000

2000

3000 4000

Time (s)

5000

6000

10 1

10 2

7000

Algorithm 1: 100-point-100-term

Memory: 4684MB, Speed: 12.76s/it

Algorithm 2: 100-point-100-term

Memory: 4684MB, Speed: 0.42s/it

Full batch GD: 1-point-10000-term

Memory: 5368MB, Speed: 23.37s/it

2500

10 0

10 1

10 2

1000

2000

3000 4000

Time (s)

5000

1000

2000

3000

Time (s)

4000

10000D ANISOTROPIC Fokker-Planck

10 0

Algorithm 1: 100-point-100-term

Memory: 2520MB, Speed: 1.27s/it

Algorithm 2: 100-point-100-term

Memory: 2520MB, Speed: 0.41s/it

Full batch GD: 10-point-1000-term

Memory: 2546MB, Speed: 2.31s/it

10 0

1000D ANISOTROPIC Fokker-Planck

Algorithm 1: 1000-point-10-term

Memory: 2310MB, Speed: 0.16s/it

Algorithm 1: 400-point-25-term

Memory: 2310MB, Speed: 0.17s/it

Algorithm 1: 200-point-50-term

Memory: 2310MB, Speed: 0.23s/it

Full batch GD: 100-point-100-term

Memory: 2384MB, Speed: 0.25s/it

error

10 0

10000D ISOTROPIC Fokker-Planck

Algorithm 1: 100-point-100-term

Memory: 2520MB, Speed: 1.27s/it

Algorithm 2: 100-point-100-term

Memory: 2520MB, Speed: 0.41s/it

Full batch GD: 10-point-1000-term

Memory: 2546MB, Speed: 2.31s/it

1000D ISOTROPIC Fokker-Planck

Algorithm 1: 1000-point-10-term

Memory: 2310MB, Speed: 0.16s/it

Algorithm 1: 400-point-25-term

Memory: 2310MB, Speed: 0.17s/it

Algorithm 1: 200-point-50-term

Memory: 2310MB, Speed: 0.23s/it

Full batch GD: 100-point-100-term

Memory: 2384MB, Speed: 0.25s/it

10 0

6000

10 1

10 2

7000

Algorithm 1: 100-point-100-term

Memory: 4684MB, Speed: 12.76s/it

Algorithm 2: 100-point-100-term

Memory: 4684MB, Speed: 0.42s/it

Full batch GD: 1-point-10000-term

Memory: 5368MB, Speed: 23.37s/it

1000

2000

3000

Time (s)

4000

Figure 3: High-dimensional FP equation corresponding to the Brownian motion: convergence, memory cost, and speed

of different methods under various dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red:

Our Algorithm 2; Green: Baseline full batch GD over PDE terms. The x-axis represents the running time, while the

y-axis represents the relative L 2 error of the FP equation corresponding to the Brownian motion. First row: isotropic

problem. Second row: anisotropic problem.

high-dimensional HJB equations with LQG control. Adversarial training involves optimizing two loss functions: one

for the PINN parameters, minimizing the loss through gradient descent, and another for the residual point coordinates,

maximizing the PINN loss to approximate the L ∞ loss. This adversarial minimax process, known as adversarial

training, is computationally demanding, often requiring multiple rounds of optimization before the resulting residual

points can effectively optimize the PINN parameters.

The current state-of-the-art research by Wang et al. [46] and He et al. [21] has successfully scaled adversarial

training to 250 dimensions. In this study, we integrate adversarial training with our proposed approach to enhance the

scalability and efficiency of adversarial training in high-dimensional HJB equations requiring adversarial training of

PINN.

We use a four-layer PINN with 1024 hidden units, trained by an Adam optimizer [30] with a learning rate = 1e-3 at

the beginning and decay linearly to zero; also, 2e4 test points are employed. We modified the PINN structure as in

equation (44) to make the terminal condition automatically satisfied:

u HJB-LQG

(x, t) = u θ (x, t)(1 − t) + g(x),

(54)

where g(x) is the terminal condition, which is g(x) = log((1 + ∥x∥ 2 )/2) in the HJB-LQG-Log case, and g(x) =

P d

i=1 x i in the HJB-LQG-Lin case. Thus, we only need to focus on the residual loss. We conduct training for a total

of 10,000 epochs. In all dimensions, we employ Algorithm 2 with 100 batches of PDE terms in the loss function for

gradient descent, and Algorithm 2 with 10 batches of PDE terms in the loss function for adversarial training. For both

our methods and the baseline methods proposed by Wang et al. [46], we randomly sample 100 residual points per

epoch.

The results of adversarial training on the two HJB equations are presented in Table 5. For instance, in the 250D

HJB-LQG case, achieving a relative L 2 error of 1e-2 using [46] requires a training time of 1 day. In contrast, our

approach achieves a superior L 2 error in just 3.5 hours. In the 1000D case, full batch GD’s adversarial training is

exceedingly slow, demanding over 5 days of training time. However, our method effectively mitigates this issue and

attains a relative L 2 error of 5.672E-3 in 217 minutes. Moreover, our approach enables adversarial training even in

higher-dimensional HJB equations, specifically 10,000 and 100,000 dimensions, with resulting relative L 2 errors of

5.026E-2 and 3.838E-2, respectively. For the HJB-Lin case, our method is also consistently better than others. The

success of our method hinges on the accelerated random gradient and the equation’s symmetry.

5.4

Fokker-Planck PDE: Anisotropic Problem

So far, the PDEs we have tested are isotropic and symmetric across dimensions. This choice is motivated by commonly

used benchmark PDEs in real-world applications that exhibit these properties [19, 46]. However, it is worth exploring

20whether our method can converge on asymmetric and anisotropic PDEs. Therefore, we design an anisotropic Fokker-

Planck (FP) equation corresponding to the Brownian motion as follows:

∂ t p = −

µ i ∂ x i p +

i=1

1 X 2

∂ p,

2 i=1 x i

t ∈ [0, 1], x ∈ B d .

(55)

u(x, 0) = ∥x∥ 2 .

u(x, t) given,

t ∈ [0, 1], x ∈ S d−1 .

where B d and S d−1 are the d-dimensional unit ball and sphere, respectively. Its solution is

u(x, t) =

(x i − µ i t) 2 + dt.

(56)

i=1

Here each µ i is randomly sampled from N (1, 1) for the anisotropic case and all µ i = 1 for the isotropic case.

Specifically, we will compare the convergence results of our method in two different cases to observe the robustness of

our approach to anisotropic problems. This anisotropic case is not only asymmetric in its form of differential operators

but also features an anisotropic solution across different axes.

We use the following network architecture to satisfy the initial condition automatically:

u Brownian

(x, t) = u θ (x, t)t + ∥x∥ 2 .

(57)

In this example of the FP equation corresponding to the Brownian motion, we decompose the residual prediction

of PINN along each dimension as follows:





d 2 

1 X 

1 X 2

µ j ∂ x j p − ∂ x i p .

∂ x i p =

∂ t p +

µ i ∂ x i p −

(58)





i=1

j=1

We employ a four-layer PINN with 1024 hidden units, trained using the Adam optimizer [30] with an initial

learning rate of 1e-3, exponentially decaying with a factor of 0.9995. Our testing datasets consist of 2e4 points. We

sample random boundary points for all equations for the boundary condition and use the regular boundary loss. The

points are samples uniformly from the bounded domain.

The results for the FP equation corresponding to the Brownian motion are presented in Figure 3, where the first row

presents the convergence curve under different dimensions in the isotropic problem, while the second row showcases

those for the anisotropic problem.

Regarding the speed of our algorithms compared to full batch gradient descent, in the case of 100 dimensions

(100D), we vary the PDE term batch size and the number of collocation points for Algorithm 1 to match the memory

cost of the full batch gradient descent (GD) baseline, aiming to observe the stability of our algorithm with different

batch sizes. Regardless of the chosen PDE term batch size, our Algorithm 1 consistently converges faster than the

baseline, demonstrating the stability and fast convergence of the stochastic gradient descent (SGD) over the PDE terms

approach. For higher dimensions, we include the results of Algorithm 2, which consistently outperforms Algorithm 1,

while Algorithm 1, in turn, outperforms the baseline. This highlights the acceleration capability of Algorithm 2 in

high-dimensional scenarios.

Regarding our algorithm’s robustness in the anisotropic problem, by comparing the convergence curves of our

method in isotropic and anisotropic cases at the same dimensionality, we observe that our approach maintains fast and

stable convergence even in anisotropic cases. Furthermore, the final convergence results do not deteriorate significantly

in the anisotropic case. Concretely, the relative L 2 errors of the isotropic problem under 100D, 1000D, and 10000D

are (4.708E-3, 2.223E-3, 3.800E-3), respectively, while those for the anisotropic problem are (4.906E-3, 2.623E-3,

3.715E-3). Therefore, our method is applicable to both isotropic and anisotropic cases.

Note that this FP equation is solved on a bounded domain, specifically, the unit ball. Generally, problems on

bounded domains are somewhat easier because PDEs on unbounded domains lack information at infinity.

5.5

Schrödinger Equation

This subsection scales up the Tensor Neural Networks (TNNs) [47, 48] for solving very high-dimensional Schrödinger

equations. Due to its unique characteristics, the Schrödinger equation cannot be addressed by the traditional PINN

framework and requires a specialized TNN network structure and corresponding loss function for high-dimensional

eigenvalue problems. However, our method can still be flexibly integrated into this framework for scaling up

Schrödinger equations, demonstrating the versatility of our approach. In contrast, other methods specifically designed

for high-dimensional equations exhibit much greater limitations.

215.5.1

Tensor Neural Network (TNN) for Solving Schrödinger Equation

In general, the Schrödinger equation can be viewed as a high-dimensional eigenvalue problem.

−∆u(x) + v(x)u(x) = λu(x),

u(x) = 0,

x ∈ Ω,

(59)

x ∈ ∂Ω,

where Ω = Ω 1 × Ω 2 × · · · × Ω d ⊂ R d and each Ω i = (a i , b i ), i = 1, , d is a bounded interval in R, v(x) is a known

potential function, u(x) is the unknown solution and λ is the unknown eigenvalue of the high-dimensional eigenvalue

problem. This problem can be addressed via the variational principle:

|∇Φ(x)| 2 dx + Ω v(x)Φ(x) 2 dx

Ω

(60)

λ = min

Φ(x)

Φ(x) 2 dx

Ω

The integrals are usually computed via the expensive quadrature rule, e.g., Gaussian quadrature:

Φ(x)dx ≈

w (n) Φ(x (n) ),

Ω

(61)

n∈N

where N is the index set for the quadrature points. The size N grows exponentially as the dimension increases,

incurring the curse of dimensionality. To overcome this issue, the Tensor Neural Network (TNN) adopts a separable

function network structure to reduce the computational cost associated with integration:

Φ(x; θ) =

p Y

ϕ j (x i ; θ i ),

(62)

j=1 i=1

where x i is the ith dimension of the input point x, and ϕ j (x i ; θ i ) is the jth output of the sub-wavefunction network

parameterized by θ i , and p is the predefined rank of the TNN model. The integral by quadrature rule of the TNN can

be calculated efficiently:

Φ(x; θ)dx =

Ω

p Z Y

j=1

ϕ j (x i ; θ i )dx =

Ω i=1

p Y

d Z

j=1 i=1

ϕ j (x i ; θ i )dx i ≈

Ω i

p Y

d X

w (n i ) ϕ j (x (n i ) ; θ i ),

(63)

j=1 i=1 n i ∈N i

where N i is the index set for the ith dimensional numerical integral. Thus, the numerical integrals can be computed

along each axes separately to reduce the exponential cost to linear cost. We refer the readers to [47, 48] for more

details.

With the cheap quadrature achieved by TNN’s separable structure, the loss function to be minimized is

(n)

|∇Φ(x (n) ; θ)| 2 + n∈N w (n) v(x (n) )Φ(x (n) ; θ) 2

n∈N w

min ℓ(θ) =

(64)

(n) Φ(x (n) ; θ) 2

n∈N w

where Φ(x; θ) is the neural network wave function with trainable parameters θ.

The primary computational bottleneck of TNNs for this problem lies in the first-order derivatives in the loss

function, while other terms only contain zero-order terms with lower computational and memory costs:

X ∂

∂

Φ(x; θ) =

ϕ j (x i ; θ i )

∂x i

j=1

ϕ j (x k ; θ k )

(65)

k=1,k̸ = i

Since TNN is a separable architecture, the d-dimensional Schrödinger equation requires computing the first-order

derivatives and performing quadrature integration for each of the d subnetworks. Consequently, the computational

cost scales linearly with the dimension. In comparison to previous examples of PINN solving PDEs, TNN incurs

larger memory consumption for first-order derivatives due to the non-sharing of network parameters across different

dimensions, i.e., d independent first-order derivatives of d distinct subnetworks are required, resulting in increased

computational requirements.

Mathematically, the computational and memory bottleneck is the gradient with respect to the first-order term of the

neural network wave function Φ(x; θ)

∂

grad(θ) =

w (n) |∇Φ θ (x (n) )| 2 =

w (n)

Φ θ (x (n) )

(66)

∂θ

∂x

i=1

n∈N

22Dimension

Full Batch GD

SDGD

10 2

3.882E-8

3.772E-8

10 3

3.889E-8

3.439E-8

10 4

3.811E-8

4.003E-8

2 ∗ 10 4

OOM

5.139E-8

Table 6: Laplace eigenvalue problem: relative L 1 error of the predicted eigenvalues of full batch gradient descent

(GD) and our SDGD. The results of our algorithm are similar to the full batch GD. In extremely high dimensions,

e.g., 2 ∗ 10 4 dimensions, the full batch GD encounters out-of-memory (OOM) error, while our algorithm reduces the

memory costs and ensures a good convergence result.

By employing the proposed SDGD approach, we can simplify the entire gradient computation:

d X ∂

∂

(n)

grad I (θ) =

Φ θ (x )

|I|

∂θ ∂x i

n∈N

(67)

i∈I

where I ⊂ {1, 2, · · · , d} is a random index set. Therefore, we can sample the entire gradient at the PDE term level to

reduce memory and computational costs, and the resulting stochastic gradient is unbiased, i.e., E I [grad I (θ)] = grad(θ)

ensuring the convergence of our method. The gradient accumulation can also be done for large-scale problems by our

method, see Section 3.3 for details.

It is important to note that our method is more general compared to the traditional approach of SGD solely based

on quadrature points. In particular, in extremely high-dimensional cases, even computing the gradient for a single

quadrature point may lead to an out-of-memory (OOM) error. This is because the traditional approach requires a

minimum batch size of d PDE terms and 1 quadrature point, where d is the dimensionality of the PDE. However, our

method further reduces the computational load per batch to only 1 PDE term and 1 quadrature point.

5.5.2

Experimental Setup

We consider two classical examples of the Schrödinger equation: the Laplace eigenvalue problem and the quantum

harmonic oscillator (QHO) problem.

Laplace eigenvalue problem. The potential function in the Schrödinger equation in equation (59) is zero function

Q d

v(x) = 0. The domain is chosen as Ω = i=1 Ω i = [0, 1] d where each Ω i = [0, 1] and the exact smallest eigenvalue

Q d

and eigenfunction are λ = dπ 2 and u(x) = i=1 sin(πx i ). Following [47], the quadrature scheme for TNN divides

each Ω i into ten identical subintervals, and sixteen Gauss quadrature points are selected for each subinterval. To train

the TNN, the Adam optimizer [30] is used with a learning rate of 1e-3. The TNN is with rank p = 1, depth 2, and

width 50 for the sub-wavefunctions in each dimension. For this problem, we evaluate the performance of SDGD.

Quantum harmonic oscillator (QHO). The potential function in the Schrödinger equation in equation (59) is

v(x) = ∥x∥ 2 . The original problem is defined on an infinite interval. Therefore, we truncate the entire domain to

[−5, 5] d . The exact smallest eigenvalue and eigenfunction are λ = d and u(x) = exp(−∥x i ∥ 2 /2). We use the same

model structure and hyperparameters as the Laplace eigenvalue problem. For this problem, we test the effectiveness of

the Gradient Accumulation method based on our further decomposition of PINN’s gradient.

|λ −λ

The test metric we report is the L 1 relative error of the minimum eigenvalue, given by true |λ true approx

. Due to the

diminishing nature of the eigenfunctions in high dimensions, we only report the accuracy of the eigenvalues. It is

important to note that the eigenvalues carry physical significance as they represent the ground state energy.

5.5.3

Experimental Results

On the Laplace eigenvalue problem, we demonstrate the stable convergence of SDGD in Figure 4, and it can scale up

to larger problem sizes in Table 6. Specifically, the convergence results of our SDGD are as satisfactory as the full

batch GD, and our algorithm can scale up the problem to high dimensions (2 ∗ 10 4 dimensions) where full batch GD

directly fails due to insufficient memory.

On the QHO problem, our gradient accumulation method achieves training in 2 ∗ 10 4 dimensions within the limited

GPU memory in Figure 5. Note that gradient accumulation performs theoretically the same as full batch GD, so we do

not compare their results, but the former can save GPU memory, thus scaling up PDEs to higher dimensions. These

results demonstrate the scalability of our method for PINN and PDE problems and highlight its generality, making it

applicable to various physics-informed machine-learning problems.

During the early stages of optimization, we observe quick convergence of TNNs. Since the model parameters are

initialized randomly, so there is much room for improvement. At this point, the gradients of the loss function with

respect to the parameters can be relatively large. Larger gradients result in more significant updates to the parameters,

leading to faster convergence.

23Relative L1 Error of Eigenvalue for the Laplace Problem

10 1 10 1

10 2 10 2

10 3 10 3

10 4 10 4

10 5 10 5

10 6 10 6

10 7 10 7

10 0

10000

20000

Epoch

30000

40000

50000

Relative L1 Error of Eigenvalue for the Laplace Problem

10 0

SGD over PDE Terms: Dim=10000, Batch=5000

10 1 10 1

10 2 10 2

10 3 10 3

10 4 10 4

10 5 10 5

10 6 10 6

10 7 10 7

10000

20000

Epoch

30000

40000

Relative L1 Error of Eigenvalue for the Laplace Problem

SGD over PDE Terms: Dim=1000, Batch=500

10 0

SGD over PDE Terms: Dim=100, Batch=50

10 0

50000

10000

20000

Epoch

30000

40000

50000

Relative L1 Error of Eigenvalue for the Laplace Problem

SGD over PDE Terms: Dim=20000, Batch=2500

20000

40000

Epoch

60000

80000

100000

Figure 4: Laplace eigenvalue problem: convergence curve of our algorithm (SDGD) under different dimensions.

24Relative L1 Error of Eigenvalue for the QHO Problem

Relative L1 Error of Eigenvalue for the QHO Problem

QHO: Gradient Acc. with Dim=100, Batch=50

QHO: Gradient Acc. with Dim=1000, Batch=500

10 0

10 2

10 4

10 6

Error

10 0

10000

20000

Epoch

30000

40000

10 2

10 4

10 6

50000

Relative L1 Error of Eigenvalue for the QHO Problem

10 6

30000

40000

50000

QHO: Gradient Acc. with Dim=20000, Batch=2500

Epoch

10 0

20000

Relative L1 Error of Eigenvalue for the QHO Problem

QHO: Gradient Acc. with Dim=10000, Batch=5000

10000

20000

Epoch

30000

40000

50000

10 2

10 4

10 6

10000

20000

Epoch

30000

40000

50000

Figure 5: Quantum harmonic oscillator (QHO) problem: convergence curve of our algorithm (gradient accumulation)

under different dimensions.

251000D Wave with High/Low Effective Dimensions

10 1

1000D Wave with High/Low Effective Dimensions

10 1

Algorithm 1: 100-point-1-term on eff = 1

Algorithm 1: 100-point-1-term on eff = 10^3

10 0

10 2

10 0

Algorithm 1: 100-point-10-term on eff = 1

Algorithm 1: 100-point-10-term on eff = 10^3

2000

4000

Epoch

6000

8000

10 1

10 2

10000

10000D Wave with High/Low Effective Dimensions

10 1

10 2

2000

4000

Epoch

6000

8000

Epoch

6000

8000

10000

Algorithm 1: 100-point-100-term on eff = 1

Algorithm 1: 100-point-100-term on eff = 10^4

10 1

10 0

4000

10000D Wave with High/Low Effective Dimensions

Algorithm 1: 100-point-10-term on eff = 1

Algorithm 1: 100-point-10-term on eff = 10^4

10 1

2000

10000

10 0

10 1

10 2

2000

4000

Epoch

6000

8000

10000

Figure 6: Results for the wave equation. We present the convergence results with different batch sizes in the scenarios

of 1000 and 10000 dimensions. The red line corresponds to cases with a high effective dimension, while the blue

line represents cases with a low effective dimension of only one. We have chosen very small-dimension batch sizes,

specifically the PDE terms’ batch size. In the 1000-dimensional case, we use a minimum of one dimension, optimizing

one dimension at a time. In the 10000-dimensional case, we use only ten dimensions per iteration for gradient descent.

Despite the satisfactory convergence results, these problems demonstrate one shortcoming of our proposed

approach. Our method is primarily designed to address the computational bottleneck arising from high-dimensional

differentiations in PDEs. However, if the problem itself is large-scale and does not stem from differential equations,

our method cannot be directly applied. Specifically, in the case of the Schrödinger equation, due to the separable

nature of its network structure, we employ a separate neural network for each dimension. This results in significant

memory consumption, even without involving differentials. Additionally, in this problem, we cannot sample the points

of integration / residual points because there is an integral term in the denominator. If we sample the integral in

the denominator, the resulting stochastic gradient will be biased, leading to poor convergence. Therefore, for such

problems, designing an improved unbiased stochastic gradient remains an open question.

5.6

On the Effective Dimension of the PDE Solution

In practical applications, many high-dimensional equation solutions may exhibit some form of low-dimensional

structure, meaning they have a low effective dimension. In this chapter, we aim to demonstrate that our method can

efficiently capture the low-dimensional solutions of high-dimensional equations. If the equation possesses only a few

effective dimensions, the solutions in numerous other dimensions are essentially zero. This may lead to a significant

variance in gradients, but we prove that the gradients between different dimensions are actually transferable. Thus,

optimizing one dimension effectively assists the learning in other dimensions, ensuring that even when our algorithm

selects a batch size of 1, rapid convergence is still achievable.

26In particular, we consider the high-dimensional wave equation within the unit ball:

∂ tt u = ∆u,

u(x, 0) =

ef f

t ∈ [0, 1], x ∈ B d .

sinh x i ,

(68)

u t (x, 0) = 0.

i=1

u(x, t) given,

t ∈ [0, 1], x ∈ S d−1 .

Here ef f is the effective dimension of the solution. We choose ef f = 1 and ef f = d as the low-dimensional and

high-dimensional solutions, respectively. The boundary condition is obtained via the ground truth solution of the PDE:

u(x, t) = cosh(t)(

ef f

sinh(x i )).

(69)

i=1

Actually, in the low-dimensional solution case where ef f = 1, the solution is anisotropic which poses additional

variance to the gradients of our algorithms. We use the following network architecture to satisfy the initial condition

automatically:

ef f

sinh(x i ).

(70)

u Wave

(x,

t)t

i=1

In this example of the wave equation, we decompose the residual prediction of PINN along each dimension as follows:

∂ tt u − ∆u =

1 X

{u tt − du x i x i } .

d i=1

(71)

Other experimental details including the choices of points and optimizers are the same as those in the Fokker-Planck

equation.

The results for the wave are presented in Figure 6, where we present the convergence results with different batch

sizes in the scenarios of 1000 and 10000 dimensions. The red line corresponds to cases with a high effective dimension,

while the blue line represents cases with a low effective dimension of only one. Firstly, if the equation has a low

effective dimension (blue line), our method surprisingly converges faster. This indicates that our approach can quickly

capture the low-dimensional solutions within a high-dimensional equation.

Secondly, it’s worth noting that we have chosen very small-dimension batch sizes, specifically the PDE terms’

batch size. In the 1000-dimensional case, we use a minimum of one dimension, optimizing one dimension at a

time. In the 10000-dimensional case, we use only ten dimensions per iteration for gradient descent. However, our

method still achieves rapid convergence, suggesting the presence of transferability in the gradients. Optimizing one

dimension effectively triggers a transfer learning effect on other dimensions, implying that the gradients between

different dimensions are not entirely independent or orthogonal, but exhibit some form of transfer learning.

The transferability can also be validated by theory. Given the neural network structure:

u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · )).

(72)

where x ∈ R d is the input, and W l ∈ R m l ×m l−1 is the weight matrix at l-th layer with d = m 0 and m L = 1. The

first-order and second-order derivatives are given by

∂u θ (x)

= W L · Φ L−1 (x)W L−1 · · · · · Φ 1 (x)W 1 ∈ R d ,

∂x (73)

L−1

∂ 2 u θ (x)

(W L Φ L−1 (x) · · · W l+1 )diag(Ψ l (x)W l · · · Ψ 1 (x)(W 1 ) :,j )(W l · · · Φ 1 (x)(W 1 ) :,i )

∂x i ∂x j (74)

Φ l (x) = Diag [σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x) · · · )))] ∈ R m l ×m l (75)

l=1

where

′′

Ψ l (x) = Diag [σ (W l σ(W l−1 σ(· · · σ(W 1 x) · · · )))] ∈ R

m l ×m l

(76)

We observed that in the wave equation, the forms of second derivatives in different dimensions are actually mostly

shared. Therefore, optimizing the gradient of one dimension also affects other dimensions, which is akin to transfer

learning. Moreover, this form has different dependencies on various dimensions only in the parameters of the first

layer (notice (W 1 ) i and (W 1 ) j ), which means it differs solely in the ith and jth rows of the first-layer parameters. The

parameters of the subsequent layers from the second layer onwards do not actually have any impact.

276

Conclusions

The CoD has been an open problem for many decades but recent progress [1, 19, 40] has been made by several

research groups using deep learning methods for specific classes of PDEs. Herein, we proposed a general method

based on the mesh-free PINNs method by introducing a new type of stochastic dimension gradient descent or SDGD

as a generalization of the largely successful SGD method over subsets of the training set. In particular, we claim the

following advantages of the SDGD-PINNs method over other related methods:

Generality to Arbitrary PDEs. Primarily, certain approaches [1, 19, 40] are restricted to specific forms of

parabolic PDEs. They leverage the connection between parabolic PDEs and stochastic processes, employing Monte

Carlo simulations of stochastic processes to train their models. Consequently, their methods are limited to a subset of

PDEs. In contrast, our approach is based on PINNs, theoretically capable of handling arbitrary PDEs. Notably, we can

address challenging cases such as the wave equation, biharmonic equation, Schrödinger equation, and other similar

examples, while [1, 19, 40] cannot.

Prediction on the Entire Domain. Furthermore, these methods [1, 19, 40] can only predict the values of PDE

solutions at a single test point during each training instance. This limitation arises from the necessity of setting the

starting point of the stochastic process (i.e., the test point for the PDE solution) before conducting Monte Carlo

simulations. Consequently, the computational cost of their methods [1, 19, 40] increases greatly as the number of test

points grows. In contrast, our approach, based on PINNs, allows for predictions across the entire domain in a single

training instance, achieving a “once for all” capability.

Dependency on the Mesh. Moreover, precisely because these methods [1, 19, 40] require Monte Carlo simulations

of stochastic processes, they necessitate temporal discretization. This introduces additional parameters and requires

a sufficiently high discretization precision to obtain accurate solutions. For PDE problems that span long durations,

these methods undoubtedly suffer from the burden of temporal discretization. However, our PINN-based approach can

handle long-time PDEs effortlessly. Since PINNs are mesh-free methods, the training points can be randomly sampled.

The interpolation capability of PINNs ensures their generalization performance on long-time PDE problems.

PINNs can also be adapted for prediction at one point. Although PINNs offer advantages such as being

mesh-free and capable of predicting the entire domain, they still perform remarkably well in the setting where other

methods focus on single-point prediction. Specifically, if our interest lies in the value of the PDE at a single point, we

can exploit the relationship between the PDE and stochastic processes. This single point corresponds to the initial

point of the stochastic process, and the PDE corresponds to a specific stochastic process, such as the heat equation

corresponding to simple Brownian motion. In this case, we only need to use the snapshots obtained from the trajectory

of this stochastic process as the residual points for PINN.

We have proved the necessary theorems that show SDGD leads to an unbiased estimator and that it converges

under mild assumptions, similar to the vanilla SGD over collocation points and mini-batches. SDGD can also be used

in physics-informed neural operators such as the DeepOnet and can be extended to general regression and possibly

classification problems in very high dimensions.

In summary, the new SDGD-PINN framework is a paradigm shift in the way we solve high-dimensional PDEs as it

can tackle arbitrary nonlinear PDEs of any dimension providing high accuracy at extremely fast speeds, e.g., the BSB

solution in 100,000 dimensions can be obtained in a few minutes even on a single GPU.

28A

Proof

A.1

Proof of Theorem 4.2

Lemma A.1. Suppose that we have N fixed numbers a 1 , a 2 , · · · , a N and we choose k random numbers {X i } ki=1 from

them, i.e., each X i is a random variable and X i = a n with probability 1/N for all n ∈ {1, 2, · · · , N }, and X i are

P n

P k

P N

independent. Then the variance of the unbiased estimator i=1 X i /k for a = n=1 a n is kn

i=1 (a i − ā) .

Proof. Since the samples X i are i.i.d.,

i=1

X i

1 X

V[X i ] =

(a i − ā) 2

kn i=1

(77)

We assume that the total full batch is with N r residual points {x i } N

i=1 and N L PDE terms L =

residual loss of the PINN is

N r

(Lu θ (x i ) − R(x i )) ,

2N r N L 2 i=1

P N L

i=1

L i , then the

(78)

where we normalize over both the number of residual points and the number of PDE terms. The full batch gradient is:

N r

∂

1 X

(Lu θ (x i ) − R(x i )) Lu θ (x i )

N r N L i=1

∂θ





N L

N r

1 X

∂

(Lu θ (x i ) − R(x i )) 

L j u θ (x i ) 

N r N L 2 i=1

∂θ

j=1

g(θ) =

N L

N r X

∂

1 X

(Lu θ (x i ) − R(x i ))

L j u θ (x i )

N r N L 2 i=1 j=1

∂θ

 



 



N L

N r

∂

 

L j u θ (x i )  − R(x i )  

L j u θ (x i )  .

N r N L 2 i=1

∂θ

j=1

(79)

From the viewpoint of Algorithm 1, the full batch gradient is the mean of N r N L terms:

g(θ) =

N L

N r X

1 X

∂

1 X

(Lu

)

−

R(x

))

)

g i,j (θ),

j θ

N r N L 2 i=1 j=1

∂θ

N r N L 2 i=1 j=1

where we denote

g i,j (θ) = (Lu θ (x i ) − R(x i ))

∂

L j u θ (x i ) .

∂θ

(81)

The stochastic gradient generated by Algorithm 1 is given by





X X

∂

g B,J (θ) =

(Lu θ (x i ) − R(x i )) 

L j u θ (x i )  =

g i,j (θ).

|B||J|N L

∂θ

|B||J|N L

i∈B

j∈J

(80)

(82)

i∈B j∈J

We have the variance

V B,J [g B,J (θ)] = E B,J [(g B,J (θ) − g(θ)) ] = E B,J [g B,J (θ) 2 ] − E B,J [g(θ) 2 ] = E B,J [g B,J (θ) 2 ] − g(θ) 2 .

(83)

From high-level perspective, V B,J [g B,J (θ)] should be a function related to |B| and |J|. Thus, under the constraint that

|B| · |J| is the same for different SGD schemes, we can choose B and J properly to minimize the variance of SGD to

29accelerate convergence.





X X

∂

E B,J [g B,J (θ) 2 ] =

E B,J 

(Lu θ (x i ) − R(x i ))

L j u θ (x i ) 

|B| 2 |J| 2 N L 2

∂θ

i∈B j∈J

X X

∂

)

(Lu

)

−

R(x

))

B,J

j θ

|B| 2 |J| 2 N L 2

∂θ

i,i ′ ∈B j,j ′ ∈J

∂

(Lu θ (x i ′ ) − R(x i ′ ))

L j ′ u θ (x i ′ ) .

∂θ

(84)

The entire expectation can be decomposed into four parts (1) i = i ′ , j = j ′ , (2) i ̸ = i ′ , j = j ′ , (3) i = i ′ , j ̸ = j ′ , (4)

i ̸ = i ′ , j ̸ = j ′ . For the first part

2 #

X X

∂

E B,J

L j u θ (x i )

(Lu θ (x i ) − R(x i ))

|B| 2 |J| 2 N L 2

∂θ

i∈B j∈J

2 #

∂

E i,j (Lu θ (x i ) − R(x i ))

L j u θ (x i )

|B||J|N L 2

∂θ

= 1

E i,j [g i,j (θ) 2 ]

|B||J|N L 2

= N L

N r X

g i,j (θ) 2

|B||J|N r N L 3 i=1 j=1

= C 1

|B||J|

(85)

P N r P N L

where we denote C 1 = N r 1 N 3 i=1

j=1 g i,j (θ) which is independent of B, J.

In the second case, since the ith sample and the i ′ th sample are independent for i ̸ = i ′ , we have

X X

∂

E B,J

(Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))

L j u θ (x i )

L j u θ (x i ′ )

|B| 2 |J| 2 N L 2

∂θ

i̸ = i ′ ∈B j∈J

∂

|B| − 1

E i,i ′ ,j (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))

L j u θ (x i )

L j u θ (x i ′ )

|B||J|N L 2

∂θ

= C 2 ·

|B| − 1

|B||J|

(86)

In the third case, due to the same reason,

X X

∂

E B,J

(Lu θ (x i ) − R(x i ))

L j u θ (x i )

L j ′ u θ (x i )

|B| 2 |J| 2 N L 2

∂θ

i∈B j̸ = j ′ ∈J

|J| − 1

∂

E i,j,j ′ (Lu θ (x i ) − R(x i ))

L j u θ (x i )

L j ′ u θ (x i )

|B||J|N L 2

∂θ

= C 3 ·

|J| − 1

|B||J|

(87)In the last case,

∂

L j u θ (x i )

L j ′ u θ (x i ′ )

(Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))

∂θ

i̸ = i ′ ∈B j̸ = j ′ ∈J

∂

(|B| − 1)(|J| − 1)

∂

E i,i ′ ,j,j ′ (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))

L j u θ (x i )

L j ′ u θ (x i ′ )

|B||J|N L 2

∂θ

#! 2

(|B| − 1)(|J| − 1)

E i,j (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))

|B||J|N L 2

E B,J

|B| 2 |J| 2 N L 2

(|B| − 1)(|J| − 1)

g(θ) 2

|B||J|

(88)

Taking together

V B,J [g B,J (θ)] = E B,J [g B,J (θ) 2 ] − g(θ) 2

C 1 + C 2 (|B| − 1) + C 3 (|J| − 1) + (1 − |B| − |J|)g(θ) 2

|B||J|

C 4 |J| + C 5 |B| + C 6

|B||J|

A.2

(89)

Proof of Lemma 4.1

Here, we bound the gradient produced by physics-informed loss functions, which further provides an upper bound for

the gradient variance during PINN training using SDGD.

For the stochastic gradient produced by Algorithm 1,

V B,J (g B,J (θ)) ≤

N L

N r X

∥g i,j (θ) − g(θ)∥ 2 ≤

max ∥g i,j (θ)∥ 2 ,

|B||J|N r N L i=1 j=1

|B||J|N L i,j

(90)

where

g i,j (θ) := (Lu θ (x i ) − R(x i ))

∂

L j u θ (x i ) .

∂θ

(91)

For the stochastic gradient produced by Algorithm 2,



!

X ∂

L k u θ (x i ) − R(x i ) 

L j u θ (x i ) 

∂θ

j∈J

i∈B

k∈K

X X X

R(x i )

∂

L k u θ (x i ) −

L j u θ (x i ) ,

|B||J||K|

N L

∂θ

g B,J,K (θ) =

|B||J|N L

N L

|K|

(92)

i∈B j∈J k∈K

V B,J,K (g B,J,K (θ)) ≤

N L X

N L

N r X

(g i,j,k (θ) − g(θ)) 2 ≤

max ∥g i,j,k (θ)∥ 2 ,

|B||J||K|N r N L 2 i=1 j=1

|B||J||K| i,j,k

(93)

k=1

where

g i,j,k (θ) :=

R(x i )

L k u θ (x i ) −

N L

∂

L j u θ (x i ) .

∂θ

(94)

We focus on bounding the following two quantities:

max |L j u θ (x i )| ≤ n!(L − 1) n ∥W L ∥

i,j

L−1

∥W l ∥ n ,

(95)

l=1

L−1

∂

n+1

max

L j u θ (x i ) ≤ (n + 1)!(L − 1)

∥W L ∥

∥W l ∥ n+1 max ∥x∥.

i,j

x∈Ω

∂θ

l=1

(96)Recall that the DNN structure:

u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · ).

The neural network u θ (x) has only one part concerning x, and after taking the derivative with respect to x, there

is only one term:

∂u θ (x)

= W L · Φ L−1 (x)W L−1 · · · · · Φ 1 (x)W 1 ∈ R d ,

(97)

∂x

where Φ l (x) = diag[σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈ R m l ×m l , and the first-order derivative has an upper bound of

Q L

l=1 ∥W l ∥.

After taking the second derivative, due to the nested structure of the neural network, there will be L − 1 parts

concerning x induced by the derivative of each Φ l (x) with respect to x, resulting in L − 1 terms:

( L−1

)

∂ 2 u θ (x)

(W L Φ L−1 (x) · · · W l+1 )diag(Ψ l (x) · · · Ψ 1 (x)(W 1 ) :,j )(W l · · · Φ 1 (x)W 1 )

(98)

∂x 2

l=1

1≤j≤d

where Φ l (x) = diag[σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈ R m l ×m l , and Ψ l (x) = diag[σ ′′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈

Q L−1

R m l ×m l . Each of these L − 1 terms has an upper bound of ∥W L ∥ l=1 ∥W l ∥ 2 , and possesses at most 2(L − 1)

sub-terms depending on x. Therefore, there are 2(L − 1) 2 sub-terms within the second-order derivative depending on

Using the principle of induction, after taking the nth derivative, there will be n!(L − 1) n parts and each term has

Q L−1

an upper bound of ∥W L ∥ l=1 ∥W l ∥ n . If we further take the derivative with respect to W l , then there will be at most

Q L−1

(n+1)!(L−1) n+1 parts concerning W l , and each term will be upper bounded by ∥W L ∥ l=1 ∥W l ∥ n+1 max x∈Ω ∥x∥.

Consequently,

∂

∥g i,j (θ)∥ ≤ ∥Lu θ (x i ) − R(x i )∥

(99)

L j u θ (x i ) ≤ max |L j u θ (x i )| + R max

L j u θ (x i ) ,

i,j

∂θ

R(x i )

∂

∥g i,j,k (θ)∥ ≤ L k u θ (x i ) −

(100)

L j u θ (x i ) ≤ max |L j u θ (x i )| + R max

L j u θ (x i ) .

i,j

N L

∂θ

Parallel Computing

To speed up the training, we explored two avenues for distributed training. First, we used the data-parallel approach

[16], and second we explored the 1D tensor parallel approach [38].

The programming model for the data-parallel approach is based on the single instruction and multiple data (SIMD),

resulting in a weak scaling-driven performance model. In the data-parallel approach, the data is uniformly split into

a number of chunks, equal to the number of GPUs. The neural network (NN) model is initialized with the same

parameters on all the processes. These neural networks are working on different chunks of the data, and therefore,

work on different loss functions. To ensure a consistent neural network model with the same weights and biases across

all the processes and during each epoch or iteration of training, a distributed optimizer is used, which averages out the

gradient of loss values stored at each processor through an “all reduce” operation. Subsequently, the averaged gradient

of the loss function is sent to all the processors in the communicator world through a broadcast operation. Thus, the

parameters of the model are updated simultaneously with the same gradients.

To show the performance of the algorithm presented in this paper, we chose the HJB-Lin problem in 100,000

dimensions. We perform all the computational experiments in this section on A100 GPUs with 80 GB memory.

Since the performance of the data parallel approach is driven by weak scaling, we need to first find out the correct

hyperparameters such as the depth and width of neural networks, number of collocation points, batch size, etc. We show

a detailed Table 7 showing the performance of the data-parallel approach by varying the hyperparameters, specifically

dimensions and the number of collocation points in each dimension. To observe the accuracy of the parallel code we

ran a case for solving the HJB-Lin equation up to 5000 iterations and the convergence plot is shown in Figure 7, which

utilizes the hyperparameters DIM=100, N f =100, where we denote the batch size for dimension in Algorithm 1 of

SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing section.

For all the cases shown in Table 7, the GPU is not saturated completely, and therefore scaling is affected by memory

latency. In Table 8, we show an optimal set of hyperparameters required to saturate the GPUs. With these parameters,

we ran the proposed algorithm till the full convergence for the HJB-Lin equation in 100,000 dimensions, and the

wall times are shown in Table 9. We see a very good scaling but there is still scope for further optimization. Further

optimization of code and scaling it across the nodes will be considered in future work.

32Parallel Computing on 1E5-Dimensional HJB-Lin Equation

Algorithm 2: 100-point-100-term on 1 GPU

Algorithm 2: 100-point-1201-term on multiple GPUs

10 0

10 1

10 2

1000

2000

Epoch

3000

4000

5000

Figure 7: Convergence plot obtained from data-parallel algorithm for HJB-Lin equation in 100,000 dimensions. To

perform this experiment we use DIM=101 and N f = 100. Here we denote the batch size for dimension in Algorithms

1 and/or 2 of SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing

section.

Scaling for data-parallel approach for HJB-Lin equation in 100,000 dimensions

Hyperparameters

1-GPU (Mem and 2-GPUs (Mem and Time Per iteration

Compute)

DIM=100, N f =100

34 GB and 50 %

34 GB and 25 %

1 GPU = 10 sec, 2

GPU = 2.5 sec

DIM=1200, N f =100

67 GB and 25 %

37 GB and 30 %

1 GPU = 75 sec, 2

GPU = 25 sec

DIM=2000, N f =100

OOM Error

37 GB and 50 %

1 GPU = NA,

GPU = 38 sec

Table 7: Scaling for data-parallel approach with 3 hidden layers and 1024 neurons in each layer. Experiments are

performed on A100 GPUs with 80 GB memory. Here we denote the batch size for dimension in Algorithms 1 and/or 2

of SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing section.

Scaling for data-parallel for 100,000D HJB-Lin equation with well-balanced load

Hyperparameters

1-GPU (Mem and 2-GPUs (Mem and Time Per iteration

Compute)

DIM=2000, N f =1000

OOM Error

80 GB, and 100%

1 GPU = NA,

GPU = 100 sec

Table 8: Scaling for data-parallel approach with optimal hyperparameters endowed with 9 hidden layers and 1024

neurons in each layer.

Performance: Serial vs data parallel code for 100,000D HJB-Lin

Hyperparameters

1-GPU

2-GPUs

DIM=100, N f =100

6.17 Hrs

DIM=2000, N f =1000

3.14 Hrs

Table 9: Scaling results for serial and parallel implementation.

33Figure 8: Schematic of tensor parallel approach. Here, w is a weight matrix that is split into column axis and deployed

to different GPUs.

Next, we present the results of the tensor parallel algorithm [38]. In the tensor parallel algorithm, we split the

weight matrices of neural networks along the column dimension as shown in Figure 8. Thereafter, the forward pass

and backward pass are performed using the pipeline approach. Since in the present study, our model is not very large

but to saturate the GPU, we have used 8 hidden layers and 2048 neurons and performed the parallelization on 2 GPUs.

The tensor parallel approach did not scale as well due to the effect of pipe-lining deteriorating the performance. We

show a comparison of the tensor and data parallel approaches in Table 10.

Performance: data vs tensor parallel code for 100,000D HJB equation on 2 GPUs

Hyperparameters

data parallel

tensor parallel

DIM=2000, N f =1000

100 s/iteration

105 s/iteration

Table 10: Scaling results for data-parallel and tensor parallel implementation.

34References

[1] Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Ariel Neufeld. Deep splitting method

for parabolic pdes. SIAM Journal on Scientific Computing, 43(5):A3135–A3154, 2021.

[2] Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high-dimensional

fully nonlinear partial differential equations and second-order backward stochastic differential equations. Journal

of Nonlinear Science, 29:1563–1619, 2019.

[3] Christian Beck, Lukas Gonon, and Arnulf Jentzen. Overcoming the curse of dimensionality in the nu-

merical approximation of high-dimensional semilinear elliptic partial differential equations. arXiv preprint

arXiv:2003.00596, 2020.

[4] Christian Beck, Fabian Hornung, Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Overcoming the

curse of dimensionality in the numerical approximation of allen–cahn partial differential equations via truncated

full-history recursive multilevel picard approximations. Journal of Numerical Mathematics, 28(4):197–222,

2020.

[5] Sebastian Becker, Ramon Braunwarth, Martin Hutzenthaler, Arnulf Jentzen, and Philippe von Wurstemberger.

Numerical simulations for full history recursive multilevel picard approximations for systems of high-dimensional

partial differential equations. arXiv preprint arXiv:2005.10206, 2020.

[6] Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Timo Welti. Solving high-dimensional optimal stopping

problems using deep learning. European Journal of Applied Mathematics, 32(3):470–514, 2021.

[7] Richard Bellman. Dynamic programming. Princeton University Press, 1957.

[8] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George Em Karniadakis. Physics-informed

neural networks (pinns) for fluid mechanics: A review. Acta Mechanica Sinica, 37(12):1727–1738, 2021.

[9] Quentin Chan-Wai-Nam, Joseph Mikael, and Xavier Warin. Machine learning for semi linear pdes. Journal of

scientific computing, 79(3):1667–1712, 2019.

[10] Francisco Chinesta, Pierre Ladeveze, and Elias Cueto. A short review on model order reduction based on proper

generalized decomposition. Archives of Computational Methods in Engineering, 18(4):395, 2011.

[11] Junwoo Cho, Seungtae Nam, Hyunmo Yang, Seok-Bae Yun, Youngjoon Hong, and Eunbyung Park. Sep-

arable pinn: Mitigating the curse of dimensionality in physics-informed neural networks. arXiv preprint

arXiv:2211.08761, 2022.

[12] Jérôme Darbon and Stanley Osher. Algorithms for overcoming the curse of dimensionality for certain hamilton–

jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):19,

2016.

[13] Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent

method for non-convex objective functions. The Journal of Machine Learning Research, 21(1):5354–5401, 2020.

[14] Somdatta Goswami, Ameya D Jagtap, Hessam Babaee, Bryan T Susi, and George Em Karniadakis. Learning

stiff chemical kinetics using extended deep neural operators. arXiv preprint arXiv:2302.12645, 2023.

[15] Somdatta Goswami, Minglang Yin, Yue Yu, and George Em Karniadakis. A physics-informed variational

deeponet for predicting crack path in quasi-brittle materials. Computer Methods in Applied Mechanics and

Engineering, 391:114587, 2022.

[16] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,

Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint

arXiv:1706.02677, 2017.

[17] Ehsan Haghighat, Maziar Raissi, Adrian Moure, Hector Gomez, and Ruben Juanes. A physics-informed deep

learning framework for inversion and surrogate modeling in solid mechanics. Computer Methods in Applied

Mechanics and Engineering, 379:113741, 2021.

[18] PC Hammer. Adaptive control processes: a guided tour (r. bellman), 1962.

35[19] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep

learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.

[20] Jiequn Han, Arnulf Jentzen, et al. Deep learning-based numerical methods for high-dimensional parabolic partial

differential equations and backward stochastic differential equations. Communications in mathematics and

statistics, 5(4):349–380, 2017.

[21] Di He, Shanda Li, Wenlei Shi, Xiaotian Gao, Jia Zhang, Jiang Bian, Liwei Wang, and Tie-Yan Liu. Learning

physics-informed neural networks without stacked back-propagation. In International Conference on Artificial

Intelligence and Statistics, pages 3034–3047. PMLR, 2023.

[22] Pierre Henry-Labordere. Deep primal-dual algorithm for bsdes: Applications of machine learning to cva and im.

Available at SSRN 3071506, 2017.

[23] Zheyuan Hu, Ameya D. Jagtap, George Em Karniadakis, and Kenji Kawaguchi. When do extended physics-

informed neural networks (xpinns) improve generalization? SIAM Journal on Scientific Computing, 44(5):A3158–

A3182, 2022.

[24] Côme Huré, Huyên Pham, and Xavier Warin. Deep backward schemes for high-dimensional nonlinear pdes.

Mathematics of Computation, 89(324):1547–1579, 2020.

[25] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger.

Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential

equations. Proceedings of the Royal Society A, 476(2244):20190630, 2020.

[26] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, et al. Multilevel picard iterations for solving smooth

semilinear parabolic heat equations. Partial Differential Equations and Applications, 2(6):1–31, 2021.

[27] Ameya D Jagtap, Dimitrios Mitsotakis, and George Em Karniadakis. Deep learning of inverse water waves

problems using multi-fidelity data: Application to serre–green–naghdi equations. Ocean Engineering, 248:110775,

2022.

[28] Shaolin Ji, Shige Peng, Ying Peng, and Xichuan Zhang. Three algorithms for solving high-dimensional fully

coupled fbsdes through deep learning. IEEE Intelligent Systems, 35(3):71–84, 2020.

[29] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-

informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.

[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.

[31] Yunwen Lei, Ting Hu, Guiying Li, and Ke Tang. Stochastic gradient descent for nonconvex learning without

bounded gradient assumptions. IEEE transactions on neural networks and learning systems, 31(10):4394–4400,

2019.

[32] Jianfeng Lu, Yulong Lu, and Min Wang. A priori generalization analysis of the deep ritz method for solving high

dimensional elliptic equations. arXiv preprint arXiv:2101.01708, 2021.

[33] Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear

operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence,

3(3):218–229, 2021.

[34] Tao Luo and H. Yang. Two-layer neural networks for partial differential equations: Optimization and generaliza-

tion theory. ArXiv, abs/2006.15733, 2020.

[35] Kaspar Märtens and Christopher Yau. Neural decomposition: Functional anova with variational autoencoders. In

International Conference on Artificial Intelligence and Statistics, pages 2917–2927. PMLR, 2020.

[36] Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of

stochastic gradient descent in non-convex problems. Advances in Neural Information Processing Systems,

33:1117–1128, 2020.

[37] Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physics informed neural

networks (pinns) for approximating pdes. arXiv preprint arXiv:2006.16144, 2020.

36[38] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti,

Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language

model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High

Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.

[39] Paris Perdikaris, Daniele Venturi, and George Em Karniadakis. Multifidelity information fusion algorithms for

high-dimensional systems and massive data sets. SIAM Journal on Scientific Computing, 38(4):B521–B538,

2016.

[40] Maziar Raissi. Forward-backward stochastic neural networks: Deep learning of high-dimensional partial

differential equations. arXiv preprint arXiv:1804.07010, 2018.

[41] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial

differential equations. Journal of Computational Physics, 357:125–141, 2018.

[42] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning

framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of

Computational Physics, 378:686–707, 2019.

[43] Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S

Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information

Processing Systems, 32, 2019.

[44] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence of physics informed neural

networks for linear second-order elliptic and parabolic type pdes. arXiv preprint arXiv:2004.01806, 2020.

[45] Khemraj Shukla, Vivek Oommen, Ahmad Peyvan, Michael Penwarden, Luis Bravo, Anindya Ghoshal, Robert M

Kirby, and George Em Karniadakis. Deep neural operators can serve as accurate surrogates for shape optimization:

a case study for airfoils. arXiv preprint arXiv:2302.00807, 2023.

[46] Chuwei Wang, Shanda Li, Di He, and Liwei Wang. Is $lˆ2$ physics informed loss always suitable for training

physics informed neural network? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho,

editors, Advances in Neural Information Processing Systems, 2022.

[47] Yifan Wang, Pengzhan Jin, and Hehu Xie. Tensor neural network and its numerical integration. arXiv preprint

arXiv:2207.02754, 2022.

[48] Yifan Wang, Yangfei Liao, and Hehu Xie. Solving schr\”{o} dinger equation using tensor neural network. arXiv

preprint arXiv:2209.12572, 2022.

[49] Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantification in physics-informed neural networks.

Journal of Computational Physics, 394:136–152, 2019.

[50] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional

partial differential equations. Journal of Computational Physics, 411:109409, 2020.

[51] Dongkun Zhang, Ling Guo, and George Em Karniadakis. Learning in modal space: Solving time-dependent

stochastic pdes using physics-informed neural networks. SIAM Journal on Scientific Computing, 42(2):A639–

A665, 2020.

[52] Zhongqiang Zhang, Minseok Choi, and George Em Karniadakis. Error estimates for the anova method with

polynomial chaos interpolation: Tensor product functions. SIAM Journal on Scientific Computing, 34(2):A1165–

A1186, 2012.