New Summary

Summary Scaling Physics-Informed Neural Networks for High-Dimensional PDEs arxiv.org

21,648 words - PDF document - View PDF document

Chat with this pdf   Buy me a coffee

Processing...

AI Summary Complete!

Error!

One Line

This text discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs, which involves randomly selecting indices and computing gradients.

Slides

Slide Presentation (8 slides)

Copy slides outline   Copy embed code   Download as Word

Scaling Physics-Informed Neural Networks for High-Dimensional PDEs

Source: arxiv.org - PDF - 21,648 words - view

The Curse of Dimensionality


• The curse of dimensionality poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs.

Stochastic Dimension Gradient Descent (SDGD)


• Stochastic Dimension Gradient Descent (SDGD) is proposed as a new method for solving high-dimensional PDEs.

• SDGD utilizes sampling for both forward and backward passes, enabling Physics-Informed Neural Networks (PINNs) to be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours.

Low Memory Cost and Unbiased Gradient


• The algorithm for scaling PINNs for high-dimensional PDEs has low memory cost because the backward pass only backpropagates over terms with i ? I.

• The gradient used is an unbiased estimate.

Rapid Convergence in High-Dimensional Cases


• Algorithm 2 shows rapid convergence even in extremely high-dimensional cases.

• Instability may occur in Algorithm 2 for 100,000 dimensions due to small batch size and resulting gradient variance.

Closing Slide


• Physics-Informed Neural Networks (PINNs) offer a promising solution for scaling and speeding up the solution of high-dimensional partial differential equations (PDEs).

• SDGD enables training on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours.

• Remember the challenges posed by the curse of dimensionality and the importance of efficient algorithms for high-dimensional PDEs.


[Include relevant visuals such as graphs demonstrating convergence or comparisons between different algorithms]

Scaling Physics-Informed Neural Networks for High-Dimensional PDEs


• PINNs and SDGD offer a solution to the curse of dimensionality in solving high-dimensional PDEs.

• SDGD enables rapid convergence even in extremely high-dimensional cases.

• Efficient algorithms are crucial for tackling the challenges posed by high-dimensional PDEs.

   

Key Points

  • The curse of dimensionality poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs.
  • Stochastic Dimension Gradient Descent (SDGD) is proposed as a new method for solving high-dimensional PDEs.
  • Physics-Informed Neural Networks (PINNs) can be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours using a speed-up method utilizing sampling for both forward and backward passes.
  • The algorithm for scaling PINNs for high-dimensional PDEs has low memory cost and uses an unbiased gradient.
  • Algorithm 2 shows rapid convergence even in extremely high-dimensional cases, but instability may occur due to small batch size and resulting gradient variance.

Summaries

28 word summary

This text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs. The algorithm involves randomly selecting indices and computing gradients with respect to those indices.

37 word summary

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). The algorithm for training the PINN involves randomly selecting indices and computing the gradient with respect to those indices. Two

810 word summary

The curse of dimensionality (CoD) poses challenges in solving high-dimensional partial differential equations (PDEs) due to exponentially increasing computational costs. In this paper, a new method called Stochastic Dimension Gradient Descent (SDGD) is proposed

With a speed-up method utilizing sampling for both forward and backward passes, Physics-Informed Neural Networks (PINNs) can be trained on nontrivial nonlinear PDEs in 100,000 dimensions in just 6 hours of training time. This

Wang et al. propose tensor neural networks for solving high-dimensional Schro?dinger equations. Zhang et al. propose using PINNs for solving stochastic differential equations. Zang et al. propose a weak adversarial network for solving PDEs.

Our algorithm for scaling Physics-Informed Neural Networks (PINNs) for high-dimensional PDEs has several advantages. It has low memory cost because the backward pass only backpropagates over terms with i ? I. The gradient used is an unbiased

The document discusses scaling physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). The algorithm for training the PINN involves randomly selecting indices and computing the gradient with respect to those indices. Two extensions for further speed

The text excerpt discusses a training algorithm for scaling up and speeding up physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). The algorithm involves sampling both the forward and backward passes. The tradeoff between speed and

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It introduces Theorem 4.2, which states that the variance of the stochastic gradient decreases as batch sizes

The text excerpt discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It presents a lemma and a remark regarding the boundedness of higher-order derivatives of the network. The SGD trajectory

We explore the acceleration achieved by our algorithm and whether Physics-Informed Neural Networks (PINNs) can overcome the curse of dimensionality. We use consistent hyperparameter settings and vary PDE dimensions from 10^2 to 10^5. Our

Our method achieves faster convergence for high-dimensional scenarios, with Algorithm 2 showing rapid convergence even in extremely high-dimensional cases. However, instability was observed in Algorithm 2 for 100,000 dimensions due to small batch size and resulting gradient variance. The

Algorithm 1 and Algorithm 2 are effective in reducing memory cost and accelerating convergence in training Physics-Informed Neural Networks (PINNs). The acceleration provided by Algorithm 2 outweighs the impact of gradient variance in higher dimensional cases. The symmetry of the

The document discusses the use of Physics-Informed Neural Networks (PINN) to approximate the solutions of high-dimensional partial differential equations (PDEs). Two specific PDEs, the Fisher-KPP equation and the Sine-Gordon equation, are

The excerpt discusses the use of physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs) with linear-quadratic-Gaussian (LQG) control. It mentions the specific form of the PDEs

The method presented in the document is explored to determine whether it can converge on asymmetric and anisotropic partial differential equations (PDEs). An anisotropic Fokker-Planck equation is designed as an example. The convergence results of the

The primary computational bottleneck for TNNs in solving high-dimensional PDEs lies in the first-order derivatives in the loss function. The computational cost scales linearly with the dimension, and TNN incurs larger memory consumption due to non-sharing of network

This summary highlights the key points of the text excerpt. In Figure 4, the stable convergence of SDGD is demonstrated for the Laplace eigenvalue problem, showing that it can scale up to larger problem sizes. The algorithm is able to handle high

Our method effectively captures low-dimensional solutions within high-dimensional equations, leading to faster convergence. We use small-dimension batch sizes and optimize one dimension at a time, triggering a transfer learning effect on other dimensions. The transferability of gradients is validated by theory

The document discusses the scaling of Physics-Informed Neural Networks (PINNs) for high-dimensional Partial Differential Equations (PDEs). It introduces the full batch gradient and stochastic gradient for PINNs and analyzes their variances. The authors propose a distributed

The article discusses the scaling of physics-informed neural networks (PINNs) for high-dimensional partial differential equations (PDEs). It presents convergence plots and performance results for different algorithms and hyperparameters. The data-parallel approach is compared to the tensor

This excerpt includes a list of references cited in the document "Scaling Physics-Informed Neural Networks for High-Dimensional PDEs." The references include various papers and articles related to the use of deep learning and neural networks in solving high-dimensional partial differential equations

This summary provides a list of references from various articles and papers related to the application of deep learning and neural networks in solving high-dimensional partial differential equations (PDEs). The references cover topics such as the numerical approximation of semilinear parabolic PDE

This text excerpt is a list of references to various research papers on the topic of physics-informed neural networks (PINNs) and their applications in solving high-dimensional partial differential equations (PDEs). The papers mentioned cover a range of topics such as

Raw indexed text (120,236 chars / 21,648 words / 3,342 lines)

Tackling the Curse of Dimensionality with Physics-Informed Neural Networks
Zheyuan Hu *
Khemraj Shukla †
George Em Karniadakis †
Kenji Kawaguchi *
Abstract
The curse-of-dimensionality (CoD) taxes computational resources heavily with exponentially increasing computa-
tional cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs as Richard
Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial
differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling
of general nonlinear PDEs to high dimensions has never been achieved. In this paper, we develop a new method of
scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method,
called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding
to different dimensions and samples randomly a subset of these dimensional pieces in each iteration of training
PINNs. We theoretically prove the convergence guarantee and other desired properties of the proposed method. We
experimentally demonstrate that the proposed method allows us to solve many notoriously hard high-dimensional
PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schrödinger equations in thousands of dimensions very
fast on a single GPU using the PINNs mesh-free approach. For instance, we solve nontrivial nonlinear PDEs (one
HJB equation and one Black-Scholes equation) in 100,000 dimensions in 6 hours on a single GPU using SDGD with
PINNs. Since SDGD is a general training methodology of PINNs, SDGD can be applied to any current and future
variants of PINNs to scale them up for arbitrary high-dimensional PDEs.
1
Introduction
The curse-of-dimensionality (CoD) refers to the computational and memory challenges when dealing with high-
dimensional problems that do not exist in low-dimensional settings. The term was first introduced in 1957 by Richard
E. Bellman [7, 18], who was working on dynamic programming, to describe the exponentially increasing costs due to
high dimensions. Later, the concept was adopted in various areas of science, including numerical partial differential
equations, combinatorics, machine learning, probability, stochastic analysis, and data mining. In the context of
numerically solving Partial Differential Equations (PDEs), an increase in the dimensionality of the PDE’s independent
variables tends to lead to a corresponding exponential rise in computational costs needed for obtaining an accurate
solution. This poses a considerable challenge for all existing algorithms.
Physics-Informed Neural Networks (PINNs) are highly practical in solving partial differential equation (PDE)
problems. Compared to other methods, PINNs can solve various PDEs in their general form, enabling mesh-free
solutions and handling complex geometries. Additionally, PINNs leverage the interpolation capability of neural
networks to provide accurate predictions throughout the domain, unlike other methods that can only compute the value
of PDEs at a single point [1, 19, 40]. Due to neural networks’ flexibility and expressiveness, in principle, the set of
functions representable by PINNs have the ability to approximate high-dimensional solutions. However, PINNs may
also fail in tackling CoD due to insufficient memory and slow convergence when dealing with high-dimensional PDEs.
For example, when PDE’s dimension is very high, even just using one collocation point can lead to an insufficient
memory error. Unfortunately, the aforementioned high-dimensional and high-order cases often occur for very useful
PDEs, such as the Hamilton-Jacobi-Bellman (HJB) equation in stochastic optimal control, the Fokker-Planck equation
in stochastic analysis and high-dimensional probability, and the Black-Scholes equation in mathematical finance, etc.
These PDE examples pose a grand challenge to the application of PINNs in effectively tackling real-world large-scale
problems.
To scale up PINNs for arbitrary high-dimensional PDEs, we propose a training method of PINNs by decomposing
and sampling gradients of PDEs, which we name stochastic dimension gradient descent (SDGD). It accelerates the
training of PINNs while significantly reducing the memory cost, thereby realizing the training of any high-dimensional
PDEs with acceleration. Specifically, we reduce the memory consumption in PINNs by decomposing the computational
and memory bottleneck, namely the gradient of the residual loss that contains PDE terms. We propose a novel
decomposition of the gradient of the residual loss into each piece corresponding to each dimension of PDEs. Then, at
* Department of
† Division
of
Computer Science, National University of Singapore, Singapore, 119077 ([email protected],[email protected])
Applied Mathematics, Brown University, Providence, RI 02912, USA (khemraj [email protected],
george [email protected])
1each iteration of training, we sample a subset of these dimensional pieces to optimize PINNs. Although we use only a
subset per iteration, this sampling process is ensured to be an unbiased estimator of the full gradient of all dimensions,
which is then used to guarantee convergence for all dimensions.
SDGD enables more efficient parallel computations, fully leveraging multi-GPU computing to scale up and expedite
the speed of PINNs. Parallelizing the computation of mini-batches of different dimensional pieces across multiple
devices can expedite convergence. This is akin to traditional SGD over data points, which allows simultaneous
computation on different machines. SDGD also accommodates the use of gradient accumulation on resource-limited
machines to enable larger batch sizes and reduce gradient variance. Our method also enables parallel computations,
fully leveraging multi-GPU computing to scale up and expedite the speed of PINNs. Gradient accumulation involves
accumulating gradients from multiple batches to form a larger batch size for stochastic gradients, reducing the variance
of each stochastic gradient on a single GPU. Due to the refined and general partitioning offered by SDGD, our gradient
accumulation can be executed on devices with limited resources, ideal for edge computing.
We theoretically prove that the stochastic gradients generated by SDGD are unbiased. Based on this, we also
provide convergence guarantees for our method. Additionally, our theoretical analysis showed that, under the same
batch size and memory cost, proper batch sizes for PDE terms and residual points can minimize gradient variance
and accelerate convergence, an outcome not possible solely with conventional SGD over points. SDGD extends and
generalizes traditional SGD over points, providing stable, low-variance stochastic gradients, while enabling smaller
batch sizes and reduced memory usage.
We conduct extensive experiments on several high-dimensional PDEs. We investigate the relationship between
SDGD and SGD over collocation points, where the latter is stable and widely adopted to reduce memory cost previously.
We vary the number of PDE terms and collocation points to study the stability and convergence speed of SDGD and
SGD under the same memory budgets. Experimental results show that our proposed SDGD is as stable as SGD over
points on multiple nonlinear high-dimensional PDEs and can accelerate convergence. In some cases, SDGD over terms
can achieve faster convergence than SGD, under the same memory cost for both methods.
Furthermore, we showcase large-scale PDEs where traditional PINN training directly fails due to an out-of-memory
(OOM) error. Concretely, with the further speed-up method via sampling both forward and backward passes, we can
train PINNs on nontrivial nonlinear PDEs in 100,000 dimensions in 6 hours of training time, while the traditional
training methods exceed the GPU memory limit directly. Combined with parallel computing, the training of the
100,000D nonlinear equation can be reduced to just a few minutes.
After validating the effectiveness of SDGD compared to SGD over collocation points, we compare the algorithms
with other methods. On multiple nonlinear PDEs, our algorithm outperforms other methods in various metrics.
Additionally, our method enables mesh-free training and prediction across the entire domain, while other methods
typically only provide predictions at a single point. Furthermore, we combine our algorithm with adversarial training,
notorious for its slowness in high-dimensional machine learning, and our algorithm significantly accelerates it. We also
demonstrate the generality of our method by applying it to the Schrödinger equation, which has broad connections with
quantum physics and quantum chemistry. Overall, our method provides a paradigm shift for performing large-scale
PINN training.
The rest of this paper is arranged as follows. We present related work in Section 2, our main algorithm for
accelerating and scaling up PINNs in Section 3, theoretical analysis of our algorithms’ convergence in Section 4,
numerical experiments in Section 5, and conclusion in Section 6.
2
2.1
Related Work
Physics-Informed Machine Learning
This paper is based on the concept of Physics-Informed Machine Learning [29]. Specifically, PINNs [42] utilize neural
networks as surrogate solutions for PDEs and optimize the boundary loss and residual loss to approximate the PDEs’
solutions. On the other hand, DeepONet [33] leverages the universal approximation theorem of infinite-dimensional
operators and directly fits the PDE operators in a data-driven manner. It can also incorporate physical information by
using residual loss. Therefore, our algorithms can also be flexibly combined with the training of physics-informed
DeepONet. Since their inception, PINNs have succeeded in solving numerous practical problems in computational
science, e.g., nonlinear PDEs [41], solid mechanics [17], uncertainty quantification [49], inverse water waves problems
[27], and fluid mechanics [8], etc. DeepONets also have shown their great potential in solving stiff chemical mechanics
[14], shape optimization [45], and materials science problems [15].
The success of PINNs has also attracted considerable attention in theoretical analysis. In the study conducted by
Luo et al. [34], the authors delve into the exploration of PINNs’ prior and posterior generalization bounds. Additionally,
they utilize the neural tangent kernel to demonstrate the global convergence of Physics-Informed Neural Networks
(PINNs). Mishra et al. [37] derive an estimate for the generalization error of PINNs, considering both the training
2error and the number of training samples. In a different approach, Shin et al. [44] adapt the Schauder approach and
the maximum principle to provide insights into the convergence behavior of the minimizer as the number of training
samples tends towards infinity. They demonstrate that the minimizer converges to the solution in both C 0 and H 1 .
Furthermore, Lu et al. [32] employ the Barron space within the framework of two-layer neural networks to conduct
a prior analysis on PINN with the softplus activation function. This analysis is made possible by drawing parallels
between the softplus and ReLU activation functions. More recently, Hu et al. [23] employ the Rademacher complexity
concept to provide a measure of the generalization error for both PINNs and the extended PINNs variant.
2.2
Machine Learning PDEs Solvers in High-Dimensions
There have been numerous attempts to tackle high-dimensional PDEs by deep learning to overcome the curse of
dimensionality. In the PINN literature, [46] shows that to learn a high-dimensional HJB equation, the L ∞ loss is
required. The authors of [21] propose to parameterize PINNs by the Gaussian smoothed model and to optimize PINNs
without back-propagation by Stein’s identity, avoiding the vast amount of differentiation in high-dimensional PDE
operators to accelerate the convergence of PINNs. Separable PINN [11] considers a per-axis sampling of residual points
instead of point-wise sampling in high-dimensional spaces, thereby reducing the computational cost of PINN. However,
this method [21, 11] focuses on acceleration on only modest (less than 4) dimensions. Specifically, separable PINNs
[11] are designed mainly for increasing the number of collocation points in 3D PDEs, whose separation of sampling
points becomes intractable in high-dimensional spaces. Separable PINNs [11] also struggle in PDE problems with
non-separable solutions. Han et al. [19, 20] proposed the DeepBSDE solver for high-dimensional PDEs based on the
classical BSDE method and they use deep learning to approximate unknown functions. Extensions of the DeepBSDE
method are presented in [2, 9, 22, 24, 28]. Becker et al. [6] solve high-dimensional optimal stopping problems via
deep learning. The authors of [1] combine splitting methods and deep learning to solve high-dimensional PDEs. Raissi
[40] leverages the relationship between high-dimensional PDEs and stochastic processes whose trajectories provide
supervision for deep learning. Despite the effectiveness of the method in [1, 19, 20, 40], they can only be applied
to a certain restricted class of PDEs. Wang et al. [47, 48] propose tensor neural networks with efficient numerical
integration and separable DeepONet structures for solving high-dimensional Schrödinger equations in quantum physics.
Zhang et al. [51] propose to use PINNs for solving stochastic differential equations by representing their solutions via
spectral dynamically orthogonal and bi-orthogonal methods. Zang et al. [50] propose a weak adversarial network that
solves PDEs by its weak formulation.
In the numerical PDE literature, attempts exist to scale numerical methods to high dimensions, e.g., proper
generalized decomposition (PGD) [10], multi-fidelity information fusion algorithm [39], and ANOVA [35, 52]. Darbon
and Osher [12] propose a fast algorithm for solving high-dimensional Hamilton-Jacobi equations if the Hamiltonian is
convex and positively homogeneous of degree one. The multilevel Picard method [3, 4, 5, 25, 26] is another approach
for approximating solutions of high-dimensional parabolic PDEs, which reformulates the PDE problem as a stochastic
fixed point equation, which is then solved by multilevel and nonlinear Monte-Carlo. Similar to Beck et al. [1] and Han
et al. [19], the multilevel Picard method can only output the solution’s value on one test point and can only be applied
to a certain restricted class of PDEs.
2.3
Stochastic Gradient Descent
This paper accelerates and scales up PINNs based on the idea of stochastic gradient descent (SGD). In particular, we
propose the aforementioned SDGD method. SGD is a common approach in machine learning for handling large-scale
data. In general, SGD is an iterative optimization algorithm commonly used in machine learning for training models. It
updates the model’s parameters by computing the gradients on a small randomly selected subset of training examples,
known as mini-batches. This randomness introduces stochasticity, hence enabling faster convergence and efficient
utilization of large datasets, making SGD a popular choice for training deep learning models.
Among them, the works in [13, 31, 36] are remarkable milestones. Specifically, the condition presented in [31]
stands as a minimum requirement within the general optimization literature, specifically pertaining to the Lipschitz
continuity of the gradient estimator. Their proof can also be extended in establishing convergence for our particular
case, necessitating the demonstration of an upper bound on the Lipschitz contact of our gradient estimator. On the
other hand, [13, 36] adopt a classical condition that entails a finite upper bound on the variance. In this case, it suffices
to compute an upper bound for the variance term. While [31] assumes a more relaxed condition, the conclusion is
comparably weaker, demonstrating proven convergence but with a notably inferior convergence rate. In the theoretical
analysis presented in section 4 of this paper, we will utilize the aforementioned tools to provide convergence guarantees
for our algorithm, with an emphasis on the effect of the choice of SDGD on stochastic gradient variance and PINNs
convergence.
33
Method
3.1
Physics-Informed Neural Networks (PINNs)
In this paper, we focus on solving the following partial differential equations (PDEs) defined on a domain Ω ⊂ R d :
Lu(x) = R(x) in Ω,
Bu(x) = B(x) on Γ,
(1)
where L and B are the differential operators for the residual in Ω and for the boundary/initial condition on Γ. PINNs
[42] is a neural network-based PDE solver via minimizing the following boundary and residual loss functions.
L(θ) = λ b L b (θ) + λ r L r (θ)
n b
n r
λ b X
λ r X
2
2
=
|Lu θ (x r,i ) − R(x r,i )| .
|Bu θ (x b,i ) − B(x b,i )| +
n b i=1
n r i=1
3.2
(2)
Methodology for High-Dimensional PDEs
We first adopt the simple high-dimensional second-order Poisson’s equation for illustration, then we move to the
widely-used high-dimensional Fokker-Planck equation for the general case.
3.2.1
Introductory Case of the High-Dimensional Poisson’s Equation
We consider a simple high-dimensional second-order Poisson’s equation for illustration:
∆u(x) =
d
X
d 2
d
2 u(x) = R(x), x ∈ Ω ⊂ R ,
dx
i
i=1
(3)
where u is the solution, and u θ is our PINN model parameterized by θ. The memory and computational cost scale
linearly as d increases. So, for extremely high-dimensional PDEs containing many second-order terms, using one
collocation point can lead to insufficient memory.
However, the memory problem is solvable by inspecting the residual loss on the collocation point x:
! 2
d
X
d 2
u θ (x) − R(x) ,
dx 2 i
i=1
1
ℓ(θ) =
2
The gradient with respect to the model parameters θ for training the PINN is
! d
!
d
X
X ∂ d 2
∂ℓ(θ)
d 2
grad(θ) :=
=
u θ (x) − R(x)
u θ (x) .
∂θ
dx 2 i
∂θ dx 2 i
i=1
i=1
(4)
(5)
2
d
We are only differentiating with respect to parameters θ on the d PDE terms dx
2 u θ (x), which is the memory bottleneck
i
in the backward pass since the shape of  the gradient is proportional
 to both the PDE dimension and the parameter count
P d d 2
of the PINN. In contrast, the first part
i=1 dx 2 u θ (x) − R(x) is a scalar, which can be precomputed and detached
i
from the GPU since it is not involved in the backpropagation for θ. Since the full gradient grad(θ) is the sum of d
independent terms, we can sample several terms for stochastic gradient descent (SGD) using the sampled unbiased
gradient estimator. Concretely, our algorithm can be summarized as follows:
1. Choose random indices I ⊂ {1, 2, · · · , d} where |I| is the cardinality of the set I, which is the batch size over
PDE terms, where we can set |I| ≪ d to minimize memory cost.
2
d
2. For i = 1, 2, · · · , d, compute dx
2 u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it
i
from GPU to save memory. After detachment, the term will not be involved in the costly backpropagation, and
its gradient with respect to θ will not be computed.
3. Compute the unbiased stochastic gradient used to update the model
!
!
d
X d 2 ∂
d X d 2
grad I (θ) =
u θ (x) − R(x)
u θ (x) .
|I| i=1 dx 2 i
dx 2 i ∂θ
i∈I
4
(6)4. If not converged, go to 1.
Our algorithm enjoys the following good properties and extensions:
• Low memory cost: Since the main cost is from the backward pass, and we are only backpropagating over terms
with i ∈ I, the cost is the same as the corresponding |I|-dimensional PDE.
• Unbiased stochastic gradient: Our gradient is an unbiased estimator of the true full batch gradient, i.e.,
E I [grad I (θ)] = grad(θ),
(7)
so that modern SGD accelerators such as Adam [30] can be adopted.
• Accumulate gradient for full batch GD: For the full GD that exceeds the memory, we can select non-overlapping
index sets {I k } k∈K such that ∪ k I k = {1, 2, · · · , d}, then we can combine these minibatch gradients to get the
full gradient,
1 X
grad I k (θ) = grad(θ).
(8)
|K|
k
This baseline process is memory-efficient but time-consuming. It is memory efficient since it divides the
entire computationally intensive gradient into the sum of several stochastic gradients whose computations are
memory-efficient. This operation is conducted one by one using the “For loop from i = 1 to d”. But it is rather
time-consuming because the “For loop” cannot be parallelized.
A common practice in PINN is to sample stochastic collocation points to reduce the batch size of points, which is
a common way to reduce memory costs in previous methods and even the entire field of deep learning. Next, we
compare SDGD with SGD over collocation points:
• SGD on collocation points has a minimum batch size of 1 point plus the d-dimensional equation.
• SDGD can be combined with SGD on points so that its minimal batch size is 1 point plus 1D equation.
• Thus, our method can solve high-dimensional PDEs arbitrarily since its minimal computational cost will not
grow as the dimension d increases. In contrast, the minimum cost of SGD on points scales linearly as d grows.
• Empirically, it is interesting to see how the two SGDs affect convergence. In particular, B-point + D-term
with the same B × D quantity has the same computational and memory cost, e.g., 50-point-100-terms and
500-point-10-terms.
3.2.2
General Case
We are basically doing the decomposition:
Lu =
N L
X
L i u,
(9)
i=1
where L i are the N L decomposed terms. We have the following examples:
2
∂
• In the d-dimensional Poisson cases, N L = d and L i u = ∂x
2 u(x). This observation also highlights the
i
significance of decomposition methods for high-dimensional Laplacian operators. The computational bottleneck
in many high-dimensional second-order PDEs often lies in the high-dimensional Laplacian operator.
• Consider the d-dimensional Fokker–Planck (FP) equation, which has numerous real-world applications in
optimal control and finance:
d
d
X
∂
1 X
∂ 2
∂u(x, t)
= −
[F i (x)u(x, t)] +
[D ij (x)u(x, t)] ,
∂t
∂x i
2 i,j=1 ∂x i ∂x j
i=1
(10)
where x = (x 1 , x 2 , . . . , x n ) is a d-dimensional vector, u(x, t) is the probability density function of the
stochastic process x(t), F i (x) is the i-th component of the drift coefficient vector F (x), and D ij (x) is the
(i, j)-th component of the diffusion coefficient matrix D(x).
5Then, our decomposition over the PDE terms will be
Lu =
=
=
d
d
∂u(x, t) X ∂
1 X
∂ 2
+
[F i (x)u(x, t)] −
[D ij (x)u(x, t)]
∂t
∂x i
2 i,j=1 ∂x i ∂x j
i=1

d 
X
1 ∂ 2
1 ∂u(x, t) 1 ∂
+
[F
(x)u(x,
t)]
−
[D
(x)u(x,
t)]
i
ij
d 2
∂t
d ∂x i
2 ∂x i ∂x j
i,j=1
d
X
(11)
L ij u(x),
i,j=1
where we denoted
L ij u(x) =
1 ∂ 2
1 ∂u(x, t) 1 ∂
+
[F i (x)u(x, t)] −
[D ij (x)u(x, t)] .
2
d
∂t
d ∂x i
2 ∂x i ∂x j
(12)
• The d-dimensional bi-harmonic equation:
∆ 2 u(x) = 0,
x ∈ Ω ⊂ R d ,
(13)
P d ∂ 2
where ∆ = i=1 ∂x
2 is the Laplacian operator. For this high-dimensional fourth-order PDE, we can do the
i
following decomposition for the PDE operator:
Lu = ∆ 2 u(x) =
d
X
d
X
∂ 4
u(x)
=
L ij u(x),
∂x 2 i ∂x 2 j
i,j=1
i,j=1
where
L ij u(x) =
d
X
∂ 4
u(x).
∂x 2 i ∂x 2 j
i,j=1
(14)
(15)
• For other high-dimensional PDEs, their complexity arises from the dimensionality itself. Therefore, it is always
possible to find decomposition methods that can alleviate this complexity.
We go back to the algorithm after introducing these illustrative examples, the computational and memory bottleneck
is the residual loss:
1
2
ℓ(θ) = (Lu(x) − R(x)) .
(16)
2
The gradient with respect to the model parameters θ for training the PINN is
! N
!
N L
L
X
X
∂ℓ(θ)
∂
(17)
grad(θ) =
=
L i u θ (x) − R(x)
L i u θ (x) .
∂θ
∂θ
i=1
i=1
Subsequently, we are sampling the red part to reduce memory cost. Concretely, our algorithm can be summarized as
follows:
1. Choose random indices I ⊂ {1, 2, · · · , N L } where |I| is the cardinality of the set I, which is the batch size over
PDE terms, where we can set |I| ≪ N L } to minimize memory cost.
2. For i = 1, 2, · · · , N L , compute L i u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it
from GPU to save memory. After detachment, the term will not be involved in the costly backpropagation, and
its gradient with respect to θ will not be computed, which saves GPU memory costs.
3. Compute the unbiased stochastic gradient used to update the model
!
!
N L
X ∂
N L X
grad I (θ) =
L i u θ (x) − R(x)
L i u θ (x) .
|I| i=1
∂θ
i∈I
4. If not converged, go to 1.
The algorithm is summarized in 1.
6
(18)Algorithm 1 Training Algorithm for scaling-up by sampling the gradient in the backward pass.
1: while NOT CONVERGED do
2:
Choose random indices I ⊂ {1, 2, · · · , N L } where |I| is the cardinality of the set I, which is the batch size
over PDE terms, where we can set |I| ≪ N L } to minimize memory cost.
3:
for i ∈ {1, 2, · · · , N L } do
4:
compute L i u θ (x). If i ∈ I, then keep the gradient with respect to θ, else we detach it from GPU to save
memory. After detachment, the term will not be involved in the costly backpropagation, and its gradient with
respect to θ will not be computed.
5:
end for
6:
Compute the unbiased stochastic gradient used to update the model
!
!
N L
X ∂
N L X
L i u θ (x) − R(x)
grad I (θ) =
L i u θ (x) .
|I| i=1
∂θ
i∈I
7:
3.3
end while
Further Scale-up via Gradient Accumulation and Parallel Computing
This subsection introduces two straightforward extensions of our Algorithm 1 for further speed-up and scale-up.
Gradient Accumulation. Since our method involves SDGD, in large-scale problems, we have to reduce the batch
size to fit it within the available GPU memory. However, a very small batch size can result in significant gradient
variance. In such cases, gradient accumulation is a promising idea. Specifically, gradient accumulation involves
sampling different index sets I 1 , I 2 , · · · , I n , computing the corresponding gradients grad I k (θ) for k = 1, 2, · · · , n,
and then averaging them as the final unbiased stochastic gradient for one-step optimization to effectively increase the
batch size. The entire process can be implemented on a single GPU, and we only need to be mindful of the tradeoff
between computation time and gradient variance.
Parallel Computing. Just as parallel computing can be used in machine learning to increase batch size and
accelerate convergence in SGD, our new SDGD also supports parallel computing. Recall that the stochastic gradient
by sampling the PDE terms randomly is given by
!
!
N L
X ∂
N L X
L i u θ (x) − R(x)
grad I (θ) =
L i u θ (x) .
|I| i=1
∂θ
i∈I
We can compute the above gradient for different index sets I 1 , I 2 , · · · , I n on various machines simultaneously and
accumulate them to form a larger-batch stochastic gradient.
3.4
Further Speed-up via Simultaneous Sampling in Forward and Backward Passes
In this section, we discuss how to accelerate further large-scale PINN training based on our proposed memory reduction
methods to make it both fast and memory efficient.
Although the proposed methods can significantly reduce memory cost and use SDGD to obtain some acceleration
through gradient randomness, the speed is still slow for particularly large-scale problems because we need to calculate
L
each term {L i u(x)} N
i=1 in the forward pass one by one, and we are only omitting some these terms in the backward pass
to scale up. For example, in extremely high-dimensional cases, we may need to calculate thousands of second-order
derivatives of PINN, and the calculation speed is clearly unacceptable.
To overcome this bottleneck, we can perform the same unbiased sampling on the forward pass to accelerate it while
ensuring that the entire gradient is unbiased. In other words, we only select some indices for calculation in the forward
pass and select another set of indices for the backward pass, combining them into a very cheap unbiased stochastic
gradient.
Mathematically, consider the full gradient again:
! N
!
N L
L
X
X
∂ℓ(θ)
∂
(19)
grad(θ) =
=
L i u θ (x) − R(x)
L i u θ (x) .
∂θ
∂θ
i=1
i=1
We choose two random and independent indices sets I, J ⊂ {1, 2, · · · , N L } for the sampling of the backward and the
forward passes, respectively. The corresponding stochastic gradient is:
 


!
X ∂
N L   N L X
grad I,J (θ) =
L i u θ (x)  − R(x) 
L i u θ (x) .
(20)
|I|
|J|
∂θ
i∈I
j∈J
7Since I, J are independent, the gradient estimator is unbiased:
E I,J grad I,J (θ) = grad(θ).
(21)
Concretely, our algorithm can be summarized as follows:
1. Choose random indices I, J ⊂ {1, 2, · · · , N L }, where we can set |I| ≪ N L to minimize memory cost and
|J| ≪ N L to further speed up.
2. For i ∈ I, compute L i u θ (x) and keep the gradient with respect to θ.
3. For j ∈ J, compute L j u θ (x) and detach it to save memory cost.
4. Compute the unbiased stochastic gradient used to update the model
 


!
X ∂
N L   N L X
grad I,J (θ) =
L j u θ (x)  − R(x) 
L i u θ (x) .
|I|
|J|
∂θ
j∈J
(22)
i∈I
5. If not converged, go to 1.
The algorithm is summarized in Algorithm 2.
Algorithm 2 Training Algorithm for scaling-up and further speeding-up by sampling both the forward and backward
passes.
1: while NOT CONVERGED do
2:
Choose random indices I, J ⊂ {1, 2, · · · , N L }, where we can set |I| ≪ N L to minimize memory cost and
|J| ≪ N L to further speed up.
3:
for i ∈ I do
4:
compute L i u θ (x) and keep the gradient with respect to θ.
5:
end for
6:
for j ∈ J do
7:
compute L j u θ (x) and detach it to save memory cost.
8:
end for
9:
Compute the unbiased stochastic gradient used to update the model
 


!
X
X ∂
N L   N L
L j u θ (x)  − R(x) 
L i u θ (x) .
grad I,J (θ) =
|I|
|J|
∂θ
j∈J
10:
i∈I
end while
Trading off Speed with Gradient Variance. Obviously, since we have done more sampling, the gradient variance
of this method may be larger, and the convergence may be sub-optimal. However, at the initialization stage of
the problem, an imprecise gradient is sufficient to make the loss drop significantly, which is the tradeoff between
convergence quality and convergence speed. We will demonstrate in experiments that this method is particularly
effective for extremely large-scale PINNs training.
3.5
Practical Consideration: Normalization for Stable Gradient
In high-dimensional scenarios, the number of terms in the PDE increases with the dimensionality of the problem,
leading to the possibility of a substantially large scale for the entire PDE residual. This is different from low-dimensional
problems, necessitating the need to normalize both the PDE terms and their corresponding residual losses to prevent
the occurrence of gradient explosion. Specifically, our approach is as follows.
For the general PDE we have considered, the full batch gradient and stochastic gradients generated by Algorithm 1
8and Algorithm 2 are
!
!
N L
X
∂
L i u θ (x) − R(x)
grad(θ) =
L i u θ (x) .
∂θ
i=1
i=1
!
!
N L
X ∂
N L X
L i u θ (x) − R(x)
L i u θ (x) .
grad I (θ) =
|I| i=1
∂θ
i∈I
 


!
X
X ∂
N L   N L
grad I,J (θ) =
L j u θ (x)  − R(x) 
L i u θ (x) .
|I|
|J|
∂θ
N L
X
j∈J
i∈I
Since the full gradient is the sum of N L subgradients along each dimension where N L grows with d, the scale of this
PDE inherently increases with dimension. If we do not control the learning rate or normalize appropriately, even with
regular training, we may encounter the problem of exploding gradients. Compared with adjusting the learning rate,
normalizing the loss or gradients is more versatile. We only need to normalize using N 1 2 for both the forward and
L
backward passes and the corresponding new gradients are given by:
!
!
N L
N L
R(x)
∂
1 X
1 X
L i u θ (x) −
grad(θ) =
L i u θ (x) .
N L i=1
N L
N L i=1 ∂θ
!
!
N L
1 X
R(x)
1 X ∂
grad I (θ) =
L i u θ (x) −
L i u θ (x) .
N L i=1
N L
|I|
∂θ
i∈I
 


!
X
X ∂
1
R(x)
1

grad I,J (θ) =  
L j u θ (x)  −
L i u θ (x) .
|J|
N L
|I|
∂θ
j∈J
4
i∈I
Theory
In this section, we analyze the convergence of our proposed algorithm.
4.1
SDGD is Unbiased
In previous sections, we have shown that the stochastic gradients generated by our proposed SDGD are unbiased:
Theorem 4.1. The stochastic gradients grad I (θ) and grad I,J (θ) in our Algorithms 1 and 2, respectively, parameterized
by index sets I, J, are an unbiased estimator of the full-batch gradient grad(θ) using all PDE terms, i.e., the expected
values of these estimators match that of the full-batch gradient, E I [grad I (θ)] = E I,J [grad I,J (θ)] = grad(θ).
Proof. We prove the theorem in Section 3.
4.2
SDGD Reduces Gradient Variance
In this subsection, we aim to show that SDGD can be regarded as another form of SGD over PDE terms, serving as a
complement to the commonly used SGD over residual points. Specifically, B-point + D-term where B, D ∈ Z + with
the same B × D quantity has the same computational and memory cost, e.g., 50-point-100-terms and 500-point-10-
terms. In particular, we shall demonstrate that properly choosing the batch sizes of residual points B and PDE terms D
under the constant memory cost (B × D) can lead to reduced stochastic gradient variance and accelerated convergence
compared to previous practices that use SGD over points only.
P N L
r
We assume that the total full batch is with N r residual points {x i } N
i=1 and N L PDE terms L =
i=1 L i , then the
loss function for PINN optimization is
N r
X
1
2
(Lu θ (x i ) − R(x i )) ,
2N r N L 2 i=1
(23)
where we normalize over both the number of residual points and the number of PDE terms, which does not impact the
9directions of the gradients. The full batch gradient is:
N r
1 X
∂
(Lu θ (x i ) − R(x i )) Lu θ (x i )
2
N r N L i=1
∂θ


N L
N r
X
X
1
∂
=
(Lu θ (x i ) − R(x i )) 
L j u θ (x i )  .
N r N L 2 i=1
∂θ
j=1
g(θ) :=
(24)
If the random index sets J ⊂ {1, 2, · · · , N L } and B ⊂ {1, 2, · · · , N r } are chosen, then the SGD gradient with |B|
residual points and |J| PDE terms using these index sets is


X
X
1
∂
g B,J (θ) =
(Lu θ (x i ) − R(x i )) 
L j u θ (x i )  ,
(25)
|B||J|N L
∂θ
i∈B
j∈J
where | · | computes the cardinality of a set. It is straightforward to show that E B,J [g B,J (θ)] = g(θ).
Theorem 4.2. For the random index sets (B, J) where J ⊂ {1, 2, · · · , N L } is that for indices of PDE terms and
B ⊂ {1, 2, · · · , N r } is that for indices of residual points, then
V B,J [g B,J (θ)] =
C 1 |J| + C 2 |B| + C 3
,
|B||J|
(26)
where V computes the variance of a random variable, C 1 , C 2 , C 3 are constants independent of B, J.
Proof. The theorem is proved in Appendix A.1
Intuitively, the variance of the stochastic gradient tends to decrease as the batch sizes (|B|, |J|) increase, and it
converges to zero for |B|, |J| → ∞.
This verifies that SDGD can be regarded as another form of SGD over PDE terms, serving as a complement
to the commonly used SGD over residual points. |B|-point + |J|-term with the same |B| × |J| quantity has the
same computational and memory cost, e.g., 50-point-100-terms and 500-point-10-terms. According to Theorem
4.2, under the same memory budget, i.e., |B| × |J| is fixed, then there exists a particular choice of batch sizes |B|
and |J| that minimizes the gradient variance, in turn accelerating and stabilizing convergence. This is because the
stochastic gradient variance V B,J [g B,J (θ)] is a function of finite batch sizes |B| and |J|, which therefore can achieve
its minimum value at a certain choice of batch sizes.
Let us take the extreme cases as illustrative examples. The first extreme case is when the PDE terms have no
∂
variance, meaning that the terms ∂θ
L j u θ (x i ) are identical for all j after we fix i. In this case, if a memory budget of,
for example, 100 units is given, the optimal choice would be to select 100 points and one PDE term with the minimum
variance. Choosing more PDE terms would not decrease the gradient variance and would be less effective than using
the entire memory budget for sampling points. Conversely, if the points have no variance in terms of the gradient they
induce, the optimal choice would be one point and 100 PDE terms. In practice, the selection of residual points and
PDE terms will inevitably introduce some variance. Therefore, in the case of a fixed memory cost, the choice of batch
size for PDE terms and residual points involves a tradeoff. Increasing the number of PDE terms reduces the variance
contributed by the PDE terms but also decreases the number of residual points, thereby increasing the variance of
the points. Conversely, decreasing the number of PDE terms reduces the variance from the points but increases the
variance from the PDE terms. Thus, there exists an optimal selection strategy to minimize the overall gradient variance
since the choices of batch sizes are finite.
4.3
Gradient Variance Bound and Convergence of SDGD
To establish the convergence of unbiased stochastic gradient descent, we require either Lipschitz continuity of the
gradients [31] or bounded variances [13, 36], with the latter leading to faster convergence. To prove this property, we
need to make the following steps. Firstly, we define the neural network serving as the surrogate model in the PINN
framework.
Definition 4.1. (Neural Network). A deep neural network (DNN) u θ : x = (x 1 , . . . , x d ) ∈ R d 7−→ u θ (x) ∈ R,
parameterized by θ of depth L is the composition of L linear functions with element-wise non-linearity σ, is expressed
as below:
u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · ),
(27)
10where x ∈ R d is the input, and W l ∈ R m l ×m l−1 is the weight matrix at l-th layer with d = m 0 and m L = 1. The
parameter vector θ is the vectorization of the collection of all parameters. We denote h as the maximal width of the
neural network, i.e., h = max(m L , · · · , m 0 ).
For the nonlinear activation function σ, the residual ground truth R(x), we assume the following:
Assumption 4.1. We assume that the activation function is smooth and |σ (n) (x)| ≤ 1 and σ (n) is 1-Lipschitz, e.g., the
sine and cosine activations. We assume that |R(x)| ≤ R for all x.
Remark 1. The sine and cosine functions naturally satisfy the aforementioned conditions. As for the hyperbolic
tangent (tanh) activation function, when the order n of the PDE is determined, there exists a constant C n such that the
activation function C n tanh(x) satisfies the given assumptions. This constant can be absorbed into the weights of the
neural network. This is because both the tanh function and its derivatives up to order n are bounded.
Recall that the full batch gradient is


N L
X
∂
1
(Lu θ (x i ) − R(x i )) 
g(θ) =
L j u θ (x i )  .
N r N L 2 i=1
∂θ
j=1
N r
X
(28)
The stochastic gradient produced by Algorithm 1 is


X
X ∂
1
(Lu θ (x i ) − R(x i )) 
L j u θ (x i )  ,
g B,J (θ) =
|B||J|N L
∂θ
i∈B
(29)
j∈J
The stochastic gradient produced by Algorithm 2 is
X
1
g B,J,K (θ) =
|B||J|N L
i∈B
N L
|K|
!
X
L k u θ (x i )

!
X ∂
− R(x i ) 
L j u θ (x i )  ,
∂θ
(30)
j∈J
k∈K
where B is the index set sampling over the residual points, while J and K sample the PDE terms in the forward pass
and the backward pass, respectively.
Lemma 4.1. (Bounding the Gradient Variance) Suppose that we consider the neural network function class:
NN L
M := {x 7→ W L σ(W L−1 σ(· · · σ(W 1 x) · · · ) | ∥W l ∥ 2 ≤ M (l)} .
(31)
If the PDE under consideration has at most nth order derivative, then the gradient variance can be bounded by
2
V B,J (g B,J (θ)) ≤
|B||J|N L
n
n!(L − 1) M (L)
L−1
Y
! 2
n
(n + 1)!(L − 1) n+1 M (L)
L−1
Y
! 2
M (l) n+1 max ∥x∥
n!(L − 1) n M (L)
,
x∈Ω
l=1
2
V B,J,K (g B,J,K (θ)) ≤
|B||J||K|
·
M (l) + R
l=1
L−1
Y
(32)
! 2
M (l) n + R
·
l=1
(n + 1)!(L − 1)
n+1
M (L)
L−1
Y
l=1
! 2
M (l)
n+1
max ∥x∥
x∈Ω
,
Proof. The proof of this lemma is provided in Appendix A.2
Remark 2. Intuitively, given the order of the PDE and the number of layers in the neural network, it is observed that
both of them are finite. Consequently, the absolute values of the higher-order derivatives of the network are necessarily
bounded by a quantity that is related to the norms of its weight matrices.
Next, we define the SGD trajectory, and we will demonstrate the convergence of the SGD trajectory based on the
stochastic gradients provided by our Algorithms 1 and 2.
111
Definition 4.2. (SGD Trajectory) Given step sizes (learning rates) {γ n } ∞
n=1 , and the initialized PINN parameters θ ,
then the update rules of SGD using Algorithms 1 and 2 are
θ n+1 = θ n − γ n g B,J (θ n ), (33)
θ n+1 = θ n − γ n g B,J,K (θ n ), (34)
respectively.
The following theorem demonstrates the convergence rate of SDGD in PINN training.
Theorem 4.3. (Convergence of SDGD under Bounded Gradient Variance) Fix some tolerance level δ > 0, suppose
that the SGD trajectory given in equations (33, 34) are bounded, i.e., ∥W l n ∥ ≤ M (l) for all n where the collection
n
∗
of all {W l n } L
l=1 is θ , and that the minimizer of the PINN loss is θ , and that the SGD step size follows the form
p
γ n = γ/(n + m) for some p ∈ (1/2, 1] and large enough γ, m > 0, then
1. There exist neighborhoods U and U 1 of θ ∗ such that, if θ 1 ∈ U 1 , the event
E ∞ = {θ n ∈ U for all n ∈ N}
(35)
occurs with probability at least 1 − δ, i.e., P(E ∞ |θ 1 ∈ U 1 ) ≥ 1 − δ.
2. Conditioned on E ∞ , we have
E[∥θ n − θ ∗ ∥ 2 |E ∞ ] ≤ O(1/n p ).
(36)
Remark 3. The convergence rate of stochastic gradient descent is O(1/n p ), where O represents the constant involved,
including the variance of the stochastic gradient. A larger variance of the stochastic gradient leads to slower
convergence, while a smaller variance leads to faster convergence. The selection of the learning rate parameters γ
and m depends on the specific optimization problem itself. For example, in a PINN problem, suitable values of γ and
m ensure that the initial learning rate is around 1e-4, which is similar to the common practice for selecting learning
rates.
Remark 4. In our convergence theorem, we have not made any unrealistic assumptions; all we require is the
boundedness of the optimization trajectory of the PINN. Empirically, as long as the learning rate and initialization
are properly chosen, PINN always converges, and the values of its weight matrices do not diverge to infinity. In the
literature, many convergence theorems for SGD require various conditions to be satisfied by the loss function of the
optimization problem itself, such as Lipschitz continuity. However, in our case, none of these conditions are required.
5
Experiments
Our main goal is to showcase the following properties and advantages of our method:
1. For Algorithm 1, reducing the residual point batch size and the PDE term batch size yields a comparable
acceleration effect under the same memory budget. In some cases, our SDGD can even outperform the traditional
SGD, demonstrating that SDGD can be an alternative form of SGD applied to PDE terms that is stable and
reliable.
2. The main objective of Algorithm 2 is to scale up and accelerate training for large-scale PINN problems. We
will demonstrate that while Algorithm 2 exhibits poorer convergence on small-scale problems due to increased
gradient variance, its convergence speed and accuracy improve significantly as the scale of the PDE problem
grows, such as in cases with tens of thousands of dimensions. In many of these large-scale instances, Algorithm
2 achieves convergence, whereas Algorithm 1 and full training only reach a few epochs without convergence.
Additionally, full training may lead to out-of-memory (OOM) errors directly.
3. We compare our PINN-based method with other non-PINN mainstream approaches for solving high-dimensional
PDEs [1, 19, 40], especially in terms of accuracy.
All the experiments are done on an NVIDIA A100 GPU with 40GB memory.
5.1 Hamilton-Jacobi-Bellman (HJB) Equation with Linear Solution and Black-Scholes-
Barenblatt (BSB) Equation: Scaling-up to 100,000D, and Comparisons with Baselines
This experiment will demonstrate how Algorithms 1 and 2 are superior to the commonly used full-batch gradient
descent on training PINNs, and how our SDGD-PINNs method can surpass strong non-PINNs baselines, on several
nonlinear PDEs.
125.1.1
Experimental Design
We consider the following two nonlinear PDEs:
• Hamilton-Jacobi-Bellman equation with linear solution (HJB-Lin) [46] with the terminal condition given
below:
d
1 X
∂ t u(x, t) + ∆u(x, t) −
|∂ x i u| c = −2, x ∈ R d , t ∈ [0, T ]
d i=1
(37)
d
X
u(x, T ) =
x i ,
i=1
where c = 1, T = 1 and the dimension d can be arbitrary chosen. The analytic solution is:
u(x, t) =
d
X
x i + T − t.
(38)
i=1
In this example, we decompose the residual prediction of PINN along each dimension as follows:


d
d 
d

X
X
1
1
1 X
|∂ x i u| c =
∂ t u(x, t) + d∂ x 2 i u(x, t) −
|∂ x j u| c ,
∂ t u(x, t) + ∆u(x, t) −

d i=1
d i=1 
d j=1
(39)
since the first-order derivative is much cheaper and is not the main bottleneck.
• Black-Scholes-Barenblatt (BSB) equation
d
d
X
1 X 2
x i u x i ),
x i u x i x i + r(u −
u t = − σ 2
2 i=1
i=1
(40)
2
u(x, T ) = ∥x∥ ,
where T = 1, σ = 0.4, r = 0.05 with the following analytic solution


u(x, t) = exp (r + σ 2 )(T − t) ∥x∥ 2 .
In this example, we decompose the residual prediction of PINN along each dimension as follows:


d
d
d
d

X
X
1 2 X 2
1 X 
1 2 2
u t + σ
x i u x i x i − r(u −
x i u x i ) =
u t + σ dx i u x i x i − r(u −
x j u x j ) ,

2 i=1
d i=1 
2
i=1
j=1
(41)
(42)
Using consistent hyperparameter settings, we explore varying PDE dimensions in the range of 10 2 , 10 3 , 10 4 , 10 5 .
This allows us to assess the acceleration achieved by our algorithm and whether PINNs can overcome the CoD.
Specifically, for the hyperparameters of PINNs, we employ a four-layer PINN with 1024 hidden units, trained using
the Adam optimizer [30] with an initial learning rate of 1e-3, exponentially decaying with a factor of 0.9995. Our
testing datasets consist of 2e4 points, sampled from x ∼ N (0, I) and t ∼ Unif[0, 1]. The residual points are sampled
from the same distribution. The testing criterion is the relative L 2 error between the model and the solution given by
qP
N test
2
i=1 |u pred (X i , t i ) − u(X i , t i )|
qP
,
(43)
N test
2
i=1 |u(X i , t i )|
where u pred is the prediction model and u is the true solution, N test is the number of test points and (X i , t i ) is the ith
test point.
Due to the large scale of the terminal condition, such as in HJB-Lin, where it follows a Gaussian distribution
with zero mean and variance of d, or in the BSB equation, where it follows a chi-squared distribution with a mean of
d, there is a high probability of sampling points with large values in high-dimensional space. Directly learning the
terminal condition using the boundary loss can result in numerical overflow and optimization difficulties. Therefore,
we modified the PINN structure to satisfy the terminal condition automatically. This allows us to focus solely on
13PDE
HJB-Lin
BSB
10 2
34min
31min
10 3
68min
57min
10 4
81min
118min
10 5
310min
41min
Table 1: This table presents the convergence time required by our SDGD for different PDEs. In the HJB-Lin equation,
as the dimensionality increases from 100 to 100,000, the dimensionality grows by a factor of 1000, while the time only
increases by a factor of ten. This indicates that our method can withstand the curse of dimensionality. In the second
BSB equation, surprisingly, the high-dimensional case converges faster than the low-dimensional case, demonstrating
the so-called blessing of dimensionality and that our method can harness the blessing of dimensionality.
the residual loss. Specifically, for the HJB-Lin and BSB equations with the terminal condition u(x, T ) = g(x), we
consider the following PINN models:
u HJB-Lin
(x, t) = u θ (x, t)(T − t) +
θ
d
X
x i ,
2
u BSB
θ (x, t) = ((T − t)u θ (x, t) + 1) ∥x∥ ,
(44)
i=1
where u θ is a vanilla multi-layer perceptron network while u HJB-Lin
and u BSB
are the prediction models for HJB-Lin
θ
θ
and BSB equations, respectively. For the HJB-Lin equation, the initialization is already sufficiently accurate regarding
P d
the L 2 error of the prediction for u, as the i=1 x i scale is much larger than T − t. Therefore, we focus on testing the
accuracy of ∂ t u, which is a nontrivial task:
qP
N test
2
i=1 |∂ t u pred (X i , t i ) − ∂ t u(X i , t i )|
qP
,
(45)
N test
2
|∂
u(X
,
t
)|
t
i i
i=1
where u pred is the prediction model and u is the true solution.
For the baselines [1, 19, 40], it is necessary to specify the initial point of the stochastic process corresponding to
the PDE. These methods can only make predictions at this initial point, hence reporting the relative L 1 error at a single
point. We will also report the error of PINN at this initial point. We choose the initial points as (X 0 , t 0 ) = (x 1 =
1, x 2 = 1, · · · , x d = 1, t = 0) for the BSB equation and (X 0 , t 0 ) = (x 1 = 0, x 2 = 0, · · · , x d = 0, t = 0) for the
HJB-Lin equation, and the corresponding error is
|u pred (X 0 , t 0 ) − u(X 0 , t 0 )|
,
u(X 0 , t 0 )
(46)
where u pred is the prediction model and u is the true solution.
5.1.2
Results of Our Method and Full-Batch GD on Training PINN
This subsection presents the results comparing our Algorithms 1 and 2 with full batch gradient descent when training
PINN for the BSB and HJB equations, which are demonstrated in Figures 1 and 2.
Specifically, the captions of Figures 1 and 2 illustrate different algorithms’ speeds and memory costs in the HJB and
BSB equations across various dimensions. Regarding memory cost, full gradient descent encounters OOM problems in
100,000 dimensions even with only one residual point, while SDGD can handle even higher-dimensional problems. In
terms of speed, under comparable memory costs, Algorithm 2 consistently exhibits the highest speed, followed by
Algorithm 1, while full batch GD is the slowest. Notably, in high-dimensional cases, our Algorithm 2 demonstrates
significant acceleration compared to other algorithms. In lower dimensions, the acceleration effect of Algorithm 2 is
less pronounced, and Algorithm 1 suffices.
In the case of the 1000-dimensional BSB equation, Algorithm 2 exhibits a shorter time per epoch compared to
Algorithm 1. However, the convergence results are inferior, suggesting that Algorithm 2 sacrifices speed for larger
gradient variance due to increased random sampling in both forward and backward passes. Nevertheless, as the
dimension of the PDE increases, the advantageous fast convergence of Algorithm 2 becomes more evident. For
instance, on the largest 100,000-dimensional problem, Algorithm 2 achieved convergence with a relative error of less
than 3.477e-5 (HJB) / 4.334e-3 (BSB) in under 4 hours. In contrast, full GD encountered OOM errors, and the speed of
Algorithm 1 was extremely slow, running only a few epochs within the same time frame and still far from convergence.
Meanwhile, in the 100,000-dimensional HJB equation with a relatively small batch size (100 PDE terms and 100
residual points), we observed a significant increase in the relative L 2 error performance of Algorithm 2 during training,
attributable to the heightened variance of the gradient.
14100-Dimensional HJB-Lin Equation
10 1
10 2
1000-Dimensional HJB-Lin Equation
10 0
Algorithm 1: 1000-point-10-term
Memory: 2464MB, Speed: 0.14s/it
Full batch GD: 100-point-100-term
Memory: 2406, Speed: 0.22s/it
10 0
0
500
1000
Time (s)
1500
10 1
10 2
2000
Algorithm 1: 1000-point-100-term
Memory: 9386MB, Speed: 1.29s/it
Algorithm 2: 1000-point-100-term
Memory: 9386MB, Speed: 0.49s/it
Full batch GD: 100-point-1000-term
Memory: 9660MB, Speed: 2.15s/it
0
10000-Dimensional HJB-Lin Equation
1
10 2
0
500
15000
20000
10 0
10
10000
Time (s)
100000-Dimensional HJB-Lin Equation
Algorithm 1: 100-point-100-term
Memory: 6920MB, Speed: 10.01s/it
Algorithm 2: 100-point-100-term
Memory: 6920MB, Speed: 0.39s/it
Full batch GD: 1-point-10000-term
Memory: 7272MB, Speed: 21.12s/it
10 0
5000
1000 1500 2000 2500 3000 3500 4000
Time (s)
10 1
10 2
10 3
10 4
Algorithm 1: 100-point-100-term
Memory: 36896MB, Speed: 391.88s/it
Algorithm 2: 100-point-100-term
Memory: 36896MB, Speed: 2.15s/it
0
5000
10000
Time (s)
15000
20000
Figure 1: High-dimensional HJB-Lin equation: convergence, memory cost, and speed of different methods under
various dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red: Our Algorithm 2; Green:
Baseline full batch GD over PDE terms. The x-axis represents the running time, while the y-axis represents the relative
L2 error of u t in the HJB equation (a nontrivial test). At initialization, the relative L 2 error is always nearly 1000%,
leading to a nontrivial learning problem for PINNs. In the 100-dimensional, our Algorithm 1 has a similar convergence
speed to the baseline, but the latter converges better. Our method in 100D and 1000D demonstrates faster completion
within the same number of epochs in higher-dimensional cases, resulting in a shorter total running time. In even higher
dimensions, we ensure that the running times of different methods are similar. Our method consistently achieves faster
convergence for dimensionality higher than 1000. In extremely high-dimensional scenarios, the baseline full batch
gradient descent encounters memory limitations, whereas our Algorithm 2 is capable of rapid convergence. However,
we observed that in the case of 100,000 dimensions, although convergence was achieved, the convergence curve of
our Algorithm 2 exhibited some instability. This instability was attributed to the small batch size, which resulted in
gradient variance and consequently led to the observed instability.
1510 1
10 2
10 3
1000-Dimensional BSB Equation
Algorithm 1: 1000-point-10-term
Memory: 2386MB, Speed: 0.18s/it
Full batch GD: 100-point-100-term
Memory: 2190MB, Speed: 0.32s/it
L 2 error
100-Dimensional BSB Equation
0
500
1000
1500
2000
Time (s)
2500
3000
10 1
10 2
10 3
Algorithm 1: 1000-point-100-term
Memory: 11054MB, Speed: 1.58s/it
Algorithm 2: 1000-point-100-term
Memory: 11054MB, Speed: 0.85s/it
Full batch GD: 100-point-1000-term
Memory: 10764MB, Speed: 3.35s/it
0
10000-Dimensional BSB Equation
3000 4000
Time (s)
5000
6000
7000
10 0
1
10
2000
100000-Dimensional BSB Equation
Algorithm 1: 100-point-1000-term
Memory: 30430MB, Speed: 20.59s/it
Algorithm 2: 100-point-1000-term
Memory: 30430MB, Speed: 4.77s/it
Full batch GD: 10-point-10000-term
Memory: 29892MB, Speed: 42.51s/it
10
1000
10 1
10 2
Algorithm 1: 100-point-100-term
Memory: 40054MB, Speed: 391.88s/it
Algorithm 2: 100-point-100-term
Memory: 40054MB, Speed: 1.58s/it
2
0
2000
4000
6000
8000
Time (s)
10000
12000
0
2000
4000
6000
Time (s)
8000
10000
Figure 2: High-dimensional BSB equation: convergence, memory cost, and speed of different methods under various
dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red: Our Algorithm 2; Green: Baseline
full batch GD over PDE terms. The x-axis represents the running time, while the y-axis represents the relative L 2
error of the BSB equation. At initialization, the relative L 2 error is always larger than 10%, leading to a nontrivial
learning problem for PINNs. In the 100-dimensional and 1000-dimensional cases, our method demonstrates faster
completion within the same number of epochs, resulting in a shorter total running time. In higher dimensions, we
ensure that the running times of different methods are similar. Our method consistently achieves faster convergence. In
extremely high-dimensional scenarios, the baseline full batch gradient descent encounters memory limitations, whereas
our Algorithm 2 is capable of rapid convergence.
16Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours Evaluation Metric
Relative L 2 error of u
Relative L 2 error of u
Relative L 2 error of u
Relative L 2 error of u Dim=10 2
N.A.
N.A.
N.A.
6.182E-4 Dim=10 3
N.A.
N.A.
N.A.
1.284E-3 Dim=10 4
N.A.
N.A.
N.A.
3.571E-3 Dim=10 5
N.A.
N.A.
N.A.
4.486E-3
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours Evaluation Metric
Relative L 1 error on one point
Relative L 1 error on one point
Relative L 1 error on one point
Relative L 1 error on one point Dim=10 2
1.378E-5
1.106E-5
2.048E-3
6.781E-6 Dim=10 3
8.755E-5
5.184E-4
1.942E-3
4.915E-5 Dim=10 4
4.003E-5
3.013E-4
1.992E-3
1.084E-5 Dim=10 5
3.559E-5
4.519E-4
2.109E-3
1.972E-5
Table 2: Results of various methods on the BSB equations with different dimensions and evaluation metrics.
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours Evaluation Metric
Relative L 2 error of u
Relative L 2 error of u
Relative L 2 error of u
Relative L 2 error of u Dim=10 2
N.A.
N.A.
N.A.
1.159E-4 Dim=10 3
N.A.
N.A.
N.A.
3.508E-4 Dim=10 4
N.A.
N.A.
N.A.
1.216E-5 Dim=10 5
N.A.
N.A.
N.A.
1.937E-7
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours Evaluation Metric
Relative L 2 error of u t
Relative L 2 error of u t
Relative L 2 error of u t
Relative L 2 error of u t Dim=10 2
N.A.
N.A.
N.A.
4.918E-3 Dim=10 3
N.A.
N.A.
N.A.
3.243E-3 Dim=10 4
N.A.
N.A.
N.A.
2.655E-3 Dim=10 5
N.A.
N.A.
N.A.
3.792E-5
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours Evaluation Metric
Relative L 1 error on one point
Relative L 1 error on one point
Relative L 1 error on one point
Relative L 1 error on one point Dim=10 2
6.327E-5
N.A.
1.034E-2
2.711E-5 Dim=10 3
7.062E-5
N.A.
Diverge
6.208E-6 Dim=10 4
4.916E-5
N.A.
Diverge
2.513E-5 Dim=10 5
5.218E-5
N.A.
Diverge
3.614E-5
Table 3: Results of various methods on the HJB-Lin equations with different dimensions and evaluation metrics.
Thus, Algorithm 1 effectively reduces the memory cost of PINN by minimizing gradient variance and accelerates
convergence through gradient randomness. Algorithm 2 further enhances iteration speed while reducing memory
requirements, albeit at the expense of increased gradient variance and compromised convergence results. However, in
higher dimensional cases, the acceleration provided by Algorithm 2 outweighs the impact of gradient variance.
In addition to the stochastic gradient acceleration employed in our method, the symmetry of the PDE plays a
crucial role in achieving convergence. These PDEs exhibit symmetry along each x i axis regardless of dimensionality.
This symmetry reduces gradient variance in SDGD and random sampling during the forward pass in Algorithm 2,
facilitating rapid convergence in our approach.
In conclusion, the experimental results demonstrate that Algorithm 1 and Algorithm 2 exhibit significant advantages
over full batch gradient descent in training PINNs. This is particularly evident in the case of very high dimensions,
where their acceleration and computational efficiency are especially pronounced.
5.1.3
Comparison Between SDGD-PINNs and Strong Baselines
This subsection presents the results comparing SDGD based on PINN and other strong non-PINN baselines [1, 19, 40]
for the BSB and HJB equations.
Tables 2 and 3 showcase the comparative results between our algorithm and the baselines on the BSB and HJB
equations, respectively. Specifically, we present the prediction performance of various models on both the solution field
(u and/or u t ) and individual points, which are introduced in the experimental setup. If predictions are to be made across
the entire domain, only our PINN-based approach can achieve that, while other methods can only make predictions at
a single point. This showcases the advantage of PINN in not requiring a mesh/grid, unlike other methods that rely on
temporal discretization. Even in predicting at a single point, PINN performs exceptionally well and achieves the best
results among various methods. This is because although PINN is trained across the entire domain, its neural network
possesses strong interpolation capabilities, enabling accurate predictions at single points. Other methods perform
poorly on these PDEs because the scale of the entire PDE solution is very large, leading to potential issues such as
numerical overflow or gradient explosion. However, in the case of PINNs, we cleverly employ boundary/terminal
17Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours PDE
Allen-Cahn
Allen-Cahn
Allen-Cahn
Allen-Cahn Dim=10 2
1.339E-3
4.581E-3
3.642E-3
7.815E-4 Dim=10 3
2.184E-3
2.516E-3
1.563E-3
3.142E-4 Dim=10 4
7.029E-3
2.980E-3
2.369E-3
7.042E-4 Dim=10 5
5.392E-2
2.975E-3
2.962E-3
2.477E-4
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours PDE
Semilinear Heat
Semilinear Heat
Semilinear Heat
Semilinear Heat Dim=10 2
3.872E-3
3.024E-3
2.823E-3
1.052E-3 Dim=10 3
8.323E-4
3.222E-3
3.432E-3
5.263E-4 Dim=10 4
7.705E-3
3.510E-3
3.439E-3
6.910E-4 Dim=10 5
1.426E-2
4.279E-3
3.525E-3
1.598E-3
Method
FBSNN [40]
DeepBSDE [19]
DeepSplitting [1]
Ours PDE
Sine-Gordon
Sine-Gordon
Sine-Gordon
Sine-Gordon Dim=10 2
2.641E-3
3.217E-3
3.297E-3
2.265E-3 Dim=10 3
4.283E-3
2.752E-3
2.674E-3
2.310E-3 Dim=10 4
2.343E-1
2.181E-3
2.170E-3
1.674E-3 Dim=10 5
5.185E-1
2.630E-3
2.249E-3
1.956E-3
Table 4: Results of various methods on the Allen-Cahn, semilinear heat, and sine-Gordon PDEs with different
dimensions. The evaluation metric is all relative L 1 error on one test point.
condition augmentation to mitigate these issues.
5.2
More Comparisons with Strong Baselines [1, 19, 40]
This subsection presents further results comparing our method based on PINN and other strong non-PINN baselines
[1, 19, 40] for several nonlinear PDEs without analytical solutions. Thus, we adopt other methods’ settings and evaluate
the model on one test point, where we show that PINNs still outperform. other competitors.
• Allen-Cahn equation.
∂ t u(x, t) = ∆u(x, t) + u(x, t) − u(x, t) 3 ,
(x, t) ∈ R d × [0, 0.3],
(47)
with the initial condition u(x, t) = arctan(max i x i ). We aim to approximate the solution’s true value on the
one test point (x, t) = (0, · · · , 0, 0.3).
• Semilinear heat equation.
∂ t u(x, t) = ∆u(x, t) +
1 − u(x, t) 2
,
1 + u(x, t) 2
(x, t) ∈ R d × [0, 0.3],
(48)
with the initial condition u(x, t) = 5/(10 + 2∥x∥ 2 ). We aim to approximate the solution’s true value on the one
test point (x, t) = (0, · · · , 0, 0.3).
• Sine-Gordon equation.
∂ t u(x, t) = ∆u(x, t) + sin (u(x, t)) ,
(x, t) ∈ R d × [0, 0.3],
(49)
with the initial condition u(x, t) = 5/(10 + 2∥x∥ 2 ). We aim to approximate the solution’s true value on the one
test point (x, t) = (0, · · · , 0, 0.3).
The reference values of these PDEs on the test point are computed by the multilevel Picard method [5, 25] with
sufficient theoretical accuracy. The residual points of PINN are chosen along the trajectories of the stochastic processes
that those PDEs correspond to, see the last section for details.
The results for these three nonlinear PDEs are shown in Table 4. PINN surpasses other methods in all dimensions
and for all types of PDEs. Whether it is predicting the entire domain or the value at a single point, PINN is capable of
handling both scenarios. This is attributed to the mesh-free nature of PINN and its powerful interpolation capabilities.
5.3
PINN and Adversarial Training (L ∞ Loss)
In Wang et al. [46], it was demonstrated that the class of Hamilton-Jacobi-Bellman (HJB) equation [19, 46] could only
be effectively solved using adversarial training, which approximates the L ∞ loss. The first one is the HJB equation
18Dim
250
1,000
10,000
100,000
Dim
250
1,000
10,000
100,000
HJB-LQG Results for u
Wang et al. [46]
He et al. [21]
Time
Rel. L 2 Error
Time
Rel. L 2 Error
38 hours
1.182E-2
12 hours
1.370E-2
>5 days
N.A.
>5 days
N.A.
OOM
N.A.
OOM
N.A.
OOM
N.A.
OOM
N.A.
HJB-Lin Results for ∂ t u
Wang et al. [46]
He et al. [21]
Time
Rel. L 2 Error
Time
Rel. L 2 Error
38 hours
1.874E-3
12 hours
1.769E-3
>5 days
N.A.
>5 days
N.A.
OOM
N.A.
OOM
N.A.
OOM
N.A.
OOM
N.A.
SDGD (Ours)
Time
Rel. L 2 Error
210 minutes
6.147E-3
217 minutes
5.852E-3
481 minutes
4.926E-2
855 minutes
4.852E-2
SDGD (Ours)
Time
Rel. L 2 Error
210 minutes
4.710E-4
217 minutes
4.455E-4
481 minutes
3.609E-4
855 minutes
1.305E-3
Table 5: Relative L 2 error results, running time across different dimensions for the baselines and ours, for two HJB
equations requiring adversarial training of PINNs. In all dimensions, our method significantly outperforms the other
two baselines in terms of accuracy and speed. The baselines, employing full batch gradient descent, experience
slower speed and increased memory requirements in higher-dimensional scenarios. In contrast, our approach utilizes
Algorithm 2, which employs stochastic gradient descent over PDE terms, resulting in faster computation and relatively
lower memory consumption. Additionally, the symmetry of the PDE guarantees the accuracy of our method.
with linear-quadratic-Gaussian (LQG) control:
∂ t u(x, t) + ∆u(x, t) − ∥∇ x u(x, t)∥ 2 = 0,
x ∈ R d , t ∈ [0, 1]
u(x, T ) = g(x),
where g(x) is the terminal condition to be chosen, with the analytic solution:

Z
p
u(x, t) = − log
(2π) −d/2 exp(−∥y∥ 2 /2) exp(−g(x − 2(1 − t)y))dy .
(50)
(51)
R d
We choose the cost function g(x) = log((1 + ∥x∥ 2 )/2) following [19, 21, 40, 46] to obtain the HJB-LQG case. It is
worth noting that the solution’s integration in HJB-LQG-Log cannot be analytically solved, necessitating Monte Carlo
integration for approximation. We use the relative L 2 error approximating u as the evaluation metric for this equation.
The second HJB equation requiring L ∞ loss and adversarial training is the previous HJB-Lin equation but with a
different PDE operator:
d
∂ t u(x, t) + ∆u(x, t) −
u(x, T ) =
d
X
1 X
|∂ x i u| c = −2,
d i=1
x ∈ R d , t ∈ [0, T ]
(52)
x i ,
i=1
P d
where c = 1.75, T = 1 and the dimension d can be arbitrarily chosen. The analytic solution is u(x, t) = i=1 x i +
T − t. Specifically, Wang et al. [46] demonstrated that as c increases within the range of 1-2, the effectiveness of the
L 2 PINN training loss diminishes, and PINN relies more on the L ∞ norm loss to achieve smaller testing error. For this
equation, we also evaluate the more challenging L 2 error of ∂ t u.
In this example of the HJB equation with LQG control, we decompose the residual prediction of PINN along each
dimension as follows:
∂ t u(x, t) + ∆u(x, t) − µ∥∇ x u(x, t)∥ 2 =

d 
1 X
∂ 2 u
∂ t u(x, t) + d 2 − µ∥∇ x u(x, t)∥ 2 .
d i=1
∂x i
(53)
For the HJB equation with linear solution, we decompose the residual prediction of PINN along each dimension as in
the previous example.
Despite its ability to maintain a low maximal memory cost during training, adversarial training approximating
the L ∞ loss is widely recognized for its slow and inefficient nature [43], which poses challenges in applying it to
19100D ISOTROPIC Fokker-Planck
1
10 2
0
500
1000
1500
Time (s)
2000
10 0
10 1
10 2
2500
0
100D ANISOTROPIC Fokker-Planck
1
10 2
0
500
1000
1500
Time (s)
2000
1000
2000
3000 4000
Time (s)
5000
6000
10 1
10 2
7000
Algorithm 1: 100-point-100-term
Memory: 4684MB, Speed: 12.76s/it
Algorithm 2: 100-point-100-term
Memory: 4684MB, Speed: 0.42s/it
Full batch GD: 1-point-10000-term
Memory: 5368MB, Speed: 23.37s/it
0
2500
10 0
10 1
10 2
0
1000
2000
3000 4000
Time (s)
5000
1000
2000
3000
Time (s)
4000
10000D ANISOTROPIC Fokker-Planck
10 0
Algorithm 1: 100-point-100-term
Memory: 2520MB, Speed: 1.27s/it
Algorithm 2: 100-point-100-term
Memory: 2520MB, Speed: 0.41s/it
Full batch GD: 10-point-1000-term
Memory: 2546MB, Speed: 2.31s/it
10
10 0
1000D ANISOTROPIC Fokker-Planck
Algorithm 1: 1000-point-10-term
Memory: 2310MB, Speed: 0.16s/it
Algorithm 1: 400-point-25-term
Memory: 2310MB, Speed: 0.17s/it
Algorithm 1: 200-point-50-term
Memory: 2310MB, Speed: 0.23s/it
Full batch GD: 100-point-100-term
Memory: 2384MB, Speed: 0.25s/it
error
10 0
10000D ISOTROPIC Fokker-Planck
Algorithm 1: 100-point-100-term
Memory: 2520MB, Speed: 1.27s/it
Algorithm 2: 100-point-100-term
Memory: 2520MB, Speed: 0.41s/it
Full batch GD: 10-point-1000-term
Memory: 2546MB, Speed: 2.31s/it
10
1000D ISOTROPIC Fokker-Planck
Algorithm 1: 1000-point-10-term
Memory: 2310MB, Speed: 0.16s/it
Algorithm 1: 400-point-25-term
Memory: 2310MB, Speed: 0.17s/it
Algorithm 1: 200-point-50-term
Memory: 2310MB, Speed: 0.23s/it
Full batch GD: 100-point-100-term
Memory: 2384MB, Speed: 0.25s/it
10 0
6000
10 1
10 2
7000
Algorithm 1: 100-point-100-term
Memory: 4684MB, Speed: 12.76s/it
Algorithm 2: 100-point-100-term
Memory: 4684MB, Speed: 0.42s/it
Full batch GD: 1-point-10000-term
Memory: 5368MB, Speed: 23.37s/it
0
1000
2000
3000
Time (s)
4000
Figure 3: High-dimensional FP equation corresponding to the Brownian motion: convergence, memory cost, and speed
of different methods under various dimensions with detailed hyperparameter settings. Blue: Our Algorithm 1; Red:
Our Algorithm 2; Green: Baseline full batch GD over PDE terms. The x-axis represents the running time, while the
y-axis represents the relative L 2 error of the FP equation corresponding to the Brownian motion. First row: isotropic
problem. Second row: anisotropic problem.
high-dimensional HJB equations with LQG control. Adversarial training involves optimizing two loss functions: one
for the PINN parameters, minimizing the loss through gradient descent, and another for the residual point coordinates,
maximizing the PINN loss to approximate the L ∞ loss. This adversarial minimax process, known as adversarial
training, is computationally demanding, often requiring multiple rounds of optimization before the resulting residual
points can effectively optimize the PINN parameters.
The current state-of-the-art research by Wang et al. [46] and He et al. [21] has successfully scaled adversarial
training to 250 dimensions. In this study, we integrate adversarial training with our proposed approach to enhance the
scalability and efficiency of adversarial training in high-dimensional HJB equations requiring adversarial training of
PINN.
We use a four-layer PINN with 1024 hidden units, trained by an Adam optimizer [30] with a learning rate = 1e-3 at
the beginning and decay linearly to zero; also, 2e4 test points are employed. We modified the PINN structure as in
equation (44) to make the terminal condition automatically satisfied:
u HJB-LQG
(x, t) = u θ (x, t)(1 − t) + g(x),
θ
(54)
where g(x) is the terminal condition, which is g(x) = log((1 + ∥x∥ 2 )/2) in the HJB-LQG-Log case, and g(x) =
P d
i=1 x i in the HJB-LQG-Lin case. Thus, we only need to focus on the residual loss. We conduct training for a total
of 10,000 epochs. In all dimensions, we employ Algorithm 2 with 100 batches of PDE terms in the loss function for
gradient descent, and Algorithm 2 with 10 batches of PDE terms in the loss function for adversarial training. For both
our methods and the baseline methods proposed by Wang et al. [46], we randomly sample 100 residual points per
epoch.
The results of adversarial training on the two HJB equations are presented in Table 5. For instance, in the 250D
HJB-LQG case, achieving a relative L 2 error of 1e-2 using [46] requires a training time of 1 day. In contrast, our
approach achieves a superior L 2 error in just 3.5 hours. In the 1000D case, full batch GD’s adversarial training is
exceedingly slow, demanding over 5 days of training time. However, our method effectively mitigates this issue and
attains a relative L 2 error of 5.672E-3 in 217 minutes. Moreover, our approach enables adversarial training even in
higher-dimensional HJB equations, specifically 10,000 and 100,000 dimensions, with resulting relative L 2 errors of
5.026E-2 and 3.838E-2, respectively. For the HJB-Lin case, our method is also consistently better than others. The
success of our method hinges on the accelerated random gradient and the equation’s symmetry.
5.4
Fokker-Planck PDE: Anisotropic Problem
So far, the PDEs we have tested are isotropic and symmetric across dimensions. This choice is motivated by commonly
used benchmark PDEs in real-world applications that exhibit these properties [19, 46]. However, it is worth exploring
20whether our method can converge on asymmetric and anisotropic PDEs. Therefore, we design an anisotropic Fokker-
Planck (FP) equation corresponding to the Brownian motion as follows:
∂ t p = −
d
X
d
µ i ∂ x i p +
i=1
1 X 2
∂ p,
2 i=1 x i
t ∈ [0, 1], x ∈ B d .
(55)
u(x, 0) = ∥x∥ 2 .
u(x, t) given,
t ∈ [0, 1], x ∈ S d−1 .
where B d and S d−1 are the d-dimensional unit ball and sphere, respectively. Its solution is
u(x, t) =
d
X
(x i − µ i t) 2 + dt.
(56)
i=1
Here each µ i is randomly sampled from N (1, 1) for the anisotropic case and all µ i = 1 for the isotropic case.
Specifically, we will compare the convergence results of our method in two different cases to observe the robustness of
our approach to anisotropic problems. This anisotropic case is not only asymmetric in its form of differential operators
but also features an anisotropic solution across different axes.
We use the following network architecture to satisfy the initial condition automatically:
u Brownian
(x, t) = u θ (x, t)t + ∥x∥ 2 .
θ
(57)
In this example of the FP equation corresponding to the Brownian motion, we decompose the residual prediction
of PINN along each dimension as follows:


d
d
d
d
X
X
d 2 
1 X 
1 X 2
µ j ∂ x j p − ∂ x i p .
∂ x i p =
∂ t p +
∂ t p +
µ i ∂ x i p −
(58)


2
d
2
i=1
i=1
i=1
j=1
We employ a four-layer PINN with 1024 hidden units, trained using the Adam optimizer [30] with an initial
learning rate of 1e-3, exponentially decaying with a factor of 0.9995. Our testing datasets consist of 2e4 points. We
sample random boundary points for all equations for the boundary condition and use the regular boundary loss. The
points are samples uniformly from the bounded domain.
The results for the FP equation corresponding to the Brownian motion are presented in Figure 3, where the first row
presents the convergence curve under different dimensions in the isotropic problem, while the second row showcases
those for the anisotropic problem.
Regarding the speed of our algorithms compared to full batch gradient descent, in the case of 100 dimensions
(100D), we vary the PDE term batch size and the number of collocation points for Algorithm 1 to match the memory
cost of the full batch gradient descent (GD) baseline, aiming to observe the stability of our algorithm with different
batch sizes. Regardless of the chosen PDE term batch size, our Algorithm 1 consistently converges faster than the
baseline, demonstrating the stability and fast convergence of the stochastic gradient descent (SGD) over the PDE terms
approach. For higher dimensions, we include the results of Algorithm 2, which consistently outperforms Algorithm 1,
while Algorithm 1, in turn, outperforms the baseline. This highlights the acceleration capability of Algorithm 2 in
high-dimensional scenarios.
Regarding our algorithm’s robustness in the anisotropic problem, by comparing the convergence curves of our
method in isotropic and anisotropic cases at the same dimensionality, we observe that our approach maintains fast and
stable convergence even in anisotropic cases. Furthermore, the final convergence results do not deteriorate significantly
in the anisotropic case. Concretely, the relative L 2 errors of the isotropic problem under 100D, 1000D, and 10000D
are (4.708E-3, 2.223E-3, 3.800E-3), respectively, while those for the anisotropic problem are (4.906E-3, 2.623E-3,
3.715E-3). Therefore, our method is applicable to both isotropic and anisotropic cases.
Note that this FP equation is solved on a bounded domain, specifically, the unit ball. Generally, problems on
bounded domains are somewhat easier because PDEs on unbounded domains lack information at infinity.
5.5
Schrödinger Equation
This subsection scales up the Tensor Neural Networks (TNNs) [47, 48] for solving very high-dimensional Schrödinger
equations. Due to its unique characteristics, the Schrödinger equation cannot be addressed by the traditional PINN
framework and requires a specialized TNN network structure and corresponding loss function for high-dimensional
eigenvalue problems. However, our method can still be flexibly integrated into this framework for scaling up
Schrödinger equations, demonstrating the versatility of our approach. In contrast, other methods specifically designed
for high-dimensional equations exhibit much greater limitations.
215.5.1
Tensor Neural Network (TNN) for Solving Schrödinger Equation
In general, the Schrödinger equation can be viewed as a high-dimensional eigenvalue problem.
−∆u(x) + v(x)u(x) = λu(x),
u(x) = 0,
x ∈ Ω,
(59)
x ∈ ∂Ω,
where Ω = Ω 1 × Ω 2 × · · · × Ω d ⊂ R d and each Ω i = (a i , b i ), i = 1, , d is a bounded interval in R, v(x) is a known
potential function, u(x) is the unknown solution and λ is the unknown eigenvalue of the high-dimensional eigenvalue
problem. This problem can be addressed via the variational principle:
R
R
|∇Φ(x)| 2 dx + Ω v(x)Φ(x) 2 dx
Ω
R
.
(60)
λ = min
Φ(x)
Φ(x) 2 dx
Ω
The integrals are usually computed via the expensive quadrature rule, e.g., Gaussian quadrature:
Z
X
Φ(x)dx ≈
w (n) Φ(x (n) ),
Ω
(61)
n∈N
where N is the index set for the quadrature points. The size N grows exponentially as the dimension increases,
incurring the curse of dimensionality. To overcome this issue, the Tensor Neural Network (TNN) adopts a separable
function network structure to reduce the computational cost associated with integration:
Φ(x; θ) =
p Y
d
X
ϕ j (x i ; θ i ),
(62)
j=1 i=1
where x i is the ith dimension of the input point x, and ϕ j (x i ; θ i ) is the jth output of the sub-wavefunction network
parameterized by θ i , and p is the predefined rank of the TNN model. The integral by quadrature rule of the TNN can
be calculated efficiently:
Z
Φ(x; θ)dx =
Ω
p Z Y
d
X
j=1
ϕ j (x i ; θ i )dx =
Ω i=1
p Y
d Z
X
j=1 i=1
ϕ j (x i ; θ i )dx i ≈
Ω i
p Y
d X
X
w (n i ) ϕ j (x (n i ) ; θ i ),
(63)
j=1 i=1 n i ∈N i
where N i is the index set for the ith dimensional numerical integral. Thus, the numerical integrals can be computed
along each axes separately to reduce the exponential cost to linear cost. We refer the readers to [47, 48] for more
details.
With the cheap quadrature achieved by TNN’s separable structure, the loss function to be minimized is
P
P
(n)
|∇Φ(x (n) ; θ)| 2 + n∈N w (n) v(x (n) )Φ(x (n) ; θ) 2
n∈N w
P
min ℓ(θ) =
,
(64)
(n) Φ(x (n) ; θ) 2
θ
n∈N w
where Φ(x; θ) is the neural network wave function with trainable parameters θ.
The primary computational bottleneck of TNNs for this problem lies in the first-order derivatives in the loss
function, while other terms only contain zero-order terms with lower computational and memory costs:
p
X ∂
∂
Φ(x; θ) =
ϕ j (x i ; θ i )
∂x i
∂x i
j=1
d
Y
ϕ j (x k ; θ k )
(65)
k=1,k̸ = i
Since TNN is a separable architecture, the d-dimensional Schrödinger equation requires computing the first-order
derivatives and performing quadrature integration for each of the d subnetworks. Consequently, the computational
cost scales linearly with the dimension. In comparison to previous examples of PINN solving PDEs, TNN incurs
larger memory consumption for first-order derivatives due to the non-sharing of network parameters across different
dimensions, i.e., d independent first-order derivatives of d distinct subnetworks are required, resulting in increased
computational requirements.
Mathematically, the computational and memory bottleneck is the gradient with respect to the first-order term of the
neural network wave function Φ(x; θ)
!
d
2
X
X
X
∂
∂
∂
grad(θ) =
w (n) |∇Φ θ (x (n) )| 2 =
w (n)
Φ θ (x (n) )
.
(66)
∂θ
∂θ
∂x
i
i=1
n∈N
n∈N
22Dimension
Full Batch GD
SDGD
10 2
3.882E-8
3.772E-8
10 3
3.889E-8
3.439E-8
10 4
3.811E-8
4.003E-8
2 ∗ 10 4
OOM
5.139E-8
Table 6: Laplace eigenvalue problem: relative L 1 error of the predicted eigenvalues of full batch gradient descent
(GD) and our SDGD. The results of our algorithm are similar to the full batch GD. In extremely high dimensions,
e.g., 2 ∗ 10 4 dimensions, the full batch GD encounters out-of-memory (OOM) error, while our algorithm reduces the
memory costs and ensures a good convergence result.
By employing the proposed SDGD approach, we can simplify the entire gradient computation:
!
2
X
d X ∂
∂
(n)
(n)
grad I (θ) =
w
,
Φ θ (x )
|I|
∂θ ∂x i
n∈N
(67)
i∈I
where I ⊂ {1, 2, · · · , d} is a random index set. Therefore, we can sample the entire gradient at the PDE term level to
reduce memory and computational costs, and the resulting stochastic gradient is unbiased, i.e., E I [grad I (θ)] = grad(θ)
ensuring the convergence of our method. The gradient accumulation can also be done for large-scale problems by our
method, see Section 3.3 for details.
It is important to note that our method is more general compared to the traditional approach of SGD solely based
on quadrature points. In particular, in extremely high-dimensional cases, even computing the gradient for a single
quadrature point may lead to an out-of-memory (OOM) error. This is because the traditional approach requires a
minimum batch size of d PDE terms and 1 quadrature point, where d is the dimensionality of the PDE. However, our
method further reduces the computational load per batch to only 1 PDE term and 1 quadrature point.
5.5.2
Experimental Setup
We consider two classical examples of the Schrödinger equation: the Laplace eigenvalue problem and the quantum
harmonic oscillator (QHO) problem.
Laplace eigenvalue problem. The potential function in the Schrödinger equation in equation (59) is zero function
Q d
v(x) = 0. The domain is chosen as Ω = i=1 Ω i = [0, 1] d where each Ω i = [0, 1] and the exact smallest eigenvalue
Q d
and eigenfunction are λ = dπ 2 and u(x) = i=1 sin(πx i ). Following [47], the quadrature scheme for TNN divides
each Ω i into ten identical subintervals, and sixteen Gauss quadrature points are selected for each subinterval. To train
the TNN, the Adam optimizer [30] is used with a learning rate of 1e-3. The TNN is with rank p = 1, depth 2, and
width 50 for the sub-wavefunctions in each dimension. For this problem, we evaluate the performance of SDGD.
Quantum harmonic oscillator (QHO). The potential function in the Schrödinger equation in equation (59) is
v(x) = ∥x∥ 2 . The original problem is defined on an infinite interval. Therefore, we truncate the entire domain to
[−5, 5] d . The exact smallest eigenvalue and eigenfunction are λ = d and u(x) = exp(−∥x i ∥ 2 /2). We use the same
model structure and hyperparameters as the Laplace eigenvalue problem. For this problem, we test the effectiveness of
the Gradient Accumulation method based on our further decomposition of PINN’s gradient.
|λ −λ
|
The test metric we report is the L 1 relative error of the minimum eigenvalue, given by true |λ true approx
. Due to the
|
diminishing nature of the eigenfunctions in high dimensions, we only report the accuracy of the eigenvalues. It is
important to note that the eigenvalues carry physical significance as they represent the ground state energy.
5.5.3
Experimental Results
On the Laplace eigenvalue problem, we demonstrate the stable convergence of SDGD in Figure 4, and it can scale up
to larger problem sizes in Table 6. Specifically, the convergence results of our SDGD are as satisfactory as the full
batch GD, and our algorithm can scale up the problem to high dimensions (2 ∗ 10 4 dimensions) where full batch GD
directly fails due to insufficient memory.
On the QHO problem, our gradient accumulation method achieves training in 2 ∗ 10 4 dimensions within the limited
GPU memory in Figure 5. Note that gradient accumulation performs theoretically the same as full batch GD, so we do
not compare their results, but the former can save GPU memory, thus scaling up PDEs to higher dimensions. These
results demonstrate the scalability of our method for PINN and PDE problems and highlight its generality, making it
applicable to various physics-informed machine-learning problems.
During the early stages of optimization, we observe quick convergence of TNNs. Since the model parameters are
initialized randomly, so there is much room for improvement. At this point, the gradients of the loss function with
respect to the parameters can be relatively large. Larger gradients result in more significant updates to the parameters,
leading to faster convergence.
23Relative L1 Error of Eigenvalue for the Laplace Problem
10 1 10 1
10 2 10 2
10 3 10 3
10 4 10 4
10 5 10 5
10 6 10 6
10 7 10 7
0
10 0
10000
20000
Epoch
30000
40000
50000
Relative L1 Error of Eigenvalue for the Laplace Problem
10 0
SGD over PDE Terms: Dim=10000, Batch=5000
10 1 10 1
10 2 10 2
10 3 10 3
10 4 10 4
10 5 10 5
10 6 10 6
10 7 10 7
0
10000
20000
Epoch
30000
40000
Relative L1 Error of Eigenvalue for the Laplace Problem
SGD over PDE Terms: Dim=1000, Batch=500
0
10 0
SGD over PDE Terms: Dim=100, Batch=50
10 0
50000
10000
20000
Epoch
30000
40000
50000
Relative L1 Error of Eigenvalue for the Laplace Problem
SGD over PDE Terms: Dim=20000, Batch=2500
0
20000
40000
Epoch
60000
80000
100000
Figure 4: Laplace eigenvalue problem: convergence curve of our algorithm (SDGD) under different dimensions.
24Relative L1 Error of Eigenvalue for the QHO Problem
Relative L1 Error of Eigenvalue for the QHO Problem
QHO: Gradient Acc. with Dim=100, Batch=50
QHO: Gradient Acc. with Dim=1000, Batch=500
10 0
10 2
10 4
10 6
Error
10 0
0
10000
20000
Epoch
30000
40000
10 2
10 4
10 6
50000
0
Relative L1 Error of Eigenvalue for the QHO Problem
4
10 6
30000
40000
50000
QHO: Gradient Acc. with Dim=20000, Batch=2500
10
Epoch
10 0
10 0
2
20000
Relative L1 Error of Eigenvalue for the QHO Problem
QHO: Gradient Acc. with Dim=10000, Batch=5000
10
10000
0
10000
20000
Epoch
30000
40000
50000
10 2
10 4
10 6
0
10000
20000
Epoch
30000
40000
50000
Figure 5: Quantum harmonic oscillator (QHO) problem: convergence curve of our algorithm (gradient accumulation)
under different dimensions.
251000D Wave with High/Low Effective Dimensions
10 1
1000D Wave with High/Low Effective Dimensions
10 1
Algorithm 1: 100-point-1-term on eff = 1
Algorithm 1: 100-point-1-term on eff = 10^3
10 0
1
10 2
10 0
10
Algorithm 1: 100-point-10-term on eff = 1
Algorithm 1: 100-point-10-term on eff = 10^3
0
2000
4000
Epoch
6000
8000
10 1
10 2
10000
0
10000D Wave with High/Low Effective Dimensions
10 1
10 2
0
2000
4000
Epoch
6000
8000
Epoch
6000
8000
10000
Algorithm 1: 100-point-100-term on eff = 1
Algorithm 1: 100-point-100-term on eff = 10^4
10 1
10 0
4000
10000D Wave with High/Low Effective Dimensions
Algorithm 1: 100-point-10-term on eff = 1
Algorithm 1: 100-point-10-term on eff = 10^4
10 1
2000
10000
10 0
10 1
10 2
0
2000
4000
Epoch
6000
8000
10000
Figure 6: Results for the wave equation. We present the convergence results with different batch sizes in the scenarios
of 1000 and 10000 dimensions. The red line corresponds to cases with a high effective dimension, while the blue
line represents cases with a low effective dimension of only one. We have chosen very small-dimension batch sizes,
specifically the PDE terms’ batch size. In the 1000-dimensional case, we use a minimum of one dimension, optimizing
one dimension at a time. In the 10000-dimensional case, we use only ten dimensions per iteration for gradient descent.
Despite the satisfactory convergence results, these problems demonstrate one shortcoming of our proposed
approach. Our method is primarily designed to address the computational bottleneck arising from high-dimensional
differentiations in PDEs. However, if the problem itself is large-scale and does not stem from differential equations,
our method cannot be directly applied. Specifically, in the case of the Schrödinger equation, due to the separable
nature of its network structure, we employ a separate neural network for each dimension. This results in significant
memory consumption, even without involving differentials. Additionally, in this problem, we cannot sample the points
of integration / residual points because there is an integral term in the denominator. If we sample the integral in
the denominator, the resulting stochastic gradient will be biased, leading to poor convergence. Therefore, for such
problems, designing an improved unbiased stochastic gradient remains an open question.
5.6
On the Effective Dimension of the PDE Solution
In practical applications, many high-dimensional equation solutions may exhibit some form of low-dimensional
structure, meaning they have a low effective dimension. In this chapter, we aim to demonstrate that our method can
efficiently capture the low-dimensional solutions of high-dimensional equations. If the equation possesses only a few
effective dimensions, the solutions in numerous other dimensions are essentially zero. This may lead to a significant
variance in gradients, but we prove that the gradients between different dimensions are actually transferable. Thus,
optimizing one dimension effectively assists the learning in other dimensions, ensuring that even when our algorithm
selects a batch size of 1, rapid convergence is still achievable.
26In particular, we consider the high-dimensional wave equation within the unit ball:
∂ tt u = ∆u,
u(x, 0) =
ef f
X
t ∈ [0, 1], x ∈ B d .
sinh x i ,
(68)
u t (x, 0) = 0.
i=1
u(x, t) given,
t ∈ [0, 1], x ∈ S d−1 .
Here ef f is the effective dimension of the solution. We choose ef f = 1 and ef f = d as the low-dimensional and
high-dimensional solutions, respectively. The boundary condition is obtained via the ground truth solution of the PDE:
u(x, t) = cosh(t)(
ef f
X
sinh(x i )).
(69)
i=1
Actually, in the low-dimensional solution case where ef f = 1, the solution is anisotropic which poses additional
variance to the gradients of our algorithms. We use the following network architecture to satisfy the initial condition
automatically:
ef f
X
2
sinh(x i ).
(70)
u Wave
=
u
(x,
t)t
+
θ
θ
i=1
In this example of the wave equation, we decompose the residual prediction of PINN along each dimension as follows:
d
∂ tt u − ∆u =
1 X
{u tt − du x i x i } .
d i=1
(71)
Other experimental details including the choices of points and optimizers are the same as those in the Fokker-Planck
equation.
The results for the wave are presented in Figure 6, where we present the convergence results with different batch
sizes in the scenarios of 1000 and 10000 dimensions. The red line corresponds to cases with a high effective dimension,
while the blue line represents cases with a low effective dimension of only one. Firstly, if the equation has a low
effective dimension (blue line), our method surprisingly converges faster. This indicates that our approach can quickly
capture the low-dimensional solutions within a high-dimensional equation.
Secondly, it’s worth noting that we have chosen very small-dimension batch sizes, specifically the PDE terms’
batch size. In the 1000-dimensional case, we use a minimum of one dimension, optimizing one dimension at a
time. In the 10000-dimensional case, we use only ten dimensions per iteration for gradient descent. However, our
method still achieves rapid convergence, suggesting the presence of transferability in the gradients. Optimizing one
dimension effectively triggers a transfer learning effect on other dimensions, implying that the gradients between
different dimensions are not entirely independent or orthogonal, but exhibit some form of transfer learning.
The transferability can also be validated by theory. Given the neural network structure:
u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · )).
(72)
where x ∈ R d is the input, and W l ∈ R m l ×m l−1 is the weight matrix at l-th layer with d = m 0 and m L = 1. The
first-order and second-order derivatives are given by
∂u θ (x)
= W L · Φ L−1 (x)W L−1 · · · · · Φ 1 (x)W 1 ∈ R d ,
∂x (73)
L−1
X
∂ 2 u θ (x)
=
(W L Φ L−1 (x) · · · W l+1 )diag(Ψ l (x)W l · · · Ψ 1 (x)(W 1 ) :,j )(W l · · · Φ 1 (x)(W 1 ) :,i )
∂x i ∂x j (74)
Φ l (x) = Diag [σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x) · · · )))] ∈ R m l ×m l (75)
l=1
where
′′
Ψ l (x) = Diag [σ (W l σ(W l−1 σ(· · · σ(W 1 x) · · · )))] ∈ R
m l ×m l
(76)
We observed that in the wave equation, the forms of second derivatives in different dimensions are actually mostly
shared. Therefore, optimizing the gradient of one dimension also affects other dimensions, which is akin to transfer
learning. Moreover, this form has different dependencies on various dimensions only in the parameters of the first
layer (notice (W 1 ) i and (W 1 ) j ), which means it differs solely in the ith and jth rows of the first-layer parameters. The
parameters of the subsequent layers from the second layer onwards do not actually have any impact.
276
Conclusions
The CoD has been an open problem for many decades but recent progress [1, 19, 40] has been made by several
research groups using deep learning methods for specific classes of PDEs. Herein, we proposed a general method
based on the mesh-free PINNs method by introducing a new type of stochastic dimension gradient descent or SDGD
as a generalization of the largely successful SGD method over subsets of the training set. In particular, we claim the
following advantages of the SDGD-PINNs method over other related methods:
Generality to Arbitrary PDEs. Primarily, certain approaches [1, 19, 40] are restricted to specific forms of
parabolic PDEs. They leverage the connection between parabolic PDEs and stochastic processes, employing Monte
Carlo simulations of stochastic processes to train their models. Consequently, their methods are limited to a subset of
PDEs. In contrast, our approach is based on PINNs, theoretically capable of handling arbitrary PDEs. Notably, we can
address challenging cases such as the wave equation, biharmonic equation, Schrödinger equation, and other similar
examples, while [1, 19, 40] cannot.
Prediction on the Entire Domain. Furthermore, these methods [1, 19, 40] can only predict the values of PDE
solutions at a single test point during each training instance. This limitation arises from the necessity of setting the
starting point of the stochastic process (i.e., the test point for the PDE solution) before conducting Monte Carlo
simulations. Consequently, the computational cost of their methods [1, 19, 40] increases greatly as the number of test
points grows. In contrast, our approach, based on PINNs, allows for predictions across the entire domain in a single
training instance, achieving a “once for all” capability.
Dependency on the Mesh. Moreover, precisely because these methods [1, 19, 40] require Monte Carlo simulations
of stochastic processes, they necessitate temporal discretization. This introduces additional parameters and requires
a sufficiently high discretization precision to obtain accurate solutions. For PDE problems that span long durations,
these methods undoubtedly suffer from the burden of temporal discretization. However, our PINN-based approach can
handle long-time PDEs effortlessly. Since PINNs are mesh-free methods, the training points can be randomly sampled.
The interpolation capability of PINNs ensures their generalization performance on long-time PDE problems.
PINNs can also be adapted for prediction at one point. Although PINNs offer advantages such as being
mesh-free and capable of predicting the entire domain, they still perform remarkably well in the setting where other
methods focus on single-point prediction. Specifically, if our interest lies in the value of the PDE at a single point, we
can exploit the relationship between the PDE and stochastic processes. This single point corresponds to the initial
point of the stochastic process, and the PDE corresponds to a specific stochastic process, such as the heat equation
corresponding to simple Brownian motion. In this case, we only need to use the snapshots obtained from the trajectory
of this stochastic process as the residual points for PINN.
We have proved the necessary theorems that show SDGD leads to an unbiased estimator and that it converges
under mild assumptions, similar to the vanilla SGD over collocation points and mini-batches. SDGD can also be used
in physics-informed neural operators such as the DeepOnet and can be extended to general regression and possibly
classification problems in very high dimensions.
In summary, the new SDGD-PINN framework is a paradigm shift in the way we solve high-dimensional PDEs as it
can tackle arbitrary nonlinear PDEs of any dimension providing high accuracy at extremely fast speeds, e.g., the BSB
solution in 100,000 dimensions can be obtained in a few minutes even on a single GPU.
28A
Proof
A.1
Proof of Theorem 4.2
Lemma A.1. Suppose that we have N fixed numbers a 1 , a 2 , · · · , a N and we choose k random numbers {X i } ki=1 from
them, i.e., each X i is a random variable and X i = a n with probability 1/N for all n ∈ {1, 2, · · · , N }, and X i are
P n
P k
P N
1
2
independent. Then the variance of the unbiased estimator i=1 X i /k for a = n=1 a n is kn
i=1 (a i − ā) .
Proof. Since the samples X i are i.i.d.,
"P
V
k
i=1
X i
k
#
n
=
1
1 X
V[X i ] =
(a i − ā) 2
k
kn i=1
(77)
r
We assume that the total full batch is with N r residual points {x i } N
i=1 and N L PDE terms L =
residual loss of the PINN is
N r
X
1
2
(Lu θ (x i ) − R(x i )) ,
2N r N L 2 i=1
P N L
i=1
L i , then the
(78)
where we normalize over both the number of residual points and the number of PDE terms. The full batch gradient is:
N r
∂
1 X
(Lu θ (x i ) − R(x i )) Lu θ (x i )
2
N r N L i=1
∂θ


N L
N r
X
1 X
∂
(Lu θ (x i ) − R(x i )) 
=
L j u θ (x i ) 
N r N L 2 i=1
∂θ
j=1
g(θ) =


N L
N r X
∂
1 X
=
(Lu θ (x i ) − R(x i ))
L j u θ (x i )
N r N L 2 i=1 j=1
∂θ
 

 

N L
N L
N r
X
X
X
1
∂
 
L j u θ (x i )  − R(x i )  
=
L j u θ (x i )  .
N r N L 2 i=1
∂θ
j=1
j=1
(79)
From the viewpoint of Algorithm 1, the full batch gradient is the mean of N r N L terms:
g(θ) =


N L
N L
N r X
N r X
1 X
∂
1 X
(Lu
(x
)
−
R(x
))
L
u
(x
)
:=
g i,j (θ),
θ
i
i
j θ
i
N r N L 2 i=1 j=1
∂θ
N r N L 2 i=1 j=1
where we denote

g i,j (θ) = (Lu θ (x i ) − R(x i ))

∂
L j u θ (x i ) .
∂θ
(81)
The stochastic gradient generated by Algorithm 1 is given by


X
X
X X
1
∂
1
g B,J (θ) =
(Lu θ (x i ) − R(x i )) 
L j u θ (x i )  =
g i,j (θ).
|B||J|N L
∂θ
|B||J|N L
i∈B
j∈J
(80)
(82)
i∈B j∈J
We have the variance
2
V B,J [g B,J (θ)] = E B,J [(g B,J (θ) − g(θ)) ] = E B,J [g B,J (θ) 2 ] − E B,J [g(θ) 2 ] = E B,J [g B,J (θ) 2 ] − g(θ) 2 .
(83)
From high-level perspective, V B,J [g B,J (θ)] should be a function related to |B| and |J|. Thus, under the constraint that
|B| · |J| is the same for different SGD schemes, we can choose B and J properly to minimize the variance of SGD to
29accelerate convergence.



 2
X X
1
∂
E B,J [g B,J (θ) 2 ] =
E B,J 
(Lu θ (x i ) − R(x i ))
L j u θ (x i ) 
|B| 2 |J| 2 N L 2
∂θ
i∈B j∈J
"


X X
1
∂
=
E
L
u
(x
)
·
(Lu
(x
)
−
R(x
))
B,J
j θ
i
θ
i
i
|B| 2 |J| 2 N L 2
∂θ
i,i ′ ∈B j,j ′ ∈J

#
∂
(Lu θ (x i ′ ) − R(x i ′ ))
L j ′ u θ (x i ′ ) .
∂θ
(84)
The entire expectation can be decomposed into four parts (1) i = i ′ , j = j ′ , (2) i ̸ = i ′ , j = j ′ , (3) i = i ′ , j ̸ = j ′ , (4)
i ̸ = i ′ , j ̸ = j ′ . For the first part
"
 2 #

X X
1
∂
2
E B,J
L j u θ (x i )
(Lu θ (x i ) − R(x i ))
|B| 2 |J| 2 N L 2
∂θ
i∈B j∈J
"
 2 #

1
∂
2
=
E i,j (Lu θ (x i ) − R(x i ))
L j u θ (x i )
|B||J|N L 2
∂θ
= 1
E i,j [g i,j (θ) 2 ]
|B||J|N L 2
= N L
N r X
X
1
g i,j (θ) 2
|B||J|N r N L 3 i=1 j=1
= C 1
,
|B||J|
(85)
P N r P N L
2
where we denote C 1 = N r 1 N 3 i=1
j=1 g i,j (θ) which is independent of B, J.
L
In the second case, since the ith sample and the i ′ th sample are independent for i ̸ = i ′ , we have
"


#
X X
∂
∂
1
E B,J
(Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))
L j u θ (x i )
L j u θ (x i ′ )
|B| 2 |J| 2 N L 2
∂θ
∂θ
i̸ = i ′ ∈B j∈J
"


#
∂
∂
|B| − 1
=
E i,i ′ ,j (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))
L j u θ (x i )
L j u θ (x i ′ )
|B||J|N L 2
∂θ
∂θ
= C 2 ·
|B| − 1
.
|B||J|
(86)
In the third case, due to the same reason,
"


#
X X
∂
∂
1
2
E B,J
(Lu θ (x i ) − R(x i ))
L j u θ (x i )
L j ′ u θ (x i )
|B| 2 |J| 2 N L 2
∂θ
∂θ
i∈B j̸ = j ′ ∈J
"


#
|J| − 1
∂
∂
2
=
E i,j,j ′ (Lu θ (x i ) − R(x i ))
L j u θ (x i )
L j ′ u θ (x i )
|B||J|N L 2
∂θ
∂θ
= C 3 ·
|J| − 1
.
|B||J|
30
(87)In the last case,

#
∂
∂
L j u θ (x i )
L j ′ u θ (x i ′ )
(Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))
∂θ
∂θ
i̸ = i ′ ∈B j̸ = j ′ ∈J
"


#
∂
(|B| − 1)(|J| − 1)
∂
E i,i ′ ,j,j ′ (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))
=
L j u θ (x i )
L j ′ u θ (x i ′ )
|B||J|N L 2
∂θ
∂θ
"
#! 2
(|B| − 1)(|J| − 1)
=
E i,j (Lu θ (x i ) − R(x i )) (Lu θ (x i ′ ) − R(x i ′ ))
|B||J|N L 2
1
E B,J
|B| 2 |J| 2 N L 2
=
"
X

X
(|B| − 1)(|J| − 1)
g(θ) 2
|B||J|
(88)
Taking together
V B,J [g B,J (θ)] = E B,J [g B,J (θ) 2 ] − g(θ) 2
C 1 + C 2 (|B| − 1) + C 3 (|J| − 1) + (1 − |B| − |J|)g(θ) 2
|B||J|
C 4 |J| + C 5 |B| + C 6
=
.
|B||J|
=
A.2
(89)
Proof of Lemma 4.1
Here, we bound the gradient produced by physics-informed loss functions, which further provides an upper bound for
the gradient variance during PINN training using SDGD.
For the stochastic gradient produced by Algorithm 1,
V B,J (g B,J (θ)) ≤
N L
N r X
X
2
1
∥g i,j (θ) − g(θ)∥ 2 ≤
max ∥g i,j (θ)∥ 2 ,
2
|B||J|N r N L i=1 j=1
|B||J|N L i,j
(90)
where

g i,j (θ) := (Lu θ (x i ) − R(x i ))

∂
L j u θ (x i ) .
∂θ
(91)
For the stochastic gradient produced by Algorithm 2,

!
X ∂
L k u θ (x i ) − R(x i ) 
L j u θ (x i ) 
∂θ
j∈J
i∈B
k∈K


X X X 
1
R(x i )
∂
L k u θ (x i ) −
L j u θ (x i ) ,
=
|B||J||K|
N L
∂θ
X
1
g B,J,K (θ) =
|B||J|N L
N L
|K|
!
X
(92)
i∈B j∈J k∈K
V B,J,K (g B,J,K (θ)) ≤
N L X
N L
N r X
X
1
2
(g i,j,k (θ) − g(θ)) 2 ≤
max ∥g i,j,k (θ)∥ 2 ,
|B||J||K|N r N L 2 i=1 j=1
|B||J||K| i,j,k
(93)
k=1
where

g i,j,k (θ) :=
R(x i )
L k u θ (x i ) −
N L


∂
L j u θ (x i ) .
∂θ
(94)
We focus on bounding the following two quantities:
max |L j u θ (x i )| ≤ n!(L − 1) n ∥W L ∥
i,j
L−1
Y
∥W l ∥ n ,
(95)
l=1
L−1
Y
∂
n+1
max
L j u θ (x i ) ≤ (n + 1)!(L − 1)
∥W L ∥
∥W l ∥ n+1 max ∥x∥.
i,j
x∈Ω
∂θ
l=1
31
(96)Recall that the DNN structure:
u θ (x) = W L σ(W L−1 σ(· · · σ(W 1 x) · · · ).
The neural network u θ (x) has only one part concerning x, and after taking the derivative with respect to x, there
is only one term:
∂u θ (x)
= W L · Φ L−1 (x)W L−1 · · · · · Φ 1 (x)W 1 ∈ R d ,
(97)
∂x
where Φ l (x) = diag[σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈ R m l ×m l , and the first-order derivative has an upper bound of
Q L
l=1 ∥W l ∥.
After taking the second derivative, due to the nested structure of the neural network, there will be L − 1 parts
concerning x induced by the derivative of each Φ l (x) with respect to x, resulting in L − 1 terms:
( L−1
)
X
∂ 2 u θ (x)
=
(W L Φ L−1 (x) · · · W l+1 )diag(Ψ l (x) · · · Ψ 1 (x)(W 1 ) :,j )(W l · · · Φ 1 (x)W 1 )
,
(98)
∂x 2
l=1
1≤j≤d
where Φ l (x) = diag[σ ′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈ R m l ×m l , and Ψ l (x) = diag[σ ′′ (W l σ(W l−1 σ(· · · σ(W 1 x))))] ∈
Q L−1
R m l ×m l . Each of these L − 1 terms has an upper bound of ∥W L ∥ l=1 ∥W l ∥ 2 , and possesses at most 2(L − 1)
sub-terms depending on x. Therefore, there are 2(L − 1) 2 sub-terms within the second-order derivative depending on
x.
Using the principle of induction, after taking the nth derivative, there will be n!(L − 1) n parts and each term has
Q L−1
an upper bound of ∥W L ∥ l=1 ∥W l ∥ n . If we further take the derivative with respect to W l , then there will be at most
Q L−1
(n+1)!(L−1) n+1 parts concerning W l , and each term will be upper bounded by ∥W L ∥ l=1 ∥W l ∥ n+1 max x∈Ω ∥x∥.
Consequently,


∂
∂
∥g i,j (θ)∥ ≤ ∥Lu θ (x i ) − R(x i )∥
(99)
L j u θ (x i ) ≤ max |L j u θ (x i )| + R max
L j u θ (x i ) ,
i,j
i,j
∂θ
∂θ


R(x i )
∂
∂
∥g i,j,k (θ)∥ ≤ L k u θ (x i ) −
(100)
L j u θ (x i ) ≤ max |L j u θ (x i )| + R max
L j u θ (x i ) .
i,j
i,j
N L
∂θ
∂θ
B
Parallel Computing
To speed up the training, we explored two avenues for distributed training. First, we used the data-parallel approach
[16], and second we explored the 1D tensor parallel approach [38].
The programming model for the data-parallel approach is based on the single instruction and multiple data (SIMD),
resulting in a weak scaling-driven performance model. In the data-parallel approach, the data is uniformly split into
a number of chunks, equal to the number of GPUs. The neural network (NN) model is initialized with the same
parameters on all the processes. These neural networks are working on different chunks of the data, and therefore,
work on different loss functions. To ensure a consistent neural network model with the same weights and biases across
all the processes and during each epoch or iteration of training, a distributed optimizer is used, which averages out the
gradient of loss values stored at each processor through an “all reduce” operation. Subsequently, the averaged gradient
of the loss function is sent to all the processors in the communicator world through a broadcast operation. Thus, the
parameters of the model are updated simultaneously with the same gradients.
To show the performance of the algorithm presented in this paper, we chose the HJB-Lin problem in 100,000
dimensions. We perform all the computational experiments in this section on A100 GPUs with 80 GB memory.
Since the performance of the data parallel approach is driven by weak scaling, we need to first find out the correct
hyperparameters such as the depth and width of neural networks, number of collocation points, batch size, etc. We show
a detailed Table 7 showing the performance of the data-parallel approach by varying the hyperparameters, specifically
dimensions and the number of collocation points in each dimension. To observe the accuracy of the parallel code we
ran a case for solving the HJB-Lin equation up to 5000 iterations and the convergence plot is shown in Figure 7, which
utilizes the hyperparameters DIM=100, N f =100, where we denote the batch size for dimension in Algorithm 1 of
SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing section.
For all the cases shown in Table 7, the GPU is not saturated completely, and therefore scaling is affected by memory
latency. In Table 8, we show an optimal set of hyperparameters required to saturate the GPUs. With these parameters,
we ran the proposed algorithm till the full convergence for the HJB-Lin equation in 100,000 dimensions, and the
wall times are shown in Table 9. We see a very good scaling but there is still scope for further optimization. Further
optimization of code and scaling it across the nodes will be considered in future work.
32Parallel Computing on 1E5-Dimensional HJB-Lin Equation
Algorithm 2: 100-point-100-term on 1 GPU
Algorithm 2: 100-point-1201-term on multiple GPUs
10 0
10 1
10 2
0
1000
2000
Epoch
3000
4000
5000
Figure 7: Convergence plot obtained from data-parallel algorithm for HJB-Lin equation in 100,000 dimensions. To
perform this experiment we use DIM=101 and N f = 100. Here we denote the batch size for dimension in Algorithms
1 and/or 2 of SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing
section.
Scaling for data-parallel approach for HJB-Lin equation in 100,000 dimensions
Hyperparameters
1-GPU (Mem and 2-GPUs (Mem and Time Per iteration
Compute)
Compute)
DIM=100, N f =100
34 GB and 50 %
34 GB and 25 %
1 GPU = 10 sec, 2
GPU = 2.5 sec
DIM=1200, N f =100
67 GB and 25 %
37 GB and 30 %
1 GPU = 75 sec, 2
GPU = 25 sec
DIM=2000, N f =100
OOM Error
37 GB and 50 %
1 GPU = NA,
2
GPU = 38 sec
Table 7: Scaling for data-parallel approach with 3 hidden layers and 1024 neurons in each layer. Experiments are
performed on A100 GPUs with 80 GB memory. Here we denote the batch size for dimension in Algorithms 1 and/or 2
of SDGD as DIM and the number of collocation points used in each iteration as N f in the parallel computing section.
Scaling for data-parallel for 100,000D HJB-Lin equation with well-balanced load
Hyperparameters
1-GPU (Mem and 2-GPUs (Mem and Time Per iteration
Compute)
Compute)
DIM=2000, N f =1000
OOM Error
80 GB, and 100%
1 GPU = NA,
2
GPU = 100 sec
Table 8: Scaling for data-parallel approach with optimal hyperparameters endowed with 9 hidden layers and 1024
neurons in each layer.
Performance: Serial vs data parallel code for 100,000D HJB-Lin
Hyperparameters
1-GPU
2-GPUs
DIM=100, N f =100
6.17 Hrs
-
DIM=2000, N f =1000
-
3.14 Hrs
Table 9: Scaling results for serial and parallel implementation.
33Figure 8: Schematic of tensor parallel approach. Here, w is a weight matrix that is split into column axis and deployed
to different GPUs.
Next, we present the results of the tensor parallel algorithm [38]. In the tensor parallel algorithm, we split the
weight matrices of neural networks along the column dimension as shown in Figure 8. Thereafter, the forward pass
and backward pass are performed using the pipeline approach. Since in the present study, our model is not very large
but to saturate the GPU, we have used 8 hidden layers and 2048 neurons and performed the parallelization on 2 GPUs.
The tensor parallel approach did not scale as well due to the effect of pipe-lining deteriorating the performance. We
show a comparison of the tensor and data parallel approaches in Table 10.
Performance: data vs tensor parallel code for 100,000D HJB equation on 2 GPUs
Hyperparameters
data parallel
tensor parallel
DIM=2000, N f =1000
100 s/iteration
105 s/iteration
Table 10: Scaling results for data-parallel and tensor parallel implementation.
34References
[1] Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Ariel Neufeld. Deep splitting method
for parabolic pdes. SIAM Journal on Scientific Computing, 43(5):A3135–A3154, 2021.
[2] Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high-dimensional
fully nonlinear partial differential equations and second-order backward stochastic differential equations. Journal
of Nonlinear Science, 29:1563–1619, 2019.
[3] Christian Beck, Lukas Gonon, and Arnulf Jentzen. Overcoming the curse of dimensionality in the nu-
merical approximation of high-dimensional semilinear elliptic partial differential equations. arXiv preprint
arXiv:2003.00596, 2020.
[4] Christian Beck, Fabian Hornung, Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Overcoming the
curse of dimensionality in the numerical approximation of allen–cahn partial differential equations via truncated
full-history recursive multilevel picard approximations. Journal of Numerical Mathematics, 28(4):197–222,
2020.
[5] Sebastian Becker, Ramon Braunwarth, Martin Hutzenthaler, Arnulf Jentzen, and Philippe von Wurstemberger.
Numerical simulations for full history recursive multilevel picard approximations for systems of high-dimensional
partial differential equations. arXiv preprint arXiv:2005.10206, 2020.
[6] Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Timo Welti. Solving high-dimensional optimal stopping
problems using deep learning. European Journal of Applied Mathematics, 32(3):470–514, 2021.
[7] Richard Bellman. Dynamic programming. Princeton University Press, 1957.
[8] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George Em Karniadakis. Physics-informed
neural networks (pinns) for fluid mechanics: A review. Acta Mechanica Sinica, 37(12):1727–1738, 2021.
[9] Quentin Chan-Wai-Nam, Joseph Mikael, and Xavier Warin. Machine learning for semi linear pdes. Journal of
scientific computing, 79(3):1667–1712, 2019.
[10] Francisco Chinesta, Pierre Ladeveze, and Elias Cueto. A short review on model order reduction based on proper
generalized decomposition. Archives of Computational Methods in Engineering, 18(4):395, 2011.
[11] Junwoo Cho, Seungtae Nam, Hyunmo Yang, Seok-Bae Yun, Youngjoon Hong, and Eunbyung Park. Sep-
arable pinn: Mitigating the curse of dimensionality in physics-informed neural networks. arXiv preprint
arXiv:2211.08761, 2022.
[12] Jérôme Darbon and Stanley Osher. Algorithms for overcoming the curse of dimensionality for certain hamilton–
jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):19,
2016.
[13] Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent
method for non-convex objective functions. The Journal of Machine Learning Research, 21(1):5354–5401, 2020.
[14] Somdatta Goswami, Ameya D Jagtap, Hessam Babaee, Bryan T Susi, and George Em Karniadakis. Learning
stiff chemical kinetics using extended deep neural operators. arXiv preprint arXiv:2302.12645, 2023.
[15] Somdatta Goswami, Minglang Yin, Yue Yu, and George Em Karniadakis. A physics-informed variational
deeponet for predicting crack path in quasi-brittle materials. Computer Methods in Applied Mechanics and
Engineering, 391:114587, 2022.
[16] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.
[17] Ehsan Haghighat, Maziar Raissi, Adrian Moure, Hector Gomez, and Ruben Juanes. A physics-informed deep
learning framework for inversion and surrogate modeling in solid mechanics. Computer Methods in Applied
Mechanics and Engineering, 379:113741, 2021.
[18] PC Hammer. Adaptive control processes: a guided tour (r. bellman), 1962.
35[19] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep
learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[20] Jiequn Han, Arnulf Jentzen, et al. Deep learning-based numerical methods for high-dimensional parabolic partial
differential equations and backward stochastic differential equations. Communications in mathematics and
statistics, 5(4):349–380, 2017.
[21] Di He, Shanda Li, Wenlei Shi, Xiaotian Gao, Jia Zhang, Jiang Bian, Liwei Wang, and Tie-Yan Liu. Learning
physics-informed neural networks without stacked back-propagation. In International Conference on Artificial
Intelligence and Statistics, pages 3034–3047. PMLR, 2023.
[22] Pierre Henry-Labordere. Deep primal-dual algorithm for bsdes: Applications of machine learning to cva and im.
Available at SSRN 3071506, 2017.
[23] Zheyuan Hu, Ameya D. Jagtap, George Em Karniadakis, and Kenji Kawaguchi. When do extended physics-
informed neural networks (xpinns) improve generalization? SIAM Journal on Scientific Computing, 44(5):A3158–
A3182, 2022.
[24] Côme Huré, Huyên Pham, and Xavier Warin. Deep backward schemes for high-dimensional nonlinear pdes.
Mathematics of Computation, 89(324):1547–1579, 2020.
[25] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger.
Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential
equations. Proceedings of the Royal Society A, 476(2244):20190630, 2020.
[26] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, et al. Multilevel picard iterations for solving smooth
semilinear parabolic heat equations. Partial Differential Equations and Applications, 2(6):1–31, 2021.
[27] Ameya D Jagtap, Dimitrios Mitsotakis, and George Em Karniadakis. Deep learning of inverse water waves
problems using multi-fidelity data: Application to serre–green–naghdi equations. Ocean Engineering, 248:110775,
2022.
[28] Shaolin Ji, Shige Peng, Ying Peng, and Xichuan Zhang. Three algorithms for solving high-dimensional fully
coupled fbsdes through deep learning. IEEE Intelligent Systems, 35(3):71–84, 2020.
[29] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-
informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
[31] Yunwen Lei, Ting Hu, Guiying Li, and Ke Tang. Stochastic gradient descent for nonconvex learning without
bounded gradient assumptions. IEEE transactions on neural networks and learning systems, 31(10):4394–4400,
2019.
[32] Jianfeng Lu, Yulong Lu, and Min Wang. A priori generalization analysis of the deep ritz method for solving high
dimensional elliptic equations. arXiv preprint arXiv:2101.01708, 2021.
[33] Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear
operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence,
3(3):218–229, 2021.
[34] Tao Luo and H. Yang. Two-layer neural networks for partial differential equations: Optimization and generaliza-
tion theory. ArXiv, abs/2006.15733, 2020.
[35] Kaspar Märtens and Christopher Yau. Neural decomposition: Functional anova with variational autoencoders. In
International Conference on Artificial Intelligence and Statistics, pages 2917–2927. PMLR, 2020.
[36] Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of
stochastic gradient descent in non-convex problems. Advances in Neural Information Processing Systems,
33:1117–1128, 2020.
[37] Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physics informed neural
networks (pinns) for approximating pdes. arXiv preprint arXiv:2006.16144, 2020.
36[38] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti,
Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language
model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
[39] Paris Perdikaris, Daniele Venturi, and George Em Karniadakis. Multifidelity information fusion algorithms for
high-dimensional systems and massive data sets. SIAM Journal on Scientific Computing, 38(4):B521–B538,
2016.
[40] Maziar Raissi. Forward-backward stochastic neural networks: Deep learning of high-dimensional partial
differential equations. arXiv preprint arXiv:1804.07010, 2018.
[41] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial
differential equations. Journal of Computational Physics, 357:125–141, 2018.
[42] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning
framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of
Computational Physics, 378:686–707, 2019.
[43] Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S
Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information
Processing Systems, 32, 2019.
[44] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence of physics informed neural
networks for linear second-order elliptic and parabolic type pdes. arXiv preprint arXiv:2004.01806, 2020.
[45] Khemraj Shukla, Vivek Oommen, Ahmad Peyvan, Michael Penwarden, Luis Bravo, Anindya Ghoshal, Robert M
Kirby, and George Em Karniadakis. Deep neural operators can serve as accurate surrogates for shape optimization:
a case study for airfoils. arXiv preprint arXiv:2302.00807, 2023.
[46] Chuwei Wang, Shanda Li, Di He, and Liwei Wang. Is $lˆ2$ physics informed loss always suitable for training
physics informed neural network? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho,
editors, Advances in Neural Information Processing Systems, 2022.
[47] Yifan Wang, Pengzhan Jin, and Hehu Xie. Tensor neural network and its numerical integration. arXiv preprint
arXiv:2207.02754, 2022.
[48] Yifan Wang, Yangfei Liao, and Hehu Xie. Solving schr\”{o} dinger equation using tensor neural network. arXiv
preprint arXiv:2209.12572, 2022.
[49] Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantification in physics-informed neural networks.
Journal of Computational Physics, 394:136–152, 2019.
[50] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional
partial differential equations. Journal of Computational Physics, 411:109409, 2020.
[51] Dongkun Zhang, Ling Guo, and George Em Karniadakis. Learning in modal space: Solving time-dependent
stochastic pdes using physics-informed neural networks. SIAM Journal on Scientific Computing, 42(2):A639–
A665, 2020.
[52] Zhongqiang Zhang, Minseok Choi, and George Em Karniadakis. Error estimates for the anova method with
polynomial chaos interpolation: Tensor product functions. SIAM Journal on Scientific Computing, 34(2):A1165–
A1186, 2012.
37

loading