Summary of Deep Learning in Medical Image Registration

Summary Deep Learning in Medical Image Registration arxiv.org

38,813 words - PDF document - View PDF document

One Line

Deep learning techniques, including transformer-based models, have revolutionized medical image registration by capturing long-range dependencies, estimating uncertainty, and addressing domain shift in an unsupervised manner.

Slides

Slide Presentation (14 slides)

Copy slides outline Copy embed code Download as Word

Advancements in Deep Learning for Medical Image Registration

Slide 1: Deep learning has transformed medical image registration

• Revolutionized traditional methods with superior accuracy and efficiency

• Enabled robust modeling of complex deformations

• Addressed challenges like domain shift and deformation regularization

[Visual: Before-and-after comparison of image registration results]

Unsupervised methods offer greater flexibility

• Shift from supervised to unsupervised learning for registration

• No need for labeled training data enhances model adaptability

• Techniques like adversarial and cycle-consistent networks lead to improved alignment

[Visual: Flowchart comparing supervised and unsupervised methods]

Novel loss functions enhance registration performance

• Traditional loss functions like MSE are being augmented

• New anatomical loss functions offer modality-independent advantages

• Contrastive and adversarial learning techniques improve feature alignment

[Visual: Graph showing performance metrics of different loss functions]

Network architectures have evolved significantly

• Introduction of encoder-decoder designs for deformable registration

• Use of Transformers captures long-range dependencies effectively

• Multi-resolution strategies mimic traditional algorithms for better performance

[Visual: Diagram of various network architectures]

Improved similarity measures are crucial

• Traditional measures like mutual information face limitations in multi-modal scenarios

• New approaches like Structural Similarity Criterion (SSC) enhance performance

• Local structural information is key to effective registration

[Visual: Chart comparing traditional and new similarity measures]

Deformation regularizers ensure realistic transformations

• Spatially-varying regularization adapts based on image content

• Enhanced techniques improve smoothness and realism of deformations

• Learning-based regularizers are emerging as a significant focus area

[Visual: Example of deformation before and after applying regularization]

Estimating registration uncertainty is essential

• Models can capture both aleatoric and epistemic uncertainties

• Key measures include transformation and appearance uncertainty

• Uncertainty quantification aids in clinical decision-making processes

[Visual: Infographic showing types of uncertainty in registration]

Applications span various medical imaging tasks

• Atlas construction, multi-atlas segmentation, and motion estimation

• 2D-3D registration benefits from improved techniques

• Clinical applications continue to expand with advancements in registration methods

[Visual: Mind map of applications in medical imaging]

Metamorphic registration accommodates topological changes

• Disentangles geometric and appearance changes effectively

• Leverages segmentation networks to guide registration processes

• Essential for handling complex medical scenarios, such as tumor presence

[Visual: Case study examples of metamorphic registration]

The importance of spatial normalization in atlases

• Enhances quality and applicability beyond traditional domains like the brain

• Impacts cancer treatment planning and patient-specific digital twins

• Significantly broadens the scope of medical imaging applications

[Visual: Comparison of atlases before and after spatial normalization]

Hyperparameter tuning is integrated into architectures

• Direct integration allows efficient tuning within training processes

• Improves overall model performance and adaptability

• Spatially discontinuous deformations can be modeled effectively

[Visual: Diagram illustrating hyperparameter integration process]

Progressive and pyramid-based techniques improve accuracy

• Decomposes registration into multiple refinement steps for better results

• Enhances convergence speed and final outcomes in complex cases

• Key to achieving high-quality deformations in medical images

[Visual: Step-by-step visual representation of progressive registration]

Future directions point towards exploration and innovation

• Continued research needed in spatially-varying regularization techniques

• Focus on improving evaluation metrics for performance assessment

• Explore applications of uncertainty in clinical settings for better outcomes

[Visual: Roadmap outlining future research directions]

The future of deep learning in medical image registration is promising

• Innovations in network architectures, loss functions, and regularization techniques are paving the way forward.

• Unsupervised and self-supervised methods hold the potential to revolutionize the field without extensive manual annotations.

• These advancements will enhance clinical decision-making, ultimately improving patient care.

Key Points

Deep learning has revolutionized medical image registration, with advancements in similarity measures, deformation regularizations, and uncertainty estimation
Learning-based registration methods can be categorized as supervised or unsupervised, with recent focus on unsupervised approaches for greater flexibility
Fundamental paradigm of learning-based registration involves deep neural networks, spatial transformers, and loss functions tailored for registration tasks
Network architectures have evolved, with encoder-decoder designs for deformable registration and encoder-only networks for rigid/affine registration
Estimating registration uncertainty is crucial for downstream applications, and evaluation metrics are essential for assessing the performance of learning-based methods
Recent advancements in deep learning-based registration include improved similarity measures, spatially-varying deformation regularizers, and novel network architectures like Transformers and diffusion models
Unsupervised and self-supervised approaches have gained traction, leveraging the inherent structure of the data to learn robust registration models without manual annotations

Summaries

20 word summary

Deep learning transforms medical image registration through unsupervised techniques. Transformer-based models capture long-range dependencies, estimating uncertainty and addressing domain shift.

45 word summary

Deep learning revolutionized medical image registration through unsupervised and self-supervised techniques. Transformer-based architectures model long-range dependencies and large deformations, achieving state-of-the-art performance. Estimating registration uncertainty is crucial, and addressing domain shift improves generalizability. Metamorphic registration accommodates topological changes, transforming spatial normalization and enabling broader applications.

111 word summary

Deep learning has revolutionized medical image registration, with a shift towards unsupervised and self-supervised approaches. Unsupervised techniques like adversarial learning and contrastive learning can align images without labeled data. Self-supervised methods exploit spatial and temporal relationships to learn effective registration models. Transformer-based architectures offer improved modeling of long-range dependencies and large deformations, demonstrating state-of-the-art performance. Estimating registration uncertainty is crucial, with Bayesian deep learning and probabilistic models capturing inherent uncertainties. Addressing domain shift is an important challenge, with methods like SynthMorph and HyperMorph improving generalizability. Metamorphic registration, which accommodates topological changes, has also been explored. Deep learning is transforming spatial normalization, enhancing atlas quality and enabling broader applications beyond the brain.

316 word summary

Deep learning has revolutionized medical image registration, with a shift towards unsupervised and self-supervised approaches. Unsupervised deformable registration techniques, such as adversarial learning and cycle-consistent networks, can align images without labeled training data by leveraging the inherent structure of the data. Contrastive learning has also been explored, where the network learns to align images by maximizing the similarity between corresponding features.

Self-supervised methods, which learn representations from the data itself, have gained traction in medical image registration. These approaches exploit the inherent spatial and temporal relationships within the data, such as predicting motion and appearance statistics, to learn effective registration models without manual annotations.

Transformer-based architectures have emerged as a powerful alternative to convolutional neural networks, offering improved modeling of long-range dependencies and better handling of large deformations. These models have demonstrated state-of-the-art performance in various registration tasks, including affine and deformable registration.

Addressing the domain shift problem, where trained networks struggle to perform well on input images from different distributions, is an important challenge. Researchers have explored methods like SynthMorph and HyperMorph to improve the generalizability of registration networks, and the potential of zero-shot learning techniques, leveraging foundation models, is highlighted as a promising avenue.

The concept of metamorphic registration, which can accommodate topological changes between scans, has also been explored. Recent learning-based metamorphic registration methods have built upon a metamorphic framework, enabling the disentanglement of geometric and appearance changes, and leveraging segmentation networks to guide the registration process.

Deep learning is also playing a transformative role in spatial normalization, enhancing the quality of atlases and enabling their broader application beyond just the brain, with significant implications for various medical imaging applications.

454 word summary

Deep learning has revolutionized medical image registration, with significant advancements in recent years. The field has shifted towards unsupervised and self-supervised approaches, which offer greater flexibility and generalization compared to traditional supervised methods.

Unsupervised deformable registration techniques, such as adversarial learning and cycle-consistent networks, have shown promising results in aligning images without the need for labeled training data. These methods leverage the inherent structure of the data to learn robust registration models. Contrastive learning has also been explored, where the network learns to align images by maximizing the similarity between corresponding features.

Estimating registration uncertainty is crucial, as it allows for a better understanding of the reliability of the results. Bayesian deep learning and probabilistic models have been explored to capture the inherent uncertainties in the registration process. This information can be valuable for downstream applications, such as atlas-based segmentation and multi-atlas-based segmentation, where uncertainty can be leveraged to improve the reliability of the analysis.

Addressing the domain shift problem, where trained networks struggle to perform well on input images from different distributions, is an important challenge. Researchers have explored methods like SynthMorph and HyperMorph to improve the generalizability of registration networks. The potential of zero-shot learning techniques, leveraging foundation models, is also highlighted as a promising avenue for enhancing the accessibility and usefulness of deep learning-based registration algorithms.

The concept of metamorphic registration, which can accommodate topological changes between scans (e.g., the presence of tumors), has also been explored. Recent learning-based metamorphic registration methods have built upon a metamorphic framework, enabling the disentanglement of geometric and appearance changes, and leveraging segmentation networks to guide the registration process.

Deep learning is also playing a transformative role in spatial normalization, enhancing the quality of atlases and enabling their broader application beyond just the brain. This has significant implications for various medical imaging applications, from cancer treatment planning to the creation of patient-specific digital twins.

Overall, the field of deep learning for medical image registration has seen significant advancements, with a focus on unsupervised and self-supervised techniques that can learn effective registration models without the need for extensive manual annotations. These developments hold promise for improving clinical decision-making and patient care, and the survey aims to guide future research in this rapidly evolving field.

1669 word summary

Deep learning has revolutionized the field of medical image registration over the past decade. Initial developments, such as ResNet-based and U-Net-based networks, laid the foundation for deep learning in image registration. Subsequent progress has focused on various aspects, including similarity measures, deformation regularizations, and uncertainty estimation.

Learning-based registration methods can be categorized as supervised or unsupervised. Supervised methods use ground truth transformations during training, while unsupervised methods do not require this extrinsic information. Recent advancements have shifted towards unsupervised methods, which offer greater flexibility in modeling deformation field properties.

The fundamental paradigm of learning-based registration involves deep neural networks, spatial transformers, and loss functions. Supervised methods use loss functions like mean squared error or end-point-error, comparing network outputs to ground truth transformations. Unsupervised methods employ loss functions similar to traditional registration energy functions, incorporating image similarity measures and deformation regularizers.

Commonly used similarity measures include mean squared error, normalized cross-correlation, and structural similarity index for mono-modal registration, as well as mutual information and correlation ratio for multi-modal registration. Novel loss functions have also been proposed, leveraging the capabilities of deep learning.

Network architectures have evolved, with encoder-decoder designs for deformable registration and encoder-only networks for rigid/affine registration. Advancements in modeling diffeomorphic transformations, through approaches like scaling-and-squaring, have enabled learning-based methods to produce invertible and topologically-preserving deformations.

Estimating registration uncertainty is an important aspect, as it can provide valuable information for downstream applications. Evaluation metrics, including accuracy and regularity measures, are crucial for assessing the performance of learning-based registration methods.

Learning-based registration has found applications in various medical imaging tasks, such as atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. As the field continues to progress, addressing current challenges and exploring future directions will further enhance the capabilities of deep learning in medical image registration.

Image registration is a crucial task in medical imaging, and deep learning has emerged as a powerful approach for this problem. Recent advancements in deep learning-based image registration have focused on improving the similarity measures, deformation regularizers, and network architectures.

Similarity measures play a crucial role in registration performance. Traditional measures like mutual information (MI) and correlation ratio have limitations, especially for multi-modal applications. Newer approaches like Structural Similarity Criterion (SSC) and Normalized Gradient Fields (NGF) have shown improved performance by considering local structural information and edge-based similarities, respectively.

Deformation regularizers are essential for ensuring smooth and realistic deformations. Conventional regularizers like diffusion and bending energy have been enhanced with spatially-varying approaches that adapt the regularization strength based on the image content. Techniques like learning a spatially-varying regularizer or using consistency losses have also been explored to implicitly regularize the deformation field.

Beyond traditional convolutional neural networks (ConvNets), recent advancements in registration network architectures have explored the use of adversarial learning, contrastive learning, Transformers, diffusion models, and neural ordinary differential equations (ODEs). Adversarial learning can alleviate the need for explicit similarity measures, while contrastive learning can learn modality-invariant representations. Transformers have shown promise in capturing long-range dependencies, and diffusion models offer a new paradigm for generating continuous deformations. Neural ODEs provide a principled way to model the deformation as a dynamical system, drawing inspiration from optimization-based methods.

These architectural innovations, combined with improvements in similarity measures and deformation regularizers, have led to significant advancements in deep learning-based medical image registration. As the field continues to evolve, further research is expected to yield even more robust and versatile registration techniques, benefiting various clinical applications.

Deep learning has shown promise in medical image registration, demonstrating superior performance compared to traditional methods. One key approach is to formulate image registration as an implicit problem, where a neural network maps spatial coordinates to a deformation field. This provides a more compact and continuous representation, facilitating smooth manipulation of the deformation.

Recent research has also explored integrating hyperparameters directly into the registration network architecture, allowing for efficient hyperparameter tuning within a single training process. Additionally, methods that permit spatially discontinuous deformations have been proposed, leveraging anatomical label maps to generate region-specific deformation fields.

Correlation layers have been adopted to aid neural networks in identifying explicit correspondences between image features, improving registration accuracy. Progressive and pyramid-based registration techniques have also been shown effective, decomposing the registration process into multiple refinement steps.

Estimating uncertainty is crucial in medical image analysis, as it enables evaluating the reliability of registration predictions. Deep learning-based methods can model both aleatoric uncertainty (inherent in the data) and epistemic uncertainty (related to model limitations). Transformation uncertainty and appearance uncertainty are two key measures of epistemic uncertainty in registration.

Evaluating registration performance remains a challenge, particularly for deformable transformations where dense manual correspondences are difficult to obtain. Accuracy measures, such as target registration error and label overlap, are commonly used, along with regularity measures that assess the smoothness of the deformation field. Recent work has also explored machine learning techniques to predict registration errors directly from the input images.

Overall, the field of deep learning-based medical image registration has seen significant advancements, with novel techniques addressing key challenges in modeling continuous deformations, handling hyperparameters, and estimating uncertainty. These developments have the potential to enhance the reliability and applicability of registration methods in clinical settings.

Deep learning has shown significant promise in medical image registration, addressing challenges associated with traditional methods. Recent advancements in learning-based registration models have focused on developing more efficient and accurate techniques.

One key aspect is the network architecture, where researchers have explored multi-resolution strategies to capture deformations across different scales. This mimics the benefits of traditional multi-resolution registration algorithms, improving performance and deformation properties. Additionally, there is growing interest in architectures that can better capture spatial correspondences between images, such as Transformers and Siamese networks.

Regarding loss functions, while MSE and NCC remain popular for mono-modal registration, learning-based methods have explored alternatives for multi-modal scenarios. Anatomical loss functions like Dice can serve as modality-independent surrogates, while contrastive and adversarial learning techniques can guide the network to understand similarities and dissimilarities across modalities.

The use of spatially-varying regularization, which was a significant focus in traditional registration, has been relatively overlooked in the deep learning era. Incorporating such spatially-adaptive regularization within or through deep learning frameworks is an important future direction.

Another critical aspect is registration uncertainty estimation, which can facilitate the interpretation of registration results and improve the reliability of various medical image analysis tasks. Limitations in ground truth evaluation and computational complexity currently restrict the widespread adoption of uncertainty estimation. Developing improved evaluation methods and efficient computational techniques can help address these challenges.

Potential applications of registration uncertainty include atlas-based segmentation, where uncertainty can be used to generate soft segmentation masks, and multi-atlas-based segmentation, where uncertainty can be leveraged to weight different segmentation results. Exploring these and other applications of registration uncertainty remains an active area of research.

Overall, the field of deep learning-based medical image registration continues to evolve, with promising directions in network architecture, loss functions, regularization, and uncertainty estimation.

Deep learning has emerged as a transformative approach for medical image registration, offering significant advancements over traditional methods. This survey examines the latest technological developments in this rapidly evolving field.

The review covers fundamental aspects of learning-based image registration, including widely-used and novel loss functions, as well as network architectures. It also delves into the estimation of registration uncertainty and appropriate metrics for assessing accuracy and regularity.

One key challenge addressed is the domain shift problem, where trained networks struggle to perform well on input images from different distributions. Researchers have explored methods like SynthMorph and HyperMorph to improve the generalizability of registration networks. The potential of zero-shot learning techniques, leveraging foundation models, is also highlighted as a promising avenue for enhancing the accessibility and usefulness of deep learning-based registration algorithms.

The survey further explores the concept of metamorphic registration, which can accommodate topological changes between scans, such as the presence of tumors. Recent learning-based metamorphic registration methods have built upon a metamorphic framework, enabling the disentanglement of geometric and appearance changes, and leveraging segmentation networks to guide the registration process.

Additionally, the review discusses the importance of spatial normalization, where deep learning is playing a transformative role in enhancing the quality of atlases, enabling their broader application beyond just the brain. This has significant implications for various medical imaging applications, from cancer treatment planning to the creation of patient-specific digital twins.

The comprehensive survey aims to guide future research in this rapidly evolving field, highlighting the latest advancements, potential clinical applications, and the challenges that remain to be addressed.

Deep learning has emerged as a powerful tool for medical image registration, offering significant advancements in accuracy and efficiency compared to traditional methods. This summary highlights key developments in this field, focusing on unsupervised and self-supervised approaches.

Uncertainty quantification is another important aspect, as it allows for a better understanding of the reliability of the registration results. Bayesian deep learning and probabilistic models have been explored to capture the inherent uncertainties in the registration process.

Raw indexed text (261,256 chars / 38,813 words / 4,647 lines)

A survey on deep learning in medical image registration: new technologies, uncertainty,

evaluation metrics, and beyond

Junyu Chen a,1,∗ , Yihao Liu b,1 , Shuwen Wei b,1 , Zhangxing Bian b , Shalini Subramanian a , Aaron Carass b , Jerry L. Prince b , Yong Du a

a Department

b Department

of Radiology and Radiological Science, Johns Hopkins School of Medicine, MD, USA

of Electrical and Computer Engineering, Johns Hopkins University, MD, USA

ABSTRACT

Article history:

Received xxxx

Received in final form xxxx

Accepted xxxx

Available online xxxx Deep learning technologies have dramatically reshaped the field of medical image reg-

istration over the past decade. The initial developments, such as ResNet-based and

U-Net-based networks, established the foundation for deep learning in image registra-

tion. Subsequent progress has been made in various aspects of deep learning-based reg-

istration, including similarity measures, deformation regularizations, and uncertainty

estimation. These advancements have not only enriched the field of image registration

but have also facilitated its application in a wide range of tasks, including atlas con-

struction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this

paper, we present a comprehensive overview of the most recent advancements in deep

learning-based image registration. We begin with a concise introduction to the core

concepts of deep learning-based image registration. Then, we delve into innovative net-

work architectures, loss functions specific to registration, and methods for estimating

registration uncertainty. Additionally, this paper explores appropriate evaluation met-

rics for assessing the performance of deep learning models in registration tasks. Finally,

we highlight the practical applications of these novel techniques in medical imaging and

discuss the future prospects of deep learning-based image registration.

ARTICLE INFO

Communicated by xxxx

Keywords: Image Registration, Deep

Neural Networks, Medical Imaging

1. Introduction

Medical image registration involves estimating the optimal

spatial transformation to align the structures of interest in

a pair of fixed and moving images. The choice of spatial

transformation depends on the specific application and can

be categorized as either rigid/affine or non-rigid/deformable.

In rigid/affine registration, all spatial coordinates are trans-

formed using the same rigid/affine matrix. On the other

hand, non-rigid/deformable registration employs independent

transformations for individual local regions of spatial coor-

dinates. Both types of registration are of great importance

to many medical imaging tasks. Rigid registration is com-

monly used when the rigid body assumption holds. For ex-

ample, it is used to align a structural scan—e.g., magnetic

resonance image (MRI) or computed tomography (CT)—with

a functional scan—e.g., functional magnetic resonance im-

age (fMRI) or positron emission tomography (PET)—of the

same patient for attenuation correction (Hofmann et al., 2008)

or interpretation of functional activities (Studholme et al.,

2000). On the other hand, deformable image registration (DIR)

∗ Corresponding

1 Contributed

author. E-mail address: [email protected].

equally to this work.

is often used in cases where more complex, spatially vary-

ing deformations are needed. Examples of such applications

include constructing deformable templates for a patient co-

hort (Christensen et al., 1996; Ganser et al., 2004) or register-

ing atlases to a patient image for multi-atlas segmentation (Reed

et al., 2009; Cabezas et al., 2011; Aljabar et al., 2009).

Traditionally, image registration has been accomplished by

iteratively solving an optimization problem (e.g., demons (Ver-

cauteren et al., 2009), LDDMM (Beg et al., 2005), SyN (Avants

et al., 2008), DARTEL (Ashburner, 2007), and Elastix (Klein

et al., 2009)). These methods are well-established and sup-

ported by strong mathematical theory. However, they can be

computationally expensive and slow in practice, as the op-

timization problem must be solved for each individual pair

of moving and fixed images. Several review papers have

covered traditional medical image registration methods exten-

sively (Maintz and Viergever, 1998; Hill et al., 2001; Shams

et al., 2010; Fluck et al., 2011; Sotiras et al., 2013; Oliveira and

Tavares, 2014; Viergever et al., 2016). Interested readers can re-

fer to these references for more information on these methods.

In the last decade, deep learning-based methods have shown

promise in improving the accuracy and efficiency of image

registration. Unlike traditional methods, deep learning-based

methods train a general network by optimizing a global objec-2

Chen, Liu, Wei et al. / (2023)

Fig. 1. Statistics of the articles investigated in this survey paper. The left panel displays a histogram of the number of papers by year; the vast majority

of the surveyed papers were proposed within the last five years. The right panel illustrates the sources of the investigated articles, demonstrating that our

survey draws from sources associated with the field of medical image analysis.

tive function on a training dataset. Then in the testing phase, the

trained network is directly applied to each image pair with fixed

network weights, resulting in a significant speedup compared

to traditional methods. Initially, ResNet-like network architec-

tures (He et al., 2016), which consist of a convolutional en-

coder and a multilayer perceptron (Wu et al., 2013; Miao et al.,

2016), were explored. During the training process, ground truth

transformations have to be provided for direct supervision. In

rigid/affine transformations, the ground truth is represented as a

transformation matrix; while a dense displacement field is often

used for deformable registration. With the introduction of spa-

tial transformer networks (Jaderberg et al., 2015) and the suc-

cess of U-Net (Ronneberger et al., 2015) in medical imaging

applications, learning-based deformable registration methods

adopted an encoder-decoder design in either supervised (Yang

et al., 2017; Rohé et al., 2017) or unsupervised (Vos et al., 2017;

Li and Fan, 2018; Balakrishnan et al., 2019; Kim et al., 2021;

Chen et al., 2022b) training schemes. These methods typically

output a high-resolution dense deformation field. On the other

hand, learning-based rigid/affine registration methods continue

to adopt encoder-only networks (Miao et al., 2016; Hu et al.,

2018a; De Vos et al., 2019; Chen et al., 2021e, 2022b; Mok

and Chung, 2022a), with the output being the rigid or affine pa-

rameters. While there are papers that provide general reviews

of learning-based registration methods (Fu et al., 2020a; Chen

et al., 2021d; Xiao et al., 2021; Zou et al., 2022), it is impor-

tant to note that these reviews may not be fully up-to-date due

to the rapid advancement of the field of deep learning. Recent

advancements, including learning-based similarity metrics and

regularizers, novel network architectures, and innovative evalu-

ation metrics and uncertainty estimation methods, have demon-

strated promising potential for medical image registration. This

paper provides a timely review of learning-based methods in

medical image registration, highlighting the latest technologies

that have been proposed and discussing their respective char-

acteristics and applications. In addition, we investigate and

formally define registration uncertainty for deep learning-based

image registration and address the appropriate evaluation met-

rics for these methods that have been overlooked in previous

review papers. For simplicity, we refer to deep learning-based

methods as learning-based methods throughout the paper.

In this paper, we surveyed over 250 articles on learning-

based medical image registration. As depicted in Fig. 1, the

focus is primarily on recent advancements proposed in the last

five years. Our search covers well-established medical imag-

ing journals, such as Medical Image Analysis, IEEE Transac-

tions on Medical Imaging, Medical Physics, and NeuroImage,

as well as conference proceedings related to medical imag-

ing and image registration, such as MICCAI, IPMI, WBIR,

CVPR, ECCV, ICCV, and NeurIPS. The remainder of the pa-

per is organized as follows: Section 2 offers a brief overview

of the fundamentals of learning-based image registration. Sec-

tion 3 explores widely-used loss functions for learning-based

registration methods which resemble objective functions in tra-

ditional methods, and discusses other novel loss functions en-

abled by deep learning. Section 4 investigates network archi-

tectures developed for medical image registration, with a focus

on recent developments. Section 5 delves into methods for es-

timating registration uncertainty in learning-based registration.

Section 6 considers appropriate evaluation metrics for learning-

based methods and examines methods for quantifying the regu-

larity of generated deformation fields. Section 7 summarizes

recent applications of learning-based registration in medical

imaging. Finally, Section 8 discusses current challenges and

provides future perspectives for deep learning in medical image

registration.Chen, Liu, Wei et al. / (2023)

2. Fundamentals of Learning-based Image Registration

Image registration aims to estimate the optimal coordinate

transformation that minimizes an energy function of the form:

ϕ̂ = arg min E(I f , I m ◦ ϕ) + λR(ϕ),

(1)

where I f and I m denote the fixed and moving image, respec-

tively, ϕ represents the deformation field that maps I m to I f , and

R is a functional of ϕ. The first term in the energy function

measures the image similarity between the fixed image and the

transformed moving image. The second term enforces regular-

ization on the deformation field, with λ being a hyperparame-

ter that determines the trade-off between image similarity and

deformation field regularity. The purpose of the image simi-

larity measure is to quantify the discrepancy between the fixed

image and the transformed moving image. The regularization

term is typically used in DIR, as it allows for the integration of

prior knowledge about the desired characteristics of the defor-

mation field, such as spatial smoothness. Moreover, regulariza-

tion prevents the deformation field from exhibiting physically

implausible behaviors, such as “folding” or rearranging of vox-

els (Rohlfing, 2011). This is particularly important for medical

images because such unrealistic behavior does not accurately

reflect the way that organs deform in reality and may lead to

a misinterpretation of the registration results. Regularization is

often not required for rigid/affine registration because the de-

formation field is guaranteed to be spatially uniform.

2.1. Supervised vs. Unsupervised Learning

Learning-based registration methods can be broadly catego-

rized as supervised and unsupervised. In the machine learn-

ing paradigm, supervised learning typically refers to the use of

extrinsic information during learning (such as labels) whereas

unsupervised methods are concerned with discovering proper-

ties intrinsic to the data. Both supervised and unsupervised

learning-based registration methods require a training stage that

uses pairs of inputs and their corresponding target outputs.

Supervised registration methods use ground truth transforma-

tions as target output during the training process. Unsupervised

methods refer to those that do not require ground truth transfor-

mations. Yet, methods that employ landmark correspondences

or anatomical label maps during their training phase are still

categorized under supervised learning. This is because land-

mark correspondences are a sparse representation of the ground

truth transformations, and matching label maps act as a surro-

gate for evaluating registration performance. When this extrin-

sic information is used alongside the image data to aim learn-

ing, these methods are referred to as semi-supervised. In certain

contexts, the term "unsupervised" might be misleading. A more

precise term could be “self-supervised” to underscore the train-

ing aspect of deep learning. However, for the purposes of clarity

and consistency in this discussion, we will use conventional ter-

minology and refer to methods that do not require supervision

from extrinsic information as unsupervised.

During the early stages of development, the majority of

learning-based registration methods were supervised. The

ground truth transformations required for the training process

are typically generated using traditional registration methods,

such as (Yang et al., 2017; Rohé et al., 2017; Cao et al., 2017;

Hu et al., 2018b; Fan et al., 2019b). However, generating

ground-truth transformations this way is a time-consuming pro-

cess, which is a notable drawback of such methods. In ad-

dition, since these networks are trained to mimic the function

of traditional methods, their registration performance may not

surpass that of the methods they are based on. In some cases,

post-processing of the deformation fields may be required to

further improve registration accuracy (Yang et al., 2017). Al-

ternatively, artificial deformations can also be used as ground

truth transformations in certain cases (Miao et al., 2016; Krebs

et al., 2017; Sokooti et al., 2017; Eppenhof and Pluim, 2018b;

Eppenhof et al., 2018).

More recently, the introduction of spatial transformer net-

works (Jaderberg et al., 2015) has led to a shift towards de-

veloping unsupervised methods that do not rely on ground-truth

transformation (Vos et al., 2017; Li and Fan, 2018; De Vos et al.,

2019; Balakrishnan et al., 2019; Dalca et al., 2019b; Mok and

Chung, 2020a,b, 2021b; Kim et al., 2021; Chen et al., 2022b,

2021b). These methods use the difference between the de-

formed moving image and the fixed image to update the net-

work, enabling end-to-end training. By removing the reliance

on ground truth transformation, these methods offer greater

flexibility in modeling different properties of the deformation

fields (e.g., smoothness, invertibility).

2.2. Paradigm for Learning-based Registration

Recent progress in the field of learning-based medical im-

age registration has been focusing on exploring different ways

to improve registration accuracy, such as through modifications

to network architectures, loss functions, and training methods,

which will be discussed in detail in subsequent sections. De-

spite these efforts, the fundamental principles of learning-based

registration have remained unchanged. Figure 2 illustrates the

conventional paradigms of learning-based rigid/affine and DIR.

Typically, these paradigms consist of the following compo-

nents:

Moving and fixed images as input

A deep neural network

The spatial transformer (for unsupervised methods)

A loss function

The way in which moving and fixed images are inputted into

deep neural networks (DNNs) varies depending on the archi-

tecture of the network. They can either be concatenated and

sent in as a single input (e.g., VoxelMorph (Balakrishnan et al.,

2019)) or each image can be processed separately by the DNN,

with the feature maps being combined in a deeper stage (e.g.,

Quicksilver (Yang et al., 2017)).

The architecture of DNNs can vary depending on the specific

task they are designed to perform and the learning method they

will undergo. For affine/rigid registration methods, DNN en-

coders are used for feature extraction and fully connected layers

are used to output the parameters of the predicted transforma-

tion. DIR methods use DNNs with both an encoder and de-

coder, and the result is a deformation field of equal sizes to the4

Chen, Liu, Wei et al. / (2023)

Fig. 2. Overview of learning-based image registration. The top panel depicts the common pipeline for supervised learning in medical image registration,

which necessitates ground truth transformations. The bottom panel demonstrates the unsupervised learning pipeline, wherein the network learns to

perform registration using only input images. The left panel presents the learning-based DIR pipeline, typically employing an encoder-decoder-style

network architecture. The right panel exhibits the learning-based rigid/affine registration, which usually involves only an encoder.

input images. In the supervised setting, the network output is

compared to ground truth transformations (generated from syn-

thetic transformations or traditional image registration meth-

ods) or landmark correspondences using a loss function. In the

unsupervised setting, the predicted transformation is used by

the spatial transformer (Jaderberg et al., 2015) to warp the mov-

ing image, and the transformed image is then evaluated against

the fixed image using a loss function that incorporates an image

similarity measure. When anatomical label maps for the fixed

and moving images are available, the warped moving label map

can also be produced by using the predicted transformation and

the spatial transformer. An anatomy loss can be computed us-

ing the warped moving label map and the fixed label maps to

provide extra guidance during network training.

There is a diverse range of loss functions to choose from, de-

pending on the learning mode. These are thoroughly discussed

in Section 3. The networks are trained by globally optimiz-

ing the loss function during the training stage using a training

dataset. The trained networks are then applied to unseen testing

images for inference.

Due to the self-supervised nature of image registration, the

difference between the transformed moving image and the fixed

image can be further reduced at test time. This is commonly

known as instance-specific optimization (Balakrishnan et al.,

2019; Siebert and Heinrich, 2022; Mok and Chung, 2022b;

Heinrich and Hansen, 2022; Chen et al., 2020). Specifically,

the network weights can be optimized during test time to re-

duce the dissimilarity of each fixed and moving image pair in

the test dataset and further boost the performance. Registra-

tion networks can also be specifically designed to produce dif-

feomorphic transformations, which are highly desirable in DIR

methods and will be discussed further in the next subsection. et al., 2019; Sokooti et al., 2017; Mok and Chung, 2021b; Hein-

rich, 2019; Hu et al., 2018b). In this model, ϕ in Eqn. 1 is repre-

sented by a displacement field, v, expressed as ϕ = id +v, where

the displacement is added to the identity transform, id. Since ϕ

may not be a one-to-one mapping, this model does not guar-

antee the invertibility of the deformation. In some cases, the

"inverse" transformation is roughly approximated by subtract-

ing the displacement (Ashburner, 2007). In many applications

(e.g., Avants et al. (2008); Oishi et al. (2009); Christensen et al.

(1997)), diffeomorphic image registration is highly desirable

because it provides transformation invertibility and topologi-

cal preservation. Diffeomorphic transformations are defined

as smooth and continuous one-to-one mappings with a smooth

and continuous inverse (i.e., positive Jacobian determinants).

They are achieved mainly through two approaches: the time-

dependent velocity field (Beg et al., 2005; Avants et al., 2008)

or the time-stationary velocity field (Arsigny et al., 2006; Ash-

burner, 2007; Vercauteren et al., 2009; Hernandez et al., 2009)

approach.

The time-dependent velocity field approach involves inte-

grating sufficiently smooth velocity fields that change over time.

The diffeomorphism is established by using a velocity field v (t)

at time t, and evolving it through (Beg et al., 2005):

2.3. Diffeomorphic Image Registration

Many learning-based DIR methods follow a small deforma-

tion model (Balakrishnan et al., 2019; Kim et al., 2021; De Vos However, the complexity of the differential equations involved

in the time-varying setting has led to limited use of this ap-

proach in current learning-based registration models. Only a

dϕ (t)

= v (t) (ϕ (t) ).

(2)

The diffeomorphic transformation is achieved by starting with

an identity transformation, i.e. ϕ (0) = id, and integrating over

the unit time period:

Z 1

(1)

(0)

ϕ = ϕ +

v (t) (ϕ (t) )dt.

(3)

0Chen, Liu, Wei et al. / (2023)

handful of studies, such as Ramon et al. (2022); Pathan and

Hong (2018); Shen et al. (2019); Yang et al. (2017, 2016);

Han et al. (2021); Wang and Zhang (2020), have integrated it

into a DNN framework. These studies primarily involve us-

ing a DNN to predict an initial momentum field and then up-

dating it through geodesic shooting (Miller et al., 2006; Zhang

and Fletcher, 2019) to derive the velocity fields. As a result,

end-to-end training is not feasible without re-implementing the

geodesic shooting framework with modern DNN libraries. To

date, only one previous work has achieved this for medical im-

age registration (Shen et al., 2019).

The time-stationary velocity field approach considers veloc-

ity fields that remain constant throughout time. By using this

setting, the evolution of the diffeomorphism in Eqn. 2 can be

rewritten as:

dϕ (t)

= v(ϕ (t) ),

(4)

where the velocity field, v, is now independent of time. Dalca

et al. (2019b) were the first to use this setting in a DNN model

through the scaling-and-squaring method (Arsigny et al., 2006;

Ashburner, 2007). This method has since become dominant

in learning-based diffeomorphic registration models (Mok and

Chung, 2020a; Chen et al., 2022b; Mok and Chung, 2020b; Han

et al., 2023; Zhang et al., 2021; Qiu et al., 2021; Zhao et al.,

2021a; Krebs et al., 2019). The scaling-and-squaring method

considers the velocity field as a member of the Lie algebra and

the deformation field as a member of the Lie group. The ve-

locity field lies in the tangent space of the identity element in

the Lie group and its connection to the deformation field is de-

scribed by an exponential map:

ϕ = exp (v),

(5)

which is equivalent to integrating along the velocity field over

the unit time period. An alternative perspective is that the Ja-

cobian determinant of a deformation resulting from exponen-

tiating the velocity field is always positive, similar to how the

derivative of the exponential of a real number is always posi-

tive (Ashburner, 2007). For further information on the imple-

mentation of this method, we direct interested readers to the

references cited (Ashburner, 2007; Arsigny et al., 2006; Dalca

et al., 2019b). It is important to note that the scaling-and-

squaring method cannot guarantee a folding-free transforma-

tion in the digital domain when measured by the finite differ-

ence approximated Jacobian determinant. This is because the

scaling-and-squaring method involves bilinear or trilinear inter-

polation that is inconsistent with the piecewise linear transfor-

mation assumed by the finite difference based Jacobian deter-

minant computation (Liu et al., 2022b).

3. Loss Functions

Table 1 provides a compilation of unsupervised DIR models,

summarizing the similarity and auxiliary loss functions, as well

as other details. See the text for complete details and discussion.

3.1. Supervised Learning

In supervised learning, where the ground truth transforma-

tion is used, the loss function is typically easy to define, with

the mean square error (MSE) (Miao et al., 2016; Krebs et al.,

2017; Eppenhof et al., 2018; Rohé et al., 2017; Cao et al., 2017;

Fan et al., 2019b), the equivalent end-point-error (EPE), and

mean absolute error (MAE) (Yang et al., 2017; Sokooti et al.,

2017) being the most popular choices.

3.2. Unsupervised & Semi-supervised Learning

In unsupervised learning, where there is no ground truth

transformation to reference, regularization is usually used to

enforce smoothness in the transformation. As a result, the loss

function is often similar to the energy function used in tradi-

tional methods (i.e., Eqn. 1), which includes an image simi-

larity measure and a transformation regularizer. The following

subsections provide a summary of commonly used and recently

proposed loss functions for image registration.

3.3. Similarity Measure

Mono-modality. The choice of image similarity measure can

vary depending on each specific application. For mono-modal

registration, MSE is still a popular choice and has the advantage

of having a straightforward probabilistic interpretation of the

Gaussian likelihood approximation (Dalca et al., 2019b; Chen

et al., 2022b; Kim et al., 2021; Balakrishnan et al., 2019; Meng

et al., 2022a; Jia et al., 2021; Liu et al., 2022c). However, a

disadvantage of MSE is that it averages the difference across all

voxels in the image, making it sensitive to local intensity vari-

ations within the image. Normalized cross-correlation (NCC)

is known to be more robust to local intensity variations and

has been found to be superior in brain MR registration appli-

cations (Avants et al., 2008). NCC has been extended as a loss

function for training learning-based models, with the local win-

dow computation often being done through convolution opera-

tions (Kuang and Schmah, 2019; Chen et al., 2022b; Kim et al.,

2021; Balakrishnan et al., 2019; Zhang, 2018; Mok and Chung,

2020a,b, 2021b). One disadvantage of NCC is its higher com-

puting cost in comparison to MSE, which is mainly attributable

to the comparatively large convolution kernel size (typically

chosen between 5×5×5 and 9×9×9 voxels (Avants et al., 2008;

Balakrishnan et al., 2019; Mok and Chung, 2020a)). The struc-

tural similarity index (SSIM) (Wang et al., 2004) has also been

demonstrated to be an effective loss function for mono-modal

image registration (Chen et al., 2020; Mahapatra et al., 2018a;

Sandkühler et al., 2018). SSIM takes into account luminance,

contrast, and structure. It can be thought of as an extension of

the NCC, with the structure term in SSIM being the square root

of NCC. This allows SSIM to capture more information about

the similarity of two images beyond just the degree of correla-

tion between them.

Multi-modality. For multi-modal applications, traditional

methods often use mutual information (MI) (Viola and

Wells III, 1997), correlation ratio (Roche et al., 1998), self-

similarity context (SSC) (Heinrich et al., 2013), or normalized

gradient fields (NGF) (Haber and Modersitzki, 2006) as sim-

ilarity measures. Both MI and correlation ratio evaluate theChen, Liu, Wei et al. / (2023)

Table 1. A compilation of unsupervised deformable image registration models (models are listed in alphabetical order). The table summarizes the models’

choices of similarity and auxiliary loss functions, regularization techniques, accuracy measures, and regularity measures.

•

• •

•

• •

•

• •

•

relationship between the two images by calculating intensity

statistics, such as intensity histograms, to measure statistical de-

pendence. However, the standard method for calculating inten-

sity histograms, which involves counting, is not differentiable,

so a Parzen window formulation (Thévenaz and Unser, 2000)

is often used to allow the loss to be backpropagated during net-

work training. Parzen-window-based MI has been employed as

a loss function in many multi-modal applications (Qiu et al.,

2021; de Vos et al., 2020; Nan et al., 2020; Guo, 2019; Hoff-

mann et al., 2021), but it can be relatively difficult to imple-

ment and also sensitive to factors such as the number of in-

tensity bins and the smoothness of the Gaussian function. As

far as we are aware, the correlation ratio has not been used in

learning-based medical image registration. It should be noted

that these intensity-statistic-based measurements do not take

into account local structural information, making them more

suitable for rigid/affine registration and less suitable for de-

formable registration applications (Pluim et al., 2000; Hein-

•

• •

•

• •

• • •

• •

•

Regularity Measure

•

Accuracy Measure

•

Regularizer

•

Aux. Loss

ADMIR (Tang et al., 2020)

Attention-Reg (Song et al., 2022)

BIRNet (Fan et al., 2019b)

CondLapIRN (Mok and Chung, 2021b)

CycleMorph (Kim et al., 2021)

de Vos et al. (de Vos et al., 2020)

Deformer (Chen et al., 2022c)

DiffuseMorph (Kim et al., 2022)

DIRNet (De Vos et al., 2017)

DLIR (De Vos et al., 2019)

DNVF (Han et al., 2023)

DTN (Zhang et al., 2021)

Dual-PRNet (Hu et al., 2019b)

Dual-PRNet++ (Kang et al., 2022)

FAIM (Kuang and Schmah, 2019)

Fan et al. (Fan et al., 2019a)

Fourier-Net (Jia et al., 2022a)

GraformerDIR (Yang et al., 2022b)

Han et al. (Han et al., 2022)

Hering et al. (Hering et al., 2021)

HyperMorph (Hoopes et al., 2022)

im2grid (Liu et al., 2022c)

Krebs et al. (Krebs et al., 2019)

LapIRN (Mok and Chung, 2020b)

LKU-Net (Jia et al., 2022b)

Li et al. (Li and Fan, 2018)

Liu et al. (Liu et al., 2019)

MIDIR (Qiu et al., 2021)

MS-DIRNet (Lei et al., 2020)

MS-ODENet (Xu et al., 2021)

NODEO (Wu et al., 2022b)

PDD-Net 2.5D (Heinrich and Hansen, 2020)

PDD-Net 3D (Heinrich, 2019)

PC-SwinMorph (Liu et al., 2022a)

SDHNet (Zhou et al., 2023)

Shao et al. (Shao et al., 2022)

SVF-R2Net (Joshi and Hong, 2022)

SYMNet (Mok and Chung, 2020a)

SymTrans (Ma et al., 2022)

SynthMorph (Hoffmann et al., 2021)

TM-DCA (Chen et al., 2023a)

TM-TVF (Chen et al., 2022a)

TransMorph (Chen et al., 2022b)

ViT-V-Net (Chen et al., 2021b)

VoxelMorph (Balakrishnan et al., 2019)

VoxelMorph-diff (Dalca et al., 2019b)

VoxelMorph++ (Heinrich and Hansen, 2022)

VR-Net (Jia et al., 2021)

VTN (Zhao et al., 2019)

XMorpher (Shi et al., 2022)

Zhang et al. (Zhang et al., 2020)

Similarity Loss

•

rich et al., 2013). SSC is another commonly used loss function

for multi-modal applications, and it is an improvement on the

modality-independent neighborhood descriptor (MIND) (Hein-

rich et al., 2012). Both SSC and MIND operate by calculat-

ing the descriptor between a voxel and its neighboring vox-

els within a given image, turning an image of any modality

into a feature representation of these descriptors. The similar-

ity is determined by summing the absolute differences between

the descriptors of the two images. As SSC and MIND con-

sider local structural information, they are not limited in the

same way as MI or correlation ratio, making them more use-

ful for multi-modal deformable registration (Hansen and Hein-

rich, 2021; Mok and Chung, 2021a; Yang et al., 2020; Xu et al.,

2020; Blendowski et al., 2021). NGF compares images by fo-

cusing on the intensity changes, or edges, in the images. The

similarity between the two images is determined by the pres-

ence of intensity changes at the same locations, regardless of the

modalities of the images being compared. NGF was originallyChen, Liu, Wei et al. / (2023)

developed for multi-modal applications like brain MR T1-to-T2

and PET-to-CT (Haber and Modersitzki, 2006). However, it is

now mostly used in learning-based registration models for lung

CT registration (Hering et al., 2019, 2021; Mok and Chung,

2021a). This is because the complex structure of the lung, in-

cluding bronchi, fissures, and vessels, can hinder accurate regis-

tration (Hering et al., 2021). NGF focuses on edges rather than

intensity values, making it a more suitable measure for this pur-

pose.

Recent Advancements. There have been many efforts to im-

prove upon or propose new loss functions due to some limita-

tions of the aforementioned similarity measures. Terpstra et al.

(2022) showed that the ℓ 2 loss (equivalent to MSE) is not op-

timal for MRI applications, because it does not fully leverage

the magnitude and phase information contained in the complex

data of MRI. The authors introduced ⊥-loss, a loss function that

is based on the polar representation of complex numbers and

promotes symmetry in the overall loss landscape. They demon-

strated that a network trained with a combination of ⊥-loss and

ℓ 2 loss outperforms a network trained with ℓ 2 loss alone in terms

of registration performance. Czolbe et al. (2021) leveraged a

ConvNet feature extractor to obtain image features from the de-

formed and fixed images, and then computed the NCC between

these features as a similarity measure. The benefit of this ap-

proach is that the features produced by the ConvNet feature ex-

tractor have less noise, resulting in a more consistent similarity

measure in areas with noise which leads to a smoother trans-

formation. Haskins et al. (2019) were the first to propose using

a ConvNet to learn a similarity measure for image registration.

However, this method relies on having ground truth target reg-

istration error for the training dataset to learn such a similarity

measure. Grzech et al. (2022) went one step further and in-

troduced a technique for learning a similarity measure using a

variational Bayesian method. The method involves initializing

the convolution kernels in the network architecture to model

MSE and NCC, and then using variational inference to learn a

similarity measure that optimizes the likelihood of the images

in the dataset when aligning them to the atlas. Building on the

success of adversarial networks in computer vision (Makhzani

et al., 2015; Goodfellow et al., 2020), researchers have devel-

oped a number of techniques for image registration that lever-

age adversarial training (Fan et al., 2019a; Mahapatra and Ge,

2020; Luo et al., 2021). These methods can be used standalone

or in conjunction with a traditional similarity measure.

3.4. Deformation Regularizer

A deformation regularizer, as the terminology implies, is

used for DIR, with its usage being not necessary for rigid/affine

transformations. For DIR algorithms, producing smooth defor-

mations is not only a desirable property but a necessary re-

quirement: while diffeomorphic transformations may not be

required for certain applications, smoothness remains imper-

ative in almost all cases to avoid trivial solutions such as re-

arranging voxels (Rohlfing, 2011), with which an almost per-

fect similarity measure can be achieved but result in unreal-

istic transformation (also see Section 6). The regularizer can

be considered as a prior in a maximum a posteriori (MAP)

framework, while the similarity measure acts as the data like-

lihood (e.g., in the case of MSE, the data likelihood becomes

a Gaussian likelihood). The diffusion regularizer is a com-

monly employed deformation regularizer, as demonstrated by

its frequent appearance in Table 1. This regularization com-

putes the squared ℓ 2 -norm of the gradients of the displacement

field, effectively penalizing the disparities between adjacent dis-

placements. Other alternatives for regularization include us-

ing the ℓ 1 -norm instead of the ℓ 2 -norm to impart equal penal-

ties on the neighboring disparities, or penalizing the second

derivative of the displacements, commonly referred to as bend-

ing energy (Rueckert et al., 1999). It is important to note that

since bending energy and curvature-based regularizers penalize

the second derivatives, thereby zeroing out any affine contri-

butions, pre-affine alignment prior to the deformable registra-

tion step may not be necessary, as demonstrated in (Ding and

Niethammer, 2022; Fischer and Modersitzki, 2003b). These

conventional regularizers enforce an isotropic regularization on

the displacement field (Pace et al., 2013). As a result, they

discourage discontinuities in the displacements in applications

where sliding motion may occur in organs, such as registering

exhale and inhale CT scans of the lung. Historically, various

improvements have been made to address this issue, including

the isotropic Total Variation (TV) regularization (Vishnevskiy

et al., 2016), anisotropic diffusion regularization (Pace et al.,

2013), and adaptive bilateral filtering-based regularization (Pa-

pież et al., 2014). However, these regularization techniques

have not been widely adopted in learning-based image regis-

tration.

Recent Advancements. Enforcing spatial smoothness alone

is insufficient to ensure the regularity of the transformations.

A different strategy is to penalize the “folding” of voxels di-

rectly during training, in addition to applying the aforemen-

tioned regularizers to enforce smoothness in the deformation.

These foldings can be evaluated using local Jacobian determi-

nants, where the magnitude of the Jacobian determinant indi-

cates if the volume is expanding or shrinking near the voxel

location. A non-positive Jacobian determinant represents a

locally non-invertible transformation. Several regularization

methods based on local Jacobian determinants have been pro-

posed to penalize such transformations (Kuang and Schmah,

2019; Mok and Chung, 2020a). Meanwhile, with the advent

of deep learning, new methods have emerged that leverage the

deep learning of deformation regularization from data. One

such method by Niethammer et al. (2019), introduced a method

that learns a spatially-varying deformation regularization using

training data. Spatially-varying regularization offers the ad-

vantage of accommodating variations in deformation that may

be required for different regions within an image, such as the

movement of the lungs in relation to other organs (e.g., rib

cage) due to respiratory processes. The technique proposed by

Niethammer et al. involves training a registration network to

produce not only a deformation field but also a set of weight

maps, each of which corresponds to the weight of a Gaussian

smoothing kernel in a multi-Gaussian kernel configuration. The

weighted multi-Gaussian kernel is then applied to the deforma-

tion field via convolution. To further impose spatial smooth-8

Chen, Liu, Wei et al. / (2023)

ness, an optimal mass transport (OMT) loss function was in-

troduced to encourage the network to assign larger weights to

Gaussian kernels with larger variances. While this method was

developed for a time-stationary velocity field setting, Shen et al.

(2019) later expanded upon it by incorporating it into a time-

varying velocity field setting. In this setup, a different set of

weight maps are produced for each time point. More recently,

Chen et al. (2023b) introduced a weighted diffusion regularizer

that applies spatially-varying regularization to the deformation

field. The neural network generates a weight volume, assign-

ing a unique regularization weight to each voxel and thus al-

lows for spatially-varying levels of regularization strength. As

the diffusion regularizer is related to Gaussian smoothing, us-

ing spatially-varying strengths of diffusion regularization can

be considered equivalent to employing a multi-Gaussian ker-

nel, as originally proposed by Niethammer et al. (2019). This

is because the convolution of multiple Gaussian kernels still

results in a Gaussian kernel. To promote the overall smooth-

ness of the deformation, they further applied a log loss to the

weight volume, which encourages the maximum regularization

strength when possible. In a different approach, Wang et al.

(2022) employed a regression network to learn the optimal reg-

ularization parameter for an optimization-based method, specif-

ically Flash (Zhang and Fletcher, 2019). Flash is a geodesic

shooting method in the Fourier space that requires only the ini-

tial velocity field to compute the time-dependent transforma-

tion. Wang et al. generated ground truth optimal regularization

parameters by assuming the prior of the initial velocity field

given the regularization parameter as a multivariate Gaussian

distribution. Using gradient descent, they obtained the optimal

regularization parameter for each image pair through MAP esti-

mation. A ConvNet regression encoder then estimates the opti-

mal regularization parameter based on the image pair. This ap-

proach achieved registration performance comparable to Flash

while significantly improving runtime and memory efficiency.

Alternatively, Laves et al. (2019) were inspired by the deep im-

age prior (Ulyanov et al., 2018). They used a randomly initial-

ized ConvNet as a regularization prior. They then fed a ran-

dom image (i.e., a noise image) as input and the network grad-

ually transformed it into a smooth deformation field through

iterative optimization. The deep image prior provided by the

ConvNet enables the network to produce a smooth deformation

in the early iterations, then gradually adds non-smooth high-

frequency deformations. As a result, early stopping is used for

the network to generate a smooth deformation field without the

need for explicitly encouraging smoothness in the loss function.

Transformations can also be implicitly regularized by im-

posing invertibility constraints. This is achieved by using a

symmetric consistency loss or cycle consistency loss. Sym-

metric consistency typically uses a single DNN to output both

the forward and reverse deformation fields, which transform the

moving image to the fixed image and vice versa, respectively.

The similarity between the warped image and the target im-

age is then calculated and backpropagated to update the net-

work (Mok and Chung, 2020a; Liu et al., 2022a). Alternatively,

a consistency loss can be calculated by composing the network-

generated forward and backward deformation fields, and then

comparing the outcome with the identity transformation (Greer

et al., 2021; Tian et al., 2022). The underlying concept is that,

theoretically, an invertible mapping should cancel itself when

composed with its inverse. Such an approach by itself imposes

invertibility but does not explicitly enforce spatial smoothness

over the deformation field. Greer et al. (2021) demonstrated

that incorporating such a loss within a DNN framework implic-

itly imposes spatial regularity on the deformation field without

necessitating an additional regularizer to enforce smoothness.

The authors showed that the errors of the DNN in computing

the inverse, combined with the implicit bias of DNN favoring

more regular outputs, enable such a consistency loss to entail a

H 1 − or Sobolev-type regularization over the deformation field,

thereby implicitly enforcing spatial smoothness. Later, Tian

et al. (2022) expanded on this regularizer and proposed to reg-

ularize deviations of the Jacobian of the composition from the

identity matrix. This improved regularizer led to faster conver-

gence while offering greater flexibility, while maintaining an

approximated diffeomorphic transformation.

On the other hand, cycle consistency employs two identical

networks, where the first network generates a forward deforma-

tion field that deforms the moving image and the second net-

work produces a reverse field that aims to warp the deformed

image back to the original moving image (Zhao et al., 2019;

Kuang, 2019; Kim et al., 2021). Both consistency losses have

been shown to improve the registration performance and pro-

vide regularization to the deformation field. However, since this

regularization is not explicitly applied to the deformation fields,

a separate deformation regularizer is often required in addition

to the consistency loss.

3.5. Auxiliary Anatomical Information

The overlap of anatomical label maps of the fixed and trans-

formed moving images is a widely used evaluation metric for

image registration. Hence, to improve registration performance

on this metric, learning-based methods often incorporate an

anatomy loss in their network training. Various loss functions

used in image segmentation tasks, such as Dice loss, cross-

entropy, and focal loss (see Ma et al. (2021) for a comprehen-

sive review of such loss functions), can be borrowed as the

choice of anatomy loss. Despite the availability of different

loss functions, Dice loss remains the most commonly used loss

function in learning-based image registration, as evidenced by

Table 1. This is likely because Dice loss is confined within the

range of [0, 1], like NCC, which makes it easier to adjust hyper-

parameters when used in conjunction with NCC.

When anatomical landmarks are present in both the moving

and fixed images, the transformation generated by the DNN

can be applied to the landmarks of the moving image. The

resulting transformed landmarks can then be compared with

the landmarks of the fixed image to create a loss. This land-

mark supervision has been utilized in optimization-based reg-

istration methods to improve performance, as demonstrated in

a number of studies (Ehrhardt et al., 2010; Polzin et al., 2013;

Rühaak et al., 2017; Heinrich et al., 2015; Fischer and Mod-

ersitzki, 2003a). Hering et al. (2021) were the first to incorpo-

rate landmark supervision into a DNN framework by comparingChen, Liu, Wei et al. / (2023)

the MSE between the transformed and target landmarks, which

resulted in a substantial improvement in the target registration

error of the landmark. Subsequently, (Heinrich and Hansen,

2022) confirmed the superiority of landmark supervision on

multiple benchmark datasets in their work. It is worth mention-

ing that the landmarks can be generated automatically before

or during the training stage without manual labeling using au-

tomatic landmark detection algorithms (Heinrich et al., 2015;

Rühaak et al., 2017; Polzin et al., 2013), making it straightfor-

ward to integrate into most learning-based registration frame-

works.

The combination of anatomy loss and deformation regular-

ization without an intensity-based similarity measure is also

common, and in these cases, the anatomy loss serves as a

modality-independent similarity measure (Hu et al., 2018b;

Song et al., 2022; Blendowski et al., 2020). However, the

drawback of using anatomy loss without a similarity measure is

clear: it does not penalize deformations in areas where anatom-

ical labels are missing or ambiguous. Thus, to achieve accurate

and realistic deformations, the anatomical labels should be as

detailed as possible, ideally with a unique label for each organ

or structure. However, obtaining such detailed labels is often

challenging as anatomical label maps in medical imaging are

usually manually delineated, which is a time-consuming and

expensive process.

4. Network Architectures

The application of ConvNets has been the dominant trend in

learning-based image registration since its inception. Among

different ConvNets architectures, the U-Net-like architec-

tures (Ronneberger et al., 2015), which were initially designed

for image segmentation tasks, have played an important role.

Many noteworthy ConvNet-based registration models, includ-

ing RegNet (Sokooti et al., 2017), DIRNet (De Vos et al.,

2017), QuickSilver (Yang et al., 2017)VoxelMorph (Balakrish-

nan et al., 2019; Dalca et al., 2019b), VTN (Zhao et al., 2019),

DeepFlash (Wang and Zhang, 2020), and CycleMorph (Kim

et al., 2021), have demonstrated promising performance in var-

ious registration applications. More recently, registration neu-

ral networks have witnessed notable advancements beyond the

conventional ConvNet designs, owing to the progress of DNN

architectures in computer vision and the development of archi-

tectures that are specifically tailored for registration tasks. No-

tably, models such as Transformers, diffusion models, and Neu-

ral ODEs are gaining increasing attention in the field of image

registration. This section provides a comprehensive overview

of these recent advancements.

4.1. Adversarial Learning

The majority of adversarial learning applied to image regis-

tration relies on the foundational principles of generative ad-

versarial networks (GANs). The concept of GANs is derived

from a two-player zero-sum game involving a generator and a

discriminator (Goodfellow et al., 2020). The objective of the

generator is to generate new samples by learning the data distri-

bution, while the discriminator functions as a binary classifier,

aiming to accurately distinguish between real and generated

samples. In the context of image registration, the registration

network acts as the generator, producing a deformation field

and subsequently warping the moving image. Meanwhile, the

discriminator functions as an image similarity measure, distin-

guishing between the warped image and the fixed image. This

offers the advantage of alleviating the need for an explicit sim-

ilarity measure, making the approach adaptable to both mono-

and multi-modality applications.

In early applications of adversarial learning to image reg-

istration, Fan et al. (2018) and Yan et al. (2018) adhered to

the aforementioned approach. The former utilized the gener-

ator to produce a deformation field, while the latter employed

a ConvNet encoder to generate affine transformation parame-

ters. Subsequently, a binary discriminator served as a similar-

ity measure between the transformed and fixed images. In a

similar vein, Mahapatra et al. (Mahapatra et al., 2018a,b; Ma-

hapatra and Ge, 2020) applied adversarial learning to multi-

modal image registration, with the additional implementation

of CycleGAN (Zhu et al., 2017; Qin et al., 2019) to further en-

sure the inverse consistency of the generated deformation field.

Elmahdy et al. (2019) proposed incorporating anatomical label

maps into a Wasserstein-GAN (WGAN) to enhance the seg-

mentation performance of the registration network. Their gen-

erator was a U-Net-based network that generated a deformation

field, which warped both the moving image and the associated

anatomical label map. The discriminator’s role was to evalu-

ate the alignment between the warped and fixed image, as well

as the warped and fixed label maps. In their approach, image

and anatomical similarity measures were still employed, while

the discriminator served as an additional measure of the align-

ment. Similar approaches can be found in Duan et al. (2019), Li

and Ogino (2019), and Luo et al. (2021), where the authors used

the discriminator in conjunction with image similarity measures

as additional alignment indicators. In another study, Fan et al.

(2019a) proposed a GAN-based registration framework appli-

cable to both mono- and multi-modality registration. Their gen-

erator was also a registration network based on U-Net, with the

discriminator serving as the sole measure of image alignment.

However, the definition of positive pairs sent to the discrimina-

tor deviated from previous methods. Ideally, in mono-modality

registration, a positive pair would consist of identical images,

but this strict requirement is impractical. Given this observa-

tion, the authors proposed that the positive pair comprise the

fixed image and an alpha-blended image created from the fixed

and moving images. For multi-modality registration, a posi-

tive pair consisted of pre-aligned multi-modal images from the

same patient. The method was evaluated on mono-modal brain

MRI registration and multi-modal pelvic MR and CT registra-

tion tasks, demonstrating favorable performance compared to

the state-of-the-art at the time.

Given the promising results GANs have demonstrated in im-

age translation, i.e., synthesizing one image modality into an-

other, researchers have made efforts to leverage their capabil-

ities in addressing multi-modal image registration. This ap-

proach involves first synthesizing multi-modal images into the

same modality and then applying a registration network to per-10

Chen, Liu, Wei et al. / (2023)

form the image registration task. Xu et al. (2020) tackled the

challenge of multi-modal registration of CT and MR images

using a CycleGAN-based approach to translate CT images into

MR images. To ensure that the translated images maintained

anatomical consistency with the original images, the authors

introduced additional loss functions, including MIND and iden-

tity loss, alongside the standard CycleGAN loss. They then em-

ployed a three-stage registration framework to align the original

and translated images. In the first stage, a U-Net-based registra-

tion network learned the multi-modal registration between CT

and MR images. In the second stage, a network with the same

architecture learned the mono-modal registration between the

translated CT and the target MR images. Finally, the deforma-

tion fields created by both registration networks were fused us-

ing a convolutional layer to produce the final deformation field.

A similar concept was presented in Wei et al. (2019), where mu-

tual information was used instead of MIND to enforce structural

consistency. Zheng et al. (2021) integrated an image translation

network within a GAN-based image registration framework,

where the modality of the moving image was first translated

to the modality of the target image before a registration net-

work was applied to register the two images. The discriminator

in this approach acted as an image similarity metric for both

the registration and image translation networks. Additionally,

this approach employed a symmetric pipeline that reversed the

order of the moving and fixed images, ensuring symmetric con-

sistency in the resulting synthesized and deformation images.

More recently, Han et al. (2022) proposed tackling the multi-

modal registration between CT and MR images using a dual-

channel framework. Within each channel, an imaging modal-

ity was transformed into a target modality using a probabilis-

tic CycleGAN, which was then followed by a registration net-

work that predicted the deformation in the target modality. The

deformation fields from both channels were then fused, taking

advantage of the uncertainty weighting generated by the syn-

thesis networks. This proposed dual-channel framework can be

trained end-to-end, resulting in improved registration accuracy

and faster runtime compared to baseline methods.

Adversarial learning has also been employed for knowledge

distillation, enabling the transfer of information from a larger

teacher network to a smaller student network (i.e., in terms of

the number of parameters). Tran et al. (2022) aimed to com-

press the size of a registration network by transferring informa-

tion from a computationally expensive VTN (Zhao et al., 2019)

to a smaller registration network with only one-tenth of its pa-

rameters. The training process for the student network involved

calculating a correlation-based image similarity measure (Zhao

et al., 2019) between the warped image generated by the student

network and the fixed image. Meanwhile, a discriminator was

used to differentiate the deformation field created by the student

network and the pre-trained teacher network. After training,

the teacher network was discarded, and only the lightweight

student network was used for inference. Despite having only

one-tenth of the network parameters, the lightweight registra-

tion network demonstrated comparable performance to baseline

learning-based methods with larger parameter sizes, in terms of

both anatomical overlaps and deformation smoothness.

4.2. Contrastive Learning

The principle of contrastive learning enables DNNs to learn

by comparing various examples instead of focusing on single

data points independently. This comparison process typically

involves examining positive pairs of similar inputs and negative

pairs of dissimilar inputs. For a comprehensive understand-

ing of this concept and a detailed overview of the evolution

of contrastive learning, we recommend interested readers refer

to Le-Khac et al. (2020). In the context of image registration,

contrastive learning could be particularly beneficial as an alter-

native to using explicit image similarity metrics, which can be

challenging to optimize due to their task-specific nature. For ex-

ample, different similarity metrics may be preferred for lung CT

registration versus brain MRI registration or multi-modal ver-

sus mono-modal registration tasks. Whereas, contrastive learn-

ing empowers the DNN to determine whether two images are

registered or not without relying on a specific image similarity

metric, making it a more versatile approach for handling differ-

ent registration tasks.

Hu et al. (2019a) were the pioneers in applying contrastive

learning to multi-modal affine registration, concentrating on

the inter-patient alignment of 2D CT and MR scans for pa-

tients with Nasopharyngeal Cancer. Their method involved us-

ing an automatic keypoint detecting algorithm to identify key-

points in the CT and MR scans. Subsequently, they extracted a

patch centered on each keypoint and employed a Siamese net-

work to minimize the contrastive loss, which minimized the

distance between corresponding keypoints and maximized the

distance between non-corresponding keypoints. In the test-

ing phase, after establishing correspondences between all key-

points in the CT and MR scans, the optimal affine transfor-

mation parameter was determined by means of least-squares

fitting. In another study, Pielawski et al. (2020) applied con-

trastive learning to transform multi-modal images into similar

contrastive representations with equivariant properties. Their

method used two independent U-Nets to learn the represen-

tations for each modality such that the InfoNCE-based (Oord

et al., 2018) loss between the learned representations is mini-

mized. This minimization can be understood as maximizing the

mutual information between the two learned representations.

Finally, conventional affine registration methods were used to

align the learned representations as if they had undergone a

mono-modal registration task. Wetzer et al. (2023) later inves-

tigated the contrastive learning approach proposed in Pielawski

et al. (2020) to determine whether applying contrastive learn-

ing supervisions to the U-Nets’ intermediate layers could im-

prove multi-modal image registration performance. However,

they concluded that the best representations for the evaluated

registration task were achieved when the contrastive loss was

applied only to the features of the final layers. Casamitjana

et al. (2021) proposed a contrastive learning-based approach

for multi-modal deformable registration. They introduced a

synthesis-by-registration method, where they trained a registra-

tion network for mono-modal registration on the target modal-

ity domain, and then froze the network’s weight for training

an image synthesis network using a loss function that lever-

ages the registration network. The image synthesis network’sChen, Liu, Wei et al. / (2023)

ability to accurately translate the moving image into the tar-

get modality directly influenced the performance of the regis-

tration network. To enhance synthesis performance and ensure

geometric consistency, a PatchNCE-based (Park et al., 2020)

contrastive loss was used, maximizing the mutual information

between pre- and post-synthesis images at the patch level. This

method demonstrated promising results in multi-modal brain

MRI registration applications, outperforming both MI-based

registration and other image synthesis-based registration meth-

ods. Dey et al. (2022) also addressed the multi-modal regis-

tration task using contrastive loss. In their method, feature-

extracting autoencoders were first pre-trained for each modality

to derive modality-specific features. These autoencoders were

then used on the deformed moving image and the fixed image to

extract features for a PatchNCE-based (Park et al., 2020) con-

trastive loss. In order to optimize contrastive learning, a single

positive pair was sampled, corresponding to the multi-scale fea-

ture patches of the same spatial location across both modalities,

while multiple negative pairs were sampled, corresponding to

the feature patches of different spatial locations.

Until now, the methods based on contrastive learning have

been centered on multi-modal image registration. However, Liu

et al. (2020a) proposed the integration of contrastive learning in

the intermediate stages of the network architecture for mono-

modal brain MRI registration. In their method, two identical

ConvNet encoders of shared weights were applied to the mov-

ing and fixed images, each followed by a fully-connected layer

to project ConvNet extracted features onto a latent space where

the contrastive loss is applied. The positive pair for comput-

ing the contrastive loss consists of the unregistered moving and

fixed image pair, while any other pair apart from the current im-

age pair under registration is considered a negative pair. In an

extension of their work, Liu et al. (2022a) proposed to compute

the contrastive loss in a similar way, but between patches of the

moving and fixed images. However, it is important to note that

the positive pair used in these two methods contained structural

dissimilarities as it was the unregistered image pair, as opposed

to the registered images used in the methods mentioned earlier.

The authors argued that this was because the image contents,

including the number of brain structures, were consistent for

brain registration. Nonetheless, further research is needed to

fully uncover the potential of these methods.

4.3. Transformers

One of the key factors in designing ConvNets is the size of

the receptive fields. While incorporating consecutive convolu-

tional layers and pooling operations can increase the theoretical

receptive fields of ConvNets, its effective receptive fields are

still limited (Luo et al., 2016). This makes them less effec-

tive at capturing long-range spatial correspondence, which is

important to image registration since it aims to identify the cor-

respondence between different parts of the images. In contrast,

Transformers are widely acknowledged for their superior abil-

ity to capture long-range dependencies and achieve exceptional

performance when trained on large datasets (Li et al., 2023).

Transformers differ from ConvNets in that they employ the self-

attention mechanism, in which each local part of an image is

compared in relation to the other parts, guiding the network

on where to focus. Originally developed for natural language

processing tasks (Vaswani et al., 2017), Transformers have re-

cently become prevalent in various computer vision applica-

tions (Dosovitskiy et al., 2021; Liu et al., 2021, 2022e; Chen

et al., 2021a; Yuan et al., 2021; Dong et al., 2022; Carion et al.,

2020). Inspired by their success, many Transformer-based

models have been proposed and have demonstrated promis-

ing performance in medical imaging applications. For a com-

prehensive review of the current Transformer-based models in

medical imaging, interested readers are directed to a review pa-

per by Li et al. (2023). Despite their potential, Transformers

have certain drawbacks, such as larger computational complex-

ity and a lack of inductive bias when compared to ConvNets,

hindering the training process. To address these shortcomings,

Transformers are commonly used in conjunction with Conv-

Nets in medical image registration applications. Chen et al.

(2021b) were the first to utilize Transformers for registration-

based tasks. They proposed ViT-V-Net, which employs a Conv-

Net for extracting high-level features, followed by a Vision

Transformer (ViT) (Dosovitskiy et al., 2020) and a ConvNet

decoder to generate a dense displacement field. Subsequently,

they proposed TransMorph (Chen et al., 2022b), which em-

ploys a Swin Transformer (Liu et al., 2021) in the encoder,

replacing the ConvNet feature extractor and ViT. TransMorph

is capable of both affine registration and deformable registra-

tion. The study provided empirical evidence that Transformer-

based models have larger effective receptive fields than base-

line ConvNets. In inter-subject and atlas-to-subject brain MRI

registration, as well as XCAT-to-CT abdomen registration ap-

plications, TransMorph achieved significantly improved regis-

tration performance when compared to top-performing tradi-

tional and ConvNet-based registration models. Zhang et al.

(2021) proposed DTN, which consists of two encoder branches

with identical architecture. Each branch contains a ConvNet

feature extractor and a ViT. In DTN, the moving and fixed im-

ages are first fed consecutively into one encoder branch, then

concatenated and sent to the other branch. The encoder outputs

are then concatenated and sent to a ConvNet decoder to pro-

duce a deformation field. Mok and Chung (2022a) introduced a

Transformer encoder, C2FViT, specifically designed to tackle

the affine registration problem. Their Transformer architec-

ture was inspired by ViT, but with augmented patch embedding

and feed-forward layers to introduce locality into the model.

C2FViT adopts a coarse-to-fine strategy with an image pyra-

mid for affine registration. The registration process is carried

out in multiple stages of ViTs with identical architectures, each

corresponding to a different resolution of the fixed and mov-

ing images. The affine parameters are estimated in each stage,

and the moving image is affine-transformed using the parame-

ters from the previous stage to refine the registration progres-

sively. C2FViT was evaluated on several benchmark datasets

and demonstrated superior performance compared to multi-

ple ConvNet-based and traditional affine registration methods.

Chen et al. (2022c) proposed a Deformer module, which lever-

ages the attention mechanism on feature maps produced by

a ConvNet encoder. The authors argued that the Deformer12

Chen, Liu, Wei et al. / (2023)

module facilitated the image-to-spatial transformation mapping

process by estimating the displacement vector prediction as a

weighted sum of multiple bases. Employing a coarse-to-fine

strategy, the proposed model outperformed both the ConvNet

and Transformer models in the comparative analysis. Song

et al. (2022) introduced Attention-Reg, a model that adopts

cross-attention to correlate features extracted from multi-modal

input images by a ConvNet encoder. To expedite the training

process, they applied a contrastive pre-training strategy to the

ConvNet feature extractor, allowing for the extraction of sim-

ilar features from different modality images. The Dice loss

was used as the multi-modal similarity measure, and they de-

veloped both rigid and deformable variations of the model. The

results showed that Attention-Reg performed favorably against

several learning-based rigid and deformable registration mod-

els. Similarly, Shi et al. (2022) introduced XMorpher, a full

Transformer architecture featuring dual parallel feature extrac-

tors that exchange information via a cross-attention mechanism.

The cross-attention module developed in their study is based

on a Swin Transformer, where attention is computed between

base windows of one image and searching windows of another

image with differing sizes. This cross-attention mechanism ex-

hibited improved performance compared to self-attention-based

Transformers and ConvNet models. Chen et al. (2023a) made

further improvements to the cross-attention technique used in

XMorpher. They proposed a novel deformable cross-attention

module that enables tokens to be sampled from regions beyond

the conventional rectangular window, while also reducing com-

putational complexity. A lightweight ConvNet was introduced

to deform the sampling window in a reference. The attention is

then computed between the tokens sampled from the deformed

window in the reference and those sampled from a rectangular

window in a base. This enables tokens sampled from a larger

reference region to guide the network on where to focus within

each local window in the base. The proposed network includes

two encoding paths. In one path, the moving and fixed images

are used as the base and reference, respectively. In the other

path, the roles of the base and reference are switched, with the

moving image used as the reference and the fixed image used as

the base. A ConvNet decoder then fuses the features extracted

from the two encoders to generate a deformation field. Their

method was evaluated on brain MRI registration tasks, and it

performed favorably against self-attention, cross-attention, and

ConvNet-based models. Liu et al. (2022c) proposed im2grid,

a model that uses cross-attention to explicitly guide the neu-

ral network in comprehending the coordinate system for im-

age registration, which is usually learned implicitly from data.

Their approach uses ConvNet encoders to independently extract

hierarchical features from the fixed and moving images. Subse-

quently, their proposed coordinate translator block computes a

softmax score function by comparing the extracted fixed image

feature at a voxel location with the features of the moving im-

age within a search window. Spatial correspondence between

the voxel location in the fixed and moving images is estab-

lished by linearly combining the coordinates of all voxel loca-

tions weighted by the score function. Their approach is imple-

mented as cross-attention with coordinates as one of the inputs.

This model was evaluated on inter-patient brain MRI registra-

tion tasks using publicly available datasets and demonstrated

superior performance compared to the comparative ConvNets

and Transformer-based models.

The mechanisms of Transformers have inspired various

ConvNet designs in computer vision, leading to a debate

on whether Transformers could replace ConvNets for image-

related tasks (Li et al., 2023). ConvNet models such as Conv-

NeXt (Liu et al., 2022d) and RepLKNet (Ding et al., 2022b)

have built upon Transformer concepts and demonstrated per-

formance comparable to Transformers. Inspired by these mod-

els, Jia et al. (2022b) proposed a U-Net with increased kernel

sizes to expand the effective receptive field of the U-Net. Their

method compared favorable against several Transformer-based

registration methods. Currently, ConvNets still possess inherent

advantages over Transformers, such as their invariance to input

image sizes and the incorporation of inductive bias due to the

nature of the convolution operation. Therefore, there has been

a growing interest in advancing ConvNets using Transformer

concepts in computer vision. It is anticipated that further re-

search in this area will lead to improved ConvNet architectures

for medical image registration applications.

4.4. Diffusion Models

In recent years, diffusion models (Sohl-Dickstein et al., 2015;

Ho et al., 2020) have garnered significant research interest in

computer vision. Initially designed for generative tasks, such

as image synthesis, inpainting, and super-resolution, diffusion

models have now been widely explored in various applications

in the field of medical image analysis (see Kazerouni et al.

(2022) for a survey). In contrast to other generative models

like GANs and VAEs, which are either confined to data with

limited variability or generating low-quality samples (Ho et al.,

2020; Kazerouni et al., 2022), diffusion models have no such

restrictions, making them an attractive alternative. The goal

of diffusion models is to use the known forward process of

gradual diffusion of information caused by noise to learn the

reverse process of recovery of information from noise. The

forward process is similar to the behavior of particles in ther-

modynamics, where particles spread (i.e., diffuse) from areas

of high concentration to those of low concentration (Kirkwood

et al., 1960; Sohl-Dickstein et al., 2015). The existing diffu-

sion models use iterative steps of diffusion, which can include

up to several thousand steps, to carry out the diffusion process.

As a result, inference with these models, which requires the re-

verse diffusion process, is time-consuming. To date, only Kim

et al. (2022) have used a diffusion model in medical image reg-

istration. They proposed DiffuseMorph, which involves a dif-

fusion network and a deformation network. The diffusion net-

work learns a conditional score function (i.e., the added noise),

while the deformation network uses the latent feature in the re-

verse diffusion process to estimate the deformation field. The

registration process of DiffuseMorph is a one-step procedure

as the fixed image is the target image at the end of the reverse

diffusion process (i.e., t = 0), and it is already given. As a

result, there is no need for time-consuming reverse diffusion

steps to synthesize a target image from the moving image. Fur-Chen, Liu, Wei et al. / (2023)

thermore, DiffuseMorph offers the added capability of produc-

ing continuous deformations through the interpolation of the

learned latent space. The method demonstrated promising re-

sults when compared to several ConvNet-based methods on a

publicly available Cardiac MRI dataset and a human facial ex-

pression dataset. However, since their forward process adopts

the strategy of adding Gaussian noise to the fixed image, their

diffusion network learns a conditional score function for the

fixed images instead of the deformation between the fixed and

moving images. Hence, additional exploration is imperative to

gain a more comprehensive insight into the benefits of diffusion

models.

4.5. Neural ODEs

Inspired by Euler’s method for discretizing the derivative of

ordinary differential equations (ODEs) into discrete time step

updates, Chen et al. (2018) proposed a new family of DNN

models called Neural ODEs. In their method, DNN elements

that progressively update their input (e.g., residual connections,

or recurrent networks) are interpreted as updates of time steps

in Euler’s method. Consequently, a chain of these elements in a

neural network is essentially a solution of the ODE with Euler’s

method of the form:

dh(t)

= f θ (h(t), t),

dt (6)

h(t + 1) = h(t) + f θ (h(t), t), (7)

and

where h(t) represents the t-th element, which may be a residual

block or a network. The final output at t = T can be computed

by integrating f over the time interval [0, T ], which is evalu-

ated by a numerical solver taking many small time steps, thus

approximating a neural network with infinite depth.

The first application of the NeuralODE framework for med-

ical image registration was introduced by Xu et al. (2021).

They proposed MS-ODENet, which parameterizes h at the fi-

nal time point T (i.e., h(T )) as the deformation field that warps

the moving image to the fixed image, and dh(t)

dt as the small in-

crement of deformation produced by a network at state t from

the preceding state h(t − 1). To alleviate the computational bur-

den of numerical solvers and accelerate the runtime, they pro-

posed solving ODEs at different resolutions in a coarse-to-fine

manner. However, the loss function, consisting of a similar-

ity measure and a deformation regularizer, is applied only to

the final deformation field h(T ). Similarly, Wu et al. (2022b)

proposed NODEO, which formulated h(t) as the voxel move-

ment at time t and the trajectory of the movement as the solu-

tion to the ODE. Drawing inspiration from dynamical systems,

they expressed the ODE as dh(t)

dt = Kv θ (h(t), t), where K is a

Gaussian smoothing kernel, v θ denotes the velocity of the voxel

movement produced by a neural network, and the initial con-

dition h(0) is an identity. It is noteworthy that this formulation

bears similarities to LDDMM (Beg et al., 2005), an influential

optimization-based method that considers image registration as

an energy-minimizing flow of particles over time. In contrast

to MS-ODENet (Xu et al., 2021), which applies loss solely to

the deformation at t = T , NODEO optimizes image similarity at

each t while minimizing the energy of the flow and encouraging

spatial smoothness and regularity of the velocity fields through

the Gaussian kernel, diffusion regularizer, and Jacobian deter-

minant loss. The authors compared NODEO to various widely-

used traditional methods and a ConvNet model on brain MRI

registration tasks. It demonstrated superior registration perfor-

mance measured by Dice while attaining diffeomorphic regis-

tration.

4.6. Implicit Neural Representations

Image registration can be formulated as an implicit problem

of the form:

C(xx , ψ) = 0, ψ : x → ψ(xx ),

(8)

where x ∈ R 2,3 is the 2D or 3D spatial coordinate (i.e., from an

integer grid), and ψ represents a neural network that maps each

coordinate x to a value of interest, subject to the constraint C.

In the context of image registration, ψ typically maps the coor-

dinate x to its deformation ψ(xx ), while C comprises a similar-

ity measure and a deformation regularizer. The neural network

ψ can be considered as an implicit function of x , defined by

the relation modeled by C (Eqn. 8). This concept is commonly

referred to as implicit neural representations in computer vi-

sion (Sitzmann et al., 2020; Mescheder et al., 2019; Niemeyer

et al., 2019; Mildenhall et al., 2021). Although x ’s used during

training are discrete, the implicit function ψ(xx ) parameterized

by a neural network is a continuous and differentiable func-

tion. As a result, implicit neural representations provide a more

compact representation of a continuous function and facilitate

smooth manipulation of that function.

Han et al. (2023) proposed to parameterize a continuous de-

formation field using a multi-layer perceptron (MLP) intro-

duced in (Sitzmann et al., 2020), given an integer grid repre-

senting the spatial coordinates of the voxels. The MLP thus

serves as the implicit function of the integer grid. Since the

MLP is not conditioned on the images and the only input is the

coordinates that are deterministic for all images of the same res-

olution, optimization of the MLP is carried out iteratively and

pair-wise for each image pair (similar to how the traditional reg-

istration methods are performed). To further improve the regis-

tration performance, the authors proposed a cascade framework

that combines the benefits of learning-based registration DNNs

with the optimization-based implicit neural representations pro-

vided by the MLP. Within this framework, the learning-based

DNN predicts an initial deformation field, while the MLP pro-

duces the residual deformation that refines the initial deforma-

tion field, leading to an enhanced overall registration perfor-

mance. However, the proposed method shares the same lim-

itation as traditional methods in that the optimization is done

pair-wise without learning from a dataset. Therefore, it cannot

benefit from the supervision provided by anatomical label maps

if these maps are not available during inference. Meanwhile,

Sun et al. (2022) applied implicit neural representations to a

task of organ shape registration. Their approach was based on

the idea of DeepSDF (Park et al., 2019), where an auto-decoder

maps a latent code representing a unique organ shape and the

3D coordinates of a sampled point to a signed distance func-

tion (SDF). The value of an SDF determines whether the point14

Chen, Liu, Wei et al. / (2023)

lies inside (< 0), outside (> 0), or on the surface (= 0) of the

shape, consequently providing an implicit description of the or-

gan shapes. The resulting SDF is a continuous function, and

the auto-decoder serves as the implicit neural representation of

the discrete coordinates. To register points from different organ

shapes, the authors modeled the trajectory of the point move-

ment in space as the solution to an ODE, akin to the formula-

tion proposed in NODEO (Wu et al., 2022b). In this formula-

tion, the time derivative corresponds to the velocity of the point

movement at time t. The authors solved this ODE using a Neu-

ralODE solver (as briefly discussed in section 4.5), resulting in

a diffeomorphic mapping between shapes.

4.7. Hyperparameter Conditioning

Inspired by HyperNetworks (Ha et al., 2017) and Hyperpa-

rameter Optimization (Franceschi et al., 2018), recent research

has introduced methods that integrate hyperparameters directly

into the architecture of the registration DNNs. This allows

for the capturing of a wide range of hyperparameters within

a single training process, consequently speeding up the hyper-

parameter tuning process without requiring multiple networks

to be trained from scratch for each hyperparameter value. In

the training process of these methods, a distinct hyperparameter

value is randomly selected, and the network generates a defor-

mation field associated with that value. Subsequently, the regis-

tration loss is calculated using the same hyperparameter value,

which is then used to update the network parameters. The hy-

perparameter being conditioned typically relates to the weight

of the deformation regularizer, which affects the smoothness of

the deformation produced by the network.

Hoopes et al. (2022) introduced HyperMorph, which is based

on the concept of HyperNetworks (Ha et al., 2017). Hyper-

Morph comprises two ConvNets: a hypernetwork and a U-Net-

like registration network (i.e.VoxelMorph (Balakrishnan et al.,

2019)). The hypernetwork estimates the weights of the U-Net

based on the provided hyperparameter value for the diffusion

regularizer, while the U-Net generates a deformation field to

warp the moving image. In each training step, the hyperparam-

eter value is randomly sampled from a uniform distribution, and

the loss is computed using the same sampled value. After train-

ing, the best-performing hyperparameter value is acquired using

gradient descent. In this process, the network weights are fixed,

and an optimizer iteratively updates the hyperparameter based

on a target objective function (commonly the Dice score) ap-

plied to a validation dataset. In a parallel work, Mok and Chung

(2021b) proposed conditioning the regularization hyperparame-

ter through conditional instance normalization (Dumoulin et al.,

2017). In this approach, the feature map statistics within the

regularization network are normalized and shifted according to

two affine parameters. These affine parameters are generated

by a lightweight mapping network, which takes the sampled

hyperparameter value as input. Later, Chen et al. (2023b) ex-

panded the conditional instance normalization to a conditional

layer normalization for application in Transformer-based regis-

tration models. The training processes in both (Mok and Chung,

2021b; Chen et al., 2023b) are similar to the one used in Hy-

perMorph, where the hyperparameter value is sampled from a

uniform distribution and then employed for loss computation.

However, it is worth noting that Mok and Chung (2021b) and

Chen et al. (2023b) obtain the best-performing hyperparameter

value through a grid search, whereas HyperMorph acquires it

via gradient descent.

4.8. Discontinuity Permitting Network

To facilitate a spatially discontinuous deformation, which is

important for many registration applications as delineated in

Section 3, Chen et al. (2021f) proposed an alternative approach.

Rather than employing a discontinuity-permitted deformation

regularization (as briefly mentioned in Section 3), the authors

proposed using anatomical label maps to segregate the moving

and fixed images into different regions of interest and subse-

quently generate deformation fields for each region using mul-

tiple registration networks. These deformation fields are then

combined to yield a final deformation via addition. However,

this method has an immediate drawback in necessitating the

anatomical label maps throughout both the training and infer-

ence stages. When label maps are not available, this method

becomes infeasible.

4.9. Correlation Layer

Optical flow is the name given by the computer vision com-

munity to image registration. In learning-based optical flow, it

is common to employ a correlation layer (Dosovitskiy et al.,

2015) to aid neural networks in pinpointing explicit correspon-

dences between points in images. This involves computing the

correlation between the neighboring features of a spatial lo-

cation in the moving image and the neighboring features of a

range of spatial locations in the fixed image. The correlation is

computed between two feature patches centered at x m and x f in

the moving and fixed images, respectively, using the following

equation:

⟨F m (xx m + o ), F f (xx f + o )⟩,

(9)

c(xx m , x f ) =

o ∈[−k,k]

where F m and F f denote the feature patches of the moving and

fixed images, respectively, and k defines the patch size. The se-

lection of locations x m and x f is based on a maximum displace-

ment d, meaning that for each x m , the range of x f is limited to

the locations that are at most d distance away. The output of the

correlation layer is a set of correlation values that represent the

correlation between one feature patch in the moving image and

another feature patch in the fixed image. The output has a size

of H × W × D × d, where H × W × D represents the size of the

feature maps.

Although the concept of directing networks with explicit cor-

respondences between voxels or patches has been employed in

computer vision since 2015, it was only recently embraced in

medical image registration. This delay can be attributed to the

potential computational challenges introduced by Eqn. 9. Since

medical images are typically volumetric, the search space for

each voxel location would be in a 3D volume, quickly becom-

ing unmanageable as the search distance d grows. (Heinrich,

2019) was the first to implement a correlation layer in their net-

work design by introducing the PDD-net. Instead of calculatingChen, Liu, Wei et al. / (2023)

the scalar product between two features as done in Eqn. 9, PDD-

net computes the correlation as the mean squared error between

feature patches centered at each control point in the moving and

fixed images:

c(xx m , x f ) =

∥F m (xx m + o ) − F f (xx f + o )∥ 2 .

(10)

o ∈[−k,k]

Moreover, in their correlation layer, the search distance is rep-

resented by a 3D matrix, d 3 , with each element in the vector,

d , defining a discrete displacement distance from the current

center of the feature patch. This correlation layer produces a

6D matrix, where the first three dimensions outline the shape

of the feature maps, and the final three dimensions describe the

shape of the search space. Due to the sparsity of the control

points in comparison to the image size, the computational bur-

den of this correlation layer remains relatively low. The corre-

lation layer is applied to features independently extracted from

the moving and fixed images using a ConvNet that incorporated

deformable convolutional layers as introduced in Heinrich et al.

(2019). Subsequently, min-convolutions and mean-field infer-

ence are employed to spatially smooth the dissimilarities pro-

duced by the correlation layer. A softmax operation is then

applied to the 6D matrix, converting the dissimilarities into

pseudo-probabilities. The displacement field is subsequently

generated by multiplying the probabilities with the displace-

ment distance in d 3 , resulting in a weighted average of these

probabilistic estimates for the 3D displacement field. The de-

formation field is then trilinearly interpolated to align with the

image resolution. Heinrich and Hansen (2020) later extended

this approach by proposing a 2.5D approximation of the quan-

tized 3D displacement, significantly reducing the memory bur-

den of the original Pdd-net. Instead of creating a 6D dissimi-

larity matrix, they generated three 5D matrices (i.e., the 2.5D

dissimilarity matrices), with each matrix representing the dis-

similarities computed for two out of the three dimensions. The

2.5D probabilities produced at the end of the network are in-

terpolated to 3D using B-splines. To minimize the error during

the conversion from 2.5D to 3D, a two-step instance normaliza-

tion is applied for each pair of test scans using gradient descent.

More recently, Heinrich and Hansen (2022) further expanded

the concept of probabilistic displacement and incorporated key-

point supervision into VoxelMorph (Balakrishnan et al., 2019)

through the introduction of VoxelMorph++. They advanced

VoxelMorph in two respects: probabilistic displacement via

heatmap prediction and multi-channel instance optimization us-

ing one-hot embeddings of the anatomical label maps gener-

ated by a segmentation network. In their model, high-level

features are initially extracted from the VoxelMorph decoder,

and feature vectors are then sampled at given keypoint loca-

tions. These feature vectors are converted into larger heatmap

patches through a convolution block followed by a softmax op-

eration. Consequently, each heatmap represents the probabilis-

tic displacements of the corresponding keypoint. The final dis-

placement field is generated as the sum of the displacements

weighted by the heatmap. During the testing phase for each im-

age pair, an instance optimization strategy (Siebert et al., 2022)

refines the displacement field using the supervision provided by

the anatomical labels generated from a segmentation network.

The methods discussed in this subsection were evaluated on

abdomen and lung CT datasets, where large deformations are

necessary for accurate registration. The architectures proposed

in these methods proved to be efficient and demonstrated supe-

rior performance compared to traditional methods and learning-

based networks that only generate dense displacement fields.

4.10. Progressive and Pyramid Registration

Recent research has also demonstrated that employing a net-

work to progressively warp a moving image towards a fixed

image, or performing registration through a multi-scale im-

age pyramid employing a coarse-to-fine technique, may signifi-

cantly improve registration performance. Zhao et al. (2019) in-

troduced the VTN, which leverages cascade registration net-

works to align moving images with fixed images. Drawing in-

spiration from FlowNet2.0 (Ilg et al., 2017), each subnetwork

is responsible for aligning the current moving image with the

fixed image, with the resulting warped image sent into the sub-

sequent subnetwork as the new moving image. The final defor-

mation field is the composition of the intermediate deformation

fields produced by the subnetworks. This approach has been

shown effective in handling large displacements. In a similar

fashion, Chen et al. (2022a) proposed a method for progressive

image alignment within a single network. Their method em-

ploys multiple convolution blocks in the decoding stage, each

responsible for aligning the current moving image to the fixed

image.

Concurrently, there have been efforts to apply progressive

registration using a multi-scale image pyramid approach. Given

the widespread adoption of hourglass-shaped network archi-

tectures in image registration, convolution blocks within the

decoder generate deformation fields at multiple resolutions in

a coarse-to-fine manner. These deformation fields at differ-

ent resolutions are subsequently upsampled and composited to

form the final deformation field. Notable methods that adopt

this scheme include (Jiang et al., 2020; Kang et al., 2022; Liu

et al., 2022c; Lv et al., 2022). In addition to network architec-

ture, the coarse-to-fine training scheme has also been adopted

in learning-based image registration. Taking inspiration from

conventional registration methods that often employ multiple

stages with varying resolutions, De Vos et al. (2019) pioneered

a multi-scale training strategy for deformable image registra-

tion. Their approach involves sequentially training the Conv-

Net in each stage for a specific image resolution by optimizing

the image similarity measure. Notably, a B-spline framework

is adopted thus alleviating the need for a deformation regu-

larizer. During training, the weights of the preceding Conv-

Nets are held fixed, and after training, the registration is per-

formed through a single pass of input images to the multi-stage

ConvNets. Eppenhof et al. (2019) proposed a novel progres-

sive and multi-scale training scheme for learning-based image

registration. Instead of training a large network on the registra-

tion task all at once, they first train smaller versions of the net-

work on lower-resolution images. The resolution of the train-

ing images is then gradually increased, and additional convo-

lutional layers are added to increase the network size. Simi-

larly, Mok and Chung (2020b) proposed LapIRN, which adopts16

Chen, Liu, Wei et al. / (2023)

a similar pyramid training scheme. However, unlike the previ-

ous training approach, which progressively increases the image

resolution and network size of the same network, LapIRN em-

ploys three different networks, each producing a deformation

field for a specific resolution. Each network is equipped with

a skip connection that propagates feature embeddings from a

lower-resolution network to a higher-resolution network. The

networks are trained in a coarse-to-fine manner, with each net-

work producing a deformation field that refines the upsampled

deformation field from the previous resolution. However, us-

ing multiple networks to generate a pyramid of deformation

fields can be computationally inefficient and increase the net-

work size, which can hinder training. To address this issue, Hu

et al. (2020) proposed a self-recursive contextual network that

employs a single feature extractor to produce features at differ-

ent resolutions. Then, a weight-sharing deformation generator

and receptive module are then used to recursively generate and

refine deformation fields in a coarse-to-fine manner. Since the

network weights are shared between resolutions, this method

reduces the computational burden and the size of the network,

resulting in more efficient training. Zhou et al. (2023) proposed

a novel network architecture to leverage progressive registra-

tion at both single and multi-scale resolution. The proposed

method iteratively refines the deformation field generated from

the previous iteration, with each iteration composing deforma-

tion fields of various resolutions to form the new deformation

field.

The registration methods discussed in this subsection have

demonstrated the efficacy of decomposing the registration pro-

cess into multiple steps, where each step refines the deforma-

tion fields from the previous step. These approaches have con-

sistently shown significant performance gains while enforcing

a smoother deformation field for image registration tasks com-

pared to using a single network to generate a deformation field

all at once.

5. Uncertainty in Learning-based Registration

DNNs are capable of learning complex representations.

However, their predictions are typically deterministic and as-

sumed to be accurate, which is usually not the case. Estimat-

ing the uncertainty is important for evaluating what the models

learn from the data and helps reduce risk in decision-making

based on the model prediction. In medical image analysis,

uncertainty estimation has been widely used in tasks such as

image segmentation, image classification, and image registra-

tion. For example, registration uncertainty empowers surgeons

to evaluate the surgical risk tied to the registration model’s pre-

diction, thereby avoiding undesirable consequences. Prior to

the deep learning-based registration, traditional registration un-

certainty is based on the framework of probabilistic registra-

tion, where the probabilistic distribution of the transformation

parameters is estimated.

In this section, we focus on registration uncertainty estima-

tion using deep learning methods, though many concepts have

been drawn from traditional registration uncertainty estimation

methods. We start with the general framework for estimating

Fig. 3. Various types of registration uncertainty can be estimated using

DNNs.

uncertainty using deep learning methods. Next, we formally

define the different types of registration uncertainty and elab-

orate on how the uncertainty estimation methods are used in

learning-based registration. Finally, we provide a summary of

the methods used for evaluating the quality of uncertainty esti-

mation.

5.1. Bayesian Deep Learning

As shown in Fig. 3, in general, uncertainty can be categorized

into two types: aleatoric and epistemic uncertainty (Kendall and

Gal, 2017). Aleatoric uncertainty, also known as data uncer-

tainty or inherent uncertainty, refers to the inherent randomness

or variability present in observed data. It can be thought of as

the variability of the data given the underlying true data gener-

ation model due to factors such as measurement errors, sensor

noise, or the intrinsic stochastic nature of the data generation

process. Epistemic uncertainty, also known as model uncer-

tainty or knowledge uncertainty, refers to variability present in

the model structure, model parameters, and model assumptions.

It arises due to our limited knowledge or understanding of the

underlying model. Aleatoric uncertainty may be reduced by

improving the data quality, while epistemic uncertainty may be

mitigated by improving model selection, refining parameter es-

timation, or acquiring additional relevant information.

To predict aleatoric and epistemic uncertainty using deep

learning, we train a model W using data set D that takes an

input x to generate an output y(x, W) and a variance prediction

σ 2 (x, W). Aleatoric uncertainty describes the uncertainty that

is inherent in the training data D. It is expressed as follows:

u a = E p(W|D) σ 2 (x, W) ,

(11)

where E represents taking the average of σ 2 (x, W) over the dis-

tribution p(W|D). Epistemic uncertainty describes the uncer-

tainty of the model W. It is represented as follows:

u e = V p(W|D) y(x, W) ,

(12)

where V represents taking the variance of y(x, W) over the dis-

tribution p(W|D). Directly computing aleatoric uncertainty us-

ing Eqn. 11 and epistemic uncertainty using Eqn. 12 is usually

impractical, as it requires the integration of high dimensional

numerical functions. Instead, these uncertainties are approx-

imated from a set of outputs by using the model weights W

sampled from the posterior distribution p(W|D).

In theory, the posterior distribution p(W|D) can be obtained

through Bayes’ rule,

p(W|D) =

p(D|W)p(W)

p(D)

(13)Chen, Liu, Wei et al. / (2023)

where p(W) is an assumed prior. However, it is not feasible to

obtain p(D) in the denominator due to the intractable integral.

As an alternative, variational inference is used to approximate

p(W|D) as a distribution q θ (W) with parameter θ by minimiz-

ing the Kullback-Leibler (KL) divergence between them. This

process can be simplified as follows:

θ̂ = arg min D KL q θ (W)∥p(W) − E q θ log p(D|W) ,

(14)

the model estimate the aleatoric uncertainty inherent from data,

the model needs to predict a variance σ(I f , I m , W) of the output.

Assuming the deformation field ϕ follows a voxel-wise Gaus-

sian distribution, the model W can be trained by minimizing the

loss function,

L =

where D KL represents KL divergence and E q θ represents taking

the average over the distribution q θ (W). With this, the aleatoric

uncertainty in Eqn. 11 and the epistemic uncertainty in Eqn. 12

can be approximated by sampling W from q θ̂ (W).

Many sampling methods can be used for uncertainty esti-

mation, including Monte Carlo dropout, bootstrap, and snap-

shot techniques. The Monte Carlo dropout sampling method

operates under the assumption that q θ (W) follows a Bernoulli

distribution (Gal and Ghahramani, 2016). It leverages dropout

layers during the testing phase to perform multiple forward in-

ferences. This method is widely used in learning-based reg-

istration models, likely due to its straightforward implemen-

tation (Yang et al., 2016, 2017; Madsen et al., 2020; Chen

et al., 2022b; Xu et al., 2022). Bootstrap sampling, a tradi-

tional method, involves training the registration model multiple

times on independent training sets to produce multiple infer-

ences (Kybic, 2009). Snapshot sampling uses the cyclic learn-

ing rate in one training process for perturbing the model to con-

verge to multiple different local minimums (Huang et al., 2017).

It has shown that snapshot sampling performs better uncertainty

estimation than other methods for the medical image registra-

tion use case (Gong et al., 2022).

5.2. Registration Uncertainty Estimation for DNN

Both aleatoric and epistemic uncertainty are present in image

registration. Aleatoric uncertainty in image registration may

arise from factors such as image noise, image artifacts, lack of

image features or image contrast, and natural anatomical varia-

tion between images. There may be two types of aleatoric un-

certainty in image registration. One is that given the underlying

true deformation field, two aligned images may not be exactly

the same due to different image noise, image artifacts or natu-

ral anatomical variation between images. Another type is that

multiple deformation fields may align two images with similar

performance due to lack of image features or image contrast.

On the other hand, epistemic uncertainty represents the limita-

tions inherent in the modeling process. In image registration,

this relates to the limited ability of the model to precisely cap-

ture the complex deformation field. This form of uncertainty

can be attributed to factors like inadequate training data, choices

in model architecture, or the inherent complexity posed by the

inverse problem of estimating the deformation fields. The fol-

lowing subsections provide details on each type of uncertainty.

5.2.1. Aleatoric Uncertainty

In medical image registration, the output of the network is

usually a deformation field ϕ(I f , I m , W) as a function of the

fixed image I f , the moving image I m and the model W. To help

X (ϕ(p) − ϕ ∗ (p)) 2

σ 2 (p)

+ log σ 2 (p),

(15)

where ϕ ∗ is the ground truth for the deformation field. However,

in unsupervised learning-based image registration, the ground

truth deformation ϕ ∗ is unavailable, and the loss may be cal-

culated in the image domain (i.e., in the form of image simi-

larity measure) rather than directly comparing the deformation

fields as in Eqn. 15. To overcome this issue, the aleatoric un-

certainty for image registration is frequently estimated using a

variational inference strategy, which optimizes a global neural

network to produce distributions of deformation fields (Dalca

et al., 2019b; Grzech et al., 2022; Wei et al., 2021a; Smolders

et al., 2022; Croquet et al., 2021; Krebs et al., 2019). This ap-

proach circumvents the direct computation of the intractable

posterior p(ϕ|I f ; I m ) by introducing an approximate posterior

q θ (ϕ), where the parameter θ can be predicted by a network

based on the inputs I f and I m . The KL divergence between

the two posteriors is then minimized, which maximizes the ev-

idence lower bound (ELBO):

θ̂ = arg min D KL q θ (ϕ)∥p(ϕ) − E q θ log p(I f |ϕ; I m ) , (16)

where D KL represents KL divergence and E q θ represents taking

the average over the distribution q θ (ϕ). Here, the approximate

posterior q θ (ϕ) is frequently modeled as a multivariate normal

distribution (i.e., ϕ ∼ N(µ ϕ , σ 2 ϕ )). Moreover, the conditional

probability p(I f |ϕ; I m ) typically takes the Gaussian form (i.e.,

I f ∼ N(I m ◦ ϕ, σ 2 I )). In practical applications, the mean µ ϕ

and the standard deviation σ ϕ of the deformation field can be

predicted by the registration network, in a similar manner to a

variational autoencoder. In this case, the variance σ 2 ϕ represents

the aleatoric uncertainty associated with the deformation field,

given the input images I f and I m .

5.2.2. Epistemic Uncertainty

As illustrated in Fig. 3, the epistemic uncertainty in regis-

tration can be divided into two different measures (Luo et al.,

2019; Xu et al., 2022; Chen et al., 2022b): transformation un-

certainty and appearance uncertainty. These measures refer to

the uncertainty in generating the transformation and the plausi-

bility of the transformation, respectively (Bierbrier et al., 2022).

The former quantifies the uncertainty in the deformation space

and tends to be large when the model is uncertain about estab-

lishing specific correspondences, such as when registering re-

gions with piecewise constant intensity. In contrast, the latter is

often based on the assumption that high image similarity indi-

cates correct alignment. Consequently, this uncertainty would

be large when the appearance differences between the warped

and fixed images are significant.

Transformation uncertainty can be described as the variance

of the sampled deformation fields, which derives from Eqn. 12Chen, Liu, Wei et al. / (2023)

by stochastic sampling:

u e,trans =

1 X 2

1 X

(ϕ i −

ϕ j ) ,

N i=1

N j=1

(17)

where ϕ i is generated by using the model weight W i sampled

from the estimated variational distribution q θ̂ (W) with the pa-

rameter θ̂ optimized by Eqn. 14, and N is the total sampling

number. Appearance uncertainty is expressed as the variance of

the warped images, which are created using the sampled defor-

mation fields (Luo et al., 2019; Xu et al., 2022):

u e,appea =

1 X

(I m ◦ ϕ i −

I m ◦ ϕ j ) 2 ,

N i=1

N j=1

(18)

where ϕ i is generated by using the sampled W i as the model,

N is the total sampling number and I m is the moving image.

However, it should be noted that the uncertainty estimated using

this equation for appearance uncertainty can be biased due to

overfitting, as shown in Chen et al. (2022b). To correct this, the

authors suggest using the following formulation instead:

u e,appea

1 X

(I m ◦ ϕ i − I f ) 2 ,

N i=1

(19)

where the predictive mean is replaced by the fixed image I f .

5.3. Uncertainty Evaluation in Registration

One significant challenge in uncertainty estimation lies in its

evaluation due to the absence of ground truth, especially in un-

supervised learning-based registration. To access the quality of

uncertainty, sparsification plots are usually used for voxel-wise

uncertainty evaluation (Mac Aodha et al., 2012; Wannenwetsch

et al., 2017; Ilg et al., 2018).

Sparsification plots demonstrate how the registration error

changes by gradually removing voxels ranked by the uncer-

tainty measure. It is anticipated that removing a voxel with

higher uncertainty will result in a greater reduction in regis-

tration error, and the opposite holds true for voxels with lower

uncertainty. If all voxels are arranged in descending order of

uncertainty, and the uncertainty ranking matches the actual reg-

istration error ranking, the accumulated registration error under

the sparsification plot will be small. Therefore, the area under

sparsification plots is also used as an evaluation metric to gauge

the quality of the uncertainty estimation.

6. Registration Evaluation Metrics

Manual correspondences are usually regarded as the gold

standard for evaluating the performance of a registration algo-

rithm (Peter et al., 2021). Landmark correspondences are the

most frequently used type, although surfaces or lines may also

serve as manual correspondences. The evaluation of registra-

tion performance using landmark correspondences is relatively

simple for rigid and affine transformations, since these trans-

formations can be expressed as matrix multiplication and the

ground truth transformation can be determined through several

pairs of manual landmark correspondences. In contrast, deter-

mining the parameters of deformable transformations requires

dense manual landmark correspondences, which are typically

not obtainable. Even in cases where manual landmark corre-

spondences are available, they are often restricted to highly se-

lective intensity features (Castillo et al., 2009) and neglect re-

gions with homogeneous intensities. Therefore, validating de-

formable registration algorithms is still considered a non-trivial

task (Viergever et al., 2016). In current literature, the perfor-

mance of deformable registration algorithms is most commonly

evaluated in terms of accuracy and regularity.

6.1. Accuracy Measures

When the manual landmark correspondences are available,

the accuracy of the transformation can be evaluated by target

registration error (TRE),

TRE forward =

||T forward (l m

) − l if || k ,

(20)

i=1

where T forward is the estimated forward transformation that takes

the moving image to the fixed image; l m

and l if are the i th pair

of landmarks in the moving and fixed image, and k ∈ {1, 2} de-

noting either the ℓ 1 -norm or ℓ 2 -norm. Both l m

and l if as well

as the warped moving landmark T (l m

) can be non-integer loca-

tions. Note that we used the forward transformation T forward in

Eqn. 20, but it is more common in practice to estimate the trans-

formation T backward that maps the fixed image to the moving im-

−1

age. In order to generate the warped image, T backward

can be

applied in place of T forward . Both T forward and T backward are map-

pings from integer locations to non-integer locations. The dif-

ference between these two schemes is manifested when render-

ing the warped image. When T forward is applied to I m , integer lo-

cations are mapped to non-integer locations, which necessitates

interpolating scattered data (Crum et al., 2007; Zhuang et al.,

−1

maps non-integer locations

2008). On the other hand, T backward

back to integer locations. Thus, rendering the warped image

only requires interpolating the moving image, which is defined

on a regular grid. For algorithms that only output T backward , TRE

can be computed as

TRE backward =

||l m

− T backward (l if )|| k .

(21)

i=1

Landmark correspondences can also be generated using arti-

ficial deformations (Bauer et al., 2021; Sdika, 2008; Ger et al.,

2017). Different from manual landmark correspondences, ar-

tificial deformation can produce dense correspondences that

are not limited to regions with highly selective intensity fea-

tures. However, the performance of algorithms on artificial de-

formation may not accurately reflect their actual performance

due to the discrepancy between the artificial and real deforma-

tions (Obeidat et al., 2016; Pluim et al., 2016). To overcome

this issue, many works have been focused on generating de-

formations that are more akin to those observed in practical

applications. For instance, Lobachev et al. (2021) proposed a

pipeline for simulating sectioning-induced deformation fields.Chen, Liu, Wei et al. / (2023)

Vlachopoulos et al. (2015) used a thin-plate kernel spline model

to simulate lung deformations arising from respiration. Biome-

chanical simulation (Fu et al., 2021; Teske et al., 2017) and

phantoms (Wu et al., 2019; Ayyalusamy et al., 2021) are other

techniques used to generate artificial deformations.

In situations where manual landmark correspondences are

not available, surrogate measures are used to evaluate accu-

racy. The most straightforward measures of this kind include

absolute intensity differences and the root-mean-square inten-

sity difference between the warped image and the fixed image.

Other similarity measures such as mutual information, struc-

tural similarity index (SSIM) can also be used. When anatomic

labels are available, evaluating the overlaps between the warped

and fixed label images is a popular technique. The Dice coeffi-

cient and Jaccard Index are examples of such measures. How-

ever, Rohlfing (2011) demonstrated that by simply reordering

the voxels from the moving image based on the intensity values

ranking without any geometrical constraints, one can achieve

significantly better performance compared with the state-of-

the-art registration algorithms in most of the surrogate mea-

sures. They concluded that surrogate measures might still be

useful to detect inaccurate registrations but many times they do

not provide sufficient positive evidence for accurate registra-

tions. Only the overlap of sufficiently local labels among the

surrogate measures was found to distinguish between reason-

able and poor registrations in their experiments.

Label surface distances from segmentation maps offers an

alternative to overlap measures. Dalca et al. (2019b) con-

verted segmentation maps into signed distance functions to ap-

proximate the distance between the fixed and warped surfaces.

They also showed that incorporating a similar surface distance

loss during network training enhanced the surface alignment

of anatomical structures. Cheng et al. (2020a) used the mean

minimum distance (MMD), computed as the average Euclidean

distance between manually defined surface points of anatom-

ical structures and their corresponding nearest points on the

warped surface, to measure the discrepancy between the sur-

faces. Additionally, the Hausdorff distance has been extensively

used (Hering et al., 2022).

Previous studies (Lotfi et al., 2013; Sokooti et al., 2016) have

explored the use of machine learning algorithms for quantify-

ing registration errors. Compared to manual landmark corre-

spondences, those methods provide dense error estimations that

can be easily visualized. More recently, several deep learn-

ing techniques have been employed, offering a speed advan-

tage over traditional machine learning algorithms, especially

when a graphical processing unit (GPU) is available (Sokooti

et al., 2021). Most of these methods were trained to predict the

registration errors between a fixed and a warped image inputs.

During training, artificial deformations are used to produce the

warped image and the accuracy of these methods were validated

using manual landmark correspondences (Eppenhof and Pluim,

2018a; Sokooti et al., 2021). Additionally, these techniques can

also be applied to inter-modality registration tasks by incorpo-

rating an extra image synthesis step (Bierbrier et al., 2023).

Fig. 4. Examples of the checkerboard problem (a) and the self-intersection

problem (b) for the central difference approximated Jacobian determinant

|J| on a 3 × 3 grid. The transformations are visualized as a displacement

field and the displacement of the center pixel is highlighted in red. In (a),

the central difference approximated |J| for the center pixel equals one but

the displacement of the center pixel is not involved in the computation.

Even if the center pixel moves outside the field of view, the central differ-

ence approximated |J| still equals one. In (b), the transformation around

the center pixel already introduced folding in space regardless of the dis-

placement of the center pixel but the central difference approximated |J| is

positive.

6.2. Regularity Measures

Given the challenge of acquiring dense manual landmark cor-

respondences and the aforementioned limitation of surrogate

measures, the regularity of the transformations is often used

alongside accuracy measures to obtain a more comprehensive

understanding of the transformations. The underlying assump-

tion is that accurate transformations should be spatially smooth.

Particularly, transformations that fold the space result in phys-

ically un-realistic anatomy structures, which usually indicate

errors. For continuous transformations, their Jacobian determi-

nant |J| must be positive everywhere to avoid folding of space.

This concept is extended to digital transformations where the

number or the percentage of voxels with non-positive Jaco-

bian determinant |J| ≤ 0 are reported to measure the irregu-

larity (Meng et al., 2022a; Liu et al., 2022c; Mok and Chung,

2022c; Dey et al., 2022; Chen et al., 2022c; Mok and Chung,

2020a; Jia et al., 2021; Wu et al., 2022b). For a 3D transforma-

tion T (x, y, z) = [T x , T y , T z ], the Jacobian is defined as

J =

∂T x

∂x

∂T y

∂x

∂T z

∂x

∂T x

∂y

∂T y

∂y

∂T z

∂y

∂T x

∂z

∂T y

∂z

∂T z

∂z

(22)

Ashburner et al. (1999) considered the transformations to be

locally affine, and the Jacobian determinant could be computed

using singular value decomposition. More generally, the Jaco-

bian of a dense nonlinear transformation is estimated through

numerical approximation using finite difference methods. In

a recent study, Liu et al. (2022b) showed that when approxi-

mating the Jacobian using forward or backward difference, it

is implicitly assumed that the digital transformations are lin-

early interpolated on a tetrahedron mesh grid. Importantly,

they showed that the Jacobian determinant, when approximated

using central difference, results in the checkerboard and self-

intersection problems. Consequently, it consistently underesti-

mates non-diffeomorphic spaces. Figure 4 shows examples of20

Chen, Liu, Wei et al. / (2023)

the checkerboard problem and the self-intersection problem in

2D. In both cases, the central difference approximated Jacobian

determinants are positive, but the underlying transformations

introduce folding of space (under the assumption that the dig-

ital transformations are piecewise linear). They conclude that

for a 2D transformation, four unique finite difference approx-

imations of |J|’s must be positive to ensure the entire domain

is invertible and free of folding; in 3D, ten unique finite differ-

ences approximations of |J|’s are required to be positive. Note

that their method is closely related to simplex counting (Hol-

land et al., 2011) used in deformation-based volumetric change

estimation. Because of the issues associated with central dif-

ference approximation of |J|’s, Liu et al. (2022b) recommend

using non-diffeomorphic volume to accurately reflect the non-

diffeomorphism introduced by transformations.

The logarithm of the Jacobian determinant is also an impor-

tant measure, especially for applications where it requires the

volume of the underlying anatomy to be preserved (Rohlfing

et al., 2003a; Jian et al., 2022). The logarithm is used to sym-

metrically weight local expansion and compression (Rohlfing

et al., 2003a; Lange et al., 2020). In recent works such as (Her-

ing et al., 2022; Chen et al., 2022b; Dey et al., 2022; Chen et al.,

2022c; Song et al., 2022), the standard deviation of the loga-

rithm of the Jacobian determinant has been used to quantify the

smoothness of the displacement field. Additionally, the statis-

tical distribution of the logarithm of the Jacobian determinant

can be used as a visualization tool to reveal differences between

registration algorithms (Leow et al., 2007; Lange et al., 2020).

Similar to many surrogate measures, the regularity of the

transformations can detect inaccurate transformations, but by

itself, it is insufficient to provide adequate positive evidence for

accurate transformations. For example, the identity transforma-

tion would be deemed a perfectly regularized transformation,

but it would not provide a meaningful registration.

7. Applications of Medical Image Registration

7.1. Atlas Construction

In computational anatomy, atlases have been an essential

tool for investigating the variability of human organs across

populations and facilitating the segmentation of organs in in-

dividual patients. Typically, atlases are constructed through

an iterative averaging process (i.e., procrustean averaging (Ma

et al., 2008)) using a population of patient images (Allasson-

nière et al., 2007; Davis et al., 2004; Guimond et al., 2000; Joshi

et al., 2004; Ma et al., 2008; Avants et al., 2010). This proce-

dure commences with the registration of images to a common

frame of reference, followed by the computation of an average

based on the registered images, which serves as the atlas for

the current iteration. The iteration cycle continues until conver-

gence has been achieved, resulting in the final atlas. However,

these traditional methods tend to blur regions exhibiting high-

frequency deformations (Dey et al., 2021). This shortcoming

arises from the averaging of intensities when constructing the

atlas, which invariably results in the loss of high-frequency in-

formation essential for capturing anatomical details.

Recent advancements in learning-based registration have

demonstrated significant improvements in the quality of con-

structed atlases while concurrently expediting the atlas con-

struction process. Dalca et al. (2019a) pioneered the devel-

opment of a brain atlas within a deep learning framework, in

which an initial approximation of the atlas is derived from the

mean of the brain images under study. This atlas is then jointly

optimized with a registration network, employing the Voxel-

Morph architecture to align the atlas with individual patient im-

ages. Throughout the training process, both the atlas and the

registration network weights are updated. To promote an un-

biased atlas and enhance spatial smoothness in the resulting

deformation fields, the authors introduce a Gaussian-inspired

prior. This prior serves to penalize sharp deformation changes

while simultaneously encouraging minimal average deforma-

tion across the entire dataset. Moreover, patient demographic

information is conditioned into the network architecture, facil-

itating the generation of conditional atlases that vary accord-

ing to the specific attributes of different individuals. This work

has inspired a variety of applications. For instance, Cheng

et al. (2020b) establish continuous spatio-temporal cortical sur-

face atlases for neonatal brains. Similarly, both Zhao et al.

(2021b) and Bastiaansen et al. (2022b) construct continuous

spatio-temporal atlases for fetal and infant brains. Zhao et al.

developed a multi-scale spherical registration network featur-

ing group-wise registration, while Bastiaansen et al. applied

group-wise registration to volumetric ultrasound images. Al-

ternatively, Yu et al. (2020a) constructed an unconditional and

universal atlas while incorporating demographic information

into the displacement field generation. This approach explicitly

models morphological changes related to attributes as a diffeo-

morphic deformation, which captures variations in shape and

size. Recognizing that the necessity for images to be affinely

aligned in a preprocessing step as suggested in (Dalca et al.,

2019a) could not adequately capture the dynamic size and shape

development of fetal brain structures, Chen et al. (2021c) pro-

posed incorporating an affine network, conditioned on patient

demographic data, to register the constructed atlas to individ-

ual patient images. This approach preserves the dynamic size

and shape variations of patients at different ages. Li et al.

(2021) proposed integrating the segmentation produced by a

segmentation network into the atlas construction method pro-

posed in (Dalca et al., 2019a). This method enables the joint

training of segmentation and registration networks while simul-

taneously constructing both image and segmentation atlases.

Similarly, Sinclair et al. (2022) embraced the concept of jointly

training segmentation and registration networks. They were

motivated by the observation that segmentation networks of-

ten yield spurious voxel-wise predictions. By warping the la-

bel map of the constructed atlas to match the segmentation pre-

diction through learning-based diffeomorphic registration, the

topology of the original anatomical structure can be preserved,

thus avoiding the potential segmentation errors produced by the

segmentation network. Drawing inspiration from Dalca et al.

(2019a) and Shu et al. (2018), Siebert et al. (2021) proposed

using a shared encoder to extract features from input images,

followed by two decoders. One decoder generates an uncon-Chen, Liu, Wei et al. / (2023)

ditional atlas, while the other produces deformation fields that

warp the atlas to individual images. To improve registration

performance and enforce unbiased atlas construction, they in-

troduced an inverse consistency and a bias reduction loss, in

addition to the commonly seen similarity measure and deforma-

tion regularizer. In a related study, Wu et al. (2022a) proposed a

closed-form update for constructing the atlas by leveraging pre-

trained registration networks as a priori knowledge of the de-

formation field. Their approach involves an alternating update

process for both the deformation field, which warps the atlas,

and the atlas itself. This method results in an atlas construction

framework that is independent of the registration model choice,

offering flexibility in its application.

Researchers have explored various strategies to enhance the

quality of the constructed atlases. Dey et al. (2021) im-

proved the constructed atlas by incorporating adversarial learn-

ing, which improved both the sharpness and centrality of the

resulting atlas. In a similar vein, He and Chung (2021) aimed

to improve the atlas’ sharpness through adversarial learning and

by integrating edge information derived from anatomical label

maps. Additionally, Pei et al. (2021) leveraged anatomical label

maps to improve the quality of the constructed atlas by apply-

ing anatomical consistency supervision. However, Ding and Ni-

ethammer (2022) contended that the importance of atlas sharp-

ness is secondary to the registration model’s ability to align cor-

responding points between images in the atlas space. There-

fore, they focused on the registration model upon which the

atlas construction is based and proposed using the constructed

atlas as a bridge. In their method, an image is first warped to

align with an atlas and then further warped to match the target

image using the registration network. This process facilitates a

direct comparison between the warped image and the target im-

age while enabling the construction and evaluation of the atlas

without requiring segmentation of the atlas itself. Inspired by

implicit neural shape representations (Mescheder et al., 2019),

Yang et al. (2022a) proposed constructing atlases of anatomical

shapes using a continuous occupancy grid instead of represent-

ing them in a voxel-based manner. Given the latent representa-

tion of the shape, this alternative approach constructs an atlas

based on the linear combination of a learned template matrix.

Their method offers a novel perspective on atlas representation,

diverging from traditional voxel-based representations.

Advancements in learning-based atlas construction methods

have facilitated the fast construction of high-quality atlases.

The following subsection explores the application of the atlases

and learning-based image registration in achieving the goal of

image segmentation.

7.2. Multi-atlas Segmentation

Multi-atlas segmentation is a well-established registration-

based segmentation technique in existence for several

decades (Rohlfing et al., 2003b; Iglesias and Sabuncu, 2015).

The typical approach involves registering atlas images or their

patches to a target image and fusing the propagated atlas la-

bels. For deformable registration-based multi-atlas methods,

the process of pairwise registration between atlas images and

the target image can be computationally expensive and time-

consuming. However, recent advancements in deep learning-

based deformable registration algorithms provide a promising

solution to address the speed issue and potentially improve the

accuracy of registration, which can subsequently improve the

accuracy of multi-atlas segmentation. While many works have

explored the use of deep networks to improve the fusion of mul-

tiple registered atlas images (Zhu et al., 2020; Fang et al., 2019;

Xie et al., 2019, 2023; Yang et al., 2018), there are relatively

few studies that incorporate deep learning-based deformable

registration algorithms into their pipeline.

Ding et al. (Ding et al., 2019, 2020) proposed VoteNet, which

predicts a voxel probability of the agreement between registered

atlas images and the segmentation target image. They adopted

Quicksilver (Yang et al., 2017) as their registration algorithm

to speed up the pairwise registration process. Their follow-

up work (Ding and Niethammer, 2021) experimented with im-

proving the initial registration results from Quicksilver by in-

corporating a registration refinement step. The results showed

that registration accuracy is a critical factor in achieving ac-

curate multi-atlas segmentation. In Ding et al. (2022a), the

authors addressed the challenging problem of cross-modality

multi-atlas segmentation. They proposed a deep network that

learns the bi-directional registration between atlas images and

the target image, as well as a second network that estimates the

weights for label fusion. To account for the modality differ-

ences between the atlas and target images, they used Dice loss

as a similarity measure to train their registration network and

conditional entropy to train the fusion network. For registering

3D first-trimester ultrasound images, Bastiaansen et al. (2022a)

proposed a two-stage network for learning an affine transforma-

tion. They then applied the VoxelMorph architecture to perform

deformable registration on the affinely aligned images. The seg-

mentation of the target images was achieved by propagating the

labels of the atlas images and combining them using majority

voting.

The good performance of supervised training in image seg-

mentation could be a reason for the relative lack of research

on deep learning-based registration in multi-atlas segmentation.

Deep neural networks have demonstrated impressive results in

supervised image segmentation tasks, making them a popular

choice for many researchers. However, the performance of sin-

gle atlas segmentation is often used to evaluate the accuracy of

a registration algorithm, as discussed in Section 3.5. Due to the

close relationship between registration and segmentation, there

is an increasing interest in exploring the possibility of integrat-

ing the learning of segmentation and registration (Sinclair et al.,

2022; Khor et al., 2023; Xu and Niethammer, 2019). Overall,

the use of deep learning-based registration in multi-atlas seg-

mentation is still in its early stages, and there is a significant

opportunity for further research.

7.3. Uncertainty

Accurate registration is critical for many medical image anal-

ysis applications, such as image-guided surgery, radiation ther-

apy, and longitudinal studies. However, registration uncertainty

can arise due to factors such as training data artifacts or pre-

dictive model variances. To address this issue, incorporating22

Chen, Liu, Wei et al. / (2023)

registration uncertainty into medical image analysis can help

guide the interpretation of the registration results and improve

the reliability of various analysis tasks.

In clinical decision-making, understanding registration un-

certainty is critical for image-guided surgery and radiation ther-

apy. The absence of proper registration uncertainty awareness

may lead surgeons to presume a substantial registration error

throughout the entire region based on a large error in a single

location, resulting in the total disregard of registration. Fur-

thermore, the lack of registration uncertainty may also cause

surgeons to place unwarranted confidence in regions with inac-

curate registration, resulting in potentially severe consequences.

For image-guided surgery, Risholm et al. (2013) showed that

the registration uncertainty increased at the site of resection

using clinical data from neurosurgery for resection of brain

tumors, which demonstrated the potential utility of registra-

tion uncertainty in recognizing the surgical regions and guid-

ing surgery. For radiation therapy, Risholm et al. (2011) had

previously presented a probabilistic framework to estimate the

accumulated radiation dose and corresponding dose uncertainty

delivered to significant anatomical structures during radiation

therapy, such as the primary tumor and healthy surrounding or-

gans. The uncertainty in the estimated dose directly results from

registration uncertainty in the deformation used to align daily

cone-beam CT images with planning CT. The accumulated ra-

diation dose is an important metric to monitor during treatment,

potentially requiring treatment plan adaptation to conform to

the current patient anatomy.

A study by Nenoff et al. (2020) employed six different de-

formable registration algorithms to analyze dose uncertainty in

proton therapy and investigate their impact on dose accumula-

tion for non-small cell lung cancer patients with inter-fractional

anatomy variations. The results show that dose degradation

caused by anatomical changes was more pronounced than the

uncertainty arising from using different deformable image reg-

istration algorithms for dose accumulation. However, accumu-

lated dose variations between these algorithms can still be sub-

stantial, leading to additional dose uncertainty.

In longitudinal medical image analysis, registration is an es-

sential step because it enables the comparison of measurements

taken at different time points, which is necessary for correcting

anatomical variability and tracking changes over time. Regis-

tration uncertainty estimation can be beneficial for longitudinal

image processing tasks, such as image smoothing, segmenta-

tion prior propagation, joint label fusion, and others. Simp-

son et al. (2011) proposed an approach to calculate the de-

formable registration uncertainty using a probabilistic registra-

tion framework, integrating the uncertainty into spatially nor-

malized statistics for adaptive image smoothing. This method

showed improved classification results in longitudinal MR brain

images acquired from Alzheimer’s Disease Neuroimaging Ini-

tiative compared to not smoothing or using a straightforward

Gaussian filter kernel.

In summary, incorporating registration uncertainty into med-

ical image registration can facilitate interpreting registration re-

sults and improve the reliability of various medical image anal-

ysis tasks. It is crucial for clinicians to understand registration

uncertainty and its potential applications in clinical decision-

making. Further research is needed to explore other potential

applications of registration uncertainty.

7.4. Motion Estimation

In the context of medical images, deep learning-based mo-

tion estimation has been closely associated with the unsuper-

vised optical flow (Jonschkowski et al., 2020; Stone et al., 2021;

Bian et al., 2022) and point tracking (Lai and Xie, 2019; Harley

et al., 2022; Ranjan et al., 2019; Bian et al., 2022) techniques

within the computer vision domain. However, the application

of motion estimation in medical imaging presents unique chal-

lenges, including limited training data, heterogeneous patient

data for testing, and special desired properties on the motion

field, such as diffeomorphism (to preserve anatomical relation-

ships) and incompressibility (to preserve anatomical integrity).

Deep learning-based registration has demonstrated successful

outcomes in estimating motion for various organs, such as the

human heart, brain, lungs, and tongue. Registration-based mo-

tion estimation plays a significant role in enabling the assess-

ment of changes in the position, shape, and size of organs over

time. Multiple dynamic imaging modalities are used for mo-

tion estimation in medical imaging, including but not limited to

cine images, tagged-MRI (Axel and Dougherty, 1989a,b), and

echocardiography.

Cine images are a temporal sequence of MR images cap-

tured in quick succession, allowing for the monitoring of organ

movement and deformation over time. Recent research (Qin

et al., 2018; Morales et al., 2019; Meng et al., 2022b; Yu et al.,

2020c; Qin et al., 2023; López et al., 2022; Yu et al., 2020b)

has successfully applied deep learning-based registration tech-

niques to cine images. For example, FOAL (Yu et al., 2020c)

proposed online optimization to mitigate distribution mismatch

between the training and testing datasets for motion estimation,

using meta-learning techniques to enable more efficient on-

line optimization with fewer gradient descent steps and smaller

data samples, which differs from instance-specific optimiza-

tion (Balakrishnan et al., 2019). Yu et al. (2020b) applied sim-

ilarity and smoothness loss to multiple scales of motion fields

(pyramid) using a deep supervision strategy.

The relatively uniform signal within tissues from cine im-

ages and the lack of reliable, identifiable landmarks have mo-

tivated the exploration of additional regularization methods for

estimating motion that is biologically plausible and clinically

reliable. For example, Qin et al. (Qin et al., 2023) trained a

variational autoencoder-based generative model to capture the

prior of biomechanically plausible deformations by reconstruct-

ing simulated deformations using finite element models. This

prior is then used as regularization during the training of the

registration network. Lopez et al. (López et al., 2022) incor-

porated hyperelastic regularization terms into the framework of

physics-informed neural networks (Raissi et al., 2019) to esti-

mate incompressible motion fields.

Tagged-MRI, on the other hand, employs a spatially modu-

lated periodic pattern to magnetize tissue temporarily, produc-

ing transient tags in the image sequence that move with the tis-

sue and capture motion information. It allows for tracking theChen, Liu, Wei et al. / (2023)

motion of inner tissue where the region does not have contrast

on cine images. DeepTag (Ye et al., 2021) takes raw 2D tagged

images as input and estimates the incremental motion between

two consecutive frames using a bi-directional registration net-

work. Then it composes the incremental motion field to esti-

mate motion between any two time frames. Harmonic phase

images (Osman et al., 1999) have been found to be more ro-

bust to tag fading and imaging artifacts during motion tracking

than raw tagged images. DRIMET (Bian et al., 2023) proposed

a simple sinusoidal transformation on the harmonic phase im-

ages, enabling end-to-end training for estimating a 3D dense

motion field from tagged-MRI. It also incorporates a Jacobian

determinant-based loss that penalizes symmetrically for con-

traction and expansion to estimate a biologically-plaussible in-

compressible motion field. DRIMET shows promising results

in terms of superior registration accuracy, a comparable de-

gree of incompressibility, and faster speed over its traditional

iterative-based counterparts (Xing et al., 2017; Mansi et al.,

2011).

Numerous deep learning-based techniques have been devised

to estimate 2D motion, and although this may be adequate for

certain applications, tracking dense 3D motion is typically nec-

essary or highly desirable when estimating the motion of bi-

ological structures. To address this issue, Meng et al. (Meng

et al., 2022b) integrate features extracted from multi-view 2D

cine CMR images captured in both short-axis and long-axis

planes to learn a 3D motion field of the heart. The edge map

of myocardial wall is used as a shape regularization of the esti-

mated motion field. Alternatively, DRIMET (Bian et al., 2023)

uses sparsely acquired tagged images and interpolates them

onto an isotropic grid with a resolution based on the in-plane

resolution. This approach is based on the observation that the

tag pattern changes slowly in the through-plane direction and

therefore will not cause aliasing issues during sampling. By

doing so, DRIMET is capable of tracking dense 3D motion.

Recent studies have shown that joint learning of segmen-

tation and motion estimation can be mutually beneficial (Qin

et al., 2018; Ta et al., 2020; Ahn et al., 2020). For instance,

Qin et al. (Qin et al., 2018) employ a dual-branch framework

consisting of a segmentation branch and a motion estimation

branch to simultaneously estimate motion and segmentation

from a sequence of cardiac cine images. During training, a

shared feature encoder is learned under the premise that joint

features can complement both tasks. In contrast, Ta et al. (Ta

et al., 2020) and Ahn (Ahn et al., 2020) adopt a task-level ap-

proach to jointly tackle motion estimation and segmentation

in the context of estimating cardiac motion from echocardio-

graphy. Specifically, they warp the segmentation (of one time

frame) using the estimated motion field and regularize the mo-

tion field by incorporating shape information obtained from

the segmentation. This approach differs from previous stud-

ies which couple motion estimation and segmentation at the

feature-level, and may offer a novel perspective on joint learn-

ing of these tasks.

In addition to MRIs and echocardiography, numerous deep

learning-based algorithms have been developed for motion esti-

mation with 4D-CT (Fu et al., 2020b; Ho et al., 2023; Wolterink

et al., 2022; Fechter and Baltas, 2020; Hering et al., 2021;

Ji et al., 2022). 4D-CT imaging captures images at different

phases of respiratory or cardiac cycles, providing valuable in-

sights for lung imaging applications, including radiation ther-

apy planning and lung function assessment. DIR-LAB (Castillo

et al., 2009) is a widely-used dataset, containing 4D CT images

of ten patients, to evaluate 4D-CT registration techniques, with

the aim of registering inspiration images to expiration images.

This task is challenging due to the superimposed motion of the

heart and lungs, which is larger in scale than the small lung

structures being studied.

LungRegNet (Fu et al., 2020b) trains two separate networks

to handle large lung motion. One network predicts large mo-

tion on a coarse scale, and the other network takes the coarsely

warped image and fixed image as input to predict fine motion.

In addition to similarity and smoothness losses, an adversar-

ial loss is applied as extra regularization to prevent unrealis-

tic deformed images. Hering et al. (Hering et al., 2021) em-

ploys a coarse-to-fine multi-level optimization strategy. The

deformations of coarse levels provide an initial guess for sub-

sequent finer levels. Networks are trained progressively, with

each handling one level and initialized with parameters from

the previous level. It incorporates a penalty for volume change

and utilizes an l2 loss function to match corresponding key-

points that are automatically detected. Ho et al. (Ho et al.,

2023) applied cycle-consistent training (Kuang, 2019) to re-

duce foldings using two networks. After the first network’s

forward pass, the warped and moving images are sent to the

second network to predict inverse deformation, with a similarity

loss applied to maximize the similarity between the moving im-

ages and inversely-deformed moving images. IDIR (Wolterink

et al., 2022) use a multi-layer perceptron to represent the trans-

formation function of coordinates and demonstrate the ability

to incorporate the Jacobian regularizer, hyperelastic regular-

izer (Burger et al., 2013), and bending energy (Rueckert et al.,

1999) into the framework. The resulting deformation is void of

foldings and achieves a mean target registration error (TRE) of

1.07 mm on DIR-LAB datasets. However, this method requires

more time compared to CNN-based approaches, prompting re-

searchers to consider acceleration as a potential future direction.

In order to accurately register previously unseen images out-

side of training datasets, the application of one-shot learning

has been employed for the estimation of lung motion (Fechter

and Baltas, 2020; Ji et al., 2022). Fechter et al. (Fechter and

Baltas, 2020) concatenated images captured at different phases

in the channel dimension in order to leverage temporal infor-

mation. To minimize memory requirements, they partitioned

images into non-overlapping patches and applied a boundary

smoothness constraint on the transitions between patches. Ad-

ditionally, they utilized a coarse-to-fine approach by construct-

ing an image pyramid, where the estimated vector fields of finer

scales were added to the upsampled vector fields of coarser

scales. The proposed method showed a competitive perfor-

mance without the need for training in advance.

7.5. 2D-3D Registration

Recent progress in the field of interventional procedures for

invasive treatment protocols has been associated with high pre-24

Chen, Liu, Wei et al. / (2023)

cision in surgeries performed at a reasonable cost (Pfandler

et al., 2019; Dlouhy et al., 2014). In these procedures, 2D-

3D registration plays a significant role in determining the spa-

tial relationship between the 3D anatomical structures and 2D

images, such as X-Ray fluoroscopic images, ultrasound im-

age frames, or endoscopic images. 2D-3D medical image reg-

istration primarily involves registering 2D interventional im-

ages to 3D pre-operative CT/MR images, i.e., to obtain the 3D

geometric transformation that aligns with the 2D view avail-

able. Conventional 2D-3D registration methods involve itera-

tive optimization methods with similarity metrics (Maes et al.,

1997) based on image intensity as the objective function. Due

to the sparsity of spatial information derived from 2D images,

the problem is non-convex, which may lead to convergence at

a local minimum if the initial estimate is not sufficiently close

to the correct one. 2D-3D registration is a problem with a min-

imum of six degrees of freedom which may also lead to reg-

istration ambiguity as the spatial information along each pro-

jection line is compressed to a single point in the 2D plane.

This high-dimensional optimization problem increases the dif-

ficulty of determining the parameters associated with the depth

of anatomical features in the 3D volume. Alternatively, deep-

learning-based methods have gained popularity for this appli-

cation as they do not require explicit functional mappings (Un-

berath et al., 2021). In this discussion, we briefly highlight re-

cent advancements in 2D-3D registration, while directing in-

terested readers to (Unberath et al., 2021) for a comprehensive

review of the influence of various learning-based methods in

this area.

Common 2D-3D registration applications and examples in-

clude registration of 2D fluoroscopic/angiography images to 3D

CT/MR images of pelvic, lung, or brain regions (Gu et al., 2020;

Liao et al., 2019; Gao et al., 2020b,a; Jaganathan et al., 2023;

Huang et al., 2022), registering endoscopy images to CT/MR

images (Liu et al., 2020b; Bobrow et al., 2022), and registering

2D Ultrasound (US) frames to 3D MR images to facilitate in-

terventional procedures, such as liver tumor ablation (Wei et al.,

2021b) or prostate cancer biopsy(Guo et al., 2022).

In (Gu et al., 2020; Wei et al., 2021b; Huang et al., 2022),

the 2D-3D registration problem was modeled as a regression

learning problem where the network is trained to directly pre-

dict the desired geometric parameters. These models are trained

by completely relying on the data, i.e., it has little to no tie to

the actual image formation physics involved. Specifically, in

(Gu et al., 2020) a 2D X-Ray image is registered to a 3D CT

volume using a ConvNet, which takes the X-Ray image and a

digitally reconstructed radiograph (DRR) from the CT volume

at some known pose as input. The ConvNet regresses a geodesic

loss function over the geometric parameter space to estimate the

relative pose between the fixed X-ray image and the DRR from

the CT volume without the need for accurate pose initialization.

In (Wei et al., 2021b), Wei et al. propose a two-step reg-

istration process to determine the position and orientation of

the ultrasound plane in the 3D MR volume data. In the first

step, a ResNet-18 network is employed to determine the US

probe orientation. Following this, a U-Net is used to regress

a weighted dice loss function, which facilitates the determina-

tion of the orientation and position of the corresponding XY

plane in the resampled 3D MR volume associated with the

US frame. In (Huang et al., 2022), Huang et al. also imple-

mented a two-step registration process for aligning 3D MR ves-

sel wall images (VWI) with 2D Digital Subtraction Angiog-

raphy (DSA) images. This approach encompasses a ConvNet

regressor (Miao et al., 2016) that estimates the initial pose, fol-

lowed by an instance-based centroid alignment, which serves to

further minimize parameter estimation errors between the im-

ages.

As an alternative to formulating registration as a regres-

sion problem that necessitates ground truth transformation pa-

rameters, several recent studies (Liao et al., 2019; Gao et al.,

2020b,a; Jaganathan et al., 2023; Guo et al., 2022) have ex-

plored framing it as an unsupervised optimization problem. In

such a formulation, the cost function is determined by a similar-

ity metric measured between the transformed and fixed images.

Liao et al. (Liao et al., 2019) trained a network to track a set of

points of interest (POIs) derived from the 3D CT volume in the

2D DRR and in the multi-view fluoroscopic 2D images (used

as fixed images), enabling the network to learn the spatial cor-

respondences between the POIs. In this method, a Siamese U-

Net architecture is employed to extract features from the DRRs

and fixed images, subsequently tracking the POIs within the

extracted features. A triangulation layer is incorporated to pin-

point the locations of the tracked POIs within the fixed image

in 3D space. Finally, the geometric transformation between the

estimated locations of POIs derived from the fixed image and

their true positions is determined analytically. In (Gao et al.,

2020b), Gao et al. proposed a novel differential volume ren-

dering transformer network combined with a feature extraction

encoder to approximate the image similarity metric in a man-

ner that renders the geometric parameter estimation as a convex

problem with respect to the pose parameters.

The examples and applications discussed thus far have pri-

marily focused on rigid 2D-3D registration. However, non-rigid

2D-3D registration is essential in certain applications, such as

cephalometry (Li et al., 2020) and lung tumor tracking in radia-

tion therapy (Foote et al., 2019; Dong et al., 2023). Cephalome-

try, for instance, involves formulating the problem as deformed

2D-3D registration with the objective of generating a 3D vol-

umetric image from a 2D X-ray image using a 3D skull atlas.

Li et al. (Li et al., 2020) developed a convolutional encoder that

uniquely codes the cephalogram image into a volumetric im-

age. The network is trained by minimizing the NCC between

the synthesized DRR originating from the volumetric image and

the 2D cephalogram.

Numerous deep learning-based models and metrics have

been developed to improve the performance of 2D-3D regis-

tration in specific applications, although these methods are spe-

cialized and not as versatile as traditional optimization methods.

Nonetheless, Machine Learning/Deep Learning has been instru-

mental in tackling the persistent challenges associated with al-

gorithmic approaches. These techniques have tackled a nar-

row optimal range of parameters, while also decreasing regis-

tration ambiguity. CNN-based approaches are also comparably

fast. These factors encourage users to further improve learning-Chen, Liu, Wei et al. / (2023)

based 2D-3D registration pipeline.

8. Challenges and Future Perspectives

Over the past decade, learning-based registration models

have been attracting increasing research interest. As illustrated

in the left panel of Fig. 5, there has been a growing trend in

developing and applying these models since 2013. Unlike other

medical image analysis tasks, such as segmentation or classi-

fication, which typically necessitate labor-intensive and time-

consuming manual annotations to develop high-performing

models, registration is inherently a self-supervised task. Tra-

ditional registration models have predominantly been unsuper-

vised, requiring only moving and fixed images to execute regis-

tration. While traditional registration models are typically unsu-

pervised, learning-based registration models initially began as a

supervised process, generating ground truth deformation fields

using traditional registration methods. However, these super-

vised models often could not surpass the performance of tradi-

tional methods. Instead, they often served as an intermediate

step to expedite conventional approaches like geodetic shoot-

ing (Shen et al., 2019), FLASH (Wang and Zhang, 2020), etc.

Despite traditional methods providing appealing deformation

properties such as time-dependent diffeomorphic transforma-

tions, researchers have recently begun exploring the unsuper-

vised nature of learning-based registration (as seen in the left

panel of Fig. 5). By training DNNs using loss functions adapted

from traditional methods’ objective functions, these methods

aim to improve both registration accuracy and speed. Incor-

porating segmentation and landmark correspondences during

training can further enhance registration accuracy, providing

capabilities not achievable with traditional methods. Given

the rapid progress of deep learning and its growing adoption

in medical applications, we anticipate an increasing focus on

learning-based medical image registration.

In this section, we provide future perspectives and discuss

potential avenues for advancing learning-based medical image

registration. Our discussion will focus on the development of

registration models, assessment of registration uncertainty, and

prospective applications.

8.1. Deep Learning-based Registration Models

8.1.1. Network Architecture

Network architectures employed in image registration oc-

casionally draw inspiration from other image analysis tasks,

such as segmentation. For instance, VoxelMorph (Balakrishnan

et al., 2019), CycleMorph (Kim et al., 2021), SYMNet (Mok

and Chung, 2020a), and DiffuseMorph (Kim et al., 2022) all

borrow U-Net-like architectures, originally developed for im-

age segmentation. In such cases, they often generate deforma-

tion fields at a single resolution. In contrast, traditional regis-

tration algorithms have demonstrated the benefits of adopting a

multi-resolution registration strategy, which decomposes defor-

mations across multiple scales. This method not only improves

registration performance but also imparts beneficial deforma-

tion properties, such as the ability to enforce larger deforma-

tions. This aspect can be particularly beneficial for lung or ab-

dominal organ registration, where organ displacement between

scans can be significant. As discussed in section 4.10, there has

been a growing interest in integrating multi-resolution strate-

gies into network architecture, and these methods have consis-

tently demonstrated notable performance improvements com-

pared to using a single resolution alone. It is worth noting that

this finding has parallels with observations in other image anal-

ysis fields, where adopting deep supervision can significantly

boost performance (Zhou et al., 2019; Isensee et al., 2021).

Moreover, registration task is intrinsically different from

other tasks, as it requires the network to capture the correspon-

dences between images rather than comprehending the context

contained within the images themselves. This concept is exem-

plified by a recent work, SynthMorph (Hoffmann et al., 2021),

where the authors demonstrated that training a viable medi-

cal image registration network does not strictly require med-

ical images. Instead, random shapes or synthetic images can

also serve as training datasets for registration networks. Con-

sequently, when designing network architectures for image reg-

istration, the primary focus should be on their ability to cap-

ture spatial correspondences between images. Architectures

such as Transformers (particularly cross-attention Transform-

ers), contrastive learning, Siamese networks, and correlation

layers, which leverage comparisons between moving and fixed

images, are of special interest for image registration. We ex-

pect to see an increasing number of studies incorporating these

designs in the future, along with other advancements in deep

learning applied to image registration.

8.1.2. Loss Function

In unsupervised models, the image similarity measures pre-

dominantly used for mono-modal registration are MSE and

NCC, as shown in Table 1. NCC is generally considered a

better choice than MSE, as it is locally adaptive and less sen-

sitive to local intensity variations (Avants et al., 2008). For

multi-modal registration, MI has historically been the preferred

choice (Maes et al., 1997; Wells III et al., 1996). However, to

auto-differentiate MI using modern deep learning frameworks

for end-to-end training, joint and marginal probabilities are of-

ten approximated using the Parzen window. A notable draw-

back of this approximation is the increased computational bur-

den. In actual implementation, each voxel location expands to

include a vector, with the elements in the vector representing

the probability of the voxel belonging to each intensity bin.

Increasing the number of intensity bins effectively results in

an increased channel dimension, ultimately leading to a higher

computational burden. Conversely, using a small number of

bins often limits the registration performance. Recent learning-

based methods have explored surrogates to tackle multi-modal

registration problems. For instance, given the advantage of

learning, anatomical loss functions like Dice can serve as a

modality-independent loss function for training the registration

network (Hoffmann et al., 2021). The trained network can then

be applied to images without requiring anatomical segmenta-

tion, offering an advantage that traditional methods cannot pro-

vide. Multi-modal registration can also be addressed using ad-

vanced learning methods, such as contrastive learning and ad-

versarial learning, as discussed in sections 4.2 and 4.1. These26

Chen, Liu, Wei et al. / (2023)

Fig. 5. The statistics of the paper count from PubMed (as of Fed. 13th, 2023) are depicted in two figures. The first figure displays the statistics obtained by

counting the number of papers that have the keywords "Image Registration" and "Neural Networks" in their title or abstract. The second figure presents

the statistics obtained by counting the papers that include the keywords "Image Registration" and either "Unsupervised" or "End-to-end" in their title

or abstract.

methods guide the neural network in understanding similarities

and dissimilarities between images across different modalities

using paired data without requiring explicit multi-modal simi-

larity measures. We expect future research to continue to de-

velop more efficient and innovative approaches to tackle multi-

modal registration.

Regarding the use of regularization in learning-based de-

formable registration, there is currently an inadequate emphasis

on the development and application of spatially-varying reg-

ularization. Despite being a significant area of research his-

torically (Schnabel et al., 2016; Schmah et al., 2013; Vialard

and Risser, 2014; Stefanescu et al., 2004; Tang et al., 2010;

Pitiot and Guimond, 2008; Gerig et al., 2014; Simpson et al.,

2015; Pace et al., 2013; Myronenko and Song, 2010; Papież

et al., 2014; Fu et al., 2018; Risser et al., 2013; Papież et al.,

2015), spatially-varying regularization has been largely over-

shadowed by the rise of learning-based registration, with only

a few studies addressing it within a deep learning framework

framework (Niethammer et al., 2019; Shen et al., 2019; Chen

et al., 2023b, 2021f). As illustrated in Table 1, most meth-

ods opt for a simple spatially-invariant regularization, predom-

inantly employing the diffusion regularizer. However, as out-

lined in section 3.4, spatially-varying regularization provides

the advantage of accommodating spatially-varying deforma-

tions, preserving discontinuities, and facilitating sliding mo-

tion, all of which are essential for a variety of applications. Ad-

vancements in modeling spatially-varying regularization within

or through deep learning frameworks are eagerly anticipated in

the future.

8.2. Registration Uncertainty

Registration uncertainty in medical image analysis is an on-

going challenge and opportunity. On the one hand, advance-

ments in deep learning have the potential to improve registra-

tion accuracy and reduce registration uncertainty by extracting

features that are robust to noise and other artifacts. On the other

hand, uncertainty can be estimated for use in interpreting the

registration results and providing valuable information for clin-

ical decision-making.

However, there are several limitations that restrict the further

usage of uncertainty estimation in various applications. One

significant limitation is the lack of ground truth for evaluating

the quality of uncertainty estimation. Without ground truth, it

is challenging to validate the accuracy of uncertainty estima-

tion directly. Instead, most existing evaluation methods rely on

indirect proofs such as sparsification analysis. This not only

affects the reliability of uncertainty estimation, but also limits

further developments for better uncertainty estimations. An-

other limitation is the computational complexity of estimating

uncertainty, which can be time-consuming and may limit its us-

age in real-time clinical applications. Additionally, interpreting

uncertainty estimates can be challenging for clinicians, as some

statistical measures are not always straightforward. This can

limit the adoption of uncertainty estimation in clinical decision-

making, where clear and concise information is essential for

making informed decisions.

To overcome these limitations, it may be helpful to de-

velop improved evaluation methods that rely on direct valida-

tion rather than indirect proofs, such as the creation of synthetic

data or the use of simulation frameworks where the ground

truth is known. This could enhance the accuracy and relia-

bility of uncertainty estimation. Moreover, new computational

techniques and algorithms such as incorporating Markov Chain

Monte Carlo into a multilevel framework (Schultz et al., 2018)

and quantifying image registration uncertainty based on a low

dimensional representation of geometric deformations (Wang

et al., 2019b) can reduce the computational workload. Last, ef-

forts should be made to provide clinicians with more accessible

and intuitive ways to interpret uncertainty estimates, such as

visual aids or simpler statistic measures.

In addition to the limitations of uncertainty estimation in

medical image analysis, there are also many potential ap-

plications of registration uncertainty that remain unexplored.

One promising area of application is atlas-based segmentation,

where registration is often used to align an atlas image to a tar-Chen, Liu, Wei et al. / (2023)

get image for the purpose of segmenting anatomical structures.

In this context, registration uncertainty can be used as a crite-

rion for generating a soft segmentation mask, where the proba-

bility of each voxel belonging to a particular anatomical struc-

ture is weighted by the uncertainty estimate. Another potential

application of registration uncertainty is multi-atlas-based seg-

mentation, where multiple atlases are registered to a target im-

age and combined to produce a final segmentation result. In this

context, registration uncertainty can be used to weight different

segmentation results, producing a more reliable segmentation.

This approach could be particularly useful in cases where some

atlases are more appropriate for a particular image than others.

Overall, these limitations and potential applications of regis-

tration uncertainty in medical image analysis offer an exciting

range of opportunities for future research, and it is likely that

continued progress in this area will have a substantial impact

on the field of medical image analysis and beyond.

8.3. Towards Zero-shot Registration

Classical registration algorithms, while potentially slower,

are usually available for immediate use and provide end-users

with the flexibility to choose the similarity measure and weight-

ing of regularization terms that best meet their needs. In con-

trast, deep learning algorithms are susceptible to the domain

shift problem, which arises when a trained network struggles to

perform well when presented with input images from a different

distribution than the training data. Several sources of domain

shift can arise in learning-based registration algorithms, such as

changes in the input image modality, different populations of

subjects, or variations in the direction of registration. To ad-

dress this challenge, researchers have explored several methods

to improve the generalizability of registration networks. For

instance, SynthMorph (Hoffmann et al., 2021) uses synthetic

images to force the network to learn contrast-invariant features,

while HyperMorph (Hoopes et al., 2021) uses a hypernetwork

to enable the adjustment of the regularization term during test

time. Although these approaches have shown promising results,

they have not yet been widely adopted, and further studies and

validations are necessary to establish their effectiveness in real-

world scenarios.

Recent developments in zero-shot learning offer a promising

avenue for further improving the generalizability of learning-

based registration algorithms. In particular, Foundation models

that are pretrained on a broad range of data have shown compet-

itive or even superior zero-shot performance compared to prior

supervised models in various tasks (Kirillov et al., 2023; Brown

et al., 2020), without requiring specific training data for each

new task. Leveraging these techniques can potentially reduce

the time and resource requirements for developing deep learn-

ing registration algorithms in clinical pipelines, making the ex-

isting registration algorithms more accessible and useful to a

wider range of users.

8.4. Metamorphic Image Registration

As outlined in Section 2.3, diffeomorphic registration is a

bijective mapping that preserves topology. In clinical scenar-

ios, however, registration often involves the deformation of a

healthy control or an atlas to fit patient images that may con-

tain tumors or other anomalies. For example, longitudinal scans

of the same patient with a tumor may need to be mapped to

one another to facilitate the study of the tumor progression or

response. Such situations violate the one-to-one mapping as-

sumption of diffeomorphisms due to topological changes be-

tween scans. To address this challenge, alternative registra-

tion methods such as metamorphic registration models (Brett

et al., 2001; Sdika and Pelletier, 2009; Niethammer et al., 2011;

Hong et al., 2012; François et al., 2022) have been proposed,

which can accommodate changes in topology and appearance.

For a mathematical definition of metamorphosis, readers can

refer to (Trouvé and Younes, 2005; Younes, 2010). However,

these methods often require manual segmentation of the anoma-

lies and are optimization-based, which can be time-consuming

and computationally expensive, thereby limiting their practical

adoption.

Recently proposed learning-based metamorphic registration

methods (Wang et al., 2023; Han et al., 2020; Bône et al.,

2020; Maillard et al., 2022) have been built upon a metamorphic

framework (Trouvé and Younes, 2005; Younes, 2010), which

adds time-varying intensity variations on top of the diffeomor-

phic flow, thereby enabling topological changes over time. The

registration networks learn to disentangle geometric and ap-

pearance changes and sometimes leverage available segmen-

tation to constrain changes within a desired location (e.g., a

tumor). With the success of learning-based image segmenta-

tion, the time-consuming manual segmentation previously re-

quired by classical methods to guide the metamorphosis can

now be addressed using segmentation networks. Several recent

approaches take advantage of segmentation networks through

joint training (Wang et al., 2023) or by integrating segmen-

tation capabilities directly into the registration network (Han

et al., 2020). Alternatively, some studies learn to disentangle

appearance and shape changes directly from data without re-

quiring to explicitly define a region (Bône et al., 2020; Mail-

lard et al., 2022). Despite these recent advancements, learning-

based metamorphic registration is still in its infancy. Success-

fully capturing topological changes still greatly depends on the

accuracy of the segmentation network. Meanwhile, the effec-

tive modeling of time-varying diffeomorphic flow using DNNs

continues to be an area of ongoing research. Considering the

practical potential of metamorphic registration, metamorphic

registration represents an appealing direction for future inves-

tigation in learning-based registration research.

8.4.1. Spatial-temporal Image Registration

Learning-based image registration methods have primarily

been centered around aligning just one pair of images. Yet,

there is a crucial but underexplored need in medical imaging ap-

plications: tracking tissue motion across multiple frames. This

is particularly relevant in modalities such as tagged/cine MRIs,

4D-CT, and echocardiography.

Addressing the challenge of motion tracking involves over-

coming several challenges and taking into account key ques-

tions that require careful consideration. For instance, how

can one ensure the preservation of desired properties, such as28

Chen, Liu, Wei et al. / (2023)

smoothness, diffeomorphism, and incompressibility, through-

out temporally long-range tracking? Achieving accurate 4D

tracking while maintaining a reasonable computational burden

in terms of both temporal and spatial complexity poses another

challenge. Moreover, how can varying input frame lengths be

effectively managed? These questions demand further explo-

ration and investigation in order to advance our understanding

and capability in motion tracking. Encouragingly, recent ad-

vancements in computer vision, particularly in the context of

natural video, have demonstrated promising results. Methods

such as correspondence learning (Jabri et al., 2020; Bian et al.,

2022; Zhang et al., 2023; Araslanov et al., 2021) and spatial-

temporal representation learning (Kim et al., 2019; Yao et al.,

2020; Wang et al., 2019a; Qian et al., 2021; Dave et al., 2022)

may offer valuable insights to tackle these challenges we face.

By building upon these recent advancements, we may pave the

way for more effective and efficient motion-tracking methods

for medical imaging applications.

8.4.2. Spatial Normalization

Spatial normalization is the process of aligning medical im-

ages to a common atlas to reduce anatomical variations (Ash-

burner and Friston, 1999; Friston et al., 1995). This process

facilitates voxel-based analysis, allowing for comparisons be-

tween individual patients as well as between a patient and

a larger population. Additionally, it is useful for mitigating

anatomical differences that arise due to factors like motion

across various image modalities from the same patient. How-

ever, a primary challenge of this approach is achieving accu-

rate registration, often hindered by significant anatomical dif-

ferences among patients. Moreover, traditionally constructed

atlases tend to be of subpar quality, primarily because con-

ventional atlas construction methods usually involve averaging,

leading to significant blurring of anatomical features. However,

as discussed in section 7.1, deep learning is currently playing

a transformative role. It is not only accelerating the atlas con-

struction process but also significantly enhancing the quality of

the atlases in aspects like contrast and sharpness. Historically,

atlases have been largely used in brain research owing to the

comparatively smaller anatomical differences among patients.

But with the advancements introduced by deep learning, com-

bined with learning-based registration models, there is poten-

tial to broaden their application beyond just the brain to include

other body parts. Such an expansion can be invaluable in vari-

ous medical imaging applications, from cancer treatment plan-

ning and monitoring tumor progression or therapy response, to

the creation of patient-specific digital twins.

9. Conclusion

In this survey, we presented a thorough examination of deep

learning for medical image registration. In contrast to existing

review papers, which might not fully capture the most recent

advancements and tend to be systematic in nature with a limited

focus on technical aspects, our comprehensive survey analyzed

over 250 papers with an emphasis on the most recent technolog-

ical advancements. Beginning with a review of the fundamen-

tals of learning-based image registration, our investigation in-

corporated widely-used and novel loss functions, as well as net-

work architectures for image registration. We also thoroughly

investigated the estimation methods of registration uncertainty

and appropriate metrics of registration accuracy and regularity.

Furthermore, we provided insights into potential clinical appli-

cations, future perspectives, and challenges, aiming to guide

future research in this rapidly evolving field.

Acknowledgments

Junyu Chen and Yong Du were supported by grants from

the National Institutes of Health (NIH), United States, U01-

CA140204 (PI: Y. Du), R01-EB031023 (PI: Y. Du), and

U01-EB031798 (PI: G. Sgouros).

Yihao Liu, Shuwen

Wei, Zhangxing Bian, Aaron Carass, and Jerry L. Prince

were supported by the NIH from National Eye Insti-

tute grants R01-EY024655 (PI: J.L. Prince) and R01-

EY032284 (PI: J.L. Prince), as well as the National Science

Foundation grant 1819326 (Co-PI: S. Scott, Co-PI: A. Carass).

The views expressed in written conference materials or publi-

cations and by speakers and moderators do not necessarily re-

flect the official policies of the NIH; nor does mention by trade

names, commercial practices, or organizations imply endorse-

ment by the U.S. Government.

References

Ahn, S.S., Ta, K., Lu, A., Stendahl, J.C., Sinusas, A.J., Duncan, J.S., 2020. Un-

supervised motion tracking of left ventricle in echocardiography, in: Medi-

cal imaging 2020: Ultrasonic imaging and tomography, SPIE. pp. 196–202.

Aljabar, P., Heckemann, R.A., Hammers, A., Hajnal, J.V., Rueckert, D., 2009.

Multi-atlas based segmentation of brain images: atlas selection and its effect

on accuracy. NeuroImage 46, 726–738.

Allassonnière, S., Amit, Y., Trouvé, A., 2007. Towards a coherent statistical

framework for dense deformable template estimation. Journal of the Royal

Statistical Society: Series B (Statistical Methodology) 69, 3–29.

Araslanov, N., Schaub-Meyer, S., Roth, S., 2021. Dense unsupervised learning

for video segmentation. Advances in Neural Information Processing Sys-

tems 34, 25308–25319.

Arsigny, V., Commowick, O., Pennec, X., Ayache, N., 2006. A log-euclidean

framework for statistics on diffeomorphisms, in: 9 th International Con-

ference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2006), Springer. pp. 924–931.

Ashburner, J., 2007. A fast diffeomorphic image registration algorithm. Neu-

roImage 38, 95–113.

Ashburner, J., Andersson, J.L., Friston, K.J., 1999. High-dimensional image

registration using symmetric priors. NeuroImage 9, 619–628.

Ashburner, J., Friston, K.J., 1999. Nonlinear spatial normalization using basis

functions. Human brain mapping 7, 254–266.

Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C., 2008. Symmetric dif-

feomorphic image registration with cross-correlation: evaluating automated

labeling of elderly and neurodegenerative brain. Medical Image Analysis

12, 26–41.

Avants, B.B., Yushkevich, P., Pluta, J., Minkoff, D., Korczykowski, M., Detre,

J., Gee, J.C., 2010. The optimal template effect in hippocampus studies of

diseased populations. NeuroImage 49, 2457–2466.

Axel, L., Dougherty, L., 1989a. Heart wall motion: Improved method of spatial

modulation of magnetization for MR imaging. Radiology 172, 349–350.

Axel, L., Dougherty, L., 1989b. MR imaging of motion with spatial modulation

of magnetization. Radiology 171, 841–845.

Ayyalusamy, A., Vellaiyan, S., Subramanian, S., Satpathy, S., 2021. Perfor-

mance of a deformable image registration algorithm for CT and cone beam

CT using physical multi-density geometric and digital anatomic phantoms.

La Radiologia Medica 126, 106–116.Chen, Liu, Wei et al. / (2023)

Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V., 2019. Vox-

elmorph: a learning framework for deformable medical image registration.

IEEE Trans. Med. Imag. 38, 1788–1800.

Bastiaansen, W.A., Rousian, M., Steegers-Theunissen, R.P., Niessen, W.J.,

Koning, A.H., Klein, S., 2022a. Multi-atlas segmentation and spatial align-

ment of the human embryo in first trimester 3d ultrasound. arXiv preprint

arXiv:2202.06599 .

Bastiaansen, W.A., Rousian, M., Steegers-Theunissen, R.P., Niessen, W.J.,

Koning, A.H., Klein, S., 2022b. Towards a 4d spatio-temporal atlas of the

embryonic and fetal brain using a deep learning approach for groupwise

image registration, in: Biomedical Image Registration: 10th International

Workshop, WBIR 2022, Munich, Germany, July 10–12, 2022, Proceedings,

Springer. pp. 29–34.

Bauer, D.F., Russ, T., Waldkirch, B.I., Tönnes, C., Segars, W.P., Schad, L.R.,

Zöllner, F.G., Golla, A.K., 2021. Generation of annotated multimodal

ground truth datasets for abdominal medical image registration. Interna-

tional journal of computer assisted radiology and surgery 16, 1277–1285.

Beg, M.F., Miller, M.I., Trouvé, A., Younes, L., 2005. Computing large de-

formation metric mappings via geodesic flows of diffeomorphisms. Interna-

tional Journal of Computer Vision 61, 139–157.

Bian, Z., Jabri, A., Efros, A.A., Owens, A., 2022. Learning pixel trajectories

with multiscale contrastive random walks, in: Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pp. 6508–6519.

Bian, Z., Xing, F., Yu, J., Shao, M., Liu, Y., Carass, A., Woo, J., Prince, J.L.,

2023. Deep Unsupervised Phase-based 3D Incompressible Motion Estima-

tion in Tagged-MRI. arXiv preprint arXiv:2301.07234 .

Bierbrier, J., Eskandari, M., Di Giovanni, D.A., Collins, D.L., 2023. To-

wards Estimating MRI-Ultrasound Registration Error in Image-Guided Neu-

rosurgery. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency

Control .

Bierbrier, J., Gueziri, H.E., Collins, D.L., 2022. Estimating medical image

registration error and confidence: A taxonomy and scoping review. Medical

Image Analysis , 102531.

Blendowski, M., Bouteldja, N., Heinrich, M.P., 2020. Multimodal 3D medical

image registration guided by shape encoder–decoder networks. International

journal of computer assisted radiology and surgery 15, 269–276.

Blendowski, M., Hansen, L., Heinrich, M.P., 2021. Weakly-supervised learn-

ing of multi-modal features for regularised iterative descent in 3D image

registration. Medical Image Analysis 67, 101822.

Bobrow, T.L., Golhar, M., Vijayan, R., Akshintala, V.S., Garcia, J.R., Durr,

N.J., 2022. Colonoscopy 3D Video Dataset with Paired Depth from 2D-3D

Registration. arXiv preprint arXiv:2206.08903 .

Bône, A., Vernhet, P., Colliot, O., Durrleman, S., 2020. Learning joint shape

and appearance representations with metamorphic auto-encoders, in: 23 rd

International Conference on Medical Image Computing and Computer As-

sisted Intervention (MICCAI 2020), Springer. pp. 202–211.

Brett, M., Leff, A.P., Rorden, C., Ashburner, J., 2001. Spatial normalization of

brain images with focal lesions using cost function masking. NeuroImage

14, 486–500.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-

lakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models

are few-shot learners. Advances in Neural Information Processing Systems

33, 1877–1901.

Burger, M., Modersitzki, J., Ruthotto, L., 2013. A hyperelastic regularization

energy for image registration. SIAM Journal on Scientific Computing 35,

B132–B148.

Cabezas, M., Oliver, A., Lladó, X., Freixenet, J., Cuadra, M.B., 2011. A review

of atlas-based segmentation for magnetic resonance brain images. Computer

Methods and Programs in Biomedicine 104, e158–e177.

Cao, X., Yang, J., Zhang, J., Nie, D., Kim, M., Wang, Q., Shen, D., 2017. De-

formable image registration based on similarity-steered cnn regression, in:

20 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2017), Springer. pp. 300–308.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.,

2020. End-to-end object detection with transformers, in: Computer Vision–

ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,

2020, Proceedings, Part I 16, Springer. pp. 213–229.

Casamitjana, A., Mancini, M., Iglesias, J.E., 2021. Synth-by-reg (sbr): Con-

trastive learning for synthesis-based registration of paired images, in: Sim-

ulation and Synthesis in Medical Imaging: 6th International Workshop,

SASHIMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg,

France, September 27, 2021, Proceedings 6, Springer. pp. 44–54.

Castillo, R., Castillo, E., Guerra, R., Johnson, V.E., McPhail, T., Garg, A.K.,

Guerrero, T., 2009. A framework for evaluation of deformable image regis-

tration spatial accuracy using large landmark point sets. Physics in Medicine

& Biology 54, 1849.

Chen, C.F.R., Fan, Q., Panda, R., 2021a. Crossvit: Cross-attention multi-

scale vision transformer for image classification, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

357–366.

Chen, J., Frey, E.C., Du, Y., 2022a. Unsupervised learning of diffeomorphic

image registration via transmorph, in: Biomedical Image Registration: 10th

International Workshop, WBIR 2022, Munich, Germany, July 10–12, 2022,

Proceedings, Springer. pp. 96–102.

Chen, J., Frey, E.C., He, Y., Segars, W.P., Li, Y., Du, Y., 2022b. Transmorph:

Transformer for unsupervised medical image registration. Medical Image

Analysis 82, 102615.

Chen, J., He, Y., Frey, E., Li, Y., Du, Y., 2021b. Vit-v-net: Vision transformer

for unsupervised volumetric medical image registration, in: Medical Imag-

ing with Deep Learning.

Chen, J., Li, Y., Du, Y., Frey, E.C., 2020. Generating anthropomorphic phan-

toms using fully unsupervised deformable image registration with convolu-

tional neural networks. Medical physics 47, 6366–6380.

Chen, J., Liu, Y., He, Y., Du, Y., 2023a. Deformable cross-attention transformer

for medical image registration. arXiv preprint arXiv:2303.06179 .

Chen, J., Liu, Y., He, Y., Du, Y., 2023b. Spatially-varying regularization with

conditional transformer for unsupervised image registration. arXiv preprint

arXiv:2303.06168 .

Chen, J., Lu, D., Zhang, Y., Wei, D., Ning, M., Shi, X., Xu, Z., Zheng, Y.,

2022c. Deformer: Towards displacement field learning for unsupervised

medical image registration, in: 25 th International Conference on Medical

Image Computing and Computer Assisted Intervention (MICCAI 2022),

Springer. pp. 141–151.

Chen, L., Wu, Z., Hu, D., Pei, Y., Zhao, F., Sun, Y., Wang, Y., Lin, W., Wang,

L., Li, G., et al., 2021c. Construction of longitudinally consistent 4d infant

cerebellum atlases based on deep learning, in: 24 th International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2021), Springer. pp. 139–149.

Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K., 2018. Neural or-

dinary differential equations. Advances in Neural Information Processing

Systems 31.

Chen, X., Diaz-Pinto, A., Ravikumar, N., Frangi, A.F., 2021d. Deep learning in

medical image registration. Progress in Biomedical Engineering 3, 012003.

Chen, X., Meng, Y., Zhao, Y., Williams, R., Vallabhaneni, S.R., Zheng, Y.,

2021e. Learning unsupervised parameter-specific affine transformation for

medical images registration, in: 24 th International Conference on Medical

Image Computing and Computer Assisted Intervention (MICCAI 2021),

Springer. pp. 24–34.

Chen, X., Xia, Y., Ravikumar, N., Frangi, A.F., 2021f. A deep discontinuity-

preserving image registration network, in: 24 th International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2021), Springer. pp. 46–55.

Cheng, J., Dalca, A.V., Fischl, B., Zöllei, L., Alzheimer’s Disease Neuroimag-

ing Initiative, et al., 2020a. Cortical surface registration using unsupervised

learning. NeuroImage 221, 117161.

Cheng, J., Dalca, A.V., Zöllei, L., 2020b. Unbiased atlas construction for neona-

tal cortical surfaces via unsupervised learning, in: Medical Ultrasound, and

Preterm, Perinatal and Paediatric Image Analysis: First International Work-

shop, ASMUS 2020, and 5th International Workshop, PIPPI 2020, Held in

Conjunction with MICCAI 2020, Lima, Peru, October 4-8, 2020, Proceed-

ings 1, Springer. pp. 334–342.

Christensen, G.E., Joshi, S.C., Miller, M.I., 1997. Volumetric transformation of

brain anatomy. IEEE Trans. Med. Imag. 16, 864–877.

Christensen, G.E., Rabbitt, R.D., Miller, M.I., 1996. Deformable templates

using large deformation kinematics. IEEE transactions on image processing

5, 1435–1447.

Croquet, B., Christiaens, D., Weinberg, S.M., Bronstein, M., Vandermeulen,

D., Claes, P., 2021. Unsupervised diffeomorphic surface registration and

non-linear modelling, in: Medical Image Computing and Computer As-

sisted Intervention–MICCAI 2021: 24th International Conference, Stras-

bourg, France, September 27–October 1, 2021, Proceedings, Part IV 24,

Springer. pp. 118–128.

Crum, W.R., Camara, O., Hawkes, D.J., 2007. Methods for inverting dense

displacement fields: Evaluation in brain image registration, in: 10 th Inter-30

Chen, Liu, Wei et al. / (2023)

national Conference on Medical Image Computing and Computer Assisted

Intervention (MICCAI 2007), Springer. pp. 900–907.

Czolbe, S., Krause, O., Feragen, A., 2021. Semantic similarity metrics

for learned image registration, in: Medical Imaging with Deep Learning,

PMLR. pp. 105–118.

Dalca, A., Rakic, M., Guttag, J., Sabuncu, M., 2019a. Learning conditional

deformable templates with convolutional networks. Advances in Neural In-

formation Processing Systems 32.

Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R., 2019b. Unsupervised

learning of probabilistic diffeomorphic registration for images and surfaces.

Medical Image Analysis 57, 226–236.

Dave, I., Gupta, R., Rizve, M.N., Shah, M., 2022. Tclr: Temporal contrastive

learning for video representation. Computer Vision and Image Understand-

ing 219, 103406.

Davis, B., Lorenzen, P., Joshi, S.C., 2004. Large deformation minimum mean

squared error template estimation for computational anatomy., in: 2 nd Inter-

national Symposium on Biomedical Imaging (ISBI 2004), pp. 173–176.

De Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M.,

Išgum, I., 2019. A deep learning framework for unsupervised affine and

deformable image registration. Medical Image Analysis 52, 128–143.

De Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Išgum, I.,

2017. End-to-end unsupervised deformable image registration with a con-

volutional neural network, in: Deep Learning in Medical Image Analysis

and Multimodal Learning for Clinical Decision Support: Third Interna-

tional Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS

2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada,

September 14, Proceedings 3, Springer. pp. 204–212.

Dey, N., Ren, M., Dalca, A.V., Gerig, G., 2021. Generative adversarial registra-

tion for improved conditional deformable templates, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

3929–3941.

Dey, N., Schlemper, J., Salehi, S.S.M., Zhou, B., Gerig, G., Sofka, M., 2022.

Contrareg: Contrastive learning of multi-modality unsupervised deformable

image registration, in: 25 th International Conference on Medical Image

Computing and Computer Assisted Intervention (MICCAI 2022), Springer.

pp. 66–77.

Ding, W., Li, L., Zhuang, X., Huang, L., 2022a. Cross-modality multi-atlas seg-

mentation via deep registration and label fusion. IEEE Journal of Biomedi-

cal and Health Informatics 26, 3104–3115.

Ding, X., Zhang, X., Han, J., Ding, G., 2022b. Scaling up your kernels

to 31x31: Revisiting large kernel design in cnns, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

11963–11975.

Ding, Z., Han, X., Niethammer, M., 2019. Votenet: A deep learning label fu-

sion method for multi-atlas segmentation, in: 22 nd International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2019), Springer. pp. 202–210.

Ding, Z., Han, X., Niethammer, M., 2020. Votenet+: An improved deep learn-

ing label fusion method for multi-atlas segmentation, in: 17 th International

Symposium on Biomedical Imaging (ISBI 2020), IEEE. pp. 363–367.

Ding, Z., Niethammer, M., 2021. Votenet++: Registration refinement for multi-

atlas segmentation, in: 18 th International Symposium on Biomedical Imag-

ing (ISBI 2021), IEEE. pp. 275–279.

Ding, Z., Niethammer, M., 2022. Aladdin: Joint atlas building and diffeomor-

phic registration learning with pairwise alignment, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

20784–20793.

Dlouhy, B.J., Rao, R.C., Page, P., Julià, D., Gómez, N., Codina-Cazador, A.,

2014. Surgical skill and complication rates after bariatric surgery. The New

England journal of medicine 370, 285–285.

Dong, G., Dai, J., Li, N., Zhang, C., He, W., Liu, L., Chan, Y., Li, Y., Xie, Y.,

Liang, X., 2023. 2D/3D Non-Rigid Image Registration via Two Orthogonal

X-ray Projection Images for Lung Tumor Tracking. Bioengineering 10, 144.

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo,

B., 2022. Cswin transformer: A general vision transformer backbone with

cross-shaped windows, in: Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pp. 12124–12134.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-

terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit,

J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for

image recognition at scale, in: International Conference on Learning Repre-

sentations.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un-

terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.,

2020. An image is worth 16x16 words: Transformers for image recognition

at scale. arXiv preprint arXiv:2010.11929 .

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van

Der Smagt, P., Cremers, D., Brox, T., 2015. Flownet: Learning optical flow

with convolutional networks, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pp. 2758–2766.

Duan, L., Yuan, G., Gong, L., Fu, T., Yang, X., Chen, X., Zheng, J., 2019.

Adversarial learning for deformable registration of brain MR image using a

multi-scale fully convolutional network. Biomedical Signal Processing and

Control 53, 101562.

Dumoulin, V., Shlens, J., Kudlur, M., 2017. A learned representation for artistic

style, in: International Conference on Learning Representations.

Ehrhardt, J., Werner, R., Schmidt-Richberg, A., Handels, H., 2010. Automatic

landmark detection and non-linear landmark-and surface-based registration

of lung CT images. Medical Image Analysis for the Clinic-A Grand Chal-

lenge, MICCAI 2010, 165–174.

Elmahdy, M.S., Wolterink, J.M., Sokooti, H., Išgum, I., Staring, M., 2019.

Adversarial optimization for joint registration and segmentation in prostate

CT radiotherapy, in: 22 nd International Conference on Medical Image Com-

puting and Computer Assisted Intervention (MICCAI 2019), Springer. pp.

366–374.

Eppenhof, K.A., Lafarge, M.W., Moeskops, P., Veta, M., Pluim, J.P., 2018. De-

formable image registration using convolutional neural networks, in: Medi-

cal Imaging 2018: Image Processing, SPIE. pp. 192–197.

Eppenhof, K.A., Lafarge, M.W., Veta, M., Pluim, J.P., 2019. Progressively

trained convolutional neural networks for deformable image registration.

IEEE Trans. Med. Imag. 39, 1594–1604.

Eppenhof, K.A., Pluim, J.P., 2018a. Error estimation of deformable image

registration of pulmonary CT scans using convolutional neural networks.

Journal of Medical Imaging 5, 024003–024003.

Eppenhof, K.A., Pluim, J.P., 2018b. Pulmonary CT registration through super-

vised learning with convolutional neural networks. IEEE Trans. Med. Imag.

38, 1097–1105.

Fan, J., Cao, X., Wang, Q., Yap, P.T., Shen, D., 2019a. Adversarial learning for

mono-or multi-modal registration. Medical Image Analysis 58, 101545.

Fan, J., Cao, X., Xue, Z., Yap, P.T., Shen, D., 2018. Adversarial similarity net-

work for evaluating image alignment in deep learning based registration, in:

21 st International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2018), Springer. pp. 739–746.

Fan, J., Cao, X., Yap, P.T., Shen, D., 2019b. Birnet: Brain image registration us-

ing dual-supervised fully convolutional networks. Medical Image Analysis

54, 193–206.

Fang, L., Zhang, L., Nie, D., Cao, X., Rekik, I., Lee, S.W., He, H., Shen, D.,

2019. Automatic brain labeling via multi-atlas guided fully convolutional

networks. Medical Image Analysis 51, 157–168.

Fechter, T., Baltas, D., 2020. One-shot learning for deformable medical im-

age registration and periodic motion tracking. IEEE Trans. Med. Imag. 39,

2506–2517.

Fischer, B., Modersitzki, J., 2003a. Combination of automatic non-rigid and

landmark-based registration: the best of both worlds, in: Medical Imaging

2003: Image Processing, SPIE. pp. 1037–1048.

Fischer, B., Modersitzki, J., 2003b. Curvature based image registration. Journal

of Mathematical Imaging and Vision 18, 81–85.

Fluck, O., Vetter, C., Wein, W., Kamen, A., Preim, B., Westermann, R., 2011.

A survey of medical image registration on graphics hardware. Computer

Methods and Programs in Biomedicine 104, e45–e57.

Foote, M.D., Zimmerman, B.E., Sawant, A., Joshi, S.C., 2019. Real-time 2D-

3D deformable registration with deep learning and application to lung ra-

diotherapy targeting, in: Information Processing in Medical Imaging: 26th

International Conference, IPMI 2019, Hong Kong, China, June 2–7, 2019,

Proceedings 26, Springer. pp. 265–276.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M., 2018. Bilevel

programming for hyperparameter optimization and meta-learning, in: 35 th

International Conference on Machine Learning (ICML 2016), PMLR. pp.

1568–1577.

François, A., Maillard, M., Oppenheim, C., Pallud, J., Bloch, I., Gori, P.,

Glaunès, J., 2022. Weighted metamorphosis for registration of images with

different topologies, in: Biomedical Image Registration: 10th International

Workshop, WBIR 2022, Munich, Germany, July 10–12, 2022, Proceedings,

Springer. pp. 8–17.Chen, Liu, Wei et al. / (2023)

Friston, K.J., Ashburner, J., Frith, C.D., Poline, J.B., Heather, J.D., Frackowiak,

R.S., 1995. Spatial registration and normalization of images. Human brain

mapping 3, 165–189.

Fu, Y., Lei, Y., Wang, T., Curran, W.J., Liu, T., Yang, X., 2020a. Deep learning

in medical image registration: a review. Physics in Medicine & Biology 65,

20TR01.

Fu, Y., Lei, Y., Wang, T., Higgins, K., Bradley, J.D., Curran, W.J., Liu, T., Yang,

X., 2020b. LungRegNet: an unsupervised deformable image registration

method for 4D-CT lung. Medical physics 47, 1763–1774.

Fu, Y., Liu, S., Li, H.H., Li, H., Yang, D., 2018. An adaptive motion regulariza-

tion technique to support sliding motion in deformable image registration.

Medical physics 45, 735–747.

Fu, Y., Wang, T., Lei, Y., Patel, P., Jani, A.B., Curran, W.J., Liu, T., Yang, X.,

2021. Deformable MR-CBCT prostate registration using biomechanically

constrained deep learning networks. Medical Physics 48, 253–263.

Gal, Y., Ghahramani, Z., 2016. Dropout as a bayesian approximation: Repre-

senting model uncertainty in deep learning, in: 33 rd International Conference

on Machine Learning (ICML 2016), PMLR. pp. 1050–1059.

Ganser, K.A., Dickhaus, H., Metzner, R., Wirtz, C.R., 2004. A deformable dig-

ital brain atlas system according to talairach and tournoux. Medical Image

Analysis 8, 3–22.

Gao, C., Grupp, R.B., Unberath, M., Taylor, R.H., Armand, M., 2020a.

Fiducial-free 2D/3D registration of the proximal femur for robot-assisted

femoroplasty, in: Medical Imaging 2020: Image-Guided Procedures,

Robotic Interventions, and Modeling, SPIE. pp. 350–355.

Gao, C., Liu, X., Gu, W., Killeen, B., Armand, M., Taylor, R., Unberath, M.,

2020b. Generalizing spatial transformers to projective geometry with appli-

cations to 2D/3D registration, in: 23 rd International Conference on Medi-

cal Image Computing and Computer Assisted Intervention (MICCAI 2020),

Springer. pp. 329–339.

Ger, R.B., Yang, J., Ding, Y., Jacobsen, M.C., Fuller, C.D., Howell, R.M.,

Li, H., Jason Stafford, R., Zhou, S., Court, L.E., 2017. Accuracy of de-

formable image registration on magnetic resonance images in digital and

physical phantoms. Medical Physics 44, 5153–5161.

Gerig, T., Shahim, K., Reyes, M., Vetter, T., Lüthi, M., 2014. Spatially vary-

ing registration using gaussian processes, in: 17 th International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2014), Springer. pp. 413–420.

Gong, X., Khaidem, L., Zhu, W., Zhang, B., Doermann, D., 2022. Uncertainty

learning towards unsupervised deformable medical image registration, in:

Proceedings of the IEEE/CVF Winter Conference on Applications of Com-

puter Vision, pp. 2484–2493.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,

S., Courville, A., Bengio, Y., 2020. Generative adversarial networks. Com-

munications of the ACM 63, 139–144.

Greer, H., Kwitt, R., Vialard, F.X., Niethammer, M., 2021. Icon: Learning

regular maps through inverse consistency, in: Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pp. 3396–3405.

Grzech, D., Azampour, M.F., Glocker, B., Schnabel, J., Navab, N., Kainz, B.,

Le Folgoc, L., 2022. A variational bayesian method for similarity learning in

non-rigid image registration, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pp. 119–128.

Gu, W., Gao, C., Grupp, R., Fotouhi, J., Unberath, M., 2020. Extended Capture

Range of Rigid 2D/3D Registration by Estimating Riemannian Pose Gradi-

ents. Machine learning in medical imaging. MLMI 12436, 281–291.

Guimond, A., Meunier, J., Thirion, J.P., 2000. Average brain models: A con-

vergence study. Computer Vision and Image Understanding 77, 192–210.

Guo, C.K., 2019. Multi-modal image registration with unsupervised deep learn-

ing. Ph.D. thesis. Massachusetts Institute of Technology.

Guo, H., Xu, X., Song, X., Xu, S., Chao, H., Myers, J., Turkbey, B., Pinto,

P.A., Wood, B.J., Yan, P., 2022. Ultrasound frame-to-volume registration

via deep learning for interventional guidance. IEEE Transactions on Ultra-

sonics, Ferroelectrics, and Frequency Control .

Ha, D., Dai, A.M., Le, Q.V., 2017. Hypernetworks, in: International Confer-

ence on Learning Representations.

Haber, E., Modersitzki, J., 2006. Intensity gradient based registration and fusion

of multi-modal images, in: 9 th International Conference on Medical Image

Computing and Computer Assisted Intervention (MICCAI 2006), Springer.

pp. 726–733.

Han, K., Sun, S., Yan, X., You, C., Tang, H., Naushad, J., Ma, H., Kong, D.,

Xie, X., 2023. Diffeomorphic image registration with neural velocity field,

in: Proceedings of the IEEE/CVF Winter Conference on Applications of

Computer Vision, pp. 1869–1879.

Han, R., Jones, C.K., Lee, J., Wu, P., Vagdargi, P., Uneri, A., Helm, P.A., Lu-

ciano, M., Anderson, W.S., Siewerdsen, J.H., 2022. Deformable mr-ct image

registration using an unsupervised, dual-channel network for neurosurgical

guidance. Medical image analysis 75, 102292.

Han, X., Hong, J., Reyngold, M., Crane, C., Cuaron, J., Hajj, C., Mann, J.,

Zinovoy, M., Greer, H., Yorke, E., et al., 2021. Deep-learning-based im-

age registration and automatic segmentation of organs-at-risk in cone-beam

CT scans from high-dose radiation treatment of pancreatic cancer. Medical

physics 48, 3084–3095.

Han, X., Shen, Z., Xu, Z., Bakas, S., Akbari, H., Bilello, M., Davatzikos, C.,

Niethammer, M., 2020. A deep network for joint registration and reconstruc-

tion of images with pathologies, in: Machine Learning in Medical Imaging:

11th International Workshop, MLMI 2020, Held in Conjunction with MIC-

CAI 2020, Lima, Peru, October 4, 2020, Proceedings 11, Springer. pp. 342–

352.

Hansen, L., Heinrich, M.P., 2021. GraphRegNet: Deep graph regularisation

networks on sparse keypoints for dense registration of 3D lung CTs. IEEE

Trans. Med. Imag. 40, 2246–2257.

Harley, A.W., Fang, Z., Fragkiadaki, K., 2022. Particle video revisited: Track-

ing through occlusions using point trajectories, in: Computer Vision–ECCV

2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,

Proceedings, Part XXII, Springer. pp. 59–75.

Haskins, G., Kruecker, J., Kruger, U., Xu, S., Pinto, P.A., Wood, B.J., Yan, P.,

2019. Learning deep similarity metric for 3D MR–TRUS image registration.

International journal of computer assisted radiology and surgery 14, 417–

425.

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image

recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition, pp. 770–778.

He, Z., Chung, A.C., 2021. Learning-based template synthesis for groupwise

image registration, in: Simulation and Synthesis in Medical Imaging: 6th In-

ternational Workshop, SASHIMI 2021, Held in Conjunction with MICCAI

2021, Strasbourg, France, September 27, 2021, Proceedings 6, Springer. pp.

55–66.

Heinrich, M.P., 2019. Closing the gap between deep and conventional image

registration using probabilistic dense displacement networks, in: 22 nd Inter-

national Conference on Medical Image Computing and Computer Assisted

Intervention (MICCAI 2019), Springer. pp. 50–58.

Heinrich, M.P., Handels, H., Simpson, I.J., 2015. Estimating large lung mo-

tion in copd patients by symmetric regularised correspondence fields, in:

18 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2015), Springer. pp. 338–345.

Heinrich, M.P., Hansen, L., 2020. Highly accurate and memory efficient unsu-

pervised learning-based discrete CT registration using 2.5 D displacement

search, in: 23 rd International Conference on Medical Image Computing and

Computer Assisted Intervention (MICCAI 2020), Springer. pp. 190–200.

Heinrich, M.P., Hansen, L., 2022. Voxelmorph++ going beyond the cranial

vault with keypoint supervision and multi-channel instance optimisation, in:

Biomedical Image Registration: 10th International Workshop, WBIR 2022,

Munich, Germany, July 10–12, 2022, Proceedings, Springer. pp. 85–95.

Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady,

M., Schnabel, J.A., 2012. Mind: Modality independent neighbourhood de-

scriptor for multi-modal deformable registration. Medical Image Analysis

16, 1423–1435.

Heinrich, M.P., Jenkinson, M., Papież, B.W., Brady, S.M., Schnabel, J.A., 2013.

Towards realtime multimodal fusion for image-guided interventions using

self-similarities, in: 16 th International Conference on Medical Image Com-

puting and Computer Assisted Intervention (MICCAI 2013), Springer. pp.

187–194.

Heinrich, M.P., Oktay, O., Bouteldja, N., 2019. OBELISK-Net: Fewer layers

to solve 3D multi-organ segmentation with sparse deformable convolutions.

Medical Image Analysis 54, 1–9.

Hering, A., Ginneken, B.v., Heldmann, S., 2019. mlvirnet: Multilevel varia-

tional image registration network, in: 22 nd International Conference on Med-

ical Image Computing and Computer Assisted Intervention (MICCAI 2019),

Springer. pp. 257–265.

Hering, A., Häger, S., Moltz, J., Lessmann, N., Heldmann, S., van Ginneken,

B., 2021. CNN-based lung CT registration with multiple anatomical con-

straints. Medical Image Analysis 72, 102139.

Hering, A., Hansen, L., Mok, T.C., Chung, A.C., Siebert, H., Häger, S., Lange,

A., Kuckertz, S., Heldmann, S., Shao, W., et al., 2022. Learn2reg: compre-32

Chen, Liu, Wei et al. / (2023)

hensive multi-task medical image registration challenge, dataset and evalua-

tion in the era of deep learning. IEEE Trans. Med. Imag. .

Hernandez, M., Bossa, M.N., Olmos, S., 2009. Registration of anatomical

images using paths of diffeomorphisms parameterized with stationary vector

field flows. International Journal of Computer Vision 85, 291–306.

Hill, D.L., Batchelor, P.G., Holden, M., Hawkes, D.J., 2001. Medical image

registration. Physics in Medicine & Biology 46, R1.

Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models.

Advances in Neural Information Processing Systems 33, 6840–6851.

Ho, T.T., Kim, W.J., Lee, C.H., Jin, G.Y., Chae, K.J., Choi, S., 2023. An unsu-

pervised image registration method employing chest computed tomography

images and deep neural networks. Computers in Biology and Medicine 154,

106612.

Hoffmann, M., Billot, B., Greve, D.N., Iglesias, J.E., Fischl, B., Dalca, A.V.,

2021. Synthmorph: learning contrast-invariant registration without acquired

images. IEEE Trans. Med. Imag. 41, 543–558.

Hofmann, M., Steinke, F., Scheel, V., Charpiat, G., Farquhar, J., Aschoff, P.,

Brady, M., Schölkopf, B., Pichler, B.J., 2008. MRI-based attenuation cor-

rection for PET/MRI: a novel approach combining pattern recognition and

atlas registration. Journal of nuclear medicine 49, 1875–1883.

Holland, D., Dale, A.M., Alzheimer’s Disease Neuroimaging Initiative, et al.,

2011. Nonlinear registration of longitudinal images and measurement of

change in regions of interest. Medical Image Analysis 15, 489–497.

Hong, Y., Joshi, S., Sanchez, M., Styner, M., Niethammer, M., 2012. Meta-

morphic geodesic regression, in: 15 th International Conference on Medi-

cal Image Computing and Computer Assisted Intervention (MICCAI 2012),

Springer. pp. 197–205.

Hoopes, A., Hoffman, M., Greve, D.N., Fischl, B., Guttag, J., Dalca, A.V.,

2022. Learning the effect of registration hyperparameters with hypermorph.

Machine Learning for Biomedical Imaging 1, 1–30.

Hoopes, A., Hoffmann, M., Fischl, B., Guttag, J., Dalca, A.V., 2021. Hy-

permorph: Amortized hyperparameter learning for image registration, in:

Information Processing in Medical Imaging: 27th International Conference,

IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27, Springer.

pp. 3–17.

Hu, B., Zhou, S., Xiong, Z., Wu, F., 2020. Self-recursive contextual network

for unsupervised 3D medical image registration, in: Machine Learning in

Medical Imaging: 11th International Workshop, MLMI 2020, Held in Con-

junction with MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings 11,

Springer. pp. 60–69.

Hu, J., Sun, S., Yang, X., Zhou, S., Wang, X., Fu, Y., Zhou, J., Yin, Y., Cao,

K., Song, Q., et al., 2019a. Towards accurate and robust multi-modal med-

ical image registration using contrastive metric learning. IEEE Access 7,

132816–132827.

Hu, X., Kang, M., Huang, W., Scott, M.R., Wiest, R., Reyes, M., 2019b.

Dual-stream pyramid registration network, in: 22 nd International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2019), Springer. pp. 382–390.

Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C.M., Em-

berton, M., Noble, J.A., Barratt, D.C., Vercauteren, T., 2018a. Label-driven

weakly-supervised learning for multimodal deformable image registration,

in: 15 th International Symposium on Biomedical Imaging (ISBI 2018),

IEEE. pp. 1070–1074.

Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G.,

Bandula, S., Moore, C.M., Emberton, M., et al., 2018b. Weakly-supervised

convolutional neural networks for multimodal image registration. Medical

Image Analysis 49, 1–13.

Huang, D.X., Zhou, X.H., Xie, X.L., Liu, S.Q., Feng, Z.Q., Hao, J.L., Hou,

Z.G., Ma, N., Yan, L., 2022. A Novel Two-Stage Framework for 2D/3D Reg-

istration in Neurological Interventions, in: 2022 IEEE International Confer-

ence on Robotics and Biomimetics (ROBIO), IEEE. pp. 266–271.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q., 2017.

Snapshot ensembles: Train 1, get m for free, in: International Conference

on Learning Representations.

Iglesias, J.E., Sabuncu, M.R., 2015. Multi-atlas segmentation of biomedical

images: a survey. Medical Image Analysis 24, 205–219.

Ilg, E., Cicek, O., Galesso, S., Klein, A., Makansi, O., Hutter, F., Brox, T.,

2018. Uncertainty estimates and multi-hypotheses networks for optical flow,

in: Proceedings of the European Conference on Computer Vision (ECCV),

pp. 652–667.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T., 2017.

Flownet 2.0: Evolution of optical flow estimation with deep networks, in:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pp. 2462–2470.

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H., 2021. nnU-

Net: a self-configuring method for deep learning-based biomedical image

segmentation. Nature Methods 18, 203–211.

Jabri, A., Owens, A., Efros, A., 2020. Space-time correspondence as a con-

trastive random walk. Advances in Neural Information Processing Systems

33, 19545–19560.

Jaderberg, M., Simonyan, K., Zisserman, A., et al., 2015. Spatial transformer

networks. Advances in Neural Information Processing Systems 28.

Jaganathan, S., Kukla, M., Wang, J., Shetty, K., Maier, A., 2023. Self-

Supervised 2D/3D Registration for X-Ray to CT Image Fusion, in: Pro-

ceedings of the IEEE/CVF Winter Conference on Applications of Computer

Vision, pp. 2788–2798.

Ji, Y., Zhu, Z., Wei, Y., 2022. A One-shot Lung 4D-CT Image Registration

Method with Temporal-spatial Features, in: 2022 IEEE Biomedical Circuits

and Systems Conference (BioCAS), IEEE. pp. 203–207.

Jia, X., Bartlett, J., Chen, W., Song, S., Zhang, T., Cheng, X., Lu, W., Qiu,

Z., Duan, J., 2022a. Fourier-net: Fast image registration with band-limited

deformation. arXiv preprint arXiv:2211.16342 .

Jia, X., Bartlett, J., Zhang, T., Lu, W., Qiu, Z., Duan, J., 2022b. U-net vs

transformer: Is u-net outdated in medical image registration?, in: Machine

Learning in Medical Imaging: 13th International Workshop, MLMI 2022,

Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022,

Proceedings, Springer. pp. 151–160.

Jia, X., Thorley, A., Chen, W., Qiu, H., Shen, L., Styles, I.B., Chang, H.J.,

Leonardis, A., De Marvao, A., O’Regan, D.P., et al., 2021. Learning a

model-driven variational network for deformable image registration. IEEE

Trans. Med. Imag. 41, 199–212.

Jian, B., Azampour, M.F., De Benetti, F., Oberreuter, J., Bukas, C., Gersing,

A.S., Foreman, S.C., Dietrich, A.S., Rischewski, J., Kirschke, J.S., et al.,

2022. Weakly-supervised Biomechanically-constrained CT/MRI Registra-

tion of the Spine, in: 25 th International Conference on Medical Image Com-

puting and Computer Assisted Intervention (MICCAI 2022), Springer. pp.

227–236.

Jiang, Z., Yin, F.F., Ge, Y., Ren, L., 2020. A multi-scale framework with un-

supervised joint training of convolutional neural networks for pulmonary

deformable image registration. Physics in Medicine & Biology 65, 015011.

Jonschkowski, R., Stone, A., Barron, J.T., Gordon, A., Konolige, K., Angelova,

A., 2020. What matters in unsupervised optical flow, in: Computer Vision–

ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,

2020, Proceedings, Part II 16, Springer. pp. 557–572.

Joshi, A., Hong, Y., 2022. Diffeomorphic image registration using lipschitz con-

tinuous residual networks, in: International Conference on Medical Imaging

with Deep Learning, PMLR. pp. 605–617.

Joshi, S., Davis, B., Jomier, M., Gerig, G., 2004. Unbiased diffeomorphic atlas

construction for computational anatomy. NeuroImage 23, S151–S160.

Kang, M., Hu, X., Huang, W., Scott, M.R., Reyes, M., 2022. Dual-stream

pyramid registration network. Medical Image Analysis 78, 102379.

Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Haci-

haliloglu, I., Merhof, D., 2022. Diffusion models for medical image analysis:

A comprehensive survey. arXiv preprint arXiv:2211.07804 .

Kendall, A., Gal, Y., 2017. What uncertainties do we need in bayesian deep

learning for computer vision? Advances in Neural Information Processing

Systems 30.

Khor, H.G., Ning, G., Sun, Y., Lu, X., Zhang, X., Liao, H., 2023. Anatomi-

cally constrained and attention-guided deep feature fusion for joint segmen-

tation and deformable medical image registration. Medical Image Analysis

, 102811.

Kim, B., Han, I., Ye, J.C., 2022. Diffusemorph: Unsupervised deformable

image registration using diffusion model, in: Proceedings of the European

Conference on Computer Vision (ECCV), Springer. pp. 347–364.

Kim, B., Kim, D.H., Park, S.H., Kim, J., Lee, J.G., Ye, J.C., 2021. Cyclemorph:

cycle consistent unsupervised deformable image registration. Medical Im-

age Analysis 71, 102036.

Kim, D., Cho, D., Kweon, I.S., 2019. Self-supervised video representation

learning with space-time cubic puzzles, in: Proceedings of the AAAI con-

ference on artificial intelligence, pp. 8545–8552.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao,

T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything.

arXiv preprint arXiv:2304.02643 .

Kirkwood, J.G., Baldwin, R.L., Dunlop, P.J., Gosting, L.J., Kegeles, G., 1960.Chen, Liu, Wei et al. / (2023)

Flow equations and frames of reference for isothermal diffusion in liquids.

The Journal of Chemical Physics 33, 1505–1513.

Klein, S., Staring, M., Murphy, K., Viergever, M.A., Pluim, J.P., 2009. Elastix:

a toolbox for intensity-based medical image registration. IEEE Trans. Med.

Imag. 29, 196–205.

Krebs, J., Delingette, H., Mailhé, B., Ayache, N., Mansi, T., 2019. Learning a

probabilistic model for diffeomorphic registration. IEEE Trans. Med. Imag.

38, 2165–2176.

Krebs, J., Mansi, T., Delingette, H., Zhang, L., Ghesu, F.C., Miao, S., Maier,

A.K., Ayache, N., Liao, R., Kamen, A., 2017. Robust non-rigid registra-

tion through agent-based action learning, in: 20 th International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2017), Springer. pp. 344–352.

Kuang, D., 2019. Cycle-consistent training for reducing negative jacobian de-

terminant in deep registration networks, in: Simulation and Synthesis in

Medical Imaging: 4th International Workshop, SASHIMI 2019, Held in

Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Pro-

ceedings 4, Springer. pp. 120–129.

Kuang, D., Schmah, T., 2019. Faim–a convnet method for unsupervised 3D

medical image registration, in: International Workshop on Machine Learn-

ing in Medical Imaging, Springer. pp. 646–654.

Kybic, J., 2009. Bootstrap resampling for image registration uncertainty esti-

mation without ground truth. IEEE Transactions on Image Processing 19,

64–73.

Lai, Z., Xie, W., 2019. Self-supervised learning for video correspondence flow.

arXiv preprint arXiv:1905.00875 .

Lange, F.J., Ashburner, J., Smith, S.M., Andersson, J.L., 2020. A symmetric

prior for the regularisation of elastic deformations: Improved anatomical

plausibility in nonlinear image registration. NeuroImage 219, 116962.

Laves, M.H., Ihler, S., Ortmaier, T., 2019. Deformable medical image registra-

tion using a randomly-initialized CNN as regularization prior, in: Medical

Imaging with Deep Learning.

Le-Khac, P.H., Healy, G., Smeaton, A.F., 2020. Contrastive representation

learning: A framework and review. IEEE Access 8, 193907–193934.

Lei, Y., Fu, Y., Wang, T., Liu, Y., Patel, P., Curran, W.J., Liu, T., Yang, X.,

2020. 4D-CT deformable image registration using multiscale unsupervised

deep learning. Physics in Medicine & Biology 65, 085003.

Leow, A.D., Yanovsky, I., Chiang, M.C., Lee, A.D., Klunder, A.D., Lu, A.,

Becker, J.T., Davis, S.W., Toga, A.W., Thompson, P.M., 2007. Statis-

tical properties of jacobian maps and the realization of unbiased large-

deformation nonlinear image registration. IEEE Trans. Med. Imag. 26, 822–

832.

Li, H., Fan, Y., 2018. Non-rigid image registration using self-supervised fully

convolutional networks without training data, in: 15 th International Sympo-

sium on Biomedical Imaging (ISBI 2018), IEEE. pp. 1075–1078.

Li, J., Chen, J., Tang, Y., Wang, C., Landman, B.A., Zhou, S.K., 2023. Trans-

forming medical imaging with transformers? a comparative review of key

properties, current progresses, and future perspectives. Medical Image Anal-

ysis , 102762.

Li, L., Sinclair, M., Makropoulos, A., Hajnal, J.V., David Edwards, A., Kainz,

B., Rueckert, D., Alansary, A., 2021. CAS-Net: conditional atlas generation

and brain segmentation for fetal MRI, in: Uncertainty for Safe Utilization of

Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and

Preterm Image Analysis: 3rd International Workshop, UNSURE 2021, and

6th International Workshop, PIPPI 2021, Held in Conjunction with MICCAI

2021, Strasbourg, France, October 1, 2021, Proceedings 3, Springer. pp.

221–230.

Li, P., Pei, Y., Guo, Y., Ma, G., Xu, T., Zha, H., 2020. Non-Rigid 2D-3D

Registration Using Convolutional Autoencoders, in: 17 th International Sym-

posium on Biomedical Imaging (ISBI 2020), pp. 700–704. doi:10.1109/

ISBI45749.2020.9098602.

Li, Z., Ogino, M., 2019. Adversarial learning for deformable image registra-

tion: Application to 3d ultrasound image fusion, in: Smart Ultrasound Imag-

ing and Perinatal, Preterm and Paediatric Image Analysis: First International

Workshop, SUSI 2019, and 4th International Workshop, PIPPI 2019, Held

in Conjunction with MICCAI 2019, Shenzhen, China, October 13 and 17,

2019, Proceedings 4, Springer. pp. 56–64.

Liao, H., Lin, W.A., Zhang, J., Zhang, J., Luo, J., Zhou, S.K., 2019. Mul-

tiview 2D/3D rigid registration via a point-of-interest network for tracking

and triangulation, in: Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pp. 12638–12647.

Liu, L., Aviles-Rivero, A.I., Schönlieb, C.B., 2020a. Contrastive regis-

tration for unsupervised medical image segmentation. arXiv preprint

arXiv:2011.08894 .

Liu, L., Hu, X., Zhu, L., Heng, P.A., 2019. Probabilistic multilayer regular-

ization network for unsupervised 3d brain image registration, in: 22 nd Inter-

national Conference on Medical Image Computing and Computer Assisted

Intervention (MICCAI 2019), Springer. pp. 346–354.

Liu, L., Huang, Z., Liò, P., Schönlieb, C.B., Aviles-Rivero, A.I., 2022a. Pc-

swinmorph: patch representation for unsupervised medical image registra-

tion and segmentation. arXiv preprint arXiv:2203.05684 .

Liu, X., Zheng, Y., Killeen, B., Ishii, M., Hager, G.D., Taylor, R.H., Unberath,

M., 2020b. Extremely dense point correspondences using a learned feature

descriptor, in: Proceedings of the IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition, pp. 4847–4856.

Liu, Y., Chen, J., Wei, S., Carass, A., Prince, J., 2022b. On finite differ-

ence jacobian computation in deformable image registration. arXiv preprint

arXiv:2212.06060 .

Liu, Y., Zuo, L., Han, S., Xue, Y., Prince, J.L., Carass, A., 2022c. Coordinate

translator for learning deformable medical image registration, in: Interna-

tional Workshop on Multiscale Multimodal Medical Imaging, Springer. pp.

98–109.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021.

Swin transformer: Hierarchical vision transformer using shifted windows,

in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pp. 10012–10022.

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022d. A

convnet for the 2020s, in: Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pp. 11976–11986.

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022e. Video swin

transformer, in: Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pp. 3202–3211.

Lobachev, O., Funatomi, T., Pfaffenroth, A., Förster, R., Knudsen, L., Wrede,

C., Guthe, M., Haberthür, D., Hlushchuk, R., Salaets, T., et al., 2021. Eval-

uating registrations of serial sections with distortions of the ground truths.

IEEE Access 9, 152514–152535.

López, P.A., Mella, H., Uribe, S., Hurtado, D.E., Costabal, F.S., 2022. Warp-

PINN: Cine-MR image registration with physics-informed neural networks.

arXiv preprint arXiv:2211.12549 .

Lotfi, T., Tang, L., Andrews, S., Hamarneh, G., 2013. Improving probabilistic

image registration via reinforcement learning and uncertainty evaluation, in:

Machine Learning in Medical Imaging: 4th International Workshop, MLMI

2013, Held in Conjunction with MICCAI 2013, Nagoya, Japan, September

22, 2013. Proceedings 4, Springer. pp. 187–194.

Luo, J., Sedghi, A., Popuri, K., Cobzas, D., Zhang, M., Preiswerk, F., Toews,

M., Golby, A., Sugiyama, M., Wells, W.M., et al., 2019. On the applicability

of registration uncertainty, in: MICCAI19, Springer. pp. 410–419.

Luo, W., Li, Y., Urtasun, R., Zemel, R., 2016. Understanding the effective

receptive field in deep convolutional neural networks. Advances in Neural

Information Processing Systems 29.

Luo, Y., Cao, W., He, Z., Zou, W., He, Z., 2021. Deformable adversarial regis-

tration network with multiple loss constraints. Computerized Medical Imag-

ing and Graphics 91, 101931.

Lv, J., Wang, Z., Shi, H., Zhang, H., Wang, S., Wang, Y., Li, Q., 2022. Joint pro-

gressive and coarse-to-fine registration of brain MRI via deformation field

integration and non-rigid feature fusion. IEEE Trans. Med. Imag. 41, 2788–

2802.

Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., Martel, A.L., 2021.

Loss odyssey in medical image segmentation. Medical Image Analysis 71,

102035.

Ma, J., Miller, M.I., Trouvé, A., Younes, L., 2008. Bayesian template estima-

tion in computational anatomy. NeuroImage 42, 252–261.

Ma, M., Xu, Y., Song, L., Liu, G., 2022. Symmetric transformer-based net-

work for unsupervised image registration. Knowledge-Based Systems 257,

109959.

Mac Aodha, O., Humayun, A., Pollefeys, M., Brostow, G.J., 2012. Learning a

confidence measure for optical flow. IEEE transactions on pattern analysis

and machine intelligence 35, 1107–1120.

Madsen, D., Morel-Forster, A., Kahr, P., Rahbani, D., Vetter, T., Lüthi, M.,

2020. A closest point proposal for mcmc-based probabilistic surface reg-

istration, in: Computer Vision–ECCV 2020: 16th European Conference,

Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, Springer.

pp. 281–296.

Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P., 1997.34

Chen, Liu, Wei et al. / (2023)

Multimodality image registration by maximization of mutual information.

IEEE Trans. Med. Imag. 16, 187–198.

Mahapatra, D., Antony, B., Sedai, S., Garnavi, R., 2018a. Deformable medical

image registration using generative adversarial networks, in: 15 th Interna-

tional Symposium on Biomedical Imaging (ISBI 2018), IEEE. pp. 1449–

1453.

Mahapatra, D., Ge, Z., 2020. Training data independent image registration us-

ing generative adversarial networks and domain adaptation. Pattern Recog-

nition 100, 107109.

Mahapatra, D., Ge, Z., Sedai, S., Chakravorty, R., 2018b. Joint registration

and segmentation of xray images using generative adversarial networks, in:

Machine Learning in Medical Imaging: 9th International Workshop, MLMI

2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September

16, 2018, Proceedings 9, Springer. pp. 73–80.

Maillard, M., François, A., Glaunès, J., Bloch, I., Gori, P., 2022. A deep resid-

ual learning implementation of metamorphosis, in: 19 th International Sym-

posium on Biomedical Imaging (ISBI 2022), IEEE. pp. 1–4.

Maintz, J.A., Viergever, M.A., 1998. A survey of medical image registration.

Medical Image Analysis 2, 1–36.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B., 2015. Adversarial

autoencoders. arXiv preprint arXiv:1511.05644 .

Mansi, T., Pennec, X., Sermesant, M., Delingette, H., Ayache, N., 2011.

iLogDemons: A demons-based registration algorithm for tracking incom-

pressible elastic biological tissues. International Journal of Computer Vision

92, 92–111.

Meng, M., Bi, L., Fulham, M., Feng, D.D., Kim, J., 2022a. Enhancing medical

image registration via appearance adjustment networks. NeuroImage 259,

119444.

Meng, Q., Qin, C., Bai, W., Liu, T., de Marvao, A., O’Regan, D.P., Rueckert,

D., 2022b. MulViMotion: Shape-aware 3D Myocardial Motion Tracking

from Multi-View Cardiac MRI. IEEE Trans. Med. Imag. 41, 1961–1974.

Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A., 2019.

Occupancy networks: Learning 3D reconstruction in function space, in:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pp. 4460–4470.

Miao, S., Wang, Z.J., Liao, R., 2016. A CNN regression approach for real-time

2D/3D registration. IEEE Trans. Med. Imag. 35, 1352–1363.

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R.,

Ng, R., 2021. Nerf: Representing scenes as neural radiance fields for view

synthesis. Communications of the ACM 65, 99–106.

Miller, M.I., Trouvé, A., Younes, L., 2006. Geodesic shooting for computa-

tional anatomy. Journal of Mathematical Imaging and Vision 24, 209–228.

Mok, T.C., Chung, A., 2020a. Fast symmetric diffeomorphic image registra-

tion with convolutional neural networks, in: Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pp. 4644–4653.

Mok, T.C., Chung, A., 2020b. Large deformation diffeomorphic image regis-

tration with laplacian pyramid networks, in: 23 rd International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2020), Springer. pp. 211–221.

Mok, T.C., Chung, A., 2021a. Conditional deep laplacian pyramid image reg-

istration network in learn2reg challenge, in: 24 th International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2021), Springer. pp. 161–167.

Mok, T.C., Chung, A., 2021b. Conditional deformable image registration with

convolutional neural network, in: 24 th International Conference on Medi-

cal Image Computing and Computer Assisted Intervention (MICCAI 2021),

Springer. pp. 35–45.

Mok, T.C., Chung, A., 2022a. Affine medical image registration with coarse-

to-fine vision transformer, in: Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pp. 20835–20844.

Mok, T.C., Chung, A., 2022b. Robust Image Registration with Absent Cor-

respondences in Pre-operative and Follow-up Brain MRI Scans of Diffuse

Glioma Patients. arXiv preprint arXiv:2210.11045 .

Mok, T.C., Chung, A.C., 2022c. Unsupervised Deformable Image Registration

with Absent Correspondences in Pre-operative and Post-recurrence Brain

Tumor MRI Scans, in: 25 th International Conference on Medical Image

Computing and Computer Assisted Intervention (MICCAI 2022), Springer.

pp. 25–35.

Morales, M.A., Izquierdo-Garcia, D., Aganj, I., Kalpathy-Cramer, J., Rosen,

B.R., Catana, C., 2019. Implementation and validation of a three-

dimensional cardiac motion estimation network. Radiology: Artificial In-

telligence 1, e180080.

Myronenko, A., Song, X., 2010. Intensity-based image registration by mini-

mizing residual complexity. IEEE Trans. Med. Imag. 29, 1882–1891.

Nan, A., Tennant, M., Rubin, U., Ray, N., 2020. Drmime: Differentiable mutual

information and matrix exponential for multi-resolution image registration,

in: Medical Imaging with Deep Learning, PMLR. pp. 527–543.

Nenoff, L., Ribeiro, C.O., Matter, M., Hafner, L., Josipovic, M., Langendijk,

J.A., Persson, G.F., Walser, M., Weber, D.C., Lomax, A.J., et al., 2020.

Deformable image registration uncertainty for inter-fractional dose accumu-

lation of lung cancer proton therapy. Radiotherapy and Oncology 147, 178–

185.

Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A., 2019. Occupancy

flow: 4d reconstruction by learning particle dynamics, in: Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pp. 5379–5389.

Niethammer, M., Hart, G.L., Pace, D.F., Vespa, P.M., Irimia, A., Van Horn,

J.D., Aylward, S.R., 2011. Geometric metamorphosis, in: 14 th International

Conference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2011), NIH Public Access. p. 639.

Niethammer, M., Kwitt, R., Vialard, F.X., 2019. Metric learning for image reg-

istration, in: Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pp. 8463–8472.

Obeidat, M., Narayanasamy, G., Cline, K., Stathakis, S., Pouliot, J., Kim, H.,

Kirby, N., 2016. Comparison of different qa methods for deformable im-

age registration to the known errors for prostate and head-and-neck virtual

phantoms. Biomedical Physics & Engineering Express 2, 067002.

Oishi, K., Faria, A., Jiang, H., Li, X., Akhter, K., Zhang, J., Hsu, J.T., Miller,

M.I., van Zijl, P.C., Albert, M., et al., 2009. Atlas-based whole brain white

matter analysis using large deformation diffeomorphic metric mapping: ap-

plication to normal elderly and Alzheimer’s disease participants. NeuroIm-

age 46, 486–499.

Oliveira, F.P., Tavares, J.M.R., 2014. Medical image registration: a review.

Computer Methods in Biomechanics and Biomedical Engineering 17, 73–

93.

Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representation learning with contrastive

predictive coding. arXiv preprint arXiv:1807.03748 .

Osman, N.F., Kerwin, W.S., McVeigh, E.R., Prince, J.L., 1999. Cardiac motion

tracking using CINE harmonic phase (HARP) magnetic resonance imaging.

Mag. Reson. Med. 42, 1048–1060.

Pace, D.F., Aylward, S.R., Niethammer, M., 2013. A locally adaptive regular-

ization based on anisotropic diffusion for deformable image registration of

sliding organs. IEEE Trans. Med. Imag. 32, 2114–2126.

Papież, B.W., Franklin, J., Heinrich, M.P., Gleeson, F.V., Schnabel, J.A., 2015.

Liver motion estimation via locally adaptive over-segmentation regulariza-

tion, in: 18 th International Conference on Medical Image Computing and

Computer Assisted Intervention (MICCAI 2015), Springer. pp. 427–434.

Papież, B.W., Heinrich, M.P., Fehrenbach, J., Risser, L., Schnabel, J.A., 2014.

An implicit sliding-motion preserving regularisation via bilateral filtering

for deformable image registration. Medical Image Analysis 18, 1299–1311.

Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S., 2019.

Deepsdf: Learning continuous signed distance functions for shape represen-

tation, in: Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pp. 165–174.

Park, T., Efros, A.A., Zhang, R., Zhu, J.Y., 2020. Contrastive learning for un-

paired image-to-image translation, in: Computer Vision–ECCV 2020: 16th

European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,

Part IX 16, Springer. pp. 319–345.

Pathan, S., Hong, Y., 2018. Predictive image regression for longitudinal studies

with missing data, in: Medical Imaging with Deep Learning.

Pei, Y., Chen, L., Zhao, F., Wu, Z., Zhong, T., Wang, Y., Chen, C., Wang, L.,

Zhang, H., Wang, L., et al., 2021. Learning spatiotemporal probabilistic

atlas of fetal brains with anatomically constrained registration network, in:

24 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2021), Springer. pp. 239–248.

Peter, L., Alexander, D.C., Magnain, C., Iglesias, J.E., 2021. Uncertainty-aware

annotation protocol to evaluate deformable registration algorithms. IEEE

Trans. Med. Imag. 40, 2053–2065.

Pfandler, M., Stefan, P., Mehren, C., Lazarovici, M., Weigl, M., 2019. Technical

and nontechnical skills in surgery: a simulated operating room environment

study. Spine 44, E1396–E1400.

Pielawski, N., Wetzer, E., Öfverstedt, J., Lu, J., Wählby, C., Lindblad, J.,

Sladoje, N., 2020. Comir: Contrastive multimodal image representation

for registration. Advances in Neural Information Processing Systems 33,Chen, Liu, Wei et al. / (2023)

18433–18444.

Pitiot, A., Guimond, A., 2008. Geometrical regularization of displacement

fields for histological image registration. Medical Image Analysis 12, 16–

25.

Pluim, J., Maintz, J., Viergever, M., 2000. Image registration by maximiza-

tion of combined mutual information and gradient information. IEEE Trans.

Med. Imag. 19, 809–814.

Pluim, J.P., Muenzing, S.E., Eppenhof, K.A., Murphy, K., 2016. The truth

is hard to make: Validation of medical image registration, in: 2016 23rd

International Conference on Pattern Recognition (ICPR), IEEE. pp. 2294–

2300.

Polzin, T., Rühaak, J., Werner, R., Strehlow, J., Heldmann, S., Handels, H.,

Modersitzki, J., 2013. Combining automatic landmark detection and varia-

tional methods for lung CT registration, in: Fifth international workshop on

pulmonary image analysis, pp. 85–96.

Qian, R., Meng, T., Gong, B., Yang, M.H., Wang, H., Belongie, S., Cui, Y.,

2021. Spatiotemporal contrastive video representation learning, in: Proceed-

ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-

nition, pp. 6964–6974.

Qin, C., Bai, W., Schlemper, J., Petersen, S.E., Piechnik, S.K., Neubauer, S.,

Rueckert, D., 2018. Joint learning of motion estimation and segmentation

for cardiac MR image sequences, in: 21 st International Conference on Medi-

cal Image Computing and Computer Assisted Intervention (MICCAI 2018),

Springer. pp. 472–480.

Qin, C., Shi, B., Liao, R., Mansi, T., Rueckert, D., Kamen, A., 2019. Unsu-

pervised deformable registration for multi-modal images via disentangled

representations, in: Information Processing in Medical Imaging: 26th In-

ternational Conference, IPMI 2019, Hong Kong, China, June 2–7, 2019,

Proceedings 26, Springer. pp. 249–261.

Qin, C., Wang, S., Chen, C., Bai, W., Rueckert, D., 2023. Generative my-

ocardial motion tracking via latent space exploration with biomechanics-

informed prior. Medical Image Analysis 83, 102682.

Qiu, H., Qin, C., Schuh, A., Hammernik, K., Rueckert, D., 2021. Learning dif-

feomorphic and modality-invariant registration using b-splines, in: Medical

Imaging with Deep Learning.

Raissi, M., Perdikaris, P., Karniadakis, G.E., 2019. Physics-informed neural

networks: A deep learning framework for solving forward and inverse prob-

lems involving nonlinear partial differential equations. Journal of Computa-

tional physics 378, 686–707.

Ramon, U., Hernandez, M., Mayordomo, E., 2022. Lddmm meets gans: Gen-

erative adversarial networks for diffeomorphic registration, in: Biomedical

Image Registration: 10th International Workshop, WBIR 2022, Munich,

Germany, July 10–12, 2022, Proceedings, Springer. pp. 18–28.

Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.,

2019. Competitive collaboration: Joint unsupervised learning of depth, cam-

era motion, optical flow and motion segmentation, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

12240–12249.

Reed, V.K., Woodward, W.A., Zhang, L., Strom, E.A., Perkins, G.H., Tereffe,

W., Oh, J.L., Yu, T.K., Bedrosian, I., Whitman, G.J., et al., 2009. Automatic

segmentation of whole breast using atlas approach and deformable image

registration. International Journal of Radiation Oncology* Biology* Physics

73, 1493–1500.

Risholm, P., Balter, J., Wells, W.M., 2011. Estimation of delivered dose in ra-

diotherapy: the influence of registration uncertainty, in: 14 th International

Conference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2011), Springer. pp. 548–555.

Risholm, P., Janoos, F., Norton, I., Golby, A.J., Wells III, W.M., 2013. Bayesian

characterization of uncertainty in intra-subject non-rigid registration. Medi-

cal Image Analysis 17, 538–555.

Risser, L., Vialard, F.X., Baluwala, H.Y., Schnabel, J.A., 2013. Piecewise-

diffeomorphic image registration: Application to the motion estimation be-

tween 3D CT lung images with sliding conditions. Medical Image Analysis

17, 182–193.

Roche, A., Malandain, G., Pennec, X., Ayache, N., 1998. The correlation ratio

as a new similarity measure for multimodal image registration, in: 1 st Inter-

national Conference on Medical Image Computing and Computer Assisted

Intervention (MICCAI 1998), Springer. pp. 1115–1124.

Rohé, M.M., Datar, M., Heimann, T., Sermesant, M., Pennec, X., 2017. Svf-

net: learning deformable image registration using shape matching, in: 20 th

International Conference on Medical Image Computing and Computer As-

sisted Intervention (MICCAI 2017), Springer. pp. 266–274.

Rohlfing, T., 2011. Image similarity and tissue overlaps as surrogates for image

registration accuracy: widely used but unreliable. IEEE Trans. Med. Imag.

31, 153–163.

Rohlfing, T., Maurer, C.R., Bluemke, D.A., Jacobs, M.A., 2003a. Volume-

preserving nonrigid registration of MR breast images using free-form defor-

mation with an incompressibility constraint. IEEE Trans. Med. Imag. 22,

730–741.

Rohlfing, T., Russakoff, D.B., Maurer, C.R., 2003b. Expectation maximization

strategies for multi-atlas multi-label segmentation, in: 18 th Inf. Proc. in Med.

Imaging (IPMI 2003), Springer. pp. 210–221.

Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for

biomedical image segmentation, in: 18 th International Conference on Medi-

cal Image Computing and Computer Assisted Intervention (MICCAI 2015),

Springer. pp. 234–241.

Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L., Leach, M.O., Hawkes, D.J.,

1999. Nonrigid registration using free-form deformations: application to

breast MR images. IEEE Trans. Med. Imag. 18, 712–721.

Rühaak, J., Polzin, T., Heldmann, S., Simpson, I.J., Handels, H., Modersitzki,

J., Heinrich, M.P., 2017. Estimation of large motion in lung CT by inte-

grating regularized keypoint correspondences into dense deformable regis-

tration. IEEE Trans. Med. Imag. 36, 1746–1757.

Sandkühler, R., Jud, C., Andermatt, S., Cattin, P.C., 2018. Airlab: autograd

image registration laboratory. arXiv preprint arXiv:1806.09907 .

Schmah, T., Risser, L., Vialard, F.X., 2013. Left-invariant metrics for diffeo-

morphic image registration with spatially-varying regularisation, in: 16 th

International Conference on Medical Image Computing and Computer As-

sisted Intervention (MICCAI 2013), Springer. pp. 203–210.

Schnabel, J.A., Heinrich, M.P., Papież, B.W., Brady, J.M., 2016. Advances and

challenges in deformable image registration: from image fusion to complex

motion modelling.

Schultz, S., Handels, H., Ehrhardt, J., 2018. A multilevel markov chain monte

carlo approach for uncertainty quantification in deformable registration, in:

Medical Imaging 2018: Image Processing, SPIE. pp. 162–169.

Sdika, M., 2008. A fast nonrigid image registration with constraints on the

jacobian using large scale constrained optimization. IEEE Trans. Med. Imag.

27, 271–281.

Sdika, M., Pelletier, D., 2009. Nonrigid registration of multiple sclerosis brain

images using lesion inpainting for morphometry or lesion mapping. Techni-

cal Report. Wiley Online Library.

Shams, R., Sadeghi, P., Kennedy, R.A., Hartley, R.I., 2010. A survey of med-

ical image registration on multicore and the gpu. IEEE signal processing

magazine 27, 50–60.

Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Zhang, B., 2022. A multi-

scale unsupervised learning for deformable image registration. International

Journal of Computer Assisted Radiology and Surgery , 1–10.

Shen, Z., Vialard, F.X., Niethammer, M., 2019. Region-specific diffeomorphic

metric mapping. Advances in Neural Information Processing Systems 32.

Shi, J., He, Y., Kong, Y., Coatrieux, J.L., Shu, H., Yang, G., Li, S., 2022.

Xmorpher: Full transformer for deformable medical image registration via

cross attention, in: 25 th International Conference on Medical Image Com-

puting and Computer Assisted Intervention (MICCAI 2022), Springer. pp.

217–226.

Shu, Z., Sahasrabudhe, M., Guler, R.A., Samaras, D., Paragios, N., Kokkinos,

I., 2018. Deforming autoencoders: Unsupervised disentangling of shape

and appearance, in: Proceedings of the European Conference on Computer

Vision (ECCV), pp. 650–665.

Siebert, H., Hansen, L., Heinrich, M.P., 2022. Fast 3D registration with accu-

rate optimisation and little learning for Learn2Reg 2021, in: Biomedical Im-

age Registration, Domain Generalisation and Out-of-Distribution Analysis:

MICCAI 2021 Challenges: MIDOG 2021, MOOD 2021, and Learn2Reg

2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, Septem-

ber 27–October 1, 2021, Proceedings. Springer, pp. 174–179.

Siebert, H., Heinrich, M.P., 2022. Learn to fuse input features for large-

deformation registration with differentiable convex-discrete optimisation, in:

Biomedical Image Registration: 10th International Workshop, WBIR 2022,

Munich, Germany, July 10–12, 2022, Proceedings, Springer. pp. 119–123.

Siebert, H., Rajamani, K.T., Heinrich, M.P., 2021. Learning inverse consistent

3d groupwise registration with deforming autoencoders, in: Medical Imag-

ing 2021: Image Processing, SPIE. pp. 89–95.

Simpson, I.J., Cardoso, M.J., Modat, M., Cash, D.M., Woolrich, M.W., An-

dersson, J.L., Schnabel, J.A., Ourselin, S., Initiative, A.D.N., et al., 2015.

Probabilistic non-linear registration with spatially adaptive regularisation.36

Chen, Liu, Wei et al. / (2023)

Medical Image Analysis 26, 203–216.

Simpson, I.J., Woolrich, M., Groves, A.R., Schnabel, J.A., 2011. Longitu-

dinal brain MRI analysis with uncertain registration, in: 14 th International

Conference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2011), Springer. pp. 647–654.

Sinclair, M., Schuh, A., Hahn, K., Petersen, K., Bai, Y., Batten, J., Schaap,

M., Glocker, B., 2022. Atlas-istn: joint segmentation, registration and atlas

construction with image-and-spatial transformer networks. Medical Image

Analysis 78, 102383.

Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G., 2020. Im-

plicit neural representations with periodic activation functions. Advances in

Neural Information Processing Systems 33, 7462–7473.

Smolders, A., Lomax, T., Weber, D., Albertini, F., 2022. Deformable image

registration uncertainty quantification using deep learning for dose accumu-

lation in adaptive proton therapy, in: Biomedical Image Registration: 10th

International Workshop, WBIR 2022, Munich, Germany, July 10–12, 2022,

Proceedings, Springer. pp. 57–66.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., 2015. Deep

unsupervised learning using nonequilibrium thermodynamics, in: 32 nd Inter-

national Conference on Machine Learning (ICML 2016), PMLR. pp. 2256–

2265.

Sokooti, H., Saygili, G., Glocker, B., Lelieveldt, B.P., Staring, M., 2016. Accu-

racy estimation for medical image registration using regression forests, in:

19 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2016), Springer. pp. 107–115.

Sokooti, H., Vos, B.d., Berendsen, F., Lelieveldt, B.P., Išgum, I., Staring, M.,

2017. Nonrigid image registration using multi-scale 3D convolutional neural

networks, in: 20 th International Conference on Medical Image Computing

and Computer Assisted Intervention (MICCAI 2017), Springer. pp. 232–

239.

Sokooti, H., Yousefi, S., Elmahdy, M.S., Lelieveldt, B.P., Staring, M., 2021.

Hierarchical prediction of registration misalignment using a convolutional

LSTM: Application to chest CT scans. IEEE Access 9, 62008–62020.

Song, X., Chao, H., Xu, X., Guo, H., Xu, S., Turkbey, B., Wood, B.J., Sanford,

T., Wang, G., Yan, P., 2022. Cross-modal attention for multi-modal image

registration. Medical Image Analysis 82, 102612.

Sotiras, A., Davatzikos, C., Paragios, N., 2013. Deformable medical image

registration: A survey. IEEE Trans. Med. Imag. 32, 1153–1190.

Stefanescu, R., Pennec, X., Ayache, N., 2004. Grid powered nonlinear image

registration with locally adaptive regularization. Medical Image Analysis 8,

325–342.

Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R., 2021.

Smurf: Self-teaching multi-frame unsupervised raft with full-image warp-

ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pp. 3887–3896.

Studholme, C., Constable, R.T., Duncan, J.S., 2000. Accurate alignment of

functional EPI data to anatomical MRI using a physics-based distortion

model. IEEE Trans. Med. Imag. 19, 1115–1127.

Sun, S., Han, K., Kong, D., Tang, H., Yan, X., Xie, X., 2022. Topology-

preserving shape reconstruction and registration via neural diffeomorphic

flow, in: Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pp. 20845–20855.

Ta, K., Ahn, S.S., Stendahl, J.C., Sinusas, A.J., Duncan, J.S., 2020. A semi-

supervised joint network for simultaneous left ventricular motion tracking

and segmentation in 4d echocardiography, in: 23 rd International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2020), Springer. pp. 468–477.

Tang, K., Li, Z., Tian, L., Wang, L., Zhu, Y., 2020. Admir–affine and de-

formable medical image registration for drug-addicted brain images. IEEE

Access 8, 70960–70968.

Tang, L., Hamarneh, G., Abugharbieh, R., 2010. Reliability-driven, spatially-

adaptive regularization for deformable registration, in: Biomedical Image

Registration: 4th International Workshop, WBIR 2010, Lübeck, Germany,

July 11-13, 2010. Proceedings 4, Springer. pp. 173–185.

Terpstra, M.L., Maspero, M., Sbrizzi, A., van den Berg, C.A., 2022. ⊥-loss: a

symmetric loss function for magnetic resonance imaging reconstruction and

image registration with deep learning. Medical Image Analysis , 102509.

Teske, H., Bartelheimer, K., Meis, J., Bendl, R., Stoiber, E.M., Giske, K., 2017.

Construction of a biomechanical head and neck motion model as a guide to

evaluation of deformable image registration. Physics in Medicine & Biology

62, N271.

Thévenaz, P., Unser, M., 2000. Optimization of mutual information for mul-

tiresolution image registration. IEEE transactions on image processing 9,

2083–2099.

Tian, L., Greer, H., Vialard, F.X., Kwitt, R., Estépar, R.S.J., Niethammer, M.,

2022. Gradicon: Approximate diffeomorphisms via gradient inverse consis-

tency. arXiv preprint arXiv:2206.05897 .

Tran, M.Q., Do, T., Tran, H., Tjiputra, E., Tran, Q.D., Nguyen, A., 2022.

Light-weight deformable registration using adversarial learning with distill-

ing knowledge. IEEE Trans. Med. Imag. 41, 1443–1453.

Trouvé, A., Younes, L., 2005. Metamorphoses through lie group action. Foun-

dations of computational mathematics 5, 173–198.

Ulyanov, D., Vedaldi, A., Lempitsky, V., 2018. Deep image prior, in: Proceed-

ings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-

nition, pp. 9446–9454.

Unberath, M., Gao, C., Hu, Y., Judish, M., Taylor, R.H., Armand, M., Grupp,

R., 2021. The impact of machine learning on 2D/3D registration for image-

guided interventions: A systematic review and perspective. Frontiers in

Robotics and AI 8, 716007.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,

Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in

Neural Information Processing Systems 30.

Vercauteren, T., Pennec, X., Perchant, A., Ayache, N., 2009. Diffeomorphic

demons: Efficient non-parametric image registration. NeuroImage 45, S61–

S72.

Vialard, F.X., Risser, L., 2014. Spatially-varying metric learning for diffeo-

morphic image registration: A variational framework, in: 17 th International

Conference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2014), Springer. pp. 227–234.

Viergever, M.A., Maintz, J.A., Klein, S., Murphy, K., Staring, M., Pluim, J.P.,

2016. A survey of medical image registration–under review.

Viola, P., Wells III, W.M., 1997. Alignment by maximization of mutual infor-

mation. International Journal of Computer Vision 24, 137–154.

Vishnevskiy, V., Gass, T., Szekely, G., Tanner, C., Goksel, O., 2016. Isotropic

total variation regularization of displacements in parametric image registra-

tion. IEEE Trans. Med. Imag. 36, 385–395.

Vlachopoulos, G., Korfiatis, P., Skiadopoulos, S., Kazantzi, A.,

Kalogeropoulou, C., Pratikakis, I., Costaridou, L., 2015. Selecting

registration schemes in case of interstitial lung disease follow-up in CT.

Medical physics 42, 4511–4525.

de Vos, B.D., van der Velden, B.H., Sander, J., Gilhuijs, K.G., Staring, M.,

Išgum, I., 2020. Mutual information for unsupervised deep learning image

registration, in: Medical Imaging 2020: Image Processing, SPIE. pp. 155–

161.

Vos, B.D.d., Berendsen, F.F., Viergever, M.A., Staring, M., Išgum, I., 2017.

End-to-end unsupervised deformable image registration with a convolu-

tional neural network, in: Deep learning in medical image analysis and mul-

timodal learning for clinical decision support. Springer, pp. 204–212.

Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W., 2019a. Self-supervised

spatio-temporal representation learning for videos by predicting motion and

appearance statistics, in: Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pp. 4006–4015.

Wang, J., Wells III, W.M., Golland, P., Zhang, M., 2019b. Registration un-

certainty quantification via low-dimensional characterization of geometric

deformations. Magnetic resonance imaging 64, 122–131.

Wang, J., Xing, J., Druzgal, J., Wells III, W.M., Zhang, M., 2023. Metamorph:

Learning metamorphic image transformation with appearance changes.

arXiv preprint arXiv:2303.04849 .

Wang, J., Zhang, M., 2020. Deepflash: An efficient network for learning-based

medical image registration, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pp. 4444–4452.

Wang, J., Zhang, M., et al., 2022. Deep learning for regularization prediction in

diffeomorphic image registration. Machine Learning for Biomedical Imag-

ing 1, 1–10.

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality

assessment: from error visibility to structural similarity. IEEE transactions

on image processing 13, 600–612.

Wannenwetsch, A.S., Keuper, M., Roth, S., 2017. Probflow: Joint optical flow

and uncertainty estimation, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pp. 1173–1182.

Wei, D., Ahmad, S., Guo, Y., Chen, L., Huang, Y., Ma, L., Wu, Z., Li, G., Wang,

L., Lin, W., et al., 2021a. Recurrent tissue-aware network for deformable

registration of infant brain mr images. IEEE transactions on medical imaging

41, 1219–1229.Chen, Liu, Wei et al. / (2023)

Wei, D., Ahmad, S., Huo, J., Peng, W., Ge, Y., Xue, Z., Yap, P.T., Li, W.,

Shen, D., Wang, Q., 2019. Synthesis and inpainting-based MR-CT registra-

tion for image-guided thermal ablation of liver tumors, in: 22 nd International

Conference on Medical Image Computing and Computer Assisted Interven-

tion (MICCAI 2019), Springer. pp. 512–520.

Wei, W., Haishan, X., Alpers, J., Rak, M., Hansen, C., 2021b. A deep learning

approach for 2D ultrasound and 3D CT/MR image registration in liver tumor

ablation. Computer Methods and Programs in Biomedicine 206, 106117.

doi:10.1016/j.cmpb.2021.106117.

Wells III, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R., 1996. Multi-

modal volume registration by maximization of mutual information. Medical

Image Analysis 1, 35–51.

Wetzer, E., Lindblad, J., Sladoje, N., 2023. Can representation learning for

multimodal image registration be improved by supervision of intermediate

layers? arXiv preprint arXiv:2303.00403 .

Wolterink, J.M., Zwienenberg, J.C., Brune, C., 2022. Implicit neural represen-

tations for deformable image registration, in: International Conference on

Medical Imaging with Deep Learning, PMLR. pp. 1349–1359.

Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., Shen, D., 2013. Unsupervised

deep feature learning for deformable registration of MR brain images, in:

16 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2013), Springer. pp. 649–656.

Wu, N., Wang, J., Zhang, M., Zhang, G., Peng, Y., Shen, C., 2022a. Hybrid

atlas building with deep registration priors, in: 19 th International Symposium

on Biomedical Imaging (ISBI 2022), IEEE. pp. 1–5.

Wu, R.Y., Liu, A.Y., Wisdom, P., Zhu, X.R., Frank, S.J., Fuller, C.D., Gunn,

G.B., Palmer, M.B., Wages, C.A., Gillin, M.T., et al., 2019. Characteri-

zation of a new physical phantom for testing rigid and deformable image

registration. Journal of Applied Clinical Medical Physics 20, 145–153.

Wu, Y., Jiahao, T.Z., Wang, J., Yushkevich, P.A., Hsieh, M.A., Gee, J.C., 2022b.

Nodeo: A neural ordinary differential equation based optimization frame-

work for deformable image registration, in: Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, pp. 20804–20813.

Xiao, H., Teng, X., Liu, C., Li, T., Ren, G., Yang, R., Shen, D., Cai, J., 2021. A

review of deep learning-based three-dimensional medical image registration

methods. Quantitative Imaging in Medicine and Surgery 11, 4895.

Xie, L., Wang, J., Dong, M., Wolk, D.A., Yushkevich, P.A., 2019. Improving

multi-atlas segmentation by convolutional neural network based patch error

estimation, in: 22 nd International Conference on Medical Image Computing

and Computer Assisted Intervention (MICCAI 2019), Springer. pp. 347–

355.

Xie, L., Wisse, L.E., Wang, J., Ravikumar, S., Khandelwal, P., Glenn, T.,

Luther, A., Lim, S., Wolk, D.A., Yushkevich, P.A., 2023. Deep label fusion:

A generalizable hybrid multi-atlas and deep convolutional neural network

for medical image segmentation. Medical Image Analysis 83, 102683.

Xing, F., Woo, J., Gomez, A.D., Pham, D.L., Bayly, P.V., Stone, M., Prince,

J.L., 2017. Phase vector incompressible registration algorithm for motion es-

timation from tagged magnetic resonance images. IEEE Trans. Med. Imag.

36, 2116–2128.

Xu, J., Chen, E.Z., Chen, X., Chen, T., Sun, S., 2021. Multi-scale neural odes

for 3D medical image registration, in: 24 th International Conference on Med-

ical Image Computing and Computer Assisted Intervention (MICCAI 2021),

Springer. pp. 213–223.

Xu, Z., Luo, J., Lu, D., Yan, J., Frisken, S., Jagadeesan, J., Wells III, W.M., Li,

X., Zheng, Y., Tong, R.K.y., 2022. Double-uncertainty guided spatial and

temporal consistency regularization weighting for learning-based abdominal

registration, in: MICCAI22, Springer. pp. 14–24.

Xu, Z., Luo, J., Yan, J., Pulya, R., Li, X., Wells, W., Jagadeesan, J., 2020. Ad-

versarial uni-and multi-modal stream networks for multimodal image regis-

tration, in: 23 rd International Conference on Medical Image Computing and

Computer Assisted Intervention (MICCAI 2020), Springer. pp. 222–232.

Xu, Z., Niethammer, M., 2019. Deepatlas: Joint semi-supervised learning

of image registration and segmentation, in: 22 nd International Conference

on Medical Image Computing and Computer Assisted Intervention (MIC-

CAI 2019), Springer. pp. 420–429.

Yan, P., Xu, S., Rastinehad, A.R., Wood, B.J., 2018. Adversarial image regis-

tration with application for MR and TRUS image fusion, in: Machine Learn-

ing in Medical Imaging: 9th International Workshop, MLMI 2018, Held in

Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Pro-

ceedings 9, Springer. pp. 197–204.

Yang, H., Sun, J., Carass, A., Zhao, C., Lee, J., Prince, J.L., Xu, Z., 2020.

Unsupervised MR-to-CT synthesis using structure-constrained CycleGAN.

IEEE Trans. Med. Imag. 39, 4249–4261.

Yang, H., Sun, J., Li, H., Wang, L., Xu, Z., 2018. Neural multi-atlas label

fusion: Application to cardiac MR images. Medical Image Analysis 49,

60–75.

Yang, J., Wickramasinghe, U., Ni, B., Fua, P., 2022a. Implicitatlas: learn-

ing deformable shape templates in medical imaging, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

15861–15871.

Yang, T., Bai, X., Cui, X., Gong, Y., Li, L., 2022b. Graformerdir: Graph

convolution transformer for deformable image registration. Computers in

Biology and Medicine 147, 105799.

Yang, X., Kwitt, R., Niethammer, M., 2016. Fast predictive image registration,

in: Deep Learning and Data Labeling for Medical Applications: First In-

ternational Workshop, LABELS 2016, and Second International Workshop,

DLMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece,

October 21, 2016, Proceedings 1, Springer. pp. 48–57.

Yang, X., Kwitt, R., Styner, M., Niethammer, M., 2017. Quicksilver: Fast

predictive image registration–a deep learning approach. NeuroImage 158,

378–396.

Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q., 2020. Video playback rate per-

ception for self-supervised spatio-temporal representation learning, in: Pro-

ceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pp. 6548–6557.

Ye, M., Kanski, M., Yang, D., Chang, Q., Yan, Z., Huang, Q., Axel, L.,

Metaxas, D., 2021. DeepTag: An unsupervised deep learning method for

motion tracking on cardiac tagging magnetic resonance images, in: 2021

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.

7261–7271.

Younes, L., 2010. Shapes and diffeomorphisms. volume 171. Springer.

Yu, E.M., Dalca, A.V., Sabuncu, M.R., 2020a. Learning conditional deformable

shape templates for brain anatomy, in: Machine Learning in Medical Imag-

ing: 11th International Workshop, MLMI 2020, Held in Conjunction with

MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings 11, Springer. pp.

353–362.

Yu, H., Chen, X., Shi, H., Chen, T., Huang, T.S., Sun, S., 2020b. Motion

pyramid networks for accurate and efficient cardiac motion estimation, in:

23 rd International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2020), Springer. pp. 436–446.

Yu, H., Sun, S., Yu, H., Chen, X., Shi, H., Huang, T.S., Chen, T., 2020c. Foal:

Fast online adaptive learning for cardiac motion estimation, in: Proceedings

of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pp. 4313–4323.

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng,

J., Yan, S., 2021. Tokens-to-token vit: Training vision transformers from

scratch on imagenet, in: Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pp. 558–567.

Zhang, J., 2018. Inverse-consistent deep networks for unsupervised deformable

image registration. arXiv preprint arXiv:1809.03443 .

Zhang, M., Fletcher, P.T., 2019. Fast diffeomorphic image registration via

fourier-approximated lie algebras. International Journal of Computer Vision

127, 61–73.

Zhang, S., Liu, P.X., Zheng, M., Shi, W., 2020. A diffeomorphic unsupervised

method for deformable soft tissue image registration. Computers in Biology

and Medicine 120, 103708.

Zhang, Y., Li, L., Wang, W., Xie, R., Song, L., Zhang, W., 2023. Boosting

video object segmentation via space-time correspondence learning. arXiv

preprint arXiv:2304.06211 .

Zhang, Y., Pei, Y., Zha, H., 2021. Learning dual transformer network for dif-

feomorphic registration, in: 24 th International Conference on Medical Image

Computing and Computer Assisted Intervention (MICCAI 2021), Springer.

pp. 129–138.

Zhao, F., Wu, Z., Wang, F., Lin, W., Xia, S., Shen, D., Wang, L., Li, G., 2021a.

S3reg: Superfast spherical surface registration based on deep learning. IEEE

Trans. Med. Imag. 40, 1964–1976.

Zhao, F., Wu, Z., Wang, L., Lin, W., Xia, S., Li, G., Consortium, U.B.C.P.,

2021b. Learning 4d infant cortical surface atlas with unsupervised spherical

networks, in: 24 th International Conference on Medical Image Computing

and Computer Assisted Intervention (MICCAI 2021), Springer. pp. 262–

272.

Zhao, S., Lau, T., Luo, J., Eric, I., Chang, C., Xu, Y., 2019. Unsupervised 3D

end-to-end medical image registration with volume tweening network. IEEE

Journal of Biomedical and Health Informatics 24, 1394–1404.38

Chen, Liu, Wei et al. / (2023)

Zheng, Y., Sui, X., Jiang, Y., Che, T., Zhang, S., Yang, J., Li, H., 2021. Symreg-

gan: symmetric image registration with generative adversarial networks.

IEEE transactions on pattern analysis and machine intelligence 44, 5631–

5646.

Zhou, S., Hu, B., Xiong, Z., Wu, F., 2023. Self-distilled hierarchical network

for unsupervised deformable image registration. IEEE Trans. Med. Imag. .

Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2019. Unet++: Re-

designing skip connections to exploit multiscale features in image segmen-

tation. IEEE Trans. Med. Imag. 39, 1856–1867.

Zhu, H., Adeli, E., Shi, F., Shen, D., Initiative, A.D.N., 2020. Fcn based label

correction for multi-atlas guided organ segmentation. NeuroImage 18, 319–

331.

Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image trans-

lation using cycle-consistent adversarial networks, in: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

2223–2232.

Zhuang, X., Rhode, K., Arridge, S., Razavi, R., Hill, D., Hawkes, D., Ourselin,

S., 2008. An atlas-based segmentation propagation framework using locally

affine registration–application to automatic whole heart segmentation, in:

11 th International Conference on Medical Image Computing and Computer

Assisted Intervention (MICCAI 2008), Springer. pp. 425–433.

Zou, J., Gao, B., Song, Y., Qin, J., 2022. A review of deep learning-based

deformable medical image registration. Frontiers in Oncology 12.