Summary of Bayesian Flow Networks A Generative Model

Summary Bayesian Flow Networks A Generative Model arxiv.org

18,918 words - PDF document - View PDF document

One Line

Bayesian Flow Networks optimize information transmission between sender and receiver by combining Bayesian inference and deep learning, modifying independent distributions' parameters through Bayesian inference and using a neural network for output.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

Bayesian Flow Networks: Optimizing Information Transmission

Source: arxiv.org - PDF - 18,918 words - view

Introduction to Bayesian Flow Networks

• Bayesian Flow Networks (BFNs) modify parameters through Bayesian inference

• BFNs use a neural network to output interdependent distributions

• BFNs optimize information transmission between sender and receiver

Optimizing Information Transmission

• BFNs update input and output distributions using Bayesian inference

• Bayesian inference improves the transmission of information

• Optimization of L-r(x) is indirectly trained through L-n(x)

Input and Output Distributions

• Input distribution for discrete data is factorized categorical

• Output distribution for discrete data is determined by the softmax function

• Output distribution for continuous data involves Gaussian noise

Evaluation on Generative Benchmarks

• BFNs have been evaluated on CIFAR-10, dynamically binarized MNIST, and text8 datasets

• Continuous-time loss L ? (x) was used for training and evaluation

• BFNs showed promising results in various datasets

Conclusion

• BFNs optimize information transmission through Bayesian inference and deep learning

• Input and output distributions are updated to improve the transmission process

• BFNs have shown promising results in generative benchmarks

Summary and Main Message

• BFNs optimize information transmission between sender and receiver

• Bayesian inference and deep learning enhance the process

• BFNs have demonstrated promising results in various datasets

Key Points

Bayesian Flow Networks (BFNs) are a generative model that modifies parameters through Bayesian inference and uses a neural network to output interdependent distributions.
BFNs optimize the transmission of information between a sender and receiver by updating input and output distributions using Bayesian inference.
The optimization of L_r(x) is indirectly trained through the optimization of L_n(x), where L(x) represents the total number of nats required to transmit data.
The input distribution for discrete data is factorized categorical, while the output distribution is determined by the softmax function for discrete data and Gaussian noise for continuous data.
BFNs have been evaluated on generative benchmarks and have shown promising results in various datasets.

Summaries

33 word summary

Bayesian Flow Networks (BFNs) optimize information transmission between sender and receiver by combining Bayesian inference and deep learning. BFNs modify independent distributions' parameters through Bayesian inference and use a neural network for output.

42 word summary

Bayesian Flow Networks (BFNs) are a generative model that combines Bayesian inference and deep learning to optimize information transmission between a sender and receiver. BFNs modify the parameters of independent distributions through Bayesian inference and use a neural network to output inter

683 word summary

Bayesian Flow Networks (BFNs) are introduced as a new generative model that modifies the parameters of independent distributions through Bayesian inference. These modified parameters are then inputted into a neural network that outputs a second, interdependent distribution. BFNs

Bayesian Flow Networks (BFNs) use Bayesian inference and deep learning to optimize the transmission of information between a sender and receiver. The input distribution, which models the variables in the data independently, is updated using Bayesian inference. The output distribution, produced

The document discusses Bayesian Flow Networks, a generative model that uses input parameters and a neural network to generate output distributions. The output distribution can exploit context information, such as surrounding pixels in an image or related words in a text. The receiver distribution combines

The paper discusses the optimization of L_r(x) by indirectly training it through the optimization of L_n(x). The loss function L(x) is defined as the total number of nats required to transmit the data, which is the sum of the n

Our choice of ? 2 ensures that E IY j I 3 < ? 2 for j > 0. T n 3 > S n 3 and E IY 0 I 3 < C for some constant C.

The plot shows the distribution of the input mean for different values of alpha. The distribution is concentrated around the initial parameters for low alpha and around the input value for high alpha. Identity equations are used to derive the distribution. The accuracy schedule is derived to

The excerpt discusses the use of Bayesian Flow Networks as a generative model. It explains that the noise in the model does not affect training and sets an upper limit on the chosen value for precision. The equations for the discrete-time loss and continuous-time loss

Output Distribution pO(.|I, ?, t) models discretized data using neural networks. The network outputs ?(?, t) are used to generate a Gaussian noise vector e, which is then used to generate the mean sample ?. The output distribution

The input distribution for discrete data is a factorized categorical distribution over class indices. The input prior is uniform. The output distribution for discrete data is determined by applying the softmax function to the network outputs. For binary data, the output distribution is determined by

The fraction of observations of class k in c can be used to deduce the value of x if m is sufficiently large. As the accuracy ? shrinks, the sender distribution p(c I x, ?) becomes closer to uniform. By defining the

Bayesian Flow Networks are a generative model that can be used for discrete data. The model involves a Bayesian update function that updates the probability distribution based on observed data. The update function is defined as h(? i?1 , y, ?)

Bayesian Flow Networks (BFNs) were evaluated on generative benchmarks including CIFAR-10, dynamically binarized MNIST, and text8 datasets. The network was trained using the continuous-time loss L ? (x) and evaluated with the

Direct optimization of the n-step loss would likely lead to reduced loss for low values of n. The input and output distributions of the MNIST dataset show that the network learns to correct for the uniform prior in the input distribution. The output distribution is less

Continuous loss performed better than discretized loss with 256 bins. Discretized training with 16 bins yielded better sample quality than training with 256 bins. Future work could involve training one Bayesian Flow Network (BFN) to model the lower bits

This text excerpt includes a list of references to various papers and articles related to the topic of Bayesian Flow Networks. The references cover a range of subjects including continuous diffusion for categorical data, asymmetric numeral systems, introduction to probability and statistics, generating sequences with recurrent

Diffusion-lm improves controllable text generation. Decoupled weight decay regularization is discussed. Reflected diffusion models are explored. Tess: Text-to-text self-conditioned simplex diffusion is introduced. The large text compression benchmark by Matt Mahoney is referenced

This summary is a list of references to other papers and articles. It does not provide any specific information or details about the content of those papers.

Raw indexed text (94,888 chars / 18,918 words / 3,075 lines)

Bayesian Flow Networks

Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez

{alex,rupesh,timothy,tino}@nnaisense.com

NNAISENSE

Abstract

This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in

which the parameters of a set of independent distributions are modified with Bayesian inference

in the light of noisy data samples, then passed as input to a neural network that outputs a

second, interdependent distribution. Starting from a simple prior and iteratively updating

the two distributions yields a generative procedure similar to the reverse process of diffusion

models; however it is conceptually simpler in that no forward process is required. Discrete and

continuous-time loss functions are derived for continuous, discretised and discrete data, along

with sample generation procedures. Notably, the network inputs for discrete data lie on the

probability simplex, and are therefore natively differentiable, paving the way for gradient-based

sample guidance and few-step generation in discrete domains such as language modelling. The

loss function directly optimises data compression and places no restrictions on the network

architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling

on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion

models on the text8 character-level language modelling task.

Introduction

Large-scale neural networks have revolutionised generative modelling over the last few years, with an

unprecedented ability to capture complex relationships among many variables. Building a convincing

joint model of all the pixels in a high resolution image, for example, was impossible before the

advent of modern generative networks.

Key to the expressive power of most of these networks — including autoregressive models [9],

flow-based models [32], deep VAEs [48] and diffusion models [41] — is that the joint distribution

they encode is broken down into a series of steps, thereby eluding the “curse of dimensionality”

that would doom any effort to explicitly define all the interactions among so many variables. In

colloquial terms they solve a hard problem by splitting it into easy pieces.

A general way to view such distributions is as an exchange of messages between a sender, Alice,

who has access to some data, and her friend Bob, who wishes to receive it in as few bits as possible.

At each step Alice sends a message to Bob that reveals something about the data. Bob attempts

to guess what the message is: the better his guess the fewer bits are needed to transmit it. After

receiving the message, Bob uses the information he has just gained to improve his guess for the

next message. The loss function is the total number of bits required for all the messages.

In an autoregressive language model, for example, the messages are the word-pieces the text

is divided into. The distribution encoding Bob’s prediction for the first message is of necessity

1uninformed: a zero-gram prior based on the relative frequencies of different word-pieces. The

transmission cost is the negative log-probability under this prior. Bob then uses the first word-piece

to predict the second; on average, the second prediction will be slightly more informed than the first,

and the expected transmission cost will be slightly lower. The process repeats with the predictions

improving at each step. The sum of the transmission costs is the negative log-probability of the

complete text sequence, which is the loss function minimised by maximum likelihood training. It

is also the minimum number of bits that would be required for Alice to transmit the pieces to

Bob using arithmetic coding [51]. There is therefore a direct correspondence between fitting an

autoregressive model with maximum likelihood and training it for data compression.

Autoregressive networks are currently state-of-the-art for language modelling [29], and in general

perform well on discrete data where a natural ordering exists. However they have proved less

effective in domains such as image generation, where the data is continuous and no natural order

exists among variables (e.g. there is no reason to generate one pixel before another). They also

have the drawback that generating samples requires as many network updates as there are variables

in the data.

Diffusion models are an alternative framework that has proved particularly effective for image

generation [5, 34]. In this case the transmission procedure is a little more complex 1 . Each message

Bob receives is a noisy version of the message before, where the noise is designed so that in

expectation the messages approach the data. The transmission cost at each step is the Kullback-

Leibler divergence between the distribution from which Alice draws the message and Bob’s prediction

of that distribution (which is a reparameterisation of his prediction of the data). The sum of the KL

divergences is the evidence lower bound minimised by diffusion training [41]; it is also the expected

number of bits needed to transmit the data using an efficient bits-back coding scheme [11]. Once

again there is an exact equivalence between the loss function used to train the model and the model’s

ability to compress data, as elucidated by previous authors [46].

We posit that the superiority of diffusion over autoregression for image generation lies in the

way diffusion progresses from coarse to fine image details as the level of noise decreases — a more

natural way to construct an image than one dot at a time. However diffusion has yet to match

autoregression for discrete data, which is unfortunate, as diffusion models have the advantage of

decoupling the number of generation steps from the number of variables. A fundamental challenge

is that when the data is discrete, the noise in the diffusion process is also discrete, and therefore

discontinuous. To return to the transmission metaphor, if the data is a piece of text, then Bob

begins the process with a totally garbled text, every symbol of which is either randomly altered or

left unchanged by each of Alice’s messages. A key motivation for this paper was our belief that

a fully continuous transmission process — where Alice’s messages smoothly alter Bob’s beliefs —

would be more effective for discrete data. Moreover this should open the door to gradient-based

sample guidance [5] and few-step generation techniques [37, 50, 43], similar to those that have been

developed for continuous diffusion.

Bayesian Flow Networks (BFNs), the model introduced in this paper, differ from diffusion models

in that the network operates on the parameters of a data distribution, rather than on a noisy version

of the data itself. This ensures that the generative process is fully continuous and differentiable,

even when the data is discrete. BFNs can be summarised by the following transmission scheme

(Figure 1). Bob has an “input distribution” which is initially a simple prior: a standard normal for

continuous data, a uniform categorical for discrete data. At each transmission step he feeds the

We are here describing the reverse process of diffusion models.

2Variable 1

Data

ALICE

Sender

Prev. Input

Noise

sample

BOB

Output

…

Input

Receiver

Network

Bayesian

update

Network

Variable 2

Data

ALICE

Sender

…

Prev. Input

Noise

sample

BOB

Receiver

Input

Output

Bayesian

update

Figure 1: System Overview. The figure represents one step of the modelling process of a Bayesian Flow

Network. The data in this example is a ternary symbol sequence, of which the first two variables (‘B’ and ‘A’)

are shown. At each step the network emits the parameters of the output distribution based on the parameters

of the previous input distribution. The sender and receiver distributions (both of which are continuous,

even when the data is discrete) are created by adding random noise to the data and the output distribution

respectively. A sample from the sender distribution is then used to update the parameters of the input

distribution, following the rules of Bayesian inference. Conceptually, this is the message sent by Alice to Bob,

and its contribution to the loss function is the KL divergence from the receiver to the sender distribution.

parameters of the input distribution (e.g. the mean of a normal distribution, the probabilities of a

categorical distribution) into a neural network. The network outputs the parameters of a second

distribution referred to as the “output distribution”. Alice then creates a “sender distribution” by

adding noise to the data according to a predefined schedule, and Bob creates a “receiver distribution”

by convolving the output distribution with the same noise distribution used by Alice: intuitively, for

every value the data could take on, Bob constructs the sender distribution Alice would have used if

that value was correct, then sums over all these hypothetical sender distributions, weighted by the

probability of the corresponding value under the output distribution. Alice picks a sample from the

sender distribution and sends it to Bob at a cost equal to the KL divergence from receiver to sender.

3Bob then uses the sample to update his input distribution, following the rules of Bayesian inference.

Usefully, the Bayesian updates are available in closed-form as long as the input distribution models

all the variables in the data independently. Once the update is complete, Bob again feeds the

parameters of the input distribution to the network which returns the parameters of the output

distribution. The process repeats for n steps, at which point Bob can predict the data accurately

enough that Alice can send it to him without any noise.

Note the key difference between the input and output distributions: the input distribution

receives information about each variable in the data independently (via the Bayesian updates), and

is therefore unable to exploit contextual information, such as neighbouring pixels in an image or

related words in a text; the output distribution, on the other hand, is produced by a neural network

that jointly processes all the parameters in the input distribution, giving it access to all available

context. Intuitively, the combination of the input and output distributions represents a division

of labour between Bayesian inference and deep learning that plays to both of their strengths: the

former provides a mathematically optimal and finely controllable way to collect and summarise

information about individual variables, while the latter excels at integrating information over many

interrelated variables.

The above transmission process defines an n-step loss function that can be generalised to

continuous time by sending n to ∞. In continuous time the Bayesian updates become a Bayesian

flow of information from the data to the network. As well as removing the need to predefine the

number of steps during training, the continuous-time loss function is mathematically simpler and

easier to compute than the discrete-time loss. A BFN trained with continuous-time loss can be run

for any number of discrete steps during inference and sampling, with performance improving as the

number of steps increases.

The rest of the paper is structured as follows. A short summary of related work is given

in Section 2. The basic framework of BFNs, along with a general derivation of the discrete

and continuous time loss functions is provided in Section 3. Specialisations of the framework to

continuous, discretised and discrete data are provided in Sections 4–6, along with pseudocode

for training, evaluating and sampling from the network. Experimental results on the CIFAR-10,

dynamically binarized MNIST and text8 datasets are provided in Section 7 and concluding remarks

are given in Section 8.

Related Work

Of existing methods, Bayesian Flow Networks are most closely related to diffusion models. However

the two differ in some crucial aspects. Most obviously BFNs embody a function from one distribution

to another — rather than from data to a distribution, like diffusion models and most other

probabilistic networks. One advantage of this approach is that, because the parameters of a

categorical distribution are real-valued probabilities, the inputs to the network are continuous even

when the data is discrete. This contrasts with discrete diffusion, which natively uses discrete samples

as input [41, 14, 1].

Numerous authors have proposed continuous variants of discrete diffusion. Typically these

rely either on mapping to and from a continuous embedding space [44, 21, 6, 2], or on restricting

continuous diffusion to the probability simplex [33, 24, 23]. While we do not directly compare

against the above methods, we note that continuity is an inherent property of the Bayesian Flow

framework (the network inputs automatically lie on the probability simplex by virtue of being the

4parameters of a categorical distribution), rather than a constraint added to an existing system. As

well as reducing the number of free parameters and design choices (e.g. the continuous embedding

space, the mapping functions), this ensures that BFNs directly optimise the negative log-likelihood

of discrete data, unlike continuous diffusion methods for discrete data, which typically require either

simplified loss functions [24] or auxiliary loss terms [21] to make learning stable.

For continuous data, BFNs are most closely related to variational diffusion models [17], with a

very similar continuous-time loss function. The main difference in this case is that the network inputs

are considerably less noisy in BFNs than in variational diffusion and other continuous diffusion

models. This is because the generative process of BFNs begins with the parameters of a fixed prior,

whereas that of diffusion models begins with pure noise. We hypothesise that the reduction in noise

could lead to faster learning on large datasets where the model underfits; however we have yet to

test this hypothesis experimentally.

Another key difference from diffusion models is that there is no need to define and invert a

forward process for BFNs, which arguably makes it easier to adapt them to different distributions

and data types. We showcase this flexibility by adapting BFNs to continuous, discretised and

discrete data, with minimal changes to the training procedure. This contrasts with e.g. discretised

diffusion, which requires carefully defined transition matrices [1].

Bayesian Flow Networks

This section covers the basic mathematical formalism of Bayesian Flow Networks, laying out the

structure of the various functions and distributions required by the model, along with the discrete

and continuous-time loss functions used for training. Specific instantiations of the general framework

for continuous, discretised and discrete data are given in Sections 4–6.

3.1

Input and Sender Distributions

Given D-dimensional data x = x (1) , . . . , x (D) ∈ X D , let θ = θ (1) , . . . , θ (D) be the parameters of

a factorised input distribution p I (· | θ), with

p I (x | θ) =

p I (x (d) | θ (d) ).

(1)

d=1

For example, θ (d) may consist of the probabilities of a categorical

Let p S (· | x; α) be a

distribution.

(1)

(D)

similarly factorised sender distribution with y = y , . . . , y

∈ Y and

p S (y | x; α) =

p S y (d) | x (d) ; α ,

(2)

d=1

where α ∈ R + is an accuracy parameter defined such that when α = 0, the sender samples are entirely

uninformative about x and as α increases the samples become progressively more informative.

3.2

Output Distribution p O (· | θ, t)

During the data transmission process, the input parameters θ are passed along with the process

time t as input to a neural network Ψ. The network then emits an output vector Ψ(θ, t) =

Ψ (1) (θ, t), . . . , Ψ (D) (θ, t) which is used to parameterise an output distribution factorised in the

same way as the input and sender distributions:

p O (x | θ, t) =

p O (x (d) | Ψ (d) (θ, t)).

(3)

d=1

As discussed in the introduction, the key difference between the input and output distributions

is that while each p I (x (d) | θ (d) ) depends only on information gathered via p S y (d) | x (d) ; α about

x (d) , each p O (x (d) | Ψ (d) (θ, t)) depends (via the network) on all of θ and hence all of x. The

output distribution, unlike the input distribution, can therefore exploit context information, such as

surrounding pixels in an image or related words in a text.

3.3

Receiver Distribution p R (· | θ; t, α)

Given sender distribution p S (· | x; α) and output distribution p O (· | θ, t) the receiver distribution

over Y D is defined as

p R (y | θ; t, α) =

p S y | x ′ ; α .

(4)

p O (x ′ |θ;t)

Intuitively this can be understood as a receiver who knows the form of the sender distribution

p S (· | x; α) but does not know x, and therefore integrates over all x ′ ∈ X D , and hence all possible

sender distributions, weighted by the probability given to x ′ by the output distribution p O (x | θ, t).

The receiver distribution therefore combines two sources of uncertainty: the “known unknown” of

the sender distribution entropy (which is a function of α), and the “unknown unknown” of the

output distribution entropy.

3.4

Bayesian Updates

Given parameters θ and sender sample y drawn with accuracy α the Bayesian update function h is

derived by applying the rules of Bayesian inference to compute the updated parameters θ ′ :

θ ′ ← h(θ, y, α).

(5)

The Bayesian update distribution p U (· | θ, x; α) is then defined by marginalizing out y:

p U (θ ′ | θ, x; α) =

δ θ ′ − h(θ, y, α) ,

(6)

p S (y|x;α)

where δ (· − a) is the multivariate Dirac delta distribution centred on the vector a. In Sections 4.4

and 6.7 we will prove that both forms of p U (· | θ, x; α) considered in this paper have the following

property: the accuracies are additive in the sense that if α = α a + α b then

p U (θ ′′ | θ, x; α) =

p U (θ ′ |θ,x;α a )

p U (θ ′′ | θ ′ , x; α b ).

(7)

It follows from this property that given prior input parameters θ 0 , the probability of observing

parameters θ n after drawing a sequence of n sender samples y 1 , . . . , y n with accuracies α 1 , . . . , α n is

...

p U (θ n | θ n−1 , x; α n ) = p U θ n | θ 0 , x;

α i . (8)

p U (θ 1 |θ 0 ,x;α 1 ) p U (θ 2 |θ 1 ,x;α 2 )

p U (θ n−1 |θ n−2 ,x;α n−1 )

i=1

63.5

Accuracy Schedule β(t)

By performing an infinite number of transmission steps, the Bayesian update process can be

generalized to continuous time. Let t ∈ [0, 1] be the process time and let α(t) > 0 be the accuracy

rate at time t. Now define the accuracy schedule β(t) as

α(t ′ )dt ′ .

β(t) =

(9)

t ′ =0

It follows from the above definitions that β(t) is a monotonically increasing function of t, that

β(0) = 0, and that dβ(t)

dt = α(t).

Specific forms of β(t) for continuous and discrete data are provided in Sections 4.5 and 6.8. Both

are derived using simple heuristics, with a deeper investigation left for future work.

3.6

Bayesian Flow Distribution p F (· | x; t)

Given prior parameters θ 0 , Bayesian update distribution p U (· | θ, x; α) and accuracy schedule β(t),

the Bayesian flow distribution p F (· | x; t) is the marginal distribution over input parameters at time

t, defined by

p F (θ | x; t) = p U (θ | θ 0 , x; β(t)).

3.7

(10)

Loss Function L(x)

Given prior parameters θ 0 and accuracy schedule β(t), consider a sequence of n sender samples

y 1 , . . . , y n sampled at times t 1 , . . . , t n where t i = i/n. The sender distribution at step i is p S (· | x; α i )

where

α i = β(t i ) − β(t i−1 ),

(11)

the receiver distribution at step i is p R (· | θ i−1 ; t i−1 , α i ), and the input parameter sequence θ 1 , . . . , θ n

is recursively calculated from

θ i = h(θ i−1 , y, α i ).

(12)

Define the n-step discrete-time loss L n (x) as the expected number of nats required to first transmit

y 1 , . . . , y n , and the reconstruction loss L r (x) as the expected number of nats required to then

transmit x. Since — using a bits-back coding scheme [11, 7] — it requires D KL (p S ∥ p R ) nats to

transmit a sample from p S to a receiver with p R ,

L n (x) =

p(θ 1 ,...,θ n−1 )

D KL (p S (· | x; α i ) ∥ p R (· | θ i−1 ; t i−1 , α i )) ,

(13)

i=1

where

p(θ 1 , . . . , θ n ) =

p U (θ i | θ i−1 , x; α i ),

i=1

(14)and since the number of nats needed to transmit x using an arithmetic coding scheme [51] based on

p(x) is − ln p(x), and the marginal probability of θ n is given by p F (· | x, 1),

L r (x) = −

p F (θ|x,1)

ln p O (x | θ; 1).

(15)

Note that L r (x) is not directly optimised in this paper; however it is indirectly trained by optimising

L n (x) since both are minimised by matching the output distribution to the data. Furthermore, as

long as β(1) is high enough, the input distribution at t = 1 will be very close to x, making it trivial

for the network to fit p O (x | θ; 1).

The loss function L(x) is defined as the total number of nats required to transmit the data,

which is the sum of the n-step and reconstruction losses:

L(x) = L n (x) + L r (x)

(16)

Alternatively L(x) can be derived as the loss function of a variational autoencoder (VAE; [18]).

Consider the sequence y 1 , . . . , y n as a latent code with posterior probability given by

q(y 1 , . . . , y n ) =

p S (y i | x; α i ) , (17)

p R (y i | θ i−1 ; t i−1 , α i ). (18)

i=1

and autoregressive prior probability given by

p(y 1 , . . . , y n ) =

i=1

Then, noting that the decoder probability p(x | y 1 , . . . , y n ) = p O (x | θ n ; 1), the complete trans-

mission process defines a VAE with loss function given by the negative variational lower bound

(VLB)

L(x) = −VLB(x) = D KL (q ∥ p) −

y 1 ,...,y n ∼q

ln p(x | y 1 , . . . , y n )

= L n (x) + L r (x).

3.8

(19)

(20)

Discrete-Time Loss L n (x)

Eq. 13 can be rewritten as

L n (x) = n

i∼U {1,n} p U (θ 1 |θ 0 ,x;α 1 )

...

p U (θ|θ i−2 ,x;α i−1 )

D KL (p S (· | x; α i ) ∥ p R (· | θ; t i−1 , α i )) ,

(21)

where U {1, n} is the uniform distribution over the integers from 1 to n. Furthermore, it follows

from Eqs. 8 and 10 that

p U (θ 1 |θ 0 ,x;α 1 )

...

p U (θ|θ i−2 ,x;α i−1 )

(22)

p U (θ|θ 0 ,x;β(t i−1 ))

(23)

p F (θ|x;t i−1 )

and hence

L n (x) = n

i∼U {1,n},p F (θ|x;t i−1 )

D KL (p S (· | x; α i ) ∥ p R (· | θ; t i−1 , α i )) ,

(24)

which allows us approximate L n (x) via Monte-Carlo sampling without computing the n-step sum.

83.9

Continuous-Time Loss L ∞ (x)

Eq. 24 can be used to train the network directly. However this presupposes that n is fixed during

training. Furthermore, for discrete and discretised data the KL terms do not have analytic solutions,

leading to noisy gradient estimates.

Inspired by Variational Diffusion Models [17] we derive a continuous-time loss function L ∞ (x)

by taking the limit of L n (x) as n → ∞. This turns out to be mathematically simpler than the

discrete-time loss, as well as removing both the noisy gradients for the discrete and discretised KL

terms and the need to fix n during training.

Let

def 1

ϵ = ,

(25)

def

α(t, ϵ) = β(t) − β(t − ϵ),

(26)

def

L ∞ (x) = lim L n (x).

(27)

n→∞

Then, from the definition of L n (x) in Eq. 24,

D KL (p S (· | x; α(t, ϵ)) ∥ p R (· | θ; t − ϵ, α(t, ϵ))) ,

ϵ→0 ϵ t∼U (ϵ,1),p F (θ|x,t−ϵ)

L ∞ (x) = lim

(28)

where U (a, b) is the continuous uniform distribution over the interval [a, b]. As we will see, for all

the sender, receiver distribution pairs in this paper,

D KL (p S (· | x; α) ∥ p R (· | θ; α, t)) =

D KL N g(x

(d)

), Cα

−1

∥ P

(d)

(θ, t) ∗ N 0, Cα

−1

d=1

(29)

where g : X → Y is a function from data space to sender space, P (d) (θ, t) is a distribution over Y

with finite expectation and variance, ∗ denotes the convolution of two probability distributions and

C is a scalar constant.

The following proposition is now required:

Proposition 3.1. For a continuous univariate probability

distribution P with finite expectation

E[P ] and variance V ar[P ], the convolution P ∗ N 0, σ 2 → N E[P ], σ 2 as σ 2 → ∞.

Proof. Let ϵ 2 be some variance in the interval 0, π 8 and consider the sequence of random variables

X 0 , X 1 , . . . , X n where X 0 ∼ P and X j ∼ N 0, ϵ 2 for j > 0. Define

(

def X 0 − E[P ] if j = 0,

(30)

Y j =

X j

otherwise.

def

R n =

Y j ,

(31)

j=0

2 def

S n =

V ar[Y j ] = nϵ 2 ,

(32)

j=1

def

T n 2 = V ar[P ] + S n 2 .

(33)

It follows from P the definition of convolution that nj=0 X j ∼ P ∗ N 0, nϵ 2 . Since nϵ 2 → ∞ as

n → ∞, and nj=0 X j = R n + E[P ], the result is proved if it can be shown that as n → ∞,

√

R n → N 0, nϵ 2 or equivalently R n /(ϵ n) → N (0, 1).

The Lyapunov

central limit

theorem [8] states that if there exists λ > 0 such that

1 P n

2+λ = 0 then R /T → N (0, 1). First note that T 2 → S 2 = nϵ 2

lim n→∞ 2+λ

j=0

T n

√

as n → ∞. Hence if R n /T n → N (0, 1) then R n /(ϵ n) → N (0, 1). Now set λ = 1 and observe q

that

for Y j ∼ N 0, ϵ 2 , E |Y j | 3 is the third moment of the half-normal distribution, which is ϵ 3 π 8 .

Our choice of ϵ 2 therefore ensures that

E |Y j | 3 < ϵ 2 for j > 0. Also note that T n 3 > S n 3 and, since

E[P ] and V ar[P ] are finite, E |Y 0 | 3 < C for some constant C. Hence

1 n→∞

1 X

nϵ

= 3 3/2 + √ −−−→ 0.

T n 3

S n 3

ϵ n

j=0

(34)

It follows from the continuity of β(t) and Eq. 26 that α(t, ϵ) −1 → ∞ as ϵ → 0. Therefore,

Proposition 3.1 can be applied to Eq. 29 to yield

∥ N E[P (d) (θ, t)],

lim D KL (p S (· | x, α t ) ∥ p R (· | θ, α t , t)) =

D KL N g(x (d) ),

ϵ→0

α(t, ϵ)

d=1

(35)

α(t, ϵ)

∥g(x) − E[P (θ, t)]∥ 2 ,

(36)

where

g(x) = g(x (1) ), . . . , g(x (D) ) ,

E[P (θ, t)] = E[P (1) (θ, t)], . . . , E[P (D) (θ, t)] .

(37)

(38)

Therefore,

L ∞ (x) =

α(t, ϵ) ∥g(x) − E[P (θ, t)]∥ 2

t∼U (0,1),p F (θ|x,t) ϵ→0

lim

(39)

Substituting from Eq. 26,

lim

ϵ→0

α(t, ϵ)

β(t) − β(t − ϵ)

dβ(t)

= lim

= α(t),

ϵ→0

(40)

and hence

L ∞ (x) =

α(t)

t∼U (0,1),p F (θ|x,t)

3.10

∥g(x) − E[P (θ, t)]∥ 2

(41)

Sample Generation

Given prior parameters θ 0 , accuracies α 1 , . . . , α n and corresponding times t i = i/n, the n-step

sampling procedure recursively generates θ 1 , . . . , θ n by sampling x ′ from p O (· | θ i−1 , t i−1 ), y from

p S (· | x ′ , α i ) (meaning that y ∼ p R (· | θ i−1 ; t i−1 , α i ) — see Eq. 4), then setting θ i = h(θ i−1 , y).

Given θ n the network is run one more time and the final sample is drawn from p O (· | θ n , 1).

104

Continuous Data

For continuous data X = R and hence x ∈ R D . In our experiments, x is normalised to lie in [−1, 1] D

to ensure that the network inputs remain in a reasonable range; however this is not essential for the

mathematical framework.

4.1

Input Distribution p I (· | θ)

The input distribution for continuous data is a diagonal normal:

def

θ = {µ, ρ}

def

(42)

p I (x | θ) = N x | µ, ρ −1 I ,

(43)

where I is the D × D identity matrix. We define the prior parameters as

def

θ 0 = {0, 1},

(44)

where 0 is the length D vectors of zeros. Hence the input prior is a standard multivariate normal:

p I (x | θ 0 ) = N (x | 0, I) .

(45)

The usual Bayesian approach would be to fit the prior mean and variance to the training data.

However we found that a standard prior worked better in practice, as well as simplifying the

equations. It is important to remember that the distributions p I (x | θ 0 ) are never used directly

to make predictions, but rather to inform the network’s predictions. All that matters is that the

parameters fed into the network accurately and accessibly encode the information received so far

about x. The network can easily learn the empirical prior of the training set and use that to correct

its predictions.

4.2

Bayesian Update Function h(θ i−1 , y, α)

Given a univariate Gaussian prior N µ a , ρ a −1 over some unknown data x it can be shown [27]

that

−1 with

the Bayesian posterior after observing

noisy

sample

from

normal

distribution

known precision α is N µ b , ρ −1

, where

ρ b = ρ a + α,

µ a ρ a + yα

µ b =

ρ b

(46)

(47)

Since both p I (x | θ) and p S (y | x; α) distributions are normal with diagonal covariance, Eqs. 46 and

47 can be applied to obtain the following Bayesian update function

for parameters θ i−1 = {µ i−1 , ρ i−1 }

and sender sample y drawn from p S (· | x; αI) = N x, α −1 I :

h({µ i−1 , ρ i−1 }, y, α) = {µ i , ρ i }, (48)

ρ i = ρ i−1 + α,

µ ρ i−1 + y α

µ i = i−1

ρ i (49)

with

(50)Figure 2: Bayesian updates for continuous data. For univariate data x = 0.7, the initial input

distribution parameters θ 0 = {µ 0 = 0, ρ 0 = 1} are updated to θ 1 = {µ 1 , ρ 1 }, θ 2 = {µ 2 , ρ 2 }, θ 3 = {µ 3 , ρ 3 } by

iterating Eqs. 49 and 50 with sender samples y 1 , y 2 , y 3 drawn with accuracies 2, 4, 6 respectively. Note how

the input mean (µ 1 , µ 2 , µ 3 ) stochastically approaches the data, while the input precision smoothly increases.

4.3

Bayesian Update Distribution p U (· | θ, x; α)

Eq. 50 computes µ i given a single sample y from the sender distribution. To marginalise over

y ∼ N y | x, α −1 I as defined in Eq. 6, the following standard identity for normal distributions

can be applied:

X ∼ N µ X , σ X

=⇒ aX + b ∼ N aµ X + b, a 2 σ X

∀a, b ∈ R .

(51)

2 = α −1 I, a =

Substituting X = y, µ X = x, σ X

µ i ∼ N

ρ i

and b =

µ i−1 ρ i−1

ρ i

Eq. 50 gives:

α x +µ i−1 ρ i−1 α

, 2 I ,

ρ i

and therefore (since µ i is the only random part of θ i )

α x +µ i−1 ρ i−1 α

p U (θ i | θ i−1 , x; α) = N µ i |

, 2 I .

ρ i

4.4

(52)

(53)

Additive Accuracies

We can check that the sender accuracies are additive in the sense required by Eq. 7 by first observing

that if θ i−1 = {µ i−1 , ρ i−1 } is drawn from p(· | θ i−2 , x; α a ) then

α a x +µ i−2 ρ i−2 α a

µ i−1 ∼ N

, 2 I .

(54)

ρ i−1

12Figure 3: Bayesian update distribution for continuous data. For x = 0.7, the plot shows the

distribution p(µ | θ 0 , x; α) over input mean µ from Eq. 52 given initial parameters µ 0 = 0, ρ 0 = 1 and 11 α

values spaced log-linearly between e −5 and e 5 . Note how the distribution is tightly concentrated around µ 0

for very low alpha, then smoothly progresses to a tight concentration around x for high alpha.

Define

def

µ ′ i =

and apply Identity 51 with a =

µ ′ i

α b x +µ i−1 ρ i−1

ρ i−1

α b x

µ i−1 +

ρ i

ρ i−1

ρ i

and b =

α b x

ρ i

(55)

to see that

ρ i−1 α a x +µ i−2 ρ i−2 α b x ρ 2 i−1 α a

∼N

, 2 2 I

ρ i

ρ i−1

ρ i

ρ ρ

i i−1

(α a + α b ) x +µ i−2 ρ i−2 α a

= N

, 2 I .

ρ i

Now observe that if θ i = {µ i , ρ i } is drawn from p(· | θ i−1 , x; α b ) then

α b x +µ i−1 ρ i−1 α b

µ i ∼ N

, 2 I ,

ρ i

(56)

(57)

(58)

and hence

µ i ∼ µ ′ i + ϵ, (59)

α b

ϵ ∼ N 0, 2 I .

ρ i (60)

where

Another standard identity for Gaussian variables can now be applied:

X ∼ N µ X , σ X

, Y ∼ N µ Y , σ Y 2 =⇒ X + Y ∼ N µ X + µ Y , σ X

+ σ Y 2 ,

(61)to see that

µ i ∼ N

(α a + α b ) x +µ i−2 ρ i−2

α a + α b

I ,

ρ i ,

ρ 2 i

(62)

and hence

p U (θ i−1 |θ i−2 ,x;α a )

p U (θ i | θ i−1 , x; α b ) = p U (θ i | θ i−2 , x; α a + α b ),

(63)

as required.

4.5

Accuracy Schedule β(t)

We derive β(t) for continuous data by requiring that the expected entropy of the input distribution

linearly decreases with t. Intuitively, this means that information flows into the input distribution

at a constant rate. Define

def

H(p I (· | θ))

2πe

1 + β(t)

H(t) =

(64)

p F (θ|x;t)

(65)

Then if H(t) linearly decreases with t,

H(t) = (1 − t)H(0) + tH(1)

2π

=⇒ ln

= (1 − t) ln(2π) + t ln

1 + β(t)

1 + β(1)

=⇒ − ln(1 + β(t)) = −t ln(1 + β(1))

(66)

−1

=⇒ (1 + β(t))

−t

= (1 + β(1)) .

(67)

(68)

(69)

Define σ 1 to be the standard deviation of the input distribution at t = 1. We will choose σ 1 empirically

to minimise the loss; in general it should be small enough to ensure that the reconstruction loss is

low, but not so small as to create unnecessary transmission costs. Recalling that the precision ρ at

time t is 1 + β(t), we see that

σ 1 2 = (1 + β(1)) −1 .

(70)

Therefore

(1 + β(t)) −1 = σ 1 2t

=⇒ β(t) =

=⇒ α(t) =

σ 1 −2t − 1

d σ 1 −2t −

2 ln σ 1

= − 2t .

σ 1

(71)

(72)

(73)

(74)Figure 4: Bayesian flow for continuous data. For x = 0.8, σ 1 = 0.02 and γ(t) defined as in Eqn. 80, the

plot shows stochastic parameter trajectories for the input distribution mean µ (white lines) superimposed on

a log-scale heatmap of the Bayesian flow distribution p(θ | x; t). Note how the trajectories all begin at µ 0 = 0

then fan out before converging on x.

4.6

Bayesian Flow Distribution p F (· | x; t)

Recall from Eq. 10 that

p F (θ | x; t) = p U (θ | θ 0 , x, β(t)).

(75)

Therefore, setting θ i−1 = θ 0 = {0, 1} and α = β(t) in Eq. 53, and recalling that ρ = 1 + β(t),

β(t)

(76)

p F (θ | x; t) = N µ |

1 + β(t) (1 + β(t)) 2

= N (µ | γ(t) x, γ(t)(1 − γ(t))I) ,

(77)

where

γ(t) = β(t)

1 + β(t) (78)

= σ 1 −2t − 1

σ 1 −2t (79)

def

= 1 − σ 1 2t .

4.7

(80)

Output Distribution p O (· | θ; t)

Following standard practice for diffusion models [42], the output distribution is defined by reparam-

eterising a prediction of the Gaussian noise vector ϵ ∼ N (0, I) used to generate the mean µ passed

15Figure 5: Input variance for Bayesian Flow Networks and diffusion models. For σ 1 = 0.001 and

γ(t) defined as in Eqn. 80, the blue line shows the variance γ(t)(1 − γ(t)) of the distribution over the input

mean µ as a function of t (see Eq. 77). Note that the variance is 0 at t = 0 (since the input prior µ 0 is

deterministic) and becomes small again as t approaches 1 and µ becomes increasingly concentrated around

the data. The green and red lines show the equivalent network input variance for two different noise schedules

from the literature (linear [12] and cosine [28]) during the reverse process of a diffusion model (note that t is

reversed relative to diffusion convention). The input variance is much lower for Bayesian Flow Networks.

as input to the network. Recall from Eq. 77 that

µ ∼ N (γ(t) x, γ(t)(1 − γ(t))I) , (81)

µ = γ(t) x + γ(t)(1 − γ(t))ϵ

1 − γ(t)

−

ϵ.

=⇒ x =

γ(t)

γ(t) (82)

and hence

(83)

The network outputs an estimate ϵ̂(θ, t) of ϵ and this is transformed into an estimate x̂(θ, t) of x by

1 − γ(t)

x̂(θ, t) =

−

ϵ̂(θ, t).

(84)

γ(t)

Given x̂(θ, t) the output distribution is

p O (x | θ; t) = δ(x −x̂(θ, t)),

(85)

Note that γ(0) = 0, making the transformation from ϵ̂(θ, t) to p O (x | θ; t) undefined at t = 0. We

therefore set p O (x | θ; t) = 0 for t under some small threshold t min . Also, x̂(θ, t) is clipped to lie

within the allowed range [x min , x max ] for x. In our experiments t min = 1e−10 and [x min , x max ] =

[−1, 1].

4.8

Sender Distribution p S (· | x; α)

The sender space Y = X = R for continuous data, and the sender distribution is normal with

precision α:

p S (y | x; α) = N y | x, α −1 I .

(86)

16Figure 6: Sender, output and receiver distributions for continuous data. Note that the sender and

receiver distributions have identical variance and the output distribution is a Dirac delta distribution centred

on the network prediction x̂(θ, t).

4.9

Receiver Distribution p R (· | θ; t, α)

Substituting Eqs. 85 and 86 into Eq. 4,

N y | x ′ , α −1 I

δ(x ′ −x̂(θ,t))

= N y | x̂(θ, t), α −1 I .

p R (y | θ; t, α) =

4.10

(87)

(88)

Reconstruction Loss L r (x)

Truly continuous data requires infinite precision to reconstruct, which makes the reconstruction loss

problematic. However it would be reasonable to assume that either the data is finely discretised

(as all information is on a digital computer), or that it contains some noise. The reconstruction

loss for discretised data is presented in Section 5.3. Alternatively, if we assume the presence of

normally distributed measurement noise on x, with fixed isotropic variance σ 2 , then a noisy version

of the reconstruction loss can be defined as the expected KL divergence between N x, σ 2 I and

the output distribution at t = 1:

L r (x) =

D KL N x, σ 2 I ∥ N x̂(θ, 1), σ 2 I

(89)

p F (θ|x,1)

∥x −x̂(θ, 1)∥ 2 .

p F (θ|x,1) 2σ 2

(90)

The noise does not directly affect training, as the reconstruction loss is not optimised. However the

value of σ places a natural upper limit on the value that should be chosen for σ 1 : there is no point

transmitting the data to greater precision than it was originally measured. Empirically, we find

that when σ 1 < σ/2 the reconstruction loss is very small.

174.11

Discrete-Time Loss L n (x)

From Eqs. 86 and 88,

D KL (p S (· | x, α i ) ∥ p R (· | θ i−1 ; t i−1 , α i )) = D KL N x, α i −1 I ∥ N x̂(θ i−1 , t i−1 ), α i −1 I

α i

∥x −x̂(θ i−1 , t i−1 )∥ 2 ,

(91)

(92)

and from Eqs. 11 and 72,

α i = β(t i ) − β(t i−1 )

−2i/n

(93)

−2(i−1)/n

− σ

−2i/n

2/n

= σ 1

1 − σ 1

= σ 1

(94)

(95)

Therefore, substituting into Eq. 24,

L n (x) =

∥x −x̂(θ i−1 , t i−1 )∥ 2

2/n

1 − σ 1

2i/n

i∼U {1,n},p F (θ i−1 |x;t i−1 )

(96)

where t i−1 = (i − 1)/n.

4.12

Continuous-time Loss L ∞ (x)

Eq. 29 claimed that

D KL (p S (· | x, α) ∥ p R (· | θ, α, t)) = D KL N g(x), Cα −1 I ∥ P (θ, t) ∗ N 0, Cα −1 I ,

(97)

for some embedding function g : X → Y, constant C and distribution p θ over Y D with finite mean

and variance. If g is the identity function, C = 1 and

P (y | θ, t) = δ(y −x̂(θ, t)),

then P (θ, t) has finite mean and variance and

N y | g(x), Cα −1 I = N y | x, α −1 I = p S (y | x; α) ,

P (y | θ, t) ∗ N 0, Cα −1 I = N y | x̂(θ, t), α −1 I = p R (y | θ, α, t),

(98)

(99)

(100)

so the claim is true and the continuous-time loss from Eq 41 applies, with E[P (θ, t)] = x̂(θ, t) and

α(t) as defined in Eq 74, yielding

L ∞ (x) = − ln σ 1

4.13

∥x −x̂(θ, t)∥ 2

σ 1 2t

t∼U (0,1),p F (θ|x;t)

(101)

Pseudocode

Pseudocode for evaluating the n-step loss L n (x) and continuous-time loss L ∞ (x) for continuous

data is presented in Algorithms 1 and 2, while the sample generation procedure is presented in

Algorithm 3.

18# Note that θ = {µ, ρ}, but ρ is fully determined by t

# For our experiments t min = 1e−10, [x min , x max ] = [−1, 1]

function cts output prediction(µ ∈ R D , t ∈ [0, 1], γ >∈ R + , t min ∈ R + , x min , x max ∈ R)

if t < t min then

x̂(θ, t) ← 0

else

Input (µ, t) to q

network, receive ϵ̂(θ, t) as output

x̂(θ, t) ← µ γ − 1−γ

γ ϵ̂(θ, t)

clip x̂(θ, t) to [x min , x max ]

end if

Return x̂(θ, t)

end function

Algorithm 1 Discrete-Time Loss L n (x) for Continuous Data

Require: σ 1 ∈ R + , number of steps n ∈ N

Input: continuous data x ∈ R D

i ∼ U {1, n}

t ← i−1

γ ← 1 − σ 1 2t

µ ∼ N (γ x, γ(1 − γ)I)

x̂(θ, t) ← cts

output

prediction(µ, t, γ)

2/n

L n (x) ←

n 1−σ 1

2i/n

2σ 1

∥x −x̂(θ, t)∥ 2

Algorithm 2 Continuous-Time Loss L ∞ (x) for Continuous Data

Require: σ 1 ∈ R +

Input: continuous data x ∈ R D

t ∼ U (0, 1)

γ ← 1 − σ 1 2t

µ ∼ N (γ x, γ(1 − γ)I)

x̂(θ, t) ← cts output prediction(µ, t, γ)

L ∞ (x) ← − ln σ 1 σ 1 −2t ∥x −x̂(θ, t)∥ 2

19Algorithm 3 Sample Generation for Continuous Data

Require: σ 1 ∈ R + , number of steps n ∈ N

µ ← 0

ρ ← 1

for i = 1 to n do

t ← i−1

x̂(θ, t) ← cts

output

prediction(µ, t, 1 − σ 1 )

−2i/n

α ← σ 1

2/n

1 − σ 1

y ∼ N x̂(θ, t), α −1 I

µ ← ρµ+α

ρ+α

ρ ← ρ + α

end for

x̂(θ, 1) ← cts output prediction(µ, 1, 1 − σ 1 2 )

Return x̂(θ, 1)

Discretised Data

This section considers continuous data that has been discretised into K bins. For example, 8-bit

images are discretised into 256 bins, 16-bit audio is discretised in 2 16 = 65, 536 bins. This data

is represented by tiling [−1, 1] into K intervals, each of length 2/K. Let k l , k c and k r denote

respectively the left, centre and right of interval k, and let {1, K} denote the set of integers from 1

to K. Then for k ∈ {1, K},

2k − 1

− 1,

k l = k c − ,

k r = k c + .

k c =

(102)

(103)

(104)

Let k(x) = k(x (1) ), . . . , k(x (D) ) ∈ {1, K} D be the vector of the indices of the bins occupied by

x = x (1) , . . . , x (D) ∈ R D , and let k l (x), k c (x) and k r (x) be the corresponding vectors of left edges,

centres and right edges of the bins. If the data has not already been discretised, we set x = k c (x).

For example if the red channel in an 8-bit RGB image has index 110, it will be represented by the

number 2∗(110)−1

− 1 = −0.14453125. Note that each x (d) therefore lies in the range [ K 1 − 1, 1 − K 1 ]

256

and not [−1, 1].

The input distribution p I (x | θ), prior parameters θ 0 , sender distribution p S (y | x; α), Bayesian

update function h(θ i−1 , y, α), Bayesian update distribution p U (θ i | θ i−1 , x; α), Bayesian flow

distribution p F (θ | x; t) and accuracy schedule β(t) are all identical to the continuous case described

in Section 4. It may surprise the reader that the output distribution is discretised while the input,

sender and receiver distributions are not. We made this choice partly for mathematical convenience

(Bayesian updates are considerably more complex for discretised distributions; [1]) and partly

because we suspected that it would easier for the network to interpret continuous means than

discrete probabilities as input. In a similar vein to our argument for standard priors in Sec. 4.1, we

remind the reader that the input distribution only serves to inform the network and not directly to

20Figure 7: Output distribution for discretised data. For univariate data x discretised into K = 16 bins,

the green line shows the continuous distribution N µ x , σ x 2 that is discretised to yield the output distribution

p O (x | θ, t), as described in Section 5.1. Bin boundaries are marked with vertical grey lines. The heights of the

green bars represent the probabilities assigned to the respective bins by p O (x | θ, t). For ease of visualisation

these heights are rescaled relative to the probability density, as indicated on the right axis. Note the clipping

at ±1: the area under the dotted green line to the left of −1 is added to the probability of the first bin, the

area under the dotted green line to the right of 1 is added to the probability of the last bin.

model the data; all that matters is that the input parameters contain enough information to allow

the network to make accurate predictions.

Section 4.11 noted that the level of measurement noise assumed for continuous data should

inform the choice of standard deviation σ 1 for the input distribution at t = 1 (which in turn defines

the accuracy schedule β(t)). For discretised data a similar role is played by the width of the

discretisation bins, as these place a natural limit on how precisely the data needs to be transmitted.

For example, for 8-bit data with 256 bins and hence a bin width of 1/128, setting σ 1 = 1e−3

corresponds to a final input distribution with standard deviation roughly one eighth of the width of

the bin, which should be precise enough for the network to identify the correct bin with very high

probability.

One caveat with discretisation is that calculating the loss has O(K) computational cost, which

may be prohibitive for very finely discretised data. In any the benefits of discretisation tend to

decrease as the number of bins increases, as we will see in our experiments.

5.1

Output Distribution p O (· | θ, t)

Discretised continuous distributions offer a natural and expressive way to model discretised data

with neural networks [38]. As in Section 4.7, the network outputs Ψ(θ, t) are not used to predict

x directly, but rather to model the Gaussian noise vector e used to generate the mean sample µ

passed as input to the network.

First Ψ(θ, t) is split into two length D vectors, µ ϵ and ln σ ϵ . Then these are transformed to µ x

and σ x using

(

if t < t min ,

(105)

µ x =

1−γ(t)

−

otherwise,

γ(t)

(

if t < t min ,

σ x = q 1−γ(t)

(106)

exp(ln

)

otherwise.

γ(t)

21For each d ∈ {1, D}, define the following univariate Gaussian cdf

(d)

−

(d)

1 + erf

F x | µ (d)

x , σ x

(d) √

σ x 2

(107)

and clip at [−1, 1] to obtain



 0





(d)

G x | µ (d)



(d)

 F x | µ (d)

x , σ x

if x ≤ −1,

if x ≥ 1,

(108)

otherwise.

Then, for k ∈ {1, K},

def

(d)

p (d)

(k | θ; t) = G(k r | µ (d)

x , σ x ) − G(k l | µ x , σ x ),

(109)

and hence

p O (x | θ , t) =

(d)

p (d)

k(x

)

θ;

(110)

d=1

5.2

Receiver Distribution p R (· | θ; t, α)

Substituting Eq. 110 and Eq. 86 into Eq. 4 gives

p R (y | θ; t, α) =

N y (d) | k c (x ′ ), α −1 I

p O (x ′ |θ , t)

D Z

′

d=1 x

D X

(111)

(d)

′

−1

k(x

)

θ;

dx ′ p (d)

O (112)

(d)

p O

(k | θ; t)N y (d) | k c , α −1 . (113)

d=1 k=1

5.3

Reconstruction Loss L r (x)

The reconstruction loss for discretised data is

L r (x) = −

p F (θ|x,1)

= −

p F (θ|x,1)

ln p O (x | θ; 1)

(d)

ln p (d)

k(x

)

θ;

d=1

(114)

(115)Figure 8: Sender, output and receiver distributions for discretised data. For data x discretised

into 8 bins, the three plots depict the sender distribution (red line), the discretised output distribution (green

bars; heights reflect the probabilities assigned to each bin, rescaled as in Figure 7) and receiver distribution

(blue line) for progressively increasing values of α, and for progressively more accurate predictions of x (both

of which typically happen as t increases). Also shown are the continuous distribution N (x | µ x , σ x 2 ) (dotted

green line) which is discretized to create the output distribution and the continuous receiver distribution from

Section 4 (dashed orange line). Bin boundaries are marked with vertical grey lines. Note the KL divergences

printed in the top right: taking discretisation into account leads to a lower KL due to the density “bumps” at

the bin centres where x could be. The advantage of discretisation becomes more pronounced as the prediction

gets closer to x and more of the probability mass is concentrated in the correct bin.

235.4

Discrete-time Loss L n (x)

From Eqs. 86 and 113,

D KL (p S (· | x, α i ) ∥ p R (· | θ i−1 ; t i−1 , α i ))

= D KL

x, α i −1 I

∥

(116)

D X

p (d)

(k | θ i−1 , t i−1 )N k c , α i −1

(117)

d=1 k=1

which cannot be calculated in closed form, but can be estimated with Monte-Carlo sampling.

Substituting into Eq. 24,

L n (x) = n

ln N y | x, α i −1 I

(118)

i∼U {1,n},p F (θ|x;t i−1 ),N ( y|x,α −1

i I )

−

p (d)

(k | θ, t i−1 )N y (d) | k c , α i −1

(119)

d=1

5.5

k=1

Continuous-time Loss L ∞ (x)

Justifying the claim made in Eq. 29 follows almost the same reasoning here as in Section 4.12, with

C = 1 and g the identity function. The only difference is that

P (y | θ; t) =

D X

p (d)

(k | θ, t)δ(y (d) − k c ),

(120)

d=1 k=1

which clearly has finite variance and mean. Since

P (y | θ, t) ∗ N 0, Cα −1 I = p R (y | θ, α, t),

(121)

the claim holds and the continuous time loss from Eq 41 can be applied with

def

(1)

(D)

E[P (θ, t)] =

p (k | θ, t)k c , . . . ,

p (k | θ, t)k c = k̂(θ, t),

k=1

(122)

k=1

and α(t) as defined in Eq 74, yielding

L ∞ (x) = − ln σ 1

x − k̂(θ, t)

t∼U (0,1),p F (θ|x;t)

σ 1 2t

(123)

Note that k̂(θ, t) is a function of the complete discretised distribution p O (x | θ, t), hence L ∞ (x)

depends on both µ x and σ x , and not only on µ x , as for continuous data. This also means that

calculating L ∞ (x) has O(K) computational cost for discretised data.

5.6

Pseudocode

Pseudocode for evaluating the discrete-time loss L n (x) and continuous-time loss L ∞ (x) for discretised

data is presented in Algorithms 4 and 5, while sample generation is presented in Algorithm 6.

24+

function discretised

cdf(µ

i ∈ R, σ ∈ R , x ∈ R)

√

F (x) ← 12 1 + erf x−µ

σ 2





if x ≤ −1

 0

G(x) ← 1

if x ≥ 1





F (x) otherwise

Return G(x)

end function

# For our experiments t min = 1e−10

# k l = 2(k−1)

− 1, k r = 2k

K − 1

function discretised output distribution(µ ∈ R D , t ∈ [0, 1], K ∈ N, γ ∈ R + , t min ∈ R + ).

if t < t min then

µ x ← 0

σ x ← 1

else

Input (µ, t) q to network, receive (µ ϵ , ln σ ϵ ) as output

µ x ←

γ −

1−γ

µ ϵ

σ x ← 1−γ

γ exp(ln σ ϵ )

end if

for d ∈ {1, D}, k ∈ {1, K} do

(d)

p (d)

(k | θ; t) ← discretised cdf(µ x , σ x , k r ) − discretised cdf(µ x , σ x , k l )

end for

Return p O (· | θ; t)

end function

Algorithm 4 Discrete-Time Loss L n (x) for Discretised Data

# k c = 2k−1

K − 1

Require: σ 1 ∈ R + , number of steps n ∈ N, number of bins K ∈ N

Input: discretised data x ∈ [ K 1 − 1, 1 − K 1 ] D

i ∼ U {1, n}

t ← i−1

γ ← 1 − σ 1 2t

µ ∼ N (γ x, γ(1 − γ)I)

−2i/n

2/n

α ← σ 1

1 − σ 1

y ∼ N x, α −1 I

p O (· | θ; t) ← discretised output distribution(µ, t, K, γ)

P (d)

(d) | k , α −1

L n (x) ← n ln N y | x, α −1 I − d ln

k p O (k | θ; t)N y

25Algorithm 5 Continuous-Time Loss L ∞ (x) for Discretised Data

Require: σ 1 ∈ R + , number of bins K ∈ N

Input: discretised data x ∈ [ K 1 − 1, 1 − K 1 ] D

t ∼ U (0, 1)

γ ← 1 − σ 1 2t

µ ∼ N (γ x, γ(1 − γ)I)

p O (· | θ; t) ← discretised output distribution(µ, t, K, γ)

P (1)

P (D)

k̂(θ, t) ←

k p O (k | θ; t)k c , . . . ,

k p O (k | θ; t)k c

L ∞ (x) ← − ln σ 1 σ 1 −2t x − k̂(θ, t)

Algorithm 6 Sample Generation for Discretised Data

(1)

(D)

# k c = k c , . . . , k c

Require: σ 1 ∈ R + , number of steps n ∈ N, number of bins K ∈ N

µ ← 0

ρ ← 1

for i = 1 to n do

t ← i−1

k ∼ discretised

output

distribution(µ, t, 1 − σ 1 2t )

−2i/n

α ← σ 1

2/n

1 − σ 1

−1

y ∼ N k c , α I

µ ← ρµ+α

ρ+α

ρ ← ρ + α

end for

k ∼ discretised output distribution(µ, 1, 1 − σ 1 2 )

Return k c

Discrete Data

We now consider discrete data in which no meaningful order or distance exists between the classes,

unlike the discretised continuous data covered in the previous section. Some obvious examples are

text characters, classification labels or any binary data. In this context the data is represented as a

D dimensional vector of class indices: x = x (1) , . . . , x (D) ∈ {1, K} D , where {1, K} is the set of

integers from 1 to K.

6.1

Input Distribution p I (· | θ)

For discrete data, the input distribution is a factorised categorical over the class indices. Let

(d)

θ = θ (1) , . . . , θ (D) ∈ [0, 1] KD with θ (d) = θ 1 , . . . , θ K ∈ ∆ K−1 , where θ k is the probability

26assigned to class k for variable d. Then

p I (x | θ) =

(d)

θ x (d) .

(124)

d=1

The input prior is uniform with

θ 0 =

(125)

where K

is the length KD vector whose entries are all K 1 . We chose a uniform prior—rather than

an empirical prior fit to the training data—for the same reasons we chose a standard normal prior

for continuous data: it’s mathematically simpler, and the disparity between the true prior and the

simple prior can easily be corrected by the network.

6.2

Output Distribution p O (· | θ; t)

Given data x, network inputs θ, t and corresponding network outputs Ψ(θ, t) = Ψ (1) (θ, t), . . . , Ψ (D) (θ, t)

∈ R KD , the output distribution for discrete data is as follows:

(d)

p (d)

θ;

softmax(Ψ

(θ,

t))

(126)

p O (x | θ; t) =

p (d)

(x (d) | θ; t).

(127)

d=1

(d)

Note that for binary data only the probability θ 1 that k = 1 is fed into the network, on the grounds

(d)

that the probability of k = 2 can easily be inferred from θ 2 = 1 − θ 1 . The output distribution for

binary data is determined by applying the logistic sigmoid function elementwise to the length D

output vector to get the probability for k = 1:

(d)

p (d)

θ;

(θ,

t))

(128)

where

σ(x) =

1 − e −x

(129)

then inferring the probabilities for k = 2 from

p (d)

(2 | θ; t) = 1 − p (d)

(1 | θ; t).

(130)

In principle one class could also be removed from the inputs and outputs when K > 2 and inferred

from the others. However this would require the network to internalise a slightly more sophisticated

inference procedure that could potentially slow down learning. We therefore followed deep-learning

convention and included a redundant input and output unit for K > 2.

All probabilities are rescaled to the range [−1, 1] by multiplying by two then subtracting one

before feeding them into the network.

6.3

Sender Distribution p S (· | x; α)

For discrete data the

sender D space Y = R . Given ω ∈ [0, 1], and a vector of D class indices

(1)

(D)

k = k ,...,k

∈ {1, K} , let

def

p(k (d) | x (d) ; ω) =

1 − ω

+ ωδ k (d) x (d) ,

(131)

where δ ij is the Kronecker delta function. Clearly p(k (d) | x (d) ; ω) ≥ 0 ∀k and

x (d) ; ω) = 1, so the vector

def

a(x (d) , ω) = p(1 | x (d) ; ω), . . . , p(K | x (d) ; ω) ,

P K

k=1 p(k

(d)

(132)

defines a valid distribution over K classes. To simplify notation we will from now on drop the

superscripts and refer to x (d) as x, p(k (d) | x (d) ; ω) as p(k | x; ω) and so on, except where necessary

to remove ambiguity.

Consider a vector of integer counts c = (c 1 , . . . , c K ) ∈ {1, m} K , corresponding to the number

of times each of the K classes is observed among m independent draws from a(x, ω). Then the

probability of observing c is given by the following multinomial distribution:

p(c | x, ω) = Multi(m, a(x, ω))

(133)

p(k | x; ω)

c 1 ! . . . c K !

k=1

1 − ω

+ ωδ kd .

c 1 ! . . . c K !

(134)

(135)

k=1

Now consider the fraction c k /m of observations of class k in c. Clearly

c k

lim

= p(k | x; ω),

(136)

m→∞ m

meaning that for any finite ω it would be possible to deduce from c what the value of x is if m

is sufficiently large. However as ω shrinks, p(k | x; ω) becomes closer to uniform, meaning that a

def

larger m is required to unambigously identify x from c. By defining the accuracy α = mω 2 and

sending m → ∞ (and hence ω → 0 for any finite α), p(c | x, ω) can therefore be used to define a

continuous-valued sender distribution that smoothly varies from totally uninformative at α = 0 to

totally informative as α → ∞, like the sender distribution for continuous data.

It can be proved from the central limit theorem that for any set of discrete probabilities

p = {p 1 , . . . , p K }, where 0 < p k < 1 ∀k, that if c ∼ Multi(m, p) then in the limit m → ∞ the

following result holds [8]:

c − mp

∼ N (0, I) ,

√

(137)

where I is the K × K identity matrix. Therefore

lim p(c k | x, ω) = N (c k | mp(k | x; ω), mp(k | x; ω))

(138)

m→∞

= p

exp

2πmp(k | x; ω)

− [c k − mp(k | x, ω)] 2

2mp(k | x; ω)

(139)Now define

ωK

1 − ω

def

ξ = 1+

(140)

And the length K sender sample y = (y 1 , . . . , y K ) as

def

y k = c k −

ln ξ.

(141)

Note that y is continuous (Y = R K ), and that c − K

measures the number of times each class is

observed, minus the average number of observations per class. Intuitively, y provides information

about the relative concentration of the classes among the counts, with (since ln ξ > 1) positive

values for classes observed more frequently than the mean and negative values for those observed

less frequently than the mean. As mω 2 grows the concentration increases around the true class, and

hence y become more informative about x.

Rearranging Eq. 141,

y k

ln ξ K

dc k

=⇒

dy k

ln ξ

c k =

(142)

(143)

which we can use for the following change of variables:

p(y k | x, ω) =

dc k

p(c k | x, ω)

dy k

(144)



ln ξ 2πmp(k | x, ω)

 −

exp 

where we have used the fact that ξ ≥ 1 and hence

which can be substituted into the above to yield

dc k

dy k



p(y k | x, ω) =



exp 

ln ξ 2παp(k | x, ω)

y k

ln ξ

i 2 

− mp(k | x, ω) 

 ,

2mp(k | x, ω)

(145)

≥ 0. Recall that α = mω 2 and hence m =

−

y k

ln ξ

− p(k | x, ω)

2αp(k | x, ω)

i 2 



 .

ω 2

(146)

Substituting from Eq. 131,

− p(k | x, ω) = ω

− δ kx ,

(147)

and hence



p(y k | x, ω) =



exp 

ln ξ 2παp(k | x, ω)

−

y k

ln ξ

− α δ kx −

2αp(k | x, ω)

i 2 



 .

(148)Applying the identity ln(1 + x) =

that

P ∞

n=1

for |x| < 1 to ln ξ = ln 1 +

(−1) n−1 n

ln ξ ∈

ωK

+ O(ω 2 ),

1 − ω

ωK

1−ω

it can be seen

(149)

and hence

ln ξ = K.

ω→0 ω

lim

(150)

Furthermore, it follows directly from Eq. 131 that

∀k ∈ {1, K}.

K (151)

p S (y k | x; α) = lim p(y k | x, ω). (152)

lim p(k | x, ω) =

ω→0

Now define

def

ω→0

Plugging Eq. 150 and 151 into Eq. 148,

p S (y k | x; α) =

exp

−

y k

2πα K 1

exp

= √

2παK

− α δ kx −

2α K 1

2 !

− [y k − α (Kδ kx − 1)] 2

2αK

(153)

= N (α (Kδ kx − 1) , αK) .

(154)

(155)

Restoring the superscript,

p S y (d) | x (d) ; α = N (α (Ke x (d) − 1) , αKI) ,

(156)

where 1 is a vector of ones, I is the identity matrix and e j ∈ R K is the projection from the class

index j to the length K one-hot vector defined by (e j ) k = δ jk , and therefore

p S (y | x; α) = N (y | α (Ke x − 1) , αKI) ,

(157)

def

where e x = (e x (1) , . . . , e x (D) ) ∈ R KD .

6.4

Receiver Distribution p R (· | θ; t, α)

Substituting Eq. 127 and Eq. 157 into Eq. 4 gives the following receiver distribution for dimension

p (d)

(y (d) | θ; t, α) =

p R (y | θ; t, α) =

k=1

p (d)

(k | θ; t)N (α (Ke k − 1) , αKI) ,

O (158)

p (d)

(y (d) | θ; t, α).

R (159)

d=1

306.5

Bayesian Update Function h(θ i−1 , y, α)

(d)

Recall from Section 6.1 that (θ i−1 ) k is the probability assigned to x (d) = k by p(x (d) | θ i−1 ).

Dropping the superscript and returning to the count distribution p(c | x, ω) defined in Eq. 133, the

posterior probability that x = k after observing c is

p(k | c; ω) = P K

p(c | k; ω)(θ i−1 ) k

k ′ =1 p(c

| k ′ ; ω)(θ i−1 ) k ′

(160)

Substituting Eq. 135 into Eq. 160 and cancelling terms in the enumerator and denominator,

1−ω m−c k 1−ω

c k

(θ i−1 ) k

K + ω

(161)

p(k | c; ω) = P K

m−c

c k ′

′

1−ω

′

(θ

)

′

i−1

k =1

i c k

1−ω m h

ωK

(θ i−1 ) k

1−ω

(162)

c k ′

1−ω m P K

ωK

′

(θ

)

′

i−1

1−ω

i c k

ωK

(θ i−1 ) k

1 + 1−ω

i c k ′

(163)

= P

ωK

′

(θ

)

i−1 k

k ′ =1

1−ω

= P K

ξ c k (θ i−1 ) k

k ′ =1 ξ

c k ′ (θ

i−1 ) k ′

(164)

Now define

def

h(θ, y) = P K

e y θ

y k

k=1 e θ k

(165)

Substituting the definition of y k from Eq. 141 into the definition of h(θ, y) from Eq. 165,

exp(− K

ln ξ) exp(c k ln ξ)(θ i−1 ) k

exp(− K ln ξ) K

k ′ =1 exp(c k ′ ln ξ)(θ i−1 ) k ′

exp(ln ξ k )(θ i−1 ) k

= P K

c k ′

k ′ =1 exp(ln ξ )(θ i−1 ) k ′

ξ k (θ i−1 ) k

= P K

c k ′

k ′ =1 ξ (θ i−1 ) k ′

(h(θ i−1 , y)) k =

(166)

(167)

(168)

(169)

and hence, from Eq. 164,

h(θ i−1 , y) k = p(k | c; ω).

(170)

Therefore in the limit m → ∞ with mω 2 = α, the stochastic parameter update from θ i−1 to θ i

induced by drawing c from multi(m, a(x, ω)) can be sampled by first drawing y from p S (· | x, α)

then setting θ i = h(y, θ i−1 ). Hence the Bayesian update function is

def

h(θ i−1 , y, α) = P K

e y θ i−1

y k

k=1 e (θ i−1 ) k

(171)

where the redundant parameter α has been included for consistency with the update function for

continuous data.

316.6

Bayesian Update Distribution p U (· | θ i−1 , x; α)

Substituting Eqs. 157 and 171 into Eq. 6,

p U (θ | θ i−1 , x; α) =

6.7

N (y|α(Ke x −1),αKI)

δ θ − P K

e y θ i−1

y k

k=1 e (θ i−1 ) k

(172)

Additive Accuracies

It follows from the definition of the update distribution that if y a is drawn from p S (· | x; α a ) then

θ i−1 = h(y a , θ i−2 ) is drawn from p(· | θ i−2 , x; α a ). Furthermore, if y b is drawn from p S (· | x; α b ) then

θ i = h(y b , θ i−1 ) = h(y b , h(y a , θ i−2 )) is drawn from E p U (θ i−1 |θ i−2 ,x;α a ) p U (θ i | θ i−1 , x; α b ). Substituting

the definition of h from Eqn 165,

exp(y b ) P K

k ′ =1

h(y b , h(y a , θ i−2 )) = P K

exp(y a )θ i−2

exp((y a ) k ′ )(θ i−2 ) k ′

exp((y a ) k )(θ i−2 ) k

k=1 exp ((y b ) k ) P K

k ′ =1

= P K

exp((y a ) k ′ )(θ i−2 ) k ′

exp(y b ) exp(y a )θ i−2

k=1 exp ((y b ) k ) exp ((y a ) k ) (θ i−2 ) k

= P K

exp(y a + y b )θ i−2

k=1 exp ((y a

(173)

+ y b ) k ) (θ i−2 ) k

= h(y a + y b , θ i−2 ).

(174)

(175)

(176)

From Eqn. 156

y a ∼ N (α a (Ke x − 1) , α a KI) , (177)

y b ∼ N (α b (Ke x − 1) , α b KI) (178)

(179)

and hence, from Identity 61

y a + y b ∼ N ((α a + α b ) (Ke x − 1) , (α a + α b )KI) .

(180)

Therefore, if y is drawn from p S (· | x; α a + α b ) and θ i = h(y, θ i−2 ) then θ i is drawn from

E p U (θ i−1 |θ i−2 ,x;α a ) p U (θ i | θ i−1 , x; α b ) and

p U (θ i−1 |θ i−2 ,x;α a )

p U (θ i | θ i−1 , x; α b ) = p U (θ i | θ i−2 , x; α a + α b ),

(181)

as required.

6.8

Accuracy Schedule β(t)

As with continuous data, the guiding heuristic for β(t) was to decrease the expected entropy of

the input distribution linearly with t. In the continuous case, where the entropy is a deterministic

function of σ 2 , applying the heuristic was straightforward; in the discrete case an explicit computation

32Figure 9: Accuracy schedule vs. expected entropy for discrete data. The surface plot shows the

expectation over the parameter distribution p(θ | x; β) of the entropy of the categorical input distribution

p(x | θ) for K = 2 to 30 and β = 0.01 to 4. The red and cyan lines highlight the entropy curves for 2 and 27

classes, the two values that occur in our experiments. The red and cyan stars show the corresponding values

we chose for β(1).

of E p F (θ|x;t) H [p I (x | θ)] would be needed. We were unable to derive an analytic expression for this

term, but found that

β(t) = t 2 β(1)

(182)

was a reasonable approximation, with β(1) determined empirically for each experiment. Therefore

α(t) =

6.9

dβ(t)

= β(1)2t.

(183)

Bayesian Flow Distribution p F (· | x; t)

Substituting Eq. 172 into Eq. 10,

p F (θ | x; t) =

N (y|β(t)(Ke x −1),β(t)KI)

Since the prior is uniform with θ 0 =

p F (θ | x; t) =

δ θ − P K

e y θ 0

k=1 e

y k (θ

0 ) k

(184)

this reduces to

N (y|β(t)(Ke x −1),β(t)KI)

δ (θ − softmax(y)) ,

(185)Figure 10: Bayesian flow for discrete data. For K = 3, the input distribution parameters θ = (θ 1 , θ 2 , θ 3 )

can be visualised as points on the 2-simplex, with the data x corresponding to the bottom left corner. For

the accuracy schedule

β(t) from Eq. 182, the white line shows a single input parameter trajectory starting

from θ 0 = 3 1 , 13 , 13 and evolving under the Bayesian update distribution p U (θ i | θ i−1 ; x, β(t i ) − β(t i−1 ))

from Eq. 172, superimposed on log-scale heatmaps of the Bayesian flow distribution p F (θ | x; t) from Eq. 185,

plotted at regular intervals from t = 0.02 to 1.

which can be sampled by drawing y from N (β(t) (Ke x − 1) , β(t)KI) then setting θ = softmax(y).

The sender distribution for discrete data can therefore be interpreted as a source of softmax logits

for the Bayesian flow distribution; the higher the sender accuracy α is, the larger in expectation the

logits corresponding to x will be in y, hence the closer θ will be to e x and the more information the

network will gain about x.

34Figure 11: Bayesian flow for binary data. For the input probability p 1 of class one, the plot shows

several parameter trajectories starting from p 1 = 0.5 at t = 0 and evolving under the Bayesian update

distribution to t = 1, superimposed on a log-scale heatmap of the Bayesian flow distribution. β(1) = 4 in this

plot. Note that both here and in Figure 10 the convergence towards the data appears slower and noisier than

the equivalent trajectories for continuous data in Figure 4. This is a fundamental consequence of discreteness:

since all points in X are equidistant the input distributions cannot concentrate on values close to x as the

trajectories progress.

6.10

Reconstruction Loss L r (x)

The reconstruction loss for discrete data is

L r (x) = −

p F (θ|x,1)

6.11

ln p O (x | θ; 1).

(186)

Discrete-time Loss L n (x)

From Eqs. 156 and 158,

(d)

D KL p S · | x ; α ∥ p R (· | θ; t, α)

= D KL

N (α (Ke x (d) − 1) , αKI) ∥

(187)

p (d)

(k | θ; t)N (α (Ke k − 1) , αKI) .

O (188)

ln N (y | α i (Ke x − 1) , α i KI) (189)

k=1

Therefore, substituting into Eq. 24,

L n (x) = n

i∼U {1,n},p(θ|x;t i−1 ),N (y|α i (Ke x −1),α i KI)

−

d=1

p (d)

(k | θ; t i−1 )N y (d) | α i (Ke k − 1) , α i KI

k=1

(190)where, from Eq. 182,

α i = β(t i ) − β(t i−1 )

i − 1 2

= β(1)

−

2i − 1

= β(1)

n 2

6.12

(191)

(192)

(193)

Continuous-time Loss L ∞ (x)

Let

+ 1,

def

v =

and apply Identity 51 to see that if

y (d) ∼ p S · | x (d) ; α = N (α(Ke x (d) − 1), αKI) ,

(194)

(195)

then

v (d) ∼ N

Ke x (d) , I ,

(196)

and similarly if

(d)

∼ p R (· | θ; t, α) =

(d)

p (d)

θ;

t)N

(Ke

−

αKI

(197)

k=1

then

(d)

∼

(d)

p O

k=1

= K

(k | θ; t)N Ke k , I

(d)

p O

k=1

(k | θ; t)δ(· − e k ) ∗ N 0, I .

(198)

(199)

The Kullback-Leibler divergence is invariant under affine transformations of variables, hence

D KL p S · | x (d) ; α ∥ p (d)

(·

θ;

)

(200)

(d)

= D KL N Ke x (d) , I ∥

p O (k | θ; t)Kδ(· − e k ) ∗ N 0, I

(201)

k=1

Now set C = K, g(x (d) ) = Ke x (d) and

P (d) (θ, t) = K

p (d)

(k | θ; t)δ(· − e k ),

k=1

(202)which has finite variance and the following finite expectation

E[P (d) (θ, t)] = Kê (d) (θ, t),

(203)

where

(d)

ê

def

(θ, t) =

p (d)

(k | θ; t)e k .

(204)

k=1

The conditions in Eq. 29 are therefore satisfied and Eqs. 203 and 183 can be substituted into Eq. 41

to yield

L ∞ (x) = Kβ(1)

t∼U (0,1),p F (θ|x,t)

t∥e x − ê(θ, t)∥ 2 ,

(205)

where

(1)

(D)

ê(θ, t) = ê (θ, t), . . . , ê (θ, t) .

def

6.13

(206)

Pseudocode

Pseudocode for evaluating the discrete-time loss L n (x) and continuous-time loss L ∞ (x) for discrete

data is presented in Algorithms 7 and 8, while sample generation is presented in Algorithm 9.

function discrete output distribution(θ ∈ [0, 1] KD , t ∈ [0, 1])

Input (θ, t) to network, receive Ψ(θ, t) as output

for d ∈ {1, D} do

if k = 2 then

p (d)

(1 | θ; t) ← σ Ψ (d) (θ, t)

p (d)

(2 | θ; t) ← 1 − p (d)

(1 | θ; t)

else

p (d)

(· | θ; t) ← softmax(Ψ (d) (θ, t))

end if

end for

Return p O (· | θ; t)

end function

37Algorithm 7 Discrete-Time Loss L n (x) for Discrete Data

Require: β(1) ∈ R + , number of steps n ∈ N, number of classes K ∈ N

Input: discrete data x ∈ {1, K} D

i ∼ U {1, n}

t ← (i − 1)/n

β ← β(1)t 2

y ′ ∼ N (β (Ke x − 1) , βKI)

θ ← softmax(y ′ )

output distribution(θ, t)

p O (· | θ; t) ← discrete

α ← β(1) 2i−1

y ∼ N (α (Ke

x − 1) , αKI)

P (d)

(d) | α (Ke − 1) , αKI

L (x) ← n ln N (y | α (Ke x − 1) , αKI) − d ln

k p O (k | θ; t)N y

Algorithm 8 Continuous-Time Loss L ∞ (x) for Discrete Data

Require: β(1) ∈ R + , number of classes K ∈ N

Input: discrete data x ∈ {1, K} D

t ∼ U (0, 1)

β ← β(1)t 2

y ∼ N (β (Ke x − 1) , βKI)

θ ← softmax(y)

p O (· | θ; t) ← discrete output distribution(θ, t)

P (1)

P (D)

ê(θ, t) ←

k p O (k | θ; t)e k , . . . ,

k p O (k | θ; t)e k

L ∞ (x) ← Kβ(1)t ∥e x − ê(θ, t)∥ 2

Algorithm 9 Sample Generation for Discrete Data

Require:

β(1) ∈ R + , number of steps n ∈ N, number of classes K ∈ N

θ ← K

for i = 1 to n do

t ← i−1

k ∼ discrete output distribution(θ, t)

α ← β(1) 2i−1

n 2

y ∼ N (α (Ke k − 1) , αKI)

θ ′ ← e y θ

′

θ ← P θ θ ′

k k

end for

k ∼ discrete output distribution(θ, 1)

Return k

38Model

Dynamically Binarized MNIST

Improved DDPM [28]

NVAE [48]

PixelVAE++ † [35]

Locally Masked PixelCNN † [15]

Image Transformer † [30]

DDPM++ [16]

LSGM [49]

VDVAE [3]

Sparse Transformer † [4]

Reflected Diffusion [23]

VDM [17]

ARDM-Upscale 4 [13]

CIFAR-10

2.94

2.91

2.90

2.89

2.88

2.87

2.80

2.68

2.65

2.64

78.01

78.00

77.58

BFN 77.87 2.66

CR-NVAE* [40]

VDM* [17] 76.93 2.51

2.49

Table 1: Comparison of dynamically binarized MNIST and CIFAR-10 results with other

methods. The best published results for both datasets (*) use data augmentation for regularization. Results

for models marked with ( † ) are exact values; all other results are upper bounds.

n-steps 10 25 50 100 784 1000 ∞

NPI 95.21 84.40 81.06 79.46 78.02 78.07 77.87

Table 2: Dynamically binarized MNIST results. NPI is nats per image averaged over 2,000 passes

through the test set with L n (x) or L ∞ (x) sampled once per test image per pass. The reconstruction loss

L r (x) (included in NPI) was 0.46. 784 is the total number of pixels per image, hence the number of steps

required to generate an image with an autoregressive model.

Experiments

We evaluated Bayesian Flow Networks (BFNs) on the following generative benchmarks: CIFAR-10

(32×32 8-bit color images), dynamically binarized MNIST (28×28 binarized images of handwritten

digits) and text8 (length 256 character sequences with a size 27 alphabet). The continuous (Sec. 4)

and discretised (Sec. 5) versions of the system were compared on CIFAR-10, while the discrete

version (Sec. 6) was applied to the other datasets. In all cases, the network was trained using

the continuous-time loss L ∞ (x), with the discrete-time loss L n (x) evaluated for testing only, with

various values of n. Standard network architectures and training algorithms were used throughout to

allow for direct comparison with existing methods. Because the focus of this paper is on probabilistic

modelling rather than image generation, FID scores were not calculated. However, examples of

generated data are provided for all experiments.

39(a) Test Data

(b) Generated Data

Figure 12: MNIST real and generated data. Samples generated with 100 steps.

7.1

Dynamically Binarized MNIST

Data. The binarized MNIST benchmark data was originally created from the MNIST dataset

of handwritten images [20] by treating the grayscale pixel intensities as Bernoulli probabilities

and sampling a particular binarization [36] which is held fixed during training. In recent years, a

variant of the same benchmark has become more popular, with a new binarization sampled from

the probabilities for every training batch. The two are not comparable, as the latter, which we refer

to as dynamically binarized MNIST, effectively has a larger training set and hence gives better test

set performance. All our experiments and the results referenced from the literature use dynamically

binarized MNIST.

Setup. The network architecture was based on a U-Net introduced for diffusion models [28].

Starting from the hyperparameters used for the CIFAR-10 dataset (see Appendix A in the above

reference), we made the following modifications: the number of resblocks was reduced from three

to two, the layer widths were reduced from [C, 2C, 2C, 2C] to [C, 2C, 2C] with C = 128, and an

input-output skip connection was added. 600 randomly selected training images (1% of the training

set) were used as a validation set. The optimiser was AdamW [22] with learning rate 0.0001, weight

decay 0.01 and (β 1 , β 2 ) = (0.9, 0.98). Dropout was used with probability 0.5, the training batch size

was 512, and β(1) was set to 3 (see Sec. 6.8). The network was trained for 150 000 weight updates

until early stopping. An exponential moving average of model parameters with a decay rate of

0.9999 was used for evaluation and sample generation. The total number of learnable parameters

was approximately 25M.

Results. As can be seen from Table 1, BFN is close to state-of-the-art for this task with no data

augmentation. Table 2 shows the expected inverse relationship between loss and number of steps.

Direct optimisation of the n-step loss would likely lead to reduced loss for low values of n; however

40(a) Input Distribution

(b) Output Distribution

Figure 13: MNIST Input and output distributions. For two test set images the figure shows the white

pixel probability at 20 steps evenly spaced between t = 0 and t = 1/3. Note how the input probabilities are

initially uniform whereas the output distribution initially predicts a superposition of multiple digits, closely

matching the per-pixel marginal prior over the training set: this supports our belief that the network learns

to correct for the uniform prior in the input distribution. Also note that the output distribution is much

less noisy than the input distribution, and that it changes more dramatically as new information is received

(e.g. the network appears to switch from predicting a 6 to a 2 to a 7 for the first image). This highlights the

network’s use of context to resolve ambiguity and noise in the input distribution.

Figure 14: MNIST losses against time. The left plot shows the mean over the test set of the cts. time

loss L ∞ (x) used for training for transmission time t between 0 and 1. The right plot shows the average

cumulative value of L ∞ (x) up to t, along with the reconstruction loss L r (x) evaluated at t and the sum of

these two losses, which would be the total loss if the transmission process halted at t. Note the unevenness

of L ∞ (x) against t: we speculate that rescaling β(t) to make the loss curve more uniform could improve

performance.

we leave that for future work. One issue is that the reconstruction loss was relatively high at 0.46

nats per image. The obvious way to decrease this would be to increase β(1), but we found that

doing so led to slower learning and worse performance. Along with the loss curves in Figure 14, this

suggests that the accuracy schedule is suboptimal for binary data.

41n-steps Cts. (256 bins) Discd. (256 bins) Cts. (16 bins) Discd. (16 bins)

100

250

500

1000 6.18

3.65

3.10

2.86

2.73

2.69

2.67 3.91

3.16

2.93

2.81

2.73

2.71

2.70 1.42

1.11

1.03

0.99

0.97

0.96

0.96 1.16

1.02

0.98

0.96

0.94

∞ 2.66 2.68 0.96 0.94

L r (x) 0.001 0.003 0.073 0.070

Updates 5M 5M 250K 1M

Table 3: CIFAR-10 results. All losses are bits per dimension (BPD) averaged over 100 passes through

the test set with L n (x) or L ∞ (x) sampled once per test image per pass. The reconstruction losses L r (x)

(included in BPD) and the number of training updates for each network are shown below.

7.2

CIFAR-10

Data. Two sets of generative modelling experiments were conducted on the CIFAR-10 database [19],

one at the standard bit-depth of 8, corresponding to 256 discretised bins per colour channel, and

one at a reduced bit-depth of 4, corresponding to 16 bins per channel. In both cases the bins

evenly partitioned the interval [−1, 1] and the data was pre-processed by assigning each channel

intensity to the nearest bin centre, as described in Section 5. The purpose of comparing 16 and 256

bin discretisation was twofold: (1) to test the hypothesis that the advantage of training with the

discretised loss from Section 5 rather than the continuous loss from Section 4 would be greater when

the number of bins was lower, and (2) to test whether modelling the data at lower precision would

lead to improved perceptual quality. No data augmentation, such as horizontal flips or random

crops, was used on the training set.

Setup. The network architecture was essentially the same as that used for Variational Diffusion

Models (VDMs [17]), including the Fourier feature inputs. The only modification was an extra

input-output skip connection. In total there were approximately 31M learnable parameters. The

following hyperparameters were used for all CIFAR-10 experiments: a validation set of 500 randomly

selected training images (1% of the training set), the AdamW [22] optmizer with weight decay 0.01,

learning rate 0.0002 and (β 1 , β 2 ) = (0.9, 0.99), dropout with probability 0.1, training batch size of 128,

t min = 1e−10, [x min , x max ] = [−1, 1], and an exponential moving average of model parameters with

a decay rate of 0.9999 for evaluation and √ sample generation. For the 256 bin experiments σ 1 = 0.001,

while for the 16 bin experiments σ 1 = 0.001. For the networks trained with continuous loss, the

reconstruction loss was measured using the discretised version of L r (x) from Section 5.3 rather than

the continuous version from Section 4.10, using a discretised Gaussian with mean equal to x̂(θ, 1)

and std. deviation chosen empirically to be σ 1 for 256 bins and 0.7σ 1 for 16 bins. This ensured the re-

sults were comparable between continuous and discretised training, and consistent with the literature.

Results.

Table 1 shows that the best performing BFN gives 2.66 BPD for the 256 bin data, which

42(a) Test Data (256 bins) (b) Generated Data (256 bins)

Figure 15: CIFAR-10 real and generated data. Samples generated with 4,000 steps, using networks

trained with discretised loss. The same random seed was used for both sets of samples. Note the improved

image quality of the 16 bin samples compared to the 256 bin samples.

is close to the state-of-the-art at 2.64 BPD. The most obvious performance benchmark (given the

shared network architecture and similarity in loss function) is the VDM result at 2.65 BPD [17].

However this took 10M weight updates to achieve, and due to time constraints we were only able to

train BFNs for 5M updates. Validation performance was still improving after 5M updates, and it

remains unclear how much performance would improve with 10M updates.

Table 3 shows that discretised loss gave better performance than continuous loss for 16 bins, as

43(a) Input Mean

(b) Output Mean

Figure 16: CIFAR-10 Input and output distributions. For two test set images the figure shows the

means of the input and output distributions at steps evenly spaced between t = 0 and t = 0.25.

Figure 17: CIFAR-10 losses against time. The plot was made using the network trained with discretised

loss on 256 bins. Note the high loss at the very start of the process, which we did not observe with discrete

data.

well as much faster training time (250K updates vs. 1M). This supports the hypothesis that training

with discretised loss is most beneficial when the number of bins is relatively low. Furthermore, for

both 16 and 256 bins, discretised training gave much better results when the number of steps n was

low (e.g. 10 or 25). However continuous loss gave better performance than discretised loss on 256

bins (2.66 BPC vs 2.68); more investigation would be needed to understand why.

Figure 15 shows that discretised training with 16 bins gives better sample quality than training

with 256 bins. This is presumably because the loss function of the former is restricted to the first

four bits of the data in which — as can be seen by comparing the test data at 16 and 256 bins —

most of the perceptually relevant information is contained. An interesting direction for future work

would be to train one BFN to model the lower bits of an image, and a second BFN to conditionally

upscale to higher bits, as has previously been explored for autoregressive models [26, 13].

44Flow-based models

Order-agnostic Models

Diffusion models

Autoregressive baseline

Best result*

Model BPC

IAF/SCF † [52]

Argmax Coupling Flow † [14]

Discrete Flow † [47] 1.88

1.80

1.23

OA-ARDM [13]

MAC [39] 1.43 ± 0.001

1.40

Multinomial Diffusion [14]

D3PM uniform [1]

D3PM NN [1]

D3PM mask [1] 1.72

1.61 ± 0.02

1.59 ± 0.03

1.45 ± 0.02

BFN 1.41

Transformer † [1]

Adaptive Span Transformer † [45] 1.23

1.07

Table 4: Comparison of text8 results with other methods. The best published model on this dataset

(*) was trained on sequences of length 512. Rest of the above models were trained on sequences of length 256.

Results for models marked with ( † ) are exact values; all other results are upper bounds.

n-steps 10 25 50 100 256 1000 ∞

BPC 1.70 1.52 1.47 1.43 1.42 1.41 1.41

Table 5: text8 results. BPC is bits per character averaged over 1M randomly cropped sequences from the

test set with L n (x) or L ∞ (x) sampled once per crop. The reconstruction loss L r (x) (included in BPC) was

0.006.

7.3

text8

Data. The text8 dataset [25] was derived from a subset of the enwik9 Wikipedia dataset by

removing punctuation and restricting the text to lowercase Latin letters and spaces, giving an

alphabet of size 27. For clarity, we represent the space character with an underscore in figures.

Setup. The network architecture was a Transformer similar to the small model (d model = 768)

used by Radford et al. [31] except that it uses the GELU activation function [10] and the depth

was increased to 24 layers. The standard training/validation/test split of 90M/5M/5M consecutive

characters was used, and the network was trained with a batch size of 3328 sequences of length 256,

randomly cropped from the training set, for 1.2 M weight updates using the AdamW optimizer[22].

The learning rate was set to 10 −4 , weight decay to 0.1 and (β 1 , β 2 ) to (0.9, 0.98). An exponential

moving average of model parameters with a decay rate of 0.9999 was used for evaluation and

sample generation. Dropout was not used, but overfitting was observed towards the end of training

indicating that regularization may further improve results. The total number of learnable parameters

was approximately 170M. Note that the batch size and number of layers were larger than prior

results from diffusion models. The first choice increases model capacity while the second tends

to make overfitting more likely. These choices were made to maximize the utilization of available

45nd_philip_melanchthon_in_one_five_six_one_he_copper_etched_a_map

_a_single_sheet_cartouche_of_silesia_which_he_published_under_th

e_title_silesiae_typus_and_dedicated_to_nicolaus_rhedinger_his_m

ap_was_later_republished_in_several_versions_of_abraham_ortelius th_earth_it_will_receive_the_smell_of_a_body_known_as_the_postwa

naplast_for_that_reason_two_devonian_rocks_englathus_slimche_rho

se_ice_s_stalk_permanently_and_after_the_last_we_drink_meat_of_c

ourse_the_sessiology_created_during_the_first_half_of_the_fall_t

ages_middle_school_high_school_or_both_in_a_manner_that_includes

_military_traditions_and_training_in_military_subjects_the_vast_

majority_are_in_the_united_states_many_military_schools_are_also

_boarding_schools_and_others_are_simply_magnet_schools_in_a_larg in_moor_was_employed_as_lake_in_among_others_one_nine_two_zero_b

ut_the_pair_went_out_of_the_fame_a_cell_on_maytown_in_one_nine_s

ix_six_kells_was_formally_promoted_by_being_presidential_candida

te_for_proclamation_moor_was_the_first_author_of_snowman_s_expos

berno_norman_bourkes_have_been_in_mayo_since_the_thirteenth_cent

ury_like_many_who_came_to_ireland_with_the_norman_invasion_it_wa

s_said_of_the_bourkes_that_they_ended_up_more_irish_than_the_iri

sh_themselves_her_family_had_links_with_many_diverse_political_s rt_unofficially_naming_pepper_s_best_health_cea_resistant_tum_fu

lci_ua_starring_yellow_the_negative_campaign_for_the_use_of_poly

map_button_was_sent_on_systems_comparable_to_a_new_type_of_nasca

r_historically_gea_involved_a_commission_that_the_government_net

onally_eaten_on_the_day_is_goose_according_to_legend_martin_was_

reluctant_to_become_bishop_which_is_why_he_hid_in_a_stable_fille

d_with_geese_the_noise_made_by_the_geese_betrayed_his_location_t

o_the_people_who_were_looking_for_him_also_in_the_east_part_of_t ir_own_context_although_finally_loyal_and_genial_to_medieval_voi

l_the_honour_best_on_amber_can_be_done_by_scrutinous_members_of_

the_knights_templar_and_grand_master_still_called_amber_the_pope

_for_four_years_who_had_on_approximated_journeys_he_was_only_kno

ide_film_fn_m_two_four_nine_saw_presentation_mpeg_for_the_religi

ous_order_known_as_the_minimi_minims_order_of_the_minims_see_min

im_religious_order_light_machine_guns_modern_firearms_of_the_uni

ted_states_fabrique_nationale_de_herstal_five_five_six_mm_machin _as_were_independent_of_jesus_or_throughout_the_time_of_nazarite

s_this_is_not_true_versical_times_despite_the_use_of_the_lattian

s_as_still_not_readily_adjacent_to_the_text_the_popular_tendency

_of_revision_is_current_in_the_three_gospels_some_subscriptions_

ining_to_fly_planes_atta_traveled_to_prague_stayed_overnight_and

_then_entered_the_u_s_on_june_three_atta_and_earlier_arrived_hij

ackers_opened_bank_accounts_and_continue_to_check_on_flight_scho

ols_in_july_atta_and_marwan_al_shehhi_enrolled_at_huffman_aviati _counter_strike_by_the_national_terrorists_c_e_ti_core_blooded_a

m_forse_for_elsie_by_the_columbia_university_of_washington_stude

nt_in_their_own_right_such_views_change_reflected_by_the_fiercel

y_scriptly_knit_allegations_came_under_this_a_tendency_that_the_

_boosted_by_its_famously_harsh_winters_the_region_is_reportedly_

the_third_largest_theater_market_in_the_country_attracting_major

_performances_the_guthrie_theater_is_the_most_famous_theater_in_

the_city_in_order_to_help_revitalize_the_downtown_and_warehouse_ s_that_of_the_related_spirit_of_the_great_sword_and_by_the_spiri

t_of_jehudah_as_they_were_attained_in_hebrew_trees_as_spiritual_

witness_to_a_high_angel_of_the_south_thus_identified_pinhim_with

_pulim_where_they_were_emphetic_and_assyrian_languages_in_ancien

(a) Test Data (b) Generated Data

Figure 18: text8 real and generated data. Samples generated with 1000 steps.

resources while achieving results in reasonable time.

Results. Table 4 shows that BFN yielded a 1.41 BPC on the text8 test set, which is better than

all discrete diffusion models we found in the literature, and close to the best order-agnostic model,

MAC at 1.40 BPC. We note however that both a standard autoregressive baseline and a discrete

flow model perform substantially better at 1.23 BPC. Table 5 shows that performance is reasonably

robust to decreased n, with only 100 steps required to reach 1.43 BPC. This result could probably

be improved by training with the discrete-time loss.

Conclusion

This paper introduced Bayesian Flow Networks, a new class of generative model that combines

Bayesian inference with neural networks in an iterative modelling process. Discrete and continuous-

time loss functions were derived along with sampling procedures, and the model was succesfully

applied to continuous, discretised and discrete data. We hope this work will inspire fresh perspectives

and new directions for generative modelling research.

References

[1] Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc-

tured Denoising Diffusion Models in Discrete State-Spaces. arXiv preprint arXiv:2107.03006,

July 2021.

[2] Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using

diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.

46t = 0.0

t = 0.2

t = 0.4

t = 0.6

t = 0.8

t = 1.0

(a) Input Distribution

(b) Output Distribution

Figure 19: text8 Input and Output Distributions. The heatmaps show the character probability

distributions across part of a test sequence at various times during the flow process. Whereas the expected

entropy for each letter decreases independently in the input distribution, the entropy of the output distribution

tends to chunk into words and phrases — e.g. the date “one five six one” is confidently predicted early in the

process.

[3] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on

images. arXiv preprint arXiv:2011.10650, 2020.

[4] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with

sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

[5] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.

Advances in neural information processing systems, 34:8780–8794, 2021.

[6] Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin,

Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Contin-

uous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.

470.1

0.2

0.3

0.4

0.5

0.6

0.7

(a) Input Distribution

(b) Output Distribution

Figure 20: text8 Input and Output Distributions. An alternative visualisation with the character

sizes scaled in proportion to their probability.

[7] Jarek Duda. Asymmetric numeral systems. arXiv preprint arXiv:0902.0271, 2009.

[8] H.O. Georgii. Stochastics: Introduction to Probability and Statistics. De Gruyter textbook.

Walter De Gruyter, 2008. ISBN 9783110191455. URL https://books.google.co.uk/books

?id=ttJ5xpQX2MgC.

[9] Alex Graves. Generating sequences with recurrent neural networks.

arXiv:1308.0850, 2013.

arXiv preprint

[10] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint

arXiv:1606.08415, 2016.

[11] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the

description length of the weights. In Proceedings of the sixth annual conference on Computational

learning theory, pages 5–13, 1993.

[12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances

in neural information processing systems, 33:6840–6851, 2020.

[13] Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg,

and Tim Salimans. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021.

[14] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax

Flows and Multinomial Diffusion: Learning Categorical Distributions. In Advances in Neural

Information Processing Systems, volume 34, pages 12454–12465. Curran Associates, Inc., 2021.

[15] Ajay Jain, Pieter Abbeel, and Deepak Pathak. Locally masked convolution for autoregressive

models. In Conference on Uncertainty in Artificial Intelligence, pages 1358–1367. PMLR, 2020.

48[16] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Soft

truncation: A universal training technique of score-based diffusion model for high precision

score estimation. arXiv preprint arXiv:2106.05527, 2021.

[17] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.

Advances in neural information processing systems, 34:21696–21707, 2021.

[18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114, 2013.

[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,

University of Toronto, 2009.

[20] Yann LeCun and Corinna Cortes. MNIST handwritten digit database, 2010. URL http:

//yann.lecun.com/exdb/mnist/.

[21] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto.

Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022.

[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint

arXiv:1711.05101, 2017.

[23] Aaron Lou and Stefano Ermon. Reflected diffusion models. arXiv preprint arXiv:2304.04740,

2023.

[24] Rabeeh Karimi Mahabadi, Jaesung Tae, Hamish Ivison, James Henderson, Iz Beltagy,

Matthew E. Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffu-

sion. arXiv preprint arXiv:2305.08379, 2023.

[25] Matt Mahoney. Large text compression benchmark., 2009. URL http://mattmahoney.net/

dc/textdata.html.

[26] Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel

networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608, 2018.

[27] Kevin Murphy. Conjugate bayesian analysis of the gaussian distribution. Technical report,

University of British Columbia, 2007.

[28] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic

models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.

[29] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[30] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,

and Dustin Tran. Image transformer. In International conference on machine learning, pages

4055–4064. PMLR, 2018.

[31] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.

Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.

[32] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In

International conference on machine learning, pages 1530–1538. PMLR, 2015.

49[33] Pierre H. Richemond, Sander Dieleman, and Arnaud Doucet. Categorical SDEs with simplex

diffusion. arXiv preprint arXiv:2210.14784, 2022.

[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.

High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recognition, pages 10684–10695, 2022.

[35] Hossein Sadeghi, Evgeny Andriyash, Walter Vinci, Lorenzo Buffoni, and Mohammad H Amin.

Pixelvae++: Improved pixelvae with discrete prior. arXiv preprint arXiv:1908.09948, 2019.

[36] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks.

In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM,

2008.

[37] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.

arXiv preprint arXiv:2202.00512, 2022.

[38] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the

PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint

arXiv:1701.05517, 2017.

[39] Andy Shih, Dorsa Sadigh, and Stefano Ermon. Training and inference on any-order autoregres-

sive models the right way. Advances in Neural Information Processing Systems, 35:2762–2775,

2022.

[40] Samarth Sinha and Adji Bousso Dieng. Consistency regularization for variational auto-encoders.

Advances in Neural Information Processing Systems, 34:12943–12954, 2021.

[41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-

vised learning using nonequilibrium thermodynamics. In International conference on machine

learning, pages 2256–2265. PMLR, 2015.

[42] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and

Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv

preprint arXiv:2011.13456, 2020.

[43] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv

preprint arXiv:2303.01469, 2023.

[44] Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will

Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding

diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.

[45] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive

Attention Span in Transformers. arXiv preprint arXiv:1905.07799, August 2019.

[46] James Townsend, Tom Bird, and David Barber. Practical lossless compression with latent

variables using bits back coding. arXiv preprint arXiv:1901.04866, 2019.

50[47] Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, and Ben Poole. Discrete flows:

Invertible generative models of discrete data. Advances in Neural Information Processing

Systems, 32, 2019.

[48] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in

neural information processing systems, 33:19667–19679, 2020.

[49] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space.

Advances in Neural Information Processing Systems, 34:11287–11302, 2021.

[50] Daniel Watson, William Chan, Jonathan Ho, and Mohammad Norouzi. Learning fast samplers

for diffusion models by differentiating through sample quality. arXiv preprint arXiv:2202.05830,

2022.

[51] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression.

Communications of the ACM, 30(6):520–540, 1987.

[52] Zachary Ziegler and Alexander Rush. Latent Normalizing Flows for Discrete Sequences. In

Proceedings of the 36th International Conference on Machine Learning, pages 7673–7682. PMLR,

May 2019.