Summary of AudioPaLM Large Language Model for Speech

Summary AudioPaLM Large Language Model for Speech arxiv.org

15,140 words - PDF document - View PDF document

One Line

The text provides information about the training data, models used, and performance metrics of the AudioPaLM project, as well as acknowledgments of contributors, and discusses the capabilities and performance of the AudioPaLM large language model for speech.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

AudioPaLM: Advancing Speech Recognition and Translation

Source: arxiv.org - PDF - 15,140 words - view

Training Data and Models

• The training data includes ASR and AST data, with hours provided for each language.

• Two models mentioned: AudioPaLM-2 8B and Whisper 1.5B.

• Average BLEU scores given for each language.

Visual: Bar graph comparing BLEU scores

Evaluation and Inclusion of Audio Data

• Evaluation of Whisper model on 82 languages mentioned.

• Audio data included in training of AudioPaLM.

Visual: World map highlighting languages evaluated

Acknowledgment and Contributions

• Contribution and acknowledgment of project participants.

• Recognition of their role in the project's success.

Key Points from the Document

• Comprehensive resource covering various topics related to speech recognition and translation.

• AudioPaLM demonstrates state-of-the-art results on speech translation benchmarks.

• Competitive performance on ASR and S2ST tasks.

Experimental Findings

• Experiments evaluated model performance with different tasks, tokenization schemes, and baselines.

• Impact of task combination and tokenization scheme on performance.

• Finetuning and ASR task inclusion improve AST performance.

Visual: Comparison chart of experimental results

AudioPaLM-2 Model Improvement

• Significant improvement compared to AudioPaLM model.

• Outperforms Whisper model in observed language translation.

• Challenges in zero-shot translation due to lack of AST data for some languages.

Visual: Comparison chart of BLEU scores

Unlocking the Power of AudioPaLM

• Advancing speech recognition and translation capabilities.

• Harnessing the potential of mixed tasks, finetuning, and multimodal representation.

• Reminder: AudioPaLM sets new standards with state-of-the-art performance.

Note: Visuals such as graphs, images, and charts can be added to enhance the presentation and engage the audience further.

Key Points

The training data for the AudioPaLM project includes ASR and AST data, with the number of hours of training data provided for each language.
Two models, AudioPaLM-2 8B and Whisper 1.5B, are mentioned, with average BLEU scores given for each language.
The evaluation of the Whisper model on 82 languages is mentioned, along with the inclusion of audio data in the training of AudioPaLM.
The contribution and acknowledgment of individuals who participated in the project are mentioned.
The AudioPaLM large language model for speech is a comprehensive resource that covers various topics related to speech recognition and translation.
AudioPaLM demonstrates state-of-the-art results on speech translation benchmarks and performs competitively on ASR and S2ST tasks.
The experiments conducted to evaluate the performance of the model involved different tasks, tokenization schemes, and baselines.
The AudioPaLM-2 model shows significant improvement compared to the AudioPaLM model in terms of performance.

Summary

1120 word summary

The excerpt provides information about the training data, performance metrics, and models used in the AudioPaLM project. It also includes acknowledgments of the individuals who contributed to the project.

The training data includes ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) data. The number of hours of training data for ASR and AST is provided for each language.

Two models, AudioPaLM-2 8B and Whisper 1.5B, are mentioned in the excerpt. The average BLEU score is given for each language, indicating the performance of the models.

The excerpt also mentions the evaluation of the Whisper model on 82 languages and the inclusion of audio data in the training of AudioPaLM.

The summary can be organized into separate paragraphs as follows:

Paragraph 1: The training data includes ASR and AST data, with the number of hours of training data provided for each language.

Paragraph 2: Two models, AudioPaLM-2 8B and Whisper 1.5B, are mentioned. The average BLEU score is given for each language, indicating the performance of the models.

Paragraph 3: The evaluation of the Whisper model on 82 languages is mentioned, along with the inclusion of audio data in the training of AudioPaLM.

Paragraph 4: The contribution and acknowledgment of individuals who participated in the project are mentioned. The document "AudioPaLM Large Language Model for Speech" is a comprehensive resource that includes references to various papers and studies related to speech recognition and translation. It covers topics such as large-scale weak supervision, robust speech recognition, transferable visual models, pre-trained word embeddings, and multilingual datasets. The document also mentions the use of language models for audio generation and compression, direct speech-to-speech translation, and self-supervised speech representation learning. Additionally, it provides information on specific models and frameworks like Hubert, Maestro, and Wavlm. The document includes references to conferences such as ACL, ICASSP, and NeurIPS, as well as workshops and shared task papers on machine translation. It also highlights the importance of language resources and corpora like Common Voice. AudioPaLM is a large language model for speech that can process and generate speech and text. The model has been trained on various datasets and has been evaluated on different tasks such as automatic speech recognition (ASR) and speech-to-text translation (S2ST). The results show that increasing the amount of training data improves the performance of the model on these tasks. Additionally, adding speech-to-speech translation tasks to the training process enhances the model's capabilities in generating audio tokens. However, this comes at a slight decrease in performance on text-output tasks. Overall, AudioPaLM demonstrates state-of-the-art results on speech translation benchmarks and performs competitively on ASR and S2ST tasks. In the document "AudioPaLM Large Language Model for Speech," several experiments were conducted to evaluate the performance of the model. The experiments involved training the model on different tasks, using different tokenization schemes, and comparing it to baselines. The results showed that training with combined tasks improved performance on the AST task but resulted in a slight reduction in performance on the ASR task. The choice of tokenization scheme had a significant impact on performance, with the USM-v2 tokens performing the best. Finetuning a pretrained checkpoint also improved results compared to training from scratch. Adding ASR tasks to the training data helped improve performance on AST. The model achieved higher quality and better voice similarity than the baseline Translatotron 2 system. Objective and subjective evaluations were conducted to assess audio quality and voice similarity, with AudioPaLM outperforming the baseline system. The model demonstrated superior text translation capabilities, with a significant increase in performance for AST-observed and ASR-observed languages. The AudioPaLM-2 model shows significant improvement compared to the AudioPaLM model in terms of performance. It outperforms the Whisper model in speech-to-text translation for observed languages. However, it does not perform as well in zero-shot translation as it lacks AST data for certain languages. The proposed AudioPaLM-2 model significantly outperforms the Whisper model. The number of hours of training data varies for different models. The results obtained with the two proposed AST models, AudioPaLM and AudioPaLM-2, are presented in Table 3. There are certain languages for which BLEU scores are not available. The models are evaluated on the FLEURS dataset, which includes speech utterances and their corresponding transcripts in multiple languages. The models are trained on various datasets, including VoxPopuli, CoVoST2, and Conversational EsEn. The training setup involves finetuning with the Adafactor optimizer and using loss masking on the inputs. Different datasets are used to train the models, and mixtures of data are used to improve performance. The models are trained on tasks such as ASR, AST, and S2ST. The evaluation metrics include BLEU scores, word error rate (WER), and character error rate (CER). The datasets used for training and their respective hours of audio are listed in Table 1. The models can perform tasks such as transcription, translation, and speech synthesis. Combined tasks and direct tasks are considered, with the model mapping from input to output or outputting intermediate steps. Task tags are used to specify the language and task involved in the model's output. The AudioPaLM large language model for speech is capable of performing automatic speech recognition (ASR) on utterances in different languages. It uses tokenized audio as input, along with a tag specifying the task and the language of the input and output. The model can perform tasks such as ASR, text-to-text machine translation (MT), text-to-speech (TTS), speech-to-speech translation (S2ST), and automatic speech translation (AST). The model uses a decoder-only Transformer architecture and can generate both text and audio tokens. It can be trained on mixed speech and text tasks, and the audio tokens can be converted back to raw audio using different decoding methods. The model can also be finetuned on a mixture of speech and text tasks to improve performance. The tokenization process involves converting audio into discrete tokens using pretrained models such as w2v-BERT or Universal Speech Model (USM). The model can learn a mapping between text and audio tokens and can be used for speech-to-speech translation tasks. The overall model architecture allows for multimodal representation of both text and audio. AudioPaLM is a large language model for speech that combines text-based and speech-based language models. It can generate speech and text using a unified architecture. The model can perform tasks such as speech recognition, text-to-speech synthesis, and speech-to-speech translation. It leverages the capabilities of pretrained text models and can be initialized with their weights. AudioPaLM exhibits zero-shot capabilities and outperforms existing systems in speech translation tasks. It can transfer voices across languages and preserve paralinguistic information such as speaker identity and intonation. The model is trained on a mixture of tasks and can process and generate both speech and text. The paper provides experimental results and ablations to evaluate the model's performance.

Raw indexed text (93,000 chars / 15,140 words / 1,868 lines)

AudioPaLM: A Large Language Model That Can

Speak and Listen

Paul K. Rubenstein ∗

Chulayuth Asawaroengchai ∗

Duc Dung Nguyen ∗

Ankur Bapna

Zalán Borsos

Félix de Chaumont Quitry

Peter Chen

Dalia El Badawy

Wei Han

Eugene Kharitonov

Hannah Muckenhirn

Dirk Padfield

James Qin

Danny Rozenberg

Tara Sainath

Johan Schalkwyk

Matt Sharifi

Michelle Tadmor Ramanovich

Marco Tagliasacchi

Alexandru Tudor

Mihajlo Velimirović

Damien Vincent

Jiahui Yu

Yongqiang Wang

Vicky Zayats

Neil Zeghidour

Yu Zhang

Zhishuai Zhang

Lukas Zilka

Christian Frank

Google

Abstract

We introduce AudioPaLM, a large language model for speech understanding and

generation. AudioPaLM fuses text-based and speech-based language models,

PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified

multimodal architecture that can process and generate text and speech with applica-

tions including speech recognition and speech-to-speech translation. AudioPaLM

inherits the capability to preserve paralinguistic information such as speaker iden-

tity and intonation from AudioLM and the linguistic knowledge present only in

text large language models such as PaLM-2. We demonstrate that initializing

AudioPaLM with the weights of a text-only large language model improves speech

processing, successfully leveraging the larger quantity of text training data used

in pretraining to assist with the speech tasks. The resulting model significantly

outperforms existing systems for speech translation tasks and has the ability to

perform zero-shot speech-to-text translation for many languages for which in-

put/target language combinations were not seen in training. AudioPaLM also

demonstrates features of audio language models, such as transferring a voice across

languages based on a short spoken prompt. We release examples of our method at:

https://google-research.github.io/seanet/audiopalm/examples.

Introduction

Large language models (LLMs) [Brown et al., 2020, Rae et al., 2021, Chowdhery et al., 2022] excel

at generating text for tasks that require the modeling of complex interactions as well as knowledge

retrieval, such as open-domain question answering or few-shot machine translation [Anil et al., 2023].

The remarkable generative abilities of the underlying system — a Transformer [Vaswani et al., 2017]

trained to predict sequences of discrete tokens — have been subsequently extended to continuous,

natural signals with images [Yu et al., 2022b] or audio waveforms [Lakhotia et al., 2021, Kreuk et al.,

2022, Wang et al., 2023] being converted into a stream of discrete units through a lossy compression

algorithm and then modeled in a sequential fashion as would be text.

In the context of audio generation, the AudioLM framework [Borsos et al., 2022] has introduced

a hierarchical approach which combines two types of audio tokens, with high-level coarse tokens

extracted from self-supervised embeddings [Chung et al., 2021] being used to condition the generation

of lower-level codes of a neural codec [Zeghidour et al., 2021]. This general framework, which makes

∗

Authors have contributed equally to this work.little assumptions about the nature of the modeled audio signals, has been used to generate speech and

music [Kharitonov et al., 2023, Agostinelli et al., 2023, Donahue et al., 2023]. In the particular case

of text-to-music [Agostinelli et al., 2023] or text-to-speech [Kharitonov et al., 2023], a Transformer

model takes text tokens as inputs and generates audio tokens, such that text and audio vocabularies

do not interact with each other. Such models could naturally be converted into, respectively, music

captioning and speech recognition systems by swapping their inputs and outputs. Following this

observation, combining text and audio vocabularies into a multimodal, single vocabulary would allow

for training a single model in both directions.

In this work, we introduce AudioPaLM, a multimodal generative model of speech and text. At the

heart of AudioPaLM is a joint vocabulary that can represent speech and text with a limited number

of discrete tokens which, combined with an elementary markup description of tasks, allows training

a single decoder-only model on a mixture of tasks that involve arbitrarily interleaved speech and

text. This includes speech recognition, text-to-speech synthesis, and speech-to-speech translation,

unifying tasks that are traditionally solved by heterogeneous models into a single architecture and

training run. Moreover, as the underlying architecture of AudioPaLM is a large Transformer model,

we can initialize its weights with those of a large language model pretrained on text which allow

it to benefit from the linguistic and common sense knowledge of models such as PaLM [Chowdhery

et al., 2022] or PaLM 2 [Anil et al., 2023]. In particular, we show in Section 5.4.8 how the model’s

translation capability is derived from the translation capability of the underlying text model. The

contributions of this work are:

• We present a unified speech-text LLM, capable of consuming and producing both speech

and text, and leveraging the existing capabilities of PaLM Chowdhery et al. [2022] and

PaLM-2 [Anil et al., 2023] coming from text-only pretraining.

• This unified approach across modalities allows training AudioPaLM on a mixture of tasks

such as Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and

Speech-to-Speech Translation (S2ST), achieving state of the art results on AST and S2ST

benchmarks, and competitive performance on ASR benchmarks.

• Leveraging AudioLM’s audio prompting [Borsos et al., 2022], our model performs S2ST

with voice transfer of unseen speakers, surpassing existing methods in terms of speech

quality and voice preservation, as measured by both objective and subjective evaluations.

• Our model exhibits zero-shot capabilities, performing AST with speech input/target language

combinations that were not seen in training.

The remainder of this paper is organized as follows: in Section 2 we discuss the relation to existing

work. In Section 3 we describe our method. In Section 4 we provide details about the data we use,

and other technical details as a prelude to the experiments. In Section 5 we present our experimental

results including a series of ablations to determine the influence of various design choices. We

conclude in Section 6.

2.1

Related work

Multimodal fusion

Encoder-based models are used to learn features which can be used for downstream tasks. By learning

joint representations of both modalities together, the goal is that in addition to the learned features be-

ing richer than they would be with each modality treated separately, they are aligned with one another,

improving their performance when used for inter-modality tasks. Such approaches have been applied

in audio [Chen et al., 2022c, Bapna et al., 2022, Zhang et al., 2023a] and in vision [Chen et al., 2020,

Gan et al., 2020, Fu et al., 2021] as well as combining both audio and video inputs [Shi et al., 2022].

Similar to BERT [Devlin et al., 2018], such encoders may be trained with a masked language model

objective for both the multimodal setting as in previously mentioned works and for the unimodal

setting [Baevski et al., 2020, Hsu et al., 2021, Chiu et al., 2022]. They may alternatively be trained in

a contrastive manner [Radford et al., 2021, Yuan et al., 2021, Yu et al., 2022a] resulting in separate

encoders for each modality with each informed by the other due to the contrastive objective.

A line of work on multimodal encoder-decoder models (also known as Vision Language Models in

the vision literature) has sought to fuse text-decoders with advances in non-text encoder models.

2Examples include Flamingo [Alayrac et al., 2022] and PaLI [Chen et al., 2022b] in the vision domain,

and Whisper [Radford et al., 2022] in the audio domain. The general idea of these approaches is to

take an audio or vision encoder and a text decoder and to combine them, either with adapter layers as

in Flamingo and Whisper, or by merging via a separate encoder as in PaLI.

Both PaLI and Flamingo use pretrained components. The advantage of this is that individual

components can be frozen while finetuning the model on multimodal data (Whisper does not use a

pretrained encoder or decoder and so does not freeze individual components). The disadvantage is

that such models are constrained to only output text, since the decoder is text-only. In contrast, our

proposed approach results in a decoder-only model which models sequences of arbitrary audio and

text tokens. This is similar to the approach taken by Wang et al. [2022] except that we use a single

decoder-only model and all audio seen by the model is tokenized, whereas Wang et al. [2022] use an

encoder-decoder architecture and use continuous inputs and tokenized outputs for images.

2.2

Generating audio with language models

Recent work [Lakhotia et al., 2021, Wang et al., 2023] has explored generating speech by modeling

discretized representations as target tokens of an autoregressive Transformer [Vaswani et al., 2017]

network. Such discrete tokens can be extracted from self-supervised speech representations [Oord

et al., 2018, Baevski et al., 2020, Hsu et al., 2021, Chung et al., 2021], modeling long-term patterns

in audio sequences while providing limited reconstruction quality, or from a neural codec [Zeghidour

et al., 2021, Défossez et al., 2022], providing high-fidelity reconstruction but with less temporal

structure. AudioLM [Borsos et al., 2022] addresses this dichotomy by introducing a hierarchical

approach, where a first stage produces “semantic” tokens from a self-supervised w2v-BERT sys-

tem [Chung et al., 2021], which a second stage then uses as conditioning to generate the “acoustic”

tokens of a SoundStream [Zeghidour et al., 2021] neural codec. This joint modeling of semantic and

acoustic tokens allows the model to learn linguistic structure from the syntactic to the lexical and

phonetic levels from speech-only corpora, without any textual guidance, while generating realistic

speech from arbitrary speakers and in diverse acoustic conditions.

SPEAR-TTS [Kharitonov et al., 2023] combines the decoder-only generator of AudioLM with a text

encoder, such that the model can perform text-to-speech synthesis. By leveraging pretraining and

backtranslation [Sennrich et al., 2016], SPEAR-TTS can be trained with only 15 minutes of labeled

speech. The ability of this model to learn a mapping between text and semantic tokens in such a

low-data regime suggests that these representations are very close, yet the model’s encoder-decoder

architecture specifically ingests text and outputs audio, such that both vocabulary of tokens (text

and semantic) are disjoint and modeled separately. SpeechLM [Hassid et al., 2023] also exploits the

similarity between text and semantic tokens by initializing a decoder-only audio generator with the

weights of a pretrained text-based language model. While this allows some transfer of knowledge from

text-to-speech modeling, the resulting architecture is not multimodal: semantic tokens replace the text

vocabulary —rather than extending it— and the model is finetuned on speech-only data. AudioPaLM

bridges these gaps and combines semantic tokens and text into an extended, multimodal set of tokens

used interchangeably as inputs and outputs, such that text-only language model pretraining can be

used to initialize a decoder-only model that can then be finetuned on a mixture of tasks that map

freely between speech and text (e.g. speech-to-text, text-to-speech or speech-to-speech).

2.3

Speech-to-speech translation

The field of speech-to-speech translation (S2ST) focuses on converting spoken language from one

language to another, facilitating communication between individuals who speak different languages.

Conventional automatic speech-to-speech translation systems are typically composed of a cascade of

three components: automatic speech recognition (ASR), text-to-text machine translation (MT), and

text-to-speech (TTS) synthesis [Lavie et al., 1997, Wahlster, 2000, Nakamura et al., 2006]. However,

these cascade-based approaches primarily focus on the text and may overlook important aspects such

as para-linguistic features, computational efficiency, compound errors, and the accurate handling of

proper names, nouns, and non-verbal communication that do not require translation.

Direct speech-to-speech translation systems [Jia et al., 2019b, Kano et al., 2021, Jia et al., 2022b] are

trained end-to-end operating on the audio spectrogram domain without relying on text representation

at inference time. In these systems, the synthesized audio has access to acoustic information in

3audio tokens

text tokens

- - pre-trained on text-only data

[S2ST French English]

[ASR Italian]

Audio

Embeddings

Matrix

Audio & text

tokenizers

Text

Embeddings

Matrix

Decoder-only

Transformer

SoundStorm

or AudioLM

stages 2 + 3

Text

detokenizer

Ciao

mondo!

Figure 1: The AudioPaLM model, illustrated on speech-to-speech translation and automatic speech

recognition. We take a pretrained text-only model (dashed lines) and expand its embeddings matrix to

model a new set of audio tokens. The model architecture is otherwise unchanged; a mixed sequence

of text and audio tokens is fed as input and the model decodes text or audio tokens. Audio tokens are

converted back to raw audio with the latter AudioLM stages or SoundStorm (see Section 3.3).

source speech and can potentially learn to preserve acoustic features and reduce compound errors

and computational requirements.

There are other cascaded S2ST systems that utilize learned discrete speech representations as an

intermediate representation [Tjandra et al., 2019, Zhang et al., 2021, Lee et al., 2022, Ma et al., 2021,

Lee et al., 2021]. In these systems the translation operates in learned discrete representation space

allowing to learn alignment in the discrete domain, and simplify leveraging of text pre-training. Lastly,

there are other S2ST approaches that improve on performance, efficiency, and data requirements. Jia

et al. [2022a] and Wei et al. [2022b] leveraged weakly supervised data and component pre-training to

improve translation accuracy while requiring little parallel speech data.

Method

We use a decoder-only Transformer to model sequences consisting of text and audio tokens. As far

as the model is concerned, text and audio are just sequences of arbitrary integers, as the inputs are

tokenized before feeding to the model, and any outputs are detokenized before being returned to a

user of the model. By representing speech with discrete tokens in a finite vocabulary, we can build

a multimodal vocabulary which is the union of this audio vocabulary and a SentencePiece [Kudo

and Richardson, 2018b] one used to represent text. Thus, in principle there is almost no difference

between our setting and the usual decoder-only setup for pure text, except that in our setting some of

the tokens represent audio and some text, and we initialize our multimodal model using a pretrained

text-only checkpoint.

The overall model is described in Figure 1. In the rest of this section we describe the main steps of

the model: first, how text and audio inputs are tokenized; second, how we modify existing pretrained

text decoders to also model audio; and third, how we convert the model output into raw audio. Since

the first and third steps are identical to the process used by Borsos et al. [2022] and [Borsos et al.,

2023], we keep our explanation of these points high-level and refer the reader to those papers for

further details.

Finally, we describe how we finetune AudioPaLM on a mixture of combined speech and text tasks

including speech recognition and translation from or into either speech or text.

3.1

Audio Embeddings and Tokenization

We follow the process of Lakhotia et al. [2021], Borsos et al. [2022] to convert raw waveforms

into tokens. This involves extracting embeddings from an existing speech representation model and

subsequently discretizing those embeddings into a limited set of audio tokens. Borsos et al. [2022]

extract embeddings from the w2v-BERT model [Chung et al., 2021] and quantize them via k-means.

In this work, we experiment with the following approaches to obtain a set of discrete audio tokens.

4• w2v-BERT: We follow the procedure described in Borsos et al. [2022] with two modifi-

cations. First, we use a w2v-BERT model that has been trained on multilingual data, as

opposed to the English-only setting of Borsos et al. [2022]. Second, we do not normalize the

embeddings before performing the k-means clustering. While Borsos et al. [2022] found that

the normalization removed speaker-identity information without degrading performance,

we found in the multilingual setting that normalization did indeed cause degradation. This

method produces tokens at a rate of 25Hz and the token vocabulary is of size 1024.

• USM-v1: We perform the same procedure with the more performant Universal Speech

Model (USM) encoder [Zhang et al., 2023a] instead of the w2v-BERT encoder. We use the

largest 2B parameter variant of this multilingual speech encoder and extract embeddings

from the middle layer. Similar to w2v-BERT, this method produces tokens at a rate of 25Hz

and the token vocabulary is of size 1024.

• USM-v2 : We additionally experiment with a quantizer that is trained with an auxiliary ASR

loss. This version has been finetuned further to provide better multilingual performance. As

with USM-v1, this method accepts raw audio as input and returns a sequence of integers

with length proportional to the length of the audio as output.

3.2

Modifying text-only decoders to model both text and audio

In a Transformer decoder, the first layer of the model after input preprocessing is the token embeddings

matrix E which maps integer-valued tokens to dense embeddings; given a vocabulary of t tokens

and embeddings of size m, E is a t × m matrix whose ith row gives the embedding for the ith token.

Another embeddings matrix E ′ appears in the final softmax layer used to compute the logits over all

tokens at each position; it is a m × t matrix which is multiplied with the m-dimensional output of the

model to obtain a t dimensional vector of logits, one for each of the tokens. In the PaLM architecture,

these matrices have shared variables, so that one is the transpose of the other, that is, E ′ = E ⊺ .

The rest of the decoder architecture is completely agnostic to the number of tokens modelled.

Therefore we only need to make one small modification to turn a text-only model into one that models

both text and audio: we expand the size of the embeddings matrix E to be of size (t + a) × m where

a is the number of audio tokens (the size of E ′ = E ⊺ changes accordingly).

In order to make use of pretrained text models, we change the existing model checkpoints by adding

a new rows to the embeddings matrix E. An implementation detail is that the first t tokens (from zero

to t) correspond to the SentencePiece text tokens while the next a tokens (from t to t + a) represent

audio tokens. While we can re-use the text embeddings of the pre-trained model, the new audio

embeddings are freshly initialized and must be trained. We found it necessary to train all model

parameters rather than keeping the previous weights fixed. We train using mixed speech and text

tasks, as detailed in Section 4. In Section 5.4.2 we show how adding audio tokens to a text-pretrained

checkpoint in the above manner is highly beneficial for performance on the considered speech and

text tasks (compared to re-training from scratch). For further details about the PaLM architecture we

refer the reader to Section 2 of [Chowdhery et al., 2022].

3.3

Decoding audio tokens to raw audio

To synthesize an audio waveform from audio tokens, we experimented with two different methods:

i) autoregressive decoding, following the setup of AudioLM [Borsos et al., 2022] and ii) non-

autoregressive decoding, using the recently proposed SoundStorm model [Borsos et al., 2023]. In

both cases the audio tokens are first used to generate SoundStream tokens [Zeghidour et al., 2021],

which are then converted to an audio waveform with a convolutional decoder.

The acoustic generation in AudioLM proceeds in two stages: “Stage 2” is a decoder-only Transformer

model that takes the audio tokens produced by AudioPaLM and a voice conditioning as input, and

generates SoundStream tokens that can be used to materialize the speech in the desired voice, but at a

very low bitrate. “Stage 3” reconstructs higher levels of SoundStream’s residual vector quantizer,

which increases the bitrate and improves the audio quality. We use the same hyperparameters and the

training process as in [Kharitonov et al., 2023].

SoundStorm proposes an alternative non-autoregressive decoding scheme, which applies an iterative

method that proceeds in parallel on all tokens. SoundStorm produces audio of the same quality as

5AudioLM, but with higher consistency in voice and acoustic conditions, while being two orders of

magnitude faster.

In both cases we train on Multilingual LibriSpeech [Pratap et al., 2020] and the voice conditioning is

supplied as a 3-second long voice sample, represented as both audio tokens and SoundStream tokens.

By providing part of the original input speech as the voice conditioning, the model is able to preserve

the original speaker’s voice when translating their speech to a different language (see Section 5).

Whenever the original audio is shorter than 3 seconds, it is repeated to reach the required duration.

3.4

Training tasks

Types of tasks We apply our method to the problems of speech recognition, speech synthesis and

speech-to-speech translation. All datasets used in this report are speech-text datasets which contain a

subset of the following fields.

•

Audio: speech in the source language.

Transcript: a transcript of the speech in Audio.

Translated audio: the spoken translation of the speech in Audio.

Translated transcript: the written translation of the speech in Audio.

The component tasks that we consider in this report are:

•

ASR (automatic speech recognition): transcribing the audio to obtain the transcript.

AST (automatic speech translation): translating the audio to obtain the translated transcript.

S2ST (speech-to-speech translation): translating the audio to obtain the translated audio.

TTS (text-to-speech): reading out the transcription to obtain the audio.

MT (text-to-text machine translation): translating the transcript to obtain the translated

transcript.

A dataset including more than two of the fields may be used for multiple possible tasks. As explored

in the experiment of Section 5.4.1, we found that including multiple tasks (for example, both ASR

and AST) from the same dataset resulted in improved performance.

Expressing tasks Following Raffel et al. [2020], we signal to the model which task it should

perform on a given input by prefixing the input with a tag specifying the task and the English name

of the language of the input and, optionally, the language of the output if it is different.

For example, to query the model to perform ASR on an utterance in French, the tokenized audio

input would be preceded by the tag [ASR French]. To perform TTS in English, the text would

be preceded by [TTS English]. To perform S2ST from English to French, the tokenized English

audio would be preceded by [S2ST English French]. The tag is tokenized using the normal text

tokenizer of the model; we do not introduce special tokens to express the task or the languages

involved. We found that changing task names to be more human-readable, such as using transcribe

the following French audio instead of [ASR French], does not change the performance of

the model. Naming the language in the task – compared to just using generic tags like transcribe

audio or [ASR] – is not ultimately required but is beneficial for low-resource languages.

Combined tasks We consider both direct tasks, where the model is expected to directly map from

input to output, and combined tasks, where we instruct the model to also output intermediate steps

for a complex task. This is similar in spirit to chain of thought prompting [Wei et al., 2022a].

For example, for S2ST we could demand that the model directly maps from English audio tokens

to French audio tokens. This would be expressed with the task tag [S2ST English French].

Alternatively we can train the model to first output English text, followed by French text, and finally

French audio tokens. We express this with the task tag [ASR AST S2ST English French]. The

model performs this task as a single autoregressive decoding, i.e. it is not performed with multiple

separate calls to the model for each task. In particular this means that the model can attend to the

input and all prior decoded content at each stage, as opposed to a separated pipeline approach of

doing ASR, MT and then TTS.

6We found combined tasks to improve performance, which we explore in the experiment of Sec-

tion 5.4.4.

3.5

Training mixtures

In this section we describe the data mixtures used to train our best models based on the datasets listed

in Table 1. Mixtures were implemented using the SeqIO library [Roberts et al., 2022]. More details

on the datasets can be found in Section 4.

There are two mixtures: one used to train the Audio PaLM 8B AST and AudioPaLM-2 8B AST models

which output text and are trained on ASR and AST tasks; the other used to train the Audio PaLM 8B

S2ST model which outputs both text and speech and additionally includes TTS and S2ST tasks.

• The AST mixture is composed of:

– The ASR tasks from the CVSS, VoxPopuli ASR, CommonVoice 11, Conversational

EsEn and Youtube ASR datasets. For the CVSS and Conversational EsEn datasets, we

use ASR in both source and target languages.

– The AST tasks from CVSS, Conversational EsEn and VoxPopuli S2ST. We use Vox-

Populi S2ST for AST by mapping from the translated audio to the transcript, since the

translated transcript is not available.

– The combined AST + ASR task for the CVSS and Conversational EsEn datasets.

– The MT task from the WMT/TED dataset.

• The S2ST mixture is composed of the above, plus additionally:

– The TTS tasks from the CVSS and VoxPopuli ASR datasets. For CVSS we use only

the source transcript and audio.

– The S2ST tasks from the Vox Populi S2ST, CVSS, WMT/TED and PaLM MT TTS

datasets. Note that except for VoxPopuli S2ST, the speech targets of these datasets

are all synthetically generated. For VoxPopuli S2ST we perform translation from both

source to target, and target to source.

– The combined ASR + AST + S2ST tasks from the Conversational EsEn, CVSS and

WMT/TED datasets.

In general the components of the mixture are weighted according to the number of elements in each

component while we downweighted larger datasets; Table 1 lists the amounts of audio that models

trained on the above mixtures have seen during training.

3.6

Training setup

In all experiments, we use the same finetuning setup as described in Section 6.1.2 of [Chowdhery

et al., 2022]. In particular, we finetune with the Adafactor optimizer with a constant learning rate of

5 × 10 −5 and dropout rate of 0.1, and we use loss masking on the inputs.

4.1

Data and Metrics

Datasets

Table 1 lists the datasets used in AudioPaLM training.

• CoVoST2 [Wang et al., 2020] is a speech-to-text dataset mapping speech in 21 languages to

English text.

• CVSS [Jia et al., 2022c] augmented CoVoST2 to synthesize speech for the target text

in two flavors: CVSS-C uses a canonical speakers voice, while CVSS-T transfers voice

properties from the source voice. Unless stated otherwise, we use the CVSS-C flavor in

speech-to-speech translation experiments.

• VoxPopuli [Wang et al., 2021] contains speeches from the European Parliament together

with their transcripts – which can be used for speech recognition (ASR) tasks – as well

as spoken translations from parliamentary interpreters – which can be used for speech

translation (S2ST) tasks.

7Table 1: Datasets used for training AudioPaLM. The number of training hours corresponds to the

number of hours seen by the AudioPaLM AST, AudioPaLM-2 AST and AudioPaLM S2ST models

during training, as a result of datasets balancing and a finite number of training steps.

Name

Audio Transcript

CoVoST2 / CVSS

VoxPopuli ASR

VoxPopuli S2ST

CommonVoice 11

Conversational EsEn

YouTube ASR

WMT/TED TTS

PaLM MT TTS

✓

Translated Translated

# Hours of training audio

# Languages

audio

transcript AudioPaLM AST AudioPaLM-2 AST AudioPaLM S2TS

✓

0.9k

1.6k

6.7k

3.4k

13k

0.9k

1.6k

3.9k

7.6k

1.4k

1.6k

5.4k

0.9K

26.2k

10.9k

• Common Voice [Ardila et al., 2020] consists of text paired with recordings where people

were asked to read the text aloud.

• The conversational dataset described in [Jia et al., 2019a] was obtained by crowd-sourcing

humans to read a subset of the Spanish side of a proprietary conversational Spanish-English

MT dataset.

• YouTube ASR is an unlabeled multilingual dataset of YouTube-based audio which was

transcribed automatically by using the USM-2B ASR model [Zhang et al., 2023a]. The

dataset helps to improve models for YouTube to perform better on captioning and translation.

• WMT/TED TTS is based on WMT [Barrault et al., 2020, 2019, Bojar et al., 2018, 2017,

2015, 2013] and TED [Qi et al., 2018] text-to-text translation datasets as described in [Bapna

et al., 2022]. Following Jia et al. [2022a] the dataset is augmented by running all the source

and target text through a TTS engine to generate synthetic paired audio.

• PaLM MT TTS provides additional training data for S2ST: We use PaLM-2 to translate

the transcripts of the YouTube, Common Voice, and Babel [Gales et al., 2017] datasets

to English, and use a prior AudioPaLM 8B S2ST model (trained without this dataset) to

synthesize the speech. The method is similar in spirit to [Jia et al., 2019a] which combines

MT and TTS to generate additional paired data.

We train models on mixtures based on these datasets as described in Section 3.5 on ASR, AST, and

S2ST tasks from the above datasets. In Section 5.4.6 we explore how adding more data improves the

performance of our method.

Note that our method makes use of the text-pretrained PaLM checkpoints and audio tokenizers. So

while the models are trained on the datasets listed in Table 1, they can also benefit from PaLM’s text

training data [Anil et al., 2023] via the pre-trained PaLM checkpoint, and from the data used to train

the audio tokenizers.

4.2

Evaluation Metrics

We evaluate our method on the following benchmarks:

• CoVoST2 AST: We use BLEU scores, with the SacreBLEU corpusBLEU implementation

Papineni et al. [2002], Post [2018]. We do not perform any normalization to the text before

computing BLEU.

• FLEURS AST: The FLEURS [Conneau et al., 2023] dataset contains speech utterances and

their corresponding transcripts in 102 languages and is used for evaluation, only. We use

BLEU scores, as described for CoVoST2 AST.

• VoxPopuli ASR: We use the JiWER implementation of word error rate (WER). We normalise

the text by ignoring capitalisation and punctuation before computing the WER.

• CoVoST2 ASR: Comparable to VoxPopuli ASR, but for Japanese and Chinese, the character

error rate (CER) is reported instead of WER. We report this metric for experiments trained

on CoVoST2, only.

21 pairs X → En

113 pairs X → EnTable 2: Top level results of this paper.

CoVoST2 AST

BLEU↑

Model

CVSS S2ST

ASR-BLEU↑

VoxPopuli ASR

WER↓

Whisper Large-v2 1.5B [Radford et al., 2022]

mSLAM-CTC 2B [Bapna et al., 2022]

MAESTRO 600M [Chen et al., 2022c]

USM-M [Zhang et al., 2023a]

Translatotron 2 + pretraining

+ TTS aug [Jia et al., 2022a] 29.1

25.2

30.7 −

−

− 13.6

9.1

8.1

−

− 25.6 −

AudioPaLM 8B AST (ours)

AudioPaLM 8B S2ST (ours) 35.4

36.2 −

32.5 11.1

16.0

AudioPaLM-2 8B AST (ours) 37.8 − 9.8

AudioPaLM-2 8B cascaded ASR + transl. (ours) 39.0 − −

• CVSS S2ST: Following Translatron 2 [Jia et al., 2022b], we feed the audio output of our

model into an ASR model and use BLEU to compare the ASR output with the ground truth

target text. We use the same ASR model as Jia et al. [2022b] and so the metrics presented

here are directly comparable.

All evaluations are performed on the test splits of the corresponding datasets.

Experiments

We start with our top-level results presenting significant improvements over prior results on automatic

speech-to-text translation (AST) and direct speech-to-speech translation (S2ST), as well as competi-

tive results on automatic speech recognition (ASR). Ablations of individual factors are provided in

Section 5.4.

5.1

Speech translation and speech recognition results

Table 2 displays results on ASR, AST and S2ST benchmarks for our method and existing baselines.

Our models come in two variants; the first variant (referred to as AST) is trained on AST tasks without

S2ST and TTS data; the second variant (referred to as S2ST) is trained with S2ST and TTS data

and is therefore able to produce speech as well. To generate the audio for the S2ST results we used

SoundStorm [Borsos et al., 2023]. For details on the training mixtures see Section 3.5.

As an initial checkpoint, we use a PaLM-2 8B checkpoint [Anil et al., 2023] to which we add the

capability to process audio tokens as input and output as described in Section 3.2. The additional

audio token embeddings are initialized to 0. As in the original PaLM and PaLM 2 models, the input

and output embeddings are shared.

Our method exceeds the baselines on AST and S2ST and is competitive on ASR. Our method also

comes close in AST performance to a cascaded approach in which we use our best AudioPaLM-2

ASR model followed by translation with another AudioPaLM-2 model finetuned only for text-to-text

translation on CoVoST2.

5.2

Zero-shot behaviour

Setup. We evaluate the zero-shot capabilities of our AST models on the FLEURS multilingual

dataset [Conneau et al., 2023]. The dataset contains speech utterances and their corresponding

transcripts in 102 languages. Note that none of our models were trained on FLEURS, so we use the

dataset for evaluation only. In this context, we focus on the language pairs X → English and extract

two subsets of languages:

• 29 AST-observed languages: languages for which speech-to-text translation (AST) data (X

→ En) was seen during training (as these language pairs were present in the VoxPopuli

9Table 3: Zero-shot AST performance on the FLEURS dataset. We split the results into two groups:

AST observed are languages for which X→ En speech-to-text data was present in the AudioPaLM

training data. Only ASR observed are languages for which only ASR data was present in the training

data (so the translation is done by the AudioPaLM model in a zero-shot manner). This shows that

AudioPaLM inherits its translation capabilities from the base model, and is consistent with the

improved translation capabilities of PaLM-2 compared to PaLM. The hours of training data for

AudioPaLM do not include audio data seen in self-supervised training of audio tokenization models.

∗

Whisper has seen AST data for all languages considered and is included here just for reference.

AST observed languages

BLEU↑ AST / ASR hours

Model

Only ASR observed languages

BLEU↑

AST / ASR hours

Whisper Large-v2 1.5B [Radford et al., 2022] 23.3 74.0k / 104.4k 19.6 ∗ 40.6k / 11.3k

AudioPaLM 8B AST

AudioPaLM-2 8B AST 22.4

28.6 6.6k / 11.4k

4.8k / 8.2k 10.0

20.7 0 / 5.3k

0 / 3.1k

S2ST, CoVoST2 or/and Conversational EsEn datasets). These languages are indicated with

a § in Table 17.

• 26 ASR-observed languages: languages for which no speech-to-text translation data was

seen when training our AST models, but for which at least 1 hour of transcription (ASR)

data was present. We removed 3 languages (Cantonese, Kurdish and Ganda) for which

we did not have a BLEU score for the baseline. These languages are indicated with a † in

Table 17.

Results. In Table 3, we present the results obtained with the two proposed AST models AudioPaLM

and AudioPaLM-2, as well as the baseline model “Whisper Large-v2 1.5B”. We also present the

number of AST and ASR speech training hours for these three models. For the proposed models, the

reported number of hours do not take into account the amount of speech used to train the tokenizers.

Discussion. We observe that the proposed AudioPaLM-2 model significantly outperforms the

Whisper model on AST-observed languages. Although Whisper is used as a reference for the only

ASR observed setting, its results are not zero-shot as Whisper has been trained on 40.6K hours of

speech-to-text translation (AST) data for these languages. For the AudioPaLM models, this setting is

zero-shot as it did not see any AST data for these languages. Despite this disadvantage, AudioPaLM-

2 also outperforms the Whisper model on ASR-observed languages. For a detailed performance

comparison for each language, see Appendix D.

There is a large improvement obtained by using the AudioPaLM-2 instead of the AudioPaLM model:

28% increase for AST-observed languages and 107% increase for ASR-observed languages. These

numbers show that the superior text translation capabilities of AudioPaLM-2 immediately transfer

to the audio domain, despite the fact that the model has not seen any speech-to-text data for these

language pairs during training in the case of ASR-observed languages.

5.3

Quality of generated speech

In addition to measuring the translation quality of the speech content as reported in Table 2, we are

also interested in evaluating whether the speech generated by AudioPaLM is (a) of high quality, and

(b) truthfully preserves the voice of the speaker when translating to a different language. To this end,

we use a combination of objective metrics and subjective evaluation studies that use the test split

of the CVSS-T dataset [Jia et al., 2022c]. The subjective experiments were conducted on an earlier

version of AudioPaLM using the acoustic generation method described in AudioLM [Borsos et al.,

2022].

Baselines. As a first baseline, we use the ground-truth translated utterances which are provided

as a part of the CVSS-T dataset. These utterances were obtained by synthesizing the ground-truth

translated text with a high-quality TTS system which was modified to enable voice transfer [Jia et al.,

2021, 2022c]. As a result, the ground-truth utterances mimic the voice in the source utterance.

10Table 4: Audio quality and voice similarity results. Subjective and objective audio quality results are

reported on the 1...5 MOS scale. Objective voice similarity and acoustic consistency are measured in

terms of cosine similarity. Subjective voice similarity scores span 1...5. Both objective and subjective

metrics are computed on the same set of examples. Higher is better across all metrics.

Audio quality

Objective Subjective

Voice similarity

Objective Subjective

Acoustic consistency

Objective

CVSS-T 3.41 3.88 0.24 3.70 0.54

Translatotron 2

AudioPaLM 3.36

3.65 3.96

4.44 0.18

0.40 3.51

4.00 0.44

0.81

As a second baseline, we use Translatotron 2. Note that we could not use the “Translatotron 2 +

pretraining + TTS aug” model (mentioned in Table 2) in the comparison because it was not trained

to preserve voices and instead generates speech in a single canonical voice. Instead, we use the

Translatotron 2 system presented by Jia et al. [2022c] which is capable of transferring the voice

from the source utterance (albeit it achieves a lower BLEU score on CVSS). This version of the

Translatotron 2 model was trained on the CVSS-T dataset and implements S2ST from 21 languages

to English.

Objective metrics. As the first objective metric, we use a no-reference MOS estimator akin

to Reddy et al. [2021] which, given an audio sample, provides an estimate of the perceived audio qual-

ity on a scale from 1 to 5. To measure cross-lingual voice transfer quality, we rely on an off-the-shelf

speaker verification model [Chen et al., 2022a] as used by Zhang et al. [2023b] and Kharitonov et al.

[2023], and compute the cosine similarity between the embeddings of the source (encoded/decoded

with SoundStream) and the translated speech. Besides voice preservation, we also measure how

well the acoustic properties (recording conditions, background noise) are transferred from the source

audio to the target. We do so by computing the cosine similarity between embeddings extracted

from a model trained to identify segments that belong to the same recording [Borsos et al., 2023].

Subjective evaluation. We run two separate studies, one for evaluating the quality of the generated

speech, and another for assessing the voice similarity. We use the same set of samples for both

studies. Since utterances in CVSS-T are sourced from volunteer-generated data of variable quality,

we noticed that some of the utterances contain loud overlapping speech (e.g., a TV show or a song

playing in the background) or extremely strong noise (e.g., clothes rubbing against the microphone).

Such aberrations complicate the work of raters, thus we decided to pre-filter by only selecting inputs

with an estimated MOS of at least 3.0. Finally, we sampled 10 examples per language, giving us

21 × 10 = 210 source utterances to translate. All utterances were peak normalised and resampled to

16kHz, if needed.

Before starting, the raters were provided with a small set of illustrative examples with ground-truth

grades. They also completed a small pilot study as a training. The utterances (pairs of source-target

utterances, in the case of the voice similarity evaluation) were presented one-by-one. The ratings are

provided on a 5-grade scale from 1 (poor quality or completely different voices) to 5 (excellent quality,

identical voices). In the voice similarity study, the raters are explicitly asked to ignore differences

in the recording conditions and language, and solely focus on the voice. Each of the 630 output

examples (10 inputs from each of 21 languages were generated with each of the 3 different systems)

was rated 10 times which results in 6300 ratings per study. Aggregating those ratings per system, we

obtain mean opinion score (MOS) and similarity mean opinion score (SMOS).

Results. We report the results of the objective and subjective evaluations in Table 4. From these

results we observe that AudioPaLM significantly outperforms the baseline Translatotron 2 system

both in audio quality and in voice similarity, in objective and subjective measurements. Moreover,

AudioPaLM has higher quality and better voice similarity than the ground-truth synthesized recordings

in CVSS-T, with a relatively large gap in most of the metrics. Following Jia et al. [2022c], we also

compared the systems across high and low-resource groups (French, German, Spanish and Catalan

vs. the rest) and found no significant variation of the metrics across these groups.

11Table 5: Results from experiment 5.4.1 showing the impact of training with ASR data in addition to

AST data. Adding ASR tasks to the training mix helps to improve performance on AST.

Tasks

CoVoST2 AST

BLEU↑

AST only

AST & ASR

5.4

16.0

18.5

Impact of model and data choices

In this section we walk the reader through experiments that guided us towards our final training recipe

from initial early experimentation. These show the impact of individual factors and build on top of

one another until reaching the final setup described and analysed in the previous sections.

5.4.1

Training on multiple tasks

To achieve the results in Section 5.1, we trained on multiple tasks based on the same underlying

data to improve performance. For example, the CoVoST2 data can be used for both ASR and AST

tasks, and we observed that adding ASR tasks in training results in improved performance on AST

benchmarks, compared to training with the AST tasks alone. In this section we investigate the effect

of this choice on model performance.

Setup. We train two models on the CoVoST2 dataset. All conditions are identical except that in

one experiment, we use only the AST data; in the other we train with both AST and ASR tasks. The

base models are the PaLM 8B checkpoint and we use the USM-v1 tokenizer. We evaluate on the

CoVoST2 AST benchmark.

Results. See Table 5. We observe that adding ASR tasks into the dataset increases BLEU by 2.5

from 16.0 to 18.5 on the CoVoST2 AST benchmark.

Discussion. Although ASR is not part of the evaluation task, adding ASR data helped improve

performance. Our hypothesis is that ASR tasks help the model to better connect its understanding of

the new audio input to its previous understanding of text. In subsequent experiments we include both

ASR and AST tasks when using the CoVoST2 training data.

5.4.2

Training from scratch vs. finetuning

The results in Section 5.1 are based on finetuning a text-pretrained PaLM checkpoint. Here we

investigate the effect of using such a model compared to starting training from scratch on the same

architecture.

Setup. In the 1B from-scratch and 8B from-scratch experiments we start with randomly initialised

weights. In the 8B finetune experiment we start from the PaLM 8B checkpoint, which has been

modified by adding extra rows to the token embedding matrix for the audio tokens, which are

randomly initialised.

All three models are trained on CoVoST2 ASR and AST tasks.

Results. See Table 6. We observe that finetuning the PaLM 8B checkpoint achieves substantially

higher performance than training from scratch on CoVoST2 tasks for both ASR and AST. The

1B-from-scratch experiment was added to determine whether a smaller model architecture would

work better than the 8B model when trained from scratch on CoVoST2; it does not.

Discussion. Finetuning a pretrained checkpoint substantially improves results. This is in some

sense not surprising as the base model is very capable to begin with; nonetheless it is interesting that

with finetuning the model is able to adapt to completely new input stimulus, since the audio tokens

are totally new embeddings that the model must learn to understand. Furthermore the audio tokens

are very different from text: despite the low sampling rate, there is presumably still some redundancy

12Table 6: Results from Experiment 5.4.2 showing that training from a pretrained checkpoint has a

substantial positive effect on performance compared to training from scratch.

CoVoST2 AST

BLEU↑

Initial checkpoint

PaLM 1B from scratch

PaLM 8B from scratch

PaLM 8B finetuned

CoVoST2 ASR

WER↓

6.5

6.9

18.4

66.0

63.3

40.2

Table 7: Results from Experiment 5.4.3 showing the impact of training with different types of tokens.

Performance is affected significantly by the choice of tokens.

Tokens

W2V-BERT

USM-v1

USM-v2

CoVoST2 AST

BLEU↑

15.2

18.5

26.9

CoVoST2 ASR

WER↓

50.1

40.2

22.3

in the data and the rate of samples is still much higher than text tokens — we estimate from the data

that at 25Hz, one text token corresponds to approximately 6-8 audio tokens.

5.4.3

Different tokenization schemes

To obtain the results in Section 5.1, we tokenized audio based on USM-v2. Here we investigate the

impact of the choice of tokenization scheme on the final results.

Setup. We train three models with all conditions identical except for the tokenization scheme

applied to the audio. All models are trained using the PaLM 8B checkpoint. In each case we use

the CVSS datasets with ASR and AST tasks with the source audio preprocessed using different

tokenizers. The three tokenizers used are the w2v-BERT, USM-v1 and USM-v2 tokenizers which

were discussed Section 3.1.

Results. See Table 7. We observe that the choice of tokenization scheme has a large impact on the

performance of the model. The fact that the USM encoder is more powerful than w2v-BERT indeed

translates to an improvement in performance in our setting. The USM-v2 tokens perform even better,

yielding substantially improved results.

Discussion. The choice of tokenization scheme has a substantial effect on performance. This is not

surprising; the model only is exposed to the information captured by the tokenizer, and this may be in

a form which is easy or difficult for the model to process. Future work should consider tokenization

of audio more carefully because this is still relatively immature as a research area.

5.4.4

Training with combined tasks

To obtain the results in Section 5.1, we required the model to compute intermediate steps for complex

tasks by combining multiple tasks into one, as described in Section 3.4. In the following we investigate

the impact of this choice.

Setup. We train pairs of models on the CoVoST2 AST dataset. Within a pair, the only change

is that for one model we train with ASR and AST tasks, while for the other we also include the

combined task consisting of first doing ASR and then outputting the AST result. For the latter model,

at evaluation time we report the result of doing the combined task from which we use only the final

output. We repeat this setup twice: once with the USM-v1 tokens, and once with the USM-v2 tokens.

Results. See Table 8. This shows that expressing the AST task as a combination of simpler tasks

results in improved performance on the AST task. At the same time, we see a small reduction in

performance on the ASR task.

13Table 8: Results from Experiment 5.4.4 showing that defining complex tasks as combinations of

simpler tasks results in an improvement in performance on the AST task and a small reduction on the

ASR task.

CoVoST2 AST

BLEU↑

CoVoST2 ASR

WER↓

Tokens Tasks USM-v1 Direct

Combined 18.5

22.1 40.2

41.6

USM-v2 Direct

Combined 26.9

30.5 22.3

25.3

Table 9: Results from Experiment 5.4.5 showing that training with S2ST data brings new capabilities

but results in a degradation of performance on AST and ASR tasks.

Tasks

AST, ASR

AST, ASR & S2ST

CoVoST2 AST

BLEU↑

CVSS S2ST

ASR-BLEU↑

−

24.2

30.5

27.8

CoVoST2 ASR

WER↓

25.3

27.1

Discussion. Our results are consistent with prior works which have observed that allowing the

model to break down a complex task into easier pieces results in improved performance, relative to

making the model directly output the answer [Wei et al., 2022a].

At the same time, we observe a reduction in performance on the ASR task. We hypothesize that this

may be a consequence of our checkpoint selection criterion, which was to select the checkpoint with

the best AST metric on the validation split. It may also be a consequence of the large change in the

data mixture resulting from this change.

We note that it may appear that combined tasks reduce the problem to a pipeline approach of separate

ASR and translation systems. However this is not the case, as the model can refer to all previous

tokens at each step and is a single unified model. For example, when decoding the translated text,

it is possible to refer to the input audio and any information contained in them. This is particularly

important for the S2ST setting (see Experiment 5.4.5) where prosodic information may be present in

the input audio, which can be attended to while decoding output audio.

5.4.5

Training with additional speech-to-speech tasks

In the following, we investigate the impact of adding speech-to-speech translation (S2ST) tasks to the

trained tasks.

Setup. We train two models using the CoVoST2 dataset. One model is only trained on the AST, ASR

and combined AST tasks. The other model is additionally trained on S2ST as a direct and combined

task. Thus the difference between these two models is that in the S2ST the model additionally sees

tasks in which it must output audio tokens, whereas for the other tasks (and all previous experiments)

the model only outputs text tokens.

Results. See Table 9. We observe that adding the S2ST task results in the new capability of being

able to perform S2ST, but that this comes at the cost of a modest decrease in performance to both the

AST BLEU score and ASR WER score when evaluating on the CoVoST2 test split.

Discussion. Since we use loss masking on the inputs for each training example, performing S2ST

is fundamentally different from ASR or AST since the model must learn to emit audio tokens. For

ASR and AST, the model takes audio tokens as input, but the loss masking means that it doesn’t need

to learn to model these sequences of audio tokens. It is thus perhaps not surprising that this results in

a decrease in performance on the text-output tasks, since model capacity must be devoted to audio

modelling.

14Table 10: Results for Experiment 5.4.6 showing that scaling the amount of training data improves

performance. Observe also that within each pair, adding S2ST tasks brings new capabilities, but

at the expense of slight decrease in performance on AST and ASR tasks. “Public speech datasets”

corresponds to CoVoST2/CVSS, Vox Populi, CommonVoice 11 and Conversational EsEn.

VoxP. ASR

WER↓

CoVoST2 ASR

WER↓

CoVoST2 AST

BLEU↑

CVSS S2ST

ASR-BLEU↑

Datasets Tasks CoVoST2 / CVSS AST

S2ST 168.7

166.3 25.3

27.1 30.5

27.8 −

24.2

Public speech datasets AST

S2ST 9.0

14.5 15.5

19.4 33.1

31.9 −

27.0

Public speech datasets

+ YT AST

S2ST 9.6

14.5 13.8

16.5 34.8

32.3 −

27.6

Public speech datasets

+ YT + WMT/TED AST

S2ST 11.1

15.4 15.1

16.6 35.4

33.8 −

29.5

Public speech datasets

+ PaLM MT TTS + WMT/TED S2ST 16.0 15.0 36.2 31.2

5.4.6

Scaling the training data

In this section we investigate the impact of increasing the amount of training data.

Setup. We run this analysis on two types of models, both trained from a PaLM 8B checkpoint and

with USM-v2 tokens. The models “AudioPaLM 8B AST” are trained without the S2ST tasks, the

models “AudioPaLM 8B S2ST” are trained with the S2ST tasks.

We train these two types of models with an increasing amount of data:

• The CoVoST2 dataset only. For the S2ST model, we use the modified S2ST version of this

dataset: CVSS.

• All the public datasets described in Table 1, namely CoVoST2/CVSS, VoxPopuli AST,

VoxPopuli S2ST, CommonVoice 11 and Conversational EsEn.

• All the public datasets, as well as the YouTube ASR dataset.

• All the public datasets, as well as the YouTube ASR dataset and the WMT/TED text-to-text

translation dataset. For the S2ST models, we follow Jia et al. [2022a] and synthesise a paired

S2ST dataset from this by using TTS on the examples in this dataset.

• As above, but using the synthetic PaLM-based MT TTS dataset S2ST mixture instead of

the YouTube ASR dataset. For this dataset we used PaLM-2 to translate the transcripts of

the YouTube, Common Voice, and Babel datasets to English text, and then synthesized the

English speech to create a speech-to-speech dataset.

Results. See Table 10. We observe that training with increasing amounts of data yields a substantial

improvement. In particular, consistent with Experiment 5.4.1 we see that adding additional ASR

data helps on AST tasks. Consistent with Experiment 5.4.5 we observe that for each fixed dataset

mixture for which we compare the AST and S2ST mixtures, including the S2ST tasks brings new

S2ST capabilities at the cost of a modest reduction in performance on AST. All of the S2ST results in

Table 10 use AudioLM stage 2 and 3 models [Borsos et al., 2022] to reconstruct the audio samples

from audio tokens as discussed in Section 3.3.

Discussion. It is unsurprising that scaling the amount of training data results in an improvement

in performance. We observe that adding more data in some cases leads to a small reduction in

performance on the ASR tasks, though always an improvement on the AST tasks. Similar to

Experiment 5.4.4, this may be a consequence of our checkpoint selection criterion, which is based on

AST performance on the CVSS validation set.

15Table 11: Results for Experiment 5.4.7 showing the impact on S2ST metrics of decoding from audio

tokens to wave audio using AudioLM stage 2 and 3 models compared to SoundStorm.

CVSS S2ST

ASR-BLEU↑

Decoder

31.2

32.5

AudioLM

SoundStorm

Table 12: Results for Experiment 5.4.8 showing impact of finetuning PaLM vs PaLM-2.

VoxPopuli ASR

WER↓

CoVoST2 ASR

WER↓

CVSS AST

BLEU↑

Datasets Checkpoint Public + YT PaLM

PaLM-2 9.6

9.7 13.8

17.4 34.8

37.2

Public + YT + WMT/TED PaLM

PaLM-2 11.1

9.8 15.1

15.7 35.4

37.8

5.4.7

Decoding with AudioLM vs SoundStorm

In this section we investigate the impact on S2ST metrics of decoding using AudioLM stage 2 and 3

models vs SoundStorm.

Setup. We take the best AudioPaLM model from Section 5.4.6 trained with the mixture consisting

of public, PaLM MT TTS and WMT/TED datasets. The previous experiment used AudioLM stage 2

and 3 models to decode the audio tokens output by AudioPaLM to wave audio. We rerun this using a

SoundStorm model instead, and measure the impact on the CVSS S2ST task.

Results. See Table 11. We observe a 1.3 BLEU point increase when using SoundStorm compared

to AudioLM. This result corresponds to the S2ST model presented in Table 2 trained on the S2ST

mixture described in 3.5.

Discussion These observations are consistent with those reported in Borsos et al. [2023], which

found that compared to AudioLM, SoundStorm produces more intelligible speech when used to

decode semantic audio tokens. This was measured by how faithfully the resulting audio matches a

ground truth transcript when transcribed with an ASR system, which is similar to our setup.

5.4.8

Impact of using PaLM-2

In the following we explore the effect of using the PaLM-2 checkpoint vs the original PaLM model.

PaLM-2 was trained with improved data and techniques compared to the original PaLM model, and

was explicitly trained with parallel translation data. We therefore aim to understand whether these

improvements translate to gains in speech tasks.

Setup. We focus on speech-to-text tasks and do not consider S2ST. We train two pairs of models

on the largest datasets considered in Section 5.4.6. For each dataset we train two models, one using

the PaLM 8B checkpoint and the other using the PaLM-2 8B checkpoint. Compared to the PaLM

finetuning experiments, the optimization hyperparameters differed: we used a dropout rate of 0.2 and

a learning rate schedule of linear ramp-up to 10 −4 followed by exponential decay to 10 −5 .

Results. See Table 12. On the smaller mixture, we observe an improvement on the CoVoST2 AST

task, and a minor degradation on VoxPopuli ASR and a more significant degradation on CoVoST2

ASR. On the larger data mixture, we see that PaLM-2 exceeds PaLM on the Vox Populi ASR and

CVSS AST tasks, and is slightly worse on CoVoST2 ASR. Our interpretation of these results is that

the improved ability of PaLM-2 to perform text translation leads to an improvement for AST. The

impact on ASR capabilities is mixed, where when using the full training mixture, PaLM 2 exhibits

slightly worse ASR capabilities on CoVoST2 and slightly better ones on VoxPopuli ASR.

16Table 13: Results for Experiment 5.4.9 showing impact of architecture scale when using PaLM-2

checkpoints on AST/ASR tasks.

Datasets

Public + YT

Public + YT + WMT/TED

Checkpoint size

VoxPopuli ASR

WER↓

CoVoST2 ASR

WER↓

CVSS AST

BLEU↑

128M

8B 15.9

11.9

9.7 30.2

21.5

17.4 16.6

30.4

37.2

128M

8B 16.4

11.7

9.8 29.9

17.3

15.7 18.3

31.6

37.8

Discussion. While we do see a difference, we suspect that the different capabilities between PaLM

and PaLM-2 are not as important in this setting as they might be for purely text-based tasks, since the

addition of tokenized audio is novel for both models.

5.4.9

Impact of architecture scale

In the following we investigate the impact of the model size on the downstream task performance.

We use PaLM-2 for this and focus on the ASR and AST settings.

Setup. We train three PaLM-2 models of different sizes (128M, 1B, and 8B) using USM-v2 tokens

with the same two largest datasets from Section 5.4.6 and observe their performance on our benchmark

ASR and AST tasks.

Results. See Table 13. We find that our results improve substantially with model size, with 42%

and 28% reduction in WER for CVSS and VoxPopuli ASR tasks and over 13 points increase in BLEU

scores for translation tasks respectively moving from 128M to 1B model on the full Public + YT +

WMT/TED dataset. Increasing the model size further from 1B to 8B leads to additional gains of a

further 10% and 16% reduction in WER for CVSS and VoxPopuli ASR tasks and a further 6.2 point

improvement in BLEU score. We find the scaling improvements also hold across different training

datasets (e.g., Public + YT compared with Public + YT + WMT/TED).

Discussion. As expected, performance on downstream ASR/AST tasks improves with larger model

size. Our 1B sized model outperforms Whisper 1.5B Large model by over 5 BLEU points and 28%

reduction in WER for VoxPopuli ASR.

Conclusion

We introduce AudioPaLM, a large language model that can process and generate speech and text

interchangeably. AudioPaLM starts from a pre-trained text-based LLM and extends its vocabulary

with discrete audio tokens. In doing so, the model can leverage its existing text capabilities while

being finetuned to also consume and produce tokenized audio on a mixture of speech-text tasks.

Moreover, by expressing the different tasks with textual tags, a single model can be trained on all tasks

together. AudioPaLM demonstrates state-of-the-art results on speech translation benchmarks and

competitive performance on speech recognition tasks, as well as zero-shot speech-to-text translation

abilities on unseen language pairs. AudioPaLM also benefits from features of audio language models,

such as voice prompting, and can perform S2ST with voice transfer of a superior quality compared to

existing baselines, as measured by both automatic metrics and human raters.

Limitations The fact that our model can natively produce audio is a consequence of the fact that

we make use of tokenized audio. This introduces a strong dependency on the quality of the audio

tokenizer, as demonstrated in Section 7. We additionally empirically found it necessary to finetune

the whole model, unlike a Flamingo-like [Alayrac et al., 2022] approach which freezes most of

the weights and thus provides guarantees on preservation of the original capabilities of the model

components.

17Open questions There are numerous further avenues of research. One strand is around audio

tokenization: what are desirable properties of audio tokens, how can we measure them, and how

can we optimize for them? Another is around evaluations. In comparison to text, the richness of

the set of established benchmarks for generative text/audio tasks is less developed. This work has

focused on speech recognition and speech translation, for which the benchmarks are more mature.

The establishment of more benchmarks and metrics for generative audio tasks will help to accelerate

research further.

Acknowledgements

We would like to thank Nobuyuki Morioka and Yifan Ding for their help in re-creating the TTS-

augmented WMT/TED dataset which was also used in Jia et al. [2022a] and Adam Roberts and Ron

Weiss for their advice and reviews. We would like to thank Slav Petrov, Colin Cherry and the PaLM-2

team for their advice and support.

References

A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen,

A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank. Musiclm: Generating music

from text. arXiv preprint arXiv:2301.11325, 2023.

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican,

M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural

Information Processing Systems, 35:23716–23736, 2022.

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. T. Passos, S. Shakeri, E. Taropa, P. Bailey,

Z. Chen, E. Chu, J. Clark, L. E. Shafey, Y. Huang, K. S. Meier-Hellstern, G. Mishra, E. Moreira,

M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. ’Abrego, J. Ahn,

J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K. M. Brooks, M. Catasta, Y. Cheng,

C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crépy, S. Dave, M. Dehghani, S. Dev,

J. Devlin, M. C. D’iaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. García,

S. Gehrmann, L. González, G. Gur-Ari, S. Hand, H. Hashemi, L. Hou, J. Howland, A. R. Hu,

J. Hui, J. Hurwitz, M. Isard, A. Ittycheriah, M. Jagielski, W. H. Jia, K. Kenealy, M. Krikun,

S. Kudugunta, C. Lan, K. Lee, B. Lee, E. Li, M.-L. Li, W. Li, Y. Li, J. Li, H. Lim, H. Lin, Z.-Z.

Liu, F. Liu, M. Maggioni, A. Mahendru, J. Maynez, V. Misra, M. Moussalem, Z. Nado, J. Nham,

E. Ni, A. Nystrom, A. Parrish, M. Pellat, M. Polacek, A. Polozov, R. Pope, S. Qiao, E. Reif,

B. Richter, P. Riley, A. Ros, A. Roy, B. Saeta, R. Samuel, R. M. Shelby, A. Slone, D. Smilkov, D. R.

So, D. Sohn, S. Tokumine, D. Valter, V. Vasudevan, K. Vodrahalli, X. Wang, P. Wang, Z. Wang,

T. Wang, J. Wieting, Y. Wu, K. Xu, Y. Xu, L. W. Xue, P. Yin, J. Yu, Q. Zhang, S. Zheng, C. Zheng,

W. Zhou, D. Zhou, S. Petrov, and Y. Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

2023.

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders,

F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings

of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille,

France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL

https://aclanthology.org/2020.lrec-1.520.

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised

learning of speech representations. Advances in neural information processing systems, 33:

12449–12460, 2020.

A. Bapna, C. Cherry, Y. Zhang, Y. Jia, M. Johnson, Y. Cheng, S. Khanuja, J. Riesa, and A. Con-

neau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint

arXiv:2202.01374, 2022.

L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck,

P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the

2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference

on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61. Association for

Computational Linguistics, 2019. URL https://aclanthology.org/W19-5301.

18L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-jussà, C. Federmann, Y. Graham, R. Grundkiewicz,

B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, C.-k. Lo, N. Ljubešić, C. Monz, M. Morishita,

M. Nagata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri. Findings of the 2020 conference on

machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation,

pages 1–55. Association for Computational Linguistics, 2020. URL https://aclanthology.

org/2020.wmt-1.1.

O. Bojar, C. Buck, C. Callison-Burch, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post,

R. Soricut, and L. Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In

Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44. Association

for Computational Linguistics, 2013. URL https://aclanthology.org/W13-2201.

O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck, C. Hokamp, P. Koehn, V. Logacheva,

C. Monz, M. Negri, M. Post, C. Scarton, L. Specia, and M. Turchi. Findings of the 2015

workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical

Machine Translation, pages 1–46. Association for Computational Linguistics, 2015. URL https:

//aclanthology.org/W15-3001.

O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, S. Huang, M. Huck, P. Koehn, Q. Liu,

V. Logacheva, C. Monz, M. Negri, M. Post, R. Rubino, L. Specia, and M. Turchi. Findings of

the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference

on Machine Translation, pages 169–214. Association for Computational Linguistics, 2017. URL

https://aclanthology.org/W17-4717.

O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz.

Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third

Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Compu-

tational Linguistics, 2018. URL https://aclanthology.org/W18-6401.

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier,

M. Tagliasacchi, and N. Zeghidour. AudioLM: a language modeling approach to audio generation.

arXiv preprint arXiv:2209.03143, 2022.

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm:

Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,

G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,

D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,

C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot

learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in

Neural Information Processing Systems 33: Annual Conference on Neural Information Processing

Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.

neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu,

L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale

self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16

(6):1505–1518, 2022a. doi: 10.1109/JSTSP.2022.3188113. URL https://doi.org/10.1109/

JSTSP.2022.3188113.

X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner,

B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. arXiv

preprint arXiv:2209.06794, 2022b.

Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. Uniter: Universal

image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference,

Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.

Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, P. Moreno, A. Bapna, and H. Zen. Maestro:

Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409,

2022c.

19C.-C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu. Self-supervised learning with random-projection

quantizer for speech recognition. In International Conference on Machine Learning, pages

3915–3924. PMLR, 2022.

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung,

C. Sutton, S. Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint

arXiv:2204.02311, 2022.

Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu. W2V-Bert: Combining

contrastive learning and masked language modeling for self-supervised speech pre-training. In

ASRU, 2021.

A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna.

Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken

Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.

A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. CoRR,

abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/

arXiv.2210.13438.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional

transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

C. Donahue, A. Caillon, A. Roberts, E. Manilow, P. Esling, A. Agostinelli, M. Verzetti, I. Simon,

O. Pietquin, N. Zeghidour, and J. H. Engel. Singsong: Generating musical accompaniments

from singing. CoRR, abs/2301.12662, 2023. doi: 10.48550/arXiv.2301.12662. URL https:

//doi.org/10.48550/arXiv.2301.12662.

T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu. Violet: End-to-end video-language

transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.

M. J. Gales, K. M. Knill, and A. Ragni. Low-resource speech recognition and keyword-spotting. In

Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September

12-16, 2017, Proceedings 19, pages 3–19. Springer, 2017.

Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. Large-scale adversarial training for vision-

and-language representation learning. Advances in Neural Information Processing Systems, 33:

6616–6628, 2020.

M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. Conneau, F. Kreuk, J. Copet, A. Défossez, G. Synnaeve,

E. Dupoux, R. Schwartz, and Y. Adi. Textually pretrained speech language models. arXiv preprint

arXiv:2305.13009, 2023.

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert:

Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM

Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.

Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y. Wu.

Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc.

ICASSP, pages 7180–7184, 2019a.

Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu. Direct speech-to-speech

translation with a sequence-to-sequence model. In INTERSPEECH, 2019b.

Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu. Png bert: Augmented bert on phonemes and graphemes

for neural tts. Proc. Interspeech 2021, pages 151–155, 2021.

Y. Jia, Y. Ding, A. Bapna, C. Cherry, Y. Zhang, A. Conneau, and N. Morioka. Leveraging unsuper-

vised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint

arXiv:2203.13339, 2022a.

Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz. Translatotron 2: High-quality direct speech-

to-speech translation with voice preservation. In International Conference on Machine Learning,

pages 10120–10134. PMLR, 2022b.

20Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen. Cvss corpus and massively multilingual speech-to-

speech translation. arXiv preprint arXiv:2201.03713, 2022c.

T. Kano, S. Sakti, and S. Nakamura. Transformer-based direct speech-to-speech translation with

transcoder. In Proc. IEEE SLT, pages 958–965, 2021.

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi,

and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision.

arXiv preprint arXiv:2302.03540, 2023.

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and

Y. Adi. Audiogen: Textually guided audio generation. CoRR, abs/2209.15352, 2022. doi:

10.48550/arXiv.2209.15352. URL https://doi.org/10.48550/arXiv.2209.15352.

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer

and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018a.

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer

and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the

2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System

Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association

for Computational Linguistics, 2018b. doi: 10.18653/v1/d18-2012. URL https://doi.org/10.

18653/v1/d18-2012.

K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet,

A. Baevski, A. Mohamed, et al. On generative spoken language modeling from raw audio.

Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.

A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld, and P. Zhan. JANUS-

III: Speech-to-speech translation in multiple languages. In ICASSP, 1997.

A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, J. Pino, J. Gu, and

W.-N. Hsu. Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352,

2021.

A. Lee, P.-J. Chen, C. Wang, J. Gu, X. Ma, A. Polyak, Y. Adi, Q. He, Y. Tang, J. Pino, and W.-N.

Hsu. Direct speech-to-speech translation with discrete units. In ACL, 2022.

X. Ma, H. Gong, D. Liu, A. Lee, Y. Tang, P.-J. Chen, W.-N. Hsu, K. Heafield, P. Koehn, and J. Pino.

Direct simultaneous speech to speech translation. arXiv preprint arXiv:2110.08250, 2021.

S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro, J.-S. Zhang, H. Yamamoto,

E. Sumita, and S. Yamamoto. The ATR multilingual speech-to-speech translation system. IEEE

Transactions on Audio, Speech, and Language Processing, 2006.

A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.

arXiv preprint arXiv:1807.03748, 2018.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meeting of the Association for Computational

Linguistics, pages 311–318, 2002.

M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on

Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, Oct. 2018. Association

for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.

V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert. Mls: A large-scale multilingual dataset

for speech research. arXiv preprint arXiv:2012.03411, 2020.

Y. Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig. When and why are pre-trained word

embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the

North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, Volume 2 (Short Papers), pages 529–535. Association for Computational Linguistics,

2018. URL https://aclanthology.org/N18-2084.

21A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,

J. Clark, et al. Learning transferable visual models from natural language supervision. In

International conference on machine learning, pages 8748–8763. PMLR, 2021.

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech

recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, H. F. Song, J. Aslanides, S. Henderson,

R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den

Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang,

J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. Jayakumar,

E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L.

Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau,

M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama,

C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas,

A. Guy, C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel, W. Isaac,

E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett,

D. Hassabis, K. Kavukcuoglu, and G. Irving. Scaling language models: Methods, analysis &

insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/

2112.11446.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu.

Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of

Machine Learning Research, 21(1):5485–5551, 2020.

C. K. A. Reddy, V. Gopal, and R. Cutler. Dnsmos: A non-intrusive perceptual objective speech quality

metric to evaluate noise suppressors. In IEEE International Conference on Acoustics, Speech and

Signal Processing (DNSMOS), 2021.

A. Roberts, H. W. Chung, A. Levskaya, G. Mishra, J. Bradbury, D. Andor, S. Narang, B. Lester,

C. Gaffney, A. Mohiuddin, C. Hawthorne, A. Lewkowycz, A. Salcianu, M. van Zee, J. Austin,

S. Goodman, L. B. Soares, H. Hu, S. Tsvyashchenko, A. Chowdhery, J. Bastings, J. Bulian,

X. Garcia, J. Ni, A. Chen, K. Kenealy, J. H. Clark, S. Lee, D. Garrette, J. Lee-Thorp, C. Raffel,

N. Shazeer, M. Ritter, M. Bosma, A. Passos, J. Maitin-Shepard, N. Fiedel, M. Omernick, B. Saeta,

R. Sepassi, A. Spiridonov, J. Newlan, and A. Gesmundo. Scaling up models and data with t5x

and seqio, 2022.

R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolin-

gual data. In ACL, 2016.

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed. Learning audio-visual speech representation by

masked multimodal cluster prediction, 2022.

A. Tjandra, S. Sakti, and S. Nakamura. Speech-to-speech translation between untranscribed unknown

languages. In Proc. IEEE ASRU, pages 593–600, 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.

Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages

5998–6008, 2017.

W. Wahlster. Verbmobil: Foundations of speech-to-speech translation. Springer, 2000.

C. Wang, A. Wu, and J. M. Pino. Covost 2: A massively multilingual speech-to-text translation

corpus. CoRR, abs/2007.10310, 2020. URL https://arxiv.org/abs/2007.10310.

C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux.

Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised

learning and interpretation, 2021.

C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural

codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111,

2023.

22P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang. Ofa: Unifying

architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In

International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting

elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a.

K. Wei, L. Zhou, Z. Zhang, L. Chen, S. Liu, L. He, J. Li, and F. Wei. Joint pre-training with speech

and bilingual text for direct speech to speech translation. arXiv:2210.17027, 2022b.

J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. Coca: Contrastive captioners

are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.

J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan,

B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu. Scaling autoregressive

models for content-rich text-to-image generation. arXiv:2206.10789, 2022b. doi: 10.48550/arXiv.

2206.10789.

L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al.

Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. SoundStream: An end-to-end

neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:

495–507, 2021.

C. Zhang, X. Tan, Y. Ren, T. Qin, K. Zhang, and T.-Y. Liu. UWSpeech: Speech to speech translation

for unwritten languages. In AAAI, 2021.

Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang,

et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint

arXiv:2303.01037, 2023a.

Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Speak

foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv

preprint arXiv:2303.03926, 2023b.

23A

Author Contributions

Paul initiated the project, created the AudioPaLM model architecture and proved its viability with

a number of pre-training, speech recognition and translation tasks, onboarded the team to the

project, and contributed significantly to the write-up of this report. Chulayuth performed many audio

tokenization experiments and completed the audio integration with PaLM2. Duc carried out the

experiments around combined tasks (performing both ASR and AST in the same task), synthesized

the WMT-derived speech-to-speech datasets and further developed the audio vocabulary. All of the

above contributed to the best-performing configuration for speech-to-speech translation. Paul and

Christian coordinated the write-up of this report. Duc and Chulayuth ran a majority of the ablation

experiments.

Ankur, Johan, Tara and Yu collaborated on the research and development of USM-v2 tokens which led

to our best-performing configuration across tasks. Jiahui and Zhishuai developed the token learning

approach for images and Jiahui advised on the development of USM-v2 tokens. James, Wei and

Yongqiang developed the large-scale tokenization and transcription infrastructure for USM models.

Yongqiang, Wei and Félix curated the semi-supervised ASR datasets. Alexandru put in place data

processing pipelines, improved our best mixture by adding a variety of ASR datasets, on AudioLM

speech generation models, and initially worked on the USM-v2 audio tokens together with Johan.

Danny worked on speech-to-speech translation and the ASR-BLEU metric for S2ST models together

with Alexandru and Duc. Peter and Vicky worked on the PaLM2 integration and cascaded model

baselines. Dalia significantly improved the best configuration of this report by adding TTS tasks and

text-to-text and synthetic speech-to-speech datasets to the model’s task mixtures.

Eugene, Damien, Mihajlo, and Neil worked on AudioLM speech quality and in particular on making

the translated voice consistent with the source voice and providing objective metrics. Mihajlo

coordinated this effort, trained speech generation models, and created the website together with

Hannah. Hannah further tuned the best models for the paper, analysed the zero-shot capabilities of

the models, managed the rating tasks for subjective speech quality metrics, and performed a detailed

training data analysis.

Neil contributed significantly to the writing of this report. Marco identified the opportunity to leverage

AudioLM for speech-to-speech translation and Zalán trained the very first such model. Zalán, Neil,

and Marco provided guidance around AudioLM details and project planning. Michelle provided

guidance on speech-to-speech baselines, Translatotron, and other related work. Lukas, Dirk, Matt

and Johan supported and advised the team throughout the project. Christian initiated the project,

coordinated the effort, and contributed with core ideas and technical work.

24B

Detailed results of AST models performance

Model

Whisper 1.5B [Radford et al., 2022]

mSLAM-CTC 2B [Bapna et al., 2022]

AudioPaLM 8B AST

AudioPaLM 8B S2ST

AudioPaLM-2 8B AST

Arabic

Table 14: BLEU scores on CoVoST2.

39.7

19.3

45.1

45.5

48.7 31.8

35.4

37.9

36.4

38.4 21.5

6.7

15.5

19.4

13.7 36.3

35.9

42.4

41.4

43.4 40.1

41.0

44.9

43.4

44.2 15.0

22.6

23.7

27.2

30.0 19.3

9.7

25.5

28.4

29.4 36.4

39.0

44.1

43.2

44.8 48.1

8.8

52.0

54.3

56.2 30.9

37.3

43.6

42.9

44.3 26.1

3.3

21.4

24.4

25.9 13.9

26.8

28.1

33.3

35.0 0.1

0.8

4.3

5.8

7.6 41.2

37.6

45.5

43.4

48.3 51.6

42.8

56.5

55.5

57.3 43.3

48.4

52.8

54.3

55.6 21.6

32.3

39.3

41.8

42.6 42.9

38.5

53.0

53.8

53.3 4.2

0.6

4.2

6.9

9.0 28.3

24.2

38.9

37.5

41.0 18.0

10.0

23.7

21.4

25.5 29.1

25.2

35.4

36.2

37.8

18.6 28.7 12.4 11.4 16.1

8.4

8.7 9.4 9.1

7.2

8.6 7.9 8.5

6.4

- 12.0 8.3 11.4

7.6

- 17.6 10.0 17.4

6.4

- 12.3 8.3 10.7

13.8

8.4

7.1

12.4

9.8

9.5

19.0 33.2 12.9 7.8 14.4 15.4 27.9

15.4 - 10.5 6.4 7.8 6.0 15.1

13.3 -

9.2 5.7 7.3 5.2 14.2

15.0 - 10.1 8.3 13.2 10.0 23.6

17.0 - 11.1 7.6 22.7 8.6 64.3

14.8 - 10.5 7.0 9.6 5.9 17.8

French

7.0

6.3

6.1

6.5

6.2

Finnish

11.2

8.7

7.9

8.9

10.6

9.4

(et)

12.6

6.8

6.9

10.1

13.6

8.4

Whisper 1.5B [Radford et al., 2022]

mSLAM-CTC 2B [Bapna et al., 2022]

MAESTRO 600M [Chen et al., 2022c]

AudioPaLM 8B AST

AudioPaLM 8B S2ST

AudioPaLM-2 8B AST

Model

Table 15: WER (%) on Vox Populi.

13.6

9.1

8.1

11.1

16.0

9.8

Detailed results of S2ST models performance

Tamil

Swedish

(sl)

Model

Table 16: S2ST performance on CVSS, ASR-BLEU scores.

Translatotron 2 [Jia et al., 2022a] 30.2 31.9 5.4 33.6 38.5 21.0 11.6 36.5 32.8 35.7 8.5 22.7 2.5 34.1 41.1 45.6 25.8 36.6 2.2 28.7 13.1 25.6

AudioPaLM 8B S2ST

41.5 33.7 18.4 37.2 40.4 23.6 24.6 38.3 47.9 39.4 20.9 25.3 4.8 40.4 50.6 51.3 38.5 43.6 7.2 35.1 20.0 32.5

25D

Detailed results of AST zero-shot performance

26.9

1887

2047

0 9.1 23.6 18.9 6.2

84 2145 8860 622

16 2585 7054 0

77.2

1465

18.8

151

Malayalam

Macedonian

1.0 11.0 14.0

20 20 99

0 0.1 67

14.3 10.2 27.7 16.7 1.0 12.9

68 1381 30 892 79 288

16 0.5

0.6

8.61 16.1 1.6 0.7 9.5 26.8 0.6 30.5

0 0

0.7

0 26 0

0 172 0 150

38.7

838

848

0 16.8

0 -

19.4

157

Hebrew

37.9

115

21.7 30.6 29.2 10.2 34.2 0.3 17.8 27.8 11.1 9.7

75 78

255

101 34 239 126 121 0 91 338 181 0

4.8

5.4 6.1 11.6 21.3

31 672 90 19938

12 1

7993

0 -

0 Whisper 1.5B (BLEU)

2.4

AST training data (hours) 40

ASR training data (hours) 0.6

AudioPaLM-2 8B (BLEU) 13.6 1.6 29.4 9.5 0.1

AST training data (hours) 0

ASR training data (hours) 137 0

0 0.2 0

7.2

27.9 16.2 0.4 21.8 22.0 27.0 21.2 16.0 29.1

368 208 8 418 5438 239 554 116 1174

0.3 0 688 12

91 379 13 1014

AudioPaLM-2 8B

31.7 25.7 0.29 29.3 15.6 36.5 0.3 34.7 12.2 0.6 0.4

AST training data (hours) 3

800 0

ASR training data (hours) 163 165 0

175

857 0.2 1 123 0.7 0

Model

34.5

237

(ha)

(gu)

22.1 24.4 32.2

750 894 4481

1066 75 9752

21.3

120

Whisper 1.5B

18.7 19.6

AST training data (hours) 79 392

ASR training data (hours) 41 24

Model

AudioPaLM-2 8B

34.7 3.8 29.0 9.3 30.8 16.2 15.1 35.5 15.9 35.7 42.5 10.3 4.0

AST training data (hours) 0

135

ASR training data (hours) 96 0 146 0.4 0

99 135 168 121

515

18.4 27.8 13.0 32.7 34.6 23.7 80.2

23.3

11731 401 8263 386 4309 968

6693

23446 192 73 473 13344 529 438218 11100

13.7 11.7 28.5 13.2 29.7 34.2

86 133 202 1988 219 236

47 2.4 86 1.3 11 1883

Asturian

Whisper 1.5B

34.1 1.9 25.5 5.4

AST training data (hours) 330 32 2286 136

ASR training data (hours) 4.1 0 739 0

Model

Assamese

Table 17: Zero-shot AST performance on FLEURS. BLEU scores for each language together with

the number of hours of audio the model has been trained on in each language. The hours of training

data for AudioPaLM do not include audio data seen in self-supervised training of audio tokenization

models. The languages indicated with § and † belong respectively to the “AST observed” and “ASR

observed” sets used in Section 5.2.

1.2

30.8 12.2 10.1 17.1

104 110 4

9017.2 6.0 20.4

1990 4 1719

104 0.3 691

25.3 0.4 38.4 35.7 31.2 1.4 32.3 27.4

174 0

10 156 18

0 52 17

267 0

16 246 179 0 191 170

Whisper 1.5B

1.8 0.7 32.5 35.3 7.2 9.2 12.5 14.5 16.1 26.6 29.4

AST training data (hours) 279 21 136 1055 282 1484 987 15 1635 2241 509

ASR training data (hours) 0

0 28 2119 5 136

0.3 226 4333 697

Umbundu

Ukrainian †

AudioPaLM-2 8B

31.9 12.4 0.0 34.6 16.2 29.1 1.1 1.4 22.9 0.3 8.9 6.0

AST training data (hours) 0

0 0

ASR training data (hours) 126 0.7 0

0 170 195 0 0

0 0.2 0.3

Model

15.7 22.3 3.4 38.1 31.5 27.8 5.7 26.1 17.0

117 2200 63 3620 555 7687 46 144 395

0.8 4278 0 8573 356 9761 0 90 41

0 20.2

0 -

Whisper 1.5B

27.3 13.5 0.4 31.4 16.1 24.0

AST training data (hours) 1691 41 59 322 133 1767

ASR training data (hours) 382

1 0.1 266 0.6 2077

Nepali

Norwegian

(my)

Model

Table 17: (continued) Zero-shot AST performance on FLEURS. BLEU scores for each language

together with the number of hours of audio the model has been trained on in each language. The

hours of training data for AudioPaLM do not include audio data seen in self-supervised training of

audio tokenization models. The languages indicated with § and † belong respectively to the “AST

observed” and “ASR observed” sets used in Section 5.2. In the last column “All (82 languages)”,

the average BLEU score and total number of AST/ASR training hours were computed over the 82

languages (out of 102) that were used to evaluate the Whisper model.

0 -

0 1.4

432

0 -

0 17.9

120.6k

117.1k

AudioPaLM-2 8B

0.4 0.9 34.3 40.4 9.1 15.0 13.3 17.1 15.0 30.1 26.9 0.9 13.3 17.2 15.6 0.3 0.2 0.7 7.4 1.9 20.4

AST training data (hours) 0

0 0

4.8k

ASR training data (hours) 0

0 122 129 126 125 109 0

156 145 138 0 116

149 0 0

1 0 12.8k