Summary of Harnessing Instruction-Tuned LLM in End-to-End Speech Recognition

Summary Harnessing Instruction-Tuned LLM in End-to-End Speech Recognition arxiv.org

4,990 words - PDF document - View PDF document

One Line

The text describes a method that integrates instruction-tuned large language models (LLMs) into automatic speech recognition (ASR) systems, using a two-stage approach to improve performance, particularly on written-style tasks, by leveraging the LLMs for grammatical error correction.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Harnessing Instruction-Tuned LLMs for Enhanced Speech Recognition

Slide 1: Instruction-Tuned LLMs Revolutionize ASR

• Large language models (LLMs) like Llama2 exhibit exceptional capabilities in natural language tasks.

• Instruction-tuning enables LLMs to efficiently perform zero-shot learning.

• Integration of LLMs into ASR systems enhances overall performance.

[$Visual: Diagram illustrating the zero-shot learning capability of LLMs.]

The Two-Stage Model Approach

• The proposed model employs a two-stage training process for optimal performance.

• Initial ASR hypothesis is generated using CTC decoding from encoder output.

• LLM is utilized as a front-end to the decoder network for grammatical error correction.

[$Visual: Flowchart showing the two-stage model training process.]

Leveraging Grammatical Error Correction

• LLM acts as a zero-shot grammatical error correction tool for ASR.

• Explicit instructions guide the LLM to refine initial ASR hypotheses.

• Improved linguistic accuracy leads to better final output sequences.

[$Visual: Before-and-after comparison of ASR outputs with and without LLM correction.]

Performance Evaluation Across Tasks

• Evaluated on LibriSpeech, TED-LIUM2, and CoVoST2 datasets.

• Proposed model consistently outperformed baseline hybrid CTC-attention model.

• Most significant improvements noted on CoVoST2 due to unnormalized written-style text.

[$Visual: Bar graph comparing performance metrics across different tasks.]

Challenges with Uncommon Words

• Limited improvements observed in LS-960 task due to uncommon word recognition issues.

• LLM struggles with specific character names and less frequent terms.

• Combined approach with LLM-based rescoring provides some mitigation of this issue.

[$Visual: Example of ASR output showing errors in uncommon words.]

Importance of Prompt Design

• Ablation studies reveal the critical role of prompt design in LLM integration.

• Effective prompts maximize the LLM’s zero-shot capabilities for ASR tasks.

• Results confirm that prompt quality directly impacts performance improvements.

[$Visual: Examples of effective vs. ineffective prompts used in the study.]

Comparison with Alternative Methods

• Proposed model's direct integration shows competitive or superior results against shallow fusion and rescoring methods.

• Other methods also yield performance improvements but lack the same level of efficiency.

• Direct LLM integration enhances both accuracy and processing speed in ASR.

[$Visual: Comparative chart of performance metrics for various integration methods.]

Future Research Directions

• Explore applications of the model in speech translation and related speech tasks.

• Investigate strategies to handle uncommon words more effectively in ASR systems.

• Consider few-shot fine-tuning options for further performance enhancements.

[$Visual: Roadmap outlining future research pathways.]

Potential Impacts on Industry Applications

• Improved ASR systems can enhance user experiences in voice recognition technologies.

• Applications range from virtual assistants to automated transcription services.

• Greater accuracy in ASR can lead to increased adoption in various sectors, including healthcare and customer service.

[$Visual: Infographic showing potential industry applications of enhanced ASR technology.]

Conclusions on LLM Integration in ASR

• Instruction-tuned LLMs hold significant promise for advancing end-to-end speech recognition capabilities.

• The hybrid model effectively leverages linguistic knowledge for enhanced performance.

• Continued exploration and refinement can unlock further advancements in ASR technology.

[$Visual: Summary graphic encapsulating key findings and conclusions.]

The Future of Speech Recognition is Here

• Instruction-tuned LLMs represent a transformative force in automatic speech recognition.

• Their integration enhances accuracy, efficiency, and usability across applications.

• Embracing this technology will pave the way for innovative advancements in the field.

Key Points

Leveraging the zero-shot power of instruction-tuned large language models (LLMs) to enhance end-to-end automatic speech recognition (ASR) performance
Integrating an instruction-tuned LLM, specifically Llama2, into a hybrid connectionist temporal classification (CTC) and attention-based ASR architecture
Using the LLM as a front-end for the decoder network to capitalize on its potential as a zero-shot grammatical error correction model
Evaluating the proposed model on various English ASR tasks, including LibriSpeech, TED-LIUM2, and CoVoST2, and demonstrating consistent performance improvements over the baseline hybrid CTC-attention model
Conducting ablation studies to assess the importance of the LLM integration and the prompt design, and comparing the proposed approach with other methods for integrating LLMs into ASR

Summaries

20 word summary

Integrates instruction-tuned LLMs into end-to-end ASR, leveraging LLMs for grammatical error correction. Two-stage approach improves performance, especially on written-style tasks.

47 word summary

This work integrates instruction-tuned LLMs into end-to-end ASR, leveraging LLMs for grammatical error correction. The two-stage approach trains a baseline ASR model, then a new decoder with the LLM as front-end. Experiments show consistent improvements, especially on written-style tasks, though uncommon words in audiobooks required LLM-based rescoring.

126 word summary

This work explores integrating instruction-tuned large language models (LLMs) into end-to-end automatic speech recognition (ASR). The key idea is to leverage the LLM's potential as a zero-shot grammatical error correction model to enhance ASR performance. The proposed approach involves a two-stage training process, where a baseline ASR model is first trained, and then a new decoder network is trained with the LLM as the front-end. The model was evaluated on various English ASR tasks, showing consistent improvements over the baseline, especially on tasks with written-style text. However, the model struggled with uncommon words in audiobooks, which was mitigated by combining it with LLM-based rescoring. Ablation studies confirmed the effectiveness of the LLM integration and the importance of the prompt in maximizing the LLM's zero-shot learning capability.

331 word summary

This work explores the integration of instruction-tuned large language models (LLMs), specifically Llama2, into an end-to-end automatic speech recognition (ASR) framework. The key idea is to leverage the LLM's potential as a zero-shot grammatical error correction model to enhance ASR performance.

The proposed approach involves a two-stage training process. First, a baseline hybrid connectionist temporal classification (CTC) and attention-based end-to-end ASR model is trained. Then, a new decoder network is trained from scratch, using the pre-trained encoder and CTC networks, with the LLM as the front-end. The objective function is defined by the negative log-likelihood of the joint probability of the target sequence and the ASR hypothesis, with the intractable marginalization over hypotheses approximated through sampling.

During inference, the best ASR hypothesis is obtained using the encoder and CTC decoding, and then joint CTC and attention decoding is performed with the LLM-enhanced decoder.

The proposed model was evaluated on various English ASR tasks, including LibriSpeech (LS), TED-LIUM2 (TED2), and CoVoST2 (CV2), and compared to the baseline hybrid CTC-attention model. The results demonstrate that the integration of the instruction-tuned LLM consistently outperformed the baseline model on tasks other than LS-960, with the most significant gains observed on the CV2 task. This is attributed to the LLM's ability to extract more precise linguistic information from the unnormalized written-style text in CV2.

In the LS-960 task, the proposed model showed limited improvements, which is attributed to the LLM's tendency to struggle with recognizing "uncommon" words, such as character names in audiobooks. This issue was mitigated to some extent by combining the proposed model with LLM-based rescoring.

Ablation studies confirmed the effectiveness of the LLM integration and highlighted the crucial role of the prompt in maximizing the LLM's zero-shot learning capability. The proposed approach was also compared to other methods for integrating LLMs into ASR, demonstrating competitive or superior results.

Overall, this work demonstrates the promising potential of harnessing the versatile linguistic knowledge embedded within instruction-tuned LLMs to advance the field of end-to-end speech recognition.

643 word summary

HARNESSING THE ZERO-SHOT POWER OF INSTRUCTION-TUNED LARGE LANGUAGE MODELS IN END-TO-END SPEECH RECOGNITION

Introduction Large language models (LLMs) have demonstrated remarkable capabilities across a diverse range of natural language processing tasks, often achieving impressive performance through few-shot or zero-shot learning. This highlights the potential of LLMs to solve downstream tasks efficiently, leveraging their vast linguistic knowledge.

In this work, we explore the application of instruction-tuned LLMs, specifically Llama2, to enhance end-to-end automatic speech recognition (ASR) performance. We propose a novel integration of the LLM into a hybrid connectionist temporal classification (CTC) and attention-based ASR architecture, where the LLM serves as a front-end for the decoder network.

Methodology The key idea is to capitalize on the LLM's potential as a zero-shot grammatical error correction model. We first obtain an initial ASR hypothesis from the encoder output via CTC decoding. This hypothesis is then fed into the LLM, along with an explicit instruction to guide the model towards correcting grammatical errors. The decoder network subsequently takes the LLM embeddings as input, incorporating acoustic information from the encoder, to generate the final output sequence.

The proposed model is trained in two stages: 1) training a baseline hybrid CTC-attention-based end-to-end ASR model, and 2) training a new decoder network from scratch, using the pre-trained encoder and CTC networks, with the LLM as the front-end. The objective function is defined by the negative log-likelihood of the joint probability of the target sequence and the ASR hypothesis, where the intractable marginalization over hypotheses is approximated through sampling.

During inference, the most probable sequence is estimated by first obtaining the best ASR hypothesis using the encoder and CTC decoding, and then performing joint CTC and attention decoding with the LLM-enhanced decoder.

Experiments and Results We evaluated the proposed model on various English ASR tasks, including LibriSpeech (LS), TED-LIUM2 (TED2), and CoVoST2 (CV2), and compared its performance to the baseline hybrid CTC-attention model.

The results demonstrate that the proposed integration of the instruction-tuned LLM consistently outperformed the baseline model on tasks other than LS-960, with the most significant gains observed on the CV2 task. This can be attributed to the use of unnormalized written-style text in CV2, which enabled the LLM to extract more precise linguistic information.

In the LS-960 task, the proposed model showed limited improvements, which we attribute to the LLM's tendency to struggle with recognizing "uncommon" words, such as character names in audiobooks. This issue was mitigated to some extent by combining the proposed model with LLM-based rescoring, which further enhanced the performance across all tasks.

Ablation studies were conducted to assess the importance of the LLM integration and the prompt design. The results confirmed the effectiveness of the LLM in improving ASR performance and highlighted the crucial role of the prompt in maximizing the LLM's zero-shot learning capability.

Additionally, we compared the proposed approach with other methods for integrating LLMs into ASR, such as shallow fusion and rescoring. While these techniques also led to notable performance improvements, the proposed model's direct integration of the LLM into the decoder network demonstrated competitive or superior results.

Conclusion and Future Work In this work, we have presented a novel integration of an instruction-tuned LLM, Llama2, into an end-to-end ASR framework. By leveraging the LLM's zero-shot grammatical error correction capability, we were able to enhance the ASR performance across various tasks, particularly in domains where the linguistic information is crucial.

Future research directions may include exploring the application of the proposed model to other speech-related tasks, such as speech translation, and investigating methods to further address the LLM's limitations in handling uncommon words. Additionally, exploring alternative prompt designs and investigating the potential of few-shot or few-example fine-tuning of the LLM could lead to further performance improvements.

Overall, this work demonstrates the promising potential of harnessing the versatile linguistic knowledge embedded within instruction-tuned LLMs to advance the field of end-to-end speech recognition.

Raw indexed text (30,932 chars / 4,990 words / 651 lines)

HARNESSING THE ZERO-SHOT POWER OF INSTRUCTION-TUNED LARGE LANGUAGE

MODEL IN END-TO-END SPEECH RECOGNITION

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan

ABSTRACT

We present a novel integration of an instruction-tuned large lan-

guage model (LLM) and end-to-end automatic speech recognition

(ASR). Modern LLMs can perform a wide range of linguistic tasks

within zero-shot learning when provided with a precise instruction

or a prompt to guide the text generation process towards the desired

task. We explore using this zero-shot capability of LLMs to extract

linguistic information that can contribute to improving ASR perfor-

mance. Specifically, we direct an LLM to correct grammatical errors

in an ASR hypothesis and harness the embedded linguistic knowl-

edge to conduct end-to-end ASR. The proposed model is built on

the hybrid connectionist temporal classification (CTC) and attention

architecture, where an instruction-tuned LLM (i.e., Llama2) is em-

ployed as a front-end of the decoder. An ASR hypothesis, subject to

correction, is obtained from the encoder via CTC decoding, which is

then fed into the LLM along with an instruction. The decoder sub-

sequently takes as input the LLM embeddings to perform sequence

generation, incorporating acoustic information from the encoder out-

put. Experimental results and analyses demonstrate that the pro-

posed integration yields promising performance improvements, and

our approach largely benefits from LLM-based rescoring.

Index Terms— End-to-end speech recognition, instruction-

tuned large language model, grammatical error correction, Llama2

1. INTRODUCTION

In the field of natural language processing (NLP), large language

models (LLMs) [1, 2] have ascended to a predominant paradigm,

demonstrating notable accomplishments across a diverse range of

NLP tasks [3, 4]. Thanks to the abundance of internet-sourced data

and rapidly expanding model sizes, LLMs have shown remarkable

adaptability in acquiring various capabilities with minimal or even

no task-specific training data [5–9]. This has highlighted the poten-

tial of LLMs to solve different downstream tasks in a few-shot or

even zero-shot manner, enabling highly flexible and efficient infor-

mation processing that has been beneficial to end users [10].

Such inherent few-/zero-shot learning capabilities of LLMs can

be improved by strategically using the prompting mechanism to

control their behavior. In-context learning [6] allows LLMs to learn

tasks without requiring parameter updates, which is achieved by

adding a few input-output examples prior to a target input. This

technique directs LLMs toward desired outputs by conditioning the

model’s text generation process on a specific domain or context.

To further enhance the controllability of LLMs, instruction fine-

tuning [11–13] is adopted to make the model follow instructions

more closely and generate responses that align with the desired out-

come. This involves fine-tuning the model on a mixture of highly

diverse tasks that are phrased as natural language instructions, and

it has been shown to substantially boost zero-shot performance on

unseen tasks.

In this work, we explore the application of the LLM’s zero-shot

learning capability to address speech processing tasks, specifically

focusing on end-to-end automatic speech recognition (ASR). We aim

to align the target speech processing task with a related NLP task

and leverage the linguistic information embedded within the LLM to

improve the target speech task; We relate the end-to-end ASR task

to (zero-shot) grammatical error correction [14–16]. Recent efforts

have revealed the effectiveness of using instruction-tuned LLMs for

ASR error correction as post-processing [17, 18], wherein the model

is instructed to select the most probable hypothesis from an ASR N-

best list, similar to a rescoring approach. Our approach, in contrast,

attempts to directly integrate the LLM’s ability to correct grammati-

cal errors into an end-to-end ASR formulation.

The key contributions of this work are summarized as follows.

1) We propose a novel method for integrating instruction-tuned LLM

(i.e., Llama2 [19]) into end-to-end ASR. This is achieved via the hy-

brid connectionist temporal classification (CTC) and attention-based

framework [20], with the LLM serving as a front-end of the decoder

network. 2) Experiments on various ASR tasks demonstrate that the

proposed approach delivers promising improvements in ASR perfor-

mance. Simple rescoring using the LLM leads to additional perfor-

mance gains. 3) All the codes will be open-sourced to encourage

further research on leveraging LLMs in speech processing tasks.

2. METHODOLOGY

Figure 1 illustrates the overview of the proposed model. Our model

is built upon the hybrid CTC-attention-based architecture [20], inte-

grating an autoregressive decoder-only LLM (e.g., GPT-3 [6] and

LLaMA [21]) at the input of the decoder network. The LLM is

utilized to extract linguistic information that contributes to solving

an ASR task. To achieve this, we capitalize on the LLM’s poten-

tial as a zero-shot grammatical error correction model [15, 16, 18].

The proposed model mainly involves the following processes. First,

a hypothesis is obtained from the encoder output via CTC decod-

ing. Subsequently, we feed this hypothesis to the LLM for a subject

to correction, accompanied by an explicit instruction (or prompt) to

precisely guide the LLM’s interpretation of the hypothesis. The de-

coder network then takes as input the LLM output and performs text

generation, which is trained using an ASR corpus with a specific

domain of interest.

In the following subsections, we present a brief explanation of an

instruction-tuned LLM with emphasis on its instruction-based con-

trollability. We then delve into the proposed integration of the LLM

and end-to-end ASR, providing a detailed formulation that substan-

tiates the effectiveness of our model design.

2.1. Instruction-Tuned LLM

Recent advances in LLMs have centered around the development of

decoder-only models, which train deep stacks of self-attention lay-W

: Parameter Freezing

Ŵ

Cross-attention

Decoder

CTC Decoding

Encoder

Llama2

Correct the given statement.

Instruction

Bread is made from flower. [s] Bread is made from flour.

User Input

Response

Fig. 1. Overview of proposed integration of instruction-tuned LLM, i.e., Llama2(-Chat), and end-to-end ASR. We construct a hybrid CTC-

attention-based model, with Llama2 serving as the front-end for the decoder. Given a hypothesis obtained from the encoder output, Llama2 is

prompted to perform grammatical error correction. The decoder then generates a sequence using linguistic information derived from Llama2.

ers [22] for autoregressive text generation. With access to extensive

amounts of internet-derived text data and exponentially increasing

parameter sizes, LLMs have shown their capacity to learn a range of

NLP tasks without the need for explicit supervision. This has high-

lighted the LLM’s potential for performing zero-shot task transfer

without relying on gradient updates or fine-tuning [5]. The latest

LLMs have the capability to be “prompted” for executing specific

tasks. This involves providing instructions or contexts that influ-

ence the subsequent output generated by the model, thereby enabling

users to obtain precise information. Additionally, advancements in

instruction fine-tuning have further enhanced the model’s ability to

produce responses that align with human expectations [11, 13].

We focus on Llama2-Chat, an instruction-tuned version of

Llama2 [19], as the LLM used in our proposed model. For brevity,

we hereafter refer to the chat model as “Llama2” in this paper. Given

a series of input sequences, Llama2 outputs a D-dimensional hidden

vector e n ∈ R D at a token position n as

e n = Llama2( W inst , W user , W

| {z } | {z } | {z }

(1)

Instruction User Input Response

where W inst ∈ V L is an L-length instruction sequence, W user ∈ V M

is an M -length user input sequence, W = (w n ∈ V|w 1 , · · · , w N )

is an N -length target sequence with W

V is the vocabulary of Llama2. Using the outputs from Eq. (1), the

likelihood of a target sequence is computed as

p(W |W inst , W user ) =

p(w n |W

n=1

p(w n |W

inst

user

) = σ(Linear D→|V| (e n )) ∈ [0, 1],

(3)

where Linear D→|V| (·) transforms a vector to a logit, and σ(·) repre-

sents the softmax function.

In the context of ASR, the formulation presented in Eq. (2) is

similar to the causal Transformer-based LMs commonly used for

shallow fusion or rescoring [23]. However, Llama2 distinguishes it-

self through its vast scale (over billions of parameters), having supe-

rior versatility in extracting linguistic information, which can be ex-

plicitly controlled through the auxiliary inputs (i.e, W inst and W user )

tailored to specific tasks. Our primary focus is to explore the ca-

pability of such LLMs to effectively bridge the gap between speech

and NLP tasks.

2.2. Proposed Integration of LLM and End-to-End ASR

End-to-end ASR aims to model direct speech-to-text mapping using

a single deep neural network. Let O ∈ R T ×F be a T -length input

speech sequence, which consists of F -dimensional acoustic features

and corresponds to a target sequence W . In the proposed approach,

we factorize the posterior distribution of p(W |O) as

p(W |O) =

p(W | W̃ , O)p( W̃ |O),

(4)

W̃ ∈H(W )

where W̃ ∈ V is an M -length ASR hypothesis, and H(W ) is a

set of all possible hypotheses compatible with W . In other words,

H(W ) comprises sequences that are prone to being misrecognized

from input speech O. On the right side of Eq. (4), we further factor-

ize p(W | W̃ , O) by applying a probabilistic chain rule as

p(W | W̃ , O) =

p(w n | W̃ , W

(5)

n=1

We model the posterior p(W | W̃ , O) in Eq. (5) using the attention-

based encoder-decoder architecture [24, 25]. The emission probabil-

ity at each token position in Eq. (5) is obtained by

p(w n | W̃ , W

H = ConformerEncoder(O) ∈ R T ×D ,

e n = Llama2(W

inst

, W̃ , W

(7)

(8)

In Eq. (6), TransformerDecoder(·) represents the Transformer de-

coder [22], which takes a query, e n , from the output of Llama2,

along with a key and value, H, from the output of the Conformer en-

coder [26] in Eq. (7). The Llama2 output is generated using the same

process as described in Eq. (1). Here, we provide Llama2 with an

ASR hypothesis W̃ as the user input, accompanied by an instruction

W inst for guiding the model toward the grammatical error correction

task (see the Llama2 input depicted in Fig. 1 for example).

The above formulation shares similarities with the deliberation

or two-pass-based strategy [27] employed in end-to-end ASR. This

approach involves training an additional decoder network, which re-

fines the output sequence using hypotheses generated from a first-

pass decoder [28–32]. Our approach differs in that the decoder net-

work does not specifically deliberate on hypotheses to generate an

output sequence. Instead, it leverages linguistic information derived

from the LLM, which is prompted to improve the hypotheses.2.2.1. Training

The proposed model is constructed through the two-stage process:

1. Train a hybrid CTC-attention-based end-to-end ASR model;

2. Train the proposed model (Fig. 1) by training a new decoder

from scratch using Llama2, along with the pre-trained en-

coder and CTC networks from Step 1.

In Step 2, we only update the parameters of the decoder network.

The objective function of the proposed model is defined by the

negative log-likelihood of Eq. (4) expanded with Eq. (5),

L = − log

p(w n | W̃ , W

(9)

W̃ ∈H(W ) n=1

≤ −E W̃ ∼H(W ) log

p(w n | W̃ , W

(10)

n=1

In Eq. (10), the intractable marginalization over hypotheses is ap-

proximated under expectation with respect to the sampling distribu-

tion H(W ). The sampling process is implemented by running the

encoder network in “training mode” (with dropout enabled) and per-

forming the best path decoding algorithm of CTC [33], which is a

similar strategy utilized in uncertainty estimation [34, 35].

2.2.2. Inference

The most probable sequence is estimated by solving Eq. (4) as

Ŵ = argmax

p(W | W̃ , O)p( W̃ |O),

(11)

W̃ ∈H(W )

≈ argmax p(W | W̃ ′ , O),

(12)

where W̃ ′ = argmax p( W̃ |O).

(13)

W̃

To handle the intractable marginalization in Eq. (11), we apply

the Viterbi approximation with respect to the posterior distribu-

tion p( W̃ |O), as shown in Eqs. (12) and (13). The search pro-

cess in Eq. (13) is implemented by calculating the encoder output

(Eq. (7)) and performing the best path decoding algorithm of CTC.

Subsequently, in Eq. (12), the joint CTC and attention decoding

algorithm [20] is performed to obtain the final output.

3. EXPERIMENTS

3.1. Experimental Setting

We used the ESPnet toolkit [36] for conducting the experiments, and

all the recipes and codes will be open-sourced for reproducibility at

https://github.com/YosukeHiguchi/espnet.

Data We focused on English ASR tasks, using various corpora

spanning different domains, including LibriSpeech (LS) [37], TED-

LIUM2 (TED2) [38], and CoVoST2 (CV2) [39]. LS consists of

utterances extracted from audiobooks. In addition to the 960-hour

training set (LS-960), we used the train-clean-100 subset (LS-100)

for a lower-resource scenario. TED2 contains utterances from TED

Talks, totaling 210 hours of training data. CV2 is a corpus de-

signed for speech translation tasks, derived from the Common Voice

project [39]. We used CV2 for an ASR task by exclusively using

source speech-text data from the “En→X” task, which comprised

430 hours of training data. Note that transcriptions provided by the

above corpora, except for CV2, are normalized by default for ASR

training, where punctuation is removed, and casing is standardized to

lowercase. We specifically used CV2 for training models with punc-

tuation and casing preserved, as this can be crucial for the LLM to

accurately capture linguistic information. During evaluation using

CV2, we removed punctuations to exclusively assess ASR perfor-

mance. All the text tokenization was performed using the vocabulary

of Llama2, where |V| = 32k.

Model and network architecture The baseline model was a hy-

brid CTC-attention-based encoder-decoder model implemented in

ESPnet [40], which corresponds to the model trained in Step 1 of

Sec. 2.2.1. The proposed model was trained as described in Step 2

of Sec. 2.2.1, using the loss defined in Eq. (10). The encoder net-

work consisted of two convolutional neural network layers followed

by a stack of 12 Conformer encoder blocks. The number of heads

d h , the dimension of a self-attention layer d model , the dimension of

a feed-forward network d ff , and the kernel size were set to (4, 256,

1024, 31) for LS-100, TED2, and CV2; and (8, 512, 2048, 31) for

LS-960. The decoder network was a stack of 6 Transformer decoder

blocks, where (d h , d model , d ff ) were set to (4, 256, 2048) for LS-100,

TED2, and CV2; and (8, 512, 2048) for LS-960. For the proposed

model, we used the Llama2-Chat model with 7 billion parameters,

which was accessed through the HuggingFace library [41, 42]. To

match the hidden dimensions of Llama2 and the decoder network,

we applied a single linear layer to the Llama2 output.

Training and decoding We mostly followed configurations pro-

vided by the ESPnet2 recipe for each dataset. The baseline model

was trained up to 50 epochs, and subsequently, the proposed model

was trained up to 50 epochs for LS-100, and 25 epochs for the

other datasets. The Adam optimizer [43] with Noam learning rate

scheduling [22] was used for weight updates, where the number of

warmup steps was set to 15k, and a peak learning rate was tuned

from (0.0015, 0.002). We followed the default recipes for regu-

larization hyperparameters (e.g., dropout rate and label-smoothing

weight). We augmented speech data using speed perturbation [44]

and adaptive SpecAugment [45]. We set the weight of 0.3 to the

CTC loss during baseline model training (refer to Eq. (38) in [20]).

After training, a final model was obtained for evaluation by averag-

ing model parameters over ten checkpoints with the best validation

performance. For the joint CTC and attention decoding performed

in Eq. (12), we used a beam size of 20 unless otherwise specified and

the weight for the CTC probability of 0.3 (refer to Eq. (45) in [20]).

Prompt We empirically designed a prompting instruction to guide

Llama2 in performing grammatical error correction, where we set

W inst to “You will be provided with a statement in quotes. Correct

the wrong words and provide your revised version.”. As specified in

the instruction, we enclosed an user input (i.e., a hypothesis) within

double quotation marks. We followed the prompt format described

in [19, 46] for inputting sequences into Llama2.

3.2. Main Results

Table 1 lists results on all the tasks, which are evaluated using the

word error rate (WER). For tasks other than LS-960, the proposed

model with the Llama2 front-end consistently outperformed the

vanilla CTC-attention-based model, requiring only the retraining of

the decoder network with minimal parameters (e.g., 19 million on

LS-100). These improvements over the baseline indicate the suc-

cessful utilization of linguistic knowledge obtained from the LLM

to enhance ASR performance, which we further analyze in Sec. 3.3.

In CV2, the proposed model demonstrated a notably higher level of

gain compared to those observed in other tasks. We attribute this to

the use of unnormalized written-style text, which enabled Llama2 to

extract precise linguistic information.

The proposed model showed limited improvements on LS-960.

We observed our model’s tendency to degrade in recognizing “un-

common” words (e.g., names of characters in a book), which canTable 1. Word error rate [%] (↓) on various ASR tasks, comparing proposed Llama2-based model to hybrid CTC/attention baseline model.

LibriSpeech-100h

Dev WER

LibriSpeech-960h

Test WER

Dev WER

TED-LIUM2

Test WER

CoVoST2

#Beam clean other clean other clean other clean other Dev

WER Test

WER Dev

WER Test

WER

CTC/Attention

+ Llama2 front-end 1

1 9.9

6.7 20.6

17.5 10.7

7.3 21.1

17.9 3.2

2.6 6.5

6.9 3.4

2.8 6.7

7.0 11.6

9.5 8.8

7.8 18.9

15.6 21.8

18.1

CTC/Attention

+ Llama2 front-end 20

20 7.2

6.2 17.5

16.5 7.5

6.7 18.0

16.9 2.3

2.6 5.7

6.8 2.6

2.8 5.7

7.0 9.4

7.6 7.8

7.2 16.2

15.0 18.4

16.9

Model

Table 2. WER [%] on LS-100 dev. sets for ablation studies.

#Beam= 1

Method

#Beam= 20

clean other clean other

CTC/Attention

+ Llama2 front-end 9.9

6.7 20.6

17.5 7.2

6.2 17.5

16.5

A1: Without Llama2

A2: No prompt

A3: Mismatched prompt 9.3

9.3

6.8 20.7

19.7

17.8 7.1

6.5

6.5 17.8

16.7

16.8

be considered as long-tail words rarely seen during the pre-training

process of Llama2. This could potentially explain why beam search

decoding failed to deliver any performance gains, as Llama2 might

be excessively confident in its predictions for common and generic

words within its vocabulary. However, we believe that such issue can

be addressed by biasing the model through the use of prompting.

3.3. Ablation Study

To validate the effectiveness of the proposed model, we conducted

several ablation studies to assess the significance of both the integra-

tion of Llama2 and prompt design. Table 2 presents the results of the

ablation studies, evaluated by WER using the LS-100 dev. sets.

Importance of Llama2 We ablated Llama2 from the proposed

model during the training process (A1); in Step 2 of Sec. 2.2.1, we

trained the decoder from scratch, in the same way as in Step 1, with-

out using Llama2 as its front end. This modified training resulted

in improvements on the “clean” set compared to the baseline model.

However, there was a slight decline in performance on the “other”

set, indicating a decrease in generalizability. With the incorporation

of Llama2, the proposed model achieved significantly better results

with superior generalization ability.

Influence of prompt First, we removed the prompt (i.e., the in-

struction W inst and hypothesis Ŵ ) from the Llama2 input (A2).

While this modification resulted in a slight improvement compared

to the baseline performance, it had a negative impact on the proposed

model. We then modify the prompt to focus on a speech translation

task instead of grammatical error correction (A3), by setting W inst to

“You will be provided with a statement in quotes, and your task is to

translate it into Japanese.”. This greatly improved over the baseline,

with a marginal performance degradation relative to the proposed

model using the proper prompt. The findings from A2 and A3 sug-

gest that, in the proposed model, a prompt is an essential factor in

maximizing the zero-shot learning capability of the LLM to extract

helpful linguistic information. Designing a prompt that aligns with

the target task can further enhance the model performance.

3.4. Comparison and Combination with Previous Methods

Table 3 shows WER on LS-100, comparing the proposed model and

other approaches for using an LLM in ASR. We performed shallow

fusion [47] by incorporating the Llama2 probability (Eq. (2)) into

Table 3. WER [%] on LS-100, comparing proposed approach with

other major LLM integration methods.

Dev WER

Test WER

Integration Method clean other clean other

CTC/Attention

+ Shallow fusion

+ Rescoring

+ Error correction 7.2

6.4

6.1

13.8 17.5

17.1

15.6

22.8 7.5

7.0

6.4

13.4 18.0

17.5

16.1

23.3

+ Front-end 6.2 16.5 6.7 16.9

Table 4. Test WER [%] of proposed model combined with rescoring.

Model

CTC/Attention

+ Llama2 rescoring

+ Llama2 front-end

++ Llama2 rescoring

LS-100 LS-960 TED2 CV2

7.5 / 18.0

6.4 / 16.1

6.7 / 16.9

5.8 / 15.1 2.6 / 5.7

2.3 / 5.1

2.8 / 7.0

2.4 / 6.0 7.8

7.3

7.2

6.8 18.4

15.8

16.9

14.5

the joint decoding process, with the LM weight set at 0.5. We also

conducted rescoring [48] by using the Llama2 probability (Eq. (2))

to rerank top-10 hypotheses obtained from the joint decoding pro-

cess. We combined scores derived from both the joint decoding

and rescoring procedures, assigning a weight of 0.5 applied to the

Llama2 score. Moreover, we assessed the inherent ability of Llama2

for grammatical error correction [15, 16], evaluating a response gen-

erated by our prompt with the instruction to improve an ASR hy-

pothesis (see Sec. 3.1). These methods were applied to the inference

process of the baseline model. Consequently, both shallow fusion

and rescoring resulted in notable performance improvements, with

rescoring yielding larger gains than the proposed approach. Gram-

matical error correction appeared to be challenging, primarily due

to the LLM’s high flexibility in text generation and the absence of

explicit access to ASR probability.

As the above-mentioned approaches are specifically designed

for use during inference, they can complement the proposed model,

which integrates the LLM directly into the decoder network. To vali-

date this, we focus on combining our model with the most promising

rescoring method. Table 4 lists the results on all the tasks, demon-

strating that the Llama2-based rescoring further enhanced the perfor-

mance of our proposed model. The issue in LS-960 was mitigated

to some extent, indicating the potential of the proposed model in

retaining uncommon words during its beam search process.

4. CONCLUSION

We proposed a novel integration of instruction-tuned LLM and end-

to-end ASR. We guided the LLM to perform grammatical error cor-

rection and leveraged the embedded linguistic information to en-

hance the ASR performance. Future work should consider applying

the proposed model to other speech tasks (e.g., speech translation).5. REFERENCES

[1] J. Devlin, M.-W. Chang, K. Lee et al., “BERT: Pre-training of deep

bidirectional Transformers for language understanding,” in Proc.

NAACL-HLT, 2019, pp. 4171–4186.

[2] A. Radford, K. Narasimhan, T. Salimans et al., “Improving lan-

guage understanding by generative pre-training,” 2018.

[3] A. Wang, A. Singh, J. Michael et al., “GLUE: A multi-task bench-

mark and analysis platform for natural language understanding,” in

Proc. ICLR, 2018.

[4] L. Gao, J. Tow, S. Biderman et al., “A framework for few-shot

language model evaluation,” Sep. 2021. [Online]. Available:

https://doi.org/10.5281/zenodo.5371628

[5] A. Radford, J. Wu, R. Child et al., “Language models are unsuper-

vised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.

[6] T. Brown, B. Mann, N. Ryder et al., “Language models are few-shot

learners,” in Proc. NeurIPS, 2020, pp. 1877–1901.

[7] T. L. Scao, A. Fan, C. Akiki et al., “BLOOM: A 176B-

parameter open-access multilingual language model,” arXiv

preprint arXiv:2211.05100, 2022.

[8] A. Chowdhery, S. Narang, J. Devlin et al., “Palm: Scaling language

modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.

[9] J. Wei, Y. Tay, R. Bommasani et al., “Emergent abilities of large lan-

guage models,” Transactions on Machine Learning Research, 2022.

[10] OpenAI,

“GPT-4

technical

arXiv:2303.08774, 2023.

report,”

arXiv

preprint

[11] J. Wei, M. Bosma, V. Zhao et al., “Finetuned language models are

zero-shot learners,” in Proc. ICLR, 2022.

[12] H. W. Chung, L. Hou, S. Longpre et al., “Scaling instruction-

finetuned language models,” arXiv preprint arXiv:2210.11416,

2022.

[13] L. Ouyang, J. Wu, X. Jiang et al., “Training language models to

follow instructions with human feedback,” in Proc. NeurIPS, 2022,

pp. 27 730–27 744.

[14] M. Nie, M. Yan, C. Gong et al., “Prompt-based re-ranking language

model for ASR,” in Proc. Interspeech, 2022, pp. 3864–3868.

[15] H. Wu, W. Wang, Y. Wan et al., “ChatGPT or Grammarly? evalu-

ating ChatGPT on grammatical error correction benchmark,” arXiv

preprint arXiv:2303.13648, 2023.

[16] T. Fang, S. Yang, K. Lan et al., “Is ChatGPT a highly fluent

grammatical error correction system? a comprehensive evaluation,”

arXiv preprint arXiv:2304.01746, 2023.

[17] R. Ma, M. J. F. Gales, K. M. Knill et al., “N-best T5: Robust ASR

error correction using multiple input hypotheses and constrained de-

coding space,” in Proc. Interspeech, 2023, pp. 3267–3271.

[18] R. Ma, M. Qian, P. Manakul et al., “Can generative large lan-

guage models perform asr error correction?” arXiv preprint

arXiv:2307.04172, 2023.

[19] H. Touvron, L. Martin, K. Stone et al., “Llama 2: Open founda-

tion and fine-tuned chat models,” arXiv preprint arXiv:2307.09288,

2023.

[20] S. Watanabe, T. Hori, S. Kim et al., “Hybrid CTC/attention architec-

ture for end-to-end speech recognition,” IEEE Journal of Selected

Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.

[21] H. Touvron, T. Lavril, G. Izacard et al., “LLaMA: Open

and efficient foundation language models,” arXiv preprint

arXiv:2302.13971, 2023.

[22] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you

need,” in Proc. NeurIPS, 2017, pp. 5998–6008.

[23] K. Irie, A. Zeyer, R. Schlüter et al., “Language modeling with deep

Transformers,” in Proc. Interspeech, 2019, pp. 3905–3909.

[24] J. K. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based

models for speech recognition,” in Proc. NeurIPS, 2015, pp. 577–

585.

[25] W. Chan, N. Jaitly, Q. Le et al., “Listen, attend and spell: A neural

network for large vocabulary conversational speech recognition,” in

Proc. ICASSP, 2016, pp. 4960–4964.

[26] A. Gulati, J. Qin, C.-C. Chiu et al., “Conformer: Convolution-

augmented Transformer for speech recognition,” in Proc. Inter-

speech, 2020, pp. 5036–5040.

[27] Y. Xia, F. Tian, L. Wu et al., “Deliberation networks: Sequence gen-

eration beyond one-pass decoding,” Proc. NeurIPS, vol. 30, 2017.

[28] T. N. Sainath, R. Pang, D. Rybach et al., “Two-pass end-to-end

speech recognition,” in Proc. Interspeech, 2019, pp. 2773–2777.

[29] K. Hu, T. N. Sainath, R. Pang et al., “Deliberation model based

two-pass end-to-end speech recognition,” in Proc. ICASSP, 2020,

pp. 7799–7803.

[30] W. Wang, K. Hu, and T. N. Sainath, “Deliberation of stream-

ing RNN-transducer by non-autoregressive decoding,” in Proc.

ICASSP, 2022, pp. 7452–7456.

[31] Y. Higuchi, B. Yan, S. Arora et al., “BERT meets CTC: New for-

mulation of end-to-end speech recognition with pre-trained masked

language model,” in Proc. Findings of EMNLP, 2022, pp. 5486–

5503.

[32] Y. Higuchi, T. Ogawa, T. Kobayashi et al., “BECTRA: Transducer-

based end-to-end ASR with BERT-enhanced encoder,” in Proc.

ICASSP, 2023.

[33] A. Graves, S. Fernández, F. Gomez et al., “Connectionist temporal

classification: Labelling unsegmented sequence data with recurrent

neural networks,” in Proc. ICML, 2006, pp. 369–376.

[34] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:

Representing model uncertainty in deep learning,” in Proc. ICML,

2016, pp. 1050–1059.

[35] A. Vyas, P. Dighe, S. Tong et al., “Analyzing uncertainties in speech

recognition using dropout,” in Proc. ICASSP, 2019, pp. 6730–6734.

[36] S. Watanabe, T. Hori, S. Karita et al., “ESPnet: End-to-end speech

processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.

[37] V. Panayotov, G. Chen, D. Povey et al., “Librispeech: An ASR cor-

pus based on public domain audio books,” in Proc. ICASSP, 2015,

pp. 5206–5210.

[38] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-

LIUM corpus with selected data for language modeling and more

TED talks,” in Proc. LREC, 2014, pp. 3935–3939.

[39] C. Wang, A. Wu, J. Gu et al., “CoVoST 2 and massively multi-

lingual speech translation,” in Proc. Interspeech, 2021, pp. 2247–

2251.

[40] P. Guo, F. Boyer, X. Chang et al., “Recent developments on ESPnet

toolkit boosted by Conformer,” in Proc. ICASSP, 2021, pp. 5874–

5878.

[41] T. Wolf, L. Debut, V. Sanh et al., “Transformers: State-of-the-art

natural language processing,” in Proc. EMNLP: System Demonstra-

tions, 2020, pp. 38–45.

[42] “Llama-2-7b-chat-hf,”

https://huggingface.co/meta-llama/

Llama-2-7b-chat-hf, [Online; Accessed on September-13-2023].

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-

tion,” in Proc. ICLR, 2015.

[44] T. Ko, V. Peddinti, D. Povey et al., “Audio augmentation for speech

recognition,” in Proc. Interspeech, 2015, pp. 3586–3589.

[45] D. S. Park, Y. Zhang, C.-C. Chiu et al., “SpecAugment on large

scale datasets,” in Proc. ICASSP, 2020, pp. 6879–6883.

[46] “Llama 2 is here - get it on hugging face,” https://huggingface.

co/blog/llama2#how-to-prompt-llama-2, [Online; Accessed on

September-13-2023].

[47] K. Hu, T. N. Sainath, B. Li et al., “Massively multilingual shallow

fusion with large language models,” in Proc. ICASSP, 2023.

[48] T. Udagawa, M. Suzuki, G. Kurata et al., “Effect and analysis

of large-scale language model rescoring on competitive ASR sys-

tems,” in Proc. Interspeech, 2022, pp. 3919–3923.