Summary of Aligning Large Language Models for Information Retrieval

New Summary

Summary Aligning Large Language Models for Information Retrieval arxiv.org

9,650 words - PDF document - View PDF document

Chat with this pdf Buy me a coffee

One Line

The RLCF framework helps language models improve their specificity for information retrieval tasks.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Aligning Large Language Models for Information Retrieval

Source: arxiv.org - PDF - 9,650 words - view

Large Language Models (LLMs) Lack Specificity in IR Tasks

• LLMs often generate responses lacking specificity in information retrieval

• This limits their effectiveness in addressing the needs of IR tasks

• Alignment with IR context is crucial for improving LLMs' performance

Reinforcement Learning from Contrastive Feedback (RLCF)

• RLCF is an unsupervised alignment framework for LLMs

• It enables LLMs to generate high-quality and context-specific responses

• RLCF addresses the issue of lacking specificity in IR tasks

Constructing Contrastive Feedback

• RLCF involves comparing each document with its similar documents

• Contrastive feedback is constructed based on these comparisons

• This helps LLMs capture fine-grained distinctions within documents

Optimizing LLMs through Reinforcement Learning

• RLCF optimizes LLMs using the Proximal Policy Optimization algorithm

• The Batched-MRR metric is used as the reward score for responses

• Reinforcement learning maximizes this reward score to improve LLMs

Experimental Results in IR Context

• RLCF effectively improves the performance of LLMs in IR tasks

• Data augmentation and document summarization tasks show significant improvements

• RLCF-optimized LLMs outperform vanilla LLMs in these tasks

Benefits of RLCF Optimization

• RLCF aligns the capabilities of LLMs with the context of information retrieval

• Specificity and effectiveness of responses generated by LLMs are improved

• Summaries and queries for documents become more specific and informative

Future Directions for Research

• Further exploration of other domains for RLCF optimization

• Incorporating explicit knowledge in pre-trained language models for passage re-ranking

• Continuous improvement and development of alignment frameworks

Aligning LLMs for Effective Information Retrieval

• RLCF optimization enhances the specificity and effectiveness of LLMs in IR tasks

• Aligning LLMs with IR context improves performance in data augmentation and document summarization

• Remember the importance of aligning LLMs with the needs of information retrieval tasks

Key Points

Large language models (LLMs) often generate responses that lack specificity in information retrieval (IR) tasks.
The authors propose an unsupervised alignment framework called Reinforcement Learning from Contrastive Feedback (RLCF) to address this issue.
RLCF enables LLMs to generate high-quality and context-specific responses that suit the needs of IR tasks.
The RLCF framework involves constructing contrastive feedback by comparing each document with its similar documents.
RLCF optimizes LLMs through reinforcement learning using the Proximal Policy Optimization algorithm.
The experimental results show that RLCF effectively improves the performance of LLMs in an IR context.
RLCF-optimized LLMs outperform vanilla LLMs in data augmentation and document summarization tasks.
RLCF optimization aligns the capabilities of LLMs with the context of information retrieval, resulting in more specific summaries and queries for documents.

Summaries

17 word summary

The RLCF framework aligns large language models with information retrieval tasks by addressing their lack of specificity.

60 word summary

The Reinforcement Learning from Contrastive Feedback (RLCF) framework aligns large language models (LLMs) with information retrieval (IR) tasks. RLCF addresses the issue of LLMs lacking specificity by enabling them to generate context-specific responses suitable for IR. The framework involves constructing contrastive feedback using the Batched-MRR reward function. Popular applications and the effectiveness of RLCF are discussed and demonstrated through experiments.

117 word summary

The Reinforcement Learning from Contrastive Feedback (RLCF) framework is introduced in this paper to align large language models (LLMs) with information retrieval (IR) tasks. RLCF addresses the issue of LLMs lacking specificity in their responses by enabling them to generate high-quality and context-specific responses suitable for IR. The framework involves constructing contrastive feedback by comparing each document with its similar documents using the Batched-MRR reward function. The limitations of LLMs in IR are discussed, and popular applications such as data augmentation and document summarization are presented. The RLCF framework optimizes LLMs through reinforcement learning using the Proximal Policy Optimization algorithm. Experiments demonstrate the effectiveness of RLCF in improving LLM performance in dense retrieval tasks and document summarization.

394 word summary

The paper introduces the Reinforcement Learning from Contrastive Feedback (RLCF) framework, which aligns large language models (LLMs) with information retrieval (IR) tasks. LLMs often lack specificity in their responses, limiting their effectiveness in IR. RLCF addresses this issue by enabling LLMs to generate high-quality and context-specific responses suitable for IR.

RLCF involves constructing contrastive feedback by comparing each document with its similar documents. The authors use the Batched-MRR reward function to teach LLMs to generate responses that capture fine-grained distinctions between documents. Experiments in data augmentation and summarization tasks demonstrate the effectiveness of RLCF in improving LLM performance in IR.

The limitations of LLMs in IR, including hallucination and slow knowledge update, are discussed. Misalignment between LLM capabilities and IR needs is identified as a key problem. Popular applications of LLMs in IR, such as data augmentation and document summarization, are presented.

The training pipeline of LLMs, including pre-training, supervised fine-tuning (SFT), and alignment stages, is discussed. However, the existing training pipeline fails to ensure the capability of LLMs to differentiate fine-grained distinctions in information.

To address this issue, the authors propose the RLCF framework, an unsupervised framework that utilizes contrastive feedback to align LLMs with IR context and capture fine-grained distinctions within documents. The framework includes contrastive data construction, RLCF optimization, and calculation of contrastive feedback.

The RLCF framework optimizes LLMs through reinforcement learning using the Proximal Policy Optimization (PPO) algorithm. The Batched-MRR is considered as the reward score, and the PPO algorithm maximizes this reward score. A penalty term is incorporated in the reward to prevent significant divergence from the vanilla LLM.

Experiments on various datasets demonstrate the effectiveness of RLCF in improving LLM performance in IR. The main contributions of the study are proposing the RLCF framework, introducing the Batched-MRR metric, and demonstrating the effectiveness of the framework through comprehensive experiments.

Experiments on document summarization tasks show that RLCF optimization significantly improves Rouge-diff scores on both Chinese and English datasets. Experiments on dense retrieval tasks show that RLCF-optimized LLMs consistently outperform vanilla LLMs in terms of evaluation metrics such as MRR@10, Recall@20, Recall@100, and NDCG@10. The effect of data augmentation increases with the number of parameters in LLMs.

In conclusion, the study introduces the RLCF framework, which leverages contrastive feedback to optimize LLMs. The experiments demonstrate the effectiveness of RLCF in improving LLM performance in dense retrieval tasks and document summarization.

566 word summary

The paper introduces the Reinforcement Learning from Contrastive Feedback (RLCF) framework, which aims to align large language models (LLMs) with the context of information retrieval (IR). LLMs have shown impressive capabilities in various tasks, but they often lack specificity in their responses, limiting their effectiveness in IR. RLCF addresses this issue by enabling LLMs to generate high-quality and context-specific responses that are suitable for IR tasks.

The RLCF framework involves constructing contrastive feedback by comparing each document with its similar documents. The authors use a reward function called Batched-MRR to teach LLMs to generate responses that capture the fine-grained information that distinguishes documents from their similar ones. The authors conducted experiments in data augmentation and summarization tasks to demonstrate the effectiveness of RLCF in improving the performance of LLMs in an IR context.

The paper discusses the limitations of LLMs in IR, including hallucination and slow knowledge update, which hinder their reliability as information accessing tools. The misalignment between the capabilities of LLMs and the needs of IR tasks is identified as a key problem. The paper presents examples of popular applications of LLMs in IR, namely data augmentation and document summarization.

The training pipeline of LLMs, which includes pre-training, supervised fine-tuning (SFT), and alignment stages, is discussed. However, the existing training pipeline fails to ensure the capability of LLMs to differentiate fine-grained distinctions in information.

The RLCF framework optimizes LLMs through reinforcement learning using the Proximal Policy Optimization (PPO) algorithm. The Batched-MRR is considered as the reward score, and the PPO algorithm maximizes this reward score. A penalty term is also incorporated in the reward to prevent the policy model from producing responses that diverge significantly from the vanilla LLM.

Experiments conducted on various datasets demonstrate the effectiveness of RLCF in improving the performance of LLMs in IR context. The authors summarize their main contributions as proposing the RLCF framework, introducing the Batched-MRR metric, and demonstrating the effectiveness of the framework through comprehensive experiments.

The study focuses on document summarization for vanilla LLMs and RLCF-optimized LLMs. The experiments are conducted on two datasets: LCSTS for Chinese and Gigaword for English. The results show that RLCF-optimized LLMs consistently outperform vanilla LLMs in terms of NDCG@10, Recall@100, and Batched-MRR metrics. The experiments also include document summarization tasks, where RLCF optimization significantly improves the Rouge-diff scores on both Chinese and English datasets.

The study proposes a novel framework called RLCF that leverages contrastive feedback to optimize large language models. The experiments demonstrate the effectiveness of RLCF in improving the performance of LLMs in dense retrieval tasks and document summarization.

The experiments on dense retrieval tasks involve various datasets such as MS-MARCO, NQ, TriviaQ, and BEIR. The results show that RLCF-optimized LLMs consistently outperform vanilla LLMs in terms of MRR@10, Recall@20, Recall@100, NDCG@10, and other evaluation metrics. The study also analyzes the scaling law of LLMs on data augmentation for dense retrieval and finds that the effect of data augmentation increases with the number of parameters in LLMs.

1031 word summary

The paper discusses the alignment of large language models (LLMs) with the context of information retrieval (IR) through contrastive feedback. LLMs have shown remarkable capabilities in various tasks, but they often generate responses that lack specificity, limiting their effectiveness in IR. To address this issue, the authors propose an unsupervised alignment framework called Reinforcement Learning from Contrastive Feedback (RLCF). RLCF enables LLMs to generate high-quality and context-specific responses that suit the needs of IR tasks.

The RLCF framework involves constructing contrastive feedback by comparing each document with its similar documents. A reward function called Batched-MRR is used to teach LLMs to generate responses that capture the fine-grained information that distinguishes documents from their similar ones. The authors conducted experiments in two typical applications of LLMs in IR, namely data augmentation and summarization, to demonstrate the effectiveness of RLCF. The experimental results show that RLCF can effectively improve the performance of LLMs in an IR context.

The paper highlights the importance of IR in modern society and the potential of LLMs to support or empower IR systems. It discusses the limitations of LLMs in IR, including hallucination and slow knowledge update, which prevent them from serving as reliable information accessing tools. The misalignment between the capabilities of LLMs and the general needs of IR tasks, particularly the capability to differentiate fine-grained distinctions in documents, is identified as a key problem. The paper presents two example cases of popular applications of LLMs in IR: data augmentation and document summarization.

The training pipeline of LLMs, which includes pre-training, supervised fine-tuning (SFT), and alignment stages, is discussed. The pre-training stage equips LLMs with linguistic knowledge from a massive corpus, while the SFT stage focuses on training LLMs to support different types of instructions and prompts with supervised data. The alignment stage aims to align the capabilities of LLMs with environmental feedback. However, the existing training pipeline fails to ensure the capability of LLMs to differentiate fine-grained distinctions in information.

To align the capability of LLMs with IR context, the authors propose the RLCF framework. RLCF is a novel unsupervised framework that utilizes contrastive feedback to align LLMs with IR context and capture fine-grained distinctions within documents without supervision. The framework includes contrastive data construction, RLCF optimization, and the calculation of contrastive feedback. Contrastive feedback is obtained through the comparison of similar documents using a retriever.

The RLCF framework optimizes LLMs through reinforcement learning, specifically with the Proximal Policy Optimization (PPO) algorithm. The Batched-MRR is considered as the reward score for the entire response, and the PPO algorithm maximizes this reward score. A penalty term is also incorporated in the reward to prevent the policy model from producing responses that diverge significantly from the vanilla LLM.

The authors conducted experiments on BEIR, MS-MARCO, NQ, and TriviaQA datasets to evaluate the effectiveness of RLCF in data augmentation and document summarization tasks. The experimental results demonstrate the effectiveness of RLCF in improving the performance of LLMs in IR context. The authors summarize their main contributions as proposing the RLCF framework, introducing the Batched-MRR metric, and demonstrating the effectiveness of the framework through comprehensive experiments.

Overall, the paper presents a novel framework for aligning LLMs with the context of IR through contrastive feedback. The RLCF framework shows promise in improving the specificity and effectiveness of responses generated by LLMs in IR tasks such as data augmentation and document summarization.

The study focuses on the effectiveness of document summarization for vanilla Large Language Models (LLMs) and LLMs optimized using Reinforcement Learning with Contrastive Feedback (RLCF). The experiments are conducted on two datasets: LCSTS for Chinese and Gigaword for English. LCSTS is a dataset used for short text summarization in Chinese, while Gigaword is a large-scale collection of news articles and their summaries. The implementation details include the use of Flan-T5 as the backbone of LLMs for English datasets and BELLE-7B-2M for the Chinese dataset. The experiments involve data augmentation for dense retrieval tasks such as question answering, entity retrieval, and fact checking. The results show that RLCF-optimized LLMs consistently outperform vanilla LLMs in terms of NDCG@10, Recall@100, and Batched-MRR metrics. The experiments also include document summarization tasks, where RLCF optimization significantly improves the Rouge-diff scores on both Chinese and English datasets. Human evaluation further confirms that summaries generated by RLCF-optimized LLMs are more specific and effective in distinguishing similar documents compared to vanilla LLMs. The study concludes that RLCF optimization aligns the capabilities of LLMs with the context of information retrieval, resulting in more specific summaries and queries for documents.

The study proposes a novel framework called RLCF that leverages contrastive feedback to optimize large language models. The framework involves constructing groups of similar documents, feeding them into LLMs, obtaining responses, and calculating contrastive feedback using a reward function called Batched-MRR. The contrastive feedback is then used to optimize LLMs using the Proximal Policy Optimization algorithm. The experiments demonstrate the effectiveness of RLCF in improving the performance of LLMs in dense retrieval tasks and document summarization.

The experiments on document summarization tasks involve LCSTS and Gigaword datasets. The results show that RLCF optimization significantly improves the Rouge-diff scores on both datasets, indicating the effectiveness of RLCF in generating more specific and informative summaries. Human evaluation further confirms the superiority of summaries generated by RLCF-optimized LLMs over vanilla LLMs.

The study concludes by suggesting future directions for research, such as exploring other domains for RLCF optimization and incorporating explicit knowledge in pre-trained language models for passage re-ranking. The references provide additional resources for further reading on related topics.

Overall, the study demonstrates the effectiveness of RLCF optimization in aligning the capabilities of large language models with the context of information retrieval. The experiments on dense retrieval and document summarization tasks show significant improvements in performance when using RLCF-optimized LLMs compared to vanilla LLMs.

Raw indexed text (63,738 chars / 9,650 words / 1,447 lines)

Aligning the Capabilities of Large Language Models with the

Context of Information Retrieval via Contrastive Feedback

Qian Dong Yiding Liu Qingyao Ai ∗

[email protected]

Quan Cheng Laboratory&

DCST, Tsinghua University&

Zhongguancun Laboratory

Beijing, China [email protected]

Baidu Inc.

Beijing, China [email protected]

Quan Cheng Laboratory&

DCST, Tsinghua University&

Zhongguancun Laboratory

Beijing, China

Zhijing Wu Haitao Li Yiqun Liu

[email protected]

School of Computer Science and

Technology

Beijing Institute of Technology

Beijing, China [email protected]

Quan Cheng Laboratory&

DCST, Tsinghua University&

Zhongguancun Laboratory

Beijing, China [email protected]

Quan Cheng Laboratory&

DCST, Tsinghua University&

Zhongguancun Laboratory

Beijing, China

Shuaiqiang Wang Dawei Yin Shaoping Ma

[email protected]

Baidu Inc.

Beijing, China [email protected]

Baidu Inc.

Beijing, China [email protected]

Quan Cheng Laboratory&

DCST, Tsinghua University&

Zhongguancun Laboratory

Beijing, China

ABSTRACT

Information Retrieval (IR), the process of finding information to

satisfy user’s information needs, plays an essential role in modern

people’s lives. Recently, large language models (LLMs) have demon-

strated remarkable capabilities across various tasks, some of which

are important for IR. Nonetheless, LLMs frequently confront the

issue of generating responses that lack specificity. This has limited

the overall effectiveness of LLMs for IR in many cases. To address

these issues, we present an unsupervised alignment framework

called Reinforcement Learning from Contrastive Feedback (RLCF),

which empowers LLMs to generate both high-quality and context-

specific responses that suit the needs of IR tasks. Specifically, we

construct contrastive feedback by comparing each document with

its similar documents, and then propose a reward function named

Batched-MRR to teach LLMs to generate responses that captures

the fine-grained information that distinguish documents from their

similar ones. To demonstrate the effectiveness of RLCF, we con-

ducted experiments in two typical applications of LLMs in IR, i.e.,

data augmentation and summarization. The experimental results

∗ Corresponding

author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from [email protected].

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00

https://doi.org/XXXXXXX.XXXXXXX

show that RLCF can effectively improve the performance of LLMs

in IR context.

KEYWORDS

large language models, information retrieval, reinforcement learn-

ing

ACM Reference Format:

Qian Dong, Yiding Liu, Qingyao Ai, Zhijing Wu, Haitao Li, Yiqun Liu,

Shuaiqiang Wang, Dawei Yin, and Shaoping Ma. 2018. Aligning the Capabil-

ities of Large Language Models with the Context of Information Retrieval

via Contrastive Feedback. In Proceedings of Make sure to enter the correct

conference title from your rights confirmation emai (Conference acronym ’XX).

ACM, New York, NY, USA, 11 pages. https://doi.org/XXXXXXX.XXXXXXX

INTRODUCTION

Information Retrieval (IR), which aims to fulfil information needs

of peoples though finding relevant documents and knowledge from

the Web or large-scale corpus [12], plays a fundamental role in

our modern society [10, 20, 58]. Recently, Large Language Mod-

els (LLMs) have demonstrated promising performances across a

wide range of research fields, including many NLP related tasks

such as machine translation and constrained text generation. Yet,

despite their advanced capabilities, LLMs are subject to numerous

issues such as hallucination and slow knowledge update, which

prevent them to serve directly as a reliable information accessing

tool. Therefore, IR is still important, attracting a growing number of

researchers to investigate the utilization of LLMs to support or em-

power IR systems. For example, studies [5, 14, 52, 54] have shown

that LLMs can create high-quality training data for retrieval models

in unsupervised manners by generating queries that are potentiallyConference acronym ’XX, June 03–05, 2018, Woodstock, NY

Qian Dong et al.

Similar Documents

Doc 1

...Philippine share prices

closed lower thursday amid

renewed fears about the

global economic slowdown...

23.7%

Doc 2

...Philippine share prices

closed lower tuesday on

fears that a congressional

vote to quash an

impeachment complaint

against president gloria

arroyo...

Doc 3

...Philippine share prices

closed lower on tuesday,

weighed down by the

downturn in foreign markets,

dealers said...

Summaries Generated by Vanilla LLM

Draw

Figure 6: The results of human evaluation.

initial test set and making the evaluation challenging. As a result,

the final test set consists of 2048 documents.

The experimental results are presented in Table 4. From this

table, we can draw the following findings:

• RLCF optimization significantly improves the Rouge-diff on

the test set, demonstrating its effectiveness on document

summarization in IR context.

• RLCF optimization leads to significant improvements on both

Chinese and English datasets, highlighting its effectiveness

across different languages.

• Both Rouge-diff and Batched-MRR metrics on LCSTS are

higher compared to Gigaword. This is because LCSTS has

fewer documents (2.1 million) compared to Gigaword (3.9

million). Typically, a larger corpus contains more similar doc-

uments. As a result, Gigaword presents a more challenging

dataset, making it harder to improve Rouge-diff scores.

Human Evaluation. Although automatic summarization evalua-

tion methods are efficient, their accuracy needs validation. Conse-

quently, we also incorporate human evaluation in our experiments.

The settings of human evaluation are presented in Section 4.3. The

evaluation results are documented in Figure 6. This figure reveals

that responses generated by the RLCF-optimized LLM contain more

specific information than those produced by the vanilla LLM, mak-

ing them more suitable for IR contexts.

In order to further examine the distinctions between RLCF-

optimized LLMs and vanilla LLMs, we perform case studies in

Section 5.3.

5.3

Qian Dong et al.

Case Study

In this section, we present several cases to facilitate an intuitive un-

derstanding of the effectiveness of RLCF, as shown in Figure 7. For

document summarization, we choose three similar documents that

focus on the subject of "Philippine stocks close lower". For query

generation, we first chose a representative case with a generated

query that can be applied to all documents. Subsequently, we chose

two similar documents to analyze the their queries generated by

vanilla LLMs and RLCF-optimized LLMs, respectively.

For the task of document summarization, we can see that the

summaries generated by vanilla LLM are all the same for these simi-

lar documents. Despite that the generated summarizes are accurate

for individual documents, they are not suitable within the pipeline

of IR. In the context of IR, once a user submits a query, the search en-

gine retrieves a collection of documents relevant to the query. These

Sum 1

Philippine share prices closed

lower.

Sum 2

Philippine share prices closed

lower.

Sum 3

Philippine share prices closed

lower.

Summaries Generated by RLCF-optimized LLM

Sum 1 Sum 2 Sum 3

Philippine stocks close lower

amid global slowdown fears Philippine stocks close lower

on impeachment vote fears Philippine stocks close lower

on foreign market downtown

Documents

Doc

...The Rfam data base

contains more than five

hundred structurally

annotated RNA families.

Unfortunately, ...

Doc 1

...8 mph winds from the

North-Northeast... 7 mph

winds from the North-

Northeast... 9am: The

Anacortes forecast for Apr

12 is 48 degrees and Patchy

light rain...

Doc 2

...5 mph winds from the

Southwest. 9am: The

Edmonds, WA forecast for

Apr 12 is 43 degrees and

Light rain...6 mph winds from

the East-Southeast.

Queries Generated by Vanilla LLM

Qry

What is the primary goal of the

study?

Qry 1

What is the temperature at

9am?

Qry 2

What is the difference in wind

direction between two times?

Queries Generated by RLCF-optimized LLM

Qry

How many RNA families are

there in the Rfam database?

Qry 1

Which time of day is it most

likely to raining in Anacortes?

Qry 2

Which time of the day is it

expected to rain in Edmonds?

Figure 7: The cases of responses generated by vanilla LLMs

and RLCF-optimized LLMs for highly similar documents.

documents naturally possess a high degree of similarity. Generating

a distinct summary for such highly similar documents aids users

in filtering and identifying their desired documents. As shown in

Figure 7, after RLCF optimization, the summaries generated by the

LLMs not only precisely summarize the main idea of the document,

i.e., the lowering of Philippine stocks, but also provide specific

reasons based on their corresponding documents. The summaries

generated by RLCF-optimized LLMs demonstrate a higher degree

of specificity towards their respective documents, making them

more suitable for IR scenarios.

For the task of query generation, it is evident that the queries

generated by vanilla LLMs also lack specificity. In the first case of

query generation, the query generated by vanilla LLMs could even

match all documents. In the second and third cases, despite the

generated queries being relatively more relevant to the documents,

they still lack specificity. A query generated by vanilla LLMs for one

document can still match other documents. Figure 7 demonstrates

that the second generated query, "What is the temperature at 9am?",

can also be answered by the third document. Likewise, the third

generated query solely inquires about changes in wind and can

also be answered by the second document. The RLCF-optimized

LLMs can accurately generate specific queries based on individ-

ual documents. Since the generated query and its corresponding

document are positive examples of each other in the contrastiveAligning the Capabilities of Large Language Models with the Context of Information Retrieval via Contrastive Feedback

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

learning training of dense retrieval, the lack of specificity of the

query leads to the false negative problem, which hampers the per-

formance of dense retrieval models [14]. RLCF-optimized LLMs can

alleviate this problem and thus improve the effectiveness of data

augmentation.

Therefore, through RLCF optimization, the capabilities of LLMs

can be effectively aligned with the context of IR, resulting in the

generation of more specific summaries and queries for documents

CONCLUSION

In this work, we propose a novel framework that leverages con-

trastive feedback to optimize large language models through rein-

forcement learning, namely RLCF. The capabilities of LLM could

be aligned with the context of information retrieval through the

proposed RLCF. Specifically, we first construct a group of similar

documents by a dense retrieval model. Subsequently, documents in

the same group are fed into a LLM to be optimized. The responses

are obtained by the LLM for these similar documents. The con-

trastive feedback is obtained from these responses generated by

LLM with respect to corresponding documents. The contrastive

feedback is calculated by dense retrieval model. Formally, we em-

ploy a novel reward function, Batched-MRR, as the contrastive

feedback, which is a variant of MRR. Then, the contrastive feed-

back could be utilized to optimized LLM through PPO algorithm,

which is a widely used reinforcement learning method. We conduct

experiments on two tasks of information retrieval, demonstrating

the effectiveness of our proposed RLCF. The RLCF-optimized LLM

could generates specific queries for data augmentation, achieving

promising performance on few-shot dense retrieval. Besides, we

introduce a brand-new setting of document summarization, which

is under the context of information retrieval. To be specific, the

summarizes should be specific to each document among similar

documents, which is desired for users to filter out target document.

To evaluate the effectiveness of summarization in the proposed

setting, we introduce rouge-diff, a variant of rouge score, which is

calculated in the group level. In future work, we suggest exploring

more domains which could use the RLCF for optimization, such as

style transfer, harmless alignment, helpfulness alignment and etc.

REFERENCES

[1] Amanuel Alambo, Cori Lohstroh, Erik Madaus, Swati Padhee, Brandy Foster,

Tanvi Banerjee, Krishnaprasad Thirunarayan, and Michael Raymer. 2020. Topic-

centric unsupervised multi-document summarization of scientific and news

articles. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 591–

596.

[2] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe,

Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm

for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).

[3] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion,

Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon,

et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint

arXiv:2212.08073 (2022).

[4] Florian Böhm, Yang Gao, Christian M Meyer, Ori Shapira, Ido Dagan, and Iryna

Gurevych. 2019. Better rewards yield better summaries: Learning to summarise

without references. arXiv preprint arXiv:1909.01214 (2019).

[5] Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022.

InPars: Data Augmentation for Information Retrieval using Large Language

Models. arXiv preprint arXiv:2202.05144 (2022).

[6] Mrinmoi Borah, Pankaj Dadure, Partha Pakray, et al. 2022. Comparative analysis

of T5 model for abstractive text summarization on different datasets. (2022).

[7] Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2019. Unsupervised opinion

summarization as copycat-review generation. arXiv preprint arXiv:1911.02247

(2019).

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,

Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot learners. Advances in neural

information processing systems 33 (2020), 1877–1901.

[9] Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015. Ranking with

recursive neural networks and its application to multi-document summarization.

In Proceedings of the AAAI conference on artificial intelligence, Vol. 29.

[10] Jia Chen, Yiqun Liu, Yan Fang, Jiaxin Mao, Hui Fang, Shenghao Yang, Xiaohui

Xie, Min Zhang, and Shaoping Ma. 2022. Axiomatically Regularized Pre-training

for Ad hoc Search. In Proceedings of the 45th International ACM SIGIR Conference

on Research and Development in Information Retrieval. 1524–1534.

[11] Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with

reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080 (2018).

[12] Anfeng Cheng, Yiding Liu, Weibin Li, Qian Dong, Shuaiqiang Wang, Zhengjie

Huang, Shikun Feng, Zhicong Cheng, and Dawei Yin. 2023. Layout-aware Web-

page Quality Assessment. arXiv preprint arXiv:2301.12152 (2023).

[13] Woon Sang Cho, Pengchuan Zhang, Yizhe Zhang, Xiujun Li, Michel Galley, Chris

Brockett, Mengdi Wang, and Jianfeng Gao. 2018. Towards coherent and cohesive

long-form text generation. arXiv preprint arXiv:1811.00511 (2018).

[14] Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov,

Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot

dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022).

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:

Pre-training of deep bidirectional transformers for language understanding. arXiv

preprint arXiv:1810.04805 (2018).

[16] Qian Dong, Yiding Liu, Qingyao Ai, Haitao Li, Shuaiqiang Wang, Yiqun Liu,

Dawei Yin, and Shaoping Ma. 2023. Iˆ 3 Retriever: Incorporating Implicit In-

teraction in Pre-trained Language Models for Passage Retrieval. arXiv preprint

arXiv:2306.02371 (2023).

[17] Qian Dong, Yiding Liu, Suqi Cheng, Shuaiqiang Wang, Zhicong Cheng, Shuzi Niu,

and Dawei Yin. 2022. Incorporating Explicit Knowledge in Pre-trained Language

Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).

[18] Qian Dong and Shuzi Niu. 2021. Latent Graph Recurrent Network for Docu-

ment Ranking. In International Conference on Database Systems for Advanced

Applications. Springer, 88–103.

[19] Qian Dong and Shuzi Niu. 2021. Legal judgment prediction via relational learning.

In Proceedings of the 44th International ACM SIGIR Conference on Research and

Development in Information Retrieval. 983–992.

[20] Qian Dong, Shuzi Niu, Tao Yuan, and Yucheng Li. 2022. Disentangled Graph

Recurrent Network for Document Ranking. Data Science and Engineering 7, 1

(2022), 30–43.

[21] Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit

Cheung. 2018. Banditsum: Extractive summarization as a contextual bandit.

arXiv preprint arXiv:1809.09672 (2018).

[22] Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston.

2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv

preprint arXiv:1901.05415 (2019).

[23] Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. Lcsts: A large scale chinese

short text summarization dataset. arXiv preprint arXiv:1506.05865 (2015).

[24] Shengding Hu, Ning Ding, Weilin Zhao, Xingtai Lv, Zhen Zhang, Zhiyuan Liu, and

Maosong Sun. 2023. OpenDelta: A Plug-and-play Library for Parameter-efficient

Adaptation of Pre-trained Models. arXiv preprint arXiv:2307.03084 (2023).

[25] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo-

janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in-

formation retrieval with contrastive learning. arXiv preprint arXiv:2112.09118

(2021).

[26] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson,

Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. 2019. Way

off-policy batch deep reinforcement learning of implicit human preferences in

dialog. arXiv preprint arXiv:1907.00456 (2019).

[27] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa:

A large scale distantly supervised challenge dataset for reading comprehension.

arXiv preprint arXiv:1705.03551 (2017).

[28] Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2022. An empirical survey on

long document summarization: Datasets, models, and metrics. ACM computing

surveys 55, 8 (2022), 1–35.

[29] Bonan Kou, Muhao Chen, and Tianyi Zhang. 2023. Automated Summarization

of Stack Overflow Posts. arXiv preprint arXiv:2305.16680 (2023).

[30] Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can

neural machine translation be improved with user feedback? arXiv preprint

arXiv:1804.05958 (2018).

[31] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur

Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton

Lee, et al. 2019. Natural questions: a benchmark for question answering research.

Transactions of the Association for Computational Linguistics 7 (2019), 453–466.

[32] Carolin Lawrence and Stefan Riezler. 2018. Improving a neural semantic

parser by counterfactual learning from human bandit feedback. arXiv preprintConference acronym ’XX, June 03–05, 2018, Woodstock, NY

arXiv:1805.01252 (2018).

[33] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman

Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising

sequence-to-sequence pre-training for natural language generation, translation,

and comprehension. arXiv preprint arXiv:1910.13461 (2019).

[34] Canjia Li, Xiaoyang Wang, Dongdong Li, Yiding Liu, Yu Lu, Shuaiqiang Wang,

Zhicong Cheng, Simiu Gu, and Dawei Yin. 2023. Pretrained Language Model

based Web Search Ranking: From Relevance to Satisfaction. arXiv preprint

arXiv:2306.01599 (2023).

[35] Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Yueyue Wu, Yiqun Liu, Chong Chen,

and Qi Tian. 2023. SAILER: Structure-aware Pre-trained Language Model for

Legal Case Retrieval. arXiv preprint arXiv:2304.11370 (2023).

[36] Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating verifiability in

generative search engines. arXiv preprint arXiv:2304.09848 (2023).

[37] Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint

arXiv:1903.10318 (2019).

[38] Yiding Liu, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong

Cheng, and Dawei Yin. 2021. Pre-trained Language Model for Web-scale Retrieval

in Baidu Search. arXiv preprint arXiv:2106.03373 (2021).

[39] Yuxiang Lu, Yiding Liu, Jiaxiang Liu, Yunsheng Shi, Zhengjie Huang, Shikun

Feng Yu Sun, Hao Tian, Hua Wu, Shuaiqiang Wang, Dawei Yin, et al. 2022. Ernie-

search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation

for dense passage retrieval. arXiv preprint arXiv:2205.09153 (2022).

[40] Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Ranking sentences

for extractive summarization with reinforcement learning. arXiv preprint

arXiv:1802.08636 (2018).

[41] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela

Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.

Training language models to follow instructions with human feedback. Advances

in Neural Information Processing Systems 35 (2022), 27730–27744.

[42] Richard Yuanzhe Pang, Adam D Lelkes, Vinh Q Tran, and Cong Yu. 2021.

AgreeSum: Agreement-oriented multi-document summarization. arXiv preprint

arXiv:2106.02278 (2021).

[43] Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and

Kyunghyun Cho. 2019. Finding generalizable evidence by learning to convince

q&a models. arXiv preprint arXiv:1909.05863 (2019).

[44] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxi-

ang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training

approach to dense passage retrieval for open-domain question answering. In

Proceedings of the 2021 Conference of the North American Chapter of the Association

for Computational Linguistics: Human Language Technologies. 5835–5847.

[45] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.

Improving language understanding by generative pre-training. (2018).

[46] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,

et al. 2019. Language models are unsupervised multitask learners. OpenAI blog

1, 8 (2019), 9.

[47] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,

Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the

limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn.

Res. 21, 140 (2020), 1–67.

[48] Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention

model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685

(2015).

[49] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point:

Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368

(2017).

[50] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea

Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to

summarize with human feedback. Advances in Neural Information Processing

Systems 33 (2020), 3008–3021.

[51] Dan Su, Yan Xu, Tiezheng Yu, Farhad Bin Siddique, Elham J Barezi, and Pascale

Fung. 2020. CAiRE-COVID: A question answering and query-focused multi-

document summarization system for COVID-19 scholarly information manage-

ment. arXiv preprint arXiv:2005.03975 (2020).

[52] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna

Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of

information retrieval models. arXiv preprint arXiv:2104.08663 (2021).

[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in neural information processing systems. 5998–6008.

[54] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2021. Gpl:

Generative pseudo labeling for unsupervised domain adaptation of dense retrieval.

arXiv preprint arXiv:2112.07577 (2021).

[55] Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan,

Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, et al. 2023. T2Ranking: A large-

scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679

(2023).

Qian Dong et al.

English Summarization

Please write a summary for the document.

Document: {d}

Summary:

Passage Retrieval

Please write a question based on the document.

Document: {d 1 }

Question: {q 1 }

Document: {d 2 }

Question: {q 2 }

Document: {d}

Question:

Fact Checking

Please write a claim which could be

supported/refuted by the document.

Document: {d 1 }

Claim: {q 1 }

Document: {d 2 }

Claim: {q 2 }

Document: {d}

Claim:

Chinese Summarization

请为下面这个文档生成一个摘要。

文档：{d}

摘要：

Entity Retrieval

Please write a question about the main entity in

the document.

Document: {d 1 }

Question: {q 1 }

Document: {d 2 }

Question: {q 2 }

Document: {d}

Question:

Citation Prediction

Please write a title which could be cited by the

document.

Document: {d 1 }

Title: {q 1 }

Document: {d 2 }

Title: {q 2 }

Document: {d}

Title:

Figure 8: The templates used in our RLCF framework.

[56] Hemant Yadav, Nehal Patel, and Dishank Jani. 2023. Fine-Tuning BART for

Abstractive Reviews Summarization. In Computational Intelligence: Select Pro-

ceedings of InCITe 2022. Springer, 375–385.

[57] Shenghao Yang, Haitao Li, Zhumin Chu, Jingtao Zhan, Yiqun Liu, Min Zhang,

and Shaoping Ma. [n. d.]. THUIR at the NTCIR-16 WWW-4 Task. ([n. d.]).

[58] Shenghao Yang, Yiqun Liu, Xiaohui Xie, Min Zhang, and Shaoping Ma. 2022.

Enhance Performance of Ad-hoc Search via Prompt Learning. In China Conference

on Information Retrieval. Springer, 28–39.

[59] Wenwen Ye, Yiding Liu, Lixin Zou, Hengyi Cai, Suqi Cheng, Shuaiqiang Wang,

and Dawei Yin. 2022. Fast Semantic Matching via Flexible Contextualized In-

teraction. In Proceedings of the Fifteenth ACM International Conference on Web

Search and Data Mining. 1275–1283.

[60] Sanghyun Yi, Rahul Goel, Chandra Khatri, Alessandra Cervone, Tagyoung Chung,

Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur.

2019. Towards coherent and engaging spoken dialog response generation using

automatic conversation evaluators. arXiv preprint arXiv:1904.13015 (2019).

[61] Xin Zheng, Aixin Sun, Jing Li, and Karthik Muthuswamy. 2019. Subtopic-driven

multi-document summarization. In Proceedings of the 2019 conference on empirical

methods in natural language processing and the 9th international joint conference

on natural language processing (EMNLP-IJCNLP). 3153–3162.

[62] Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun

Zhao. 2018. Neural document summarization by jointly learning to score and

select sentences. arXiv preprint arXiv:1807.02305 (2018).

[63] Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and

evaluation of open domain natural language generation models. In Proceedings

of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717–9724.

[64] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario

Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models

from human preferences. arXiv preprint arXiv:1909.08593 (2019).

PROMPT TEMPLATE

The templates used in our RLCF framework are summarized in

Figure 8 for reference. Each template begins with an instruction,

outlining the tasks requirement of the LLMs. Considering the di-

versity of IR tasks, we offer up to two examples to assist LLMs in

adapting to a specific task. Similar to prior studies, the number

of examples depends on the length of the document 𝑑 due to the

input length restriction of LLMs. As shown in Figure 8, 𝑑 1 and 𝑞 1 ,

𝑑 2 and 𝑞 2 , are two examples of documents and their corresponding

queries.

ANNOTATION GUIDELINE

The annotation guidelines involve three dimensions: specificity,

correctness and concision.

• Specificity. Can the summary be distinguished from similar

documents?Aligning the Capabilities of Large Language Models with the Context of Information Retrieval via Contrastive Feedback

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

• Correctness. Is the summary correct and complete?

• Concision. Whether the summary is concise?

The annotation process in RLCF is conducted at the group level,

wherein the ultimate decision regarding superior responses is made

through comprehensive evaluation.