Summary of Enhancing LLM Completions with DERA

Summary Enhancing LLM Completions with DERA arxiv.org

18,498 words - PDF document - View PDF document

One Line

The article discusses the use of DERA, an AI tool for generating care plans in medical conversations, to manage urinary tract infections and improve the quality of care plans, as well as offering default prompts to enhance completions in the LLM model and guidelines for a medical question-answering research task.

Key Points

Large language models (LLMs) can be improved with dialog-enabled resolving agents (DERA) which allow models to communicate feedback and iteratively improve output.
DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance.
DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks.
The article discusses the use of DERA to enhance LLM completions by altering questions and generating free-form answers for medical question answering datasets.
The DERA system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness.

Summaries

205 word summary

This document discusses the use of DERA, an AI tool for generating care plans in medical conversations, to manage urinary tract infections (UTIs) and improve the quality of care plans. The DERA platform generates a final care plan that includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The primary challenge in using DERA for medical question answering datasets is mapping free-form responses to a discrete set of options. DERA can also enhance LLM completions by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. Instructions for writing a care plan are provided, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The document provides prompts for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric. This article offers default prompts to enhance completions in the LLM model. It includes prompts for rewriting questions, characterizing skin lesions, and determining the next step in patient management. Additionally, it provides prompts and guidelines for a medical question-answering research task. The article discusses the use of DERA to improve LLM completions, including prompts for question-answering and output format.

537 word summary

The article presents default prompts for prompt generation in the LLM model to improve completions, including prompts for rewriting questions, characterizing skin lesions, and determining the appropriate next step in management for a patient. The article also provides prompts and guidelines for a medical question-answering research task and discusses the use of DERA to enhance LLM completions, including prompts for question-answering and the output format.

Instructions for writing a care plan are provided, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The document provides prompts for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric.

The DERA system involves generating medical dialogues and care plans for patients following their discussion with a physician. The system aims to improve the quality of care plans by ensuring they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue.

A medical encounter is evaluated using the DERA system to generate a scratchpad of edits to improve the initial summary, including correcting pertinent negatives to state that the patient has not been vaccinated against COVID-19 and modifying pertinent unknowns to remove the patient's uncertainty about their condition. DERA is a digital education resource assistant that enhances the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. DERA can also assist with medical cases by providing prompt questions and treatment recommendations.

The article discusses the use of DERA to enhance LLM completions with a focus on GPT-4 models. Various studies related to natural language processing and medical informatics are discussed, including semantic similarity measurement, language models for generating medical conversations, summarization, text generation, and mapping biomedical text. The document also includes examples of chats between Decider and Researcher agents discussing study design and a medical case study.

The primary challenge in using DERA for medical question answering datasets is mapping free-form responses to a discrete set of options. The evaluation metrics involve generating multiple-choice prompts. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers.

1225 word summary

Large language models (LLMs) can be improved with dialog-enabled resolving agents (DERA) which allow models to communicate feedback and iteratively improve output. The authors present dialog-based strategies such as chaining and prompt-based methods that prompt the model to think through a problem as an expert would approach it, leading to improved performance in some tasks. DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance. DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks. The approach is well-suited for longer-form generation tasks, such as medical conversation summarization. DERA has potential for improving GPT-4 performance and reducing errors in conversation summaries. The dataset includes 500 medical encounters that cover a wide variety of common presentations in telehealth. DERA can generate doctor-facing suggestions for care plan generation. The study introduces DERA, an AI tool for generating care plans in medical conversations. Physician-expert evaluations show that DERA summaries capture far more clinical information than initial GPT-4 generated summaries. The DERA system generates physician-facing suggestions for care management plans, which can resolve errors like hallucinations and omissions. The study evaluates the effectiveness of DERA on a dataset of 50 medical encounters with prompts specific to care plan generation. It concludes that DERA can enhance LLM completions and improve the quality of care plans.

The Enhancing LLM Completions with DERA document suggests improvements to a patient's care plan for managing urinary tract infections (UTIs). The Medications section recommends considering the patient's previous antibiotic exposure, allergies, and potential drug interactions with BuSpar, LaMICtal, and Zoloft. The Supportive Care section should offer guidance on managing UTIs and potential medication interactions related to barrier methods during sex to reduce the risk of UTIs. The Lifestyle section should include a recommendation to educate the patient on safe sexual practices and the importance of using a barrier method during sex. The Tests section recommends performing a urinalysis and urine culture to confirm the presence of a UTI and identify the appropriate antibiotic for treatment. The Initial Care Plan includes prescribing an appropriate antibiotic, providing education on recognizing the signs and symptoms of recurrent UTIs, and encouraging proper hygiene practices. Referrals are recommended if symptoms worsen or do not improve after a few days of antibiotic treatment.

The article discusses the use of the DERA platform to generate care plans for patients. The final care plan includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The platform emphasizes the importance of seeking timely medical care and offers reassurance and support for patients' mental health. The article discusses the use of DERA to enhance LLM completions by altering questions and generating free-form answers for medical question answering datasets. The primary challenge is mapping free-form responses to a discrete set of options, and the evaluation metrics involve generating multiple-choice prompts. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers. In the open-ended setting, DERA shows small to no improvement over GPT-4, while in the multiple-choice setting, GPT-4 outperforms the best previously published approaches. A framework called DERA is introduced for improving the output of large language model-powered question-answering tasks. The authors suggest that additional work in methods that automatically evaluate the output of large language model-powered tools is needed. This document discusses the use of zero-shot clinical information extractors and OpenAI's API to enhance LLM completions, with a focus on GPT-4 models. The chat-based format of DERA allows for increased interpretability and is well-suited for longer generation tasks. Various studies related to natural language processing and medical informatics are discussed, including semantic similarity measurement, language models for generating medical conversations, summarization, text generation, and mapping biomedical text. A list of research papers related to using language models for text generation and summarization is provided. The document also includes examples of chats between Decider and Researcher agents discussing study design and a medical case study. DERA can assist with medical cases by providing prompt questions and treatment recommendations. DERA is a digital education resource assistant that enhances the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. DERA can also assist with medical cases by providing prompt questions and treatment recommendations.

A medical encounter is evaluated using the DERA system to generate a scratchpad of edits to improve the initial summary. The summary includes information on the patient's medical history, symptoms, medication, and intent. Pertinent negatives are corrected to state that the patient has not been vaccinated against COVID-19, and pertinent unknowns are modified to remove the patient's uncertainty about their condition. The patient's use of an Albuterol inhaler for wheezing is added to the summary, and the sentence about the last two sinus infections is rephrased.

The DERA system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness. The system also includes a prompt for generating corrupted versions of the summary, with varying levels of desired corruption. The summary includes sections for medical history, pertinent positives, pertinent negatives, pertinent unknowns, patient intent, demographics, and social determinants of health. The document discusses how to enhance LLM completions with DERA (Dialogue-based Educational Recommender Agent). The system involves generating medical dialogues and care plans for patients following their discussion with a physician. The care plan should be reviewed section by section, with any corrections added to a scratchpad. A care plan writer (Person A) discusses the plan with another care plan writer (Person B) who verifies its soundness. The researcher agent writes the care plan based on the scratchpad of corrections made during the discussion. The system aims to improve the quality of care plans by ensuring they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue. The conversation ends with the tag [DONE].

The document provides instructions for writing a care plan, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The definitions and scopes of each section are defined, and the next steps in care management that the patient and physician will perform should be categorized. Two prompts are provided for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric. Examples are given for each prompt, with medical concepts such as COVID-19, symptoms, diseases, conditions, allergies, and lab tests being extracted or verified.

Finally, the document provides four multiple choice questions that can be rewritten as open-ended questions. The article presents default prompts for prompt generation in the LLM model, which can be modified for specific needs. These prompts are designed to improve completions in the LLM model and include prompts for rewriting questions, characterizing skin lesions, and determining the appropriate next step in management for a patient. The article also provides prompts and guidelines for a medical question-answering research task, involving an expert medical advisor helping a student arrive at the correct answer to a medical question. Additionally, the article discusses the use of DERA to enhance LLM completions, including prompts for question-answering, such as exact matching between generated and gold answers and similarity score between generated and gold answers. The output format includes a dxSimilarityScore and expected vs provided dx pairs.

3119 word summary

The document discusses the use of DERA to enhance LLM completions. It includes several prompts for question-answering, including a prompt for exact matching between generated and gold answers, a prompt similar to that used for similarity score between generated and gold answers, and a prompt for question-answering final answer. The prompts emphasize the importance of considering the guidance provided by the expert advisor and thinking through the question. The output format includes a dxSimilarityScore and expected vs provided dx pairs. This excerpt is a set of prompts and guidelines for a medical question-answering research task. It involves an expert medical advisor helping a student arrive at the correct answer to a medical question. The advisor should not give the answer directly but should guide the student by pointing out specific parts of the question that are important in finding the answer. The response should be no more than 3-4 sentences and should include a quote from the question. The advisor should stop only when all important aspects of the question have been pointed out, and responses should not repeat themselves. Another prompt involves generating a single-shot answer distribution, and another prompt provides a medical question with options to select the best answer phrase. The article discusses using default parameters for prompt generation in the LLM model. It provides examples of prompts for rewriting questions and characterizing skin lesions. It also includes a prompt for determining the appropriate next step in management for a patient. The prompts are designed to improve completions in the LLM model and can be modified for specific needs. Prompt 12: Rewrite multiple choice questions as open-ended questions.

Question: What is the most likely cytokine involved in the process of gouty arthritis?

Question: A 71-year-old man presents with complaints of right wrist pain. What characterizes the pattern of his skin lesions?

Question: Several patients at a local US hospital present with chronic secretory diarrhea. What is the most likely common cause of their diarrhea?

Question: A 60-year-old man comes in for an examination prior to a scheduled cholecystectomy. Which of the following is the most appropriate next step in management? The document discusses the use of DERA (Dialogue-based Educational Recommender Agent) in enhancing LLM completions. The system involves generating medical dialogues and care plans for patients following their discussion with a physician. The researcher agent is tasked with writing the care plan based on a scratchpad of corrections made during the discussion with another care plan writer. The system aims to improve the quality of care plans by ensuring that they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue. The conversation ends with the tag [DONE]. The document discusses how to enhance LLM completions with DERA. The care plan should be reviewed section by section, and any corrections should be added to a scratchpad. A care plan writer (Person A) should discuss the care plan with another care plan writer (Person B) who verifies the plan's soundness. The dialogue used for generating the care plan is a medical encounter between a patient and a doctor done over chat. The full instructions for writing care plans are presented in the document. The document discusses enhancing LLM completions with DERA. It provides instructions for writing a care plan, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The definitions and scopes of each section are defined, and the next steps in care management that the patient and physician will perform should be categorized. Two prompts are provided for extracting medical concepts from a summary and verifying medical concepts from a summary section used to compute the GPT-F1 metric. Examples are given for each prompt, with medical concepts such as COVID-19, symptoms, diseases, conditions, allergies, and lab tests being extracted or verified. The DERA summarization experiments involve a process where Person A writes an original summary, Person B edits it, and a Decider agent decides which corrections to incorporate. Person B goes through each section of the summary and points out any discrepancies between the summary and the medical dialogue. Person A adds agreed-upon corrections to a scratchpad, which Person B incorporates into the original summary. The process ends with a final summarization step that includes the scratchpad of corrections. The output tag "[STOP]" is used to indicate when writing is complete. The correction scratchpad and original summary are provided for reference in their respective sections. The discussion between Person A and Person B is also recorded in the summary discussion section. The Decider agent's role is to decide which corrections to include in the final summary and output them in the tag [SCRATCHPAD: <things-to-add-to-the-scratchpad-here>]. The document discusses a system called DERA that enhances LLM completions. The system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness. The first person can make corrections to the summary on a scratchpad and the second person can suggest additional edits. The system also includes a prompt for generating corrupted versions of the summary, with varying levels of desired corruption. The summary includes sections for medical history, pertinent positives, pertinent negatives, pertinent unknowns, patient intent, demographics, and social determinants of health. A study on enhancing LLM completions with DERA is evaluated through a quantitative comparison between initial and final summaries. The DERA dialog generates a scratchpad of edits to improve the initial summary. The medical conversation summarization includes information on the patient's medical history, symptoms, medication, and intent. Pertinent negatives are corrected to state that the patient has not been vaccinated against COVID-19. Pertinent unknowns are modified to remove the patient's uncertainty about their condition. The patient's use of an Albuterol inhaler for wheezing is added to the summary, and the sentence about the last two sinus infections is rephrased. The patient's demographics are not mentioned in the encounter. A patient seeks care for a sinus infection and experiences symptoms such as facial swelling, swollen eyelids, and a slight fever. They have a history of seasonal allergies, sinus infections, and chickenpox. The patient currently uses ibuprofen and a nasal spray for symptom relief. They have not been vaccinated against COVID-19 but have had it before. The patient typically uses an Albuterol inhaler for asthma due to allergies causing wheezing in the spring and fall. Demographic information is not provided. The document discusses the use of a digital education resource assistant (DERA) to enhance the completion rate of LLM programs. DERA provides students with personalized study plans and feedback based on their individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. Additionally, it can assist with medical cases by providing prompt questions and treatment recommendations.

Paragraph 1: DERA is a digital education resource assistant that can enhance the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance.

Paragraph 2: DERA uses various metrics, such as concept verification and summarization, to evaluate student progress and provide tailored feedback.

Paragraph 3: DERA can also assist with medical cases by providing prompt questions and treatment recommendations. Enhancing LLM Completions with DERA is a document that is not available for public access. A 20-year-old woman presents with symptoms including bruising easily, heavy menstrual bleeding, and diarrhea. Her family history includes similar problems with bruising. The researcher considers the patient's symptoms and diagnostic biopsy results, which show blunting of intestinal villi and flat mucosa with multiple intraepithelial lymphocytes, and concludes that Celiac Disease is the most likely diagnosis. The patient's symptoms are most prominent after breakfast when she typically consumes cereal. The researcher recommends an esophagogastroduodenoscopy. The patient's vital signs are within normal limits and her BMI is 17. The document discusses the use of DERA to enhance LLM completions. It provides examples of chats between Decider and Researcher agents, who are discussing the appropriate study design for a research question. The goal is to submit an editorial to a top journal based on a comprehensive literature search and meta-analysis. The agents consider the specific goals and resources of the researcher and ultimately recommend a systematic review and meta-analysis or a meta-analysis as viable options for pooling results from multiple studies. The document also includes a question answering example and several appendices with additional information. A list of research papers related to using language models for text generation and summarization is provided. The papers cover a range of topics including evaluating text generation with BERT, leveraging pretrained models for automatic summarization of doctor-patient conversations, prompting large language models for reasoning and faithful chains of reasoning, and analyzing BERTScore for social bias. Other topics include using language models for medical question answering, programming for large language models, and the capabilities of GPT-4 on medical challenge problems. This document discusses various studies and approaches related to natural language processing and medical informatics. One study focuses on measuring semantic similarity using open-source software, while another examines the use of large language models for generating medical conversations. Other studies explore the effectiveness of language models for summarization and text generation, as well as the use of pathways in language modeling. Additionally, the document includes information on mapping biomedical text to the UMLS Metathesaurus and the use of data generators for medical dialogue summarization. This paper discusses the use of zero-shot clinical information extractors in enhancing LLM completions using OpenAI's API, mostly using GPT-4 models. The experiments in this paper were performed using a different form than is likely to be applied in a regular clinical setting. Additionally, some datasets that are openly available cannot be directly used with API-based models, further limiting options for medically-focused research. Clinical text interpretation is also potentially problematic and requires further research. The need for further research and auditable models is reiterated. The chat-based format of DERA allows for increased interpretability when auditing text generation and is well-suited for longer-generation tasks. A framework called DERA is introduced for improving the output of large language model-powered question-answering tasks. DERA is able to identify and correct hallucinations and omissions in the resulting text, reducing the number of errors. The framework involves two types of agents, Researcher and Decider, working together to achieve the correct answer in one or two generations. In some cases, DERA adds additional details to the one-shot answer, while in others it changes it to a more specific answer. The ability of DERA to improve the quality of the generated text is important given the ability of large language models, in particular GPT-4, to produce text that is fluent but potentially prone to errors. The authors suggest that additional work in methods that automatically evaluate the output of large language model-powered tools is needed. The document discusses the use of DERA to enhance LLM completions. In the open-ended setting, DERA shows small to no improvement over GPT-4, suggesting opportunities for future work. In the multiple-choice setting, GPT-4 outperforms the best previously published approaches. The authors use a variety of metrics to evaluate the generated and gold answer similarity, including BERTScore and GPT-4 prompts that identify exact matches and similar matches. The passing grade for the MedQA test set is above 60%, and there is a gap between open-ended and multiple-choice performance. The results are slightly higher due to the use of a self-consistency approach. The document discusses the use of DERA to enhance LLM completions. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers. Several examples of DERA chats are provided, including one where Von Willebrand Disease is diagnosed as the most likely cause of a patient's symptoms. The Gold Answer and 1-shot answer are also included. The document notes that closely related lexical forms receive separate votes, and variations in the text are not accounted for. The article discusses the use of DERA to enhance LLM completions. The primary challenge is mapping free-form responses to a discrete set of options. Evaluation metrics involve generating multiple-choice prompts, and the model is trained on open-ended question-answer pairs. The DERA setup uses a single-shot prompt to generate an initial answer, followed by iterative discussion between the Decider and Researcher prompts to arrive at a final answer. The approach provides DERA with a distribution of votes instead of a single-shot answer, resulting in stronger results. The MedQA dataset is used for testing, and the prompts are released for rewriting the final sentence. The article discusses the use of DERA to enhance LLM completions. DERA is used to alter questions and generate free-form answers for medical question answering datasets, including MedQA and NEJM. The goal is to test DERA's performance on open-ended questions and evaluate its ability to recall a large set of knowledge across multiple domains of medicine. The article provides qualitative examples of DERA's performance in generating care plans and highlights potential drug interactions. The findings suggest that DERA has the potential to improve care management practices, but further research is needed to determine its generalizability. The article discusses the use of the DERA platform to generate care plans for patients. The platform uses physician-expert evaluations and patient-physician chats to generate more effective care plans than initial GPT-4 generated care plans. The platform asks questions about potentially harmful information and has a high rate of agreement with physician suggestions. The final care plan includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The platform emphasizes the importance of seeking timely medical care and offers reassurance and support for patients' mental health. The Enhancing LLM Completions with DERA document suggests improvements to a patient's care plan for managing urinary tract infections (UTIs). In the Medications section, it recommends considering the patient's previous antibiotic exposure, allergies, and potential drug interactions with BuSpar, LaMICtal, and Zoloft. The Supportive Care section should offer guidance on managing UTIs and potential medication interactions related to barrier methods during sex to reduce the risk of UTIs. The Lifestyle section should include a recommendation to educate the patient on safe sexual practices and the importance of using a barrier method during sex. The Tests section recommends performing a urinalysis and urine culture to confirm the presence of a UTI and identify the appropriate antibiotic for treatment. The Initial Care Plan includes prescribing an appropriate antibiotic, providing education on recognizing the signs and symptoms of recurrent UTIs, and encouraging proper hygiene practices. Referrals are recommended if symptoms worsen or do not improve after a few days of antibiotic treatment. The study evaluates the effectiveness of DERA, an AI tool for generating care plans in medical conversations. The evaluation involves a dataset of 50 medical encounters with prompts specific to care plan generation. The generated care plans are evaluated by licensed physicians who rate them as complete 92% of the time, compared to 64% for initial care plans. The physicians also prefer the final care plan generated by DERA 84% of the time. The study concludes that DERA can enhance LLM completions and improve the quality of care plans. The DERA system generates physician-facing suggestions for care management plans, which can resolve errors like hallucinations and omissions. The system's performance is evaluated using the GPT-F1 metric on initial summaries with varying levels of corruption. The degree of corruption is controlled by a variable variable that can be set to low, medium, or high. The system's ability to write medical summaries is quantitatively evaluated by generating GPT-4 summaries for 500 encounters. The DERA-produced summary consistently outperforms the initial summary in terms of GPT-F1 scores. The study introduces the use of DERA to correct errors in medical conversation summarization. Synthetic corruption is introduced to evaluate the performance of the initial summary and DERA-corrected summary. Physician-expert evaluations show that DERA summaries capture far more clinical information than initial GPT-4 generated summaries. Physicians rate agreement with “All” suggestions in 63% of encounters, “Most” in 14%, “Some” in 5%, and “None” in 18% for initial summaries vs. 56% of initial GPT-4 summaries. The physician-experts prefer to use the final DERA-produced summary over the initially generated summary produced using GPT-4. The document discusses the use of DERA for medical conversation summarization. The dataset includes 500 medical encounters that cover a wide variety of common presentations in telehealth. GPT-4 prompts are run with settings mentioned in Table 5, and the Decider generates the final summary once a maximum conversation length is reached. The Researcher agent's role is to read the summary and point out any discrepancies to Decider. The DERA setup requires the model to summarize the chat while placing each piece of information in the appropriate section. The task of medical conversation summarization is to encapsulate a patient-doctor conversation into a short format. DERA can generate doctor-facing suggestions for care plan generation. DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks. The Researcher agent reviews information and highlights what is relevant, while the Decider agent makes the final decision on what information to include in the output. The Generator agent uses GPT-4 to generate text based on the highlighted information. By using these agents, DERA can alternate between processing knowledge sources and create higher-quality output. This approach is well-suited for longer-form generation tasks, such as medical conversation summarization. DERA has potential for improving GPT-4 performance and reducing errors in conversation summaries. The article introduces DERA, a framework that uses dialogue to improve performance on natural language tasks, particularly in the clinical domain. DERA consists of two agents: a Decider and a Researcher. The Decider generates an initial output for the task, and the Researcher helps identify crucial problem components. The two agents work through the problem via conversation. DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance. The paper proposes using agents tasked with resolving or improving anomalies within the output and suggests that this approach can improve performance on natural language tasks. Large language models (LLMs) have become valuable tools for many natural language understanding tasks, particularly in safety-critical applications such as healthcare. However, their performance can be improved through the use of dialog-enabled resolving agents (DERA). DERA is a paradigm that allows models to communicate feedback and iteratively improve output. In this work, the authors present dialog-based strategies such as chaining and prompt-based methods that prompt the model to think through a problem as an expert would approach it, leading to improved performance in some tasks. They test DERA against three clinically-focused tasks and show significant improvement over the base LLM model. The authors also release the open-ended MEDQA dataset, which can be used to evaluate model performance. This work highlights the importance of using dialog-based strategies to enhance LLM completions and improve their performance in safety-critical applications.

Raw indexed text (110,939 chars / 18,498 words / 2,948 lines)

DERA: Enhancing Large Language Model Completions

with Dialog-Enabled Resolving Agents

Varun Nair ∗ Elliot Schumacher ∗

Geoffrey Tso Anitha Kannan

{varun, elliot, geoff, anitha}@curai.com

Curai Health

Abstract

Large language models (LLMs) have emerged

as valuable tools for many natural language

understanding tasks. In safety-critical appli-

cations such as healthcare, the utility of these

models is governed by their ability to gen-

erate outputs that are factually accurate and

complete. In this work, we present dialog-

enabled resolving agents (DERA). DERA is

a paradigm made possible by the increased

conversational abilities of LLMs, namely GPT-

4. It provides a simple, interpretable forum

for models to communicate feedback and it-

eratively improve output. We frame our dia-

log as a discussion between two agent types –

a Researcher, who processes information and

identifies crucial problem components, and a

Decider, who has the autonomy to integrate

the Researcher’s information and makes judg-

ments on the final output.

We test DERA against three clinically-focused

tasks. For medical conversation summariza-

tion and care plan generation, DERA shows

significant improvement over the base GPT-

4 performance in both human expert prefer-

ence evaluations and quantitative metrics. In

a new finding, we also show that GPT-4’s per-

formance (70%) on an open-ended version of

the MedQA question-answering (QA) dataset

(Jin et al. (2021), USMLE) is well above the

passing level (60%), with DERA showing sim-

ilar performance. We release the open-ended

MEDQA dataset at https://github.com/

curai/curai-research/tree/main/DERA.

Introduction

Large language models (LLMs; Brown et al.

(2020); Lewis et al. (2020)) are deep-learning mod-

els that have been trained to predict natural lan-

guage text conditioned on an input. The use of

these models has led to advances in natural lan-

guage performance far beyond just language mod-

∗

The first two authors contributed equally to this work.

eling tasks. Within the realm of medicine, LLM-

powered methods have shown improvements in

medical tasks such as question answering (Singhal

et al., 2022; Liévin et al., 2022), information ex-

traction (Agrawal et al., 2022), and summarization

(Chintagunta et al., 2021).

LLM-powered methods use natural language in-

structions called prompts. These instruction sets

often include a task definition, rules the predictions

must follow, and optionally some examples of the

task input and output (Reynolds and McDonell,

2021; Brown et al., 2020). The ability of gener-

ative language models to create output based on

natural language instructions (or prompts) removes

the need for task-specific training (Min et al., 2022)

and allows non-experts to build upon this technol-

ogy.

While many tasks can be formulated as a sin-

gle prompt, later work has shown that breaking

down single tasks into sub-tasks (called chain-

ing) has benefits in terms of task performance and

interpretability (Wu et al., 2022). Examples of

chaining strategies include chain-of-thought (Wei

et al., 2022) and other task-specific approaches (e.g,

Agrawal et al. (2022)). Chain-of-thought strategies

prompt the model to think through a problem as

an expert might approach it, leading to improve-

ments in some tasks (Liévin et al., 2022; Wang

et al., 2022; Tafjord et al., 2022).

All of these approaches attempt to coerce the

correct generation from the base language model.

However, one fundamental limitation of this strat-

egy is that these prompting architectures are re-

stricted to a fixed set of prompts designed for spe-

cific tasks in mind, such as writing explanations

or resolving anomalies within the output. Further-

more, they struggle with generating factually accu-

rate text and often can include hallucinations and

omissions (Maynez et al., 2020; Dziri et al., 2022;

Berezin and Batura, 2022). This poses a significant

hurdle when applying them to real-world scenarios,Figure 1: Overview of DERA. The method consists of two agents–a Researcher and a Decider. The Decider

generates an initial output for the task (step 1). Then, the Decider and Researcher work through the problem via

conversation (step 2), with the Researcher tasked to help identify crucial problem components. The Decider has

the autonomy to integrate the Researcher ’s inputs and makes judgments on the final output (step 3). Neither agent

has knowledge of the ideal final output.

especially in the clinical domain.

We advocate for a different approach that has

two essential elements. First, it consists of an itera-

tive approach to refining the initial output. This al-

lows the generation to be refined holistically as op-

posed to conditional chaining. Second, it includes

an advisor that can guide by suggesting areas to

focus on in each iteration, adding interpretability

to the process. With the advent of GPT-4 (OpenAI,

2023) capable of robust, realistic conversation, we

can use dialog as the medium for interaction.

We propose DERA: Dialog-Enabled Resolving

Agents. DERA is a framework to explore how we

can improve performance on natural language tasks

using agents tasked with resolving (or improving)

the output through dialog. We propose that scop-

ing each agent in the dialog to a specific role will

better enable them to focus on discrete portions

of the task, and ensure their partner agent stays

aligned with the overall goal. One agent role, the

Researcher, works to identify relevant information

for the problem and suggest areas of focus to the

other agent. Another agent role, the Decider, has

the autonomy to react to that information and make

final decisions about the output.

Our paper makes the following contributions:

• We introduce DERA (§ 2) - a framework for

agent-agent dialog to improve performance

on natural language tasks.

• We evaluate DERA on three different types of

clinical tasks. Each of these requires different

types of textual inputs and types of knowledge

to solve.

– The medical conversation summarization

task (§ 3) focuses on generating a sum-

mary of a doctor-patient chat that is fac-

tually accurate without omissions or hal-

lucinations.

– The careplan generation task (§4) is

knowledge-intensive with long outputs

that are useful in clinical decision sup-

port. There is not a single best answer to

the task and the goal is to maximize the

amount of factually accurate and relevant

information generated.

– Medical question answering (Jin et al.,

2021) is a knowledge reasoning task with

a single answer but posed as an open-

ended task without access to multiple

choices. We study in this harder setting

using two question-answering datasets

(§5).

• In both human-annotated evaluations, we find

that DERA outperforms base GPT-4 perfor-

mance in the careplan generation and medicalconversation summarization tasks on a vari-

ety of metrics. In quantitative evaluations, we

find that DERA successfully corrects medical

conversation summaries with large amounts

of errors. Conversely, we find small to no im-

provement between GPT-4 performance and

DERA on question-answering.

• We theorize this approach is well suited for

longer-form generation tasks, in which there

are a lot of fine-grained details.

• We will work to release a new open-ended

medical question-answering task based on

MedQA, which consists of United States Med-

ical Licensing Exam (USMLE) practice ques-

tions 1 . This opens up new research in the mod-

eling and evaluation of question-answering

systems.

DERA: Overview

DERA is a general chat framework that leverages

dialog-capable agents to iteratively work through a

task (Figure 1). We focus on agent setups that work

to probe knowledge sources, whether internal (from

within GPT-4) or external (from text, documents,

etc.). In approaches like chain-of-thought, these

roles are generally performed jointly. In contrast,

we propose that pairing an information-focused

agent with a decision-maker agent will lead to a

higher-quality output. Furthermore, this approach

allows for DERA to alternate between processing

knowledge and acting upon them, as opposed to

doing them concurrently.

First, we propose the use of a Researcher agent.

The goal of a researcher agent is to review pieces

of information – which can be internal to GPT-4 or

external – and make suggestions on what is likely

to be crucial in solving the problem. As we do

not have a definitive source of what is and is not

relevant, we rely on GPT-4’s ability to identify rel-

evancy in light of the current task. We do not treat

this agent as the definitive source of truth. Rather,

we task it with being helpful and constructive dur-

ing the dialog.

Second, we propose the use of a Decider agent.

In addition to starting the conversation, this agent is

tasked with responding to the information provided

by the Researcher agent, and deciding whether to

integrate that information into the task output. This

https://github.com/curai/curai-research/tree/

main/DERA

allows GPT-4 to make discrete decisions in reaction

to the information highlighted by the Researcher.

At no point, however, does the Decider defer to the

Researcher. This agent is ultimately responsible

for the final decision, and while it is tasked with re-

viewing all information highlighted by Researcher,

it does not have to use any of that information.

The specifics of each agent can vary for different

tasks. For Question Answering, the Researcher is

tasked with pulling information from the question,

using the internal knowledge of GPT-4 alone. For

summarization, the Researcher has access to exter-

nal texts which contain the full patient encounter.

Conversely, the edits to the text generation task are

made incrementally by the Decider in the summa-

rization task, while they are made more discretely

in the question-answering task. In some settings,

agents take a hybrid role, where they each have

access to different information and jointly make

decisions. Overall, the goal remains the same –

that this approach allows for information to be pro-

cessed in a manner that is both role-defined and

iterative, producing better quality output.

We apply DERA to three natural language gen-

eration tasks. The first, medical conversation sum-

marization (§3), probes the ability of DERA to

create a summary of a doctor-patient chat. This re-

quires the ability to identify and rewrite medically-

relevant information in a concise format. The

second, care plan generation (§4), tests whether

DERA can generate doctor-facing suggestions for

potential actions to address patient concerns. This

requires similar abilities, with the added challenge

of knowing the appropriate next steps for a variety

of medical conditions. Finally, medical question-

answering (§5) tests the ability of DERA to gener-

ate a wide variety of medical knowledge in a short

format.

Medical Conversation Summarization

Overview The task of medical conversation sum-

marization is to encapsulate a patient-doctor con-

versation (Enarvi et al., 2020; Joshi et al., 2020;

Zhang et al., 2021; Chintagunta et al., 2021). We

focus on summarizing patient-doctor chats into six

independent sections: Demographics and Social

Determinants of Health, Medical Intent, Pertinent

Positives, Pertinent Negatives, Pertinent Unknowns,

and Medical History. This structured format re-

quires the model to summarize the chat while plac-

ing each piece of information in the appropriatesection. As these summaries are typically used

by doctors for downstream tasks such as clinical

decision-making, it is important that the generated

summaries are both factually accurate (no halluci-

nations) and complete (no omissions).

DERA Setup We formulate the DERA setup for

medical conversation summarization as follows.

Both Decider and Researcher have access to the

full medical conversation between the patient and

the physician. Both agents are prompted to con-

verse with one another. The Decider agent gener-

ates an initial summary of the medical conversa-

tion (Prompt 1) and shares it with the Researcher

agent. The Researcher agent’s role (Prompt 4) is

to “read” the summary and point out any discrep-

ancies to Decider. Decider, using Prompt 3, either

accepts or rejects those discrepancies, by agreeing

with the suggestion or disagreeing and respond-

ing with some reasoning. Instead of regenerating

the summary at each step of the conversation, De-

cider writes the accepted suggestions to a shared

scratchpad, which acts like a memory that it uses

at the end of the conversation to generate the final

summary. The conversation terminates once Re-

searcher is satisfied with the suggestions made to

the scratchpad or a maximum conversation length

is reached (set to 15 turns total). As the final step,

the Decider generates (Prompt 5) the final sum-

mary using the contents of the scratchpad and the

original summary.

GPT-4 prompts are run with the settings men-

tioned in Table 5.

Dataset We randomly sampled 500 medical en-

counters from a chat-based telehealth platform.

Each encounter contains the patient’s age, sex, and

chat conversation with a licensed medical provider.

Encounters in this dataset cover a wide variety

of common presentations in telehealth, includ-

ing urinary tract infections, back/abdominal pains,

toothaches, and others. All data is de-identified and

scrubbed for protected health information prior to

experimentation. Conversations contain 27 dialog

turns on average (min of 9 turns, max of 82 turns)

and average 646 unigram tokens per encounter

(min 42 tokens, max 2031 tokens).

Human Expert Evaluation To evaluate the ef-

fectiveness of DERA to generate better summaries,

we conducted human evaluation studies with four

licensed physicians on a random subset of 50 out of

the 500 encounters described above. We sampled

a smaller, random subset due to the high labeling

cost induced by using expert physicians.

The licensed physicians were provided with the

encounter and the two summaries. These included

the initial GPT-4 generated summary, and the final

generated summary produced using DERA. Each

physician was asked to answer three main questions

in the light of the summary’s clinical utility for

themselves or another physician:

1. Which summary do you prefer to use for the

given patient and encounter? (Options: Initial,

Final).

2. What percentage of the overall clinical infor-

mation in the dialog is captured by the sum-

mary? (Options: All, Most, Some, None)

3. What percentage of the suggestions added to

the DERA scratchpad do you agree with? (Op-

tions: All, Most, Some, None)

Figure 2 shows the results of our human expert

evaluation. Physicians notably choose the summary

produced after DERA over the initially generated

summary 90% - 10%. Their preference for the

DERA-produced summary is further corroborated

by the fraction of medical information captured

in the final DERA summary vs. initial, as final

summaries were rated as capturing "All" medical

information from the patient-physician dialog in

86% of encounters vs. the initial summaries cap-

turing "All" medical information in just 56% of en-

counters. In general, we also find broad agreement

for the suggestions in each encounter’s scratch-

pad: they agreed with "All" corrections suggested

for a given encounter’s summary 63% of the time,

"Most" 14% of the time, "Some" 5% of the time,

and "None" 18% of the time. On average, each

scratchpad contains 2-3 suggestions.

In addition to these questions, we also asked

the physician-experts the following: If this sum-

mary were acted upon by another clinical provider,

does this summary contain information that could

potentially be harmful to the patient given their

presentation? (Options: Yes, No). The amount of

summaries containing "harmful" information drops

from 2% in the initial summary to 0% in the final

DERA summary. We caution against drawing gen-

eralizations from these harmfulness numbers. Our

evaluations are both limited in number and drawn

from a patient population specific to the telehealth

platform; thus cannot predict the generalizability

of these findings in other settings.Figure 2: Results from physician-expert evaluations on the medical conversation summarization task. (Left) Physi-

cians choose the final summary produced by DERA over the initial GPT-4 generated summary 90% to 10%. (Cen-

ter) Final DERA summaries capture far more clinical information than initial GPT-4 generated summaries, with

physicians rating "All" relevant clinical information from the patient-physician chat captured in 86% of DERA

summaries vs. 56% of initial GPT-4 summaries. (Right) For summary correction suggestions in the scratchpad,

physicians rate agreement with "All" suggestions in 63% of encounters, Most" in 14%, "Some" in 5%, and "None"

in 18%.

Corruption

Level Summ.

Version

low ( 10

) DERA

medium ( 10

)

high ( 10

)

Initial

DERA

Initial

DERA

Pertinent

Positives

89.38

95.65

83.12

94.29

68.35

92.96

Pertinent

Negatives

83.05

96.77

81.6

95.31

70.07

90.86

Pertinent

Unknowns

87.42

97.10

71.14

96.17

68.79

94.81

Medical

History

80.88

97.35

73.82

98.12

57.27

95.16

Average

85.18

96.71

77.42

95.97

66.12

93.45

Table 1: Medical conversation summarization task: Quantitative evaluation (GPT-F1 scores) of the initial summary

with errors and the DERA corrected version. We show that by introducing synthetic corruption (hallucinations,

omissions, etc.) into medical summaries, DERA can resolve these corruptions at low, medium, and high levels of

corruption. GPT-F1 scores for the DERA-produced summary are consistently higher than the initial summaries.

Quantitative Evaluation We also perform a

more large-scale study without the need for hu-

man annotation. We generate GPT-4 summaries

for all the 500 encounters and assume them to be

ground truth. Then, we synthetically induce “cor-

ruptions” into the generated summary and use that

as the initial input. These mistakes artificially lower

the summary’s quality and produce significant hal-

lucinations and omissions. The goal is to quanti-

tatively evaluate DERA’s ability to write medical

summaries by measuring the degree to which the

Researcher and Decider agents can identify and fix

"corruptions" introduced to the medical summary.

Prompt 2 contains specific instructions for gen-

erating the corruptions. We can control the level

of corruption desired by passing one of three lev-

els of corruption as a variable to our corruption

prompt: low ( 10

), medium ( 10

), or high ( 10

). The

higher the corruption, the more symptoms could

be rearranged. Similarly, hallucinated symptoms

could be introduced, among other corruptions. For

a qualitative example of this process of generating

an initial summary, corrupting it, resolving with

DERA, and generating a final summary see Fig. 6.

We measure the degree to which corruptions are

present by using a GPT-based metric that tracks the

medical concept coverage of the medical summary,

GPT-F1. To compute GPT-F1, we compute the

harmonic mean of two sub-metrics: GPT-Recall

and GPT-Precision. We describe each sub-metric

below.

GPT-Recall: To compute, we first extract medi-

cal entities from both the predicted text and ground-truth text 2 of the same summary section (using

Prompt 6) and use a verification prompt (Prompt

7) to infer if the entities extracted from the ground-

truth section are also present in the predicted text,

This produces tp gt and f n values, which is used to

tp gt

calculate GPT-Recall = tp gt +f

GPT-Precision: To compute, we also first ex-

tract medical entities from the corresponding pre-

dicted and ground-truth summary sections and

verify concepts extracted from the predicted sec-

tion are also present in the ground-truth text, ei-

ther as exact matches or re-phrasings. This pro-

duces tp pred and f p , which is used to calculate

tp pred

GPT-Precision = tp pred

+f p .

We present the results of our quantitative evalua-

tion using the GPT-F1 metric in Table 1. Specifi-

cally, we compare GPT-F1 on the initial summary

with errors to the DERA corrected summary. Note

first how the higher levels of corruption manifest

in the initial summary GPT-F1. As the corrup-

tion level of the initial summary increases, the

initial GPT-F1 score drops. We find that DERA

can produce significantly improved summaries in

low, medium, and high levels of corruption, as ev-

idenced by increases in GPT-F1. This suggests

that the interaction between the Researcher and

Decider agents is identifying hallucinations and

omissions and resolving them through dialog, even

when many such corruptions are present.

Care Plan Generation

We also analyze the performance of DERA on the

task of generating a care management plan. This

care plan contains suggestions that are meant to be

physician-facing - that is, we generate suggestions

that a physician would be required to approve of

and then communicate to a patient. Our care plans

contain five sections: Medications, Referrals, Tests,

Lifestyle, and Supportive Care.

DERA setup As in the medical conversation

summarization task, the goal of DERA is to im-

prove the quality of the generated care plan by sug-

gesting more appropriate home care for the patient,

recommending additional lab tests, or otherwise

better aligning the generated summary. The DERA

setup is the same as the medical conversation sum-

marization task with care plan-specific prompts.

Note that the terms "predicted" and "ground-truth" are

used loosely here, and can also refer to synthetic text that is

used as ground-truth (as in the case of the corruption experi-

ments).

The Decider starts with an initial care plan. The

Researcher is prompted (Prompt 10) to converse

with the Decider (Prompt 9). Finally, the Decider

generates the final care plan (Prompt 11). by com-

bining the initial care plan with the content of the

‘scratchpad’ accumulated during the conversation.

We run DERA on the care plan generation task

using GPT-4 with the settings mentioned in Table

Dataset We used the same set of 50 medical en-

counters we used for the human expert evaluation

of the medical conversation summarization task.

Human Experts Evaluation We evaluated the

effectiveness of DERA to generate care plans

through human evaluation with four licensed physi-

cians. We explicitly instructed the physician evalu-

ators that the generated plan is defined as "meant to

be provider-facing, meaning that not all suggested

interventions will necessarily be recommended to

the patient or followed by the patient." The physi-

cians who evaluated the quality of these care plans

were not those who provided care to the patients in

the original encounter.

The experts were provided with the encounter

and the two careplans – the baseline GPT-4 gener-

ated summary and the DERA generated summary

starting from GPT-4 generated summary. They

were asked to answer the following three questions:

1. Which careplan do you prefer to use for the

given patient and encounter? (Options: Initial,

Final).

2. What fraction of the necessary care plan man-

agement steps are captured? (Options: All,

Most, Some, None)

3. What percentage of the suggestions added to

the DERA scratchpad do you agree with? (Op-

tions: All, Most, Some, None)

Figure 4 shows the results. In a head-to-head

comparison, the physicians prefer the final care

plan produced by DERA 84% of the time. Fur-

thermore, when asked to give what fraction of care

plan corrections were useful, they fully agreed with

72% of suggestions. They agree with none of the

suggestions only 14% of the time. Finally, they

rated 92% of care plans as complete, compared to

64% of initial care plans. In summation, the appli-

cation of DERA to care plan generation increased

the resulting quality substantially.Doctor-Patient Chat

Patient: UTI

Doctor: Hi NAME, thank you for starting a visit. My name

is NAME, and I am your clinical associate for this live

chat. . . . I am sorry to hear that.

Patient: Okay thank you . . .

Doctor: When was your last sexual encounter?

Patient: Two days ago. I’m not having any since then because

of the pain with my bladder infection.

Doctor: Was the encounter unprotected?

Patient: No

Doctor: How many sexual partners have you had in the past

6 months?

Patient: Several. I did however get tested for HIV and STIs

a couple weeks ago and all came back negative. I get

tested regularly since I am sexually active.

Doctor: Do you have a new sexual partner?

Patient: No new ones

Doctor: Do you douche or use feminine hygiene products?

Patient: No Sometimes I use baby wipes

...

Doctor: For what medical conditions you are taking BuSpar,

LaMICtal, Zoloft?

Patient: Buspar for Bipolar, Lamictal for anxiety, Zoloft for

depression

Initial Care Plan

Tests: (1) Perform a urinalysis and urine culture to confirm the presence of a urinary tract infection and to identify . . .

Medications: (1) Prescribe an appropriate antibiotic, such as nitrofurantoin, fosfomycin, or trimethoprim/sulfamethoxazole, to

treat the infection. Note that the choice of antibiotic may be adjusted based on the results of the urine culture. (2) Continue the

use of over-the-counter pain relief medication, such as AZO, until the antibiotic treatment relieves symptoms.

Lifestyle: (1) Encourage the patient to increase fluid intake, particularly water, to help flush out bacteria from the urinary tract.

(2) Suggest urinating frequently and fully emptying the bladder to help clear the infection. (3) Recommend proper hygiene

practices, such as . . . (4) Advise the use of a urinary health supplement . . .

Referrals: (1) If symptoms worsen or do not improve after a few days of antibiotic treatment, refer the patient to . . .

Supportive Care: (1) Provide education on recognizing the signs and symptoms of recurrent urinary tract infections and the

importance of seeking timely medical care. (2) Offer reassurance and support regarding the patient’s mental health and . . .

DERA Dialog

Researcher: Starting with the Tests section, everything looks good with the urinalysis and urine culture being recommended.

However, in the Medications section, it would be helpful to include that antibiotic choice should consider the patient’s

previous antibiotic exposure and allergies. It will also be useful to mention any potential drug interactions with BuSpar,

LaMICtal, and Zoloft. In the Lifestyle section, there is no mention of the patient’s sexual practices, which could

contribute to recurrent UTIs. Considering the patient’s history of multiple sexual partners, you might want to add a

recommendation to educate the patient on safe sexual practices and the importance of using a barrier method during sex

to reduce the risk of UTIs. . . .

Decider: Thank you for your suggestions. I agree with the recommendation to consider the patient’s previous antibiotic

exposure and allergies, as well as potential drug interactions with their other medications. I also see the importance

of addressing safe sexual practices and barrier method usage to reduce the risk of UTIs. In terms of supportive care,

providing guidance on managing UTIs and potential medication interactions concerning their mental health medications

is crucial.

Researcher: Thank you for considering my suggestions. Please make sure to discuss these points with the patient, and . . .

Scratchpad

1) Add a note in the Medications section to consider the patient’s previous antibiotic exposure, allergies, and potential drug

interactions with BuSpar, LaMICtal, and Zoloft.

2) In the Lifestyle section, include a recommendation to educate the patient on safe sexual practices and the importance of

using barrier methods during sex to reduce the risk of UTIs.

3) In the Supportive Care section, offer guidance on managing UTIs and potential medication interactions related to their

mental health medications.

Final Care Plan

Tests: (1) Perform a urinalysis and urine culture to confirm the presence of a urinary tract infection and to identify . . .

Medications: (1) Prescribe an appropriate antibiotic, such as nitrofurantoin, fosfomycin, or trimethoprim/sulfamethoxazole, to

treat the infection. Note that the choice of antibiotic may be adjusted based on the results of the urine culture. Consider the

patient’s previous antibiotic exposure, allergies, and potential drug interactions with BuSpar, LaMICtal, and Zoloft.

(2) Continue the use of over-the-counter pain relief medication, such as AZO, until the antibiotic treatment relieves symptoms.

Lifestyle: (1) Encourage the patient to increase fluid intake, particularly water, to help flush out bacteria from the urinary tract.

(2) Suggest urinating frequently and fully emptying the bladder to help clear the infection. (3) Recommend proper hygiene

practices, such as . . . (4) Advise the use of a urinary health supplement . . . (5) Educate the patient on safe sexual practices

and the importance of using barrier methods during sex to reduce the risk of UTIs.

Referrals: (1) If symptoms worsen or do not improve after a few days of antibiotic treatment, refer the patient to . . .

Supportive Care: (1) Provide education on recognizing the signs and symptoms of recurrent urinary tract infections and the

importance of seeking timely medical care. (2) Offer reassurance and support regarding the patient’s mental health . . . (3)

Offer guidance on managing UTIs and potential medication interactions related to their mental health medications.

Figure 3: Qualitative example of care plan generation with DERA abridged for space. Starting with the initial chat,

the Decider generates an initial care plan. The Researcher and Decider agents in DERA then converse with one

another, visible in DERA dialog. The Decider adds accepted suggestions to a scratchpad, which collects the final

changes to make to the care plan. The final care plan is generated by the Decider using this scratchpad. Note the

points in bold that were added to the final care plan.Figure 4: Care plan generation task: Results from physician-expert evaluations. (Left) Physicians choose the

final care plan produced by DERA over the initial GPT-4 generated care plan 84% to 16%. (Center) Final DERA

care plans capture far more of the necessary care management steps than initial GPT-4 generated care plans, with

physicians rating "All" relevant steps inferred from the patient-physician chat generated in 92% of DERA care

plans vs. 64% of initial GPT-4 care plans. (Right) For care plan correction suggestions in the scratchpad, physicians

rate agreement with "All" suggestions in 72% of encounters, Most" in 14%, "Some" in 0%, and "None" in 14%.

In addition to these questions, we also asked

the physician-experts the following: If this care

plan were acted upon by the patient, does this care

plan contain information that could potentially be

harmful to the patient given their presentation?

(Options: Yes, No). The amount of careplan con-

taining "harmful" information drops from 2% in the

initial careplan to 0% in the final DERA summary.

We caution against drawing generalizations from

these harmfulness numbers. Our evaluations are

both limited in number and drawn from a patient

population specific to the telehealth platform; thus

cannot predict the generalizability of these findings

in other settings.

Qualitative Examples We show a qualitative ex-

ample of the care plan generation task with DERA

in Figure 3. The initial care plan generated by the

Decider was originally rated as containing "Most"

necessary care management steps by our physician-

expert evaluator, suggesting there were still some

improvements possible. In the DERA dialog, the

Researcher highlights potential drug interactions

with the patient’s current medications and the rec-

ommendation to educate the patient on safe sexual

practices. These corrections were accepted by the

Decider, as evidenced by the notes written to the

scratchpad. In turn, the corrections were mani-

fested in the final care plan, with the three changes

bolded. This final care plan was rated as contain-

ing "All" necessary care management steps by our

physician-expert evaluator.

Open-Ended Medical Question

Answering

Overview We also investigate the use of

DERA for short-form medical reasoning. A com-

monly used dataset for this task is MedQA (Jin

et al., 2021) which consists of USMLE-style

practice multiple-choice questions. Previous ap-

proaches for this dataset have included using

RoBERTa (Liu et al., 2019), refining chain-of-

thought using GPT-3 (Liévin et al., 2022), and

fine-tuning PaLM (Chowdhery et al., 2022; Sing-

hal et al., 2022). While most previously-reported

results achieved passing results, recent GPT-4 is

shown to work at a near-expert level (Nori et al.,

2023).

In all previous work, the primary focus was on

the multiple-choice question format which has lim-

ited applicability in the real world. If these models

are to support doctors in decision-making, these

models need to operate without any options pro-

vided. To mimic this setting, we extend the MedQA

dataset to be open-ended to evaluate the model in a

more realistic and harder setting. In an open-ended

form, the model must generate the correct answer

free-form and not choose from a given bank of

options. We also evaluate a set of continuing edu-

cation questions from the New England Journal of

Medicine (NEJM), again in an open-ended setting.

A method that can perform at a high level on

this task requires several attributes. First, it must

be able to recall a large set of knowledge across

multiple domains of medicine. Second, it must beable to reason over long questions, which will likely

include both irrelevant and crucial facts needed to

arrive at the solution.

Datasets We evaluate our approach using two

Medical Question answering datasets - MedQA US

dataset (Jin et al., 2021) and New England Journal

of Medicine Test Questions (NEJM). Both datasets

consist of questions taken from practice or real

medical exams (United States Medical Licensing

for MedQA, and continuing education questions

for NEJM). For both datasets, the questions are

originally written in multiple-choice format (e.g,

Which of the following is the best diagnosis?). Our

goal is to test DERA ’s performance on open-ended

question answering, where the task will be to gen-

erate the answer free-form.

Therefore, we use GPT-4 to alter the questions

to be open-ended. In most cases, this requires a

simple rephrasing of the final sentence. For ex-

ample, the previous question could be re-written

as What is the best diagnosis?. In these cases,

we restrict GPT-4 to rewrite only the final sen-

tence of the question, so as to guard against hallu-

cinations. When a more complex rewrite is re-

quired, we prompt GPT-4 to rewrite the entire

question and find that it only changes the relevant

sentence. Some questions could already be an-

swered open-ended and required no rewriting. Al-

though we performed quality checks, as the entire

process is automated, there may be some errors.

The prompts for rewriting the final sentence 13

and the full question 12 are included in the Ap-

pendix. We also release the full MedQA open-

ended dataset at https://github.com/curai/

curai-research/tree/main/DERA. We cannot

release the NEJM dataset due to licensing issues.

For MedQA, we sample a portion of the train-

ing set (1178 questions) as a development set and

maintain the integrity of the test set (1273 ques-

tions) as formulated by the authors. For NEJM, we

split the datasets by area, reserving 7 areas 3 as a

development set (consisting of 639 questions), with

the remainder serving as a test set (1112 questions).

We do not exclude questions containing images.

DERA setup To generate an initial answer for

DERA to discuss, we use a single-shot prompt

which outputs a short answer (Prompt 14). We use

a single-shot prompt to ensure a consistent output,

Reproductive, Gastrointestinal, Neurologic/Psychogenic,

Special Sensory, Endocrine, Musculoskeletal, and Maternity

Care

which we were unable to achieve with a zero-shot

prompt. Earlier work (Singhal et al., 2022) has

shown that using a self-consistency strategy pro-

vides stronger results. We adopt this approach by

running 5 completions of our single-shot prompt

and selecting the answer with the most votes as

the single-shot answer, and consider this as our

baseline 4 .

Instead of initializing our Decider with a single

answer, we provide it with the distribution of votes.

This approach provides DERA with the distribu-

tion better captures the underlying uncertainty of

the model 5 . A prompt (Prompt 15) is tasked with

writing a reasoning behind the votes, which is used

as the initial Decider message.

Starting with the initial Decider message, both

Decider (Prompt 17) and Researcher have access

only to the question and their own conversation as

they iteratively discuss the problem and attempt to

achieve the right answer. The Researcher can stop

the dialogue when they have exhausted all relevant

information, otherwise, it is set to end after n = 3

turns. At each turn, the Decider must state what

their current answer is and explain their reasoning,

and they may choose to either confirm or change

their answer.

We instruct both prompts to act as medical doc-

tors who work at an expert level. To arrive at a final

answer, a prompt is given the same information as

the original one-shot prompt, with the exception

that it is also given the full chat history to use as

additional context. We generate n = 5 answers

and use the most frequently generated answer as

our final answer (see Prompt 18).

We run DERA on open-ended question answer-

ing with the parameters noted in Table 6. For the

multiple-choice setting, we use a very similar con-

figuration. The primary prompt changes are limited

to the fact that Decider is given a set of options and

asked to generate the letter (A-D) instead of a short

phrase.

Evaluation Metrics There are two main chal-

lenges when evaluating open-ended responses to

the questions. Some questions may have multi-

ple correct answers, given that it is no longer con-

strained to a discrete set of options. More impor-

tantly, we need a mechanism to map the free-form

We do not account for variations in the text, each lexical

form is counted separately.

This also handles cases where closely related lexical

forms receive separate votes, as the Decider output will con-

clude that the options are similar.Question [ID 4]:A 20-year-old woman presents with menorrhagia for the past several years. She says that her menses

“have always been heavy”, and she has experienced easy bruising for as long as she can remember. Family history is

significant for her mother, who had similar problems with bruising easily. The patient’s vital signs include: heart rate

98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F), and blood pressure 110/87 mm Hg. Physical examination

is unremarkable. Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds, and PTT 43 seconds.

What is the most likely cause of this patient’s symptoms?

Gold Answer: Von Willebrand Disease, 1-shot : Inherited Bleeding Disorder

Decider: The patient’s symptoms of menorrhagia and easy bruising, along with a family history of similar problems,

suggest an inherited bleeding disorder. However, Von Willebrand disease, which is the most common inherited

bleeding disorder, cannot be ruled out. Both options are possible causes of the patient’s symptoms, but the

presence of a family history of bleeding problems slightly increases the likelihood of an inherited bleeding

disorder.

Researcher: Remember to consider the fact that the patient has experienced menorrhagia for several years and easy

bruising for as long as she can remember, along with a family history of similar problems

Decider: After considering the patient’s symptoms of menorrhagia and easy bruising, along with a family history of

similar problems, I believe that Von Willebrand disease is the most likely cause of this patient’s symptoms.

Researcher: Take note of the laboratory test results, specifically the platelet count, PT, and PTT values, as these can

help determine the cause of the patient’s symptoms.

Decider: After taking into consideration the laboratory test results, specifically the platelet count, PT, and PTT values,

I still believe that Von Willebrand disease is the most likely cause of this patient’s symptoms. The patient’s

symptoms of menorrhagia and easy bruising, along with a family history of similar problems, further support this

diagnosis.

Researcher: Pay attention to the patient’s vital signs, specifically the heart rate, respiratory rate, temperature, and blood

pressure, as these can also provide clues to the cause of her symptoms.

Decider: After taking into consideration the patient’s vital signs, specifically the heart rate, respiratory rate, temperature,

and blood pressure, I still believe that Von Willebrand disease is the most likely cause of this patient’s symptoms.

The patient’s symptoms of menorrhagia and easy bruising, along with a family history of similar problems and

laboratory test results further support this diagnosis.

DERA answer: Von Willebrand Disease

Figure 5: We provide several examples of chats between the Decider and Researcher agents taken from our

MedQA development set (more shown in Appendix Section A.1). We include the Gold Answer, 1-shot answer, and

DERA answer. This example is a case where the single-shot answer is not specific enough, and the DERA dialog

changes the answer to the correct one.

answer to the correct answer text. While we only

evaluate against the ground truth correct option,

we include metrics that attempt to account for the

degree of similarity between the answer text and

the correct option.

To identify generated answers that are related

to the gold standard answer, we prompt GPT-

4 to score the relatedness of the generated and

gold answers (Prompt 19). To evaluate a bi-

nary prompt that judges if the generated and

gold answers are the exactly same, and provides

an explanation (Prompt 20). Finally, we eval-

uate the generated and gold answer similarity

using BERTScore (Zhang et al. (2019), model

scibert-basevocab-uncased). While this met-

ric has limitations (Hanna and Bojar, 2021; Sun

et al., 2022), it is commonly reported for genera-

tion tasks. We do not use a knowledge base such as

UMLS(Aronson, 2001) based similarity (McInnes

et al., 2009) as many answers do not map directly

to a single entity in a medical knowledge base.

Model Accuracy

PaLM (Singhal et al., 2022)

Nori et al. (2023)

GPT-4 0-shot

DERA 0.676

0.814

0.834

0.840

Table 2: Results on the multiple-choice (4-option) ver-

sion of MedQA. The GPT-4 0-shot and DERA results

were generated on a version of GPT-4 available in

February 2023.

Results We compare DERA to single-shot per-

formance using GPT-4, where n = 5 answers are

detected, and the one with the most votes is selected

as the answer 6 . Due to the costs involved with run-

ning the experiments, we only report single runs.

We include quantitative results for open-ended

question answering in Table 3, and for multiple-

choice question answering in Table 2.

If there are ties, the first completion of the highest-ranking

answers is selected.GPT-4 1-shot

DERA

BERTScore

0.746

0.744

MEDQA

GPT-4 Exact

0.698

0.703

GPT-4 Sim

0.65

0.67

BERTScore

0.676

0.670

NEJM

GPT-4 Exact

0.703

0.711

GPT-4 Sim

0.711

0.724

Table 3: Results on the Open-Ended versions of MedQA and NEJM. We evaluate using a variety of metrics,

including GPT-4 prompts that identify exact matches and similar matches (using a 0-1 scale). In addition, we

calculate the average BERTScore F 1 to measure the similarity between the gold and generated answers using a

separate model.

For the multiple-choice results, we find that GPT-

4 outperforms the best previously published ap-

proaches out of the box on MedQA. This is in

line with that reported by Nori et al. (2023), which

uses a very similar approach. We suspect that our

results are slightly higher due to our use of a self-

consistency approach. We do not see significant

improvements when applying DERA compared to

the multiple choice setting.

In the open-ended setting, we see strong per-

formance in both one-shot GPT-4 and DERAfor

both NEJM and MedQA. Liévin et al. (2022) notes

that the passing grade for the MedQA test set is

60%. For both GPT-4 one-shot and DERA, we

see that GPT-4 Exact Matching is above 60% and

BERTScore and Similarity measures are above 0.6.

This marks an impressive ability to generate open-

ended answers to questions. Yet there still exists a

gap between open-ended and multiple-choice per-

formance, suggesting opportunities for future work.

Similarly to the multiple choice setting,

DERA shows small to no improvement over GPT-

4, depending on the metric. The largest gain for

DERA is in the similarity metric for both MedQA

and NEJM, which suggests that DERA can lead to

answers that are closer to the ground truth. Exam-

ples of the open-ended question-answering chats

are included in Figure 5 and Appendix Section A.1.

Qualitative Analysis We include the first 10 ex-

amples from the MedQA development set (which

we randomly drew from their training set) in Ap-

pendix Table 4 7 . In our analysis of these develop-

ment examples, we see several patterns.

First, sometimes the agent successfully changes

an incorrect answer to the correct answer. For

example, in Question 4 shown in Figure 5, the

original answer is Inherited bleeding disorder, and

DERA changes it to the more specific Von Wille-

brand Disease. In other cases, DERA leaves the

These results were generated with an earlier version of

GPT-4 available in February 2023.

answer as the same in the original 1-shot genera-

tion (e.g, Questions 5, 9, 55, 94, 98). We also note

that this does not occur in a majority of cases, as

only 542 of the 1273 MedQA training examples

have the exact same answer between DERA and

one-shot.

In other cases, such as in Question 54,

DERA adds additional details to the 1-shot an-

swer (1-shot Smoking cessation counseling and

support to the Decider’s final answer Assessing for

occupational lung disease and providing smoking

cessation. There are some clear challenges with

open-ended question answering that show in both

the DERA and 1-shot generations. Specifically, of-

ten both give a more general answer than is given

in the gold standard answer. For example, in Ques-

tion 74, the gold standard answer text is a specific

medication (Deantrolene), while both DERA and 1-

shot produce more general answers (e.g, Immediate

hospitalization and supportive care).

Overall, without the inclusion of a specific set of

options, it is difficult for GPT-4 to generate an an-

swer at a correct level of specificity (e.g, a specific

treatment instead of a general approach) and a cor-

rect length (e.g, answering in short answer format

instead of long sentences). In some settings, these

attributes may be useful, but it results in a challeng-

ing approach to evaluate. We predict the need for

additional work in methods that automatically eval-

uate the output of large language model-powered

tools, given the inherent complexity present.

Discussion and Conclusion

We introduce a framework for agent-agent dialog

called DERA. This approach allows agents to focus

on specific roles, reducing the need for an LLM

to achieve the correct answer in one or two gener-

ations. In this setup, we use two types of agents

– Researcher, tasked with reviewing and selecting

information, and Decider, tasked with integrating

that information into the final output. Both dis-cuss the problem in a chat format, with the goal of

improving the output of GPT-4.

As found in Sections 3 and 4, we find DERA im-

proves the quality of the generated text in a variety

of metrics. Importantly, this reduces the number of

hallucinations and omissions in the resulting text.

This finding is important given the ability of large

language models (LLM), in particular GPT-4, to

generate text that is fluent but potentially prone to

errors. The ability of DERA to identify and correct

these hallucinations and omissions is critical when

applying these models to real-world scenarios. A

key feature is that the same LLM can be harnessed

in both roles.

We did not find similar improvements in the

question-answering task. As discussed in Section

5, DERA produced little to no improvement over

a GPT-4 baseline. We suggest this is due to sev-

eral factors, including the requirement to generate

a single, granular answer. DERA often adds infor-

mation to an answer, which is not helpful for short

text generation. These findings, paired with those

discussed above, suggest this method is well-suited

for longer-generation tasks.

Furthermore, the chat-based format of DERA al-

lows for increased interpretability when auditing

the results. Even though LLMs such as GPT-4 may

achieve high performance in zero-shot or one-shot

settings, generating long-form explanations does

not provide a granular forum for understanding

resulting generations. Conversely, the chat-based

format allows for discussions that are granular and

could be verified by an end user for mistakes.

In the future, this setup could be altered to in-

clude human input in the discussion. Alternatively,

different problems may dictate the inclusion of dif-

ferent types of agents. Overall, we believe that

while LLM-based tools are critical in increasing the

quality of natural language performance, additional

research is required to ensure they are consistent

and auditable.

Finally, we reiterate the need for further research

in automated metrics for evaluating LLM output.

Human-led qualitative evaluations can provide im-

portant insights, but it remains a challenge to mea-

sure improvement given the limited tools currently

available. While these models generate text at a higher qual-

ity than other previous models, there are still lim-

itations. First, we do not have access to what the

model has and has not been trained on. Specifically,

we do not know if openly-released datasets, such

as MedQA, were included in the training data. Sec-

ond, we report results using the latest version of

GPT-4 available at the time. As OpenAI does not

persist models, this may make reproducing results

challenging.

While we include a variety of quantitative evalu-

ations, the task of automatically evaluating gener-

ated text needs further research. Previous methods,

such as BERTScore, use models that are less pow-

erful than GPT-4, yet using GPT-4 to evaluate itself

is also potentially problematic. Similarly, evalu-

ations of the ability of DERA to reduce the pres-

ence of harmful text in generations are promising,

but given the limited amount of harmful content

present to begin with, we caution against a broad

interpretation of that result.

Another limitation is inherent to clinical text

datasets. As stated in Section 3, our evaluations

and style of summary and care plan are limited

to a patient population specific to a single tele-

health service, and may not be applicable to uses

for the general population. Several of our evalu-

ation datasets cannot be openly released for data

privacy or licensing reasons, yet this is likely true

for a broad amount of medically-focused research.

Additionally, some other datasets that are openly

available cannot be directly used with API-based

models (Agrawal et al., 2022), further limiting op-

tions. We also acknowledge that while MedQA

does probe medical knowledge, it likely does so

in a different form than is likely to be applied in a

regular clinical setting.

7 Sergey Berezin and Tatiana Batura. 2022. Named en-

tity inclusion in abstractive text summarization. In

Proceedings of the Third Workshop on Scholarly

Document Processing, pages 158–162, Gyeongju,

Limitations

The experiments in this paper were performed us-

ing OpenAI’s API, mostly using GPT-4 models.

References

Monica Agrawal, Stefan Hegselmann, Hunter Lang,

Yoon Kim, and David Sontag. 2022. Large language

models are zero-shot clinical information extractors.

arXiv preprint arXiv:2205.12689.

Alan R Aronson. 2001. Effective mapping of biomed-

ical text to the umls metathesaurus: the metamap

program. In Proceedings of the AMIA Symposium,

page 17. American Medical Informatics Associa-

tion.Republic of Korea. Association for Computational

Linguistics.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Bharath Chintagunta, Namit Katariya, Xavier Amatri-

ain, and Anitha Kannan. 2021. Medically aware

GPT-3 as a data generator for medical dialogue sum-

marization. In Proceedings of the Second Work-

shop on Natural Language Processing for Medical

Conversations, pages 66–76, Online. Association for

Computational Linguistics.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton,

Sebastian Gehrmann, et al. 2022. Palm: Scaling

language modeling with pathways. arXiv preprint

arXiv:2204.02311.

Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and

Siva Reddy. 2022. On the origin of hallucinations

in conversational models: Is it the datasets or the

models? In Proceedings of the 2022 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 5271–5285, Seattle, United States.

Association for Computational Linguistics.

Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba,

Brian Delaney, Frank Diehl, Stefan Hahn, Kristina

Harris, Liam McGrath, Yue Pan, Joel Pinto, Luca

Rubini, Miguel Ruiz, Gagandeep Singh, Fabian

Stemmer, Weiyi Sun, Paul Vozila, Thomas Lin,

and Ranjani Ramamurthy. 2020. Generating medi-

cal reports from patient-doctor conversations using

sequence-to-sequence models. In Proceedings of

the First Workshop on Natural Language Process-

ing for Medical Conversations, pages 22–30, Online.

Association for Computational Linguistics.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-

jan Ghazvininejad, Abdelrahman Mohamed, Omer

Levy, Veselin Stoyanov, and Luke Zettlemoyer.

2020. BART: Denoising sequence-to-sequence pre-

training for natural language generation, translation,

and comprehension. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 7871–7880, Online. Association

for Computational Linguistics.

Valentin Liévin, Christoffer Egeberg Hother, and Ole

Winther. 2022. Can large language models rea-

son about medical questions?

arXiv preprint

arXiv:2207.08143.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized bert pretraining ap-

proach. arXiv preprint arXiv:1907.11692.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020. On faithfulness and factu-

ality in abstractive summarization. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 1906–1919, On-

line. Association for Computational Linguistics.

Bridget T McInnes, Ted Pedersen, and Serguei VS

Pakhomov. 2009.

Umls-interface and umls-

similarity: open source software for measuring paths

and semantic similarity. In AMIA annual sympo-

sium proceedings, volume 2009, page 431. Ameri-

can Medical Informatics Association.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,

Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-

moyer. 2022. Rethinking the role of demonstrations:

What makes in-context learning work?

Harsha Nori, Nicholas King, Scott Mayer McKinney,

Dean Carignan, and Eric Horvitz. 2023. Capabilities

of GPT-4 on Medical Challenge Problems.

OpenAI. 2023. Gpt-4 technical report.

Michael Hanna and Ondřej Bojar. 2021. A fine-grained

analysis of BERTScore. In Proceedings of the Sixth

Conference on Machine Translation, pages 507–517,

Online. Association for Computational Linguistics. Laria Reynolds and Kyle McDonell. 2021. Prompt pro-

gramming for large language models: Beyond the

few-shot paradigm. In Extended Abstracts of the

2021 CHI Conference on Human Factors in Com-

puting Systems, pages 1–7.

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng,

Hanyi Fang, and Peter Szolovits. 2021. What dis-

ease does this patient have? a large-scale open do-

main question answering dataset from medical ex-

ams. Applied Sciences, 11(14):6421. Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah-

davi, Jason Wei, Hyung Won Chung, Nathan Scales,

Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,

et al. 2022. Large language models encode clinical

knowledge. arXiv preprint arXiv:2212.13138.

Anirudh Joshi, Namit Katariya, Xavier Amatriain, and

Anitha Kannan. 2020. Dr. summarize: Global sum-

marization of medical dialogue by exploiting local

structures. In Findings of the Association for Com-

putational Linguistics: EMNLP 2020, pages 3755–

3763, Online. Association for Computational Lin-

guistics. Tianxiang Sun, Junliang He, Xipeng Qiu, and Xuanjing

Huang. 2022. BERTScore is unfair: On social bias

in language model-based metrics for text generation.

In Proceedings of the 2022 Conference on Empiri-

cal Methods in Natural Language Processing, pages

3726–3739, Abu Dhabi, United Arab Emirates. As-

sociation for Computational Linguistics.Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter

Clark. 2022. Entailer: Answering questions with

faithful and truthful chains of reasoning. In Proceed-

ings of the 2022 Conference on Empirical Methods

in Natural Language Processing, pages 2078–2093,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Boshi Wang, Xiang Deng, and Huan Sun. 2022. It-

eratively prompt pre-trained language models for

chain of thought. In Proceedings of the 2022 Con-

ference on Empirical Methods in Natural Language

Processing, pages 2714–2730, Abu Dhabi, United

Arab Emirates. Association for Computational Lin-

guistics.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.

Chain of thought prompting elicits reasoning in large

language models. arXiv preprint arXiv:2201.11903.

Tongshuang Wu, Michael Terry, and Carrie Jun Cai.

2022. Ai chains: Transparent and controllable

human-ai interaction by chaining large language

model prompts. In Proceedings of the 2022 CHI

Conference on Human Factors in Computing Sys-

tems, CHI ’22, New York, NY, USA. Association

for Computing Machinery.

Longxiang Zhang, Renato Negrinho, Arindam Ghosh,

Vasudevan Jagannathan, Hamid Reza Hassanzadeh,

Thomas Schaaf, and Matthew R Gormley. 2021.

Leveraging pretrained models for automatic sum-

marization of doctor-patient conversations. arXiv

preprint arXiv:2109.12174.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q

Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-

uating text generation with bert. arXiv preprint

arXiv:1904.09675.A

Appendix

A.1

Question Answering Examples

We provide several examples of chats between

the Decider and Researcher agents taken from

our MedQA development set. We include the

Gold Answer, 1-shot answer, and DERA answer.

In the following example, the 1-shot prompt

arrives at the correct answer, but DERA expands

the answer to include additional information.

Question [ID 574]:Your colleague has been read-

ing the literature on beta-carotene supplementation

and the risk of heart disease. She thinks they

may share a clinically relevant association and

would like to submit an editorial to a top journal.

Upon final literature review, she discovers a

newly published study that refutes any association

between beta-carotene and heart disease. Your

colleague is upset; you suggest that she, instead,

mathematically pool the results from all of the

studies on this topic and publish the findings. What

type of study design are you recommending to

your colleague? Gold Answer: Meta-analysis,

1-shot : Meta-analysis

Decider: Both options are viable for pooling

results from multiple studies, but a meta-

analysis (80%) is more likely to be recom-

mended as it allows for a quantitative syn-

thesis of the data. A systematic review and

meta-analysis (20%) would also be a good

option, as it includes a comprehensive liter-

ature search and qualitative analysis, but it

may be more time-consuming. Ultimately,

the choice depends on the specific goals and

resources of the researcher.

Researcher: Consider the fact that your col-

league wants to ’mathematically pool the

results from all of the studies on this topic’

and think about which study design would

best allow for this type of analysis.

Decider: Based on the fact that my colleague

wants to mathematically pool the results

from all of the studies, I believe a meta-

analysis would be the best study design as

it allows for a quantitative synthesis of the

data.

Researcher: Think about the goal of your col-

league, which is to ’submit an editorial to a

top journal’ and consider which study design

would be most appropriate for achieving this

goal.

Decider: After considering the goal of my col-

league to submit an editorial to a top jour-

nal, I believe a systematic review and meta-

analysis would be the most appropriate study

design. This design includes both a com-

prehensive literature search and qualitative

analysis, in addition to the quantitative syn-

thesis of data, which would make it more

suitable for publication in a top journal.

Researcher: Consider the fact that your col-

league discovered a ’newly published study

that refutes any association between beta-

carotene and heart disease’ and think about

how this might impact the choice of study

design.

Decider: After considering the fact that my col-

league discovered a newly published study

that refutes any association between beta-

carotene and heart disease, I still believe a

systematic review and meta-analysis would

be the most appropriate study design. This

design would allow for the inclusion of this

new study in the comprehensive literature

search and analysis, providing a more com-

plete picture of the current state of research

on this topic. DERA answer: Systematic

review and meta-analysis

In the following example, the 1-shot prompt

arrives at the wrong answer, and while

DERA changes the answer, it is also not correct.

Question [ID 2599]:A 15-year-old girl is brought

to her pediatrician’s office complaining of frequent

diarrhea, fatigue, and inability to gain weight.

Her vital signs are within normal limits, and

her BMI is 17. She describes her stools as pale,

malodorous, and bulky. She often has abdominal

bloating. Her symptoms are most prominent

after breakfast when she typically consumes

cereal. After several weeks of careful evaluation

and symptomatic treatment, the pediatrician

recommends an esophagogastroduodenoscopy. A

diagnostic biopsy shows blunting of intestinal

villi and flat mucosa with multiple intraepithelial

lymphocytes. What is the patient likely deficient

in? Gold Answer: IgA, 1-shot : Gluten tolerance

Decider: Based on the patient’s symptomsand the results of the esophagogastroduo-

denoscopy, it is likely that the patient has a

gluten intolerance or Celiac disease. How-

ever, without further testing and confirma-

tion, it is difficult to determine the exact diag-

nosis with certainty. Therefore, the options

of Gluten intolerance/Celiac Disease, Gluten

intolerance or celiac disease, and Gluten in-

tolerance or Celiac disease are all rated at

20%. The option of Gluten tolerance is rated

at 40% as it is possible that the patient’s

symptoms are caused by another condition.

Researcher: Consider the patient’s symptoms

of frequent diarrhea, fatigue, and inability to

gain weight, along with her description of

pale, malodorous, and bulky stools.

Decider: After considering the patient’s symp-

toms of frequent diarrhea, fatigue, and inabil-

ity to gain weight, along with her description

of pale, malodorous, and bulky stools, I be-

lieve that Celiac Disease is the most likely

diagnosis. The results of the esophagogastro-

duodenoscopy showing blunting of intestinal

villi and flat mucosa with multiple intraep-

ithelial lymphocytes also support this con-

clusion.

Researcher: Think about the patient’s symptoms

being most prominent after breakfast when

she typically consumes cereal, and how this

could be related to her condition.

Decider: After considering the patient’s symp-

toms being most prominent after breakfast

when she typically consumes cereal, I be-

lieve this further supports my conclusion that

Celiac Disease is the most likely diagnosis.

The consumption of cereal, which often con-

tains gluten, could be triggering her symp-

toms.

Researcher: Consider the diagnostic biopsy re-

sults which show blunting of intestinal villi

and flat mucosa with multiple intraepithelial

lymphocytes, and how this could be related

to the patient’s condition.

Decider: After considering the patient’s symp-

toms, the timing of her symptoms being most

prominent after consuming cereal, and the

diagnostic biopsy results showing blunting

of intestinal villi and flat mucosa with mul-

tiple intraepithelial lymphocytes, I believe

Celiac Disease is the most likely diagnosis.

DERA answer: Celiac DiseaseTable 4: Examples from the MedQA Training set (used as a development set in our experiments). We include

the id (or line number), the rewritten open-ended question, the correct original gold answer, the student predicted

answer, and the 1-shot answer.

id question gold text

4 A 20-year-old woman presents with men-

orrhagia for the past several years. She

says that her menses “have always been

heavy”, and she has experienced easy

bruising for as long as she can remem-

ber. Family history is significant for her

mother, who had similar problems with

bruising easily. The patient’s vital signs

include: heart rate 98/min, respiratory rate

14/min, temperature 36.1°C (96.9°F), and

blood pressure 110/87 mm Hg. Physical

examination is unremarkable. Laboratory

tests show the following: platelet count

200,000/mm3, PT 12 seconds, and PTT

43 seconds. What is the most likely cause

of this patient’s symptoms?

A 40-year-old zookeeper presents to the

emergency department complaining of se-

vere abdominal pain that radiates to her

back, and nausea. The pain started 2

days ago and slowly increased until she

could not tolerate it any longer. Past medi-

cal history is significant for hypertension

and hypothyroidism. Additionally, she re-

ports that she was recently stung by one

of the zoo’s smaller scorpions, but did not

seek medical treatment. She takes aspirin,

levothyroxine, oral contraceptive pills, and

a multivitamin daily. Family history is non-

contributory. Today, her blood pressure is

108/58 mm Hg, heart rate is 99/min, res-

piratory rate is 21/min, and temperature is

37.0°C (98.6°F). On physical exam, she is

a well-developed, obese female that looks

unwell. Her heart has a regular rate and

rhythm. Radial pulses are weak but sym-

metric. Her lungs are clear to auscultation

bilaterally. Her lateral left ankle is swollen,

erythematous, and painful to palpate. An

abdominal CT is consistent with acut... Von Wille- Von Wille- Inherited

brand

bleeding

disease

disorder

Scorpion

sting

DERA

swer

an- 1-shot

answer

Scorpion

venom-

induced

acute pan-

creatitis

Scorpion

venom-

induced

acute pan-

creatitis

Continued on next pageid question answer text DERA

swer 9 A 35-year-old male presents to his pri-

mary care physician with complaints of

seasonal allergies. He has been using in-

tranasal vasoconstrictors several times per

day for several weeks. What is a likely

consequence of the chronic use of topical

nasal decongestants?

A 60-year-old man comes to the physi-

cian for an examination prior to a sched-

uled cholecystectomy. He has hyperten-

sion treated with hydrochlorothiazide. His

mother had chronic granulomatous disease

of the lung. He works in a glass manufac-

turing plant. He has smoked two packs

of cigarettes daily for 38 years. His vi-

tal signs are within normal limits. Exam-

ination shows no abnormalities. Labora-

tory studies are within the reference range.

An x-ray of the chest is shown. What is

the most appropriate next step in manage-

ment?

You are examining a 3-day-old newborn

who was delivered vaginally without any

complications. The newborn presents with

vomiting, hyperventilation, lethargy, and

seizures. Blood work demonstrates hy-

perammonemia, elevated glutamine levels,

and decreased blood urea nitrogen. A CT

scan demonstrates cerebral edema. Which

enzyme defects would result in a clinical

presentation similar to this infant?

An 18-year-old man comes to the clinic

with his mom for “pins and needles” of

both of his arms. He denies any past medi-

cal history besides a recent anterior cruci-

ate ligament (ACL) tear that was repaired

1 week ago. The patient reports that the

paresthesias are mostly located along the

posterior forearms, left more than the right.

What would you expect to find on physical

examination of this patient? Persistent

congestion Rhinitis

Medica-

mentosa

(rebound

nasal con-

gestion) Rhinitis

medica-

mentosa

(rebound

nasal con-

gestion)

Request pre- Assessing

vious chest for occupa-

x-ray

tional lung

disease and

providing

smoking

cessation

counseling. Smoking

cessation

counseling

and support.

Carbamoyl

phosphate

synthetase I

an- 1-shot

answer

"Urea cycle Urea cycle

enzyme defi- enzyme defi-

ciencies"

ciencies

Loss

of Decreased

wrist exten- sensation

sion

and possible

weakness

both

posterior

forearms,

with

the

left

side

being more

affected

than

the

right.

Decreased

sensation

and possible

weakness in

the posterior

forearms.

Continued on next pageid question answer text DERA

swer 74 A 16-year-old girl is brought to the emer-

gency department by her friends who say

that she took a whole bottle of her mom’s

medication. They do not know which med-

ication it was she ingested. The patient is

slipping in and out of consciousness and

is unable to offer any history. Her temper-

ature is 39.6°C (103.2°F), the heart rate is

135/min, the blood pressure is 178/98 mm

Hg, and the respiratory rate is 16/min. On

physical examination, there is significant

muscle rigidity without tremor or clonus.

What is the best course of treatment for

this patient?

A 3-week-old boy is brought to the emer-

gency department by his parents because

of a 3-day history of progressive lethargy

and difficulty feeding. He was born at term

and did not have difficulty feeding previ-

ously. His temperature is 39.4°C (103°F),

pulse is 220/min, respirations are 45/min,

and blood pressure is 50/30 mm Hg. Pulse

oximetry on 100% oxygen shows an oxy-

gen saturation of 97%. Examination shows

dry mucous membranes, delayed capillary

refill time, and cool skin with poor turgor.

Despite multiple attempts by the nursing

staff, they are unable to establish periph-

eral intravenous access. What is the most

appropriate next step in management for

this 3-week-old boy?

A 70-year-old man comes to the physician

because of a 4-month history of epigas-

tric pain, nausea, and weakness. He has

smoked one pack of cigarettes daily for 50

years and drinks one alcoholic beverage

daily. He appears emaciated. He is 175

cm (5 ft 9 in) tall and weighs 47 kg (103

lb); BMI is 15 kg/m2. He is diagnosed

with gastric cancer. What cytokine is the

most likely direct cause of this patient’s

examination findings? Dantrolene Immediate

stabilization

and support-

ive care with

emergency

toxicology

consulta-

tion. Immediate

hospital-

ization and

supportive

care.

Intraosseous

cannulation Establishing

intraosseous

access for

fluid resus-

citation and

medication

administra-

tion. Intraosseous

needle place-

ment

for

fluid resus-

citation and

antibiotics.

IL-6 Tumor

necrosis

factor-alpha

(TNF-α) Tumor

necrosis

factor-alpha

(TNF-α)

an- 1-shot

answer

Continued on next pageid question

answer text

98 Three days after starting a new drug for Primaquine

malaria prophylaxis, a 19-year-old college

student comes to the physician because

of dark-colored urine and fatigue. He has

not had any fever, dysuria, or abdominal

pain. He has no history of serious illness.

Physical examination shows scleral icterus.

Laboratory studies show a hemoglobin of

9.7 g/dL and serum lactate dehydrogenase

of 234 U/L. Peripheral blood smear shows

poikilocytes with bite-shaped irregulari-

ties. What drug has the patient most likely

been taking?

DERA

swer

an- 1-shot

answer

Primaquine

PrimaquinePrompt

Summarization - Initial (1)

Summarization - Decider (3)

Summarization - Researcher (4)

Summarization - Corruption (2)

Summarization - Final (5)

GPT-F1 Metric - Concept Extractor (6)

GPT-F1 Metric - Concept Verifier (7)

Care Plan - Initial (8)

Care Plan - Decider (9)

Care Plan - Researcher (10)

Care Plan - Final (11)

temp.

max_tokens

512

200

512

top_p

freq. penalty

num. turns

Table 5: Experimental details for Summarization and Care Plan Experiments. For each prompt, we include the

Temperature, Maximum Number of Tokens for Generation, Top-P, the Frequency Penalty, and the Number of Turns

for the prompt(s) (where applicable). In all cases, the final parameters of the model were decided by qualitative

evaluation of the output.

Prompt

1-shot (14)

Decider Initial Msg. (15)

Res.-Dec. Msg. (17 & 16)

Final Decider Answer (18)

Pred - Gold Ans. Sim. (19)

temp.

0.7

0.3

max_tokens

400

250

100

top_p

num comp.

freq. penalty

0.5

num. turns

Table 6: Experimental details for Question Answering. For each prompt, we include the Temperature, Maximum

Number of Tokens for Generation, Top-P, the Number of Completions (e.g, the number of generations we prompted

from GPT-4), the Frequency Penalty, and the Number of Turns for the prompt(s). In all cases, the final parameters

of the model were decided by qualitative evaluation of the output. For GPT-3 CoT, the two turns refers to 1) the

generation of the chain of thought, and 2) the generation of the answer.Doctor-Patient Chat

Patient: I have a sinus infection and need something to knock it out.

Doctor: Hi NAME thank you for starting a visit. I am so sorry to hear about your sinus infection. How long have you had the

symptoms for?

Patient: Since Sunday

Doctor: Ah I see. Which symptoms do you have at present?

Patient: My face is swollen my cheeks hurt my eyelids are swollen and I am running a slight fever and I can feel something

draining down the back of my throat. . .

Initial Summary, Corruptions in Red

Demographics and Social Determinants of Health: The patient’s specific demographics are not mentioned in the encounter.

The patient experiences sinus infections multiple times a year. . .

Patient Intent: The patient seeks care for a sinus infection, which they have identified themselves. They are . . .

Pertinent Positives: The patient has been experiencing sinus infection symptoms since Sunday. Their face is swollen, cheeks

hurt, eyelids are swollen, and they have a slight fever at night. The patient can also feel something draining down the back of

their throat. They have a history of 3-6 sinus infections a year, with the last one occurring a couple of months ago → last

two occurring simultaneously a couple of months ago.

Pertinent Unknowns: There are no pertinent unknowns mentioned in this medical encounter. → The patient is unsure

whether their current condition is a sinus infection or a viral infection..

Pertinent Negatives: The patient has not been vaccinated against COVID-19 → The patient has received the COVID-19

vaccine.. Their temperature is currently normal but spikes at night.

Medical History: The patient has a history of seasonal allergies and sinus infections → The patient has had chickenpox,

seasonal allergies, and sinus infections. They typically use an Albuterol inhaler for asthma due to their allergies causing

wheezing in the spring and fall, however they currently do not have one. They have had COVID-19 before and were previously

prescribed amoxicillin → penicillin for sinus infections. The patient currently uses Tylenol severe sinus medication and

ibuprofen to alleviate symptoms → The patient currently uses ibuprofen and a nasal spray to alleviate symptoms..

DERA dialog

Researcher: Let’s go through the summary section by section. First, the Pertinent Positives section - there is no mention of the

patient taking Tylenol severe sinus medication, which was mentioned in the dialogue. Also, the last sentence about the

last two sinus infections occurring simultaneously might be misleading as it’s not explicitly mentioned in the dialogue.

Could you please add the Tylenol information and consider rephrasing the last part?

Decider: Thank you for pointing that out. I’ll add the information about Tylenol severe sinus medication to the summary. I

agree that the last part about the last two sinus infections occurring simultaneously can be misleading; I will rephrase it.

Do you have any other suggestions or corrections to make in the summary?

Researcher: Great, let’s move on to the Pertinent Unknowns section. The patient actually mentioned that they know it’s a

sinus infection, so we shouldn’t include that they are unsure about their condition. Could you please remove this part

from the Pertinent Unknowns section?

Decider: You’re right, the patient did mention that they know it’s a sinus infection. I apologize for the confusion. I’ll remove

that part from the Pertinent Unknowns section. . . .

Scratchpad

Add Tylenol severe sinus medication to the summary, and rephrase the sentence about the last two sinus infections.

Remove the patient being unsure about their condition from Pertinent Unknowns section.

Correct the COVID-19 vaccine information in the Pertinent Negatives section.

Update the summary to mention the patient using the Albuterol inhaler for wheezing and change medication name to

amoxicillin.

Final Summary

Demographics and Social Determinants of Health: The patient’s specific demographics are not mentioned in the encounter.

The patient experiences sinus infections multiple times a year. . .

Patient Intent: The patient seeks care for a sinus infection, which they have identified themselves. They are . . .

Pertinent Positives: The patient has been experiencing sinus infection symptoms since Sunday. Their face is swollen, cheeks

hurt, eyelids are swollen, and they have a slight fever at night. The patient can also feel something draining down the back of

their throat. They have a history of 3-6 sinus infections a year, with the last one occurring a couple of months ago and being

treated with amoxicillin.

Pertinent Unknowns: There are no pertinent unknowns mentioned in this medical encounter.

Pertinent Negatives: The patient has not been vaccinated against COVID-19. Their temperature is currently normal but spikes

at night.

Medical History: The patient has a history of seasonal allergies and sinus infections. They typically use an Albuterol inhaler

for wheezing due to their allergies causing wheezing in the spring and fall, however they currently do not have one. They have

had COVID-19 before. The patient currently uses Tylenol severe sinus medication and ibuprofen to alleviate symptoms.

Figure 6: Medical conversation summarization - Quantitative evaluation: The above is a condensed example taken

from our summarization dataset. Starting with the initial chat, we generate an initial summary, which is then

corrupted (see red text) using Prompt 2. The DERA dialog generates a scratchpad of edits, which results in the

final corrected summary. The quantitative evaluation compares the initial summary with the final summary .1

----

Below is a medical encounter between an { age }

and { sex } patient and a doctor done over chat .

Chief Complaint : "{ cc } ".

----

Medical Encounter

----

{ chat }

----

Summary Instructions

----

Provide a summary of the medical encounter between the doctor and the { age_and_sex }

patient in 6 sections ( Demographics and Social Determinants of Health , Patient

Intent , Pertinent Positives , Pertinent Unknowns , Pertinent Negatives , Medical

History ). The definitions of each section are listed below . Write a paragraph

under each section , not bullet points .

Demographics and Social Determinants of Health :

// Definition of section

Patient Intent :

// Definition of section

Pertinent Positives :

// Definition of section

Pertinent Unknowns :

// Definition of section

Pertinent Negatives :

// Definition of section

Medical History :

// Definition of section

----

Summary of Medical Encounter

----

Prompt 1: Prompt for generating initial summary.1

---

Below is a medical encounter between a { age_and_sex } patient and a doctor done over

chat .

Chief complaint : "{ cc } ".

----

Medical Encounter

----

{ chat }

----

Below is a summary of the conversation that was written using the following

instructions :

// Definition of medical summary ( same as in initial summarization prompt )

----

Summary of Medical Encounter

----

{ summary }

----

Using the above dialogue and provided summary , corrupt the summary slightly . This

could include moving a positive symptom to be a negative symptom , making up

medical history mentioned , etc .

Corruptions should only occur on the Pertinent Positives , Pertinent Unknowns ,

Pertinent Negative , or Medical History section .

The lower the desired corruption level , the fewer the changes made . Note that a 0

would be not changing the summary at all , and a 10 would be completely

corrupting the summary .

Note that any changes / corruption should make the summary less factual .

Desired Corruption Level : { corruption_level } /10

----

Corrupted Summary of Medical Encounter

----

Prompt 2: Prompt for generating corruptions based off of the initial summary.1

You ( Person A) are a very good summary writer for medical dialogues between

physicians and patients .

This is the medical dialogue you summarized for a { age } and { sex } patient :

- Medical Dialogue -

{ chat }

- Medical Dialogue -

You are discussing the summary you wrote for this dialogue with another summary

writer ( Person B) whose job it is to verify your summary for correctness .

Person B will give you points for correction and it will be your job to add the

points of correction to a scratchpad if you agree with them .

This is your original version of the summary :

- Your Original Summary -

{ summary }

- Your Original Summary -

Here is your current scratchpad of corrections to make to the summary :

- Correction Scratchpad -

{ scratchpad }

- Correction Scratchpad -

You are generally very confident about the summary you wrote , however , when

presented with compelling arguments by the verifying summary writer , you add to

the correction scratchpad . You also suggest any edits of your own in case you

notice a mistake .

This is the summary discussion so far :

- Summary Discussion -

{ discussion }

- Summary Discussion -

Question : What do you say next ? Respond to Person B in the tag [ RESPONSE : "<

your_response_here >"] and output any corrections to add to the scratchpad in the

tag [ SCRATCHPAD : "< things_to_add_to_the_scratchpad_here >"]. Make sure to use

the "[]" when outputting tags .

Answer :

Prompt 3: Prompt for decider agent used in DERA summarization experiments.1

---

You ( Person B) are a very good summary editer for medical dialogues between

physicians and patients .

This is the medical dialogue you will be referencing for a { age } and { sex } patient :

- Medical Dialogue -

{ chat }

- Medical Dialogue -

You are discussing the summary that another summary writer ( Person A) wrote for this

dialogue one section at a time .

You will be giving Person A points for correction based on any mistakes /

discrepancies you see between the dialogue and summary one section at a time .

Person A will add the points of correction that they agree on to a scratchpad to

later make edits .

However , you will only go through the Pertinent Positives , Pertinent Negatives ,

Pertinent Unknowns , and Medical History sections .

This is Person A ’s original version of the summary :

- Person A ’ s Original Summary -

{ summary }

- Person A ’ s Original Summary -

Here is Person A ’s current scratchpad of corrections to make to the summary :

- Correction Scratchpad -

{ scratchpad }

- Correction Scratchpad -

Go through each section of the summary one at a time and point out any text that

does not have a grounding in the dialogue . It must be possible to directly tie

any span of the summary to the dialogue .

Make sure to make accurate , useful suggestions for corrections .

Person A may not initially agree with you , but if you are confident there is an

error do your best to convince Person A of the mistake .

Once you have gone through each section and have confirmed each section with Person

A , and you are satisfied with all of the corrections added to the scratchpad and

/ or all of Person A ’s reasoning to reject additional corrections , output the tag

"[ STOP ]".

This is the summary discussion with Person A so far :

- Summary Discussion -

{ discussion }

- Summary Discussion -

Question : What do you say next ? Respond to Person A in the tag [ RESPONSE : "<

your_response_here >"]. If you are done correcting and are satisfied , output the

"[ STOP ]" tag .

Answer :

Prompt 4: Prompt for researcher agent used in DERA summarization experiments.1

---

You are a very good summary writer for medical dialogues between physicians and

patients .

This is the medical dialogue you summarized for a { age } and { sex } patient :

- Medical Dialogue -

{ chat }

- Medical Dialogue -

This is your original version of the summary :

- Original Summary -

{ summary }

- Original Summary -

Here is your current scratchpad of corrections to make to the summary :

- Correction Scratchpad -

{ scratchpad }

- Correction Scratchpad -

Make all changes mentioned in the scratchpad to the original summary to output the

corrected summary .

Output the tag "[ STOP ]" when finished writing the corrected summary .

- Corrected Summary -

Prompt 5: Prompt for final summarization step (incorporating scratchpad of corrections into the original summary)

used in DERA summarization experiments.1

Given the following snippet of a medical dialogue summary , extract the medical

concepts ( symptoms , diseases , conditions , allergies , lab tests , etc .) present .

The heading of the section from which the summary was extracted will also be

provided .

--- Example 1---

Pertinent Negatives : Patient reports no < concept_1 >, no < concept_2 >, < concept_3 >,

and < concept_4 >. Patient also reports having no trouble with < concept_5 >.

Medical Concepts : [< concept_1 >, < concept_2 >, < concept_3 >, < concept_4 >, < concept_5 >]

--- Example 1---

--- Example 2---

Pertinent Positives : Patient ongoing < concept_1 > for the past 5 days , < concept_2 > ,

and some < concept_3 >. Patient had < concept_4 > done in May 2021.

Medical Concepts : [< concept_1 >, < concept_2 >, < concept_3 >, < concept_4 >]

--- Example 2---

--- Example 3---

Pertinent Unknowns :

Patient is unsure about < concept_1 > and < concept_2 >.

Medical Concepts : [< concept_1 >, < concept_2 >]

--- Example 3---

--- Example 4---

Medical History : Patient reports some < concept_1 > in the past , and had last <

concept_2 > on DATE_1 .

Medical Concepts : [< concept_1 >, < concept_2 >]

--- Example 4---

Here is the example to extract medical concepts from :

{ section_heading }: { section_value }

Medical Concepts :

Prompt 6: Prompt for extracting medical concepts from the summary used to compute the GPT-F1 metric.1

Given a snippet ( snippet ) from a medical dialogue summary and a corresponding list (

list_a ) of medical concepts extracted from that snippet , evaluate what medical

concepts from a separate list ( list_b ) can be found in either list_a or snippet .

Note that on some occasions a medical concept from list_b may not be found in list_a

, but can be appropriate to be present given the snippet . This could include

rephrasings of medical concepts that are clinically equivalent ( Ex : COVID and

COVID -19) .

--- Example ---

snippet : < snippet >

list_a : [ < concept_1 >, < concept_2 >, < concept_3 >, < concept_4 >, < concept_5 >, < concept_7

list_b : [ < concept_0 >, < concept_1 >, < concept_3 >, < concept_4 >, < concept_5 >, < concept_6

found_b : [ < concept_1 >, < concept_3 >, < concept_4 >, < concept_5 >]

not_found_b : [< concept_0 >, < concept_6 >]

--- Example ---

Here is the snippet , list_a . Evaluate the medical concepts in list_b as above .

snippet : { snippet }

list_a : { list_a }

list_b : { list_b }

found_b :

Prompt 7: Prompt for verifying medical concepts from a summary section used to compute the GPT-F1 metric.1

----

Care Plan Instructions

----

You are a primary care physician tasked with writing a care plan , which lists the

next steps in care management that the patient and the physician will perform .

Categorize the next steps into five sections : Medications , Referrals , Tests ,

Lifestyle and Supportive Care . Definitions and scopes of each section are

defined below .

Medications :

// Definition of

Referrals :

// Definition of

Tests :

// Definition of

Lifestyle :

// Definition of

Supportive Care :

// Definition of

section

{ example }

----

Care Plan Instructions

----

Now that you ’ ve seen an example , you will now write a care plan of the same format (

five sections : Medications , Referrals , Tests , Lifestyle and Supportive Care ).

The dialogue you will use to write a care plan about is a medical encounter between

a { age } and { sex } patient and a doctor done over chat :

----

Dialogue

----

{ chat }

----

Care Plan

----

Prompt 8: Prompt for generating initial care plan1

---

You ( Person A) are a very good writer of care plans for patients following their

discussion with a physician . The full instructions are presented below .

---

Care Plan Writing Instructions

---

// Same instructions as in initial care plan generation prompt . Removed for brevity .

---

Given the instructions , this is the medical dialogue you see for a {{ age }} {{ sex }}

patient :

---

Medical Dialogue

---

{ chat }

---

You are discussing the care plan you wrote for this dialogue with another care plan

writer ( Person B) whose job it is to verify your care plan for soundness .

Person B will give you points for correction and it will be your job to add the

points of correction to a scratchpad if you agree with them .

This is your original version of the care plan :

---

Your Original Care Plan

---

{ careplan }

---

Here is your current scratchpad of corrections to make to the care plan :

---

Correction Scratchpad

---

{ scratchpad }

---

You are generally very confident about the care plan you wrote , however , when

presented with compelling arguments by the verifying care plan writer , you add

to the correction scratchpad . You also suggest any edits of your own in case you

notice a mistake .

This is the care plan discussion so far :

---

Care Plan Discussion

---

{ discussion }

---

Question : What do you say next ? Respond to Person B in the tag [ RESPONSE : "<

your_response_here >"] and output any corrections to add to the scratchpad in the

tag [ SCRATCHPAD : "< things_to_add_to_the_scratchpad_here >"]. Make sure to use

the "[]" when outputting tags . All text should be within the tag brackets .

An example answer would be : [ RESPONSE : "I think we should remove ... from the care

plan "] [ SCRATCHPAD : " Remove ... from the care plan because ..."]

---

Answer :

Prompt 9: Prompt for decider agent used in DERA care plan experiments.1

---

You are a primary care physician and very good editor of care plans for patients

following their discussion with a physician . The full instructions for writing

care plans are presented below .

---

Care Plan Writing Instructions

---

// Same instructions as in initial care plan generation prompt . Removed for brevity .

---

Given the instructions , this is the medical dialogue you see for a { age_and_sex }

patient :

---

Medical Dialogue

---

{ chat }

---

You are discussing the care plan that another care plan writer ( Person A) wrote for

this dialogue one section at a time .

You will be giving Person A points for correction based on any reconsiderations you

see between the dialogue and care plan one section at a time . Person A will add

the points of correction that they agree on to a scratchpad to later make edits .

This is Person A ’s original version of the care plan :

---

Person A ’ s Original Care Plan

---

{ careplan }

---

Here is Person A ’s current scratchpad of corrections to make to the care plan :

---

Correction Scratchpad

---

{ scratchpad }

---

Go through each section of the care plan one section at a time and point out any

suggestions that does not have a grounding in the dialogue . All suggestions must

be grounded in information from the dialogue .

Remember to make sure the care plan is congruent with the Care Plan Writing

Instructions .

Make sure to make accurate , useful suggestions for corrections .

Person A may not initially agree with you , but if you are confident there is an

error do your best to convince Person A of the mistake .

Once you have gone through each section and have confirmed each section with Person

A , and you are satisfied with all of the corrections added to the scratchpad and

/ or all of Person A ’s reasoning to reject additional corrections , output the tag

"[ DONE ]".

This is the care plan discussion with Person A so far :

---

Care Plan Discussion

---

{ discussion }

---

Question : What do you say next ? Respond to Person A in the tag [ RESPONSE : "<

your_response_here >"]. If you are done correcting , are satisfied , and want to

end the conversation , output " DONE ".

---

Answer :

Prompt 10: Prompt for researcher agent used in DERA care plan experiments.1

---

You are a very good writer of care plans for patients following their discussion

with a physician . The full instructions are presented below .

---

Care Plan Writing Instructions

---

// Same instructions as in initial care plan generation prompt . Removed for brevity .

---

Given the instructions , this is the medical dialogue you see for a { age } and { sex }

patient :

---

Medical Dialogue

---

{ { chat } }

---

You have been discussing the care plan you wrote for this dialogue with another care

plan writer ( Person B) whose job it is to verify your care plan for soundness .

You added corrections to a scratchpad after discussing them with Person B , and you

will later be tasked with updating the original care plan based off of the

correctness suggested in the scratchpad .

This is your original version of the care plan :

---

Your Original Care Plan

---

{ careplan }

---

Here is your current scratchpad of corrections to make to the care plan :

---

Correction Scratchpad

---

{ scratchpad }

---

Make all changes mentioned in the scratchpad to the original care plan to output the

corrected care plan . Make sure all changes are congruent to the Care Plan

Writing Instructions .

Output the tag "[ STOP ]" when finished writing the corrected care plan .

---

Corrected Care Plan

---

Prompt 11: Prompt for final care plan generation step (incorporating scratchpad of corrections into the original

care plan) used in DERA care plan experiments.1

The following question was written as a multiple choice question . Rewrite it as

posing an open - ended question . If it is already an open - ended question and the

question requires no rewrite , output "[ OPEN ]" only . Do not change any details

or facts in the question , and only change the phrasing of the question .

-- Example - -

Question : A 60 - year - old man comes to the physician for an examination prior to a

scheduled cholecystectomy . He has hypertension treated with hydrochlorothiazide .

His mother had chronic granulomatous disease of the lung . He works in a glass

manufacturing plant . He has smoked two packs of cigarettes daily for 38 years .

His vital signs are within normal limits . Examination shows no abnormalities .

Laboratory studies are within the reference range . An x - ray of the chest is

shown . Which of the following is the most appropriate next step in management ?

Rewrite : A 60 - year - old man comes to the physician for an examination prior to a

scheduled cholecystectomy . He has hypertension treated with hydrochlorothiazide .

His mother had chronic granulomatous disease of the lung . He works in a glass

manufacturing plant . He has smoked two packs of cigarettes daily for 38 years .

His vital signs are within normal limits . Examination shows no abnormalities .

Laboratory studies are within the reference range . An x - ray of the chest is

shown . What is the most appropriate next step in management ?

-- Example - -

Question : Several patients at a local US hospital present with chronic secretory

diarrhea . Although there are multiple potential causes of diarrhea present in

these patients , which of the following is most likely the common cause of their

chronic secretory diarrhea ?

Rewrite : Several patients at a local US hospital present with chronic secretory

diarrhea . Although there are multiple potential causes of diarrhea present in

these patients , what is most likely the common cause of their chronic secretory

diarrhea ?

-- Example - -

Question : A 39 - year - old male presents to your office with nodular skin lesions that

progress from his right hand to right shoulder . The patient reports that the

initial lesion , currently necrotic and ulcerative , developed from an injury he

received while weeding his shrubs a couple weeks earlier . The patient denies

symptoms of respiratory or meningeal disease . Which of the following most likely

characterizes the pattern of this patient ’s skin lesions :

Rewrite : A 39 - year - old male presents to your office with nodular skin lesions that

progress from his right hand to right shoulder . The patient reports that the

initial lesion , currently necrotic and ulcerative , developed from an injury he

received while weeding his shrubs a couple weeks earlier . The patient denies

symptoms of respiratory or meningeal disease . How would you characterize the

pattern of this patient ’s skin lesions ?

-- Example - -

Question : A 71 - year - old man presents to the clinic with complaints of right wrist

pain for 2 days . On examination , redness and swelling were noted on the dorsal

aspect of his right wrist . He had pain with extreme range of motion of the

wrist . His history includes 2 hip replacements , 2 previous episodes of gout in

both first metatarsophalangeal joints , and hypertension . Two days later , the

swelling had increased in the dorsal aspect of his right wrist and hand . Wrist

flexion was limited to 80% with severe pain . The pain was present on palpation

of the scaphoid bone . Due to the suspicion of fracture , the patient was referred

to his general practitioner for radiographs . These findings were consistent

with gouty arthritis . What is the most likely cytokine involved in this process ?

Rewrite : [ OPEN ]

---

Question : {{ question }}

Rewrite :

Prompt 12: Prompt for rewriting the question in full (temperature at 0 and otherwise uses default parameters)1

The following question was written as a multiple choice quesiton . For the sentence

in the question poses a multiple choice , rewrite it as posing an open - ended

question . If the relevant is a compound sentence , re - write the entire sentence .

If it is already an open - ended question and the question requires no rewrite ,

output "[ OPEN ]" only . Do not change any details or facts in the question , and

only change the phrasing of the question .

-- Example - -

Question : A 60 - year - old man comes to the physician for an examination prior to a

scheduled cholecystectomy . He has hypertension treated with hydrochlorothiazide .

His mother had chronic granulomatous disease of the lung . He works in a glass

manufacturing plant . He has smoked two packs of cigarettes daily for 38 years .

His vital signs are within normal limits . Examination shows no abnormalities .

Laboratory studies are within the reference range . An x - ray of the chest is

shown . Which of the following is the most appropriate next step in management ?

Original : Which of the following is the most appropriate next step in management ?

Rewrite : What is the most appropriate next step in management ?

-- Example - -

Question : Several patients at a local US hospital present with chronic secretory

diarrhea . Although there are multiple potential causes of diarrhea present in

these patients , which of the following is most likely the common cause of their

chronic secretory diarrhea ?

Original : Although there are multiple potential causes of diarrhea present in these

patients , which of the following is most likely the common cause of their

chronic secretory diarrhea ?

Rewrite : Although there are multiple potential causes of diarrhea present in these

patients , what is most likely the common cause of their chronic secretory

diarrhea ?

-- Example - -

Question : A 39 - year - old male presents to your office with nodular skin lesions that

progress from his right hand to right shoulder . The patient reports that the

initial lesion , currently necrotic and ulcerative , developed from an injury he

received while weeding his shrubs a couple weeks earlier . The patient denies

symptoms of respiratory or meningeal disease . Which of the following most likely

characterizes the pattern of this patient ’s skin lesions :

Original : Which of the following most likely characterizes the pattern of this

patient ’s skin lesions :

Rewrite : How would you characterize the pattern of this patient ’s skin lesions ?