Summary Enhancing LLM Completions with DERA arxiv.org
18,498 words - PDF document - View PDF document
One Line
The article discusses the use of DERA, an AI tool for generating care plans in medical conversations, to manage urinary tract infections and improve the quality of care plans, as well as offering default prompts to enhance completions in the LLM model and guidelines for a medical question-answering research task.
Key Points
- Large language models (LLMs) can be improved with dialog-enabled resolving agents (DERA) which allow models to communicate feedback and iteratively improve output.
- DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance.
- DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks.
- The article discusses the use of DERA to enhance LLM completions by altering questions and generating free-form answers for medical question answering datasets.
- The DERA system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness.
Summaries
205 word summary
This document discusses the use of DERA, an AI tool for generating care plans in medical conversations, to manage urinary tract infections (UTIs) and improve the quality of care plans. The DERA platform generates a final care plan that includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The primary challenge in using DERA for medical question answering datasets is mapping free-form responses to a discrete set of options. DERA can also enhance LLM completions by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. Instructions for writing a care plan are provided, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The document provides prompts for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric. This article offers default prompts to enhance completions in the LLM model. It includes prompts for rewriting questions, characterizing skin lesions, and determining the next step in patient management. Additionally, it provides prompts and guidelines for a medical question-answering research task. The article discusses the use of DERA to improve LLM completions, including prompts for question-answering and output format.
537 word summary
The article presents default prompts for prompt generation in the LLM model to improve completions, including prompts for rewriting questions, characterizing skin lesions, and determining the appropriate next step in management for a patient. The article also provides prompts and guidelines for a medical question-answering research task and discusses the use of DERA to enhance LLM completions, including prompts for question-answering and the output format.
Instructions for writing a care plan are provided, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The document provides prompts for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric.
The DERA system involves generating medical dialogues and care plans for patients following their discussion with a physician. The system aims to improve the quality of care plans by ensuring they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue.
A medical encounter is evaluated using the DERA system to generate a scratchpad of edits to improve the initial summary, including correcting pertinent negatives to state that the patient has not been vaccinated against COVID-19 and modifying pertinent unknowns to remove the patient's uncertainty about their condition. DERA is a digital education resource assistant that enhances the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. DERA can also assist with medical cases by providing prompt questions and treatment recommendations.
The article discusses the use of DERA to enhance LLM completions with a focus on GPT-4 models. Various studies related to natural language processing and medical informatics are discussed, including semantic similarity measurement, language models for generating medical conversations, summarization, text generation, and mapping biomedical text. The document also includes examples of chats between Decider and Researcher agents discussing study design and a medical case study.
The primary challenge in using DERA for medical question answering datasets is mapping free-form responses to a discrete set of options. The evaluation metrics involve generating multiple-choice prompts. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers.
The article discusses the use of the DERA platform to generate care plans for patients. The final care plan includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The platform emphasizes the importance of seeking timely medical care and offers reassurance and support for patients' mental health. The document presents suggestions for managing urinary tract infections (UTIs) in a patient's care plan, including considering previous antibiotic exposure and potential drug interactions. It recommends urinalysis and urine culture tests and education on safe sexual practices. Referrals are recommended if symptoms worsen or do not improve. The study introduces DERA, an AI tool for generating care plans in medical conversations, which can improve the quality of care plans. Dialog-based strategies such as chaining and prompt-based methods prompt the model to think through a problem as an expert would approach it, leading to improved performance. DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance. DERA has potential for improving GPT-4 performance and reducing errors in conversation summaries.
1225 word summary
Large language models (LLMs) can be improved with dialog-enabled resolving agents (DERA) which allow models to communicate feedback and iteratively improve output. The authors present dialog-based strategies such as chaining and prompt-based methods that prompt the model to think through a problem as an expert would approach it, leading to improved performance in some tasks. DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance. DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks. The approach is well-suited for longer-form generation tasks, such as medical conversation summarization. DERA has potential for improving GPT-4 performance and reducing errors in conversation summaries. The dataset includes 500 medical encounters that cover a wide variety of common presentations in telehealth. DERA can generate doctor-facing suggestions for care plan generation. The study introduces DERA, an AI tool for generating care plans in medical conversations. Physician-expert evaluations show that DERA summaries capture far more clinical information than initial GPT-4 generated summaries. The DERA system generates physician-facing suggestions for care management plans, which can resolve errors like hallucinations and omissions. The study evaluates the effectiveness of DERA on a dataset of 50 medical encounters with prompts specific to care plan generation. It concludes that DERA can enhance LLM completions and improve the quality of care plans.
The Enhancing LLM Completions with DERA document suggests improvements to a patient's care plan for managing urinary tract infections (UTIs). The Medications section recommends considering the patient's previous antibiotic exposure, allergies, and potential drug interactions with BuSpar, LaMICtal, and Zoloft. The Supportive Care section should offer guidance on managing UTIs and potential medication interactions related to barrier methods during sex to reduce the risk of UTIs. The Lifestyle section should include a recommendation to educate the patient on safe sexual practices and the importance of using a barrier method during sex. The Tests section recommends performing a urinalysis and urine culture to confirm the presence of a UTI and identify the appropriate antibiotic for treatment. The Initial Care Plan includes prescribing an appropriate antibiotic, providing education on recognizing the signs and symptoms of recurrent UTIs, and encouraging proper hygiene practices. Referrals are recommended if symptoms worsen or do not improve after a few days of antibiotic treatment.
The article discusses the use of the DERA platform to generate care plans for patients. The final care plan includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The platform emphasizes the importance of seeking timely medical care and offers reassurance and support for patients' mental health. The article discusses the use of DERA to enhance LLM completions by altering questions and generating free-form answers for medical question answering datasets. The primary challenge is mapping free-form responses to a discrete set of options, and the evaluation metrics involve generating multiple-choice prompts. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers. In the open-ended setting, DERA shows small to no improvement over GPT-4, while in the multiple-choice setting, GPT-4 outperforms the best previously published approaches. A framework called DERA is introduced for improving the output of large language model-powered question-answering tasks. The authors suggest that additional work in methods that automatically evaluate the output of large language model-powered tools is needed. This document discusses the use of zero-shot clinical information extractors and OpenAI's API to enhance LLM completions, with a focus on GPT-4 models. The chat-based format of DERA allows for increased interpretability and is well-suited for longer generation tasks. Various studies related to natural language processing and medical informatics are discussed, including semantic similarity measurement, language models for generating medical conversations, summarization, text generation, and mapping biomedical text. A list of research papers related to using language models for text generation and summarization is provided. The document also includes examples of chats between Decider and Researcher agents discussing study design and a medical case study. DERA can assist with medical cases by providing prompt questions and treatment recommendations. DERA is a digital education resource assistant that enhances the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. DERA can also assist with medical cases by providing prompt questions and treatment recommendations.
A medical encounter is evaluated using the DERA system to generate a scratchpad of edits to improve the initial summary. The summary includes information on the patient's medical history, symptoms, medication, and intent. Pertinent negatives are corrected to state that the patient has not been vaccinated against COVID-19, and pertinent unknowns are modified to remove the patient's uncertainty about their condition. The patient's use of an Albuterol inhaler for wheezing is added to the summary, and the sentence about the last two sinus infections is rephrased.
The DERA system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness. The system also includes a prompt for generating corrupted versions of the summary, with varying levels of desired corruption. The summary includes sections for medical history, pertinent positives, pertinent negatives, pertinent unknowns, patient intent, demographics, and social determinants of health. The document discusses how to enhance LLM completions with DERA (Dialogue-based Educational Recommender Agent). The system involves generating medical dialogues and care plans for patients following their discussion with a physician. The care plan should be reviewed section by section, with any corrections added to a scratchpad. A care plan writer (Person A) discusses the plan with another care plan writer (Person B) who verifies its soundness. The researcher agent writes the care plan based on the scratchpad of corrections made during the discussion. The system aims to improve the quality of care plans by ensuring they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue. The conversation ends with the tag [DONE].
The document provides instructions for writing a care plan, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The definitions and scopes of each section are defined, and the next steps in care management that the patient and physician will perform should be categorized. Two prompts are provided for extracting and verifying medical concepts from a summary section used to compute the GPT-F1 metric. Examples are given for each prompt, with medical concepts such as COVID-19, symptoms, diseases, conditions, allergies, and lab tests being extracted or verified.
Finally, the document provides four multiple choice questions that can be rewritten as open-ended questions. The article presents default prompts for prompt generation in the LLM model, which can be modified for specific needs. These prompts are designed to improve completions in the LLM model and include prompts for rewriting questions, characterizing skin lesions, and determining the appropriate next step in management for a patient. The article also provides prompts and guidelines for a medical question-answering research task, involving an expert medical advisor helping a student arrive at the correct answer to a medical question. Additionally, the article discusses the use of DERA to enhance LLM completions, including prompts for question-answering, such as exact matching between generated and gold answers and similarity score between generated and gold answers. The output format includes a dxSimilarityScore and expected vs provided dx pairs.
3119 word summary
The document discusses the use of DERA to enhance LLM completions. It includes several prompts for question-answering, including a prompt for exact matching between generated and gold answers, a prompt similar to that used for similarity score between generated and gold answers, and a prompt for question-answering final answer. The prompts emphasize the importance of considering the guidance provided by the expert advisor and thinking through the question. The output format includes a dxSimilarityScore and expected vs provided dx pairs. This excerpt is a set of prompts and guidelines for a medical question-answering research task. It involves an expert medical advisor helping a student arrive at the correct answer to a medical question. The advisor should not give the answer directly but should guide the student by pointing out specific parts of the question that are important in finding the answer. The response should be no more than 3-4 sentences and should include a quote from the question. The advisor should stop only when all important aspects of the question have been pointed out, and responses should not repeat themselves. Another prompt involves generating a single-shot answer distribution, and another prompt provides a medical question with options to select the best answer phrase. The article discusses using default parameters for prompt generation in the LLM model. It provides examples of prompts for rewriting questions and characterizing skin lesions. It also includes a prompt for determining the appropriate next step in management for a patient. The prompts are designed to improve completions in the LLM model and can be modified for specific needs. Prompt 12: Rewrite multiple choice questions as open-ended questions.
Question: What is the most likely cytokine involved in the process of gouty arthritis?
Question: A 71-year-old man presents with complaints of right wrist pain. What characterizes the pattern of his skin lesions?
Question: Several patients at a local US hospital present with chronic secretory diarrhea. What is the most likely common cause of their diarrhea?
Question: A 60-year-old man comes in for an examination prior to a scheduled cholecystectomy. Which of the following is the most appropriate next step in management? The document discusses the use of DERA (Dialogue-based Educational Recommender Agent) in enhancing LLM completions. The system involves generating medical dialogues and care plans for patients following their discussion with a physician. The researcher agent is tasked with writing the care plan based on a scratchpad of corrections made during the discussion with another care plan writer. The system aims to improve the quality of care plans by ensuring that they are congruent with the Care Plan Writing Instructions and grounded in information from the dialogue. The conversation ends with the tag [DONE]. The document discusses how to enhance LLM completions with DERA. The care plan should be reviewed section by section, and any corrections should be added to a scratchpad. A care plan writer (Person A) should discuss the care plan with another care plan writer (Person B) who verifies the plan's soundness. The dialogue used for generating the care plan is a medical encounter between a patient and a doctor done over chat. The full instructions for writing care plans are presented in the document. The document discusses enhancing LLM completions with DERA. It provides instructions for writing a care plan, which should be organized into five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. The definitions and scopes of each section are defined, and the next steps in care management that the patient and physician will perform should be categorized. Two prompts are provided for extracting medical concepts from a summary and verifying medical concepts from a summary section used to compute the GPT-F1 metric. Examples are given for each prompt, with medical concepts such as COVID-19, symptoms, diseases, conditions, allergies, and lab tests being extracted or verified. The DERA summarization experiments involve a process where Person A writes an original summary, Person B edits it, and a Decider agent decides which corrections to incorporate. Person B goes through each section of the summary and points out any discrepancies between the summary and the medical dialogue. Person A adds agreed-upon corrections to a scratchpad, which Person B incorporates into the original summary. The process ends with a final summarization step that includes the scratchpad of corrections. The output tag "[STOP]" is used to indicate when writing is complete. The correction scratchpad and original summary are provided for reference in their respective sections. The discussion between Person A and Person B is also recorded in the summary discussion section. The Decider agent's role is to decide which corrections to include in the final summary and output them in the tag [SCRATCHPAD: <things-to-add-to-the-scratchpad-here>]. The document discusses a system called DERA that enhances LLM completions. The system involves a dialogue between two people, with one person writing a summary of a medical encounter and the other verifying it for correctness. The first person can make corrections to the summary on a scratchpad and the second person can suggest additional edits. The system also includes a prompt for generating corrupted versions of the summary, with varying levels of desired corruption. The summary includes sections for medical history, pertinent positives, pertinent negatives, pertinent unknowns, patient intent, demographics, and social determinants of health. A study on enhancing LLM completions with DERA is evaluated through a quantitative comparison between initial and final summaries. The DERA dialog generates a scratchpad of edits to improve the initial summary. The medical conversation summarization includes information on the patient's medical history, symptoms, medication, and intent. Pertinent negatives are corrected to state that the patient has not been vaccinated against COVID-19. Pertinent unknowns are modified to remove the patient's uncertainty about their condition. The patient's use of an Albuterol inhaler for wheezing is added to the summary, and the sentence about the last two sinus infections is rephrased. The patient's demographics are not mentioned in the encounter. A patient seeks care for a sinus infection and experiences symptoms such as facial swelling, swollen eyelids, and a slight fever. They have a history of seasonal allergies, sinus infections, and chickenpox. The patient currently uses ibuprofen and a nasal spray for symptom relief. They have not been vaccinated against COVID-19 but have had it before. The patient typically uses an Albuterol inhaler for asthma due to allergies causing wheezing in the spring and fall. Demographic information is not provided. The document discusses the use of a digital education resource assistant (DERA) to enhance the completion rate of LLM programs. DERA provides students with personalized study plans and feedback based on their individual performance. The system uses various metrics, such as concept verification and summarization, to evaluate student progress. Additionally, it can assist with medical cases by providing prompt questions and treatment recommendations.
Paragraph 1: DERA is a digital education resource assistant that can enhance the completion rate of LLM programs by providing personalized study plans and feedback based on individual performance.
Paragraph 2: DERA uses various metrics, such as concept verification and summarization, to evaluate student progress and provide tailored feedback.
Paragraph 3: DERA can also assist with medical cases by providing prompt questions and treatment recommendations. Enhancing LLM Completions with DERA is a document that is not available for public access. A 20-year-old woman presents with symptoms including bruising easily, heavy menstrual bleeding, and diarrhea. Her family history includes similar problems with bruising. The researcher considers the patient's symptoms and diagnostic biopsy results, which show blunting of intestinal villi and flat mucosa with multiple intraepithelial lymphocytes, and concludes that Celiac Disease is the most likely diagnosis. The patient's symptoms are most prominent after breakfast when she typically consumes cereal. The researcher recommends an esophagogastroduodenoscopy. The patient's vital signs are within normal limits and her BMI is 17. The document discusses the use of DERA to enhance LLM completions. It provides examples of chats between Decider and Researcher agents, who are discussing the appropriate study design for a research question. The goal is to submit an editorial to a top journal based on a comprehensive literature search and meta-analysis. The agents consider the specific goals and resources of the researcher and ultimately recommend a systematic review and meta-analysis or a meta-analysis as viable options for pooling results from multiple studies. The document also includes a question answering example and several appendices with additional information. A list of research papers related to using language models for text generation and summarization is provided. The papers cover a range of topics including evaluating text generation with BERT, leveraging pretrained models for automatic summarization of doctor-patient conversations, prompting large language models for reasoning and faithful chains of reasoning, and analyzing BERTScore for social bias. Other topics include using language models for medical question answering, programming for large language models, and the capabilities of GPT-4 on medical challenge problems. This document discusses various studies and approaches related to natural language processing and medical informatics. One study focuses on measuring semantic similarity using open-source software, while another examines the use of large language models for generating medical conversations. Other studies explore the effectiveness of language models for summarization and text generation, as well as the use of pathways in language modeling. Additionally, the document includes information on mapping biomedical text to the UMLS Metathesaurus and the use of data generators for medical dialogue summarization. This paper discusses the use of zero-shot clinical information extractors in enhancing LLM completions using OpenAI's API, mostly using GPT-4 models. The experiments in this paper were performed using a different form than is likely to be applied in a regular clinical setting. Additionally, some datasets that are openly available cannot be directly used with API-based models, further limiting options for medically-focused research. Clinical text interpretation is also potentially problematic and requires further research. The need for further research and auditable models is reiterated. The chat-based format of DERA allows for increased interpretability when auditing text generation and is well-suited for longer-generation tasks. A framework called DERA is introduced for improving the output of large language model-powered question-answering tasks. DERA is able to identify and correct hallucinations and omissions in the resulting text, reducing the number of errors. The framework involves two types of agents, Researcher and Decider, working together to achieve the correct answer in one or two generations. In some cases, DERA adds additional details to the one-shot answer, while in others it changes it to a more specific answer. The ability of DERA to improve the quality of the generated text is important given the ability of large language models, in particular GPT-4, to produce text that is fluent but potentially prone to errors. The authors suggest that additional work in methods that automatically evaluate the output of large language model-powered tools is needed. The document discusses the use of DERA to enhance LLM completions. In the open-ended setting, DERA shows small to no improvement over GPT-4, suggesting opportunities for future work. In the multiple-choice setting, GPT-4 outperforms the best previously published approaches. The authors use a variety of metrics to evaluate the generated and gold answer similarity, including BERTScore and GPT-4 prompts that identify exact matches and similar matches. The passing grade for the MedQA test set is above 60%, and there is a gap between open-ended and multiple-choice performance. The results are slightly higher due to the use of a self-consistency approach. The document discusses the use of DERA to enhance LLM completions. DERA provides a dialog between Decider and Researcher agents to improve the accuracy of answers. Several examples of DERA chats are provided, including one where Von Willebrand Disease is diagnosed as the most likely cause of a patient's symptoms. The Gold Answer and 1-shot answer are also included. The document notes that closely related lexical forms receive separate votes, and variations in the text are not accounted for. The article discusses the use of DERA to enhance LLM completions. The primary challenge is mapping free-form responses to a discrete set of options. Evaluation metrics involve generating multiple-choice prompts, and the model is trained on open-ended question-answer pairs. The DERA setup uses a single-shot prompt to generate an initial answer, followed by iterative discussion between the Decider and Researcher prompts to arrive at a final answer. The approach provides DERA with a distribution of votes instead of a single-shot answer, resulting in stronger results. The MedQA dataset is used for testing, and the prompts are released for rewriting the final sentence. The article discusses the use of DERA to enhance LLM completions. DERA is used to alter questions and generate free-form answers for medical question answering datasets, including MedQA and NEJM. The goal is to test DERA's performance on open-ended questions and evaluate its ability to recall a large set of knowledge across multiple domains of medicine. The article provides qualitative examples of DERA's performance in generating care plans and highlights potential drug interactions. The findings suggest that DERA has the potential to improve care management practices, but further research is needed to determine its generalizability. The article discusses the use of the DERA platform to generate care plans for patients. The platform uses physician-expert evaluations and patient-physician chats to generate more effective care plans than initial GPT-4 generated care plans. The platform asks questions about potentially harmful information and has a high rate of agreement with physician suggestions. The final care plan includes guidance on managing UTIs, medication interactions, referrals, lifestyle changes, and tests. The platform emphasizes the importance of seeking timely medical care and offers reassurance and support for patients' mental health. The Enhancing LLM Completions with DERA document suggests improvements to a patient's care plan for managing urinary tract infections (UTIs). In the Medications section, it recommends considering the patient's previous antibiotic exposure, allergies, and potential drug interactions with BuSpar, LaMICtal, and Zoloft. The Supportive Care section should offer guidance on managing UTIs and potential medication interactions related to barrier methods during sex to reduce the risk of UTIs. The Lifestyle section should include a recommendation to educate the patient on safe sexual practices and the importance of using a barrier method during sex. The Tests section recommends performing a urinalysis and urine culture to confirm the presence of a UTI and identify the appropriate antibiotic for treatment. The Initial Care Plan includes prescribing an appropriate antibiotic, providing education on recognizing the signs and symptoms of recurrent UTIs, and encouraging proper hygiene practices. Referrals are recommended if symptoms worsen or do not improve after a few days of antibiotic treatment. The study evaluates the effectiveness of DERA, an AI tool for generating care plans in medical conversations. The evaluation involves a dataset of 50 medical encounters with prompts specific to care plan generation. The generated care plans are evaluated by licensed physicians who rate them as complete 92% of the time, compared to 64% for initial care plans. The physicians also prefer the final care plan generated by DERA 84% of the time. The study concludes that DERA can enhance LLM completions and improve the quality of care plans. The DERA system generates physician-facing suggestions for care management plans, which can resolve errors like hallucinations and omissions. The system's performance is evaluated using the GPT-F1 metric on initial summaries with varying levels of corruption. The degree of corruption is controlled by a variable variable that can be set to low, medium, or high. The system's ability to write medical summaries is quantitatively evaluated by generating GPT-4 summaries for 500 encounters. The DERA-produced summary consistently outperforms the initial summary in terms of GPT-F1 scores. The study introduces the use of DERA to correct errors in medical conversation summarization. Synthetic corruption is introduced to evaluate the performance of the initial summary and DERA-corrected summary. Physician-expert evaluations show that DERA summaries capture far more clinical information than initial GPT-4 generated summaries. Physicians rate agreement with “All” suggestions in 63% of encounters, “Most” in 14%, “Some” in 5%, and “None” in 18% for initial summaries vs. 56% of initial GPT-4 summaries. The physician-experts prefer to use the final DERA-produced summary over the initially generated summary produced using GPT-4. The document discusses the use of DERA for medical conversation summarization. The dataset includes 500 medical encounters that cover a wide variety of common presentations in telehealth. GPT-4 prompts are run with settings mentioned in Table 5, and the Decider generates the final summary once a maximum conversation length is reached. The Researcher agent's role is to read the summary and point out any discrepancies to Decider. The DERA setup requires the model to summarize the chat while placing each piece of information in the appropriate section. The task of medical conversation summarization is to encapsulate a patient-doctor conversation into a short format. DERA can generate doctor-facing suggestions for care plan generation. DERA is a chat framework that uses three agents to create high-quality output from natural language generation tasks. The Researcher agent reviews information and highlights what is relevant, while the Decider agent makes the final decision on what information to include in the output. The Generator agent uses GPT-4 to generate text based on the highlighted information. By using these agents, DERA can alternate between processing knowledge sources and create higher-quality output. This approach is well-suited for longer-form generation tasks, such as medical conversation summarization. DERA has potential for improving GPT-4 performance and reducing errors in conversation summaries. The article introduces DERA, a framework that uses dialogue to improve performance on natural language tasks, particularly in the clinical domain. DERA consists of two agents: a Decider and a Researcher. The Decider generates an initial output for the task, and the Researcher helps identify crucial problem components. The two agents work through the problem via conversation. DERA is evaluated on three different types of clinical tasks and outperforms base GPT-4 performance. The paper proposes using agents tasked with resolving or improving anomalies within the output and suggests that this approach can improve performance on natural language tasks. Large language models (LLMs) have become valuable tools for many natural language understanding tasks, particularly in safety-critical applications such as healthcare. However, their performance can be improved through the use of dialog-enabled resolving agents (DERA). DERA is a paradigm that allows models to communicate feedback and iteratively improve output. In this work, the authors present dialog-based strategies such as chaining and prompt-based methods that prompt the model to think through a problem as an expert would approach it, leading to improved performance in some tasks. They test DERA against three clinically-focused tasks and show significant improvement over the base LLM model. The authors also release the open-ended MEDQA dataset, which can be used to evaluate model performance. This work highlights the importance of using dialog-based strategies to enhance LLM completions and improve their performance in safety-critical applications.