Summary of PMC-LLaMA Finetuning LLaMA on Medical Papers

Summary PMC-LLaMA Finetuning LLaMA on Medical Papers arxiv.org

2,746 words - PDF document - View PDF document

One Line

The PMC-LLaMA model is an open-source language model fine-tuned on biomedical academic papers, achieving high performance on biomedical QA benchmarks and outperforming the original LLaMA model.

Key Points

The PMC-LLaMA model is a language model fine-tuned for medical tasks by OpenAI using 4.8 million medical papers.
PMC-LLaMA outperforms the original LLaMA model and achieves competitive results even under zero-shot evaluation.
The model is fine-tuned using the AdamW optimizer with a learning rate of 2e-5 and a batch size of 128 for 3 epochs.
The datasets used for training and testing include USMLE, MedMCQA, and PubMedQA.
Large language models often exhibit unsatisfactory performance in medical applications due to a lack of domain-specific knowledge. PMC-LLaMA addresses this issue by injecting medical knowledge and enhancing its capability in the medical domain.
Future work includes injecting more domain knowledge into pre-trained models and continuously training the PMC-LLaMA model.

Summaries

277 word summary

PMC-LLaMA is an open-source language model that is fine-tuned on biomedical academic papers to enhance its capability in the medical domain. The model is trained for 5 epochs with 8 A100 GPUs using the Fully Sharded Data Parallel (FSDP) acceleration strategy and bf16 data format. The dataset used is S2ORC Datasets with 81.1M English-language papers, filtered with PubMed Central (PMC)-id. PMC-LLaMA achieves high performance on biomedical QA benchmarks, including PubMedQA, MedMCQA, and USMLE, and can efficiently learn medical knowledge from downstream training data. The authors compare the performance of their modified model, PMC-LLaMA, to the original LLaMA and other language models like ChatGPT and InstructGPT. They show that PMC-LLaMA outperforms LLaMA and achieves competitive results even under zero-shot evaluation. The authors also demonstrate the effectiveness of data-efficient fine-tuning and full fine-tuning on different medical datasets. They conclude that PMC-LLaMA offers better initialization for medical tasks and converges faster than LLaMA. Researchers at OpenAI have created the PMC-LLaMA model by fine-tuning the LLaMA model on 4.8 million medical papers. The PMC-LLaMA model outperforms the original LLaMA model and includes more relevant medical knowledge. However, it has only been trained for five epochs and needs further training. The researchers plan to inject more domain knowledge into pre-trained models and conduct a preliminary investigation by fine-tuning LLaMA. The PMC-LLaMA model is more accurate and informative than the original LLaMA model on medical scenarios like COPD, robotic cardiac surgery, and pneumonia.

The document also cites several resources related to language models and medical research, including the Llama model, Medmcqa dataset, GPT-4 model, Peft method, Semantic Scholar Open Research Corpus, Vicuna chatbot, and other relevant datasets and general resources for medical research.

728 word summary

PMC-LLaMA Finetuning LLaMA on Medical Papers is a document that cites several resources related to language models and medical research. One resource is the Llama model created by Tatsu Lab, which is an instruction-following model that encodes clinical knowledge. Another resource is the Medmcqa dataset, which is a large-scale multi-subject multi-choice dataset for medical domain question answering. Additionally, the document mentions the GPT-4 model, which has capabilities for solving medical challenge problems. Other resources mentioned include Peft, a state-of-the-art parameter-efficient fine-tuning method, and the Semantic Scholar Open Research Corpus. Vicuna is an open-source chatbot that impresses GPT-4 with 90% chat GPT quality. The document also includes references to other relevant datasets and general resources for medical research. Researchers at OpenAI have fine-tuned the LLaMA model for medical tasks and created the PMC-LLaMA model by training it on 4.8 million medical papers. The PMC-LLaMA model performs better on medical tasks than the original LLaMA model and includes more relevant medical knowledge. However, there are limitations to the current version of PMC-LLaMA, as it has only been trained for five epochs. In future work, the researchers plan to continuously train the model. The PMC-LLaMA model is more suitable for medical tasks than the foundation LLaMA model. Future work includes injecting more domain knowledge into pre-trained models and conducting a preliminary investigation by fine-tuning LLaMA. The researchers compared the outputs of PMC-LLaMA and the original LLaMA on several medical scenarios, including COPD, robotic cardiac surgery, and pneumonia, and found that PMC-LLaMA was more accurate and informative. The article discusses the improvements made to LLaMA, a language model for medical papers. The authors compare the performance of their modified model, PMC-LLaMA, to the original LLaMA and other language models like ChatGPT and InstructGPT. They show that PMC-LLaMA outperforms LLaMA and achieves competitive results even under zero-shot evaluation. The authors also demonstrate the effectiveness of data-efficient fine-tuning and full fine-tuning on different medical datasets. They conclude that PMC-LLaMA offers better initialization for medical tasks and converges faster than LLaMA. The PMC-LLaMA model can efficiently learn medical knowledge from downstream training data. The model is fine-tuned using the AdamW optimizer with a learning rate of 2e-5 and a batch size of 128 for 3 epochs. The model is evaluated on medical QA benchmarks, and the results are shown in Tab.1. The experiments are conducted on the USMLE dataset, and the data-efficient fine-tuning approach is used with the PEFT Low-Rank Adaptation (LoRA) method to reduce computation cost. The full fine-tuning approach and parameter-efficient fine-tuning approach are also evaluated. The datasets used for training and testing include USMLE, MedMCQA, and PubMedQA. The model achieves good results in all evaluation scenarios. This document outlines the fine-tuning procedure and benchmark descriptions for PMC-LLaMA, an open-source language model trained on 4.8 million biomedical academic papers. The model is trained for 5 epochs with 8 A100 GPUs in around 7 days using the Fully Sharded Data Parallel (FSDP) acceleration strategy and bf16 data format. During finetuning, the max context length is set as 512, with a batch size of 128 and the model is trained with AdamW optimizer. The dataset used is S2ORC Datasets with 81.1M English-language papers, filtered with PubMed Central (PMC)-id. The authors believe that a medical-specific foundational language model would be more suitable for specialization in various healthcare sub-tasks, such as medical dialogue or consultation. PMC-LLaMA has demonstrated superior performance on various medical QA datasets, including PubMedQA, MedMCQA, and USMLE. Further finetuning involves injecting domain knowledge into the pre-trained LLaMA to steer the foundational language model towards medical-specific corpus. Large language models (LLMs), such as GPT and GPT-4, have revolutionized artificial intelligence in various domains, including natural language processing, computer vision, and biomedical applications. However, these models often exhibit unsatisfactory performance in areas that value precision, such as medical applications, due to a lack of domain-specific knowledge. To address this issue, the authors introduce PMC-LLaMA, an open-source language model that is acquired by fine-tuning an existing LLM on a total of 4.8 million biomedical academic papers. The model demonstrates better understanding of biomedical domain-specific concepts and achieves high performance on biomedical QA benchmarks, including PubMedQA, MedMCQA, and USMLE, by injecting medical knowledge and enhancing its capability in the medical domain. Preliminary evaluations are conducted on three biomedical QA datasets. The authors are affiliated with Shanghai AI Laboratory and Cooperative Medianet Innovation Center at Shanghai Jiao Tong University.

Raw indexed text (19,000 chars / 2,746 words / 301 lines)

PMC-LLaMA: Further Finetuning LLaMA

on Medical Papers

Chaoyi Wu 1,2 , Xiaoman Zhang 1,2 , Ya Zhang 1,2 , Yanfeng Wang 1,2 , Weidi Xie 1,2,B

{wtzxxxwcy02, xm99sjtu, ya_zhang, wangyanfeng, weidi}@sjtu.edu.cn

Cooperative Medianet Innovation Center, Shanghai Jiao Tong University

Shanghai AI Laboratory

Abstract

Large Language Models (LLMs) have showcased remarkable capabilities

in natural language understanding in various domains. These models can

usually behave well on daily dialog, or question answering scenarios, however,

in areas that value precision, for example, in medical applications, they

often exhibit unsatisfactory performance due to a lack of domain-specific

knowledge. In this report, we introduce PMC-LLaMA, an open-source

language model that is acquired by fine-tuning an open-source language

model on a total of 4.8 million biomedical academic papers for further

injecting medical knowledge, enhancing its capability in medical domain.

Our preliminary evaluations are conducted on three biomedical QA datasets,

including PubMedQA, MedMCQA, and USMLE, showing that the our model

after finetuning, i.e., PMC-LLaMA, demonstrates better understanding of

biomedical domain-specific concepts, thus achieving high performance on

QA benchmarks. The model and codes, along with an online demo, are

publicly available 1,2 .

Introduction

The rapid advancement of large language models (LLMs), for example, OpenAI’s Chat-

GPT [cha, 2023] and GPT-4 [OpenAI, 2023] has truly revolutionized artificial intelligence in

various domain, for example, natural language processing, computer vision, and biomedical

domain [Moor et al., 2023, Nori et al., 2023, Singhal et al., 2022], unfortunately, the training

details and model architectures for ChatGPT, and its variants, still remain unclear. In

contrast, as an open-source foundational language model, LLaMA [Touvron et al., 2023] often

performs poorly on applications that require heavy domain knowledge, which we conjecture

is due to inadequate domain-specific data at the model pre-training stage.

In the recent literature, there has been a growing interest in leveraging open-source LLMs

and adapting them towards specific applications or domains. For instance, Alpaca [Taori

et al., 2023] and Vicuna [Chiang et al., 2023] have concentrated on enhancing the model’s

interactive capabilities using machine-generated instruction-following samples. In contrast

to their goal on daily dialogue, in this report, our focus is to steer the foundational language

model towards medical-specific corpus, by further injecting domain knowledge into one

pre-trained LLaMA.

Specifically, we introduce PMC-LLaMA, an open-source language model trained by fine-

tuning LLaMA-7B on 4.8 million biomedical academic papers. The whole procedure can be

Huggingface Page: https://huggingface.co/chaoyi-wu/PMC_LLAMA_7B

Github Page: https://github.com/chaoyi-wu/PMC-LLaMA

Preprint. Under review.Biomedical academic papers

Large Language Model (LLM)

LLaMA

S2ORC

PMC-LLaMA

Further Finetuning

Figure 1: PMC-LLaMA Training Pipeline.

seen in Figure 1. With preliminary evaluation, PMC-LLaMA has demonstrated superior

performance on various medical QA datasets, including PubMedQA, MedMCQA, and

USMLE, for end-to-end full fine-tuning, parameter-efficient fine-tuning and data-efficient

fine-tuning. We believe that such a medical-specific foundational language model would be

more suitable for specialization in various healthcare sub-tasks, such as medical dialogue or

consultation.

Experiment

In this section, we begin by outlining the fine-tuning procedure (Sec. 2.1), including the

dataset used and specific training details; Subsequently, in Sec. 2.2, we provide a detailed

description of the evaluation benchmark, composing three QA datasets, namely PubMedQA,

MedMCQA, and UMLSE. Lastly, in Sec. 2.3, we present three evaluation scenarios: full

fine-tuning, parameter-efficient fine-tuning, and data-efficient fine-tuning.

2.1

Fine-tuning Procedure

Dataset. We start with the S2ORC [Lo et al., 2020] Datasets with 81.1M English-language

academic papers, and filter them with PubMed Central (PMC)-id. As a result, there are

around 4.9M papers left, that are highly related to medical knowledge totaling over 75B

tokens.

Fine-tuning detail. We fine-tune the LLaMA-7B model on these open-accessed PMC

papers, by optimising an autoregressive generation objective, as introduced in GPT2 [Radford

et al., 2019]. Specifically, during our finetuning, the max context length is set as 512, with

a batch size to be 128, and the model is trained with AdamW optimizer [Loshchilov and

Hutter, 2017] with learning rate 2e-5. To accelerate the training, we adopt the Fully Sharded

Data Parallel (FSDP) acceleration strategy and bf16 (Brain Floating Point) data format.

The model is trained for 5 epochs with 8 A100 GPUs in around 7 days. Note that, in each

epoch, we randomly sample 512 continuous tokens per paper for training.

2.2

Benchmark Descriptions

In this section, we detail the three QA benchmarks for evaluation in the following.

PubMedQA [Jin et al., 2019] contains questions on biomedical research, the model is

provided with paper abstracts from PubMed, and is required to complete multiple-choice

questions. It is split into three subsets: labeled (PQA-L), unlabeled (PQA-U), and artificially

generated (PQA-A). We use PQA-A for training, which contains 211,269 Question-Answer

pairs, and PQA-L for testing, which contains 1000 pairs.

MedMCQA [Pal et al., 2022] is a dataset of multiple choice questions sourced from

mock exams and past exams of two Indian medical school entrance exams called AIIMS

and NEET-PG [Pal et al., 2022]. The train set contains 182,822 questions, and the test set

contains 4183 questions. Each question has 4 choices.

2USMLE [Jin et al., 2021] is a dataset of multiple choice questions (4 choices per question),

based on the United States Medical License Exams. The dataset is collected from the

professional medical board exams, covering three languages: English, simplified Chinese,

and traditional Chinese, and contains 12,724, 34,251, and 14,123 questions for these three

languages, respectively. We only use the English parts and split it into 10,178 questions for

training, 1273 for validation, and 1273 for testing, following the official data spits.

2.3

Evaluation Scenario

In this section, we evaluate PMC-LLaMA by fine-tuning on the above-mentioned correspond-

ing medical QA datasets, under three scenarios, namely full fine-tuning, parameter-efficient

fine-tuning, and data-efficient fine-tuning as detailed in the following.

Full fine-tuning. We fine-tune PMC-LLAMA on the combination of the training set

from PubMedQA and MedMCQA, for evaluation, we treat the test set from PubMedQA

and MedMCQA as in-domain (ID) evaluation, while treating the USMLE test set as

out-of-domain (OOD) evaluation, similarly to LMFlow [Diao et al., 2023].

Parameter-efficient fine-tuning [Mangrulkar et al., 2022] enables efficient adaptation

of pre-trained LLMs to various downstream applications without fine-tuning all parameters,

greatly reducing the time and computation cost. Here we use the most widely used method,

called PEFT Low-Rank Adaptation (LoRA) [Hu et al., 2021]. The hyper-parameter for

LoRA is set by default provided in the python package PEFT [Mangrulkar et al., 2022]. The

data settings for training and testing remain the same as full fine-tuning.

Data-efficient fine-tuning. In addition to measuring the number of training parameters,

we also conduct experiments on the data side, i.e., data-efficient fine-tuning. Specifically,

we conduct experiments on USMLE dataset, i.e., simply train and test on its own splits,

resembling an USMLE ID evaluation. However, as the training set in USMLE is much

smaller compared with that of combining PubMedQA and MedMCQA (10k vs. 400k), less

medical knowledge can be learned from the downstream training data, it turns out to be

rather more challenging than the previous OOD test, as will be shown by our experimental

results.

Implementation details. For each downstream task, we fine-tune the network using the

same setting as training on PMC papers, specifically, we set the max context length as 512

and fine-tune the models with AdamW optimizer with learning rate 2e-5, the batch size is

set to be 128, and the model is trained for 3 epochs.

Results

In Fig.2 we show the fine-tuning training curve under different settings and compare the

convergence speed. The final evaluation results are shown in Tab.1, generally speaking, in

almost all medical QA benchmarks, our proposed PMC-LLaMA converges faster and exhibits

better performance than the original LLaMA [Touvron et al., 2023].

Model convergence. In Fig.2, we name each training curve using the convention

{Model Name}-{Training Setting}-{Dataset}”, for example, {LLaMA-7B}-{Full-Finetune}-

{PM&MedMC}” refers to the full fine-tuning of the LLaMA-7B model on the PubMedQA and

MedMCQA training set. As illustrated in the figure, generally, PMC-LLaMA converges more

quickly and achieves lower loss compared to the original LLaMA, especially when the training

dataset is larger (with PM&MedMC) and has more trainable parameters (Full-Finetune).

This suggests that PMC-LLaMA offers a better initialization for medical tasks.

Full fine-tuning results. As shown in Tab. 1, PMC-LLaMA-7B Full exceeds the LLaMA-

7B Full in two of three test sets, improving the performance from 44.55% to 44.70% and

35.66% to 40.61% on USMLE for OOD and ID settings respectively, and 48.15% to 50.74%

on MedMCQA.

3Figure 2: The training curve compared with LLaMA under two settings with different

training data. The curve is smoothed with 0.6 Exponential Moving Average. Depending on

the amount of training data, the curves have different ending steps, thus showing seesaw

shapes with different periods.

Method Setting

Human (pass)

Human (expert) Manual *

USMLE(OOD/ID) MedMCQA(ID) PubMedQA(ID)

50.0

87.0 –

90.0 60.0

78.0

44.0

44.7

24.3

30.3 73.2

63.9

5.2

1.8

InstructGPT-175B

ChatGPT

LLaMA-7B

LLaMA-33B Zero-shot 46.0

57.0

27.1

43.4 LLaMA-7B Full

PMC-LLaMA-7B Full Full fine-tuning 44.55/35.66

44.70/40.61 48.15

50.54 73.41

69.53

LLaMA-7B PEFT

PMC-LLaMA-7B PEFT PEFT 29.38/27.34

30.64/28.52 32.37

34.33 65.81

68.23

Table 1: Comparison between LLaMA-7B and PMC-LLaMA-7B under different settings.

ACC in percentages is reported in the table. Note that, the manual and zero-shot results

with * are referred from LMFlow [Diao et al., 2023].

Parameter-efficient fine-tuning results. As shown in Tab. 1, PMC-LLaMA-7B PEFT

demonstrates superior performance than LLaMA-7B PEFT , particularly on the in-domain

datasets, 1.22% improvement on USMLE, 1.96% improvement on MedMCQA and 2.42% on

PubMedQA. These results demonstrate that the original LLaMA only provides suboptimal

embedding spaces for Medical QA and further fine-tuning on the biomedical corpus is

beneficial for model domain adaptation.

Data-efficient fine-tuning results. We refer to USMLE ID test results as the Data-

efficient fine-tuning comparison. As shown in Tab. 1, whatever the training setting is “Full

fine-tuning” or “PEFT”, PMC-LLaMA-7B can both achieve better results on the USMLE ID

test. Specifically, we improve the ACC scores from 35.66% to 40.61% under full fine-tuning

and from 27.34% to 28.52% under PEFT.

Compare to ChatGPT and its variants. As is evidently shown in Tab.1, InstructGPT

and chatGPT achieve fantastic results even under zero-shot evaluation, however, we argue

that, as they are not open-sourced, and do not reveal any training details, there is no

guarantee on whether the evaluation data has been exploited for training their model,

hence their results can only be used as a reference here, and should not be taken for a fair

comparison.

Zero-shot Case Study

In this section, we show zero-shot generation cases in Fig.3 with the judgement from GPT-

4, and present the differences between original LLaMA and PMC-LLaMA. The GPT-4

Judgement is inspired by [Chiang et al., 2023], and uses the following prompt:

Considering the two outputs, which one do you think is more related to the input?

Input: . . . PMC-LLaMA Output: . . . Original LLaMA Output: . . .

Figure 3: Zero-shot evalution on medical-related sentence completion, we compare the results

between PMC-LLaMA and original LLaMA-7B with the judgement provided by GPT-4.

5Generally speaking, we find PMC-LLaMA has more expertise in medical knowledge and

responds better to some cutting-edge professional medical concepts.

Case 1: COPD. According to World Health Organization (WHO 1 ), “Chronic obstructive

pulmonary disease (COPD) is a common and treatable disease characterized by progressive

airflow limitation and tissue destruction and is the third leading cause of death worldwide”,

which is consistent with the output of PMC-LLaMA, in contrast, although LLaMA has

pointed out the fact that COPD can be identified by spirometry results, quickly lose the

attention on COPD and generate irrelevant sentences. In GPT-4 Judgement, it also points

out that the original LLaMA “goes on to discuss unrelated topics”.

Case 2: Robotic Cardiac Surgery. According to Johns Hopkins Medicine 2 , Robotic

cardiac surgery involves performing heart surgery through very small incisions in the chest.

Utilizing miniature instruments and robot-assisted tools, surgeons can conduct heart surgery

in a manner that is significantly less invasive than traditional open-heart surgery”. The

assertion made by PMC-LLaMA closely aligns with this statement, while the output from

the original LLaMA appears to be unclear. Its phrase through the placement of surgical

robotic tools” is inaccurate and should instead refer to “robot-assisted tools”. Although

GPT-4 considered the original LLaMA output to be satisfactory, it failed to detect the

inaccuracies in LLaMA’s descriptions, focusing merely on their relevance.

Case 3 & 4: Diabetes and Pneumonia. In cases 3 and 4, PMC-LLaMA consistently

provides more context about the input, suggesting that PMC-LLaMA possesses a deeper

understanding of medical background knowledge. In contrast, the original LLaMA tends to

introduce irrelevant content in its output. As GPT-4 notes, the rest of the text is less clear

and informative compared to the PMC-LLaMA output ”and the response goes on to discuss

PCT concentrations, which are not directly related to the input statement”.

Conclusion and Future Work

In conclusion, in this report, we investigate the existing open-source language model, namely,

LLaMA, on medical applications, showing that the model performs unsatisfactorily on QA

tasks. To inject domain knowledge into the pre-trained model, We conduct a preliminary

investigation by fine-tuning LLaMA with 4.8 million medical papers, resulting in a new

foundation model for the medical domain PMC-LLaMA. Experimental evaluation demon-

strates that PMC-LLaMA is more suitable for medical tasks compared with LLaMA and

after instruction tuning it can get better performance compared to other models instruction

tuned from LLaMA in the medical domain.

One limitation of PMC-LLaMA is that we have only trained 5 epochs in the current version

and have not seen every token in the 4.8 million papers. In future work, we will continuously

train PMC-LLaMA and update our base model on the hugging face page and gradually train

PMC-LLaMA models with more parameters.

References

Openai. introducing chatgpt. https://openai.com/blog/chatgpt/, 2023. 1

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin

Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.

Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March

2023. URL https://vicuna.lmsys.org. 1, 5

Shizhe Diao, Rui Pan, Hanze Dong, KaShun Shum, Jipeng Zhang, Wei Xiong, and Tong

Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation

models. https://optimalscale.github.io/LMFlow/, 2023. 3, 4

WHO https://www.who.int

Johns Hopkins Medicine: https://www.hopkinsmedicine.org/

6Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,

Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv

preprint arXiv:2106.09685, 2021. 3

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits.

What disease does this patient have? a large-scale open domain question answering dataset

from medical exams. Applied Sciences, 11(14):6421, 2021. 3

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa:

A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146,

2019. 2

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC:

The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting

of the Association for Computational Linguistics, pages 4969–4983, Online, July 2020.

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL

https://www.aclweb.org/anthology/2020.acl-main.447. 2

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint

arXiv:1711.05101, 2017. 2

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul.

Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/

huggingface/peft, 2022. 3

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure

Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical

artificial intelligence. Nature, 616(7956):259–265, 2023. 1

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz.

Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

2023. 1

OpenAI. Gpt-4 technical report, 2023. 1

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A

large-scale multi-subject multi-choice dataset for medical domain question answering. In

Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022. 2

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.

Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 2

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung,

Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language

models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022. 1

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,

Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following

llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 1

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,

Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.

Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

2023. 1, 3