Summary Evaluating Creativity in Large Language Models arxiv.org
22,717 words - PDF document - View PDF document
One Line
Columbia University and Salesforce AI Research created TTCW to evaluate LLM creativity but discovered challenges in replicating human creativity.
Slides
Slide Presentation (10 slides)
Key Points
- The Torrance Test of Creative Writing (TTCW) protocol evaluates the creativity of large language models (LLMs) in generating written content.
- LLM-generated stories pass significantly fewer TTCW tests compared to stories written by professionals.
- LLMs are not yet capable of reproducing expert judgments when evaluating creativity in written content.
- The TTCW measures creativity as a product in creative writing, based on fluency, flexibility, originality, and elaboration.
- LLM-generated stories are three to ten times less likely to pass individual TTCW tests compared to expert-written stories.
- The study highlights the limitations of current LLMs in reproducing expert assessments and suggests using the evaluation framework to build interactive writing support tools.
- The study recruited experts to evaluate the stories based on the TTCW, with The New Yorker stories achieving the highest passing rate on all tests.
- LLMs are not yet adept at evaluating creativity as experts do, and the study found no positive correlation between LLM assessments and expert assessments.
Summaries
19 word summary
Columbia University and Salesforce AI Research developed the TTCW to assess LLM creativity, finding limitations in reproducing human creativity.
58 word summary
Columbia University and Salesforce AI Research created the Torrance Test of Creative Writing (TTCW) to assess the creativity of large language models (LLMs). LLM-generated stories perform worse on TTCW tests than those written by professionals. LLMs do not correlate with expert assessments, indicating their limitations in reproducing human creativity. The study offers insights into evaluating creativity in LLMs.
137 word summary
Researchers at Columbia University and Salesforce AI Research have developed the Torrance Test of Creative Writing (TTCW) to evaluate the creativity of large language models (LLMs) in generating written content. LLM-generated stories pass significantly fewer TTCW tests compared to stories written by professionals. The researchers explored the use of LLMs as assessors in automating the TTCW evaluation but found that none of the LLMs positively correlated with expert assessments. The study adapts the TTCT protocol to evaluate creativity as a product in creative writing. LLM-generated stories are three to ten times less likely to pass individual TTCW tests compared to expert-written stories. Overall, LLMs fall short in terms of creativity compared to human writers. The study provides valuable insights into evaluating creativity in large language models and highlights the limitations of current LLMs in reproducing expert assessments.
612 word summary
Researchers at Columbia University and Salesforce AI Research have developed the Torrance Test of Creative Writing (TTCW) to evaluate the creativity of large language models (LLMs) in generating written content. The TTCW consists of 14 binary tests that measure creativity in terms of fluency, flexibility, originality, and elaboration. LLM-generated stories pass significantly fewer TTCW tests compared to stories written by professionals.
The researchers explored the use of LLMs as assessors in automating the TTCW evaluation but found that none of the LLMs positively correlated with expert assessments. This suggests that current LLMs are not yet capable of reproducing expert judgments when evaluating creativity in written content.
The study adapts the TTCT protocol to evaluate creativity as a product in creative writing. The TTCW is designed based on the dimensions of fluency, flexibility, originality, and elaboration. The researchers experimentally validate the TTCW through an assessment of 48 stories using 10 experts in creative writing. They find moderate agreement among experts when administering individual tests and strong agreement when evaluating all tests in aggregate, confirming the validity of the TTCW for evaluating creativity in fictional short stories.
The study compares the performance of human-written stories and LLM-generated stories in passing individual TTCW tests. LLM-generated stories are three to ten times less likely to pass individual TTCW tests compared to expert-written stories, indicating a significant gap in evaluated creativity. Additionally, the researchers investigate the performance of different LLMs in the TTCW evaluation and observe variations in their abilities. However, none of the LLMs demonstrate a positive correlation with expert assessments.
Overall, LLMs fall short in terms of creativity compared to human writers. The study provides valuable insights into evaluating creativity in large language models and highlights the limitations of current LLMs in reproducing expert assessments. The researchers release a large-scale annotation of TTCW assessments to facilitate future research in this domain. They also discuss the potential for using the evaluation framework to build interactive writing support tools and distinguish between AI-generated and human-written stories.
The study titled “Evaluating Creativity in Large Language Models” explores the use of the TTCW to assess the creativity of fictional short stories generated by LLMs. The study recruited experts in creative writing to evaluate the stories based on the TTCW. The experts were not involved in the creation of the tests, demonstrating that the TTCW can be administered by any knowledgeable expert. The study collected a total of 2,016 binary labels and expert-written justifications for these labels.
The results showed that stories published in The New Yorker achieved the highest passing rate on all fourteen tests, with an overall pass rate of 84.7%. LLM-generated stories passed between a third and a tenth of the TTCW compared to human-written stories. LLMs performed best on the Fluency dimension, while Claude v1.3 achieved the highest performance on average across Fluency, Flexibility, and Elaboration.
The study analyzed the reproducibility of the TTCW evaluations and found moderate agreement among experts on most individual tests. Experts achieved strong agreement on the number of tests a story passes overall. Human-written New Yorker stories were ranked as the most preferred story 89% of the time, while LLM-generated stories ranked as the least preferred roughly two-thirds of the time. Claude v1.3 generated higher-quality short stories than models in the GPT family.
The study also explored using LLMs to administer the TTCW tests but found that LLMs are not yet adept at evaluating creativity as experts do. None of the LLMs produced assessments that correlated positively with expert assessments.
The study concludes by highlighting the limitations of current LLMs in generating high-quality fictional stories and evaluating creativity. The authors hope that their evaluation framework and findings will guide future work.
631 word summary
Researchers at Columbia University and Salesforce AI Research have developed the Torrance Test of Creative Writing (TTCW) to evaluate the creativity of large language models (LLMs) in generating written content. The TTCW consists of 14 binary tests that measure creativity in terms of fluency, flexibility, originality, and elaboration. The researchers recruited 10 creative writers to evaluate 48 stories written by professionals and LLMs using the TTCW. LLM-generated stories pass significantly fewer TTCW tests compared to stories written by professionals.
The researchers explored the use of LLMs as assessors in automating the TTCW evaluation but found that none of the LLMs positively correlated with expert assessments. This suggests that current LLMs are not yet capable of reproducing expert judgments when evaluating creativity in written content.
The study adapts the TTCT protocol to evaluate creativity as a product in creative writing. The TTCW is designed based on the dimensions of fluency, flexibility, originality, and elaboration. The researchers experimentally validate the TTCW through an assessment of 48 stories using 10 experts in creative writing. They find moderate agreement among experts when administering individual tests and strong agreement when evaluating all tests in aggregate, confirming the validity of the TTCW for evaluating creativity in fictional short stories.
The study compares the performance of human-written stories and LLM-generated stories in passing individual TTCW tests. LLM-generated stories are three to ten times less likely to pass individual TTCW tests compared to expert-written stories, indicating a significant gap in evaluated creativity. Additionally, the researchers investigate the performance of different LLMs in the TTCW evaluation and observe variations in their abilities. However, none of the LLMs demonstrate a positive correlation with expert assessments.
Overall, LLMs fall short in terms of creativity compared to human writers. The study provides valuable insights into evaluating creativity in large language models and highlights the limitations of current LLMs in reproducing expert assessments. The researchers release a large-scale annotation of TTCW assessments to facilitate future research in this domain. They also discuss the potential for using the evaluation framework to build interactive writing support tools and distinguish between AI-generated and human-written stories.
The study titled “Evaluating Creativity in Large Language Models” explores the use of the TTCW to assess the creativity of fictional short stories generated by LLMs. The study recruited experts in creative writing to evaluate the stories based on the TTCW. The experts were not involved in the creation of the tests, demonstrating that the TTCW can be administered by any knowledgeable expert. The study collected a total of 2,016 binary labels and expert-written justifications for these labels.
The results showed that stories published in The New Yorker achieved the highest passing rate on all fourteen tests, with an overall pass rate of 84.7%. LLM-generated stories passed between a third and a tenth of the TTCW compared to human-written stories. LLMs performed best on the Fluency dimension, while Claude v1.3 achieved the highest performance on average across Fluency, Flexibility, and Elaboration.
The study analyzed the reproducibility of the TTCW evaluations and found moderate agreement among experts on most individual tests. Experts achieved strong agreement on the number of tests a story passes overall. Human-written New Yorker stories were ranked as the most preferred story 89% of the time, while LLM-generated stories ranked as the least preferred roughly two-thirds of the time. Claude v1.3 generated higher-quality short stories than models in the GPT family.
The study also explored using LLMs to administer the TTCW tests but found that LLMs are not yet adept at evaluating creativity as experts do. None of the LLMs produced assessments that correlated positively with expert assessments.
The study concludes by highlighting the limitations of current LLMs in generating high-quality fictional stories and evaluating creativity. The authors hope that their evaluation framework and findings will guide future work and
1122 word summary
Researchers at Columbia University and Salesforce AI Research have developed a protocol called the Torrance Test of Creative Writing (TTCW) to evaluate the creativity of large language models (LLMs) in generating written content. The TTCW consists of 14 binary tests that measure creativity in terms of fluency, flexibility, originality, and elaboration. To assess the effectiveness of LLMs in creative writing, the researchers recruited 10 creative writers to evaluate 48 stories written by professionals and LLMs using the TTCW. The analysis revealed that LLM-generated stories pass significantly fewer TTCW tests compared to stories written by professionals.
The researchers also explored the use of LLMs as assessors in automating the TTCW evaluation. However, they found that none of the LLMs positively correlated with expert assessments. This suggests that current state-of-the-art LLMs are not yet capable of reproducing expert judgments when evaluating creativity in written content.
The study makes several contributions. Firstly, it adapts the TTCT protocol, which measures creativity as a process, to evaluate creativity as a product in creative writing. The TTCW is designed based on the four dimensions of fluency, flexibility, originality, and elaboration. The researchers experimentally validate the TTCW through an assessment of 48 stories using 10 experts in creative writing. They find moderate agreement among experts when administering individual tests and strong agreement when evaluating all tests in aggregate, confirming the validity of the TTCW for evaluating creativity in fictional short stories.
The study also compares the performance of human-written stories and LLM-generated stories in passing individual TTCW tests. The analysis shows that LLM-generated stories are three to ten times less likely to pass individual TTCW tests compared to expert-written stories, indicating a significant gap in evaluated creativity. Additionally, the researchers investigate the performance of different LLMs in the TTCW evaluation and observe variations in their abilities. However, none of the LLMs demonstrate a positive correlation with expert assessments.
Overall, the findings suggest that while LLMs have the potential to generate written content, they still fall short in terms of creativity compared to human writers. The study provides valuable insights into evaluating creativity in large language models and highlights the limitations of current LLMs in reproducing expert assessments. The researchers release a large-scale annotation of TTCW assessments to facilitate future research in this domain. They also discuss the potential for using the evaluation framework to build interactive writing support tools and distinguish between AI-generated and human-written stories.
A new study titled "Evaluating Creativity in Large Language Models" explores the use of the Torrance Test of Creative Thinking (TTCW) to assess the creativity of fictional short stories generated by large language models (LLMs). The study recruited a set of experts with backgrounds in creative writing to evaluate the stories based on the TTCW. The experts were not involved in the creation of the tests, which demonstrates that the TTCW can be administered by any knowledgeable expert. The study recruited 10 participants, including four associated with creative writing departments at leading academic institutions and four professional writers with a Master of Fine Arts in Fiction or Poetry.
The study randomly assigned each story group to three distinct experts to analyze the reproducibility and validity of the TTCW. After each task, experts were given the option to receive another group of stories, and on average, they completed 3.6 tasks over a period of three weeks. The study collected a total of 2,016 binary labels and expert-written justifications for these labels.
The results of the evaluation showed that stories published in The New Yorker achieved the highest passing rate on all fourteen tests, with an overall pass rate of 84.7%. In comparison, LLM-generated stories passed between a third and a tenth of the TTCW compared to human-written stories. The study also found that LLMs performed best on the Fluency dimension, while Claude v1.3 achieved the highest performance on average across Fluency, Flexibility, and Elaboration.
The study further analyzed the reproducibility of the TTCW evaluations and found moderate agreement among experts on most of the individual tests. Experts achieved strong agreement on the number of tests a story passes overall. The study also asked experts to rank the stories in terms of subjective preference and to guess each story's origin. Human-written New Yorker stories were ranked as the most preferred story 89% of the time, while LLM-generated stories ranked as the least preferred roughly two-thirds of the time. The results indicated that Claude v1.3 generated higher-quality short stories than models in the GPT family.
The study also explored the feasibility of using LLMs to administer the TTCW tests. However, the results showed that LLMs are not yet adept at evaluating creativity as experts do. The study found that none of the LLMs produced assessments that correlated positively with expert assessments.
The study concludes by highlighting the limitations of current LLMs in generating high-quality fictional stories and evaluating creativity. The authors hope that their evaluation framework and findings will guide future work and innovations in creativity research. They also suggest that LLMs could be used in co-writing interfaces to assist writers during the planning and reviewing phases of the writing process.
The document titled "Evaluating Creativity in Large Language Models" explores various aspects of evaluating creativity in language models. It cites several related studies and technical reports, including OpenAI's ChatGT and GPT-4. The paper also references research on creativity assessment, narrative theory, and automated support for creative writing. The authors discuss the use of rubrics and scoring methods to evaluate creativity in writing, as well as the challenges of ensuring coverage and consistency in evaluation criteria. They provide examples of expert-written short stories from The New Yorker that were used in their evaluation process. The experts evaluated these stories based on a rubric and provided feedback on the effectiveness of the rubric. The authors also discuss the use of large language models (LLMs) in generating stories and highlight some of the limitations and challenges associated with LLM-generated stories. They explore the distinguishability between AI-generated and human-written stories, noting differences in areas such as narrative structure, language proficiency, and character development. The paper includes examples of LLM-generated explanations and expert-written explanations for assessing rhetorical complexity. The authors also discuss the potential for non-experts to administer tests for creativity evaluation and suggest the use of crowd-sourcing platforms. They provide prompts and instructions for assessing different aspects of creative writing, such as narrative pacing, scene vs exposition balance, idiom and metaphor usage, narrative ending, unity of elements, emotional flexibility, originality in theme and content, originality in form, and character development. The document concludes by highlighting the importance of considering subtext in storytelling and its impact on enriching the story's setting. Overall, the paper provides valuable insights into evaluating creativity in large language models and offers suggestions for future research.