Study Suggests Using Caution With ChatGPT for Breast Cancer Information

News
Article

When given commonly asked questions regarding breast cancer, ChatGPT’s answers were incorrect a quarter of the time and often provided fake references.

Illustration of CPU and text saying "AI".

ChatGPT's responses to breast cancer questions were often inaccurate, lacked reliable references, and were not at an appropriate reading level.

A study found that ChatGPT 3.5, when asked 20 common breast cancer questions, provided inaccurate answers in 24% of cases and lacked reliable references in 41% of responses, emphasizing the need for caution when using AI for medical information.

ChatGPT is a generative artificial intelligence language model which operates like a chatbot to generate responses to many questions. The model used in this study — ChatGPT 3.5 — was the most widely available free tool at the time researchers performed this analysis.

“Furthermore, whereas each series of prompts started with the statement, ‘I am a patient,’ and requested that the responses should be for the patient, the responses provided were not at an appropriate patient reading level,” study authors wrote, “In fact, none of the responses were at the recommended sixth‐grade reading level, and the lowest grade level was eighth grade.”

Accuracy was rated on a four-point scale ranging from 1 (comprehensive information) to 4 (completely incorrect information). Clinical concordance was rated on a five-point scale, with 1 indicating completely similar responses to a physician and 5 indicating not similar to what a physician would provide. In this study, the overall average accuracy was 1.88, and clinical concordance was 2.79.

Each response had an average word count of 310 words (ranging from 146 words to 441 words per response) with high concordance.

Readability of the responses were calculated on a scale of 0 to 100 based on the average number of syllables and the number of words per sentence. The average readability score was 37.9, indicating poor readability despite high concordance.

There was a weak correlation between the ease of readability and better clinical concordance. In addition, accuracy did not correlate with readability.

On average, responses from ChatGPT had 1.97 references and ranged from one to four references. Researchers noted that ChatGPT cited peer-reviewed articles once and often referred to nonexistent websites (41%).

Of note, the study identified several major question themes asked of ChatGPT including work-up of abnormal breast examination or imaging, surgery, medical term explanation, chemotherapy, immunotherapy, radiation therapy, available resources, supportive care resources, etiology of breast cancer and information about clinical trials.

In terms of accuracy, 36.1% (130 responses) of responses were graded as comprehensive, while 24% (87 responses) were graded as some correct and some incorrect. None of the responses were graded as completely incorrect. The most accurate responses were related to chemotherapy, whereas the lowest scored accuracy question was about lymphedema after axillary surgery.

For clinical concordance, 12.8% (46 responses) of responses were graded as completely similar (the highest score), and 7.8% (28 responses) were graded as not similar at all to answers provided by clinicians if asked the same question. The most concordant score was related to the work-up of an abnormal breast examination or imaging, while the lowest concordance score was for the question about immunotherapy.

The most frequently referenced websites in responses from ChatGPT were the National Cancer Institute, followed by the American Cancer Society. ChatGPT cited peer-reviewed articles once, both of which were landmark publications from 2002.

In July 2023, breast cancer advocates asked ChatGPT 20 questions that patients were likely to ask. The responses were evaluated based on accuracy and clinical concordance, and were repeated three times.

“With increasing reports of AI hallucination, wherein systems like OpenAI make up information or provide a response that does not seem justified by its training data, assessing patient‐facing medical information is critically important,” study authors wrote.

For more news on cancer updates, research and education, don’t forget to subscribe to CURE®’s newsletters here.

Recent Videos
Image of a woman wearing a red tank top.
Image of Annie Bond.
Image of a man with rectangular glasses and short dark hair.
Image of a woman with long dark hair.
Image of Kristen Dahlgren at Extraordinary Healer.
Image of a woman with short blonde hair wearing a white blazer.
Image of a woman with black hair.
Image of a woman with brown shoulder-length hair in front of a gray background that says CURE.
Sue Friedman in an interview with CURE
Related Content