A study found that ChatGPT 3.5, when asked 20 common breast cancer questions, provided inaccurate answers in 24% of cases and lacked reliable references in 41% of responses, emphasizing the need for caution when using AI for medical information.
ChatGPT is a generative artificial intelligence language model which operates like a chatbot to generate responses to many questions. The model used in this study — ChatGPT 3.5 — was the most widely available free tool at the time researchers performed this analysis.
“Furthermore, whereas each series of prompts started with the statement, ‘I am a patient,’ and requested that the responses should be for the patient, the responses provided were not at an appropriate patient reading level,” study authors wrote, “In fact, none of the responses were at the recommended sixth‐grade reading level, and the lowest grade level was eighth grade.”
Accuracy was rated on a four-point scale ranging from 1 (comprehensive information) to 4 (completely incorrect information). Clinical concordance was rated on a five-point scale, with 1 indicating completely similar responses to a physician and 5 indicating not similar to what a physician would provide. In this study, the overall average accuracy was 1.88, and clinical concordance was 2.79.
Each response had an average word count of 310 words (ranging from 146 words to 441 words per response) with high concordance.
Readability of the responses were calculated on a scale of 0 to 100 based on the average number of syllables and the number of words per sentence. The average readability score was 37.9, indicating poor readability despite high concordance.
There was a weak correlation between the ease of readability and better clinical concordance. In addition, accuracy did not correlate with readability.
On average, responses from ChatGPT had 1.97 references and ranged from one to four references. Researchers noted that ChatGPT cited peer-reviewed articles once and often referred to nonexistent websites (41%).
Of note, the study identified several major question themes asked of ChatGPT including work-up of abnormal breast examination or imaging, surgery, medical term explanation, chemotherapy, immunotherapy, radiation therapy, available resources, supportive care resources, etiology of breast cancer and information about clinical trials.
In terms of accuracy, 36.1% (130 responses) of responses were graded as comprehensive, while 24% (87 responses) were graded as some correct and some incorrect. None of the responses were graded as completely incorrect. The most accurate responses were related to chemotherapy, whereas the lowest scored accuracy question was about lymphedema after axillary surgery.
For clinical concordance, 12.8% (46 responses) of responses were graded as completely similar (the highest score), and 7.8% (28 responses) were graded as not similar at all to answers provided by clinicians if asked the same question. The most concordant score was related to the work-up of an abnormal breast examination or imaging, while the lowest concordance score was for the question about immunotherapy.
The most frequently referenced websites in responses from ChatGPT were the National Cancer Institute, followed by the American Cancer Society. ChatGPT cited peer-reviewed articles once, both of which were landmark publications from 2002.
In July 2023, breast cancer advocates asked ChatGPT 20 questions that patients were likely to ask. The responses were evaluated based on accuracy and clinical concordance, and were repeated three times.
“With increasing reports of AI hallucination, wherein systems like OpenAI make up information or provide a response that does not seem justified by its training data, assessing patient‐facing medical information is critically important,” study authors wrote.
For more news on cancer updates, research and education, don’t forget to subscribe to CURE®’s newsletters here.