ChatGPT is as (in)accurate at diagnosis as ‘Dr Google’

ChatGPT is mediocre at diagnosing medical conditions, getting it right only 49% of the time, according to a new study. The researchers say their findings show that AI shouldn’t be the sole source of medical information and highlight the importance of maintaining the human element in healthcare.

The convenience of access to online technology has meant that some people bypass seeing a medical professional, choosing to google their symptoms instead. While being proactive about one’s health is not a bad thing, ‘Dr Google’ is just not that accurate. A 2020 Australian study looking at 36 international mobile and web-based symptom checkers found that a correct diagnosis was listed first only 36% of the time.

Surely, AI has improved since 2020. Yes, it definitely has. OpenAI’s ChatGPT has progressed in leaps and bounds – it’s able to pass the US Medical Licensing Exam, after all. But does that make it better than Dr Google in terms of diagnostic accuracy? That’s the question that researchers from Western University in Canada sought to answer in a new study.

Using ChatGPT 3.5, a large language model (LLM) trained on a massive dataset of over 400 billion words from the internet from sources that include books, articles, and websites, the researchers conducted a qualitative analysis of the medical information the chatbot provided by having it answer Medscape Case Challenges.

Medscape Case Challenges are complex clinical cases that challenge a medical professional’s knowledge and diagnostic skills. Medical professionals are required to make a diagnosis or choose an appropriate treatment plan for a case by selecting from four multiple-choice answers. The researchers chose Medscape’s Case Challenges because they’re open-source and freely accessible. To prevent the possibility that ChatGPT had prior knowledge of the cases, only those authored after model 3.5’s training in August 2021 were included.

A total of 150 Medscape cases were analyzed. With four multiple-choice responses per case, that meant there were 600 possible answers in total, with only one correct answer per case. The analyzed cases covered a wide range of medical problems, with titles like “Beer, Aspirin Worsen Nasal Issues in a 35-Year-Old With Asthma”, “Gastro Case Challenge: A 33-Year-Old Man Who Can’t Swallow His Own Saliva”, “A 27-Year-Old Woman With Constant Headache Too Tired To Party”, “Pediatric Case Challenge: A 7-Year-Old Boy With a Limp and Obesity Who Fell in the Street”, and “An Accountant Who Loves Aerobics With Hiccups and Incoordination”. Cases with visual assets, like clinical images, medical photography, and graphs, were excluded.

An example of a standardized prompt fed to ChatGPT

Hadi et al.

To ensure consistency in the input provided to ChatGPT, each case challenge was turned into one standardized prompt, including a script of the output the chatbot was to provide. All cases were evaluated by at least two independent raters, medical trainees, blinded to each other’s responses. They assessed ChatGPT’s responses based on diagnostic accuracy, cognitive load (that is, the complexity and clarity of information provided, from low to high), and quality of medical information (including whether it was complete and relevant).

Out of the 150 Medscape cases analyzed, ChatGPT provided correct answers in 49% of cases. However, the chatbot demonstrated an overall accuracy of 74%, meaning it could identify and reject incorrect multiple-choice options.

“This higher value is due to the ChatGPT’s ability to identify true negatives (incorrect options), which significantly contributes to the overall accuracy, enhancing its utility in eliminating incorrect choices,” the researchers explain. “This difference highlights ChatGPT’s high specificity, indicating its ability to excel at ruling out incorrect diagnoses. However, it needs improvement in precision and sensitivity to reliably identify the correct diagnosis.”

In addition, ChatGPT provided false positives (13%) and false negatives (13%), which has implications for its use as a diagnostic tool. A little over half (52%) of the answers provided were complete and relevant, with 43% incomplete but still relevant. ChatGPT tended to produce answers with a low (51%) to moderate (41%) cognitive load, making them easy to understand for users. However, the researchers point out that this ease of understanding, combined with the potential for incorrect or irrelevant information, could result in “misconceptions and a false sense of comprehension”, particularly if ChatGPT is being used as a medical education tool.

“ChatGPT also struggled to distinguish between diseases with subtly different presentations and the model also occasionally generated incorrect or implausible information, known as AI hallucinations, emphasizing the risk of sole reliance on ChatGPT for medical guidance and the necessity of human expertise in the diagnostic process,” said the researchers.

The researchers say that AI should be used as a tool to enhance, not replace, medicine's human element
The researchers say that AI should be used as a tool to enhance, not replace, medicine’s human element

Of course – and the researchers point this out as a limitation of the study – ChatGPT 3.5 is only one AI model that may not be representative of other models and is bound to improve in future iterations, which may improve its accuracy. Also, the Medscape cases analyzed by ChatGPT primarily focused on differential diagnosis cases, where medical professionals must differentiate between two or more conditions with similar signs or symptoms.

While future research should assess the accuracy of different AI models using a wider range of case sources, the results of the present study are nonetheless instructive.

“The combination of high relevance with relatively low accuracy advises against relying on ChatGPT for medical counsel, as it can present important information that may be misleading,” the researchers said. “While our results indicate that ChatGPT consistently delivers the same information to different users, demonstrating substantial inter-rater reliability, it also reveals the tool’s shortcomings in providing factually correct medical information, as evident [sic] by its low diagnostic accuracy.”

The study was published in the journal PLOS One.

Read original article here

Denial of responsibility! Pioneer Newz is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a Comment