Papers are all from Radiology.
GPT-4 for Automated Determination of Radiologic Study and Protocol Based on Radiology Request Forms: A Feasibility Study
- Purpose:
- The study aims to assess the feasibility of using GPT-4 to determine the appropriate radiological study and protocol based on radiology request forms (RRFs).
- It explores the potential of GPT-4 in “natural language processing with large potential for clinical decision support applications”.
- Materials:
- The study involved “100 original in- and outpatient RRFs” from various subspecialties. These forms contained patient medical histories and clinical questions requiring radiological examination.
- Prompt
- Based on the patient’s medical history and the referring physician’s questioning please indicate the appropriate radiological study (modality, body region, contrast agent yes or no, and if yes, which contrast phases) for each patient.
- Study Design & Workflow:
- Each RRF was presented to GPT-4, which was then prompted to recommend the appropriate radiological study for the patient.
- The AI’s task was to process the information on each RRF and determine “the optimal imaging approach” considering the patient’s history and clinical questions.
- GPT-4’s recommendations were compared against a reference standard, which was established based on the decisions of an expert radiologist.
- The radiologist provided the “gold standard” for what the correct imaging approach should be for each case.
- This comparison aimed to evaluate the accuracy of GPT-4 in replicating the decision-making process of a trained radiologist in this specific context.
- Statistical Analysis and Results:
- The study used statistical analysis to measure the agreement between the AI’s decisions and the expert radiologist’s recommendations.
- The primary metric was the percentage of cases where GPT-4’s recommendations agreed with the expert’s decisions.
- GPT-4 achieved an overall agreement rate of 84% with the reference standard. This indicates a high level of accuracy in determining the correct imaging studies and protocols.
- However, some errors were noted, such as recommending the wrong modality or erroneously omitting the need for a contrast agent.
Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT
- Purpose:
- The study evaluated the appropriateness of ChatGPT’s responses to common questions about breast cancer prevention and screening. This assessment was informed by previous research showing ChatGPT’s ability to generate appropriate recommendations for cardiovascular disease prevention.
- Material:
- The material consisted of 25 questions addressing fundamental concepts related to breast cancer prevention and screening. These questions were informed by the Breast Imaging Reporting and Data System Atlas and clinical experience in tertiary care breast imaging departments.
- Prompt:
- Each of the 25 questions was submitted to ChatGPT three times. The exact wording of these questions is not specified in the quoted text, but they covered essential topics in breast cancer prevention and screening.
- Study Design & Workflow:
- This was a retrospective study approved by the University of Maryland School of Medicine institutional review board.
- After creating the set of questions, the authors submitted each question to ChatGPT three times.
- The responses from ChatGPT were graded by three fellowship-trained breast radiologists. They evaluated the responses as (a) appropriate, (b) inappropriate if any response contained inappropriate information, or (c) unreliable if the responses provided inconsistent content.
- These responses were assessed in two hypothetical contexts: as patient-facing material (e.g., on a hospital website) and as a chatbot responding to patient questions.
- A majority vote among the radiologists determined the appropriateness of ChatGPT’s recommendations, and this data was summarized using descriptive statistics.
- Statistical Analysis and Results:
- ChatGPT-generated responses were found to be appropriate for 22 of the 25 (88%) questions in both hypothetical contexts.
- One question (4%) was deemed inappropriate, and two questions (8%) were considered unreliable in both contexts.
- The inappropriate and unreliable responses pertained to specific aspects of breast cancer prevention and screening, including scheduling mammography in relation to COVID-19 vaccination and locations for breast cancer screening.
How AI Responds to Common Lung Cancer Questions: ChatGPT versus Google Bard
- Purpose:
- The study aimed to evaluate and compare the accuracy and consistency of responses generated by ChatGPT and Google Bard to questions related to lung cancer prevention, screening, and terminology commonly used in radiology reports. These questions were based on the recommendations of Lung Imaging Reporting and Data System (Lung-RADS) version 2022 from the American College of Radiology and the Fleischner Society.
- Materials and Methods:
- Forty identical questions were created and presented to ChatGPT-3.5 and Google Bard experimental version, as well as Bing and Google search engines.
- These questions were reviewed by two radiologists for accuracy, and the responses were scored as correct, partially correct, incorrect, or unanswered.
- Consistency was evaluated among the answers, defined as the agreement between the three answers provided by the different tools, regardless of whether the concept conveyed was correct or incorrect.
- Prompt:
- The specific prompt used in the study is not detailed in the quoted sections. However, it involved presenting 40 questions related to lung cancer prevention, screening, and terminology to the AI tools and search engines for their responses.
- Study Design & Workflow:
- The study was structured to compare the performance of ChatGPT-3.5 and Google Bard with conventional search engines (Bing and Google) in answering lung cancer-related questions.
- The responses were then analyzed for accuracy and consistency by radiologists, providing a direct comparison between AI-generated responses and those sourced from search engines.
- Statistical Analysis and Results:
- ChatGPT-3.5 answered all 120 questions, with 70.8% correct, 11.7% partially correct, and 17.5% incorrect answers.
- Google Bard did not answer 19.2% of the questions, with 51.7% correct, 9.2% partially correct, and 20% incorrect answers for the ones it did respond to.
- Bing answered all questions with 61.7% correct, 10.8% partially correct, and 27.5% incorrect answers.
- The Google search engine had 55% correct, 22.5% partially correct, and 22.5% incorrect answers.
- ChatGPT-3.5 was more likely to provide correct or partially correct answers than Google Bard (OR = 1.55, P = .004).
- ChatGPT-3.5 and the Google search engine were more likely to be consistent than Google Bard (OR = 6.65 for ChatGPT-3.5 and 28.83 for the Google search engine, both with P = .002).
A Context-based Chatbot Surpasses Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines
- Purpose:
- The study investigated the potential of an interactive chatbot to support clinical decision-making by providing personalized imaging recommendations based on the American College of Radiology (ACR) appropriateness criteria documents using semantic similarity processing.
- Materials and Methods:
- The study utilized 209 ACR appropriateness criteria documents as a knowledge base and the LlamaIndex framework to connect large language models with external data, along with ChatGPT-3.5-turbo, to create an appropriateness criteria context-aware chatbot (accGPT). Fifty clinical case files were used for performance comparison with general radiologists of varying experience levels and generic ChatGPT versions 3.5 and 4.0.
- Prompting Strategy and Answer Synthesis:
- The chatbots were prompted with the question: “Is imaging typically appropriate for this case? If so, please specify the most suitable imaging modality and whether a contrast agent is required.” The responses from GPT-3.5-turbo and GPT-4 were directly captured, while for accGPT, the best matching data nodes from the index were retrieved and used in a multi-step answer creation process.
- Study Design & Workflow:
- The 50 clinical case files, created based on the ACR Appropriateness Criteria, included a wide range of topics and medical conditions. Six radiologists of different experience levels evaluated the appropriateness of imaging, modality, and contrast agent administration for each case file without consulting guidelines or colleagues. The chatbots underwent six-fold repetition testing on the case files, and their performance was similarly evaluated.
- Statistical Analysis and Results:
- accGPT showed significantly better performance in providing correct recommendations for imaging according to the “usually appropriate” criteria compared to radiologists and GPT-3.5-turbo, and performed better than GPT-4 at a trend level. In terms of recommendations meeting both “usually appropriate” and “may be appropriate” criteria, accGPT again outperformed radiologists and GPT-3.5-turbo, but its performance was not significantly different from GPT-4.
- Regarding consistency, accGPT was correct in all six runs in 74% of cases and at least four times correct in 82% of cases. GPT-3.5-turbo and GPT-4 had lower percentages of consistent correct recommendations.
- Cost-Effectiveness Analysis:
- Radiologists spent an average of around 50 minutes evaluating case files, resulting in a mean cost of €29.99 (0.02) and requiring 2 minutes, GPT-4 costing €0.36 (0.20) and needing 8 minutes.