Automating Longitudinal RECIST Assessment in Lung Cancer CT Reports using a Local LLM

Introduction

Monitoring tumor changes in lung cancer is critical for assessing treatment response. Radiologists typically evaluate longitudinal CT scan reports and apply RECIST criteria (Response Evaluation Criteria in Solid Tumors) to classify disease status (e.g., progression or regression). This manual process is time-consuming and can be subjective. Recent advances in large language models (LLMs) offer a potential to automate information extraction from clinical text with high accuracy . For example, open-source models like Vicuna (13B) have accurately extracted key findings from radiology reports without additional training . Moreover, early studies show that language models can infer cancer treatment response from reports with nearly 90% accuracy . Building on these findings, we propose a study to develop an LLM-based pipeline that automatically analyzes lung cancer CT reports over time and classifies disease status using an extended RECIST framework.

Objective

Primary Objective: Develop and evaluate an automated method using a latest-generation open-source LLM (running locally, without fine-tuning) to analyze longitudinal lung cancer CT radiology reports. The LLM will extract key lesion information (e.g. lesion size measurements, invasion of adjacent structures, new lymph node involvement) and perform longitudinal comparisons to classify the patient’s disease status according to an extended RECIST criteria. The extended RECIST categories include:

Sable Disease (SD): No significant change in tumor burden.
Stable with Sub-threshold Progression: Minor increase in lesion size or extent, not meeting RECIST progression threshold.
Stable with Sub-threshold Regression: Minor decrease in lesion size, not meeting RECIST partial response threshold.
Regression: Clear tumor size reduction meeting RECIST response criteria (includes partial/complete response).
Progression: Significant tumor growth or new lesions meeting RECIST progression criteria.

By comparing the LLM’s classifications against expert human assessments, we will determine how well a zero-shot/few-shot prompted LLM can replicate radiologists’ judgment of disease progression on serial reports.

Data and Annotations

Data Source: We will utilize de-identified longitudinal CT radiology reports from lung cancer patients in a clinical database. Each patient has a series of CT scan reports (baseline and multiple follow-ups) documenting tumor findings over time. These free-text reports often describe measured tumor sizes, the presence or absence of new lesions, changes in lymph nodes, and any invasion into adjacent organs. All reports will be in English (or translated if originally in another language), and organized per patient with timepoints labeled (e.g., Report 1: January 2024, Report 2: March 2024, etc.).

Ground Truth Annotations: Expert radiologists or oncologists will have reviewed the same report series for each patient and assigned an extended RECIST category (as defined above) reflecting the disease trajectory between timepoints. For example, for each interval (baseline to first follow-up, etc.), the expert labels whether the disease showed progression, regression, or remained stable (with possible minor changes). These human classifications serve as the reference standard. We anticipate on the order of hundreds of patients (yielding a few thousand reports total) to ensure a robust evaluation. Each case’s final label is one of the five categories: Stable, Stable w/ sub-threshold progression, Stable w/ sub-threshold regression, Regression, or Progression.

Methodology

LLM Model Selection and Setup

We will employ a latest-generation open-source LLM that can be run locally to preserve data privacy. Candidates include models such as LLaMA-v3 (hypothetical next version of LLaMA), Mistral 7B/13B, or Qwen-14B, which are state-of-the-art in 2024–2025. The chosen model will have strong general language understanding and reasoning abilities, but no additional fine-tuning on our dataset will be performed. This ensures we test the model’s out-of-the-box capability. The model will be deployed on a secure local server with sufficient GPU resources for inference.

We will experiment with prompting strategies:

Zero-shot prompting: providing the model instructions to extract lesion information and summarize changes, without any examples.
Few-shot prompting: providing 1–3 example radiology report excerpts and their expected outputs (lesion findings or RECIST classification) to guide the model.

The prompt will be carefully crafted in natural language (e.g., “Extract all tumor measurements and relevant findings from the following reports, then compare the latest to the previous and determine the disease status according to extended RECIST criteria.”). We may include a brief definition of the extended RECIST categories in the prompt so the model knows the classification scheme. No patient-identifying information will be present in the prompts (only clinical content).

Information Extraction and Longitudinal Comparison

Analysis Pipeline: The study will develop an automated pipeline that processes the reports and infers the RECIST category as follows:

Lesion Information Extraction: For each CT report, the LLM is prompted to extract key tumor metrics and findings. This includes:

Measured sizes of target lesions (e.g., “Lesion in left upper lobe – 2.3 cm” and any previous size mentioned).
Qualitative descriptions of change (e.g., “slight increase,” “marked decrease”).
Presence of new lesions or lymphadenopathy (e.g., “new enlarged mediastinal lymph node”).
Mention of invasion into adjacent structures (e.g., “tumor now invades chest wall”).

The LLM’s output for each report will be a structured summary of these findings (for example, a list of lesions with their current size and prior size, plus notes on new findings). We will utilize the model in a controlled manner – if needed, splitting long reports into sections or using system messages to keep it focused. The output will then be parsed into a structured format (like JSON or a Python data structure) for easier comparison.

Longitudinal Data Alignment: Using the structured outputs, we will align lesions across timepoints. The pipeline will match lesions from the previous report to those in the current report based on their descriptions or location (the LLM-provided summary can help identify that “lesion in left upper lobe” at baseline corresponds to “lesion in left upper lobe” at follow-up). For each matching lesion, the change in size is calculated as a percentage. Newly appearing lesions or lymph nodes in the latest report are flagged.
Rule-based RECIST Classification: We will implement logic to assign the extended RECIST category for the interval. The rules will mirror RECIST 1.1 thresholds and the extended definitions:

Progression: If any target lesion has grown by ≥20% in diameter (and at least +5 mm absolute growth), or if a new lesion is present (including new lymph node metastasis), or if new invasion of structures is reported, then classify as Progression.
Regression: If there is ≥30% decrease in the sum of diameters of target lesions (or a clear wording of significant regression/partial response in the report), and no new lesions, classify as Regression. (If a complete disappearance is noted, that is also Regression, effectively a complete response.)
Stable Disease: If no criteria for progression or regression are met, it falls into stable categories. We further subdivide based on minor changes:
If there was a slight increase in tumor burden (but less than the 20% threshold), classify as Stable with sub-threshold progression.
If there was a slight decrease (but less than 30% threshold), classify as Stable with sub-threshold regression.
If essentially no change (<5% change or qualitatively “no significant change”), classify as plain Stable.

These rules will be encoded in a post-processing script that takes the LLM-extracted measurements as input and outputs the category. This approach ensures consistent application of criteria (the LLM provides the data, and the program makes the decision, reducing variability). We will verify the logic on a few sample cases manually to ensure it aligns with the intended extended RECIST framework.

LLM Feedback (optional): In cases where the LLM’s extracted data is ambiguous or incomplete, we may refine the prompt or use a second query to the model for clarification. For instance, if a report says “slight growth of the nodule,” the LLM might not give a percentage – we could prompt: “Does the report indicate if this growth meets criteria for progression?” in a few-shot manner. However, the primary plan is to rely on the deterministic logic once the raw info is extracted.
Classification Output: For each patient and each interval between scans, the pipeline will output the LLM-derived RECIST category. These outputs will be collected for evaluation against the ground truth labels.

Throughout the methodology, no fine-tuning of the LLM weights is done – the model is used as-is (pretrained on general data). This ensures the method is easily deployable to new data without additional training, aligning with our objective to test zero-shot performance. Any prompt refinements will be uniformly applied and documented, but the model’s internal parameters remain unchanged.

Validation and Evaluation

Once the LLM pipeline has generated classifications for all cases, we will evaluate its performance against the expert-labeled ground truth. The validation framework includes the following:

Hold-out Test Set: The dataset will be split so that a portion of the patient cases is set aside as a test set not seen during prompt development. We might use, for example, 80% of the cases to iteratively refine prompts and the rule-based logic, and then evaluate on the remaining 20% (or perform a cross-validation). This ensures an unbiased assessment of how the method generalizes to new patients.
Primary Metric – Accuracy: We will compute the overall classification accuracy of the LLM-based method in assigning the correct extended RECIST category, compared to the human expert labels. This gives a direct measure of how often the automated system exactly matches the human judgment.
Secondary Metrics: To provide a deeper evaluation, we will calculate:
Precision and Recall for each category (e.g., what percentage of cases the model labeled “Progression” were true progressions, and what percentage of true progressions were caught).
F1-score for each category and averaged (harmonic mean of precision and recall) – useful given class imbalance (e.g., full Progression might be less common than Stable).
Cohen’s Kappa to measure agreement with human experts beyond chance. This is important because it accounts for the possibility of agreement by random chance; a high kappa (close to 1.0) would indicate the model’s classifications are almost interchangeable with a human’s. Prior studies classifying RECIST outcomes have achieved kappa around 0.76 (substantial agreement) , so we will see if the LLM can reach a similar range.
Confusion Matrix: We will tabulate a confusion matrix showing counts of each predicted category vs actual category. This will highlight where the model tends to make mistakes (for example, confusing “Stable with sub-threshold progression” vs “Stable” proper).
Statistical Analysis: We will report these metrics with 95% confidence intervals. If the sample size allows, we may perform statistical tests (e.g., McNemar’s test for paired categorical outcomes) to see if differences between the LLM and human classifications are significant. We will also compare the performance of zero-shot vs few-shot prompting if both were tried, to quantify any improvement from including examples.
Error Analysis: Two radiologists will independently review cases where the LLM’s classification disagrees with the ground truth to determine the cause of error. We will categorize errors into types: e.g., Extraction errors (the LLM missed or misinterpreted a measurement), Logic errors (the post-processing mis-applied a rule), or Report ambiguity (even humans found the report confusing or borderline). This analysis will guide future improvements (for instance, if many errors are due to the LLM misreading numeric values, a potential fix could be to incorporate an OCR or regex step for numbers, or if many errors are borderline cases, perhaps the extended categories themselves might need refining).

Expected Outcomes and Significance

We anticipate the proposed LLM-driven approach will accurately classify patient disease status in the majority of cases, demonstrating performance comparable to human experts:

High Accuracy and Agreement: We expect the model to achieve high overall accuracy (for example, >85%). Given that prior NLP methods for RECIST classification have reached F1 scores ~86% for multi-class tasks and large transformer models achieved ~89% accuracy , our LLM approach without fine-tuning could similarly approach these numbers. A Cohen’s kappa in the range of 0.7–0.8 is anticipated, indicating substantial agreement with radiologist interpretations. This would validate that a general-purpose LLM can understand complex radiology text and apply clinical criteria correctly.
Strengths in Clear-Cut Cases: We expect particularly strong performance for clear progression (large tumor growth or new lesions) and clear regression cases. These situations are usually described explicitly in reports (e.g., “significant interval increase in size” or “marked decrease in size of all measurable disease”), which the LLM should capture and our rules will classify unequivocally as progression or regression. Precision in identifying true progression should be high – the model will rarely flag progression unless criteria are met, due to the rule-based thresholds.
Challenges in Borderline Cases: The more subtle categories (stable with sub-threshold changes) might be more challenging. We expect some confusion or variability in how the LLM interprets phrases like “slight increase” or “tiny decrease.” The model might occasionally misclassify a very small growth as progression or vice versa. However, by explicitly defining threshold rules, we hope to minimize this. The confusion matrix and error review will likely show most mistakes happen between Stable vs Stable sub-threshold categories (which even human readers might sometimes label differently). We will document how often the model “upgrades” or “downgrades” a stable case incorrectly.
Generality without Fine-Tuning: Achieving good performance with zero-shot prompting would underscore the power of latest-generation LLMs in medical text understanding. It would mean that even without task-specific training, the language model’s knowledge (likely gained from wide text pretraining) is sufficient to parse radiology jargon and measurement contexts. This aligns with recent findings that large pre-trained models can handle clinical text extraction tasks with excellent accuracy off-the-shelf . A successful outcome would validate the feasibility of a fully automated, prompt-based solution for radiology report analysis in oncology. This could encourage healthcare institutions to adopt LLM-based tools for other information extraction needs, especially since open-source models can be deployed locally to maintain patient data privacy.
Potential Need for Refinement: If the accuracy or agreement falls short (say significantly below human performance), the results will still be instructive. We might discover that the model struggles with numeric precision or specific medical terminology. In that case, the study would highlight areas for improvement, such as incorporating a smaller fine-tuning step on radiology text, adding a knowledge graph of lesions, or using an ensemble of LLM and rule-based systems. An expected outcome is a list of common error types (from our error analysis) which future research can address (e.g., by improving prompt clarity or integrating domain-specific modules).

Significance: If successful, this study will produce an automated system that can classify tumor response over time directly from narrative reports, which could streamline oncology workflows. Instead of manually reading through multiple prior reports, clinicians could get an AI-generated summary of how a patient’s disease is trending (stable, improving, or worsening) with evidence of lesion measurements. This could be especially useful in large-scale clinical trials or retrospective research, where thousands of radiology reports need to be assessed for outcomes. By using an open-source LLM, we also demonstrate a cost-effective and privacy-preserving solution that any hospital could implement without reliance on proprietary APIs. In summary, we expect to show that a latest-generation local LLM, guided by logical criteria, can closely replicate expert radiologists in interpreting longitudinal CT reports for lung cancer patients, marking a significant step towards automated longitudinal disease monitoring in medical practice.

YFWu's Notes