AI-powered diagnostic tools are increasingly used in clinical settings. As a clinician, you need to understand how to interpret their performance metrics to make informed decisions about when to trust these tools and how to use them appropriately.
This tutorial uses a chest X-ray AI for pneumonia detection as an example. The same principles apply to any diagnostic AI tool.
Each dot represents a patient's chest X-ray. The AI assigns a confidence score (0-100%) for pneumonia. Drag the purple threshold line to set the cutoff. Patients scoring above the threshold (to its right) are flagged as "Pneumonia Detected." Watch how the confusion matrix and metrics change as you move the threshold. Try the three dataset buttons to see how AI performance differs with obvious vs. subtle cases, and high vs. low disease prevalence.
AI Confidence Score (probability of pneumonia)
With this threshold, the AI correctly classifies 85% of all cases (accuracy) and catches 83% of pneumonia cases (recall). However, 14% of healthy patients receive false alarms (false positive rate). When the AI says "no pneumonia," there's a 92% chance the patient is truly healthy (NPV).
Use the interactive visualization above to explore these clinical scenarios:
You don't want to send anyone with pneumonia home from the ED. Using "Severe Cases", where would you set the threshold?
Set a low threshold (~20-30%).
For screening, you want high recall/sensitivity—catching as many true pneumonia cases as possible. Missing a case (false negative) could mean sending a sick patient home untreated.
The tradeoff: More false positives, meaning more healthy patients get flagged. But in the ED, it's safer to do additional testing on a false alarm than to miss real pneumonia.
You want to avoid unnecessary antibiotics and further workup. Using "Subtle Cases", where would you set the threshold?
Set a high threshold (~70-80%).
For confirmatory testing, you want high precision (PPV)—when the AI says "pneumonia," you want to be confident it's correct before starting treatment.
The tradeoff: Lower recall (sensitivity) means you'll miss some cases. But if you're using this as a confirmatory test, patients who test negative can still be evaluated clinically or with other tests.
Set the threshold to 50% in "Subtle Cases", note the precision. Now switch to "Low Prevalence" at the same threshold. What happened to precision, and why?
Precision drops significantly (from ~50% to ~20-30%).
This is the Bayesian insight: even with the same AI accuracy (similar AUC), precision depends on prevalence. When disease is rare, false positives outnumber true positives.
At 50% threshold in "Subtle Cases" (30% prevalence), a positive result might be ~50% likely to be true. In "Low Prevalence" (10% prevalence), that same positive result might only be ~25% likely to be true—because there are so few actual cases to find.
Clinical implication: An AI validated in a high-prevalence ICU population may have poor precision when deployed in a low-prevalence outpatient clinic, even if the underlying discrimination is identical.
When an AI tool makes a prediction, there are four possible outcomes based on whether the prediction was correct and what the actual condition is:
AI says: Pneumonia detected
Reality: Patient has pneumonia
Correct alert — patient receives treatment
AI says: No pneumonia
Reality: Patient has pneumonia
Missed case — delayed treatment
AI says: Pneumonia detected
Reality: Patient is healthy
False alarm — unnecessary treatment
AI says: No pneumonia
Reality: Patient is healthy
Correct clearance — patient reassured
These metrics help you evaluate how well the AI performs:
| Metric | Question It Answers | Formula |
|---|---|---|
| Accuracy | Of all cases, how many did the AI classify correctly? | (TP + TN) / Total |
| Recall (Sensitivity, TPR) |
Of all patients WITH pneumonia, how many did the AI catch? | TP / (TP + FN) |
| False Positive Rate (FPR) |
Of all HEALTHY patients, how many did the AI incorrectly flag? | FP / (FP + TN) |
| Precision (PPV) |
If AI says "pneumonia," how likely is it actually pneumonia? | TP / (TP + FP) |
| F1 Score | Overall balance between precision and recall (harmonic mean) | 2·P·R / (P + R) |
| NPV (Negative Predictive Value) |
If AI says "no pneumonia," how likely is the patient actually healthy? | TN / (TN + FN) |
Notice what happens when you move the threshold:
Emergency screening: Use a lower threshold. Missing pneumonia in a sick patient (false negative) is dangerous. Accept more false positives that can be ruled out with further testing.
Confirmatory testing: Use a higher threshold. You want high confidence before initiating aggressive treatment. False positives lead to unnecessary interventions.