Explore AI Diagnostic Performance

Vishnu Ravi, MD • Alaa Youssef, PhD • Aydin Zahedivash, MD • Gabriel Tse, MD • Jonathan Chen, MD, PhD

Why This Matters

AI-powered diagnostic tools are increasingly used in clinical settings. As a clinician, you need to understand how to interpret their performance metrics to make informed decisions about when to trust these tools and how to use them appropriately.

This tutorial uses a chest X-ray AI for pneumonia detection as an example. The same principles apply to any diagnostic AI tool.

Each dot represents a patient's chest X-ray. The AI assigns a confidence score (0-100%) for pneumonia. Drag the purple threshold line to set the cutoff. Patients scoring above the threshold (to its right) are flagged as "Pneumonia Detected." Watch how the confusion matrix and metrics change as you move the threshold. Try the three dataset buttons to see how AI performance differs with obvious vs. subtle cases, and high vs. low disease prevalence.

Severe Cases: ICU patients with large lobar pneumonia and high fevers. The AI easily distinguishes these obvious cases from healthy patients. This represents ideal conditions — but an AI validated only on severe cases may struggle with your more subtle outpatients.

← AI predicts: No Pneumonia AI predicts: Pneumonia →

Threshold: 50%

0% 25% 50% 75% 100%

AI Confidence Score (probability of pneumonia)

Healthy patient

Patient with pneumonia

AI: Pneumonia

AI: No Pneumonia

Actually Has
Pneumonia

True Positive 25

False Negative 5

Actually
Healthy

False Positive 10

True Negative 60

Recall (TPR) False Positive Rate

AUC: 0.85

Accuracy

85%

(TP+TN) / Total

Recall (Sensitivity)

83%

TP / (TP+FN)

False Positive Rate

14%

FP / (FP+TN)

Precision (PPV)

71%

TP / (TP+FP)

NPV

92%

TN / (TN+FN)

F1 Score

77%

2·P·R / (P+R)

Clinical Interpretation

With this threshold, the AI correctly classifies 85% of all cases (accuracy) and catches 83% of pneumonia cases (recall). However, 14% of healthy patients receive false alarms (false positive rate). When the AI says "no pneumonia," there's a 92% chance the patient is truly healthy (NPV).

Try It Yourself

Use the interactive visualization above to explore these clinical scenarios:

1. Screening

You don't want to send anyone with pneumonia home from the ED. Using "Severe Cases", where would you set the threshold?

Set a low threshold (~20-30%).

For screening, you want high recall/sensitivity—catching as many true pneumonia cases as possible. Missing a case (false negative) could mean sending a sick patient home untreated.

The tradeoff: More false positives, meaning more healthy patients get flagged. But in the ED, it's safer to do additional testing on a false alarm than to miss real pneumonia.

2. Confirmatory

You want to avoid unnecessary antibiotics and further workup. Using "Subtle Cases", where would you set the threshold?

Set a high threshold (~70-80%).

For confirmatory testing, you want high precision (PPV)—when the AI says "pneumonia," you want to be confident it's correct before starting treatment.

The tradeoff: Lower recall (sensitivity) means you'll miss some cases. But if you're using this as a confirmatory test, patients who test negative can still be evaluated clinically or with other tests.

3. Low Prevalence

Set the threshold to 50% in "Subtle Cases", note the precision. Now switch to "Low Prevalence" at the same threshold. What happened to precision, and why?

Precision drops significantly (from ~50% to ~20-30%).

This is the Bayesian insight: even with the same AI accuracy (similar AUC), precision depends on prevalence. When disease is rare, false positives outnumber true positives.

At 50% threshold in "Subtle Cases" (30% prevalence), a positive result might be ~50% likely to be true. In "Low Prevalence" (10% prevalence), that same positive result might only be ~25% likely to be true—because there are so few actual cases to find.

Clinical implication: An AI validated in a high-prevalence ICU population may have poor precision when deployed in a low-prevalence outpatient clinic, even if the underlying discrimination is identical.

The 4 Outcomes TP, FP, TN, FN explained

When an AI tool makes a prediction, there are four possible outcomes based on whether the prediction was correct and what the actual condition is:

AI: Pneumonia

AI: No Pneumonia

Actually Has
Pneumonia

True Positive (TP)

AI says: Pneumonia detected
Reality: Patient has pneumonia
Correct alert — patient receives treatment

False Negative (FN)

AI says: No pneumonia
Reality: Patient has pneumonia
Missed case — delayed treatment

Actually
Healthy

False Positive (FP)

AI says: Pneumonia detected
Reality: Patient is healthy
False alarm — unnecessary treatment

True Negative (TN)

AI says: No pneumonia
Reality: Patient is healthy
Correct clearance — patient reassured

Key Metrics What each number means

These metrics help you evaluate how well the AI performs:

Metric	Question It Answers	Formula
Accuracy	Of all cases, how many did the AI classify correctly?	(TP + TN) / Total
Recall (Sensitivity, TPR)	Of all patients WITH pneumonia, how many did the AI catch?	TP / (TP + FN)
False Positive Rate (FPR)	Of all HEALTHY patients, how many did the AI incorrectly flag?	FP / (FP + TN)
Precision (PPV)	If AI says "pneumonia," how likely is it actually pneumonia?	TP / (TP + FP)
F1 Score	Overall balance between precision and recall (harmonic mean)	2·P·R / (P + R)
NPV (Negative Predictive Value)	If AI says "no pneumonia," how likely is the patient actually healthy?	TN / (TN + FN)

Threshold Tradeoffs Why you can't have it all

Notice what happens when you move the threshold:

Lower Threshold (←)

More cases flagged as positive
Higher recall — catch more true pneumonia
Higher false positive rate — more false alarms
Fewer missed cases, more unnecessary workup

Higher Threshold (→)

Fewer cases flagged as positive
Lower recall — miss more true pneumonia
Lower false positive rate — fewer false alarms
More missed cases, less unnecessary workup

Clinical Context Matters

Emergency screening: Use a lower threshold. Missing pneumonia in a sick patient (false negative) is dangerous. Accept more false positives that can be ruled out with further testing.

Confirmatory testing: Use a higher threshold. You want high confidence before initiating aggressive treatment. False positives lead to unnecessary interventions.

Best Viewed on a Larger Screen