Best Viewed on a Larger Screen

This interactive tutorial works best on a tablet or desktop. Some features may be difficult to use on a small screen.

Explore AI Diagnostic Performance

Vishnu Ravi, MD • Alaa Youssef, PhD • Aydin Zahedivash, MD • Gabriel Tse, MD • Jonathan Chen, MD, PhD

Why This Matters

AI-powered diagnostic tools are increasingly used in clinical settings. As a clinician, you need to understand how to interpret their performance metrics to make informed decisions about when to trust these tools and how to use them appropriately.

This tutorial uses a chest X-ray AI for pneumonia detection as an example. The same principles apply to any diagnostic AI tool.

Each dot represents a patient's chest X-ray. The AI assigns a confidence score (0-100%) for pneumonia. Drag the purple threshold line to set the cutoff. Patients scoring above the threshold (to its right) are flagged as "Pneumonia Detected." Watch how the confusion matrix and metrics change as you move the threshold. Try the three dataset buttons to see how AI performance differs with obvious vs. subtle cases, and high vs. low disease prevalence.

Severe Cases: ICU patients with large lobar pneumonia and high fevers. The AI easily distinguishes these obvious cases from healthy patients. This represents ideal conditions — but an AI validated only on severe cases may struggle with your more subtle outpatients.
← AI predicts: No Pneumonia AI predicts: Pneumonia →
Threshold: 50%
0% 25% 50% 75% 100%

AI Confidence Score (probability of pneumonia)

Healthy patient
Patient with pneumonia
AI: Pneumonia
AI: No Pneumonia
Actually Has
Pneumonia
True Positive 25
False Negative 5
Actually
Healthy
False Positive 10
True Negative 60
Recall (TPR) False Positive Rate
AUC: 0.85 Area Under the Curve: The probability that a randomly chosen patient with pneumonia scores higher than a randomly chosen healthy patient. 1.0 = perfect separation, 0.5 = random guessing.
Accuracy
85%
(TP+TN) / Total
The proportion of all predictions that were correct. While intuitive, accuracy can be misleading when disease prevalence is low.
Recall (Sensitivity)
83%
TP / (TP+FN)
The proportion of actual positive cases correctly identified. High recall means fewer missed diagnoses—critical when missing a disease is dangerous.
False Positive Rate
14%
FP / (FP+TN)
The proportion of healthy patients incorrectly flagged as positive. A high FPR leads to unnecessary follow-up tests, anxiety, and costs.
Precision (PPV)
71%
TP / (TP+FP)
The proportion of positive predictions that were actually correct. Answers: "If the AI flags a patient, how likely are they to actually have the disease?"
NPV
92%
TN / (TN+FN)
Negative Predictive Value: If the AI says "no pneumonia," how likely is the patient actually healthy? Complements PPV/Precision.
F1 Score
77%
2·P·R / (P+R)
The harmonic mean of precision and recall. Useful when you need a single metric that balances catching cases (recall) with avoiding false alarms (precision).

Clinical Interpretation

With this threshold, the AI correctly classifies 85% of all cases (accuracy) and catches 83% of pneumonia cases (recall). However, 14% of healthy patients receive false alarms (false positive rate). When the AI says "no pneumonia," there's a 92% chance the patient is truly healthy (NPV).

Try It Yourself

Use the interactive visualization above to explore these clinical scenarios:

1. Screening

You don't want to send anyone with pneumonia home from the ED. Using "Severe Cases", where would you set the threshold?

Set a low threshold (~20-30%).

For screening, you want high recall/sensitivity—catching as many true pneumonia cases as possible. Missing a case (false negative) could mean sending a sick patient home untreated.

The tradeoff: More false positives, meaning more healthy patients get flagged. But in the ED, it's safer to do additional testing on a false alarm than to miss real pneumonia.

2. Confirmatory

You want to avoid unnecessary antibiotics and further workup. Using "Subtle Cases", where would you set the threshold?

Set a high threshold (~70-80%).

For confirmatory testing, you want high precision (PPV)—when the AI says "pneumonia," you want to be confident it's correct before starting treatment.

The tradeoff: Lower recall (sensitivity) means you'll miss some cases. But if you're using this as a confirmatory test, patients who test negative can still be evaluated clinically or with other tests.

3. Low Prevalence

Set the threshold to 50% in "Subtle Cases", note the precision. Now switch to "Low Prevalence" at the same threshold. What happened to precision, and why?

Precision drops significantly (from ~50% to ~20-30%).

This is the Bayesian insight: even with the same AI accuracy (similar AUC), precision depends on prevalence. When disease is rare, false positives outnumber true positives.

At 50% threshold in "Subtle Cases" (30% prevalence), a positive result might be ~50% likely to be true. In "Low Prevalence" (10% prevalence), that same positive result might only be ~25% likely to be true—because there are so few actual cases to find.

Clinical implication: An AI validated in a high-prevalence ICU population may have poor precision when deployed in a low-prevalence outpatient clinic, even if the underlying discrimination is identical.

The 4 Outcomes TP, FP, TN, FN explained
+

When an AI tool makes a prediction, there are four possible outcomes based on whether the prediction was correct and what the actual condition is:

AI: Pneumonia
AI: No Pneumonia
Actually Has
Pneumonia

True Positive (TP)

AI says: Pneumonia detected
Reality: Patient has pneumonia
Correct alert — patient receives treatment

False Negative (FN)

AI says: No pneumonia
Reality: Patient has pneumonia
Missed case — delayed treatment

Actually
Healthy

False Positive (FP)

AI says: Pneumonia detected
Reality: Patient is healthy
False alarm — unnecessary treatment

True Negative (TN)

AI says: No pneumonia
Reality: Patient is healthy
Correct clearance — patient reassured

Key Metrics What each number means
+

These metrics help you evaluate how well the AI performs:

Metric Question It Answers Formula
Accuracy Of all cases, how many did the AI classify correctly? (TP + TN) / Total
Recall
(Sensitivity, TPR)
Of all patients WITH pneumonia, how many did the AI catch? TP / (TP + FN)
False Positive Rate
(FPR)
Of all HEALTHY patients, how many did the AI incorrectly flag? FP / (FP + TN)
Precision
(PPV)
If AI says "pneumonia," how likely is it actually pneumonia? TP / (TP + FP)
F1 Score Overall balance between precision and recall (harmonic mean) 2·P·R / (P + R)
NPV
(Negative Predictive Value)
If AI says "no pneumonia," how likely is the patient actually healthy? TN / (TN + FN)
Threshold Tradeoffs Why you can't have it all
+

Notice what happens when you move the threshold:

Lower Threshold (←)

  • More cases flagged as positive
  • Higher recall — catch more true pneumonia
  • Higher false positive rate — more false alarms
  • Fewer missed cases, more unnecessary workup

Higher Threshold (→)

  • Fewer cases flagged as positive
  • Lower recall — miss more true pneumonia
  • Lower false positive rate — fewer false alarms
  • More missed cases, less unnecessary workup

Clinical Context Matters

Emergency screening: Use a lower threshold. Missing pneumonia in a sick patient (false negative) is dangerous. Accept more false positives that can be ruled out with further testing.

Confirmatory testing: Use a higher threshold. You want high confidence before initiating aggressive treatment. False positives lead to unnecessary interventions.