Zero-Shot Learning with CLIP for Medical Image Classification

Published on December 28, 2025

Medical image classification traditionally requires large labeled datasets—expensive and time-consuming to create. Zero-shot learning offers an alternative: classifying images without task-specific training data. In this post, I'll share findings from comparing OpenAI's CLIP against supervised models for OCT and retinal image classification.

What is Zero-Shot Learning?

Zero-shot learning enables models to recognize classes they've never explicitly seen during training. CLIP (Contrastive Language-Image Pre-training) achieves this by learning a shared embedding space for images and text descriptions.

import torch
import clip
from PIL import Image

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Define class descriptions for OCT images
class_prompts = [
    "an OCT scan showing choroidal neovascularization",
    "an OCT scan showing diabetic macular edema", 
    "an OCT scan showing drusen deposits",
    "a normal healthy OCT retinal scan"
]

def classify_oct_image(image_path):
    """Zero-shot classification of OCT images."""
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    text = clip.tokenize(class_prompts).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        # Compute similarity
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
    return similarity[0].cpu().numpy()

Experimental Setup

We compared three approaches on the OCT2017 and RFMiD datasets:

Prompt Engineering for Medical Images

CLIP's performance heavily depends on prompt quality. We tested multiple prompt strategies:

# Basic prompts
basic_prompts = ["CNV", "DME", "Drusen", "Normal"]

# Descriptive prompts  
descriptive_prompts = [
    "choroidal neovascularization with fluid accumulation",
    "diabetic macular edema with retinal thickening",
    "drusen deposits in the retinal pigment epithelium",
    "healthy retina with normal foveal contour"
]

# Template-based prompts
template_prompts = [
    f"a retinal OCT scan of {condition}" 
    for condition in ["CNV", "DME", "drusen", "healthy tissue"]
]

Results and Analysis

Model Accuracy F1-Score ROC-AUC MCC
ResNet-50 (Fine-tuned) 94.2% 0.941 0.987 0.922
ViT-B/16 (Fine-tuned) 96.1% 0.960 0.992 0.948
CLIP (Zero-shot, basic) 67.3% 0.654 0.821 0.564
CLIP (Zero-shot, descriptive) 78.9% 0.775 0.889 0.718

Key Findings

  1. Domain gap: CLIP's web-trained knowledge doesn't fully transfer to medical imaging
  2. Prompt sensitivity: Descriptive prompts improved accuracy by 11.6%
  3. Few-shot potential: Even 10 labeled examples significantly boost CLIP's performance

When to Use Zero-Shot vs. Fine-Tuned Models

Use CLIP zero-shot when:

Use fine-tuned models when:

Conclusion

CLIP demonstrates impressive zero-shot generalization but faces domain adaptation challenges in medical imaging. Hybrid approaches—combining CLIP's flexibility with limited fine-tuning—offer a promising direction for scenarios with scarce labeled data.