Zero-Shot Learning with CLIP for Medical Image Classification

Published on December 28, 2025

Medical image classification traditionally requires large labeled datasets—expensive and time-consuming to create. Zero-shot learning offers an alternative: classifying images without task-specific training data. In this post, I'll share findings from comparing OpenAI's CLIP against supervised models for OCT and retinal image classification.

What is Zero-Shot Learning?

Zero-shot learning enables models to recognize classes they've never explicitly seen during training. CLIP (Contrastive Language-Image Pre-training) achieves this by learning a shared embedding space for images and text descriptions.

import torch
import clip
from PIL import Image

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Define class descriptions for OCT images
class_prompts = [
    "an OCT scan showing choroidal neovascularization",
    "an OCT scan showing diabetic macular edema", 
    "an OCT scan showing drusen deposits",
    "a normal healthy OCT retinal scan"
]

def classify_oct_image(image_path):
    """Zero-shot classification of OCT images."""
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    text = clip.tokenize(class_prompts).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        # Compute similarity
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
    return similarity[0].cpu().numpy()

Experimental Setup

We compared three approaches on the OCT2017 and RFMiD datasets:

CLIP ViT-B/32: Zero-shot with custom medical prompts
Vision Transformer (ViT): Fine-tuned on training set
ResNet-50: Fine-tuned baseline

Prompt Engineering for Medical Images

CLIP's performance heavily depends on prompt quality. We tested multiple prompt strategies:

# Basic prompts
basic_prompts = ["CNV", "DME", "Drusen", "Normal"]

# Descriptive prompts  
descriptive_prompts = [
    "choroidal neovascularization with fluid accumulation",
    "diabetic macular edema with retinal thickening",
    "drusen deposits in the retinal pigment epithelium",
    "healthy retina with normal foveal contour"
]

# Template-based prompts
template_prompts = [
    f"a retinal OCT scan of {condition}" 
    for condition in ["CNV", "DME", "drusen", "healthy tissue"]
]

Results and Analysis

Model	Accuracy	F1-Score	ROC-AUC	MCC
ResNet-50 (Fine-tuned)	94.2%	0.941	0.987	0.922
ViT-B/16 (Fine-tuned)	96.1%	0.960	0.992	0.948
CLIP (Zero-shot, basic)	67.3%	0.654	0.821	0.564
CLIP (Zero-shot, descriptive)	78.9%	0.775	0.889	0.718

Key Findings

Domain gap: CLIP's web-trained knowledge doesn't fully transfer to medical imaging
Prompt sensitivity: Descriptive prompts improved accuracy by 11.6%
Few-shot potential: Even 10 labeled examples significantly boost CLIP's performance

When to Use Zero-Shot vs. Fine-Tuned Models

Use CLIP zero-shot when:

No labeled data available for new conditions
Rapid prototyping and feasibility testing
Classes are well-described by natural language

Use fine-tuned models when:

Sufficient labeled data exists
High accuracy is critical (clinical deployment)
Classes have subtle visual differences

Conclusion

CLIP demonstrates impressive zero-shot generalization but faces domain adaptation challenges in medical imaging. Hybrid approaches—combining CLIP's flexibility with limited fine-tuning—offer a promising direction for scenarios with scarce labeled data.

Related Project: Zero-Shot Medical Image Classification with CLIP