Zero-Shot Learning with CLIP for Medical Image Classification
Published on December 28, 2025
Medical image classification traditionally requires large labeled datasets—expensive and time-consuming to create. Zero-shot learning offers an alternative: classifying images without task-specific training data. In this post, I'll share findings from comparing OpenAI's CLIP against supervised models for OCT and retinal image classification.
What is Zero-Shot Learning?
Zero-shot learning enables models to recognize classes they've never explicitly seen during training. CLIP (Contrastive Language-Image Pre-training) achieves this by learning a shared embedding space for images and text descriptions.
import torch
import clip
from PIL import Image
# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Define class descriptions for OCT images
class_prompts = [
"an OCT scan showing choroidal neovascularization",
"an OCT scan showing diabetic macular edema",
"an OCT scan showing drusen deposits",
"a normal healthy OCT retinal scan"
]
def classify_oct_image(image_path):
"""Zero-shot classification of OCT images."""
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = clip.tokenize(class_prompts).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
return similarity[0].cpu().numpy()
Experimental Setup
We compared three approaches on the OCT2017 and RFMiD datasets:
- CLIP ViT-B/32: Zero-shot with custom medical prompts
- Vision Transformer (ViT): Fine-tuned on training set
- ResNet-50: Fine-tuned baseline
Prompt Engineering for Medical Images
CLIP's performance heavily depends on prompt quality. We tested multiple prompt strategies:
# Basic prompts
basic_prompts = ["CNV", "DME", "Drusen", "Normal"]
# Descriptive prompts
descriptive_prompts = [
"choroidal neovascularization with fluid accumulation",
"diabetic macular edema with retinal thickening",
"drusen deposits in the retinal pigment epithelium",
"healthy retina with normal foveal contour"
]
# Template-based prompts
template_prompts = [
f"a retinal OCT scan of {condition}"
for condition in ["CNV", "DME", "drusen", "healthy tissue"]
]
Results and Analysis
| Model | Accuracy | F1-Score | ROC-AUC | MCC |
|---|---|---|---|---|
| ResNet-50 (Fine-tuned) | 94.2% | 0.941 | 0.987 | 0.922 |
| ViT-B/16 (Fine-tuned) | 96.1% | 0.960 | 0.992 | 0.948 |
| CLIP (Zero-shot, basic) | 67.3% | 0.654 | 0.821 | 0.564 |
| CLIP (Zero-shot, descriptive) | 78.9% | 0.775 | 0.889 | 0.718 |
Key Findings
- Domain gap: CLIP's web-trained knowledge doesn't fully transfer to medical imaging
- Prompt sensitivity: Descriptive prompts improved accuracy by 11.6%
- Few-shot potential: Even 10 labeled examples significantly boost CLIP's performance
When to Use Zero-Shot vs. Fine-Tuned Models
Use CLIP zero-shot when:
- No labeled data available for new conditions
- Rapid prototyping and feasibility testing
- Classes are well-described by natural language
Use fine-tuned models when:
- Sufficient labeled data exists
- High accuracy is critical (clinical deployment)
- Classes have subtle visual differences
Conclusion
CLIP demonstrates impressive zero-shot generalization but faces domain adaptation challenges in medical imaging. Hybrid approaches—combining CLIP's flexibility with limited fine-tuning—offer a promising direction for scenarios with scarce labeled data.