Docker for Data Scientists: Containerizing ML Workflows
Published on December 5, 2025
"It works on my machine" is the bane of collaborative data science. Docker solves this by packaging your code, dependencies, and environment into portable containers. In this guide, I'll show you how to containerize ML workflows for reproducibility and seamless deployment.
Why Docker for Data Science?
- Reproducibility: Same environment everywhere
- Isolation: No dependency conflicts
- Portability: Run on any machine with Docker
- Scalability: Easy transition to cloud/Kubernetes
Basic Dockerfile for ML Projects
# Use official Python image with CUDA support
FROM python:3.10-slim
# Set working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libopencv-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first (for layer caching)
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY models/ ./models/
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/app/models/best_model.pt
# Expose port for API
EXPOSE 8000
# Default command
CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8000"]
Optimizing Docker Images
Multi-Stage Builds
# Stage 1: Build
FROM python:3.10 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
# Stage 2: Runtime
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*
COPY src/ ./src/
CMD ["python", "src/main.py"]
Layer Caching Strategy
# ✗ Bad: Invalidates cache on any code change
COPY . .
RUN pip install -r requirements.txt
# ✓ Good: Dependencies cached separately
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ ./src/
Docker Compose for ML Pipelines
# docker-compose.yml
version: '3.8'
services:
jupyter:
build:
context: .
dockerfile: Dockerfile.jupyter
ports:
- "8888:8888"
volumes:
- ./notebooks:/app/notebooks
- ./data:/app/data
environment:
- JUPYTER_TOKEN=mysecrettoken
mlflow:
image: ghcr.io/mlflow/mlflow:v2.8.0
ports:
- "5000:5000"
volumes:
- ./mlruns:/mlruns
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlruns/mlflow.db
api:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/app/models
depends_on:
- mlflow
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
postgres:
image: postgres:14
environment:
POSTGRES_DB: ml_db
POSTGRES_USER: mluser
POSTGRES_PASSWORD: mlpassword
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
GPU Support with NVIDIA Docker
# Dockerfile for GPU training
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04
# Install Python
RUN apt-get update && apt-get install -y python3 python3-pip
# Install PyTorch with CUDA
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["python3", "train.py"]
Run with GPU:
docker run --gpus all -it my-ml-image python train.py
Development vs Production Images
# Dockerfile.dev
FROM python:3.10
WORKDIR /app
COPY requirements.txt requirements-dev.txt ./
RUN pip install -r requirements.txt -r requirements-dev.txt
# Mount code as volume for hot reload
CMD ["uvicorn", "src.api:app", "--reload", "--host", "0.0.0.0"]
# Dockerfile.prod
FROM python:3.10-slim AS builder
# ... build steps ...
FROM python:3.10-slim
# Minimal runtime image
COPY --from=builder /app /app
USER nonroot
CMD ["gunicorn", "src.api:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker"]
Common Docker Commands for DS
# Build image
docker build -t my-ml-project:v1 .
# Run interactive container
docker run -it --rm -v $(pwd)/data:/data my-ml-project bash
# Run Jupyter notebook
docker run -p 8888:8888 -v $(pwd):/app jupyter/scipy-notebook
# View logs
docker logs -f container_name
# Clean up
docker system prune -a # Remove unused images/containers
Conclusion
Docker transforms data science from "works on my machine" to "works everywhere." Start with simple Dockerfiles, then graduate to Docker Compose for multi-service ML pipelines. Your future self (and collaborators) will thank you.