Docker for Data Scientists: Containerizing ML Workflows

Published on December 5, 2025

"It works on my machine" is the bane of collaborative data science. Docker solves this by packaging your code, dependencies, and environment into portable containers. In this guide, I'll show you how to containerize ML workflows for reproducibility and seamless deployment.

Why Docker for Data Science?

Basic Dockerfile for ML Projects

# Use official Python image with CUDA support
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    libopencv-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (for layer caching)
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY models/ ./models/

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/app/models/best_model.pt

# Expose port for API
EXPOSE 8000

# Default command
CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8000"]

Optimizing Docker Images

Multi-Stage Builds

# Stage 1: Build
FROM python:3.10 AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

# Stage 2: Runtime
FROM python:3.10-slim

WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*

COPY src/ ./src/
CMD ["python", "src/main.py"]

Layer Caching Strategy

# ✗ Bad: Invalidates cache on any code change
COPY . .
RUN pip install -r requirements.txt

# ✓ Good: Dependencies cached separately
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ ./src/

Docker Compose for ML Pipelines

# docker-compose.yml
version: '3.8'

services:
  jupyter:
    build:
      context: .
      dockerfile: Dockerfile.jupyter
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/app/notebooks
      - ./data:/app/data
    environment:
      - JUPYTER_TOKEN=mysecrettoken
    
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - "5000:5000"
    volumes:
      - ./mlruns:/mlruns
    command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlruns/mlflow.db
  
  api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    depends_on:
      - mlflow
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

  postgres:
    image: postgres:14
    environment:
      POSTGRES_DB: ml_db
      POSTGRES_USER: mluser
      POSTGRES_PASSWORD: mlpassword
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

GPU Support with NVIDIA Docker

# Dockerfile for GPU training
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y python3 python3-pip

# Install PyTorch with CUDA
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

COPY requirements.txt .
RUN pip3 install -r requirements.txt

COPY . /app
WORKDIR /app

CMD ["python3", "train.py"]

Run with GPU:

docker run --gpus all -it my-ml-image python train.py

Development vs Production Images

# Dockerfile.dev
FROM python:3.10

WORKDIR /app
COPY requirements.txt requirements-dev.txt ./
RUN pip install -r requirements.txt -r requirements-dev.txt

# Mount code as volume for hot reload
CMD ["uvicorn", "src.api:app", "--reload", "--host", "0.0.0.0"]
# Dockerfile.prod
FROM python:3.10-slim AS builder
# ... build steps ...

FROM python:3.10-slim
# Minimal runtime image
COPY --from=builder /app /app
USER nonroot
CMD ["gunicorn", "src.api:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker"]

Common Docker Commands for DS

# Build image
docker build -t my-ml-project:v1 .

# Run interactive container
docker run -it --rm -v $(pwd)/data:/data my-ml-project bash

# Run Jupyter notebook
docker run -p 8888:8888 -v $(pwd):/app jupyter/scipy-notebook

# View logs
docker logs -f container_name

# Clean up
docker system prune -a  # Remove unused images/containers

Conclusion

Docker transforms data science from "works on my machine" to "works everywhere." Start with simple Dockerfiles, then graduate to Docker Compose for multi-service ML pipelines. Your future self (and collaborators) will thank you.