26 Serving Frameworks

Chapter 26: Model Serving Frameworks🔗


Framework Comparison🔗

Framework Models Protocol GPU Best For
FastAPI Any Python REST No Custom logic, quick start
TF Serving TensorFlow gRPC/REST Yes TF models
TorchServe PyTorch REST/gRPC Yes PyTorch models
Triton TF/PT/ONNX/TRT gRPC/REST Yes GPU, high-throughput
Seldon Core Any (Docker) REST/gRPC Yes K8s-native, multi-model
KServe Any (InferenceService) gRPC/REST Yes K8s standard
Ray Serve Any Python REST Yes Python-first, composable

TF Serving🔗

# Serve TF SavedModel
docker run -p 8501:8501 \
  -v "$(pwd)/models/churn:/models/churn/1" \
  -e MODEL_NAME=churn \
  tensorflow/serving

# Predict
curl -X POST http://localhost:8501/v1/models/churn:predict \
  -d '{"instances": [[35, 65000, 12, 75.5, 1]]}'

TorchServe🔗

# Package model
torch-model-archiver \
  --model-name churn \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pt \
  --handler handler.py

# Serve
torchserve --start --model-store model_store --models churn=churn.mar

# Predict
curl http://localhost:8080/predictions/churn \
  -T input.json

NVIDIA Triton (High Performance)🔗

# Model repository structure
models/
  churn_model/
    config.pbtxt
    1/
      model.onnx   (or model.pt, model.savedmodel)

# Serve
docker run --gpus all -p 8000:8000 -p 8001:8001 \
  -v $(pwd)/models:/models \
  nvcr.io/nvidia/tritonserver:23.10-py3 \
  tritonserver --model-repository=/models

# Benchmark
perf_analyzer -m churn_model --concurrency-range 1:16

Next → Chapter 27: Vertex AI Endpoints