Chapter 26: Model Serving Frameworks🔗
Framework Comparison🔗
| Framework |
Models |
Protocol |
GPU |
Best For |
| FastAPI |
Any Python |
REST |
No |
Custom logic, quick start |
| TF Serving |
TensorFlow |
gRPC/REST |
Yes |
TF models |
| TorchServe |
PyTorch |
REST/gRPC |
Yes |
PyTorch models |
| Triton |
TF/PT/ONNX/TRT |
gRPC/REST |
Yes |
GPU, high-throughput |
| Seldon Core |
Any (Docker) |
REST/gRPC |
Yes |
K8s-native, multi-model |
| KServe |
Any (InferenceService) |
gRPC/REST |
Yes |
K8s standard |
| Ray Serve |
Any Python |
REST |
Yes |
Python-first, composable |
TF Serving🔗
# Serve TF SavedModel
docker run -p 8501:8501 \
-v "$(pwd)/models/churn:/models/churn/1" \
-e MODEL_NAME=churn \
tensorflow/serving
# Predict
curl -X POST http://localhost:8501/v1/models/churn:predict \
-d '{"instances": [[35, 65000, 12, 75.5, 1]]}'
TorchServe🔗
# Package model
torch-model-archiver \
--model-name churn \
--version 1.0 \
--model-file model.py \
--serialized-file model.pt \
--handler handler.py
# Serve
torchserve --start --model-store model_store --models churn=churn.mar
# Predict
curl http://localhost:8080/predictions/churn \
-T input.json
# Model repository structure
models/
churn_model/
config.pbtxt
1/
model.onnx (or model.pt, model.savedmodel)
# Serve
docker run --gpus all -p 8000:8000 -p 8001:8001 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:23.10-py3 \
tritonserver --model-repository=/models
# Benchmark
perf_analyzer -m churn_model --concurrency-range 1:16
Next → Chapter 27: Vertex AI Endpoints