Chapter 34: LLMOps β Large Language Model Operationsπ
"LLMOps is MLOps applied to Large Language Models β but the scale, iteration patterns, and risks are fundamentally different."
34.1 What is LLMOps?π
LLMOps is the specialized set of practices for deploying, monitoring, and managing Large Language Models (LLMs) in production. It extends MLOps with LLM-specific concerns: prompt management, fine-tuning, RAG pipelines, hallucination monitoring, and cost control.
MLOps vs LLMOpsπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MLOPS vs LLMOPS β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β MLOps β LLMOps β
ββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β Train from scratch β Fine-tune or use foundation model β
β Structured features β Unstructured text input β
β Deterministic output β Stochastic, creative output β
β Small models (MBs) β Huge models (GBs to TBs) β
β Standard metrics (accuracy) β LLM-specific evals (BLEU, ROUGE, β
β β hallucination rate, toxicity) β
β Model drift β retrain β Prompt drift β re-engineer prompts β
β GPU optional β GPU/TPU required β
β Inference: ms β Inference: 100msβ10s β
β Cost: low β Cost: very high (tokens/request) β
ββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ
34.2 LLM Deployment Patternsπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM DEPLOYMENT OPTIONS β
β β
β 1. API-as-a-Service (easiest, most expensive per token) β
β OpenAI GPT-4 / Anthropic Claude / Google Gemini β
β β Just call the API. No infra. Pay per token. β
β β
β 2. Managed Models (GCP / AWS / Azure) β
β Vertex AI (Gemini, Claude, Llama) β
β AWS Bedrock / Azure OpenAI Service β
β β Managed infra. Pay per token or per hour. β
β β
β 3. Self-Hosted Open Source (complex, cheapest at scale) β
β LLaMA-3, Mistral, Falcon, Gemma β
β Run on GKE + vLLM or TGI serving β
β β Full control, high setup cost, cheap at scale β
β β
β 4. Fine-Tuned Model (specialized performance) β
β Take open-source model β fine-tune on domain data β
β Deploy like option 3 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
34.3 Prompt Engineering & Prompt Managementπ
Prompt engineering is the practice of designing inputs to LLMs to get reliable, high-quality outputs.
Prompt Structureπ
# A well-structured prompt template
SYSTEM_PROMPT = """You are a customer churn analysis assistant for a telecom company.
Given customer data, identify churn risk factors and suggest retention strategies.
Always respond in JSON format with keys: risk_level, factors, recommendations.
Do not include any information not present in the provided data."""
USER_PROMPT_TEMPLATE = """
Customer Profile:
- Age: {age}
- Tenure: {tenure} months
- Monthly Charges: ${monthly_charges}
- Plan: {plan}
- Recent Support Calls: {support_calls}
Analyze churn risk for this customer.
"""
def get_churn_analysis(customer_data: dict) -> dict:
import anthropic
import json
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=SYSTEM_PROMPT,
messages=[{
"role": "user",
"content": USER_PROMPT_TEMPLATE.format(**customer_data)
}]
)
return json.loads(response.content[0].text)
Prompt Versioning with MLflowπ
import mlflow
with mlflow.start_run(run_name="prompt-v3"):
# Log the prompt as an artifact
with open("prompts/churn_analysis_v3.txt", "w") as f:
f.write(SYSTEM_PROMPT)
mlflow.log_artifact("prompts/churn_analysis_v3.txt")
# Log prompt performance metrics
mlflow.log_metrics({
"hallucination_rate": 0.02,
"json_parse_success_rate": 0.98,
"avg_response_time_s": 1.2,
"avg_tokens_used": 450,
})
34.4 RAG β Retrieval-Augmented Generationπ
RAG augments LLM responses with retrieved context from your own knowledge base, reducing hallucinations and keeping information current.
RAG Architecture:
User Query
β
βΌ
Embedding Model Knowledge Base (Vector DB)
(convert query to vector) βββΆ ββββββββββββββββββββββββββ
β Pinecone / ChromaDB β
β β Weaviate / Vertex AI β
β β Top-K similar docs β Matching Engine β
βΌ ββββββββββββββββββββββββββ
Retrieved Context + Query
β
βΌ
LLM (Claude / GPT / Gemini)
β
βΌ
Grounded Response
# pip install langchain chromadb anthropic
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import VertexAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import VertexAI
# ββ Step 1: Load and chunk documents ββββββββββββββββββββββββββ
loader = PyPDFLoader("docs/product_manual.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
# ββ Step 2: Embed and store in vector DB ββββββββββββββββββββββ
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# ββ Step 3: Create RAG chain βββββββββββββββββββββββββββββββββ
llm = VertexAI(model_name="gemini-1.5-pro", temperature=0.1)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
)
# ββ Step 4: Query βββββββββββββββββββββββββββββββββββββββββββββ
result = qa_chain("How do I cancel my subscription?")
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']} (page {doc.metadata['page']})")
34.5 Fine-Tuningπ
Fine-tuning adapts a pre-trained LLM to a specific domain or task using your own data.
BASE MODEL (Llama-3)
β
β + Domain data (e.g., medical records)
β + Task examples (e.g., diagnosis from symptoms)
β
βΌ Fine-tuning (LoRA / QLoRA / Full)
β
FINE-TUNED MODEL
β
βΌ
Deployed & Served
Fine-Tuning Typesπ
| Type | Description | Cost | When |
|---|---|---|---|
| Full Fine-tuning | All model weights updated | Very high | Large specialized task |
| LoRA | Low-rank adaptation β few extra params | Low | Common choice, good balance |
| QLoRA | Quantized LoRA β 4-bit base model | Very low | Limited GPU memory |
| PEFT | Parameter-Efficient Fine-Tuning | Low | General purpose |
| Prompt Tuning | Only tune soft prompt tokens | Minimal | Very few examples |
Fine-Tuning with Vertex AIπ
from google.cloud import aiplatform
aiplatform.init(project="my-project", location="us-central1")
# Fine-tune Gemini on Vertex AI
sft_job = aiplatform.CustomJob(
display_name="gemini-fine-tune-churn-qa",
worker_pool_specs=[{
"machine_spec": {
"machine_type": "a2-highgpu-1g", # A100 GPU
"accelerator_type": "NVIDIA_TESLA_A100",
"accelerator_count": 1,
},
"replica_count": 1,
"container_spec": {
"image_uri": "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest",
"command": ["python", "finetune.py"],
"args": [
"--base-model=google/gemma-7b",
"--train-data=gs://bucket/finetune_data/train.jsonl",
"--output-dir=gs://bucket/fine-tuned-models/",
"--method=lora",
"--lora-r=16",
"--epochs=3",
],
},
}],
)
sft_job.run(sync=True)
34.6 LLM Evaluation & Monitoringπ
Key LLM Metricsπ
OUTPUT QUALITY:
Hallucination Rate β % of responses with factual errors
Relevance Score β Does response address the question?
Faithfulness β Does response stick to provided context (RAG)?
BLEU / ROUGE β Text similarity to reference answers
Toxicity Score β % of harmful/offensive outputs
OPERATIONAL:
Latency β Time to first token, time to full response
Throughput β Tokens/second
Token Usage β Input + output tokens per request (cost driver)
Error Rate β Failed API calls, timeouts
BUSINESS:
Task Completion Rate β Did user achieve their goal?
User Satisfaction β CSAT or thumbs up/down
Escalation Rate β % of queries needing human fallback
LLM Evaluation with RAGASπ
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["How do I cancel?", "What plans do you offer?"],
"answer": ["You can cancel by calling 1-800-...", "We offer Basic, Standard, Premium"],
"contexts": [
["To cancel, call our helpline at 1-800-XXX..."],
["Plans available: Basic $29/mo, Standard $49/mo, Premium $79/mo"],
],
"ground_truth": ["Call 1-800-XXX to cancel.", "Basic, Standard, Premium"],
}
dataset = Dataset.from_dict(eval_data)
# Evaluate
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
# faithfulness: 0.92 β responses are grounded in context
# answer_relevancy: 0.88
# context_precision: 0.85
34.7 LLMOps Pipeline on GCPπ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLMOPS ON GCP β
β β
β DEVELOPMENT DEPLOYMENT MONITORING β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β Vertex AI β β Cloud Run β β Cloud Monitoring β β
β β Workbench ββββΆβ (serve LLM ββββΆβ (token usage, β β
β β β β via API) β β latency, errors) β β
β β Prompt β β β β β β
β β engineering β β Vertex AI β β Vertex AI Model β β
β β Evaluation β β Endpoints β β Monitoring β β
β ββββββββββββββββ β (managed) β β (response quality) β β
β ββββββββββββββββ ββββββββββββββββββββββββ β
β β
β VECTOR STORE RAG PIPELINE β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β Vertex AI β β Cloud Run (RAG API) β β
β β Matching Engine ββββββββ Embed β Retrieve β Generate β β
β β (ANN search) β ββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
34.8 Cost Optimization for LLMsπ
COST DRIVERS:
Input tokens Γ price/1K tokens
Output tokens Γ price/1K tokens (usually more expensive)
# requests/day
Compute (if self-hosted GPU)
OPTIMIZATION STRATEGIES:
1. Prompt compression β Shorter prompts = fewer tokens
2. Caching β Cache identical requests (semantic cache)
3. Smaller models β Use GPT-3.5 where GPT-4 is overkill
4. Batching β Group async requests
5. Streaming β Faster perceived latency
6. Fine-tuning β Smaller fine-tuned model > large general model
7. Quantization β 4-bit models, 2x smaller, ~same quality
TYPICAL COSTS (2024):
GPT-4o: $5/1M input tokens, $15/1M output tokens
Claude Sonnet: $3/1M input, $15/1M output
Gemini 1.5 Pro: $3.5/1M input, $10.5/1M output
LLaMA-3 70B: ~$0.6/1M (self-hosted A100)
Next β Chapter 35: Edge ML