Chapter 34: LLMOps — Large Language Model Operations🔗

"LLMOps is MLOps applied to Large Language Models — but the scale, iteration patterns, and risks are fundamentally different."

34.1 What is LLMOps?🔗

LLMOps is the specialized set of practices for deploying, monitoring, and managing Large Language Models (LLMs) in production. It extends MLOps with LLM-specific concerns: prompt management, fine-tuning, RAG pipelines, hallucination monitoring, and cost control.

MLOps vs LLMOps🔗

┌──────────────────────────────────────────────────────────────────┐
│                   MLOPS vs LLMOPS                                │
├──────────────────────────────┬───────────────────────────────────┤
│         MLOps                │         LLMOps                    │
├──────────────────────────────┼───────────────────────────────────┤
│ Train from scratch           │ Fine-tune or use foundation model  │
│ Structured features          │ Unstructured text input            │
│ Deterministic output         │ Stochastic, creative output        │
│ Small models (MBs)           │ Huge models (GBs to TBs)           │
│ Standard metrics (accuracy)  │ LLM-specific evals (BLEU, ROUGE,  │
│                              │ hallucination rate, toxicity)      │
│ Model drift → retrain        │ Prompt drift → re-engineer prompts │
│ GPU optional                 │ GPU/TPU required                   │
│ Inference: ms                │ Inference: 100ms–10s               │
│ Cost: low                    │ Cost: very high (tokens/request)   │
└──────────────────────────────┴───────────────────────────────────┘

34.2 LLM Deployment Patterns🔗

┌────────────────────────────────────────────────────────────────┐
│                  LLM DEPLOYMENT OPTIONS                        │
│                                                                │
│  1. API-as-a-Service (easiest, most expensive per token)       │
│     OpenAI GPT-4 / Anthropic Claude / Google Gemini           │
│     → Just call the API. No infra. Pay per token.             │
│                                                                │
│  2. Managed Models (GCP / AWS / Azure)                         │
│     Vertex AI (Gemini, Claude, Llama)                          │
│     AWS Bedrock / Azure OpenAI Service                         │
│     → Managed infra. Pay per token or per hour.               │
│                                                                │
│  3. Self-Hosted Open Source (complex, cheapest at scale)       │
│     LLaMA-3, Mistral, Falcon, Gemma                            │
│     Run on GKE + vLLM or TGI serving                          │
│     → Full control, high setup cost, cheap at scale           │
│                                                                │
│  4. Fine-Tuned Model (specialized performance)                 │
│     Take open-source model → fine-tune on domain data          │
│     Deploy like option 3                                       │
└────────────────────────────────────────────────────────────────┘

34.3 Prompt Engineering & Prompt Management🔗

Prompt engineering is the practice of designing inputs to LLMs to get reliable, high-quality outputs.

Prompt Structure🔗

# A well-structured prompt template
SYSTEM_PROMPT = """You are a customer churn analysis assistant for a telecom company.
Given customer data, identify churn risk factors and suggest retention strategies.
Always respond in JSON format with keys: risk_level, factors, recommendations.
Do not include any information not present in the provided data."""

USER_PROMPT_TEMPLATE = """
Customer Profile:
- Age: {age}
- Tenure: {tenure} months
- Monthly Charges: ${monthly_charges}
- Plan: {plan}
- Recent Support Calls: {support_calls}

Analyze churn risk for this customer.
"""

def get_churn_analysis(customer_data: dict) -> dict:
    import anthropic
    import json

    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": USER_PROMPT_TEMPLATE.format(**customer_data)
        }]
    )

    return json.loads(response.content[0].text)

Prompt Versioning with MLflow🔗

import mlflow

with mlflow.start_run(run_name="prompt-v3"):
    # Log the prompt as an artifact
    with open("prompts/churn_analysis_v3.txt", "w") as f:
        f.write(SYSTEM_PROMPT)
    mlflow.log_artifact("prompts/churn_analysis_v3.txt")

    # Log prompt performance metrics
    mlflow.log_metrics({
        "hallucination_rate": 0.02,
        "json_parse_success_rate": 0.98,
        "avg_response_time_s": 1.2,
        "avg_tokens_used": 450,
    })

34.4 RAG — Retrieval-Augmented Generation🔗

RAG augments LLM responses with retrieved context from your own knowledge base, reducing hallucinations and keeping information current.

RAG Architecture:

User Query
    │
    ▼
Embedding Model                  Knowledge Base (Vector DB)
(convert query to vector)   ──▶  ┌────────────────────────┐
                                 │  Pinecone / ChromaDB   │
    │                            │  Weaviate / Vertex AI  │
    │ ← Top-K similar docs       │  Matching Engine        │
    ▼                            └────────────────────────┘
Retrieved Context + Query
    │
    ▼
LLM (Claude / GPT / Gemini)
    │
    ▼
Grounded Response

# pip install langchain chromadb anthropic

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import VertexAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import VertexAI

# ── Step 1: Load and chunk documents ──────────────────────────
loader = PyPDFLoader("docs/product_manual.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# ── Step 2: Embed and store in vector DB ──────────────────────
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# ── Step 3: Create RAG chain ─────────────────────────────────
llm = VertexAI(model_name="gemini-1.5-pro", temperature=0.1)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
)

# ── Step 4: Query ─────────────────────────────────────────────
result = qa_chain("How do I cancel my subscription?")
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata['source']} (page {doc.metadata['page']})")

34.5 Fine-Tuning🔗

Fine-tuning adapts a pre-trained LLM to a specific domain or task using your own data.

BASE MODEL (Llama-3)
    │
    │ + Domain data (e.g., medical records)
    │ + Task examples (e.g., diagnosis from symptoms)
    │
    ▼ Fine-tuning (LoRA / QLoRA / Full)
    │
FINE-TUNED MODEL
    │
    ▼
Deployed & Served

Fine-Tuning Types🔗

Type	Description	Cost	When
Full Fine-tuning	All model weights updated	Very high	Large specialized task
LoRA	Low-rank adaptation — few extra params	Low	Common choice, good balance
QLoRA	Quantized LoRA — 4-bit base model	Very low	Limited GPU memory
PEFT	Parameter-Efficient Fine-Tuning	Low	General purpose
Prompt Tuning	Only tune soft prompt tokens	Minimal	Very few examples

Fine-Tuning with Vertex AI🔗

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Fine-tune Gemini on Vertex AI
sft_job = aiplatform.CustomJob(
    display_name="gemini-fine-tune-churn-qa",
    worker_pool_specs=[{
        "machine_spec": {
            "machine_type": "a2-highgpu-1g",   # A100 GPU
            "accelerator_type": "NVIDIA_TESLA_A100",
            "accelerator_count": 1,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest",
            "command": ["python", "finetune.py"],
            "args": [
                "--base-model=google/gemma-7b",
                "--train-data=gs://bucket/finetune_data/train.jsonl",
                "--output-dir=gs://bucket/fine-tuned-models/",
                "--method=lora",
                "--lora-r=16",
                "--epochs=3",
            ],
        },
    }],
)
sft_job.run(sync=True)

34.6 LLM Evaluation & Monitoring🔗

Key LLM Metrics🔗

OUTPUT QUALITY:
  Hallucination Rate    → % of responses with factual errors
  Relevance Score       → Does response address the question?
  Faithfulness          → Does response stick to provided context (RAG)?
  BLEU / ROUGE          → Text similarity to reference answers
  Toxicity Score        → % of harmful/offensive outputs

OPERATIONAL:
  Latency               → Time to first token, time to full response
  Throughput            → Tokens/second
  Token Usage           → Input + output tokens per request (cost driver)
  Error Rate            → Failed API calls, timeouts

BUSINESS:
  Task Completion Rate  → Did user achieve their goal?
  User Satisfaction     → CSAT or thumbs up/down
  Escalation Rate       → % of queries needing human fallback

LLM Evaluation with RAGAS🔗

# pip install ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["How do I cancel?", "What plans do you offer?"],
    "answer": ["You can cancel by calling 1-800-...", "We offer Basic, Standard, Premium"],
    "contexts": [
        ["To cancel, call our helpline at 1-800-XXX..."],
        ["Plans available: Basic $29/mo, Standard $49/mo, Premium $79/mo"],
    ],
    "ground_truth": ["Call 1-800-XXX to cancel.", "Basic, Standard, Premium"],
}

dataset = Dataset.from_dict(eval_data)

# Evaluate
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

print(results)
# faithfulness: 0.92  ← responses are grounded in context
# answer_relevancy: 0.88
# context_precision: 0.85

34.7 LLMOps Pipeline on GCP🔗

┌──────────────────────────────────────────────────────────────────┐
│                    LLMOPS ON GCP                                 │
│                                                                  │
│  DEVELOPMENT         DEPLOYMENT         MONITORING               │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐ │
│  │ Vertex AI    │   │ Cloud Run    │   │ Cloud Monitoring     │ │
│  │ Workbench    │──▶│ (serve LLM   │──▶│ (token usage,        │ │
│  │              │   │  via API)    │   │  latency, errors)    │ │
│  │ Prompt       │   │              │   │                      │ │
│  │ engineering  │   │ Vertex AI    │   │ Vertex AI Model      │ │
│  │ Evaluation   │   │ Endpoints    │   │ Monitoring           │ │
│  └──────────────┘   │ (managed)    │   │ (response quality)   │ │
│                     └──────────────┘   └──────────────────────┘ │
│                                                                  │
│  VECTOR STORE              RAG PIPELINE                          │
│  ┌─────────────────┐      ┌──────────────────────────────────┐  │
│  │ Vertex AI       │      │  Cloud Run (RAG API)             │  │
│  │ Matching Engine │◀─────│  Embed → Retrieve → Generate     │  │
│  │ (ANN search)    │      └──────────────────────────────────┘  │
│  └─────────────────┘                                            │
└──────────────────────────────────────────────────────────────────┘

34.8 Cost Optimization for LLMs🔗

COST DRIVERS:
  Input tokens  × price/1K tokens
  Output tokens × price/1K tokens (usually more expensive)
  # requests/day
  Compute (if self-hosted GPU)

OPTIMIZATION STRATEGIES:
  1. Prompt compression    → Shorter prompts = fewer tokens
  2. Caching              → Cache identical requests (semantic cache)
  3. Smaller models       → Use GPT-3.5 where GPT-4 is overkill
  4. Batching             → Group async requests
  5. Streaming            → Faster perceived latency
  6. Fine-tuning          → Smaller fine-tuned model > large general model
  7. Quantization         → 4-bit models, 2x smaller, ~same quality

TYPICAL COSTS (2024):
  GPT-4o:          $5/1M input tokens, $15/1M output tokens
  Claude Sonnet:   $3/1M input, $15/1M output
  Gemini 1.5 Pro:  $3.5/1M input, $10.5/1M output
  LLaMA-3 70B:     ~$0.6/1M (self-hosted A100)

Next → Chapter 35: Edge ML