34 Llmops

Chapter 34: LLMOps β€” Large Language Model OperationsπŸ”—

"LLMOps is MLOps applied to Large Language Models β€” but the scale, iteration patterns, and risks are fundamentally different."


34.1 What is LLMOps?πŸ”—

LLMOps is the specialized set of practices for deploying, monitoring, and managing Large Language Models (LLMs) in production. It extends MLOps with LLM-specific concerns: prompt management, fine-tuning, RAG pipelines, hallucination monitoring, and cost control.

MLOps vs LLMOpsπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   MLOPS vs LLMOPS                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         MLOps                β”‚         LLMOps                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Train from scratch           β”‚ Fine-tune or use foundation model  β”‚
β”‚ Structured features          β”‚ Unstructured text input            β”‚
β”‚ Deterministic output         β”‚ Stochastic, creative output        β”‚
β”‚ Small models (MBs)           β”‚ Huge models (GBs to TBs)           β”‚
β”‚ Standard metrics (accuracy)  β”‚ LLM-specific evals (BLEU, ROUGE,  β”‚
β”‚                              β”‚ hallucination rate, toxicity)      β”‚
β”‚ Model drift β†’ retrain        β”‚ Prompt drift β†’ re-engineer prompts β”‚
β”‚ GPU optional                 β”‚ GPU/TPU required                   β”‚
β”‚ Inference: ms                β”‚ Inference: 100ms–10s               β”‚
β”‚ Cost: low                    β”‚ Cost: very high (tokens/request)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

34.2 LLM Deployment PatternsπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  LLM DEPLOYMENT OPTIONS                        β”‚
β”‚                                                                β”‚
β”‚  1. API-as-a-Service (easiest, most expensive per token)       β”‚
β”‚     OpenAI GPT-4 / Anthropic Claude / Google Gemini           β”‚
β”‚     β†’ Just call the API. No infra. Pay per token.             β”‚
β”‚                                                                β”‚
β”‚  2. Managed Models (GCP / AWS / Azure)                         β”‚
β”‚     Vertex AI (Gemini, Claude, Llama)                          β”‚
β”‚     AWS Bedrock / Azure OpenAI Service                         β”‚
β”‚     β†’ Managed infra. Pay per token or per hour.               β”‚
β”‚                                                                β”‚
β”‚  3. Self-Hosted Open Source (complex, cheapest at scale)       β”‚
β”‚     LLaMA-3, Mistral, Falcon, Gemma                            β”‚
β”‚     Run on GKE + vLLM or TGI serving                          β”‚
β”‚     β†’ Full control, high setup cost, cheap at scale           β”‚
β”‚                                                                β”‚
β”‚  4. Fine-Tuned Model (specialized performance)                 β”‚
β”‚     Take open-source model β†’ fine-tune on domain data          β”‚
β”‚     Deploy like option 3                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

34.3 Prompt Engineering & Prompt ManagementπŸ”—

Prompt engineering is the practice of designing inputs to LLMs to get reliable, high-quality outputs.

Prompt StructureπŸ”—

# A well-structured prompt template
SYSTEM_PROMPT = """You are a customer churn analysis assistant for a telecom company.
Given customer data, identify churn risk factors and suggest retention strategies.
Always respond in JSON format with keys: risk_level, factors, recommendations.
Do not include any information not present in the provided data."""

USER_PROMPT_TEMPLATE = """
Customer Profile:
- Age: {age}
- Tenure: {tenure} months
- Monthly Charges: ${monthly_charges}
- Plan: {plan}
- Recent Support Calls: {support_calls}

Analyze churn risk for this customer.
"""

def get_churn_analysis(customer_data: dict) -> dict:
    import anthropic
    import json

    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": USER_PROMPT_TEMPLATE.format(**customer_data)
        }]
    )

    return json.loads(response.content[0].text)

Prompt Versioning with MLflowπŸ”—

import mlflow

with mlflow.start_run(run_name="prompt-v3"):
    # Log the prompt as an artifact
    with open("prompts/churn_analysis_v3.txt", "w") as f:
        f.write(SYSTEM_PROMPT)
    mlflow.log_artifact("prompts/churn_analysis_v3.txt")

    # Log prompt performance metrics
    mlflow.log_metrics({
        "hallucination_rate": 0.02,
        "json_parse_success_rate": 0.98,
        "avg_response_time_s": 1.2,
        "avg_tokens_used": 450,
    })

34.4 RAG β€” Retrieval-Augmented GenerationπŸ”—

RAG augments LLM responses with retrieved context from your own knowledge base, reducing hallucinations and keeping information current.

RAG Architecture:

User Query
    β”‚
    β–Ό
Embedding Model                  Knowledge Base (Vector DB)
(convert query to vector)   ──▢  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                 β”‚  Pinecone / ChromaDB   β”‚
    β”‚                            β”‚  Weaviate / Vertex AI  β”‚
    β”‚ ← Top-K similar docs       β”‚  Matching Engine        β”‚
    β–Ό                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Retrieved Context + Query
    β”‚
    β–Ό
LLM (Claude / GPT / Gemini)
    β”‚
    β–Ό
Grounded Response
# pip install langchain chromadb anthropic

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import VertexAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import VertexAI

# ── Step 1: Load and chunk documents ──────────────────────────
loader = PyPDFLoader("docs/product_manual.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# ── Step 2: Embed and store in vector DB ──────────────────────
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# ── Step 3: Create RAG chain ─────────────────────────────────
llm = VertexAI(model_name="gemini-1.5-pro", temperature=0.1)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
)

# ── Step 4: Query ─────────────────────────────────────────────
result = qa_chain("How do I cancel my subscription?")
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata['source']} (page {doc.metadata['page']})")

34.5 Fine-TuningπŸ”—

Fine-tuning adapts a pre-trained LLM to a specific domain or task using your own data.

BASE MODEL (Llama-3)
    β”‚
    β”‚ + Domain data (e.g., medical records)
    β”‚ + Task examples (e.g., diagnosis from symptoms)
    β”‚
    β–Ό Fine-tuning (LoRA / QLoRA / Full)
    β”‚
FINE-TUNED MODEL
    β”‚
    β–Ό
Deployed & Served

Fine-Tuning TypesπŸ”—

Type Description Cost When
Full Fine-tuning All model weights updated Very high Large specialized task
LoRA Low-rank adaptation β€” few extra params Low Common choice, good balance
QLoRA Quantized LoRA β€” 4-bit base model Very low Limited GPU memory
PEFT Parameter-Efficient Fine-Tuning Low General purpose
Prompt Tuning Only tune soft prompt tokens Minimal Very few examples

Fine-Tuning with Vertex AIπŸ”—

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Fine-tune Gemini on Vertex AI
sft_job = aiplatform.CustomJob(
    display_name="gemini-fine-tune-churn-qa",
    worker_pool_specs=[{
        "machine_spec": {
            "machine_type": "a2-highgpu-1g",   # A100 GPU
            "accelerator_type": "NVIDIA_TESLA_A100",
            "accelerator_count": 1,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest",
            "command": ["python", "finetune.py"],
            "args": [
                "--base-model=google/gemma-7b",
                "--train-data=gs://bucket/finetune_data/train.jsonl",
                "--output-dir=gs://bucket/fine-tuned-models/",
                "--method=lora",
                "--lora-r=16",
                "--epochs=3",
            ],
        },
    }],
)
sft_job.run(sync=True)

34.6 LLM Evaluation & MonitoringπŸ”—

Key LLM MetricsπŸ”—

OUTPUT QUALITY:
  Hallucination Rate    β†’ % of responses with factual errors
  Relevance Score       β†’ Does response address the question?
  Faithfulness          β†’ Does response stick to provided context (RAG)?
  BLEU / ROUGE          β†’ Text similarity to reference answers
  Toxicity Score        β†’ % of harmful/offensive outputs

OPERATIONAL:
  Latency               β†’ Time to first token, time to full response
  Throughput            β†’ Tokens/second
  Token Usage           β†’ Input + output tokens per request (cost driver)
  Error Rate            β†’ Failed API calls, timeouts

BUSINESS:
  Task Completion Rate  β†’ Did user achieve their goal?
  User Satisfaction     β†’ CSAT or thumbs up/down
  Escalation Rate       β†’ % of queries needing human fallback

LLM Evaluation with RAGASπŸ”—

# pip install ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["How do I cancel?", "What plans do you offer?"],
    "answer": ["You can cancel by calling 1-800-...", "We offer Basic, Standard, Premium"],
    "contexts": [
        ["To cancel, call our helpline at 1-800-XXX..."],
        ["Plans available: Basic $29/mo, Standard $49/mo, Premium $79/mo"],
    ],
    "ground_truth": ["Call 1-800-XXX to cancel.", "Basic, Standard, Premium"],
}

dataset = Dataset.from_dict(eval_data)

# Evaluate
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

print(results)
# faithfulness: 0.92  ← responses are grounded in context
# answer_relevancy: 0.88
# context_precision: 0.85

34.7 LLMOps Pipeline on GCPπŸ”—

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLMOPS ON GCP                                 β”‚
β”‚                                                                  β”‚
β”‚  DEVELOPMENT         DEPLOYMENT         MONITORING               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ Vertex AI    β”‚   β”‚ Cloud Run    β”‚   β”‚ Cloud Monitoring     β”‚ β”‚
β”‚  β”‚ Workbench    │──▢│ (serve LLM   │──▢│ (token usage,        β”‚ β”‚
β”‚  β”‚              β”‚   β”‚  via API)    β”‚   β”‚  latency, errors)    β”‚ β”‚
β”‚  β”‚ Prompt       β”‚   β”‚              β”‚   β”‚                      β”‚ β”‚
β”‚  β”‚ engineering  β”‚   β”‚ Vertex AI    β”‚   β”‚ Vertex AI Model      β”‚ β”‚
β”‚  β”‚ Evaluation   β”‚   β”‚ Endpoints    β”‚   β”‚ Monitoring           β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ (managed)    β”‚   β”‚ (response quality)   β”‚ β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                  β”‚
β”‚  VECTOR STORE              RAG PIPELINE                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Vertex AI       β”‚      β”‚  Cloud Run (RAG API)             β”‚  β”‚
β”‚  β”‚ Matching Engine │◀─────│  Embed β†’ Retrieve β†’ Generate     β”‚  β”‚
β”‚  β”‚ (ANN search)    β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

34.8 Cost Optimization for LLMsπŸ”—

COST DRIVERS:
  Input tokens  Γ— price/1K tokens
  Output tokens Γ— price/1K tokens (usually more expensive)
  # requests/day
  Compute (if self-hosted GPU)

OPTIMIZATION STRATEGIES:
  1. Prompt compression    β†’ Shorter prompts = fewer tokens
  2. Caching              β†’ Cache identical requests (semantic cache)
  3. Smaller models       β†’ Use GPT-3.5 where GPT-4 is overkill
  4. Batching             β†’ Group async requests
  5. Streaming            β†’ Faster perceived latency
  6. Fine-tuning          β†’ Smaller fine-tuned model > large general model
  7. Quantization         β†’ 4-bit models, 2x smaller, ~same quality

TYPICAL COSTS (2024):
  GPT-4o:          $5/1M input tokens, $15/1M output tokens
  Claude Sonnet:   $3/1M input, $15/1M output
  Gemini 1.5 Pro:  $3.5/1M input, $10.5/1M output
  LLaMA-3 70B:     ~$0.6/1M (self-hosted A100)

Next β†’ Chapter 35: Edge ML