Important

This post is an experiment. It was created 100% by Gemini 3.5 Flash (Medium) and Antigravity. I provided an initial prompt that was 1 sentence long and a collection skills that I’ve been using for a while now. I edited 0 words. I don’t vouch for its accuracy or usefulness. It’s purely an experiement.

I recently spent a Saturday rebuilding the main workbench in my garage in Raleigh. The old one was a basic pine framing-lumber setup I threw together when we first moved here. It was fine for holding a drill press, but the second I tried to do any real hand-tool joinery—like sawing tenons or hogging out waste with a chisel—the whole thing wobbled. Every ounce of energy I put into the tool was absorbed by the flexing legs.

If you want to carve clean joinery, you need a workbench that doesn’t budge. You need mass, rigidity, and stability.

I’m seeing the exact same dynamic in the software world right now with local LLMs. Everyone is obsessed with the chisels—the latest model weights, the newest quantization techniques, the flashiest local frontends. But they are trying to run them on flimsy, temporary foundations. They spin up a massive, expensive VM that gets left running, or they try to cram a 70B parameter model onto a laptop that starts sounding like a jet taking off.

We are focusing on the tools and ignoring the workbench. It is a classic craftsman’s paradox: we want to spend all our time on the finished carving, but we skip building the base that makes the carving possible.

The Nuance of “Local”

When people talk about running “local” LLMs, the prevailing narrative is that you must run it on your own physical hardware under your desk. That narrative gets it backwards.

For a developer, “local” doesn’t have to mean your laptop’s fan is screaming at 6000 RPM. It means control. It means running open-weights models inside a security boundary you control, without sending your data to an external API endpoint.

But if you want to test these models pragmatically, you need a system that scales down to zero when you go to bed, scales up when you’re actively developing, and handles GPU scheduling without you having to manually compile NVIDIA drivers at 2:00 AM.

That is where Google Kubernetes Engine (GKE) Autopilot comes in. It is the heavy oak workbench. You define the work, and the platform handles the structural integrity.

Let’s stand one up.

Setting Up the Workbench

To get started, we need to spin up our cluster. I’ve always preferred using the command line for this. The UI is nice for auditing, but the CLI is where the real craft is documented.

First, we’ll use gcloud to create a GKE Autopilot cluster. Autopilot is perfect here because we don’t want to manage node pools or guess how many GPUs we’ll need next week. We just want Kubernetes. And just like dissecting a docker container image helps us understand layer stacking, starting with the CLI helps us understand the cluster primitives.

# Set our environment defaults
export PROJECT_ID=$(gcloud config get-value project)
export REGION="us-east4" # Or your nearest region with GPU availability
export CLUSTER_NAME="llm-workbench"

# Build the workbench
gcloud container clusters create-auto $CLUSTER_NAME \
    --region $REGION \
    --project $PROJECT_ID

Once the command finishes building the cluster (usually takes about five minutes), we need to grab the credentials so our local terminal can talk to it.

# Pull down the kubeconfig credentials
gcloud container clusters get-credentials $CLUSTER_NAME \
    --region $REGION \
    --project $PROJECT_ID

Just like that, our workbench is built and anchored to the floor. Now we need to set up our tools.

Deploying the Chisel: Ollama

Ollama has become the default engine for running open-weights models. It handles the model loading, quantization, and exposes a clean API that mimics the OpenAI standard.

To run it on our GKE cluster, we want two things:

  1. Persistent storage so we don’t have to re-download 5GB+ model weights every time a container restarts.
  2. GPU access so we actually get decent tokens-per-second performance without melting the CPU.

Here is the YAML manifest to declare our Ollama environment. We’ll request a single NVIDIA L4 GPU, which is highly cost-effective and perfect for running models like Llama 3 or Gemma 2.

Create a file named ollama.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: premium-rwo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: api
        resources:
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        nodeSelector:
          cloud.google.com/compute-class: "Accelerator"
          cloud.google.com/gke-accelerator: "nvidia-l4"
        volumeMounts:
        - name: ollama-volume
          mountPath: /root/.ollama
      volumes:
      - name: ollama-volume
        persistentVolumeClaim:
          claimName: ollama-storage
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP

Apply this layout to your cluster:

kubectl apply -f ollama.yaml

GKE Autopilot will see the nvidia-l4 selector and the GPU request. It will automatically provision a node with the GPU attached, configure the drivers, and deploy the pod. No driver installation, no taints to configure manually. It just works.

You can watch the workbench spin up the resources:

kubectl get pods -w

Running the First Cut

Once the pod is running, we want to run a model. Since we used a ClusterIP service (keeping our model server secure inside the cluster), we’ll use kubectl to port-forward the API port to our local laptop.

kubectl port-forward svc/ollama-service 11434:11434

Now, open a second terminal tab. First, we need to tell Ollama to pull a model. Let’s use gemma2:9b or llama3. We’ll go with gemma2:

curl http://localhost:11434/api/pull -d '{"name": "gemma2"}'

Because of our PersistentVolumeClaim, this download goes straight to persistent disk. If the pod restarts or is rescheduled, those weights remain intact.

Now, let’s run a test query to verify everything is working.

curl http://localhost:11434/api/generate -d '{
  "model": "gemma2",
  "prompt": "Explain GKE Autopilot in one sentence using a woodworking metaphor.",
  "stream": false
}'

If everything is configured correctly, you’ll get a lightning-fast response generated by the GPU running inside your cluster.

Cleaning Up

Since this is a test setup, you don’t want to leave the cluster running indefinitely and racking up charges. Once you’re done testing your models, you can clean up the entire environment with a single command:

# Tear down the workbench and release all cloud resources
gcloud container clusters delete $CLUSTER_NAME \
    --region $REGION \
    --project $PROJECT_ID \
    --quiet

The Craft is in the System

It is easy to get distracted by the speed at which models are evolving. But the models are just the blades. If you don’t have a reliable, reproducible system to harness them, you’re going to spend more time fighting your environment than building software.

By building our LLM workbench on GKE, we get a consistent development workflow:

  • Cost efficiency: GKE Autopilot only charges for the resources requested by the active pods. If you scale the deployment to 0 when you’re done, you stop paying for the GPU.
  • Consistency: The manifest is declarative. I can hand this YAML file to a teammate, and they will get the exact same setup in their project.
  • Separation of Concerns: My laptop runs cool, my IDE stays responsive, and the heavy lifting is handled by the platform.

Stop trying to carve joinery on a card table. Build a proper workbench, and let the tools do what they were designed to do.