← Back to Blog
Case StudyCloud RunCost Optimization

How We Cut Our GCP Bill by 67% in One Afternoon

|10 min read

We build Cloud Guardian to find infrastructure cost problems. So when we discovered that our own GCP bill had ballooned to $434/month across 29 projects, the irony was not lost on us. Worse, a single misconfigured Cloud Run service accounted for $278/month of that total, and three separate failures had conspired to prevent our own auto-remediation system from catching it.

This is the full story of what happened, what we found, and how we fixed it in one afternoon. It is also the story of the feature we built afterward to make sure it never happens again.

The Setup: 29 Projects, One Platform

Cloud Guardian manages infrastructure across 29 GCP projects under a single organization. These projects span production workloads, staging environments, internal tools, and customer-facing services. Every project has a service account linked through our connector system, with credentials encrypted using envelope encryption (AES-256-GCM with Cloud KMS key wrapping).

Our scanner runs every six hours via Cloud Scheduler, hitting the /internal/scheduled-scan endpoint. It scans all 29 projects in parallel (concurrency of 5), checking nine resource types: Cloud Run, Compute Engine, Cloud SQL, Cloud Storage, Cloud Functions, GKE, BigQuery, Secret Manager, and Artifact Registry. Each scan produces snapshots, identifies violations against our policy rules, and triggers auto-remediation when configured.

At least, that is what it is supposed to do. For weeks, 27 out of 28 unique GCP projects were silently failing credential decryption, and we did not notice.

The Discovery: $278/month on a Single Service

On March 7, 2026, we ran a manual cost audit across all projects. The numbers were alarming. Total spend had climbed to approximately $434/month, far more than expected for what is fundamentally a serverless-first architecture designed to scale to zero.

The biggest line item jumped out immediately: a Cloud Run service called controlplane was costing $278/month. That is 64% of our entire GCP bill on a single service.

The root cause was a single configuration flag: cpu_idle was set to false. This meant CPU was allocated 24/7 regardless of whether the service was processing requests. For a service that handles sporadic API calls and spends 95% of its time idle, this is the most expensive possible configuration. The service was burning through vCPU-seconds around the clock, paying for compute it was not using.

Cloud Guardian is specifically designed to detect this exact problem. Our cpu_idle_disabled check runs on every scan cycle. It should have flagged this weeks ago and auto-remediated it. So why did it not?

Three Cascading Failures That Prevented Auto-Remediation

The investigation revealed three independent failures, each of which alone would have prevented the fix. Together, they formed a perfect storm that allowed a $278/month misconfiguration to run unchecked.

Failure 1: KMS Key Name Mismatch

Every project connector stores its GCP service account credentials encrypted with a data encryption key (DEK), which is itself wrapped by a Cloud KMS key. The KMS_KEY_NAME environment variable tells the scanner which KMS key to use for unwrapping.

Our Terraform configuration creates a keyring named guardian. But the KMS_KEY_NAME environment variable on the Cloud Run service was set to keyRings/cloud-guardian instead of keyRings/guardian. A simple string mismatch. The result: 27 out of 28 projects failed credential decryption on every scan cycle. The scanner could not authenticate to the GCP APIs, so it could not read resource configurations, so it could not detect violations, so it could not trigger remediation.

One project still worked because it had been onboarded with an older credential format that did not use the KMS key path. Twenty-seven projects were effectively blind.

Failure 2: Empty Remediation Scopes

Even if the scanner had been able to decrypt credentials and detect the cpu_idle violation, auto-remediation would not have fired. Each project in Cloud Guardian has configurable AutoRemediationScopes that control which types of fixes can be applied automatically. The project hosting the controlplane service had empty scopes. No auto-remediation of any kind was authorized.

This is a reasonable safety default. You do not want auto-remediation running on projects that have not explicitly opted in. But it meant that even a working scanner would have logged the violation and done nothing about it.

Failure 3: GitHub PR Mode for Critical Violations

The third failure was a design flaw. When a project has a linked GitHub repository and installation ID, Cloud Guardian's remediation engine defaults to GitHub PR mode: instead of applying fixes directly via the GCP API, it generates a Terraform PR with the configuration change. This is the right approach for most remediations. Infrastructure changes should go through code review.

But for a service bleeding $9.27/day in unnecessary CPU costs, waiting for someone to review and merge a PR is the wrong trade-off. Every day the PR sits unmerged is another $9.27 wasted. For critical cost violations like cpu_idle = false on an idle service, the fix should be applied immediately and the PR created afterward for documentation.

The Cascade

Any one of these three failures would have blocked the fix. Together, they ensured the controlplane service ran with cpu_idle disabled for weeks, accumulating approximately $600+ in unnecessary charges before manual discovery. The KMS mismatch prevented detection, the empty scopes prevented authorization, and the PR mode would have delayed execution even if the first two had worked.

The Fix: Critical Cost Protection

We fixed the immediate problem first. Correcting the KMS_KEY_NAME environment variable from keyRings/cloud-guardian to keyRings/guardian restored credential decryption for all 28 projects. Then we applied the cpu_idle = true fix directly via the Cloud Run v2 API, using an explicit UpdateMask on the Patch call (without the UpdateMask, the v2 API silently accepts the request but does not apply the change, which is its own fun discovery). We also configured AutoRemediationScopes on all 29 projects.

But fixing the symptoms was not enough. We needed to ensure that critical cost violations could never silently accumulate again. So we built Critical Cost Protection.

The feature works on a simple principle: some cost violations are too expensive to wait for a pull request. Specifically, two violations qualify as critical:

  • cpu_idle = false on any Cloud Run service that does not require always-on CPU. At Cloud Run pricing, this can cost $60-280/month per service depending on CPU allocation and instance count.
  • min_instances > 0 on services that do not need warm instances. Combined with cpu_idle disabled, this is the most expensive Cloud Run configuration possible.

When the scanner detects either of these violations, Critical Cost Protection does three things differently from normal remediation:

  1. Bypasses GitHub PR mode. The fix is applied directly via the Cloud Run API, regardless of whether the project has a linked GitHub repository. The configuration change takes effect in seconds, not days.
  2. Bypasses empty remediation scopes. The cloud_run:optimize scope is always injected for critical cost violations. A project does not need to have AutoRemediationScopes configured for critical protection to work.
  3. Emits a dedicated event. A critical_cost.remediation event is logged for audit and alerting purposes, so the team knows when a critical fix was applied automatically.

The implementation is a single function, isCriticalCostViolation(), that evaluates each violation during the remediation planning phase. When it returns true, the remediation engine uses the forceDirect flag to bypass PR mode, and the scope injection ensures the action is authorized regardless of project configuration.

// Critical cost violations bypass PR mode and scope checks.
// These are too expensive to wait for human review.
func isCriticalCostViolation(v *Violation) bool {
    switch v.CheckID {
    case "cpu_idle_disabled":
        return true  // $60-280/month per service
    case "min_instances_nonzero":
        return v.MinInstances > 0 && !v.RequiresWarmInstances
    default:
        return false
    }
}

After applying the fix and deploying Critical Cost Protection, we re-scanned all 29 projects to verify. The controlplane service showed cpu_idle = true and the violation was cleared.

Results: $434 to $142/month

The impact was immediate. Within 24 hours of applying fixes across the fleet, our projected monthly cost dropped from $434/month to $142/month, a 67% reduction.

Cost Breakdown After Optimization

controlplane (cpu_idle fix)-$248/month
Artifact Registry cleanup (stale images)-$18/month
Idle Cloud Run services (scale-to-zero)-$14/month
Overprovisioned CPU right-sizing-$12/month
Total monthly savings-$292/month

The single biggest win was the controlplane fix, saving $248/month (the remaining $30/month of its original $278 cost covers legitimate request processing and memory allocation). But the other optimizations added up: cleaning stale Artifact Registry images, enabling scale-to-zero on services that did not need warm instances, and right-sizing overprovisioned CPU allocations from 2 vCPUs to 1 vCPU on services averaging less than 2 active instances.

Annualized, this is $3,504 in savings. For a team of our size, that is meaningful. And the fixes took less than four hours from discovery to deployment, including building and shipping Critical Cost Protection.

Lessons Learned

This incident taught us several things that we have since incorporated into Cloud Guardian's design:

1. Monitor your monitoring. The most dangerous failure is a silent one. Our scanner was failing on 27 out of 28 projects for weeks, and we did not notice because the errors were logged but not alerted on. Now, scan cycle completion rates and per-project success/failure counts are tracked as first-class metrics in our Ops Health dashboard. If fewer than 90% of projects scan successfully, an alert fires.

2. Not all remediations are equal. Sending a Terraform PR is the right default for most infrastructure changes. But when a misconfiguration is costing $9/day and the fix is a single API call with no functional impact, waiting for code review is the wrong trade-off. Critical Cost Protection distinguishes between changes that need review and changes that need speed.

3. Defense in depth applies to cost protection. A single misconfigured environment variable broke credential decryption for nearly every project. If we had been validating KMS key accessibility at startup rather than at scan time, we would have caught it immediately. We now run a preflight check on boot that verifies the KMS key exists and the service account has decrypt permissions.

4. The Cloud Run v2 API has sharp edges. Sending a Patch request without an UpdateMask returns a 200 OK but does not apply the change. This is documented behavior, but it is surprising and easy to miss. We burned time debugging why the cpu_idle fix appeared to succeed but had no effect, until we discovered the UpdateMask requirement. Always set an explicit UpdateMask on Cloud Run v2 Patch calls.

5. Eat your own dog food aggressively. We build infrastructure cost protection software, and our own infrastructure had a $278/month cost problem for weeks. The embarrassment was productive. It directly led to Critical Cost Protection, scan health monitoring, and KMS preflight validation, all features that now protect every Cloud Guardian customer.

6. Scope defaults matter. Requiring explicit AutoRemediationScopes is the right safety default for customer projects. But for critical cost violations, the scope should be injected automatically. The cost of a false positive (unnecessarily enabling cpu_idle on a service that can be reverted in seconds) is far lower than the cost of a false negative (letting a $278/month bill accumulate indefinitely).

Timeline Summary

10:00 AMManual cost audit reveals $434/month total, $278/month on controlplane
10:30 AMRoot cause identified: cpu_idle disabled, KMS key mismatch found
11:00 AMKMS key fixed, credentials decrypting for all 28 projects
11:15 AMcpu_idle fix applied via Cloud Run API (with UpdateMask)
12:00 PMAutoRemediationScopes configured on all 29 projects
1:00 PMCritical Cost Protection feature implemented and tested
2:00 PMDeployed to production, full re-scan confirms $142/month projected cost

Automate Your Own GCP Cost Audit

Cloud Guardian scans your GCP projects every six hours, detects cost violations like cpu_idle misconfigurations and overprovisioned resources, and auto-remediates them. Critical Cost Protection ensures the most expensive problems are fixed in seconds, not days. Connect your first project in under five minutes.

Get Started Free