Architecture

Cloud Guardian is a multi-tenant FinOps platform with a Go backend, Next.js frontend, and MCP server.

System Overview

┌──────────────┐     ┌──────────────────┐     ┌────────────────┐
│   Next.js    │────▶│  Connect-RPC API │────▶│   Firestore    │
│   (Vercel)   │     │  (Cloud Run)     │     │   (default)    │
└──────────────┘     └──────────────────┘     └────────────────┘
                            │                         │
                     ┌──────┴──────┐           ┌──────┴──────┐
                     │  Scanner    │           │  Cloud KMS  │
                     │  (Background│           │  (Envelope  │
                     │   Goroutine)│           │   Encrypt)  │
                     └─────────────┘           └─────────────┘
                            │
                     ┌──────┴──────┐
                     │  GCP APIs   │
                     │  (per-project│
                     │   scan)     │
                     └─────────────┘

Backend Stack

  • Language: Go 1.24
  • RPC Framework: Connect-RPC (HTTP/2 cleartext via h2c)
  • Database: Firestore (default) database
  • Auth: Firebase Authentication (JWT) + static API token + per-user API keys
  • Encryption: Cloud KMS envelope encryption for credentials
  • Deployment: Cloud Run (australia-southeast2)

Frontend Stack

  • Framework: Next.js 16 with App Router
  • UI: shadcn/ui components with Tailwind CSS
  • Auth: Firebase client SDK with Google Sign-In
  • Deployment: Vercel
  • Charts: Recharts for cost trend visualization

Scanner Architecture

The background scanner runs as a goroutine within the API server:

  1. Build targets — Merges static GCP_PROJECTS with connected connectors
  2. Resolve org IDs — Maps connector IDs to organization IDs via org projects
  3. Prefetch overrides — Loads check overrides per-org to avoid repeated lookups
  4. Resolve credentials — Sequential KMS decrypt calls (not parallelized)
  5. Parallel scan — Scans up to SCAN_CONCURRENCY (default 5) projects concurrently
  6. Per-project timeout — 2-minute context timeout per project
  7. Record results — Persist snapshots, connector scan timestamps, project results
  8. Post-scan pipeline — Alerts → Drift detection → Auto-remediation → Savings verification

Scan Mutex

A sync.Mutex.TryLock() prevents overlapping scan cycles from the background ticker and manual TriggerScan RPC. This is single-instance only — Cloud Run horizontal scaling would require distributed locking.

Auth & RBAC

Auth Chain

Requests are authenticated in order:

  1. Static API token — For CI/CD and server-side API calls
  2. API key (cg_ prefix) — Per-user programmatic tokens with org + role binding
  3. Firebase JWT — Browser-based authentication via Google Sign-In

Role Hierarchy

| Role | Level | Capabilities | |------|-------|-------------| | Viewer | 1 | Read-only access to org data | | Member | 2 | + trigger scans, view details | | Admin | 3 | + manage connectors, rules, members | | Owner | 4 | + delete org, manage billing |

Org Scoping

All API calls are scoped to an organization via the X-Org-ID header. The header value is verified against the user's org membership to prevent cross-tenant access.

Remediation Engine

The remediation system supports two modes:

Direct Mode

Executes GCP API calls directly (e.g., update Cloud Run service, delete AR images).

GitHub PR Mode

When a project has a linked GitHub repository:

  1. Fix agent (Gemini-powered) generates Terraform changes
  2. Creates a PR via the GitHub App
  3. PR is tracked until merge/close
  4. Post-merge re-scan verifies the fix

Remediation Flow

Plan → Dedup → Create Action → Execute → Refresh Snapshots → Verify Savings

Data Model

All data lives in the Firestore (default) database. Key collections:

| Collection | Purpose | |-----------|---------| | users | User profiles (synced from Firebase Auth) | | organizations | Multi-tenant org records | | memberships | User ↔ org role bindings | | connectors | GCP project connections | | connector_credentials | Envelope-encrypted SA keys | | cost_snapshots | Per-resource cost data points | | remediation_actions | Planned/executed remediation records | | cost_alerts | Generated cost anomaly alerts | | scan_cycles | Scan cycle metadata | | scan_project_results | Per-project scan results |