Architecture
Cloud Guardian is a multi-tenant FinOps platform with a Go backend, Next.js frontend, and MCP server.
System Overview
┌──────────────┐ ┌──────────────────┐ ┌────────────────┐
│ Next.js │────▶│ Connect-RPC API │────▶│ Firestore │
│ (Vercel) │ │ (Cloud Run) │ │ (default) │
└──────────────┘ └──────────────────┘ └────────────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Scanner │ │ Cloud KMS │
│ (Background│ │ (Envelope │
│ Goroutine)│ │ Encrypt) │
└─────────────┘ └─────────────┘
│
┌──────┴──────┐
│ GCP APIs │
│ (per-project│
│ scan) │
└─────────────┘
Backend Stack
- Language: Go 1.24
- RPC Framework: Connect-RPC (HTTP/2 cleartext via h2c)
- Database: Firestore
(default)database - Auth: Firebase Authentication (JWT) + static API token + per-user API keys
- Encryption: Cloud KMS envelope encryption for credentials
- Deployment: Cloud Run (
australia-southeast2)
Frontend Stack
- Framework: Next.js 16 with App Router
- UI: shadcn/ui components with Tailwind CSS
- Auth: Firebase client SDK with Google Sign-In
- Deployment: Vercel
- Charts: Recharts for cost trend visualization
Scanner Architecture
The background scanner runs as a goroutine within the API server:
- Build targets — Merges static
GCP_PROJECTSwith connected connectors - Resolve org IDs — Maps connector IDs to organization IDs via org projects
- Prefetch overrides — Loads check overrides per-org to avoid repeated lookups
- Resolve credentials — Sequential KMS decrypt calls (not parallelized)
- Parallel scan — Scans up to
SCAN_CONCURRENCY(default 5) projects concurrently - Per-project timeout — 2-minute context timeout per project
- Record results — Persist snapshots, connector scan timestamps, project results
- Post-scan pipeline — Alerts → Drift detection → Auto-remediation → Savings verification
Scan Mutex
A sync.Mutex.TryLock() prevents overlapping scan cycles from the background ticker and manual TriggerScan RPC. This is single-instance only — Cloud Run horizontal scaling would require distributed locking.
Auth & RBAC
Auth Chain
Requests are authenticated in order:
- Static API token — For CI/CD and server-side API calls
- API key (
cg_prefix) — Per-user programmatic tokens with org + role binding - Firebase JWT — Browser-based authentication via Google Sign-In
Role Hierarchy
| Role | Level | Capabilities | |------|-------|-------------| | Viewer | 1 | Read-only access to org data | | Member | 2 | + trigger scans, view details | | Admin | 3 | + manage connectors, rules, members | | Owner | 4 | + delete org, manage billing |
Org Scoping
All API calls are scoped to an organization via the X-Org-ID header. The header value is verified against the user's org membership to prevent cross-tenant access.
Remediation Engine
The remediation system supports two modes:
Direct Mode
Executes GCP API calls directly (e.g., update Cloud Run service, delete AR images).
GitHub PR Mode
When a project has a linked GitHub repository:
- Fix agent (Gemini-powered) generates Terraform changes
- Creates a PR via the GitHub App
- PR is tracked until merge/close
- Post-merge re-scan verifies the fix
Remediation Flow
Plan → Dedup → Create Action → Execute → Refresh Snapshots → Verify Savings
Data Model
All data lives in the Firestore (default) database. Key collections:
| Collection | Purpose |
|-----------|---------|
| users | User profiles (synced from Firebase Auth) |
| organizations | Multi-tenant org records |
| memberships | User ↔ org role bindings |
| connectors | GCP project connections |
| connector_credentials | Envelope-encrypted SA keys |
| cost_snapshots | Per-resource cost data points |
| remediation_actions | Planned/executed remediation records |
| cost_alerts | Generated cost anomaly alerts |
| scan_cycles | Scan cycle metadata |
| scan_project_results | Per-project scan results |