CloudFinOps · SRE Command Center

Cloud Infrastructure · Cost & Health Summary

Monthly Cloud Spend ☁

$284K

▲ 12.3% vs last month

Budget: $310K · On track

Infrastructure Savings ✓

$47K

▲ 28% YTD optimization

vs $36.7K same period last yr

Network Egress Cost ⇅

$31.4K

▲ 8.7% · Reducing headers

Target: $24K · Initiative active

Overall Reliability ◎

99.94%

▲ SLO compliant

Target: 99.9% · 35 min downtime/yr

Active Clusters ☸

14

— Stable · GKE

847 nodes · 6,240 pods

Cost Attribution ⊙

87%

▲ Tagged & attributed

Target: 95% · Improving

Open Incidents ⚠

3

▲ 1 critical · 2 warning

Avg MTTR: 23 minutes

Security Alerts 🔒

7

▼ 42% vs last month

2 critical · 5 medium

Monthly Cloud Spend Trend

Total infrastructure cost with forecast

+12% MoM

Cost by Service

GCP resource distribution

30d avg

SLO Compliance by Team

Last 30 days availability

Incident Volume

Weekly P1–P3 incidents

Team Cost Attribution

Spend allocation by team

Top Cost Drivers — Last 30 Days

Service / Resource	Team	Environment	Monthly Cost	vs Budget	Trend	Utilization	Status
GKE Cluster — prod-us-east	Platform	Production	$62,840	+14.2%	📈	78%	Review
BigQuery — Analytics DWH	Data	Production	$41,200	-3.1%	📉	55%	Optimized
Cloud SQL — Postgres Prod	Backend	Production	$28,750	+6.8%	📈	64%	Autoscale
Network Egress — CDN	Platform	Production	$31,440	+23.5%	🔺	92%	Critical
GCS Buckets — Data Lake	Data	Mixed	$18,900	-8.2%	📉	42%	Cold migrated

Cost Analysis · FinOps Optimization

Total GCP Spend ☁

$284K

▲ 12.3% MoM

Forecast: $301K next month

Optimization Savings 💚

$47.2K

▲ YTD achieved savings

CUD + Preemptible + Storage

Waste / Idle Resources

$18.6K

▼ 15% vs last quarter

34 resources flagged

CUD Coverage ⊙

68%

▲ Committed Use Discounts

Target: 80% · Renew Q2

Cost Breakdown by GCP Product

Last 30 days · All environments

Breakdown

Daily Spend Heatmap Context

90-day spend with anomalies

Anomalies detected

Cost by Region

GCP regional distribution

Savings Plan Coverage

CUD vs On-demand vs Spot

Unit Economics

Cost per request / user

Cost Attribution by Team — Detailed View

Team	Budget	Actual Spend	Variance	% of Total	Biggest Driver	YTD Savings	Attribution
Platform	$120K	$118,400	-$1,600	42%	GKE	$22.1K	95%
Data	$72K	$76,800	+$4,800	27%	BigQuery	$14.8K	82%
Backend	$55K	$52,200	-$2,800	18%	Cloud SQL	$7.2K	91%
Frontend	$15K	$14,100	-$900	5%	CDN / Firebase	$2.3K	98%
Unattributed	—	$22,500	Untagged	8%	Misc GCE	—	0%

Kubernetes · GKE Cluster Management

Total Clusters

14

12 GKE · 2 Autopilot

Running Pods

6,240

▲ 99.2% healthy

Node Utilization

71%

Target: 75–85%

Pending Pods

47

▲ Scheduling backlog

HPA Scaling Events

284

▼ 12% vs last week

Cluster CPU & Memory Utilization

7-day rolling average by cluster

Live

Pod Status Distribution

Across all namespaces

Autoscaling Events — Last 7 Days

HPA scale-up vs scale-down events

Node Pool Cost vs Utilization

Bubble = cost · Position = efficiency

Cluster Health Overview

Cluster	Region	Nodes	Pods	CPU	Memory	Version	Status
prod-us-east-1	us-east4	128	1,842	78%	81%	1.29.3	Healthy
prod-eu-west-1	europe-west1	96	1,240	65%	70%	1.29.3	Healthy
staging-us-central	us-central1	32	487	42%	48%	1.28.7	Upgrade needed
data-processing-1	us-east4	64	892	91%	88%	1.29.3	High Load

Network Egress · Cost Reduction Initiative

Monthly Egress Cost ⚠

$31.4K

▲ 23.5% · Critical

Target: $24K by Q2

Unnecessary Headers

847 GB

▲ Overhead identified

Est. savings: $4,200/mo

Total Egress Volume

12.4 TB

▲ 18% MoM growth

Compressed: 9.1 TB

Compression Ratio

1.36×

▲ Brotli enabled

Saving ~2.4 TB/month

Egress Cost by Destination

Internet, inter-region, CDN

High cost

Daily Egress Volume (TB)

30-day trend with anomalies

Top Services by Egress

Bytes out per service

Header Overhead Analysis

Unnecessary bytes per endpoint

🚨 Network Egress Alerts

Critical: Egress spike on api-gateway-prod · +340% vs baseline

api-gateway-prod is generating 4.2 TB/day unexpectedly. Root cause: large response payloads with uncompressed JSON. Headers contributing 680 GB overhead.

2h ago

P1

Warning: Cross-region traffic from us-east4 to europe-west1 up 67%

Data pipeline jobs are routing through sub-optimal regions. Estimated excess cost: $2,100/month. Consider regional routing optimization.

5h ago

P2

Info: CDN cache hit rate dropped to 62% (was 81%)

Cache invalidation policy change by Frontend team resulted in more origin requests. Est. additional egress cost: $800/month.

12h ago

P3

Storage & Data Warehouse · Tiering Optimization

Total Storage Cost

$18.9K

▼ 8.2% post-migration

GCS + Coldline + Archive

Cold Storage Migrated

48 TB

▲ Saving $4.8K/mo

Infrequently accessed buckets

Unused Datasets (BQ)

127

▲ Not queried 90+ days

Potential savings: $6.2K/mo

Total Data Volume

412 TB

▲ 8% MoM growth

GCS: 288TB · BQ: 124TB

Storage Tier Distribution

Standard → Nearline → Coldline → Archive

Optimized

BigQuery Usage & Cost

Bytes processed vs cost

GCS Buckets — Storage Optimization Candidates

Bucket Name	Team	Size	Last Access	Current Tier	Recommended Tier	Monthly Savings	Action
analytics-raw-logs-2023	Data	18.4 TB	142 days	Standard	Archive	$1,840/mo	Pending
ml-training-datasets-v1	Data	12.1 TB	67 days	Standard	Coldline	$860/mo	Pending
backup-postgres-weekly	Platform	8.6 TB	180 days	Nearline	Coldline	$430/mo	Approved
app-assets-cdn	Frontend	2.3 TB	1 day	Standard	Standard	—	Optimal

Autoscaling · Compute & Database Optimization

Autoscaling Efficiency

84.2%

▲ Properly scaled events

Target: 90% · Improving

Scale Events (7d)

1,847

▼ 12% · More stable

284 HPA · 1,563 GKE Autopilot

Over-provisioned Resources

$14.2K

▲ Wasted spend/month

23 services identified

DB Autoscale Coverage

71%

▲ Cloud SQL + Spanner

Target: 85% · 4 DBs remaining

CPU Utilization Distribution

Compute nodes — 7-day view

Histogram

Scaling Events Timeline

Scale-up vs scale-down by hour

Database Autoscaling Configuration

Database	Type	Current Size	Autoscale	Min / Max	Avg CPU	Avg Mem	Recommendation
postgres-prod-main	Cloud SQL	db-n2-highmem-16	Enabled	8 / 32 vCPU	72%	81%	Optimal
spanner-analytics	Cloud Spanner	3 nodes	Enabled	1 / 10 nodes	45%	38%	Scale down
redis-prod-sessions	Memorystore	12 GB	Disabled	—	87%	91%	Enable auto

Observability · Metrics · APM · Logs (Datadog)

Avg Request Latency

47ms

▼ p99: 142ms · Good

SLO Target: p99 < 200ms

Requests / Second

84.2K

▲ Peak: 140K RPS

30d avg across all services

Error Rate

0.08%

▼ 5xx errors · Below SLO

SLO: <0.1% · Compliant

Active Monitors

2,847

142 alerting · 8 silenced

Datadog monitors

Request Latency (p50 / p95 / p99)

24-hour view in milliseconds

Within SLO

Error Rate by Service

5xx errors last 7 days

Traffic Heatmap — Requests/Hour by Day

Last 7 days · Each cell = 1hr · Color = request volume

Heatmap

Mon

Tue

Wed

Thu

Fri

Sat

Sun

Low

High

Incident

Incidents & SLOs · Reliability Engineering

Open Incidents 🚨

3

1 P1 · 2 P2

Avg age: 4.2 hours

MTTR (30d avg)

23 min

▼ 18% vs last month

Target: <30 min

SLO Compliance

99.94%

▲ Error budget: 82%

Target: 99.9% · Healthy

On-Call Rotations

8

Americas + Europe

24/7 coverage ensured

Error Budget Burn Rate

Remaining budget by service

Within budget

Incident Trend — 12 Weeks

P1/P2/P3 volumes

Active Incidents

P1 · INC-8847 · data-processing cluster high CPU · Ongoing

data-processing-1 cluster sustaining 91% CPU across 64 nodes. Autoscaler not responding. 3 engineers engaged. ETA resolution: 45 min.

INC · 2h 14m

Active

P2 · INC-8846 · Network egress spike · Investigation

api-gateway-prod generating 4.2 TB/day vs 1.2 TB baseline. Potential misconfigured route leaking uncompressed responses.

INC · 5h 01m

Investigating

P2 · INC-8845 · Redis memory at 91% · Monitoring

redis-prod-sessions approaching OOM. Autoscaling disabled. Manual intervention required to expand instance size.

INC · 8h 33m

Monitoring

Security · IAM · RBAC · Network Policy

Security Findings

7

▼ 42% vs last month

2 critical · 3 high · 2 medium

Overprivileged IAM

34

▲ Service accounts

Owner/Editor roles flagged

mTLS Coverage

94%

▲ Istio mesh · Strict mode

Target: 100% · 6 services remain

Cert Expiry (30d)

3

TLS certs expiring soon

Auto-renew configured

Security Findings Over Time

Critical / High / Medium / Low trend

Improving

IAM Policy Distribution

Role assignments by privilege level

Critical Security Findings

Finding	Category	Resource	Severity	Age	Owner	Status
Service account with owner role	IAM	sa-data-pipeline@proj	Critical	14d	Data Team	Remediation
Public GCS bucket exposed	Storage	ml-training-datasets-v1	Critical	3d	Data Team	In Review
Istio PeerAuthentication PERMISSIVE	Network	namespace: legacy	High	21d	Platform	Planned
Unused service account keys (8)	IAM	Multiple	Medium	45d	All Teams	Backlog

CI/CD Pipeline · Deployment Health

Deployment Success Rate

97.8%

▲ Last 30 days

Target: 98% · Near goal

Avg Deploy Time

8.4 min

▼ 22% via cache

Target: <10 min

Deploys / Day

47

▲ 8 teams · Active

Peak: 84 on Thursdays

Failed Pipelines (7d)

12

▼ 5 test · 7 build

MTTR: 8 minutes avg

Deployment Frequency

Daily deploys by environment

Healthy

Pipeline Duration Trend

Build time in minutes — 30d

Recent Pipeline Runs

Pipeline	Team	Branch	Status	Duration	Triggered	Environment
api-gateway deploy	Backend	main	✓ Success	6m 42s	10 min ago	prod
data-pipeline build	Data	release/v2.4	✗ Failed	14m 08s	25 min ago	staging
frontend deploy	Frontend	main	✓ Success	4m 11s	1h ago	prod
k8s-infra apply	Platform	infra/scale-v2	✓ Success	11m 33s	2h ago	prod

FinOps Initiatives · Roadmap & Timeline

Active Initiatives

8

3 On Track · 2 At Risk

Q1–Q2 2025

Projected Savings

$84K

▲ If all complete

Per annum target

At Risk

2

▲ Delayed

Header removal · DB autoscale

Completed (Q1)

4

▲ Cold storage · CUD · Tags

Saved $31.4K so far

Initiative Savings Projection

Actual vs projected savings by initiative

Progress by Category

Completion rate

Network Egress Reduction35%

Cold Storage Migration82%

DB Autoscaling Rollout55%

Cost Attribution (95% tagged)87%

CUD Coverage to 80%68%

Istio mTLS Strict Mode94%

Initiative Timeline

Q1–Q2 2025 engineering initiatives

✅ Cold Storage Migration — Phase 1

Jan 15 – Feb 20, 2025

Migrated 48 TB of infrequently accessed data from Standard to Coldline/Archive. Monthly savings: $4,800. Automated lifecycle policies deployed for 12 buckets.

✅ Committed Use Discount (CUD) Renewal

Feb 1 – Feb 28, 2025

Renewed 3-year CUD for GKE compute workloads. Coverage increased from 52% to 68%. Annual savings: $22,400 vs on-demand pricing.

🔄 Network Egress Optimization — Header Removal

Mar 1 – Apr 15, 2025 · IN PROGRESS (35%)

Identifying and removing unnecessary response headers (X-Debug, X-Request-Id-Verbose, legacy CORS headers). Estimated savings: $4,200/month. Currently in staging validation.

⚠ Database Autoscaling Rollout

Mar 15 – May 30, 2025 · AT RISK (55%)

Enable autoscaling for remaining 4 databases (Redis, Mongo, 2x Cloud SQL staging). Blocked on DBA approval for redis-prod-sessions. Risk: delayed by 2 weeks.

📋 Cost Attribution — 95% Tag Coverage

Apr 1 – May 15, 2025 · PLANNED (0%)

Implement mandatory labeling policy via OPA/Gatekeeper to ensure all GCP resources are tagged with team, environment, service. Current gap: $22.5K unattributed.

📋 BigQuery Optimization — Partition & Cluster

May 1 – Jun 30, 2025 · PLANNED (0%)

Partition 127 inactive datasets and enable clustering on top-10 most expensive tables. Projected reduction: 40% query costs = $16,500/year savings.