Cloud Infrastructure · Cost & Health Summary
Monthly Cloud Spend
$284K
▲ 12.3% vs last month
Budget: $310K · On track
Infrastructure Savings
$47K
▲ 28% YTD optimization
vs $36.7K same period last yr
Network Egress Cost
$31.4K
▲ 8.7% · Reducing headers
Target: $24K · Initiative active
Overall Reliability
99.94%
▲ SLO compliant
Target: 99.9% · 35 min downtime/yr
Active Clusters
14
— Stable · GKE
847 nodes · 6,240 pods
Cost Attribution
87%
▲ Tagged & attributed
Target: 95% · Improving
Open Incidents
3
▲ 1 critical · 2 warning
Avg MTTR: 23 minutes
Security Alerts
7
▼ 42% vs last month
2 critical · 5 medium
Monthly Cloud Spend Trend
Total infrastructure cost with forecast
Cost by Service
GCP resource distribution
SLO Compliance by Team
Last 30 days availability
Incident Volume
Weekly P1–P3 incidents
Team Cost Attribution
Spend allocation by team
Top Cost Drivers — Last 30 Days
| Service / Resource | Team | Environment | Monthly Cost | vs Budget | Trend | Utilization | Status |
|---|---|---|---|---|---|---|---|
| GKE Cluster — prod-us-east | Platform | Production | $62,840 | +14.2% | 📈 | 78% | Review |
| BigQuery — Analytics DWH | Data | Production | $41,200 | -3.1% | 📉 | 55% | Optimized |
| Cloud SQL — Postgres Prod | Backend | Production | $28,750 | +6.8% | 📈 | 64% | Autoscale |
| Network Egress — CDN | Platform | Production | $31,440 | +23.5% | 🔺 | 92% | Critical |
| GCS Buckets — Data Lake | Data | Mixed | $18,900 | -8.2% | 📉 | 42% | Cold migrated |
Cost Analysis · FinOps Optimization
Total GCP Spend
$284K
▲ 12.3% MoM
Forecast: $301K next month
Optimization Savings
$47.2K
▲ YTD achieved savings
CUD + Preemptible + Storage
Waste / Idle Resources
$18.6K
▼ 15% vs last quarter
34 resources flagged
CUD Coverage
68%
▲ Committed Use Discounts
Target: 80% · Renew Q2
Cost Breakdown by GCP Product
Last 30 days · All environments
Daily Spend Heatmap Context
90-day spend with anomalies
Cost by Region
GCP regional distribution
Savings Plan Coverage
CUD vs On-demand vs Spot
Unit Economics
Cost per request / user
Cost Attribution by Team — Detailed View
| Team | Budget | Actual Spend | Variance | % of Total | Biggest Driver | YTD Savings | Attribution |
|---|---|---|---|---|---|---|---|
| Platform | $120K | $118,400 | -$1,600 | 42% | GKE | $22.1K | 95% |
| Data | $72K | $76,800 | +$4,800 | 27% | BigQuery | $14.8K | 82% |
| Backend | $55K | $52,200 | -$2,800 | 18% | Cloud SQL | $7.2K | 91% |
| Frontend | $15K | $14,100 | -$900 | 5% | CDN / Firebase | $2.3K | 98% |
| Unattributed | — | $22,500 | Untagged | 8% | Misc GCE | — | 0% |
Kubernetes · GKE Cluster Management
Total Clusters
14
12 GKE · 2 Autopilot
Running Pods
6,240
▲ 99.2% healthy
Node Utilization
71%
Target: 75–85%
Pending Pods
47
▲ Scheduling backlog
HPA Scaling Events
284
▼ 12% vs last week
Cluster CPU & Memory Utilization
7-day rolling average by cluster
Pod Status Distribution
Across all namespaces
Autoscaling Events — Last 7 Days
HPA scale-up vs scale-down events
Node Pool Cost vs Utilization
Bubble = cost · Position = efficiency
Cluster Health Overview
| Cluster | Region | Nodes | Pods | CPU | Memory | Version | Status |
|---|---|---|---|---|---|---|---|
| prod-us-east-1 | us-east4 | 128 | 1,842 | 78% | 81% | 1.29.3 | Healthy |
| prod-eu-west-1 | europe-west1 | 96 | 1,240 | 65% | 70% | 1.29.3 | Healthy |
| staging-us-central | us-central1 | 32 | 487 | 42% | 48% | 1.28.7 | Upgrade needed |
| data-processing-1 | us-east4 | 64 | 892 | 91% | 88% | 1.29.3 | High Load |
Network Egress · Cost Reduction Initiative
Monthly Egress Cost
$31.4K
▲ 23.5% · Critical
Target: $24K by Q2
Unnecessary Headers
847 GB
▲ Overhead identified
Est. savings: $4,200/mo
Total Egress Volume
12.4 TB
▲ 18% MoM growth
Compressed: 9.1 TB
Compression Ratio
1.36×
▲ Brotli enabled
Saving ~2.4 TB/month
Egress Cost by Destination
Internet, inter-region, CDN
Daily Egress Volume (TB)
30-day trend with anomalies
Top Services by Egress
Bytes out per service
Header Overhead Analysis
Unnecessary bytes per endpoint
🚨 Network Egress Alerts
Critical: Egress spike on api-gateway-prod · +340% vs baseline
api-gateway-prod is generating 4.2 TB/day unexpectedly. Root cause: large response payloads with uncompressed JSON. Headers contributing 680 GB overhead.
Warning: Cross-region traffic from us-east4 to europe-west1 up 67%
Data pipeline jobs are routing through sub-optimal regions. Estimated excess cost: $2,100/month. Consider regional routing optimization.
Info: CDN cache hit rate dropped to 62% (was 81%)
Cache invalidation policy change by Frontend team resulted in more origin requests. Est. additional egress cost: $800/month.
Storage & Data Warehouse · Tiering Optimization
Total Storage Cost
$18.9K
▼ 8.2% post-migration
GCS + Coldline + Archive
Cold Storage Migrated
48 TB
▲ Saving $4.8K/mo
Infrequently accessed buckets
Unused Datasets (BQ)
127
▲ Not queried 90+ days
Potential savings: $6.2K/mo
Total Data Volume
412 TB
▲ 8% MoM growth
GCS: 288TB · BQ: 124TB
Storage Tier Distribution
Standard → Nearline → Coldline → Archive
BigQuery Usage & Cost
Bytes processed vs cost
GCS Buckets — Storage Optimization Candidates
| Bucket Name | Team | Size | Last Access | Current Tier | Recommended Tier | Monthly Savings | Action |
|---|---|---|---|---|---|---|---|
| analytics-raw-logs-2023 | Data | 18.4 TB | 142 days | Standard | Archive | $1,840/mo | Pending |
| ml-training-datasets-v1 | Data | 12.1 TB | 67 days | Standard | Coldline | $860/mo | Pending |
| backup-postgres-weekly | Platform | 8.6 TB | 180 days | Nearline | Coldline | $430/mo | Approved |
| app-assets-cdn | Frontend | 2.3 TB | 1 day | Standard | Standard | — | Optimal |
Autoscaling · Compute & Database Optimization
Autoscaling Efficiency
84.2%
▲ Properly scaled events
Target: 90% · Improving
Scale Events (7d)
1,847
▼ 12% · More stable
284 HPA · 1,563 GKE Autopilot
Over-provisioned Resources
$14.2K
▲ Wasted spend/month
23 services identified
DB Autoscale Coverage
71%
▲ Cloud SQL + Spanner
Target: 85% · 4 DBs remaining
CPU Utilization Distribution
Compute nodes — 7-day view
Scaling Events Timeline
Scale-up vs scale-down by hour
Database Autoscaling Configuration
| Database | Type | Current Size | Autoscale | Min / Max | Avg CPU | Avg Mem | Recommendation |
|---|---|---|---|---|---|---|---|
| postgres-prod-main | Cloud SQL | db-n2-highmem-16 | Enabled | 8 / 32 vCPU | 72% | 81% | Optimal |
| spanner-analytics | Cloud Spanner | 3 nodes | Enabled | 1 / 10 nodes | 45% | 38% | Scale down |
| redis-prod-sessions | Memorystore | 12 GB | Disabled | — | 87% | 91% | Enable auto |
Observability · Metrics · APM · Logs (Datadog)
Avg Request Latency
47ms
▼ p99: 142ms · Good
SLO Target: p99 < 200ms
Requests / Second
84.2K
▲ Peak: 140K RPS
30d avg across all services
Error Rate
0.08%
▼ 5xx errors · Below SLO
SLO: <0.1% · Compliant
Active Monitors
2,847
142 alerting · 8 silenced
Datadog monitors
Request Latency (p50 / p95 / p99)
24-hour view in milliseconds
Error Rate by Service
5xx errors last 7 days
Traffic Heatmap — Requests/Hour by Day
Last 7 days · Each cell = 1hr · Color = request volume
Mon
Tue
Wed
Thu
Fri
Sat
Sun
Low
High
Incident
Incidents & SLOs · Reliability Engineering
Open Incidents
3
1 P1 · 2 P2
Avg age: 4.2 hours
MTTR (30d avg)
23 min
▼ 18% vs last month
Target: <30 min
SLO Compliance
99.94%
▲ Error budget: 82%
Target: 99.9% · Healthy
On-Call Rotations
8
Americas + Europe
24/7 coverage ensured
Error Budget Burn Rate
Remaining budget by service
Incident Trend — 12 Weeks
P1/P2/P3 volumes
Active Incidents
P1 · INC-8847 · data-processing cluster high CPU · Ongoing
data-processing-1 cluster sustaining 91% CPU across 64 nodes. Autoscaler not responding. 3 engineers engaged. ETA resolution: 45 min.
P2 · INC-8846 · Network egress spike · Investigation
api-gateway-prod generating 4.2 TB/day vs 1.2 TB baseline. Potential misconfigured route leaking uncompressed responses.
P2 · INC-8845 · Redis memory at 91% · Monitoring
redis-prod-sessions approaching OOM. Autoscaling disabled. Manual intervention required to expand instance size.
Security · IAM · RBAC · Network Policy
Security Findings
7
▼ 42% vs last month
2 critical · 3 high · 2 medium
Overprivileged IAM
34
▲ Service accounts
Owner/Editor roles flagged
mTLS Coverage
94%
▲ Istio mesh · Strict mode
Target: 100% · 6 services remain
Cert Expiry (30d)
3
TLS certs expiring soon
Auto-renew configured
Security Findings Over Time
Critical / High / Medium / Low trend
IAM Policy Distribution
Role assignments by privilege level
Critical Security Findings
| Finding | Category | Resource | Severity | Age | Owner | Status |
|---|---|---|---|---|---|---|
| Service account with owner role | IAM | sa-data-pipeline@proj | Critical | 14d | Data Team | Remediation |
| Public GCS bucket exposed | Storage | ml-training-datasets-v1 | Critical | 3d | Data Team | In Review |
| Istio PeerAuthentication PERMISSIVE | Network | namespace: legacy | High | 21d | Platform | Planned |
| Unused service account keys (8) | IAM | Multiple | Medium | 45d | All Teams | Backlog |
CI/CD Pipeline · Deployment Health
Deployment Success Rate
97.8%
▲ Last 30 days
Target: 98% · Near goal
Avg Deploy Time
8.4 min
▼ 22% via cache
Target: <10 min
Deploys / Day
47
▲ 8 teams · Active
Peak: 84 on Thursdays
Failed Pipelines (7d)
12
▼ 5 test · 7 build
MTTR: 8 minutes avg
Deployment Frequency
Daily deploys by environment
Pipeline Duration Trend
Build time in minutes — 30d
Recent Pipeline Runs
| Pipeline | Team | Branch | Status | Duration | Triggered | Environment |
|---|---|---|---|---|---|---|
| api-gateway deploy | Backend | main | ✓ Success | 6m 42s | 10 min ago | prod |
| data-pipeline build | Data | release/v2.4 | ✗ Failed | 14m 08s | 25 min ago | staging |
| frontend deploy | Frontend | main | ✓ Success | 4m 11s | 1h ago | prod |
| k8s-infra apply | Platform | infra/scale-v2 | ✓ Success | 11m 33s | 2h ago | prod |
FinOps Initiatives · Roadmap & Timeline
Active Initiatives
8
3 On Track · 2 At Risk
Q1–Q2 2025
Projected Savings
$84K
▲ If all complete
Per annum target
At Risk
2
▲ Delayed
Header removal · DB autoscale
Completed (Q1)
4
▲ Cold storage · CUD · Tags
Saved $31.4K so far
Initiative Savings Projection
Actual vs projected savings by initiative
Progress by Category
Completion rate
Network Egress Reduction35%
Cold Storage Migration82%
DB Autoscaling Rollout55%
Cost Attribution (95% tagged)87%
CUD Coverage to 80%68%
Istio mTLS Strict Mode94%
Initiative Timeline
Q1–Q2 2025 engineering initiatives
✅ Cold Storage Migration — Phase 1
Jan 15 – Feb 20, 2025
Migrated 48 TB of infrequently accessed data from Standard to Coldline/Archive. Monthly savings: $4,800. Automated lifecycle policies deployed for 12 buckets.
✅ Committed Use Discount (CUD) Renewal
Feb 1 – Feb 28, 2025
Renewed 3-year CUD for GKE compute workloads. Coverage increased from 52% to 68%. Annual savings: $22,400 vs on-demand pricing.
🔄 Network Egress Optimization — Header Removal
Mar 1 – Apr 15, 2025 · IN PROGRESS (35%)
Identifying and removing unnecessary response headers (X-Debug, X-Request-Id-Verbose, legacy CORS headers). Estimated savings: $4,200/month. Currently in staging validation.
⚠ Database Autoscaling Rollout
Mar 15 – May 30, 2025 · AT RISK (55%)
Enable autoscaling for remaining 4 databases (Redis, Mongo, 2x Cloud SQL staging). Blocked on DBA approval for redis-prod-sessions. Risk: delayed by 2 weeks.
📋 Cost Attribution — 95% Tag Coverage
Apr 1 – May 15, 2025 · PLANNED (0%)
Implement mandatory labeling policy via OPA/Gatekeeper to ensure all GCP resources are tagged with team, environment, service. Current gap: $22.5K unattributed.
📋 BigQuery Optimization — Partition & Cluster
May 1 – Jun 30, 2025 · PLANNED (0%)
Partition 127 inactive datasets and enable clustering on top-10 most expensive tables. Projected reduction: 40% query costs = $16,500/year savings.