TCO
Open source vs paid monitoring: the real total cost
Prometheus is free. Running it is not. An independent TCO comparison of self-hosted, Grafana Cloud, and Datadog at the same 100-host scale.
TL;DR
Self-hosted Prometheus plus Grafana plus Loki plus Tempo for 100 hosts costs approximately $2K to $8K/mo in infrastructure plus 0.5 to 1 FTE of engineering time. Loaded TCO: $8K to $20K/mo. Datadog at the same scale lists at $5K to $15K/mo; Grafana Cloud at $3K to $9K/mo. The answer depends on whether you have platform engineering capacity to deploy.
Three options at 100 hosts
Loaded TCO comparison
Self-hosted (Prometheus, Grafana, Loki, Tempo)
Loaded total
$8K to $20K/mo loaded
Cheapest at the licence level. Real cost is engineering time and operational maturity.
Grafana Cloud (managed open source)
Loaded total
$3K to $9K/mo
The pragmatic middle ground. Open-source data formats, managed operations, no vendor lock at the data layer.
Fully commercial (Datadog)
Loaded total
$5K to $15K/mo
Highest licence cost, lowest operational burden, broadest integration ecosystem.
Self-hosted
What 'free' Prometheus actually requires
Prometheus
Metrics ingest and storage
2 to 4 vCPU, 16 to 32 GB RAM per ingest replica. 100 GB to 1 TB local storage with WAL. Federation or remote-write to long-term backend (Cortex, Mimir, Thanos, VictoriaMetrics).
Grafana
Dashboarding and alerting UI
1 to 2 vCPU, 4 GB RAM, single replica adequate for most teams. Postgres or MySQL for state.
Loki
Log aggregation
Object storage backend (S3, GCS, Azure Blob). 2 to 4 vCPU per ingester, 8 GB RAM. Horizontally scalable.
Tempo
Distributed tracing
Object storage backend. 2 to 4 vCPU per ingester. Designed for cheap trace storage.
Long-term metrics (Mimir/Thanos/VictoriaMetrics)
Long-term retention and query
Object storage plus query/ingest workers. The deciding factor between hobbyist and production-grade Prometheus deployments.
The egress trap
Decision matrix
Open source wins when
- +Strong platform engineering team with Kubernetes operational maturity.
- +K8s-native stack where the team already runs Helm charts and Operators.
- +High data volumes where the per-host or per-GB pricing of commercial vendors hits hard.
- +Strict data residency or compliance requirements that favour self-managed.
- +Long-term horizon. Open-source pays back over multi-year deployments, not pilots.
Paid wins when
- +Small or generalist engineering team, no platform function.
- +Aggressive feature breadth (RUM, synthetics, security, AI ops) needed out of the box.
- +Compliance regimes that prefer SaaS audit trails (SOC 2, FedRAMP).
- +Short on-call rota where 4am pages from a self-managed Prometheus cluster are unacceptable.
- +Pre-product-market-fit teams where engineering hours are scarcer than dollars.
The hybrid path
Most teams settle in the middle
Common hybrid pattern
Migration cost
What it costs to switch from Datadog to self-hosted
A realistic migration timeline for a 100-host Datadog deployment to a self-hosted Prometheus stack runs 8 to 16 weeks of engineering time across multiple roles. Key cost components:
- 2 to 4 weeks: Prometheus / Grafana / Loki / Tempo deployment, parallel to existing Datadog.
- 2 to 4 weeks: dashboard migration and validation. Grafana imports many but not all Datadog dashboards cleanly.
- 2 to 3 weeks: alert translation. Datadog monitors map roughly but not exactly to Prometheus alerting rules.
- 1 to 2 weeks: runbook update, on-call team retraining, paging integration.
- 1 to 3 weeks: cutover, monitoring of the monitoring, decommissioning.
Loaded engineering cost at typical SaaS rates: $30,000 to $80,000 one-off. Annual saving needs to clear that hurdle inside 12 months for the project to be a clean win.