01
Filter and sample logs at the source
saves 30 to 50 percent
Logs are typically 50 percent of total observability spend. Drop health-check, framework, and load-balancer noise at the agent (Fluent Bit, Vector, Filebeat). Sample DEBUG and INFO at 10 to 20 percent while keeping all WARN and ERROR. Highest single lever in the toolkit.
02
Cap custom metric cardinality
saves 20 to 40 percent
Audit the top 10 highest-cardinality metric series. Drop user_id, request_id, IP address from metric labels (keep them in logs and traces). Convert per-URL gauges to bucketed histograms. Use Datadog Metrics Without Limits or equivalent aggregation rules to enforce caps.
03
Sample APM traces at 5 to 10 percent
saves 15 to 30 percent
Head-based sampling for high-volume services, tail-based sampling for error-relevant traces. 100 percent tracing is rarely necessary. Most teams discover the gap in fidelity is invisible at 10 percent and saves a meaningful share of the APM line.
04
Right-size retention
saves 10 to 20 percent
Default to 15 days for hot data. Push 30 to 90 day historical data to object storage (S3, GCS) and rehydrate on demand. Audit compliance requirements: most regulations require specific log types (auth, audit) for fixed periods, not all logs.
05
Tier hot, warm, and cold storage
saves 30 to 60 percent on log retention
1 second resolution for 24 hours. 1 minute for 7 days. 5 minute for 30 days. Hourly aggregates for 13 months. Most operational analysis happens in the 7-day window. Capacity planning needs hourly granularity at most.
06
Negotiate annual commitment
saves 15 to 25 percent off list
Vendors discount 15 to 25 percent for an annual or multi-year commitment with a usage floor. Negotiate exit terms, true-up windows, and the floor before signing. Time renewal negotiations to coincide with quarter-end vendor pressure.
07
Move dev and staging to a free tier
saves 10 to 20 percent
Production observability rarely needs to apply to ephemeral dev environments. Run dev/staging on Grafana Cloud free tier or self-hosted Prometheus. Typically 30 to 40 percent of monitoring spend is non-production environments masquerading as production.
08
Consolidate overlapping vendors
saves 15 to 30 percent
Datadog plus PagerDuty plus Splunk plus Sentry plus a homegrown dashboard. List every paid signal source. Eliminate any signal type covered by two or more platforms. The migration cost is real and quantified on the hidden costs page.
09
Migrate metrics to open source
saves 60 to 90 percent on metrics line
Self-host Prometheus and Grafana, pay only the underlying compute. Tempo for traces, Loki for logs. Most viable when there is a platform engineering function or strong DevOps culture. Quantified TCO comparison on the open-source-vs-paid page.
10
Use Grafana Cloud as a managed open-source bridge
saves 40 to 70 percent vs Datadog
Best transition point between fully self-hosted Prometheus and a fully commercial platform. Generous free tier, OpenTelemetry-native, no vendor lock at the data format level. Ideal for teams that want to leave Datadog without taking on full operational burden.
11
Adopt OpenTelemetry from day one
saves Avoids future migration cost
Instrument with OpenTelemetry rather than vendor-specific SDKs. Data flows to any backend that supports OTLP. Future platform switches drop from months to days. Future-proofs against vendor lock at the SDK layer.
12
Audit quarterly
saves Sustains all of the above
Cost growth that outpaces infra growth is the leading indicator of a problem. A quarterly cost review with a single owner catches new cardinality, new log volume, and unintentional retention upgrades before they become invoices.