I totally get what you’re saying. I’ve been through this pain at least twice, especially when scaling microservices. What worked for us was trimming alerts down to those tied directly to user impact — e.g., response times, API error ratios, and service availability — rather than every CPU spike. It also helps to use alert tiers: warnings for early signals and criticals only when there’s real user degradation. We rebuilt a lot of our system with guidance from
devsecops consulting services — they have solid insights on designing sustainable observability stacks and aligning them with business goals. One big takeaway was defining “golden signals” (latency, traffic, errors, saturation) and linking alerts to SLOs. Once you focus on those and automate noisy stuff away, your team starts trusting the alerts again.