As part of the DevOps engineering team here at Wavefront by VMware, my day-to-day activities include setting up numerous alerts that monitor our customers’ AWS clusters to ensure that they’re in a healthy state.
To provide an exceptional quality of experience for our customers, we want to able to pinpoint problems and fix them quickly when clusters wander into a bad state. One feature, unique to Wavefront, that helps us to do this is Alert Target functionality.
Alert Target is an abstraction that categorizes all the notification mechanisms that Wavefront provides. PagerDuty, email, and webhooks are the three types of Alert Targets in which the user can be notified when alerts fire. A user can specify the type of alert trigger to listen to within an Alert Target.
For example, when an alert “fires” or “resolves”, I want to be notified with PagerDuty because often it alarms me faster than the other types. However, if an alert is put under “maintenance” or “snoozed”, I am fine with email notification, as they aren’t as urgent and can be checked at a later time.
Another parameter that can be set in Alert Target is the message of notification. More specifically, the descriptive message related to the alert that fired.
Wavefront provides a powerful analytics platform for a DevOps engineer or developer to query metrics and find out the problematic series. Given that a alert condition has already been queried if I receive a notification from Alert Target, I want to be able to see the result of the query instead of re-issuing the alert condition on my own. This is especially helpful if I am paged after work hours and I just want to resolve the issue ASAP.
The screenshot below shows how to create an Alert Target with specific alert triggers:
In an Alert Target, one can use a Mustache template that extracts data from the alert. An example of an email Mustache template is outlined below:
For more information on the Mustache syntax, view https://mustache.github.io/mustache.5.html.
For example, here I’ve set up an alert that monitors CPU usage of machines. If the 5 minute moving average exceeds 55% over the given period, then I will receive an email notification. Typically, the threshold would be higher and the alert condition can be further tuned to better detect an anomaly, but that’s a separate discussion for another day.
Using Alert Target and the default email template, I can quickly narrow down the problematic machines and extract the important information needed to diagnose the problem.
As shown in the email message, app-35 was the problematic machine. It exceeded the provided threshold at 56%. The alert description, which contains details on how to resolve a particular alert, shows the next step for debugging. The email also provides an URL, which is a link to a Wavefront chart that shows the underlying series at the time of alert firing.
With that as a starting point, I can add more queries on the same chart to begin my diagnostics to find the root cause.
There are more ways that Alert Target can be utilized that aren’t expressed here. For more information on it, visit Wavefront documentation.
Wavefront offers a free trial, so you can experiment with Alert Targets on your own.