Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Open
pirat013 opened this issue Nov 16, 2021 · 2 comments
Open
Assignees
Labels

Comments

@pirat013
Copy link
Contributor

pirat013 commented Nov 16, 2021

The current configuration for the systemd unit files are monitoring the active state like this:

  • name: systemd-services-monitoring
    rules:
    • alert: service-down-pacemaker
      expr: node_systemd_unit_state{name="pacemaker.service",
      state="active"} == 0
      labels:
      severity: page
      annotations:
      summary: Pacemaker service not running

This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:

  • name: systemd-services-monitoring
    rules:
    • alert: service-failed-pacemaker
      expr: node_systemd_unit_state{name="pacemaker.service",
      state="failed"} == 1
      labels:
      severity: page
      annotations:
      summary: Pacemaker service could not start or is crashed.

This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:

  • alert: HostSystemdServiceCrashed
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 1m
    labels:
    severity: page
    annotations:
    description: |-
    systemd service crashed
    VALUE = {{ $value }}
    LABELS = {{ $labels }}
    summary: Host systemd service crashed (instance {{ $labels.instance }})
@yeoldegrove
Copy link
Collaborator

@pirat013 I would not see anything that speaks against this change.
Are you willing to submit a PR?

@pirat013
Copy link
Contributor Author

pirat013 commented Apr 4, 2022

@yeoldegrove sorry I didn't see your request.
We may have to consider a combination of service is enabled and not started as well this would reflect the original idea better than my suggestion. I'll try to figure out this rule and can create a PR. But I can't say when it will happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants