Listing alert activations

Listing alert activations#

When an alert is activated by its configured trigger, MLRun saves the activation records that you can list, filter, etc. Alert activation records are stored in a partitioned table. The table is partitioned weekly and supports retention. The default retention period is 14 weeks. You can adjust it by setting object_retentions.alert_activations in the MLRun configuration. Specify the value in days.

In this section

List alert activations
Filter by entity
List alert activations for a given time range
Group by attributes
Aggregate by

List alert activations#

You can list alert activations by:

MLRun's run db
A specific project
An alert config object

import mlrun

run_db = mlrun.get_run_db()

activations = run_db.list_alert_activations(
    project=None,
    name=None,
    since=None,
    until=None,
    entity=None,
    severity=None,
    entity_kind=None,
    event_kind=None,
)

The method returns an instance of the AlertActivations class, which includes an activations attribute. This attribute is a list of AlertActivation objects, each containing the following fields:

id: int - activation id 
name: str - alert config name
project: str - project name
severity: AlertSeverity - alert config severity
activation_time: datetime - time when alert was activated
entity_id: str - id of entity, for job as `{job-name}.{job_uid}`, for endpoint_id is `{model_endpoint_id}.{app_name}.result.{result_name}`
entity_kind: EventEntityKind - entity kind
criteria: AlertCriteria - alert config criteria
event_kind: EventKind - event kind
number_of_events: int - number of event of `event_kind` came from prev deactivation (or from beginning of time) to the current deactivation
notifications: list[notification_objects.NotificationState]
reset_time: Optional[datetime] - time when alert was reset (for auto reset policy is the same as activation time)

activations

AlertActivations(activations=[AlertActivation(id=4, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 5, 17, 674000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.db80cba0c4be4ee9b86a09cf12a89991', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 5, 17, 674000, tzinfo=datetime.timezone.utc)), AlertActivation(id=3, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 4, 47, 530000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.db80cba0c4be4ee9b86a09cf12a89991', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 4, 47, 530000, tzinfo=datetime.timezone.utc)), AlertActivation(id=2, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 4, 17, 349000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.58e25426ea154daab2afa4ebfe454c71', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 4, 17, 349000, tzinfo=datetime.timezone.utc)), AlertActivation(id=1, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 3, 47, 209000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.58e25426ea154daab2afa4ebfe454c71', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 3, 47, 209000, tzinfo=datetime.timezone.utc))], pagination=None)

This object is iterable:

for activation in activations:
    print(activation.name)

job-failure-alert
job-failure-alert
job-failure-alert
job-failure-alert

project = mlrun.get_or_create_project("default")

> 2024-12-11 11:52:10,062 [info] Project loaded successfully: {"project_name":"default"}

# list alert activations for a specific project
activations = project.list_alert_activations(
    name=None,
    since=None,
    until=None,
    entity=None,
    severity=None,
    entity_kind=None,
    event_kind=None,
)

# list alert activations for a specific alert config
alert_config = run_db.list_alerts_configs()[0]

activations = alert_config.list_activations(
    since=None,
    until=None,
    from_last_update=False,  # set to True to get activations only from the time when alert config was updated (takes precedence over "since" if both are passed)
)

Filter by entity#

List activations only for a specific entity, using its entity_id.

The entity_id for the JOB entity_kind is formatted as <job-name>.<job_uid>. This is the only field in alert activation that supports wildcard search with asterix *. To enable a wildcard search, use ~ at the start of the entity parameter. For example, if you know the job name and want to find all activations related to it, pass the entity parameter as follows:

activations = project.list_alert_activations(entity="~test-func-handler.*")
for activation in activations:
    print(activation.entity_id)

test-func-handler.db80cba0c4be4ee9b86a09cf12a89991
test-func-handler.db80cba0c4be4ee9b86a09cf12a89991
test-func-handler.58e25426ea154daab2afa4ebfe454c71
test-func-handler.58e25426ea154daab2afa4ebfe454c71

If the entity parameter is passed as a string without a tilde (~), the search is performed for an exact match with the given string:

activations = project.list_alert_activations(
    entity="test-func-handler.db80cba0c4be4ee9b86a09cf12a89991"
)
len(activations)

List alert activations for a given time range#

To filter, pass since and until as datatime.datatime objects:

import datetime

activations = project.list_alert_activations(since=datetime.datetime.now())
len(activations)

activations = project.list_alert_activations(
    since=datetime.datetime(2024, 12, 11, 10, 5, 17, 674000),
    until=datetime.datetime.now(),
)
len(activations)

alert_config = run_db.list_alerts_configs()[0]
activations = alert_config.list_activations()
len(activations)

# update alert config and get activations since last update
alert_config = project.store_alert_config(alert_config)
activations = alert_config.list_activations(from_last_update=True)
len(activations)

Group by attributes#

The group_by method organizes alert activations into a dictionary based on specified attributes, making it easier to analyze or filter data. It groups activations by one or more attributes, such as project or severity, etc., with dictionary keys as tuples of the attribute values and values as lists of activations. This is especially useful for processing activations by categories, such as identifying alerts by their severity level or grouping them by projects, which can result in detailed insights.

len(run_db.list_alert_activations())

# group by severity
grouped = run_db.list_alert_activations().group_by("severity")
grouped["high"]

[AlertActivation(id=5, name='job-failure-alert', project='default', severity=high, activation_time=datetime.datetime(2024, 12, 11, 13, 9, 57, 317000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.ed6094787c8c4817bd0c0a34c76fd004', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 13, 9, 57, 317000, tzinfo=datetime.timezone.utc))]

grouped["medium"]

[AlertActivation(id=6, name='job-failure-alert', project='default', severity=medium, activation_time=datetime.datetime(2024, 12, 11, 13, 10, 57, 379000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.c93108a238bf4e3196780e352de9561a', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 13, 10, 57, 379000, tzinfo=datetime.timezone.utc))]

grouped["low"]

[AlertActivation(id=4, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 5, 17, 674000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.db80cba0c4be4ee9b86a09cf12a89991', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 5, 17, 674000, tzinfo=datetime.timezone.utc)),
 AlertActivation(id=3, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 4, 47, 530000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.db80cba0c4be4ee9b86a09cf12a89991', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 4, 47, 530000, tzinfo=datetime.timezone.utc)),
 AlertActivation(id=2, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 4, 17, 349000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.58e25426ea154daab2afa4ebfe454c71', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 4, 17, 349000, tzinfo=datetime.timezone.utc)),
 AlertActivation(id=1, name='job-failure-alert', project='default', severity=low, activation_time=datetime.datetime(2024, 12, 11, 10, 3, 47, 209000, tzinfo=datetime.timezone.utc), entity_id='test-func-handler.58e25426ea154daab2afa4ebfe454c71', entity_kind=job, criteria=AlertCriteria(count=1, period=None), event_kind=failed, number_of_events=1, notifications=[NotificationState(kind='webhook', err='', summary=NotificationSummary(failed=0, succeeded=1))], reset_time=datetime.datetime(2024, 12, 11, 10, 3, 47, 209000, tzinfo=datetime.timezone.utc))]

# group by severity and entity_id
grouped = run_db.list_alert_activations().group_by("severity", "entity_id")
grouped.keys()

dict_keys([(medium, 'test-func-handler.c93108a238bf4e3196780e352de9561a'), (high, 'test-func-handler.ed6094787c8c4817bd0c0a34c76fd004'), (low, 'test-func-handler.db80cba0c4be4ee9b86a09cf12a89991'), (low, 'test-func-handler.58e25426ea154daab2afa4ebfe454c71')])

Aggregate by#

The aggregate_by method groups alert activations by specified attributes and applies a custom aggregation function to each group. It returns a dictionary where the keys are tuples of attribute values (e.g., ("project1", "high")), and the values are the result of the provided aggregation function, such as counts or summations of the grouped activations. This method is useful for summing up data, such as counting alerts per project or calculating statistics, enabling efficient analysis of alert activations.

# use aggregate_by to group alert activations by severity and compute the total number of events for each severity level:
aggregated = run_db.list_alert_activations().aggregate_by(
    ["severity"],
    lambda activations: sum(activation.number_of_events for activation in activations),
)
aggregated

{medium: 1, high: 1, low: 4}