Server metrics#
MLRun collects anonymized system-size statistics, for example, project counts, artifact counts, run activity, serving endpoints, etc., and exports them to Prometheus via OpenTelemetry.
Metrics description#
Every metric carries a system_id attribute (MLRun installation UUID). Project-scoped metrics additionally carry a project name.
Metric name |
Attributes |
Meaning |
|---|---|---|
mlrun_projects |
system_id |
Current number of projects in the installation |
mlrun_functions |
system_id, project, kind ∈ {job, serving, application, dask, mpijob, spark, nuclio, …} |
|
mlrun_workflows |
system_id, project |
Current number of workflow definitions in the project |
mlrun_artifacts |
system_id, project, kind ∈ {model, dataset, document, llm_prompt, other} |
Current number of artifacts of a given kind in the project |
mlrun_runs |
system_id, project, state ∈ {running, completed, failed, aborted} |
Current number of runs in the project in each state (snapshot view) |
mlrun_pipeline_executions |
system_id, project, state ∈ {running, completed, failed, aborted} |
Current number of pipeline executions in the project in each state |
mlrun_alert_configurations |
system_id, project |
Current number of alert configurations in the project |
mlrun_alert_activations |
system_id, project |
Current number of active alert activations in the project |
mlrun_model_endpoints |
system_id, project, kind ∈ {realtime, batch} |
Current number of registered model endpoints of a given kind. Consolidates the original separate realtime_endpoints / batch_endpoints metrics via the kind attribute. |
mlrun_model_monitoring_applications |
system_id, project |
Current number of model-monitoring applications in the project. |
Example output#
mlrun_projects{system_id="f3a2b1c4d5e6f7a8"} 5
mlrun_artifacts{system_id="f3a2b1c4d5e6f7a8", project="name1", kind="model"} 8
mlrun_artifacts{system_id="f3a2b1c4d5e6f7a8", project="name2", kind="dataset"} 34
mlrun_artifacts{system_id="f3a2b1c4d5e6f7a8", project="name3", kind="other"} 1
mlrun_runs{system_id="f3a2b1c4d5e6f7a8", project="name4", state="completed"} 120
mlrun_runs{system_id="f3a2b1c4d5e6f7a8", project="name5", state="failed"} 3
Example PromQL views#
PromQL (Prometheus Query Language) is the language used to select and aggregate time series data in real time. Typical output looks like:
# Total artifacts across the system right now
sum(mlrun_artifacts)
# Top 10 projects by artifact count
topk(10, sum by (project) (mlrun_artifacts))
# Project count trend (sample every hour over the last 7d)
mlrun_projects[7d:1h]
# Net artifact change over the last 24h
delta(sum(mlrun_artifacts)[24h:])
Configure metrics#
OpenTelemetry metrics are configured in config.py. Modify the configuration with a configmap.yaml that is applied on the mlrun service.
Disable/enable OpenTelemetry#
Metrics are enabled by default. To disable the metrics collection:
MLRUN_TELEMETRY__ENABLED=false
To enable the metrics collection:
MLRUN_TELEMETRY__ENABLED=true