[watcher] Using watcher without standard Openstack telemetry projects
Hey folks! I am looking at Watcher for a possible integration with an Openstack cloud where : - The Openstack cloud does not have Gnocchi/Ceilometer or the STF from Red Hat. - I am able to get standard libvirt metrics from a compute. Before going too deep, if we assume that we can expose metrics using an OpenMetric/Prometheus format, can Watcher be fed these metrics? Thanks!
On 08/08/2025 03:40, Laurent Dumont wrote:
Hey folks!
I am looking at Watcher for a possible integration with an Openstack cloud where :
* The Openstack cloud does not have Gnocchi/Ceilometer or the STF from Red Hat. * I am able to get standard libvirt metrics from a compute.
Before going too deep, if we assume that we can expose metrics using an OpenMetric/Prometheus format, can Watcher be fed these metrics?
tl;dr yes in theory this should be possibel but not out of the box. |Yes, the current Prometheus data source for Watcher uses node-exporter metrics for host metrics. any stragey that only used node metrics will "just work". For instance metrics, you should be able to use the libvirt exporter on the compute nodes. The only thing you'll need to figure out is how to set the correct labels and/or names so they match the values Watcher expects from Ceilometer. It's important to note that Watcher doesn't perform any metric collection or storage itself. It only consumes metrics from a metric store like Gnocchi or Prometheus. The actual monitoring and metric storage are outside of Watcher's scope. It acts as a client of a metrics service, so you can't feed metrics to it directly. --- ### Prometheus and Instance Metrics Using Prometheus with instance metrics that are not from Ceilometer is technically not a tested or supported configuration today, but it's something that I'm open to in the future if there's interest. Originally, when we were designing this, we discussed using the libvirt exporter, but it was not planned to be in our downstream product, and the Ceilometer metrics were, so we started with what we knew worked. That doesn't mean we can't enhance the upstream version to be more flexible and support data from other native Prometheus exporters if folks are interested in that and are willing to help test or develop it. The libvirt domain's UUID is set to the Nova instance UUID. We did this many years ago to allow collectd and later Nagios et al. to be able to use its libvirt plugin to generate metrics as part of the OPNFV Barometer project. STF benefited from that, but the libvirt exporter should also be able to extract that field and use it as a label. We encode things like user and project information in the XML in the Nova metadata as well. Watcher doesn't currently use the project info—just the instance UUID to identify it. As long as that label is present on the relevant metrics and the metric names for the CPU, RAM, disk, etc., match, it should work. In some backends, Watcher supported a file to allow configuring a mapping between the "standard" or expected metric name and what was actually in the data source. I believe this was only supported in the Grafana data source, but I could be wrong. We haven't added support for this to the Prometheus data source yet. ||https://specs.openstack.org/openstack/watcher-specs/specs/train/implemented/file-based-metricmap.html||You can see the expected names today here: https://github.com/openstack/watcher/blob/master/watcher/decision_engine/dat... METRIC_MAP = dict(host_cpu_usage='node_cpu_seconds_total', host_ram_usage='node_memory_MemAvailable_bytes', host_outlet_temp=None, host_inlet_temp=None, host_airflow=None, host_power=None, instance_cpu_usage='ceilometer_cpu', instance_ram_usage='ceilometer_memory_usage', instance_ram_allocated='instance.memory', instance_l3_cache_usage=None, instance_root_disk_size='instance.disk', ) The ones prefixed by "instance." actually come from the instance object in our data model, which in the case of memory and disk size, are just the flavor values. So today, we only use two metrics from Ceilometer: `ceilometer_cpu` and `ceilometer_memory_usage`. --- ### Data Model and Queries At the data model level, `instance_cpu_usage` is "cpu usage as float ranging between 0 and 100 representing the total cpu usage as percentage," and `instance_ram_usage` is "ram usage as float in megabytes." https://github.com/openstack/watcher/blob/master/watcher/decision_engine/dat... However, that is not how they are stored in Prometheus or generated by Ceilometer, so our queries, which are not configurable, have to do some normalization. https://github.com/openstack/watcher/blob/master/watcher/decision_engine/dat... `ceilometer_cpu` is the total cumulative CPU time (ns) over the time period we requested. To convert that to a float from 0-100, we have to do this: query_args = ( "clamp_max((%(agg)s by (%(label)s)" "(rate(%(meter)s{%(label)s='%(label_value)s'}[%(period)ss]))" "/10e+8) *(100/%(vcpus)s), 100)" % {'label': uuid_label_key, 'label_value': instance_label, 'agg': aggregate, 'meter': meter, 'period': period, 'vcpus': vcpus} ) The period is 300 seconds: https://github.com/openstack/watcher/blob/master/watcher/decision_engine/dat... What we are asking Prometheus to do is take the mean of the CPU time in nanoseconds used over the last 5 minutes, normalize that by dividing by the number of vCPUs the guest has (since the time report is the sum of all vCPUs), then normalize that to the relevant range. So, if you can export the total CPU time used over a given period in nanoseconds for all vCPUs of a guest from the libvirt exporter with the name `ceilometer_cpu`, it would "just work." In reality, we would likely need to make this a little more flexible. For instance RAM usage, the approach is similar: https://github.com/openstack/watcher/blob/master/watcher/decision_engine/dat... We get the mean over the last 5 minutes, but the query is a lot simpler: elif meter == 'ceilometer_memory_usage': query_args = ( "%s_over_time(%s{%s='%s'}[%ss])" % (aggregate, meter, uuid_label_key, instance_label, period) ) `ceilometer_memory_usage` is already in the correct unit, so we just ask Prometheus to directly return the mean over the period with no extra processing. --- ### Configuration and Future Plans We do have some level of configurability: https://github.com/openstack/watcher/blob/master/watcher/conf/prometheus_cli... If your Prometheus uses different labels in `node_exporter` than `fqdn` to store the nodes' FQDN/hostname, you can change it (this should match whatever Nova reports for `hypervisor_hostname` and `host`). Ceilometer puts the instance UUID in a label called "resource." You can easily update that to whatever the libvirt exporter uses. We may eventually consider adding support for something like https://specs.openstack.org/openstack/watcher-specs/specs/train/implemented/... , taking a default-in-code approach so that you could override the query and metric mappings, but that is not on our short-term plan. If folks want to discuss this more, we can on the list, or this could be an interesting PTG topic. A smaller feature may just be a flag to say "use libvirt exporter metrics" instead of a generic file-based approach, so if folks wanted to work on that, I'm open to reviewing a proposal. At the moment, we are focused on wrapping up the stuff we had in flight for 2025.2, but it's not long until the 2026.1 cycle starts, so it's a good time to get feedback from folks who are interested in Watcher. so in summary |i suspect that strategies that do not require metrics or that only use host metrics will just work the one that use the instance metrics will be trickier to make work. if you really want to use watcher without |Ceilometer in the long run we would need to enhance the||Prometheus/aetos plugin to work with the libvirt exporter metrics natively. that should not be hard to do its not just currently a planned feature.| |regards| |sean. |
Thanks!
participants (2)
-
Laurent Dumont
-
Sean Mooney