On 29/08/20 2:48 am, Zane Bitter wrote:
I think a large part of the issue here is that there are multiple reasons for wanting (small-t) telemetry from OpenStack, and historically because of reasons they have all been conflated into one Thing with the result that sometimes one use case wins. At least 3 that I can think of are:
1) Monitoring the OpenStack infrastructure by the operator, including feeding into business processes like reporting, capacity planning &c.
2) Billing
3) Monitoring user resources by the user/application, either directly or via other OpenStack services like Heat or Senlin.
For the first, you just want to be able to dump data into a TSDB of the operator's choice. Since all of the reporting requirements are business-specific anyway, it's up to the operator to decide how they want to store the data and how they want to interact with it. It appears that this may have been the theory behind the Gnocchi split.
On the other hand, for the third one you really need something that should be an official OpenStack API with all of the attendant stability guarantees, because it is part of OpenStack's user interface.
The second lands somewhere in between; AIUI CloudKitty is written to support multiple back-ends, with OpenStack Telemetry being the primary one. So it needs a fairly stable API because it's consumed by other OpenStack projects, but it's ultimately operator-facing.
As I have argued before, when we are thinking about road maps we need to think of these as different use cases, and they're different enough that they are probably best served by least two separate tools.
Mohammed has made a compelling argument in the past that Prometheus is more or less the industry standard for the first use case, and we should just export metrics to that directly in the OpenStack services, rather than going through the Ceilometer collector.
I don't know what should be done about the third, but I do know that currently Telemetry is breaking Heat's gate and people are seriously discussing disabling the Telemetry-related tests, which I assume would mean deprecating the resources. Monasca offers an alternative, but isn't preferred for some distributors and operators because it brings the whole Java ecosystem along for the ride (managing the Python one is already hard enough).
cheers, Zane.
You are totally right about the three use cases, and we need to address this as we move forward with Not-Gnocchi and the rest of Telemetry. Internally we've never used OS-Telemetry for case 1, but we do use it for cases 2 and 3. I do think having a stable API for OpenStack for those last two cases is worth it, and I don't think merging those together is too hard. The way Cloudkitty (and our thing Distil) process the data for billing means we aren't needing to store months of data in the telemetry system because we ingest and aggregate into our own systems. The third use case doesn't need much long term data in a high level of granularity, but does (like billing) need high accuracy closer to 'now'. So again I think those line up well to fit into a single system, with maybe different granularity on specific metrics. We should try and fix the telemetry heat tests ideally, because there are people using Aodh and auto-scaling. As for case 1, I agree that trying to encourage Prometheus support in OpenStack is a good aim. Sadly though supporting it directly in each service likely won't be too easy, but Ceilometer already supports pushing to it, so that's good enough for now: https://github.com/openstack/ceilometer/blob/master/ceilometer/publisher/pro... We do need a more coherent future plan for Telemetry in OpenStack, but the starting point is stabilizing and consolidating before we try and steer in a new direction.