[tc][telemetry][gnocchi] The future of Gnocchi in OpenStack
adriant at catalystcloud.nz
Fri Aug 28 22:21:48 UTC 2020
On 29/08/20 2:48 am, Zane Bitter wrote:
> I think a large part of the issue here is that there are multiple
> reasons for wanting (small-t) telemetry from OpenStack, and
> historically because of reasons they have all been conflated into one
> Thing with the result that sometimes one use case wins. At least 3
> that I can think of are:
> 1) Monitoring the OpenStack infrastructure by the operator, including
> feeding into business processes like reporting, capacity planning &c.
> 2) Billing
> 3) Monitoring user resources by the user/application, either directly
> or via other OpenStack services like Heat or Senlin.
> For the first, you just want to be able to dump data into a TSDB of
> the operator's choice. Since all of the reporting requirements are
> business-specific anyway, it's up to the operator to decide how they
> want to store the data and how they want to interact with it. It
> appears that this may have been the theory behind the Gnocchi split.
> On the other hand, for the third one you really need something that
> should be an official OpenStack API with all of the attendant
> stability guarantees, because it is part of OpenStack's user interface.
> The second lands somewhere in between; AIUI CloudKitty is written to
> support multiple back-ends, with OpenStack Telemetry being the primary
> one. So it needs a fairly stable API because it's consumed by other
> OpenStack projects, but it's ultimately operator-facing.
> As I have argued before, when we are thinking about road maps we need
> to think of these as different use cases, and they're different enough
> that they are probably best served by least two separate tools.
> Mohammed has made a compelling argument in the past that Prometheus is
> more or less the industry standard for the first use case, and we
> should just export metrics to that directly in the OpenStack services,
> rather than going through the Ceilometer collector.
> I don't know what should be done about the third, but I do know that
> currently Telemetry is breaking Heat's gate and people are seriously
> discussing disabling the Telemetry-related tests, which I assume would
> mean deprecating the resources. Monasca offers an alternative, but
> isn't preferred for some distributors and operators because it brings
> the whole Java ecosystem along for the ride (managing the Python one
> is already hard enough).
You are totally right about the three use cases, and we need to address
this as we move forward with Not-Gnocchi and the rest of Telemetry.
Internally we've never used OS-Telemetry for case 1, but we do use it
for cases 2 and 3.
I do think having a stable API for OpenStack for those last two cases is
worth it, and I don't think merging those together is too hard. The way
Cloudkitty (and our thing Distil) process the data for billing means we
aren't needing to store months of data in the telemetry system because
we ingest and aggregate into our own systems.
The third use case doesn't need much long term data in a high level of
granularity, but does (like billing) need high accuracy closer to 'now'.
So again I think those line up well to fit into a single system, with
maybe different granularity on specific metrics.
We should try and fix the telemetry heat tests ideally, because there
are people using Aodh and auto-scaling.
As for case 1, I agree that trying to encourage Prometheus support in
OpenStack is a good aim. Sadly though supporting it directly in each
service likely won't be too easy, but Ceilometer already supports
pushing to it, so that's good enough for now:
We do need a more coherent future plan for Telemetry in OpenStack, but
the starting point is stabilizing and consolidating before we try and
steer in a new direction.
More information about the openstack-discuss