[tc][telemetry][gnocchi] The future of Gnocchi in OpenStack
Hey OpenStackers, We're currently in the process of discussing what to do with OpenStack's reliance on Gnocchi, and at present it is looking like we are most likely to just fork it back under a new name (currently Farfalle to stick with the pasta theme). The discussion is mostly happening here: https://review.opendev.org/#/c/744592/ But for those running Gnocchi in prod, this is likely something you may want to know about and we'd like to hear from you. A bit of history: Gnocchi started off as a new backend for Ceilometer in OpenStack, and eventually become the defacto API for telemetry samples when that was removed from Ceilometer (as backed by MongoDB). Gnocchi was eventually spun off outside of OpenStack, but still essentially remained our API for telemetry despite not being an official part of OpenStack anymore. Since then the development around it seems to have stalled, with pull requests left unreviewed, CI broken, and even the domain for the docs lapsing once. They have essentially said the project is unmaintained themselves: https://github.com/gnocchixyz/gnocchi/issues/1049 Given that OpenStack telemetry relies on it, we needed to decide what to do. We tried talking to the devs which spun it off outside of OpenStack, but they seem disinclined to interact with the OpenStack community, or move the project back to our infra/governance despite OpenStack looking like the only consumers of Gnocchi as a project. We want to find a solution, and the feeling is that they don't. So we've opted to fork it back and now the discussion is how to approach that fork. The OpenStack community doesn't want to maintain a time series database, but our telemetry API is part of it. We are putting it under non-OpenStack namespace to start, but we need to decide what the long term place for it should be. Do we want to make it an official project again? Do we keep it just as an API and drop the time series DB part for another DB? Do we build a new API back into Ceilometer and switch to a different backend like InfluxDB? We don't know yet, and we want some input from people who use the service so we can hopefully work with OpenStack telemetry as a whole and figure out what the long term picture is. If Gnocchi matters to you at all, or you use it, we want to hear from you. Cheers, Adrian Turjak
On 28/08/20 8:36 am, Adrian Turjak wrote:
Hey OpenStackers,
We're currently in the process of discussing what to do with OpenStack's reliance on Gnocchi, and at present it is looking like we are most likely to just fork it back under a new name (currently Farfalle to stick with the pasta theme).
The discussion is mostly happening here: https://review.opendev.org/#/c/744592/
But for those running Gnocchi in prod, this is likely something you may want to know about and we'd like to hear from you.
A bit of history: Gnocchi started off as a new backend for Ceilometer in OpenStack, and eventually become the defacto API for telemetry samples when that was removed from Ceilometer (as backed by MongoDB). Gnocchi was eventually spun off outside of OpenStack, but still essentially remained our API for telemetry despite not being an official part of OpenStack anymore.
I think a large part of the issue here is that there are multiple reasons for wanting (small-t) telemetry from OpenStack, and historically because of reasons they have all been conflated into one Thing with the result that sometimes one use case wins. At least 3 that I can think of are: 1) Monitoring the OpenStack infrastructure by the operator, including feeding into business processes like reporting, capacity planning &c. 2) Billing 3) Monitoring user resources by the user/application, either directly or via other OpenStack services like Heat or Senlin. For the first, you just want to be able to dump data into a TSDB of the operator's choice. Since all of the reporting requirements are business-specific anyway, it's up to the operator to decide how they want to store the data and how they want to interact with it. It appears that this may have been the theory behind the Gnocchi split. On the other hand, for the third one you really need something that should be an official OpenStack API with all of the attendant stability guarantees, because it is part of OpenStack's user interface. The second lands somewhere in between; AIUI CloudKitty is written to support multiple back-ends, with OpenStack Telemetry being the primary one. So it needs a fairly stable API because it's consumed by other OpenStack projects, but it's ultimately operator-facing. As I have argued before, when we are thinking about road maps we need to think of these as different use cases, and they're different enough that they are probably best served by least two separate tools. Mohammed has made a compelling argument in the past that Prometheus is more or less the industry standard for the first use case, and we should just export metrics to that directly in the OpenStack services, rather than going through the Ceilometer collector. I don't know what should be done about the third, but I do know that currently Telemetry is breaking Heat's gate and people are seriously discussing disabling the Telemetry-related tests, which I assume would mean deprecating the resources. Monasca offers an alternative, but isn't preferred for some distributors and operators because it brings the whole Java ecosystem along for the ride (managing the Python one is already hard enough). cheers, Zane.
On 29/08/20 2:48 am, Zane Bitter wrote:
I think a large part of the issue here is that there are multiple reasons for wanting (small-t) telemetry from OpenStack, and historically because of reasons they have all been conflated into one Thing with the result that sometimes one use case wins. At least 3 that I can think of are:
1) Monitoring the OpenStack infrastructure by the operator, including feeding into business processes like reporting, capacity planning &c.
2) Billing
3) Monitoring user resources by the user/application, either directly or via other OpenStack services like Heat or Senlin.
For the first, you just want to be able to dump data into a TSDB of the operator's choice. Since all of the reporting requirements are business-specific anyway, it's up to the operator to decide how they want to store the data and how they want to interact with it. It appears that this may have been the theory behind the Gnocchi split.
On the other hand, for the third one you really need something that should be an official OpenStack API with all of the attendant stability guarantees, because it is part of OpenStack's user interface.
The second lands somewhere in between; AIUI CloudKitty is written to support multiple back-ends, with OpenStack Telemetry being the primary one. So it needs a fairly stable API because it's consumed by other OpenStack projects, but it's ultimately operator-facing.
As I have argued before, when we are thinking about road maps we need to think of these as different use cases, and they're different enough that they are probably best served by least two separate tools.
Mohammed has made a compelling argument in the past that Prometheus is more or less the industry standard for the first use case, and we should just export metrics to that directly in the OpenStack services, rather than going through the Ceilometer collector.
I don't know what should be done about the third, but I do know that currently Telemetry is breaking Heat's gate and people are seriously discussing disabling the Telemetry-related tests, which I assume would mean deprecating the resources. Monasca offers an alternative, but isn't preferred for some distributors and operators because it brings the whole Java ecosystem along for the ride (managing the Python one is already hard enough).
cheers, Zane.
You are totally right about the three use cases, and we need to address this as we move forward with Not-Gnocchi and the rest of Telemetry. Internally we've never used OS-Telemetry for case 1, but we do use it for cases 2 and 3. I do think having a stable API for OpenStack for those last two cases is worth it, and I don't think merging those together is too hard. The way Cloudkitty (and our thing Distil) process the data for billing means we aren't needing to store months of data in the telemetry system because we ingest and aggregate into our own systems. The third use case doesn't need much long term data in a high level of granularity, but does (like billing) need high accuracy closer to 'now'. So again I think those line up well to fit into a single system, with maybe different granularity on specific metrics. We should try and fix the telemetry heat tests ideally, because there are people using Aodh and auto-scaling. As for case 1, I agree that trying to encourage Prometheus support in OpenStack is a good aim. Sadly though supporting it directly in each service likely won't be too easy, but Ceilometer already supports pushing to it, so that's good enough for now: https://github.com/openstack/ceilometer/blob/master/ceilometer/publisher/pro... We do need a more coherent future plan for Telemetry in OpenStack, but the starting point is stabilizing and consolidating before we try and steer in a new direction.
пт, 28 авг. 2020 г. в 15:40, Adrian Turjak <adriant@catalystcloud.nz>:
But for those running Gnocchi in prod, this is likely something you may want to know about and we'd like to hear from you.
Hello, everyone! Here at Selectel we use Gnocchi as a backend for Ceilometer – we gather different metrics from virtual machines and provide our customers with graphs in a control panel. In this scenario we rely on Gnocchi's Keystone auth support and nearly standard mappings for instances, volumes, ports, etc provided out of the box. We also use Gnocchi as a secondary target for our home-grown billing system. Billing measures are gathered from different OpenStack and custom APIs, go through the charging engine and then being POSTed to Gnocchi API in batches. Here again we need the possibility to fetch measures with project- and domain- scoped tokens on the customer side in the control panel to be able to separate scopes for resellers (domain owners) and their clients (project owners). The third way to consume Gnocchi API is through OpenStack Watcher in it's strategy for balancing load in our regions. Here we use hosts metrics as well as virtual machines metrics. What do we like in Gnocchi: - API is clean and easy to use, object model is universal and makes us able to utilize it in different scenarios; - Fast enough for our use cases; - Can store metrics for a long period of time with a ceph backend with no performance penalty – useful in billing case. What we do not like: - server-side aggregations do not work as one might think they should work – API and CLI are very hard to use, we stopped trying to use them; - very CPU and disk IO intensive, platforms are hot like hell 24/7 processing not more then 1k metrics per second; - sometimes deadlocks happen in Redis incoming metrics storage preventing measures from certain metrics from being processed. What are our plans for the nearest future: - try to switch Watcher to Grafana backend to be able to use the same Prometheus metrics we rely on for alerting and capacity planning; - continue using Gnocchi only for VMs mertics, switching billing system for something more reliable in terms of missed points on graphs. Speaking about VMs metrics, it would probably be great to be able to continue using Gnocchi API for customer-facing features as it works well with OpenStack object model, authentication and everything. But Gnocchi's TSDB is not the best on the market. By switching it to Victoria Metrics, providing Prometheus API and working amazingly with Grafana, we would be able to gather and store metrics with node/libvirt exporters and Prometheus doing remote writes to Victoria, and consume them via Grafana/AlertManager or Gnocchi API depending on a scenario. -- Ivan Romanko Selectel
participants (3)
-
Adrian Turjak
-
Zane Bitter
-
Иван Романько