[all][infra][qa] Retiring Logstash, Elasticsearch, subunit2sql, and Health
Hello everyone, Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably. Additionally these services represent a large portion of our resource consumption: * 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint. The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines. I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere. I am happy to hear feedback on this. Thank you for your time. Clark
On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello everyone,
Hi,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Thank you and the whole infra team(s) for the effort to keeping the infrastructure alive. I'm an active user of the ELK stack in OpenStack. I use it to figure out if a particular gate failure I see is just a one time event or it is a real failure we need to fix. So I'm sad that this tooling will be shut down as I think I loose one of the tools that helped me keeping our Gate healthy. But I understood how busy is everybody these days. I'm not an infra person but if I can help somehow from Nova perspective then let me know. (E.g. I can review elastic recheck signatures if that helps) Cheers, gibi
Clark
Le mar. 11 mai 2021 à 09:35, Balazs Gibizer <balazs.gibizer@est.tech> a écrit :
On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello everyone,
Hi,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Thank you and the whole infra team(s) for the effort to keeping the infrastructure alive. I'm an active user of the ELK stack in OpenStack. I use it to figure out if a particular gate failure I see is just a one time event or it is a real failure we need to fix. So I'm sad that this tooling will be shut down as I think I loose one of the tools that helped me keeping our Gate healthy. But I understood how busy is everybody these days. I'm not an infra person but if I can help somehow from Nova perspective then let me know. (E.g. I can review elastic recheck signatures if that helps)
Worth said, gibi. I understand the reasoning behind the ELK sunset but I'm a bit afraid of not having a way to know the number of changes that were failing with the same exception than one I saw. Could we be discussing how we could try to find a workaround for this ? Maybe no longer using ELK, but at least still continuing to have the logs for, say, 2 weeks ? -Sylvain Cheers,
gibi
Clark
On Tue, 2021-05-11 at 09:47 +0200, Sylvain Bauza wrote:
Le mar. 11 mai 2021 à 09:35, Balazs Gibizer <balazs.gibizer@est.tech> a écrit :
On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello everyone,
Hi,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Thank you and the whole infra team(s) for the effort to keeping the infrastructure alive. I'm an active user of the ELK stack in OpenStack. I use it to figure out if a particular gate failure I see is just a one time event or it is a real failure we need to fix. So I'm sad that this tooling will be shut down as I think I loose one of the tools that helped me keeping our Gate healthy. But I understood how busy is everybody these days. I'm not an infra person but if I can help somehow from Nova perspective then let me know. (E.g. I can review elastic recheck signatures if that helps)
Worth said, gibi. I understand the reasoning behind the ELK sunset but I'm a bit afraid of not having a way to know the number of changes that were failing with the same exception than one I saw.
Could we be discussing how we could try to find a workaround for this ? Maybe no longer using ELK, but at least still continuing to have the logs for, say, 2 weeks ? well we will continue to have all logs for at least 30 days in the ci results.
currently all logs get uploaded to swift on the ci providers and that is what we see when we look at the zuul results. seperatly they are also streamed to logstash which is ingested and process so we can query them with kibana. its only the elk portion that is going away not the ci logs. the indexing of those and easy quering is what would be lost by this change.
-Sylvain
Cheers,
gibi
Clark
Hi, Dnia wtorek, 11 maja 2021 09:29:13 CEST Balazs Gibizer pisze:
On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan@sapwetik.org>
wrote:
Hello everyone,
Hi,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Thank you and the whole infra team(s) for the effort to keeping the infrastructure alive. I'm an active user of the ELK stack in OpenStack. I use it to figure out if a particular gate failure I see is just a one time event or it is a real failure we need to fix. So I'm sad that this tooling will be shut down as I think I loose one of the tools that helped me keeping our Gate healthy. But I understood how busy is everybody these days. I'm not an infra person but if I can help somehow from Nova perspective then let me know. (E.g. I can review elastic recheck signatures if that helps)
I somehow missed that original email from Clark. But it's similar for Neutron team. I use logstash pretty often to check how ofter some issues happens in the CI.
Cheers, gibi
Clark
-- Slawek Kaplonski Principal Software Engineer Red Hat
On 2021-05-11 09:47:45 +0200 (+0200), Sylvain Bauza wrote: [...]
Could we be discussing how we could try to find a workaround for this? [...]
That's absolutely worth discussing, but it needs people committed to building and maintaining something. The current implementation was never efficient, and we realized that when we started trying to operate it at scale. It relies on massive quantities of donated infrastructure for which we're trying to be responsible stewards (just the Elasticsearch cluster alone consumes 6x the resources of of our Gerrit deployment). We get that it's a useful service, but we need to weigh the relative utility against the cost, not just in server quota but ongoing maintenance. For a while now we've not had enough people involved in running our infrastructure as we need to maintain the services we built over the years. We've been shouting it from the rooftops, but that doesn't seem to change anything, so all we can do at this point is aggressively sunset noncritical systems in order to hopefully have a small enough remainder that the people we do have can still keep it in good shape. Some of the systems we operate are tightly-coupled and taking them down would have massive ripple effects in other systems which would, counterintuitively, require more people to help untangle. The logstash service, on the other hand, is sufficiently decoupled from our more crucial systems that we can make a large dent in our upgrade and configuration management overhaul backlog by just turning it off. The workaround to which you allude is actually fairly straightforward. Someone can look at what we had as a proof of concept and build an equivalent system using newer and possibly more appropriate technologies. Watch the Gerrit events, fetch logs from swift for anything which gets reported, postprocess and index those, providing a query interface folks can use to find patterns. None of that requires privileged access to our systems; it's all built on public data. That "someone" needs to come from "somewhere" though. Upgrading the existing systems at this point is probably at least the same amount of work, given all the moving parts, the need to completely redo the current configuration management for it, the recent license strangeness with Elasticsearch, the fact that Logstash and Kibana are increasingly open-core fighting to keep useful features exclusively for their paying users... the whole stack needs to be reevaluated, and new components and architecture considered. -- Jeremy Stanley
On Tue, May 11, 2021, at 6:56 AM, Jeremy Stanley wrote:
On 2021-05-11 09:47:45 +0200 (+0200), Sylvain Bauza wrote: [...]
Could we be discussing how we could try to find a workaround for this? [...]
snip. What Fungi said is great. I just wanted to add a bit of detail below.
Upgrading the existing systems at this point is probably at least the same amount of work, given all the moving parts, the need to completely redo the current configuration management for it, the recent license strangeness with Elasticsearch, the fact that Logstash and Kibana are increasingly open-core fighting to keep useful features exclusively for their paying users... the whole stack needs to be reevaluated, and new components and architecture considered.
To add a bit more concrete info to this the current config management for all of this is Puppet. We no longer have the ability to run Puppet in our infrastructure on systems beyond Ubuntu Xenial. What we have been doing for newer systems is using Ansible (often coupled with docker + docker-compose) to deploy services. This means that all of the config management needs to be redone. The next problem you'll face is that Elasticsearch itself needs to be upgraded. Historically when we have done this, it has required also upgrading Kibana and Logstash due to compatibility problems. When you upgrade Kibana you have to sort out all of the data access and authorizations problems that Elasticsearch presents because it doesn't provide authentication and authorization (we cannot allow arbitrary writes into the ES cluster, Kibana assumes it can do this). With Logstash you end up rewriting all of your rules. Finally, I don't think we have enough room to do rolling replacements of Elasticsearch cluster members as they are so large. We have to delete servers to add servers. Typically we would add server, rotate in, then delete the old one. In this case the idea is probably to spin up an entirely new cluster along side the old one, check that it is functional, then shift the data streaming over to point at it. Unfortunately, that won't be possible.
-- Jeremy Stanley
On Tue, May 11, 2021 at 5:29 PM Clark Boylan <cboylan@sapwetik.org> wrote:
On Tue, May 11, 2021, at 6:56 AM, Jeremy Stanley wrote:
On 2021-05-11 09:47:45 +0200 (+0200), Sylvain Bauza wrote: [...]
Could we be discussing how we could try to find a workaround for this? [...]
snip. What Fungi said is great. I just wanted to add a bit of detail below.
Upgrading the existing systems at this point is probably at least the same amount of work, given all the moving parts, the need to completely redo the current configuration management for it, the recent license strangeness with Elasticsearch, the fact that Logstash and Kibana are increasingly open-core fighting to keep useful features exclusively for their paying users... the whole stack needs to be reevaluated, and new components and architecture considered.
To add a bit more concrete info to this the current config management for all of this is Puppet. We no longer have the ability to run Puppet in our infrastructure on systems beyond Ubuntu Xenial. What we have been doing for newer systems is using Ansible (often coupled with docker + docker-compose) to deploy services. This means that all of the config management needs to be redone.
The next problem you'll face is that Elasticsearch itself needs to be upgraded. Historically when we have done this, it has required also upgrading Kibana and Logstash due to compatibility problems. When you upgrade Kibana you have to sort out all of the data access and authorizations problems that Elasticsearch presents because it doesn't provide authentication and authorization (we cannot allow arbitrary writes into the ES cluster, Kibana assumes it can do this). With Logstash you end up rewriting all of your rules.
Finally, I don't think we have enough room to do rolling replacements of Elasticsearch cluster members as they are so large. We have to delete servers to add servers. Typically we would add server, rotate in, then delete the old one. In this case the idea is probably to spin up an entirely new cluster along side the old one, check that it is functional, then shift the data streaming over to point at it. Unfortunately, that won't be possible.
-- Jeremy Stanley
First, thanks both Jeremy and fungi for explaining why we need to stop to provide a ELK environment for our logs. I now understand it better and honestly I can't really find a way to fix it just by me. I'm just sad we can't for the moment find a way to have a way to continue looking at this unless finding "someone" who would help us :-) Just a note, I then also guess that http://status.openstack.org/elastic-recheck/ will stop to work as well, right? Operators, if you read me and want to make sure that our upstream CI continues to work as we could see gate issues, please help us ! :-)
I just came back from vacation and reading this does worry me a lot. The ES server is a crucial piece used to identify zuul job failures on openstack. LOTS of them have causes external to the job itself, either infra, packaging (os or pip) or unavailability of some services used during build. Without being able to query specific error messages across multiple jobs we will be in a vary bad spot as we loose the ability to look outside a single project. TripleO health check project relies on being able to query ER from both opendev and rdo in order to easy identification of problems. Maybe instead of dropping we should rethink what it is supposed to index and not, set some hard limits per job and scale down the deployment. IMHO, one of the major issues with it is that it does try to index maybe too much w/o filtering noisy output before indexing. If we can delay making a decision a little bit so we can investigate all available options it would really be great. I worth noting that I personally do not have a special love for ES but I do value a lot what it does. I am also pragmatic and I would not be very upset to make use of a SaaS service as an alternative, especially as I recognize how costly is to run and maintain an instance. Maybe we can find a SaaS log processing vendor willing to sponsor OpenStack? In the past I used DataDog for monitoring but they also offer log processing and they have a program for open-source <https://www.datadoghq.com/partner/open-source/> but I am not sure they would be willing to process that amount of data for us. Cheers Sorin Sbarnea Red Hat On 10 May 2021 at 18:34:40, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello everyone,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Clark
On 2021-05-12 02:05:57 -0700 (-0700), Sorin Sbarnea wrote: [...]
TripleO health check project relies on being able to query ER from both opendev and rdo in order to easy identification of problems.
Since you say RDO has a similar setup, could they just expand to start indexing our logs? As previously stated, doing that doesn't require any special access to our infrastructure.
Maybe instead of dropping we should rethink what it is supposed to index and not, set some hard limits per job and scale down the deployment. IMHO, one of the major issues with it is that it does try to index maybe too much w/o filtering noisy output before indexing.
Reducing how much we index doesn't solve the most pressing problem, which is that we need to upgrade the underlying operating system, therefore replace the current current configuration management which won't work on newer platforms, and also almost certainly upgrade versions of the major components in use for it. Nobody has time to do that, at least nobody who has heard our previous cries for help.
If we can delay making a decision a little bit so we can investigate all available options it would really be great.
This thread hasn't set any timeline for stopping the service, not yet anyway.
I worth noting that I personally do not have a special love for ES but I do value a lot what it does. I am also pragmatic and I would not be very upset to make use of a SaaS service as an alternative, especially as I recognize how costly is to run and maintain an instance. [...]
It's been pointed out that OVH has a similar-sounding service, if someone is interested in experimenting with it: https://www.ovhcloud.com/en-ca/data-platforms/logs/ The case with this, and I think with any SaaS solution, is that there would still need to be a separate ingestion mechanism to identify when new logs are available, postprocess them to remove debug lines, and then feed them to the indexing service at the provider... something our current team doesn't have time to design and manage. -- Jeremy Stanley
Hello Folks, Thank you Jeremy and Clark for sharing the issue that you have. I understand that the main issue is related to a lack of time. ELK stack requires a lot of resources, but the values that you share probably can be optimized. Is it possible to share the architecture, how many servers are using which Elasticsearch server role (master, data servers, etc.) ? My team is managing RDO infra, which contains an ELK stack based on Opendistro for Elasticsearch. We have ansible playbooks to setup Elasticsearch base on Opendistro just on one node. Almost all of ELK stack services are located on one server that does not utilize a lot of resources (the retention time is set to 10 days, 90GB of HDD is used, 2GB of RAM for Elasticsearch, 512MB for Logstash). Could you share, what is the retention time set currently in the cluster that it requires 1 TB disk? Also other statistics like how many queries are done in kibana and how much of HDD disk space is used by the Openstack project and compare it to other projects that are available in Opendev? In the end, I would like to ask, if you can share what is the Elasticsearch version currently running on your servers and if you can share the -Xmx and -Xms parameters that are set in Logstash, Elasticsearch and Kibana. Thank you for your time and effort in keeping things running smoothly for OpenDev. We find the OpenDev ELK stack valuable enough to the OpenDev community to take a much larger role in keeping it running. If you can think of any additional links or information that may be helpful to us taking a larger role here, please do not hesitate to share it. Dan On Wed, May 12, 2021 at 3:20 PM Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2021-05-12 02:05:57 -0700 (-0700), Sorin Sbarnea wrote: [...]
TripleO health check project relies on being able to query ER from both opendev and rdo in order to easy identification of problems.
Since you say RDO has a similar setup, could they just expand to start indexing our logs? As previously stated, doing that doesn't require any special access to our infrastructure.
Maybe instead of dropping we should rethink what it is supposed to index and not, set some hard limits per job and scale down the deployment. IMHO, one of the major issues with it is that it does try to index maybe too much w/o filtering noisy output before indexing.
Reducing how much we index doesn't solve the most pressing problem, which is that we need to upgrade the underlying operating system, therefore replace the current current configuration management which won't work on newer platforms, and also almost certainly upgrade versions of the major components in use for it. Nobody has time to do that, at least nobody who has heard our previous cries for help.
If we can delay making a decision a little bit so we can investigate all available options it would really be great.
This thread hasn't set any timeline for stopping the service, not yet anyway.
I worth noting that I personally do not have a special love for ES but I do value a lot what it does. I am also pragmatic and I would not be very upset to make use of a SaaS service as an alternative, especially as I recognize how costly is to run and maintain an instance. [...]
It's been pointed out that OVH has a similar-sounding service, if someone is interested in experimenting with it:
https://www.ovhcloud.com/en-ca/data-platforms/logs/
The case with this, and I think with any SaaS solution, is that there would still need to be a separate ingestion mechanism to identify when new logs are available, postprocess them to remove debug lines, and then feed them to the indexing service at the provider... something our current team doesn't have time to design and manage. -- Jeremy Stanley
-- Regards, Daniel Pawlik
On Thu, May 13, 2021, at 7:23 AM, Daniel Pawlik wrote:
Hello Folks,
Thank you Jeremy and Clark for sharing the issue that you have. I understand that the main issue is related to a lack of time. ELK stack requires a lot of resources, but the values that you share probably can be optimized. Is it possible to share the architecture, how many servers are using which Elasticsearch server role (master, data servers, etc.) ?
All of this information is public. We host high level docs [0] and you can always check the configuration management [1][2][3].
My team is managing RDO infra, which contains an ELK stack based on Opendistro for Elasticsearch. We have ansible playbooks to setup Elasticsearch base on Opendistro just on one node. Almost all of ELK stack services are located on one server that does not utilize a lot of resources (the retention time is set to 10 days, 90GB of HDD is used, 2GB of RAM for Elasticsearch, 512MB for Logstash). Could you share, what is the retention time set currently in the cluster that it requires 1 TB disk? Also other statistics like how many queries are done in kibana and how much of HDD disk space is used by the Openstack project and compare it to other projects that are available in Opendev?
We currently have retention time set to 7 days. At peak we were indexing over a billion documents per day (this is after removing DEBUG logs too) and we run with a single replica. Cacti records [4] disk use by elasticsearch over time. Note that due to our use of a single replica we always want to have some free space to accommodate rebalancing if a cluster member is down. We don't break this down as openstack vs not openstack at an elasticsearch level but typical numbers for Zuul test node CPU time shows us we are about 95% openstack and 5% not openstack. I don't know what the total number of queries made against kibana is, but the bulk of querying is likely done by elastic-recheck which also has a public set of queries [5]. These are run multiple times an hour to keep dashboards up to date.
In the end, I would like to ask, if you can share what is the Elasticsearch version currently running on your servers and if you can share the -Xmx and -Xms parameters that are set in Logstash, Elasticsearch and Kibana.
This info (at least for elasticsearch) is availabe in [1].
Thank you for your time and effort in keeping things running smoothly for OpenDev. We find the OpenDev ELK stack valuable enough to the OpenDev community to take a much larger role in keeping it running. If you can think of any additional links or information that may be helpful to us taking a larger role here, please do not hesitate to share it.
Dan
[0] https://docs.opendev.org/opendev/system-config/latest/logstash.html [1] https://opendev.org/opendev/system-config/src/branch/master/modules/openstac... [2] https://opendev.org/opendev/system-config/src/branch/master/modules/openstac... [3] https://opendev.org/opendev/system-config/src/branch/master/modules/openstac... [4] http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=66519&rra_id=3&view_type=&graph_start=1618239228&graph_end=1620917628 [5] https://opendev.org/opendev/elastic-recheck/src/branch/master/queries
I'm creating a sub-thread of this discussion, specifically to highlight the impact of retiring Logstash and Elasticsearch, the functionality we will lose as a result, and to put out the call for resources to help. I will trim Clark's original email to just the critical bits of infrastructure related to these services.
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server
<snip>
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
Just to clarify for people that aren't familiar with what these services do for us, I want to explain their importance and the impact of not having them in a future where we have to decommission them. We run a lot of jobs in CI, across a lot of projects and varying configurations. Ideally these would all work all of the time, and never have spurious and non-deterministic failures. However, that's not how the real world works, and in reality many jobs are not consistently well-behaved. Since many of our jobs run tests against many projects to ensure that the whole stack works at any given point, spurious failures in one project's tests can impact developers' ability to land patches in a large number of projects. Indeed, it takes a surprisingly low failure rate to significantly impact the amount of work that can be done across the ecosystem. Because of this, collecting information from "the firehose" about job failures is critical. It helps us figure out how much impact a given spurious failure is having, and across how wide of a swath of projects. Further, fixing the problem becomes one of determining the actual bug (of course) which can be vastly improved by gathering lots of examples of failures and looking for commonalities. These services (collectively called ELK) digest the logs and data from these test runs and provide a way to mine details when chasing down a failure. There is even a service, built by openstack people, which uses ELK to automate the identification of common failures to help determine which are having the most serious impact in order to focus human debugging attention. It's called elastic-recheck, which you've probably heard of, and is visible here: http://status.openstack.org/elastic-recheck/ Unfortunately, a select few developers actually work on these problems. They're difficult to tackle and often require multiple people across projects to nail down a cause and solution. If you've ever just run "recheck" on your patch a bunch of times until the tests are green, you have felt the pain that spurious job failures bring. Actually fixing those are the only way to make things better, and ignoring them causes them to collect over time. At some point, enough of these types of failures will keep anything from merging. Because a small number of heroes generally work on these problems, it's possible that they are the only ones that understand the value of these services. I think it's important for everyone to understand how critical ELK and associated services are to chasing these down. Without it, debugging the spurious failures (which are often real bugs, by the way!) will become even more laborious and likely happen less and less. I'm summarizing this situation in hopes that some of the entities that depend on OpenStack, who are looking for a way to help, and which may have resources (carbon- and silicon-based) that apply here can step up to help make an impact. Thanks! --Dan
On Mon, May 10, 2021, at 10:34 AM, Clark Boylan wrote:
Hello everyone,
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server * 1 x 4 vcpu + 4GB RAM health server
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
I am sure not everyone will be happy with this proposal. I get it. I came up with the idea for the elasticsearch job log processing way back at the San Diego summit. I spent many many many hours since working to get it up and running and to keep it running. But pragmatism means that my efforts and the team's efforts are better spent elsewhere.
I am happy to hear feedback on this. Thank you for your time.
Since this thread was started we have heard feedback and the OpenStack TC has brought this up with the Board to try and find volunteers to help address the hosting, upgrades, and maintenance of these services. We have said we are not in a rush to shut them off (still no rush), but feel that setting a deadline for finding help is important. At the TC meeting yesterday we decided that we would try to limp the server along through the Yoga cycle. Rough math says that is going to end April 2022. Getting this addressed sooner is definitely better, as there is always the risk that external events will force us to shut these services down prior to that date. Hopefully, having a concrete date can create some urgency and help us find the aid we need. If you would like to help definitely read through this thread as it provides details on what sorts of things need doing. Also feel free to reach out to myself or others on the OpenDev team and we'll do our best to provide direction as necessary. Clark
participants (10)
-
Balazs Gibizer
-
Clark Boylan
-
Dan Smith
-
Daniel Pawlik
-
Jeremy Stanley
-
Sean Mooney
-
Slawek Kaplonski
-
Sorin Sbarnea
-
Sylvain Bauza
-
Sylvain Bauza