I'm creating a sub-thread of this discussion, specifically to highlight the impact of retiring Logstash and Elasticsearch, the functionality we will lose as a result, and to put out the call for resources to help. I will trim Clark's original email to just the critical bits of infrastructure related to these services.
Xenial has recently reached the end of its life. Our logstash+kibana+elasticsearch and subunit2sql+health data crunching services all run on Xenial. Even without the distro platform EOL concerns these services are growing old and haven't received the care they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers * 20 x 4 vcpu + 4GB RAM logstash-worker servers * 1 x 2 vcpu + 2GB RAM logstash/kibana central server * 2 x 8 vcpu + 8GB RAM subunit-worker servers * 64GB RAM + 500GB disk subunit2sql trove db server
<snip>
To put things in perspective, they account for more than a quarter of our control plane servers, occupying over a third of our block storage and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time available currently to do the major lifting required to bring these services up to date. I would like to propose that we simply turn them off. All of these services operate off of public data that will not be going away (specifically job log content). If others are interested in taking this on they can hook into this data and run their own processing pipelines.
Just to clarify for people that aren't familiar with what these services do for us, I want to explain their importance and the impact of not having them in a future where we have to decommission them. We run a lot of jobs in CI, across a lot of projects and varying configurations. Ideally these would all work all of the time, and never have spurious and non-deterministic failures. However, that's not how the real world works, and in reality many jobs are not consistently well-behaved. Since many of our jobs run tests against many projects to ensure that the whole stack works at any given point, spurious failures in one project's tests can impact developers' ability to land patches in a large number of projects. Indeed, it takes a surprisingly low failure rate to significantly impact the amount of work that can be done across the ecosystem. Because of this, collecting information from "the firehose" about job failures is critical. It helps us figure out how much impact a given spurious failure is having, and across how wide of a swath of projects. Further, fixing the problem becomes one of determining the actual bug (of course) which can be vastly improved by gathering lots of examples of failures and looking for commonalities. These services (collectively called ELK) digest the logs and data from these test runs and provide a way to mine details when chasing down a failure. There is even a service, built by openstack people, which uses ELK to automate the identification of common failures to help determine which are having the most serious impact in order to focus human debugging attention. It's called elastic-recheck, which you've probably heard of, and is visible here: http://status.openstack.org/elastic-recheck/ Unfortunately, a select few developers actually work on these problems. They're difficult to tackle and often require multiple people across projects to nail down a cause and solution. If you've ever just run "recheck" on your patch a bunch of times until the tests are green, you have felt the pain that spurious job failures bring. Actually fixing those are the only way to make things better, and ignoring them causes them to collect over time. At some point, enough of these types of failures will keep anything from merging. Because a small number of heroes generally work on these problems, it's possible that they are the only ones that understand the value of these services. I think it's important for everyone to understand how critical ELK and associated services are to chasing these down. Without it, debugging the spurious failures (which are often real bugs, by the way!) will become even more laborious and likely happen less and less. I'm summarizing this situation in hopes that some of the entities that depend on OpenStack, who are looking for a way to help, and which may have resources (carbon- and silicon-based) that apply here can step up to help make an impact. Thanks! --Dan