Re: [all][infra][qa] Critical call for help: Retiring Logstash, Elasticsearch, Elastic-recheck

13 May 2021

      I'm creating a sub-thread of this discussion, specifically to highlight
the impact of retiring Logstash and Elasticsearch, the functionality we
will lose as a result, and to put out the call for resources to help. I
will trim Clark's original email to just the critical bits of
infrastructure related to these services.
...
Xenial has recently reached the end of its life. Our
logstash+kibana+elasticsearch and subunit2sql+health data crunching
services all run on Xenial. Even without the distro platform EOL
concerns these services are growing old and haven't received the care
they need to keep running reliably.
Additionally these services represent a large portion of our resource consumption:
* 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers
* 20 x 4 vcpu + 4GB RAM logstash-worker servers
* 1 x 2 vcpu + 2GB RAM logstash/kibana central server
* 2 x 8 vcpu + 8GB RAM subunit-worker servers
* 64GB RAM + 500GB disk subunit2sql trove db server
<snip>
...
To put things in perspective, they account for more than a quarter of
our control plane servers, occupying over a third of our block storage
and in excess of half the total memory footprint.
The OpenDev/OpenStack Infra team(s) don't seem to have the time
available currently to do the major lifting required to bring these
services up to date. I would like to propose that we simply turn them
off. All of these services operate off of public data that will not be
going away (specifically job log content). If others are interested in
taking this on they can hook into this data and run their own
processing pipelines.
Just to clarify for people that aren't familiar with what these services
do for us, I want to explain their importance and the impact of not
having them in a future where we have to decommission them.

We run a lot of jobs in CI, across a lot of projects and varying
configurations. Ideally these would all work all of the time, and never
have spurious and non-deterministic failures. However, that's not how
the real world works, and in reality many jobs are not consistently
well-behaved. Since many of our jobs run tests against many projects to
ensure that the whole stack works at any given point, spurious failures
in one project's tests can impact developers' ability to land patches in
a large number of projects. Indeed, it takes a surprisingly low failure
rate to significantly impact the amount of work that can be done across
the ecosystem.

Because of this, collecting information from "the firehose" about job
failures is critical. It helps us figure out how much impact a given
spurious failure is having, and across how wide of a swath of
projects. Further, fixing the problem becomes one of determining the
actual bug (of course) which can be vastly improved by gathering lots of
examples of failures and looking for commonalities. These services
(collectively called ELK) digest the logs and data from these test runs
and provide a way to mine details when chasing down a failure. There is
even a service, built by openstack people, which uses ELK to automate
the identification of common failures to help determine which are having
the most serious impact in order to focus human debugging
attention. It's called elastic-recheck, which you've probably heard of,
and is visible here:

http://status.openstack.org/elastic-recheck/

Unfortunately, a select few developers actually work on these
problems. They're difficult to tackle and often require multiple people
across projects to nail down a cause and solution. If you've ever just
run "recheck" on your patch a bunch of times until the tests are green,
you have felt the pain that spurious job failures bring. Actually fixing
those are the only way to make things better, and ignoring them causes
them to collect over time. At some point, enough of these types of
failures will keep anything from merging.

Because a small number of heroes generally work on these problems, it's
possible that they are the only ones that understand the value of these
services. I think it's important for everyone to understand how critical
ELK and associated services are to chasing these down. Without it,
debugging the spurious failures (which are often real bugs, by the way!)
will become even more laborious and likely happen less and less.

I'm summarizing this situation in hopes that some of the entities that
depend on OpenStack, who are looking for a way to help, and which may
have resources (carbon- and silicon-based) that apply here can step up
to help make an impact.

Thanks!

--Dan

Re: [all][infra][qa] Critical call for help: Retiring Logstash, Elasticsearch, Elastic-recheck

Dan Smith