[all][infra][qa] Retiring Logstash, Elasticsearch, subunit2sql, and Health

Sorin Sbarnea ssbarnea at redhat.com
Wed May 12 09:05:57 UTC 2021


I just came back from vacation and reading this does worry me a lot. The ES
server is a crucial piece used to identify zuul job failures on openstack.
LOTS of them have causes external to the job itself, either infra,
packaging (os or pip) or unavailability of some services used during build.

Without being able to query specific error messages across multiple jobs we
will be in a vary bad spot as we loose the ability to look outside a single
project.

TripleO health check project relies on being able to query ER from both
opendev and rdo in order to easy identification of problems.

Maybe instead of dropping we should rethink what it is supposed to index
and not, set some hard limits per job and scale down the deployment. IMHO,
one of the major issues with it is that it does try to index maybe too much
w/o filtering noisy output before indexing.

If we can delay making a decision a little bit so we can investigate all
available options it would really be great.

I worth noting that I personally do not have a special love for ES but I do
value a lot what it does. I am also pragmatic and I would not be very upset
to make use of a SaaS service as an alternative, especially as I recognize
how costly is to run and maintain an instance.

Maybe we can find a SaaS log processing vendor willing to sponsor
OpenStack?  In the past I used DataDog for monitoring but they also offer
log processing and they have a program for open-source
<https://www.datadoghq.com/partner/open-source/> but I am not sure they
would be willing to process that amount of data for us.

Cheers
Sorin Sbarnea
Red Hat


On 10 May 2021 at 18:34:40, Clark Boylan <cboylan at sapwetik.org> wrote:

> Hello everyone,
>
> Xenial has recently reached the end of its life. Our
> logstash+kibana+elasticsearch and subunit2sql+health data crunching
> services all run on Xenial. Even without the distro platform EOL concerns
> these services are growing old and haven't received the care they need to
> keep running reliably.
>
> Additionally these services represent a large portion of our resource
> consumption:
>
> * 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers
> * 20 x 4 vcpu + 4GB RAM logstash-worker servers
> * 1 x 2 vcpu + 2GB RAM logstash/kibana central server
> * 2 x 8 vcpu + 8GB RAM subunit-worker servers
> * 64GB RAM + 500GB disk subunit2sql trove db server
> * 1 x 4 vcpu + 4GB RAM health server
>
> To put things in perspective, they account for more than a quarter of our
> control plane servers, occupying over a third of our block storage and in
> excess of half the total memory footprint.
>
> The OpenDev/OpenStack Infra team(s) don't seem to have the time available
> currently to do the major lifting required to bring these services up to
> date. I would like to propose that we simply turn them off. All of these
> services operate off of public data that will not be going away
> (specifically job log content). If others are interested in taking this on
> they can hook into this data and run their own processing pipelines.
>
> I am sure not everyone will be happy with this proposal. I get it. I came
> up with the idea for the elasticsearch job log processing way back at the
> San Diego summit. I spent many many many hours since working to get it up
> and running and to keep it running. But pragmatism means that my efforts
> and the team's efforts are better spent elsewhere.
>
> I am happy to hear feedback on this. Thank you for your time.
>
> Clark
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210512/27aaee39/attachment-0001.html>


More information about the openstack-discuss mailing list