[all][infra][qa] Retiring Logstash, Elasticsearch, subunit2sql, and Health

Sylvain Bauza sylvain.bauza at gmail.com
Tue May 11 07:47:45 UTC 2021


Le mar. 11 mai 2021 à 09:35, Balazs Gibizer <balazs.gibizer at est.tech> a
écrit :

>
>
> On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan at sapwetik.org>
> wrote:
> > Hello everyone,
>
> Hi,
>
> >
> > Xenial has recently reached the end of its life. Our
> > logstash+kibana+elasticsearch and subunit2sql+health data crunching
> > services all run on Xenial. Even without the distro platform EOL
> > concerns these services are growing old and haven't received the care
> > they need to keep running reliably.
> >
> > Additionally these services represent a large portion of our resource
> > consumption:
> >
> > * 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers
> > * 20 x 4 vcpu + 4GB RAM logstash-worker servers
> > * 1 x 2 vcpu + 2GB RAM logstash/kibana central server
> > * 2 x 8 vcpu + 8GB RAM subunit-worker servers
> > * 64GB RAM + 500GB disk subunit2sql trove db server
> > * 1 x 4 vcpu + 4GB RAM health server
> >
> > To put things in perspective, they account for more than a quarter of
> > our control plane servers, occupying over a third of our block
> > storage and in excess of half the total memory footprint.
> >
> > The OpenDev/OpenStack Infra team(s) don't seem to have the time
> > available currently to do the major lifting required to bring these
> > services up to date. I would like to propose that we simply turn them
> > off. All of these services operate off of public data that will not
> > be going away (specifically job log content). If others are
> > interested in taking this on they can hook into this data and run
> > their own processing pipelines.
> >
> > I am sure not everyone will be happy with this proposal. I get it. I
> > came up with the idea for the elasticsearch job log processing way
> > back at the San Diego summit. I spent many many many hours since
> > working to get it up and running and to keep it running. But
> > pragmatism means that my efforts and the team's efforts are better
> > spent elsewhere.
> >
> > I am happy to hear feedback on this. Thank you for your time.
>
> Thank you and the whole infra team(s) for the effort to keeping the
> infrastructure alive. I'm an active user of the ELK stack in OpenStack.
> I use it to figure out if a particular gate failure I see is just a one
> time event or it is a real failure we need to fix. So I'm sad that this
> tooling will be shut down as I think I loose one of the tools that
> helped me keeping our Gate healthy. But I understood how busy is
> everybody these days. I'm not an infra person but if I can help somehow
> from Nova perspective then let me know. (E.g. I can review elastic
> recheck signatures if that helps)
>
>
Worth said, gibi.
I understand the reasoning behind the ELK sunset but I'm a bit afraid of
not having a way to know the number of changes that were failing with the
same exception than one I saw.

Could we be discussing how we could try to find a workaround for this ?
Maybe no longer using ELK, but at least still continuing to have the logs
for, say, 2 weeks ?

-Sylvain

Cheers,
> gibi
>
> >
> > Clark
> >
> >
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210511/9b930007/attachment-0001.html>


More information about the openstack-discuss mailing list