[all][infra][qa] Retiring Logstash, Elasticsearch, subunit2sql, and Health

Sean Mooney smooney at redhat.com
Tue May 11 08:21:02 UTC 2021


On Tue, 2021-05-11 at 09:47 +0200, Sylvain Bauza wrote:
> Le mar. 11 mai 2021 à 09:35, Balazs Gibizer <balazs.gibizer at est.tech> a
> écrit :
> 
> > 
> > 
> > On Mon, May 10, 2021 at 10:34, Clark Boylan <cboylan at sapwetik.org>
> > wrote:
> > > Hello everyone,
> > 
> > Hi,
> > 
> > > 
> > > Xenial has recently reached the end of its life. Our
> > > logstash+kibana+elasticsearch and subunit2sql+health data crunching
> > > services all run on Xenial. Even without the distro platform EOL
> > > concerns these services are growing old and haven't received the care
> > > they need to keep running reliably.
> > > 
> > > Additionally these services represent a large portion of our resource
> > > consumption:
> > > 
> > > * 6 x 16 vcpu + 60GB RAM + 1TB disk Elasticsearch servers
> > > * 20 x 4 vcpu + 4GB RAM logstash-worker servers
> > > * 1 x 2 vcpu + 2GB RAM logstash/kibana central server
> > > * 2 x 8 vcpu + 8GB RAM subunit-worker servers
> > > * 64GB RAM + 500GB disk subunit2sql trove db server
> > > * 1 x 4 vcpu + 4GB RAM health server
> > > 
> > > To put things in perspective, they account for more than a quarter of
> > > our control plane servers, occupying over a third of our block
> > > storage and in excess of half the total memory footprint.
> > > 
> > > The OpenDev/OpenStack Infra team(s) don't seem to have the time
> > > available currently to do the major lifting required to bring these
> > > services up to date. I would like to propose that we simply turn them
> > > off. All of these services operate off of public data that will not
> > > be going away (specifically job log content). If others are
> > > interested in taking this on they can hook into this data and run
> > > their own processing pipelines.
> > > 
> > > I am sure not everyone will be happy with this proposal. I get it. I
> > > came up with the idea for the elasticsearch job log processing way
> > > back at the San Diego summit. I spent many many many hours since
> > > working to get it up and running and to keep it running. But
> > > pragmatism means that my efforts and the team's efforts are better
> > > spent elsewhere.
> > > 
> > > I am happy to hear feedback on this. Thank you for your time.
> > 
> > Thank you and the whole infra team(s) for the effort to keeping the
> > infrastructure alive. I'm an active user of the ELK stack in OpenStack.
> > I use it to figure out if a particular gate failure I see is just a one
> > time event or it is a real failure we need to fix. So I'm sad that this
> > tooling will be shut down as I think I loose one of the tools that
> > helped me keeping our Gate healthy. But I understood how busy is
> > everybody these days. I'm not an infra person but if I can help somehow
> > from Nova perspective then let me know. (E.g. I can review elastic
> > recheck signatures if that helps)
> > 
> > 
> Worth said, gibi.
> I understand the reasoning behind the ELK sunset but I'm a bit afraid of
> not having a way to know the number of changes that were failing with the
> same exception than one I saw.
> 
> Could we be discussing how we could try to find a workaround for this ?
> Maybe no longer using ELK, but at least still continuing to have the logs
> for, say, 2 weeks ?
well we will continue to have all logs for at least 30 days in the ci results.

currently all logs get uploaded to swift on the ci providers and that is what we see when we look
at the zuul results. seperatly they are also streamed to logstash which is ingested and process so we
can query them with kibana. its only the elk portion that is going away not the ci logs.

the indexing of those and easy quering is what would be lost by this change.

> 
> -Sylvain
> 
> Cheers,
> > gibi
> > 
> > > 
> > > Clark
> > > 
> > > 
> > 
> > 
> > 
> > 





More information about the openstack-discuss mailing list