Hello Folks,

Thank you Jeremy and Clark for sharing the issue that you have. I understand that the main issue is related to a lack of time. 
ELK stack requires a lot of resources, but the values that you share probably can be optimized. Is it possible to share 
the architecture, how many servers are using which Elasticsearch server role (master, data servers, etc.) ?

My team is managing RDO infra, which contains an ELK stack based on Opendistro for Elasticsearch. 
We have ansible playbooks to setup Elasticsearch base on Opendistro just on one node. Almost all of ELK 
stack services are located on one server that does not utilize a lot of resources (the retention time is set to
10 days, 90GB of HDD is used, 2GB of RAM for Elasticsearch, 512MB for Logstash).
Could you share, what is the retention time set currently in the cluster that it requires 1 TB disk? Also other statistics like
 how many queries are done in kibana and how much of HDD disk space is used by the Openstack project and compare 
it to other projects that are available in Opendev?

In the end, I would like to ask, if you can share what is the Elasticsearch version currently running on your servers and if 
you can share the -Xmx and -Xms parameters that are set in Logstash, Elasticsearch and Kibana.

Thank you for your time and effort in keeping things running smoothly for OpenDev.  We find the OpenDev ELK stack 
valuable enough to the OpenDev community to take a much larger role in keeping it running.   
If you can think of any additional links or information that may be helpful to us taking a larger role here, please do not 
hesitate to share it.

Dan

On Wed, May 12, 2021 at 3:20 PM Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2021-05-12 02:05:57 -0700 (-0700), Sorin Sbarnea wrote:
[...]
> TripleO health check project relies on being able to query ER from
> both opendev and rdo in order to easy identification of problems.

Since you say RDO has a similar setup, could they just expand to
start indexing our logs? As previously stated, doing that doesn't
require any special access to our infrastructure.

> Maybe instead of dropping we should rethink what it is supposed to
> index and not, set some hard limits per job and scale down the
> deployment. IMHO, one of the major issues with it is that it does
> try to index maybe too much w/o filtering noisy output before
> indexing.

Reducing how much we index doesn't solve the most pressing problem,
which is that we need to upgrade the underlying operating system,
therefore replace the current current configuration management which
won't work on newer platforms, and also almost certainly upgrade
versions of the major components in use for it. Nobody has time to
do that, at least nobody who has heard our previous cries for help.

> If we can delay making a decision a little bit so we can
> investigate all available options it would really be great.

This thread hasn't set any timeline for stopping the service, not
yet anyway.

> I worth noting that I personally do not have a special love for ES
> but I do value a lot what it does. I am also pragmatic and I would
> not be very upset to make use of a SaaS service as an alternative,
> especially as I recognize how costly is to run and maintain an
> instance.
[...]

It's been pointed out that OVH has a similar-sounding service, if
someone is interested in experimenting with it:

https://www.ovhcloud.com/en-ca/data-platforms/logs/

The case with this, and I think with any SaaS solution, is that
there would still need to be a separate ingestion mechanism to
identify when new logs are available, postprocess them to remove
debug lines, and then feed them to the indexing service at the
provider... something our current team doesn't have time to design
and manage.
--
Jeremy Stanley


--
Regards,
Daniel Pawlik