Re: [all][tact-sig][infra][qa] Retiring Logstash, Elasticsearch, subunit2sql, and Health

11 May 2021

      On 2021-05-11 09:47:45 +0200 (+0200), Sylvain Bauza wrote:
[...]
...
Could we be discussing how we could try to find a workaround for
this?
[...]
That's absolutely worth discussing, but it needs people committed to
building and maintaining something. The current implementation was
never efficient, and we realized that when we started trying to
operate it at scale. It relies on massive quantities of donated
infrastructure for which we're trying to be responsible stewards
(just the Elasticsearch cluster alone consumes 6x the resources of
of our Gerrit deployment). We get that it's a useful service, but we
need to weigh the relative utility against the cost, not just in
server quota but ongoing maintenance.

For a while now we've not had enough people involved in running our
infrastructure as we need to maintain the services we built over the
years. We've been shouting it from the rooftops, but that doesn't
seem to change anything, so all we can do at this point is
aggressively sunset noncritical systems in order to hopefully have a
small enough remainder that the people we do have can still keep it
in good shape. Some of the systems we operate are tightly-coupled
and taking them down would have massive ripple effects in other
systems which would, counterintuitively, require more people to help
untangle. The logstash service, on the other hand, is sufficiently
decoupled from our more crucial systems that we can make a large
dent in our upgrade and configuration management overhaul backlog by
just turning it off.

The workaround to which you allude is actually fairly
straightforward. Someone can look at what we had as a proof of
concept and build an equivalent system using newer and possibly more
appropriate technologies. Watch the Gerrit events, fetch logs from
swift for anything which gets reported, postprocess and index those,
providing a query interface folks can use to find patterns. None of
that requires privileged access to our systems; it's all built on
public data. That "someone" needs to come from "somewhere" though.

Upgrading the existing systems at this point is probably at least
the same amount of work, given all the moving parts, the need to
completely redo the current configuration management for it, the
recent license strangeness with Elasticsearch, the fact that
Logstash and Kibana are increasingly open-core fighting to keep
useful features exclusively for their paying users... the whole
stack needs to be reevaluated, and new components and architecture
considered.
-- 
Jeremy Stanley