Hello Alex,
thanks for the writeup!
A few comments and questions from me:
Database settings
It seems dangerous to me to define the number of connections
'max_connections' so high (max 64K if you only connect to one ip).
Every database connection requires resources on the operating
system side at the very least. As far as I know, MariaDB allocates
additional memory per active thread (e.g. sort_buffer_size,
join_buffer_size). In extreme situations, this can cause
MariaDB to either run out of memory (10k connections this is ~21GB
only for threads when you use the mariadb defaults) due to CGroup
resource limits or cause the memory requirement to grow so large
that the OOM killer may even be activated on the node. Not a good
thing for a database, especially when OOMKILL terminates the
database using a SIGKILL.
Furthermore, it could be that other limitations in the setup (IO
hardware limits, ulimits or other configuration parameters) cause
many thousands of connections to be active in parallel, but these
are slowed down or even blocked/starved as a result. Ultimately,
this will at least have a negative impact on response times, but
it can also cause more serious problems that could, for example,
cause the server to block its operation. These horror scenarios
only occur in extreme situations, but it is precisely in these
situations that these settings are particularly dangerous in my
opinion.
RabbitMQ
I have also wondered what the limit for RabbitMQ is and whether there are potential difficulties here.
As far as I know, the maximum number is automatically set, for example, by the ULimit or Erlang port limit applicable to the process (see also https://www.rabbitmq.com/docs/networking#tuning-for-large-number-of-connections).
What was your initial limit?
In my setup, there are already a lot:
$ docker exec -ti rabbitmq
/bin/bash -c 'pgrep beam.smp|xargs -I PID grep -H “Max open
files” /proc/PID/limits'
/proc/22/limits:Max open files 1048576 1048576 files
$ docker exec -ti rabbitmq
rabbitmqctl eval 'erlang:system_info(port_limit).'
65536
Almost everything about the
maximum possible connections in TCP can be used here for
internal file system access.
What I would like to know: How many connections can it handle in
times of very high load? Do you have monitoring data from the
situation you resolved?
The
RabbitMQ documentation mentioned above describes some
interesting approaches - it probably makes sense to discuss this
in more detail in a dedicated mail thread.
Regards
Marc