Re: [octavia] Lose of some jobs by worker

27 Sep 2024

      Hi Payne,

So the message that worries me is the "Caught SIGTERM" message near in time
to the batch update. This implies to me that something outside of Octavia
is initiating a graceful shutdown (Great that it's graceful, so often we
see people with misconfigured systemd or k8s doing SIGKILLS).

There is a known issue I have called the "rabbit hole" caused by Oslo
Messaging. Unfortunately, the current design of Oslo Messaging (which
Octavia uses for the rabbitmq RPC calls) auto-ACKs the messages when it
pulls them from the queue. If something goes wrong between the message
being pulled off the queue and before we load the flow into Taskflow, there
is a small chance the RPC message could be lost. Ideally we wanted to ACK
the message once the TaskFlow is launched, but we found we cannot do that
with the current Oslo Messaging design. This is outstanding tech debt in
Oslo Messaging that has not yet been resolved. If you are interested in
working on it, I can give you code pointers to where the issue is occurring.

That said, with a graceful shutdown, the controller worker should complete
the processing and load the job into Taskflow before shutting the thread
down. For this reason I am not 100% convinced this is the issue, but it is
something to investigate more.

Michael

On Thu, Sep 26, 2024 at 8:15 AM Payne Max <yardalgedal@gmail.com> wrote:
...
Thanks, I don’t have octavia_jobboard:trash key at all …
I’ve checked the dates and there are no correlations with the 19 Sep date
… Have some dates from Jun and May.
Ok, to be honest, sometimes we’re receiving the same behaviour – when
we’re losing some worker jobs without any notes about that – once/a few
times per quarter, but the main problem is that when we’re trying to
imitate some behavior, rock the boat – we can not reproduce even a single
case ☹ So Octavia is absolutely resistant to the intentional rocking.
*От: *Gregory Thiemonge <gthiemonge@redhat.com>
*Дата: *четверг, 26 сентября 2024 г. в 16:22
*Кому: *Payne Max <yardalgedal@gmail.com>
*Копия: *openstack-discuss@lists.openstack.org <
openstack-discuss@lists.openstack.org>
*Тема: *Re: [octavia] Lose of some jobs by worker
can you also check the octavia_jobboard:trash hash?
in zed, I can only see 2 reasons for a job to be removed from the listings
hash:
- the job was consumed (completed successfully)
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
- the job was trashed
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
do the timestamps of the .last_modified elements match the date of the
SIGTERM?
you can decode the content with msgpackutils:
...
...
...
from oslo_serialization import msgpackutils
msgpackutils.loads(b"\xc7<\x01\x87\xa3day\x0e\xa5month\x05\xa4year\xcd\a\xe8\xa4hour\x14\xa6minute7\xa6second9\xabmicrosecond\xce\x00\x01\xdda")
datetime.datetime(2024, 5, 14, 20, 55, 57, 122209)
anyway I'll investigate to see if I can reproduce a similar behavior.
On Thu, Sep 26, 2024 at 3:53 PM Payne Max <yardalgedal@gmail.com> wrote:
Thanks for your answer, Gregory, we’re using octavia-zed (with the latest
changes available in the repo for this version).
octavia_jobboard:listings didn’t contain them (it contains only «running
now» so these jobs are not running)
there were no suspicious backtraces, from my perspective it looks like
«the worker received the request, an initial message was shown, restart
appears, nothing related to the request was done then»
my potential guess related to the fact that, maybe, the worker received
the request but restart appeared before jobboard started executing the
flow, so we lost it, but I’m not sure
*От: *Gregory Thiemonge <gthiemonge@redhat.com>
*Дата: *четверг, 26 сентября 2024 г. в 15:40
*Кому: *Payne Max <yardalgedal@gmail.com>
*Копия: *openstack-discuss@lists.openstack.org <
openstack-discuss@lists.openstack.org>
*Тема: *Re: [octavia] Lose of some jobs by worker
Hi,
looking at taskflow code, it seems that those last_modified entries may
not always be deleted:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
I think it's something that could be improved but it doesn't indicate a
potential bug there.
Which octavia release do you use?
I don't see the octavia_jobboard:listings hash in your output.
it is used to keep all the current jobs in taskflow, when a job is posted,
an element is added:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
when the conductor is started in octavia (for instance when the worker
restarts after a crash/kill), it fetches all the elements of this hash to
schedule the jobs.
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
any suspicious backtraces in the octavia worker, healthmanager,
housekeeping logs?
Greg
On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com> wrote:
Hi, OpenStack community,
I’ve faced a problem when some of our jobs can get lost by a worker, for
example from the screenshot, SIGTERM was called in several seconds after
receiving a job by a worker.
Then there were no new log messages related to this job. Then our client
complained that LB stucked in PENDING_UPDATE for several days and we
started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see
several jobs without TTL and I think they are related to the «lost» jobs.
Is it an ok situation? Can it be related to the
https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo...
Let’s discuss it!

Re: [octavia] Lose of some jobs by worker

Michael Johnson