RE: [octavia] Lose of some jobs by worker

26 Sep 2024

      Thanks, I don’t have octavia_jobboard:trash key at all …

I’ve checked the dates and there are no correlations with the 19 Sep date … Have some dates from Jun and May.

Ok, to be honest, sometimes we’re receiving the same behaviour – when we’re losing some worker jobs without any notes about that – once/a few times per quarter, but the main problem is that when we’re trying to imitate some behavior, rock the boat – we can not reproduce even a single case ☹ So Octavia is absolutely resistant to the intentional rocking.

От: Gregory Thiemonge <gthiemonge@redhat.com>
Дата: четверг, 26 сентября 2024 г. в 16:22
Кому: Payne Max <yardalgedal@gmail.com>
Копия: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Тема: Re: [octavia] Lose of some jobs by worker
can you also check the octavia_jobboard:trash hash?
in zed, I can only see 2 reasons for a job to be removed from the listings hash:
- the job was consumed (completed successfully) https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
- the job was trashed https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...

do the timestamps of the .last_modified elements match the date of the SIGTERM?
you can decode the content with msgpackutils:
...
...
...
from oslo_serialization import msgpackutils
msgpackutils.loads(b"\xc7<\x01\x87\xa3day\x0e\xa5month\x05\xa4year\xcd\a\xe8\xa4hour\x14\xa6minute7\xa6second9\xabmicrosecond\xce\x00\x01\xdda")
datetime.datetime(2024, 5, 14, 20, 55, 57, 122209)
anyway I'll investigate to see if I can reproduce a similar behavior.

On Thu, Sep 26, 2024 at 3:53 PM Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> wrote:
Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version).

octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running)

there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then»

my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure

От: Gregory Thiemonge <gthiemonge@redhat.com<mailto:gthiemonge@redhat.com>>
Дата: четверг, 26 сентября 2024 г. в 15:40
Кому: Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>>
Копия: openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org> <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>>
Тема: Re: [octavia] Lose of some jobs by worker
Hi,

looking at taskflow code, it seems that those last_modified entries may not always be deleted:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...

I think it's something that could be improved but it doesn't indicate a potential bug there.
Which octavia release do you use?

I don't see the octavia_jobboard:listings hash in your output.
it is used to keep all the current jobs in taskflow, when a job is posted, an element is added:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...

when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs.
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...

any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs?

Greg

On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> wrote:
Hi, OpenStack community,

I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker.

[cid:ii_1922ea44a7d4cff311]
Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs.

[cid:ii_1922ea44a7d5b16b22]
Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... Let’s discuss it!