Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version).
octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running)
there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then»
my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure
От:
Gregory Thiemonge <gthiemonge@redhat.com>
Дата: четверг, 26 сентября 2024 г. в 15:40
Кому: Payne Max <yardalgedal@gmail.com>
Копия: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Тема: Re: [octavia] Lose of some jobs by worker
Hi,
looking at taskflow code, it seems that those last_modified entries may not always be deleted:
I think it's something that could be improved but it doesn't indicate a potential bug there.
Which octavia release do you use?
I don't see the octavia_jobboard:listings hash in your output.
it is used to keep all the current jobs in taskflow, when a job is posted, an element is added:
when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs.
any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs?
Hi, OpenStack community,
I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a
job by a worker.
Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs.
Is it an ok situation? Can it be related to the
https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflow.py#L209-L211? Let’s discuss it!