[octavia] Lose of some jobs by worker
Hi, OpenStack community, I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker. [cid:image001.png@01DB1013.D648A630] Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation. Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs. [cid:image002.png@01DB1014.1DFF1BD0] Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... Let’s discuss it!
Hi, looking at taskflow code, it seems that those last_modified entries may not always be deleted: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... I think it's something that could be improved but it doesn't indicate a potential bug there. Which octavia release do you use? I don't see the octavia_jobboard:listings hash in your output. it is used to keep all the current jobs in taskflow, when a job is posted, an element is added: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs. https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs? Greg On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com> wrote:
Hi, OpenStack community,
I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker.
Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs.
Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... <https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflow.py#L209-L211>? Let’s discuss it!
Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version). octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running) there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then» my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure От: Gregory Thiemonge <gthiemonge@redhat.com> Дата: четверг, 26 сентября 2024 г. в 15:40 Кому: Payne Max <yardalgedal@gmail.com> Копия: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Тема: Re: [octavia] Lose of some jobs by worker Hi, looking at taskflow code, it seems that those last_modified entries may not always be deleted: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... I think it's something that could be improved but it doesn't indicate a potential bug there. Which octavia release do you use? I don't see the octavia_jobboard:listings hash in your output. it is used to keep all the current jobs in taskflow, when a job is posted, an element is added: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs. https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs? Greg On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> wrote: Hi, OpenStack community, I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker. [cid:ii_1922e863e1a4cff311] Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation. Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs. [cid:ii_1922e863e1a5b16b22] Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... Let’s discuss it!
can you also check the octavia_jobboard:trash hash? in zed, I can only see 2 reasons for a job to be removed from the listings hash: - the job was consumed (completed successfully) https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... - the job was trashed https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... do the timestamps of the .last_modified elements match the date of the SIGTERM? you can decode the content with msgpackutils:
from oslo_serialization import msgpackutils
msgpackutils.loads(b"\xc7<\x01\x87\xa3day\x0e\xa5month\x05\xa4year\xcd\a\xe8\xa4hour\x14\xa6minute7\xa6second9\xabmicrosecond\xce\x00\x01\xdda") datetime.datetime(2024, 5, 14, 20, 55, 57, 122209)
anyway I'll investigate to see if I can reproduce a similar behavior. On Thu, Sep 26, 2024 at 3:53 PM Payne Max <yardalgedal@gmail.com> wrote:
Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version).
octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running)
there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then»
my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure
*От: *Gregory Thiemonge <gthiemonge@redhat.com> *Дата: *четверг, 26 сентября 2024 г. в 15:40 *Кому: *Payne Max <yardalgedal@gmail.com> *Копия: *openstack-discuss@lists.openstack.org < openstack-discuss@lists.openstack.org> *Тема: *Re: [octavia] Lose of some jobs by worker
Hi,
looking at taskflow code, it seems that those last_modified entries may not always be deleted:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
I think it's something that could be improved but it doesn't indicate a potential bug there.
Which octavia release do you use?
I don't see the octavia_jobboard:listings hash in your output.
it is used to keep all the current jobs in taskflow, when a job is posted, an element is added:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs.
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs?
Greg
On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com> wrote:
Hi, OpenStack community,
I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker.
Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs.
Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... <https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflow.py#L209-L211>? Let’s discuss it!
Thanks, I don’t have octavia_jobboard:trash key at all … I’ve checked the dates and there are no correlations with the 19 Sep date … Have some dates from Jun and May. Ok, to be honest, sometimes we’re receiving the same behaviour – when we’re losing some worker jobs without any notes about that – once/a few times per quarter, but the main problem is that when we’re trying to imitate some behavior, rock the boat – we can not reproduce even a single case ☹ So Octavia is absolutely resistant to the intentional rocking. От: Gregory Thiemonge <gthiemonge@redhat.com> Дата: четверг, 26 сентября 2024 г. в 16:22 Кому: Payne Max <yardalgedal@gmail.com> Копия: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Тема: Re: [octavia] Lose of some jobs by worker can you also check the octavia_jobboard:trash hash? in zed, I can only see 2 reasons for a job to be removed from the listings hash: - the job was consumed (completed successfully) https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... - the job was trashed https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... do the timestamps of the .last_modified elements match the date of the SIGTERM? you can decode the content with msgpackutils:
from oslo_serialization import msgpackutils msgpackutils.loads(b"\xc7<\x01\x87\xa3day\x0e\xa5month\x05\xa4year\xcd\a\xe8\xa4hour\x14\xa6minute7\xa6second9\xabmicrosecond\xce\x00\x01\xdda") datetime.datetime(2024, 5, 14, 20, 55, 57, 122209)
anyway I'll investigate to see if I can reproduce a similar behavior. On Thu, Sep 26, 2024 at 3:53 PM Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> wrote: Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version). octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running) there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then» my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure От: Gregory Thiemonge <gthiemonge@redhat.com<mailto:gthiemonge@redhat.com>> Дата: четверг, 26 сентября 2024 г. в 15:40 Кому: Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> Копия: openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org> <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Тема: Re: [octavia] Lose of some jobs by worker Hi, looking at taskflow code, it seems that those last_modified entries may not always be deleted: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... I think it's something that could be improved but it doesn't indicate a potential bug there. Which octavia release do you use? I don't see the octavia_jobboard:listings hash in your output. it is used to keep all the current jobs in taskflow, when a job is posted, an element is added: https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs. https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp... any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs? Greg On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com<mailto:yardalgedal@gmail.com>> wrote: Hi, OpenStack community, I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker. [cid:ii_1922ea44a7d4cff311] Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation. Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs. [cid:ii_1922ea44a7d5b16b22] Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... Let’s discuss it!
Hi Payne, So the message that worries me is the "Caught SIGTERM" message near in time to the batch update. This implies to me that something outside of Octavia is initiating a graceful shutdown (Great that it's graceful, so often we see people with misconfigured systemd or k8s doing SIGKILLS). There is a known issue I have called the "rabbit hole" caused by Oslo Messaging. Unfortunately, the current design of Oslo Messaging (which Octavia uses for the rabbitmq RPC calls) auto-ACKs the messages when it pulls them from the queue. If something goes wrong between the message being pulled off the queue and before we load the flow into Taskflow, there is a small chance the RPC message could be lost. Ideally we wanted to ACK the message once the TaskFlow is launched, but we found we cannot do that with the current Oslo Messaging design. This is outstanding tech debt in Oslo Messaging that has not yet been resolved. If you are interested in working on it, I can give you code pointers to where the issue is occurring. That said, with a graceful shutdown, the controller worker should complete the processing and load the job into Taskflow before shutting the thread down. For this reason I am not 100% convinced this is the issue, but it is something to investigate more. Michael On Thu, Sep 26, 2024 at 8:15 AM Payne Max <yardalgedal@gmail.com> wrote:
Thanks, I don’t have octavia_jobboard:trash key at all …
I’ve checked the dates and there are no correlations with the 19 Sep date … Have some dates from Jun and May.
Ok, to be honest, sometimes we’re receiving the same behaviour – when we’re losing some worker jobs without any notes about that – once/a few times per quarter, but the main problem is that when we’re trying to imitate some behavior, rock the boat – we can not reproduce even a single case ☹ So Octavia is absolutely resistant to the intentional rocking.
*От: *Gregory Thiemonge <gthiemonge@redhat.com> *Дата: *четверг, 26 сентября 2024 г. в 16:22 *Кому: *Payne Max <yardalgedal@gmail.com> *Копия: *openstack-discuss@lists.openstack.org < openstack-discuss@lists.openstack.org> *Тема: *Re: [octavia] Lose of some jobs by worker
can you also check the octavia_jobboard:trash hash?
in zed, I can only see 2 reasons for a job to be removed from the listings hash:
- the job was consumed (completed successfully) https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
- the job was trashed https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
do the timestamps of the .last_modified elements match the date of the SIGTERM?
you can decode the content with msgpackutils:
from oslo_serialization import msgpackutils
msgpackutils.loads(b"\xc7<\x01\x87\xa3day\x0e\xa5month\x05\xa4year\xcd\a\xe8\xa4hour\x14\xa6minute7\xa6second9\xabmicrosecond\xce\x00\x01\xdda") datetime.datetime(2024, 5, 14, 20, 55, 57, 122209)
anyway I'll investigate to see if I can reproduce a similar behavior.
On Thu, Sep 26, 2024 at 3:53 PM Payne Max <yardalgedal@gmail.com> wrote:
Thanks for your answer, Gregory, we’re using octavia-zed (with the latest changes available in the repo for this version).
octavia_jobboard:listings didn’t contain them (it contains only «running now» so these jobs are not running)
there were no suspicious backtraces, from my perspective it looks like «the worker received the request, an initial message was shown, restart appears, nothing related to the request was done then»
my potential guess related to the fact that, maybe, the worker received the request but restart appeared before jobboard started executing the flow, so we lost it, but I’m not sure
*От: *Gregory Thiemonge <gthiemonge@redhat.com> *Дата: *четверг, 26 сентября 2024 г. в 15:40 *Кому: *Payne Max <yardalgedal@gmail.com> *Копия: *openstack-discuss@lists.openstack.org < openstack-discuss@lists.openstack.org> *Тема: *Re: [octavia] Lose of some jobs by worker
Hi,
looking at taskflow code, it seems that those last_modified entries may not always be deleted:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
I think it's something that could be improved but it doesn't indicate a potential bug there.
Which octavia release do you use?
I don't see the octavia_jobboard:listings hash in your output.
it is used to keep all the current jobs in taskflow, when a job is posted, an element is added:
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
when the conductor is started in octavia (for instance when the worker restarts after a crash/kill), it fetches all the elements of this hash to schedule the jobs.
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/backends/imp...
any suspicious backtraces in the octavia worker, healthmanager, housekeeping logs?
Greg
On Thu, Sep 26, 2024 at 3:15 PM Payne Max <yardalgedal@gmail.com> wrote:
Hi, OpenStack community,
I’ve faced a problem when some of our jobs can get lost by a worker, for example from the screenshot, SIGTERM was called in several seconds after receiving a job by a worker.
Then there were no new log messages related to this job. Then our client complained that LB stucked in PENDING_UPDATE for several days and we started investigation.
Our MySQL (persistent storage) is clean, but in our Redis, I can see several jobs without TTL and I think they are related to the «lost» jobs.
Is it an ok situation? Can it be related to the https://github.com/openstack/octavia/blob/master/octavia/common/base_taskflo... Let’s discuss it!
Thanks for the answer, Michael! That was a planned restart of Octavia that was done via Kolla, so nothing unpredictable happened :)
Ideally we wanted to ACK the message once the TaskFlow is launched
That said, with a graceful shutdown, the controller worker should complete
Absolutely voting for this way ... the processing and load the job into Taskflow before shutting the thread down. For this reason I am not 100% convinced this is the issue, but it is something to investigate more. Well, right now we don't have a proper logging level configured to be able to confirm or deny that this is exactly the case. We just did not see anything about this particular event/action after receiving this event by a worker, so that is why I'm thinking about that ... Currently, I see that we have 1 minute for graceful shutdown and as per logs we did not exceed this time for shutting down, but maybe (again) our log level just not showing enough. But another thing – we've had a little huge pool (and a little huge lb in general), ~50x listeners, 1x pool for each, 40x members for each pool. So one way or another we could spend more than 10-20 seconds preparing the task :) For sure we will increase the log level and continue debugging on our side, I'll let the community know if we will identify something here ...
participants (4)
-
Gregory Thiemonge
-
Michael Johnson
-
Payne Max
-
yardalgedal@gmail.com