[openstack-dev] [Neutron] L3 agent rescheduling issue
blak111 at gmail.com
Thu Jun 4 16:28:37 UTC 2015
Is there a way to parallelize the period tasks? I wanted to go this route
because I encountered cases where a bunch of routers would get scheduled to
l3 agents and they would all hit the server nearly simultaneously with a
sync routers task.
This could result in thousands of routers and their floating IPs being
retrieved, which would result in tens of thousands of SQL queries. During
this time, the agents would time out and have all their routers
rescheduled, leading to a downward spiral of doom.
I spent a bunch of time optimizing the sync routers calls on the l3 side so
it's hard to trigger this now, but I would be more comfortable if we didn't
depend on sync routers taking less time than the agent down time.
If we can have the heartbeats always running, it should solve both issues.
On Jun 4, 2015 8:56 AM, "Salvatore Orlando" <sorlando at nicira.com> wrote:
> One reason for not sending the heartbeat from a separate greenthread could
> be that the agent is already doing it .
> The current proposed patch addresses the issue blindly - that is to say
> before declaring an agent dead let's wait for some more time because it
> could be stuck doing stuff. In that case I would probably make the
> multiplier (currently 2x) configurable.
> The reason for which state report does not occur is probably that both it
> and the resync procedure are periodic tasks. If I got it right they're both
> executed as eventlet greenthreads but one at a time. Perhaps then adding an
> initial delay to the full sync task might ensure the first thing an agent
> does when it comes up is sending a heartbeat to the server?
> On the other hand, while doing the initial full resync, is the agent able
> to process updates? If not perhaps it makes sense to have it down until it
> finishes synchronisation.
> On 4 June 2015 at 16:16, Kevin Benton <blak111 at gmail.com> wrote:
>> Why don't we put the agent heartbeat into a separate greenthread on the
>> agent so it continues to send updates even when it's busy processing
>> On Jun 4, 2015 2:56 AM, "Anna Kamyshnikova" <akamyshnikova at mirantis.com>
>>> Hi, neutrons!
>>> Some time ago I discovered a bug for l3 agent rescheduling . When
>>> there are a lot of resources and agent_down_time is not big enough
>>> neutron-server starts marking l3 agents as dead. The same issue has been
>>> discovered and fixed for DHCP-agents. I proposed a change similar to those
>>> that were done for DHCP-agents. 
>>> There is no unified opinion on this bug and proposed change, so I want
>>> to ask developers whether it worth to continue work on this patch or not.
>>>  - https://bugs.launchpad.net/neutron/+bug/1440761
>>>  - https://review.openstack.org/171592
>>> Ann Kamyshnikova
>>> Mirantis, Inc
>>> OpenStack Development Mailing List (not for usage questions)
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> OpenStack Development Mailing List (not for usage questions)
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev