[openstack-dev] [Neutron] L3 agent rescheduling issue

Eugene Nikanorov enikanorov at mirantis.com
Mon Jun 8 03:10:17 UTC 2015


Salvatore,

By 'fairness' I meant chances for state report greenthread to get the
control. In DHCP case, each network processed by a separate greenthread, so
the more greenthreads agent has, the less chances that report state
greenthread will be able to report in time.

Thanks,
Eugene.

On Sun, Jun 7, 2015 at 4:15 AM, Salvatore Orlando <sorlando at nicira.com>
wrote:

> On 5 June 2015 at 01:29, Itsuro ODA <oda at valinux.co.jp> wrote:
>
>> Hi,
>>
>> > After trying to reproduce this, I'm suspecting that the issue is
>> actually
>> > on the server side from failing to drain the agent report state queue in
>> > time.
>>
>> I have seen before.
>> I thought the senario at that time as follows.
>> * a lot of create/update resource API issued
>> * "rpc_conn_pool_size" pool exhausted for sending notify and blocked
>>   farther sending side of RPC.
>> * "rpc_thread_pool_size" pool exhausted by waiting "rpc_conn_pool_size"
>>   pool for replying RPC.
>> * receiving state_report is blocked because "rpc_thread_pool_size" pool
>>   exhausted.
>>
>>
> I think this could be a good explanation couldn't it?
> Kevin proved that the periodic tasks are not mutually exclusive and that
> long process times for sync_routers are not an issue.
> However, he correctly suspected a server-side involvement, which could
> actually be a lot of requests saturating the RPC pool.
>
> On the other hand, how could we use this theory to explain why this issue
> tend to occur when the agent is restarted?
> Also, Eugene, what do you mean by stating that the issue could be in
> agent's "fairness"?
>
> Salvatore
>
>
>
>> Thanks
>> Itsuro Oda
>>
>> On Thu, 4 Jun 2015 14:20:33 -0700
>> Kevin Benton <blak111 at gmail.com> wrote:
>>
>> > After trying to reproduce this, I'm suspecting that the issue is
>> actually
>> > on the server side from failing to drain the agent report state queue in
>> > time.
>> >
>> > I set the report_interval to 1 second on the agent and added a logging
>> > statement and I see a report every 1 second even when sync_routers is
>> > taking a really long time.
>> >
>> > On Thu, Jun 4, 2015 at 11:52 AM, Carl Baldwin <carl at ecbaldwin.net>
>> wrote:
>> >
>> > > Ann,
>> > >
>> > > Thanks for bringing this up.  It has been on the shelf for a while
>> now.
>> > >
>> > > Carl
>> > >
>> > > On Thu, Jun 4, 2015 at 8:54 AM, Salvatore Orlando <
>> sorlando at nicira.com>
>> > > wrote:
>> > > > One reason for not sending the heartbeat from a separate greenthread
>> > > could
>> > > > be that the agent is already doing it [1].
>> > > > The current proposed patch addresses the issue blindly - that is to
>> say
>> > > > before declaring an agent dead let's wait for some more time
>> because it
>> > > > could be stuck doing stuff. In that case I would probably make the
>> > > > multiplier (currently 2x) configurable.
>> > > >
>> > > > The reason for which state report does not occur is probably that
>> both it
>> > > > and the resync procedure are periodic tasks. If I got it right
>> they're
>> > > both
>> > > > executed as eventlet greenthreads but one at a time. Perhaps then
>> adding
>> > > an
>> > > > initial delay to the full sync task might ensure the first thing an
>> agent
>> > > > does when it comes up is sending a heartbeat to the server?
>> > > >
>> > > > On the other hand, while doing the initial full resync, is the
>> agent
>> > > able
>> > > > to process updates? If not perhaps it makes sense to have it down
>> until
>> > > it
>> > > > finishes synchronisation.
>> > >
>> > > Yes, it can!  The agent prioritizes updates from RPC over full resync
>> > > activities.
>> > >
>> > > I wonder if the agent should check how long it has been since its last
>> > > state report each time it finishes processing an update for a router.
>> > > It normally doesn't take very long (relatively) to process an update
>> > > to a single router.
>> > >
>> > > I still would like to know why the thread to report state is being
>> > > starved.  Anyone have any insight on this?  I thought that with all
>> > > the system calls, the greenthreads would yield often.  There must be
>> > > something I don't understand about it.
>> > >
>> > > Carl
>> > >
>> > >
>> __________________________________________________________________________
>> > > OpenStack Development Mailing List (not for usage questions)
>> > > Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> > >
>> >
>> >
>> >
>> > --
>> > Kevin Benton
>>
>> --
>> Itsuro ODA <oda at valinux.co.jp>
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150607/70b6f7f1/attachment.html>


More information about the OpenStack-dev mailing list