[openstack-dev] [Neutron] L3 agent rescheduling issue
Kevin Benton
blak111 at gmail.com
Mon Jun 8 06:11:44 UTC 2015
Well a greenthread will only yield when it makes a blocking call like
writing to a network socket, file, etc. So once the report_state
greenthread starts executing, it won't yield until it makes a call like
that.
I looked through the report_state code for the DHCP agent and the only
blocking call it seems to make is the AMQP report_state call/cast itself.
So even with a bunch of other workers, the report_state thread should get
execution fairly quickly since most of our workers should yield very
frequently when they make process calls, etc. That's why I assumed that
there must be something actually stopping it from sending the message.
Do you have a way to reproduce the issue with the DHCP agent?
On Sun, Jun 7, 2015 at 9:21 PM, Eugene Nikanorov <enikanorov at mirantis.com>
wrote:
> No, I think greenthread itself don't do anything special, it's just when
> there are too many threads, state_report thread can't get the control for
> too long, since there is no prioritization of greenthreads.
>
> Eugene.
>
> On Sun, Jun 7, 2015 at 8:24 PM, Kevin Benton <blak111 at gmail.com> wrote:
>
>> I understand now. So the issue is that the report_state greenthread is
>> just blocking and yielding whenever it tries to actually send a message?
>>
>> On Sun, Jun 7, 2015 at 8:10 PM, Eugene Nikanorov <enikanorov at mirantis.com
>> > wrote:
>>
>>> Salvatore,
>>>
>>> By 'fairness' I meant chances for state report greenthread to get the
>>> control. In DHCP case, each network processed by a separate greenthread, so
>>> the more greenthreads agent has, the less chances that report state
>>> greenthread will be able to report in time.
>>>
>>> Thanks,
>>> Eugene.
>>>
>>> On Sun, Jun 7, 2015 at 4:15 AM, Salvatore Orlando <sorlando at nicira.com>
>>> wrote:
>>>
>>>> On 5 June 2015 at 01:29, Itsuro ODA <oda at valinux.co.jp> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> > After trying to reproduce this, I'm suspecting that the issue is
>>>>> actually
>>>>> > on the server side from failing to drain the agent report state
>>>>> queue in
>>>>> > time.
>>>>>
>>>>> I have seen before.
>>>>> I thought the senario at that time as follows.
>>>>> * a lot of create/update resource API issued
>>>>> * "rpc_conn_pool_size" pool exhausted for sending notify and blocked
>>>>> farther sending side of RPC.
>>>>> * "rpc_thread_pool_size" pool exhausted by waiting "rpc_conn_pool_size"
>>>>> pool for replying RPC.
>>>>> * receiving state_report is blocked because "rpc_thread_pool_size" pool
>>>>> exhausted.
>>>>>
>>>>>
>>>> I think this could be a good explanation couldn't it?
>>>> Kevin proved that the periodic tasks are not mutually exclusive and
>>>> that long process times for sync_routers are not an issue.
>>>> However, he correctly suspected a server-side involvement, which could
>>>> actually be a lot of requests saturating the RPC pool.
>>>>
>>>> On the other hand, how could we use this theory to explain why this
>>>> issue tend to occur when the agent is restarted?
>>>> Also, Eugene, what do you mean by stating that the issue could be in
>>>> agent's "fairness"?
>>>>
>>>> Salvatore
>>>>
>>>>
>>>>
>>>>> Thanks
>>>>> Itsuro Oda
>>>>>
>>>>> On Thu, 4 Jun 2015 14:20:33 -0700
>>>>> Kevin Benton <blak111 at gmail.com> wrote:
>>>>>
>>>>> > After trying to reproduce this, I'm suspecting that the issue is
>>>>> actually
>>>>> > on the server side from failing to drain the agent report state
>>>>> queue in
>>>>> > time.
>>>>> >
>>>>> > I set the report_interval to 1 second on the agent and added a
>>>>> logging
>>>>> > statement and I see a report every 1 second even when sync_routers is
>>>>> > taking a really long time.
>>>>> >
>>>>> > On Thu, Jun 4, 2015 at 11:52 AM, Carl Baldwin <carl at ecbaldwin.net>
>>>>> wrote:
>>>>> >
>>>>> > > Ann,
>>>>> > >
>>>>> > > Thanks for bringing this up. It has been on the shelf for a while
>>>>> now.
>>>>> > >
>>>>> > > Carl
>>>>> > >
>>>>> > > On Thu, Jun 4, 2015 at 8:54 AM, Salvatore Orlando <
>>>>> sorlando at nicira.com>
>>>>> > > wrote:
>>>>> > > > One reason for not sending the heartbeat from a separate
>>>>> greenthread
>>>>> > > could
>>>>> > > > be that the agent is already doing it [1].
>>>>> > > > The current proposed patch addresses the issue blindly - that is
>>>>> to say
>>>>> > > > before declaring an agent dead let's wait for some more time
>>>>> because it
>>>>> > > > could be stuck doing stuff. In that case I would probably make
>>>>> the
>>>>> > > > multiplier (currently 2x) configurable.
>>>>> > > >
>>>>> > > > The reason for which state report does not occur is probably
>>>>> that both it
>>>>> > > > and the resync procedure are periodic tasks. If I got it right
>>>>> they're
>>>>> > > both
>>>>> > > > executed as eventlet greenthreads but one at a time. Perhaps
>>>>> then adding
>>>>> > > an
>>>>> > > > initial delay to the full sync task might ensure the first thing
>>>>> an agent
>>>>> > > > does when it comes up is sending a heartbeat to the server?
>>>>> > > >
>>>>> > > > On the other hand, while doing the initial full resync, is the
>>>>> agent
>>>>> > > able
>>>>> > > > to process updates? If not perhaps it makes sense to have it
>>>>> down until
>>>>> > > it
>>>>> > > > finishes synchronisation.
>>>>> > >
>>>>> > > Yes, it can! The agent prioritizes updates from RPC over full
>>>>> resync
>>>>> > > activities.
>>>>> > >
>>>>> > > I wonder if the agent should check how long it has been since its
>>>>> last
>>>>> > > state report each time it finishes processing an update for a
>>>>> router.
>>>>> > > It normally doesn't take very long (relatively) to process an
>>>>> update
>>>>> > > to a single router.
>>>>> > >
>>>>> > > I still would like to know why the thread to report state is being
>>>>> > > starved. Anyone have any insight on this? I thought that with all
>>>>> > > the system calls, the greenthreads would yield often. There must
>>>>> be
>>>>> > > something I don't understand about it.
>>>>> > >
>>>>> > > Carl
>>>>> > >
>>>>> > >
>>>>> __________________________________________________________________________
>>>>> > > OpenStack Development Mailing List (not for usage questions)
>>>>> > > Unsubscribe:
>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Kevin Benton
>>>>>
>>>>> --
>>>>> Itsuro ODA <oda at valinux.co.jp>
>>>>>
>>>>>
>>>>>
>>>>> __________________________________________________________________________
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe:
>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
>>
>> --
>> Kevin Benton
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
--
Kevin Benton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150607/ed08ec92/attachment.html>
More information about the OpenStack-dev
mailing list