[openstack-dev] [Neutron] DHCP Agent Reliability
Isaku Yamahata
isaku.yamahata at gmail.com
Sat Dec 14 04:06:54 UTC 2013
On Fri, Dec 06, 2013 at 04:30:17PM +0900,
Maru Newby <marun at redhat.com> wrote:
>
> On Dec 5, 2013, at 5:21 PM, Isaku Yamahata <isaku.yamahata at gmail.com> wrote:
>
> > On Wed, Dec 04, 2013 at 12:37:19PM +0900,
> > Maru Newby <marun at redhat.com> wrote:
> >
> >> In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing?
> >>
> >
> > Do you mean overload of neutron server? Not neutron agent.
> > So event agent sends periodic 'live' report, the reports are piled up
> > unprocessed by server.
> > When server sends notification, it considers agent dead wrongly.
> > Not because agent didn't send live reports due to overload of agent.
> > Is this understanding correct?
>
> Your interpretation is likely correct. The demands on the service are going to be much higher by virtue of having to field RPC requests from all the agents to interact with the database on their behalf.
Is this strongly indicating thread-starvation. i.e. too much unfair
thread scheduling.
Given that eventlet is cooperative threading, should sleep(0) to
hogging thread?
--
Isaku Yamahata <isaku.yamahata at gmail.com>
More information about the OpenStack-dev
mailing list