[openstack-dev] [Neutron] DHCP Agent Reliability
joe.gordon0 at gmail.com
Wed Dec 4 11:25:23 UTC 2013
On Dec 4, 2013 5:41 AM, "Maru Newby" <marun at redhat.com> wrote:
> On Dec 4, 2013, at 11:57 AM, Clint Byrum <clint at fewbar.com> wrote:
> > Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
> >> I've been investigating a bug that is preventing VM's from receiving
IP addresses when a Neutron service is under high load:
> >> https://bugs.launchpad.net/neutron/+bug/1192381
> >> High load causes the DHCP agent's status updates to be delayed,
causing the Neutron service to assume that the agent is down. This results
in the Neutron service not sending notifications of port addition to the
DHCP agent. At present, the notifications are simply dropped. A simple
fix is to send notifications regardless of agent status. Does anybody have
any objections to this stop-gap approach? I'm not clear on the
implications of sending notifications to agents that are down, but I'm
hoping for a simple fix that can be backported to both havana and grizzly
(yes, this bug has been with us that long).
> >> Fixing this problem for real, though, will likely be more involved.
The proposal to replace the current wsgi framework with Pecan may increase
the Neutron service's scalability, but should we continue to use a 'fire
and forget' approach to notification? Being able to track the success or
failure of a given action outside of the logs would seem pretty important,
and allow for more effective coordination with Nova than is currently
> > Dropping requests without triggering a user-visible error is a pretty
> > serious problem. You didn't mention if you have filed a bug about that.
> > If not, please do or let us know here so we can investigate and file
> > a bug.
> There is a bug linked to in the original message that I am already
working on. The fact that that bug title is 'dhcp agent doesn't configure
ports' rather than 'dhcp notifications are silently dropped' is incidental.
> > It seems to me that they should be put into a queue to be retried.
> > Sending the notifications blindly is almost as bad as dropping them,
> > as you have no idea if the agent is alive or not.
> This is more the kind of discussion I was looking for.
> In the current architecture, the Neutron service handles RPC and WSGI
with a single process and is prone to being overloaded such that agent
heartbeats can be delayed beyond the limit for the agent being declared
'down'. Even if we increased the agent timeout as Yongsheg suggests, there
is no guarantee that we can accurately detect whether an agent is 'live'
with the current architecture. Given that amqp can ensure eventual
delivery - it is a queue - is sending a notification blind such a bad idea?
In the best case the agent isn't really down and can process the
notification. In the worst case, the agent really is down but will be
brought up eventually by a deployment's monitoring solution and process the
notification when it returns. What am I missing?
> Please consider that while a good solution will track notification
delivery and success, we may need 2 solutions:
> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported
to grizzly and havana.
> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP
> I'm hoping that coming up with a solution to #1 will allow us the
breathing room to work on #2 in this cycle.
I like the two part approach but I would phrase it slightly differently.
Short term solution to help neutron meet the deprecate nova-network goals
by icshouse-2 and a long term more robust solution.
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev