[openstack-dev] [Neutron] DHCP Agent Reliability

Clint Byrum clint at fewbar.com
Wed Dec 4 03:59:28 UTC 2013


Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800:
> 
> On Dec 4, 2013, at 11:57 AM, Clint Byrum <clint at fewbar.com> wrote:
> 
> > Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
> >> I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load:
> >> 
> >> https://bugs.launchpad.net/neutron/+bug/1192381
> >> 
> >> High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down.  This results in the Neutron service not sending notifications of port addition to the DHCP agent.  At present, the notifications are simply dropped.  A simple fix is to send notifications regardless of agent status.  Does anybody have any objections to this stop-gap approach?  I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long).
> >> 
> >> Fixing this problem for real, though, will likely be more involved.  The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification?  Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible.
> >> 
> > 
> > Dropping requests without triggering a user-visible error is a pretty
> > serious problem. You didn't mention if you have filed a bug about that.
> > If not, please do or let us know here so we can investigate and file
> > a bug.
> 
> There is a bug linked to in the original message that I am already working on.  The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental.
> 

Good point, I suppose that one bug is enough.

> > 
> > It seems to me that they should be put into a queue to be retried.
> > Sending the notifications blindly is almost as bad as dropping them,
> > as you have no idea if the agent is alive or not.
> 
> This is more the kind of discussion I was looking for.  
> 
> In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'.  Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture.  Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea?  In the best case the agent isn't really down and can process the notification.  In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns.  What am I missing? 
>

I have not looked closely into what expectations are built in to the
notification system, so I may have been off base. My understanding was
they were not necessarily guaranteed to be delivered, but if they are,
then this is fine.

> Please consider that while a good solution will track notification delivery and success, we may need 2 solutions:
> 
> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana.
>

I don't know why we'd backport to grizzly. But yes, if we can get a
notable jump in reliability with a clear patch, I'm all for it.

> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
> 
> I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle.
>

Understood, I like the short term plan and think long term having more
CPU available to process more messages is a good thing, most likely in
the form of more worker processes.



More information about the OpenStack-dev mailing list