[openstack-dev] [Neutron] DHCP Agent Reliability

Maru Newby marun at redhat.com
Wed Dec 4 03:37:19 UTC 2013


On Dec 4, 2013, at 11:57 AM, Clint Byrum <clint at fewbar.com> wrote:

> Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
>> I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load:
>> 
>> https://bugs.launchpad.net/neutron/+bug/1192381
>> 
>> High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down.  This results in the Neutron service not sending notifications of port addition to the DHCP agent.  At present, the notifications are simply dropped.  A simple fix is to send notifications regardless of agent status.  Does anybody have any objections to this stop-gap approach?  I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long).
>> 
>> Fixing this problem for real, though, will likely be more involved.  The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification?  Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible.
>> 
> 
> Dropping requests without triggering a user-visible error is a pretty
> serious problem. You didn't mention if you have filed a bug about that.
> If not, please do or let us know here so we can investigate and file
> a bug.

There is a bug linked to in the original message that I am already working on.  The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental.

> 
> It seems to me that they should be put into a queue to be retried.
> Sending the notifications blindly is almost as bad as dropping them,
> as you have no idea if the agent is alive or not.

This is more the kind of discussion I was looking for.  

In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'.  Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture.  Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea?  In the best case the agent isn't really down and can process the notification.  In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns.  What am I missing? 

Please consider that while a good solution will track notification delivery and success, we may need 2 solutions:

1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana.

2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.

I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle.


m.



> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list