[openstack-dev] [Neutron] DHCP Agent Reliability

Isaku Yamahata isaku.yamahata at gmail.com
Thu Dec 5 08:21:12 UTC 2013


On Wed, Dec 04, 2013 at 12:37:19PM +0900,
Maru Newby <marun at redhat.com> wrote:

> On Dec 4, 2013, at 11:57 AM, Clint Byrum <clint at fewbar.com> wrote:
> 
> > Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
> >> I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load:
> >> 
> >> https://bugs.launchpad.net/neutron/+bug/1192381
> >> 
> >> High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down.  This results in the Neutron service not sending notifications of port addition to the DHCP agent.  At present, the notifications are simply dropped.  A simple fix is to send notifications regardless of agent status.  Does anybody have any objections to this stop-gap approach?  I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long).
> >> 
> >> Fixing this problem for real, though, will likely be more involved.  The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification?  Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible.
> >> 
> > 
> > Dropping requests without triggering a user-visible error is a pretty
> > serious problem. You didn't mention if you have filed a bug about that.
> > If not, please do or let us know here so we can investigate and file
> > a bug.
> 
> There is a bug linked to in the original message that I am already working on.  The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental.
> 
> > 
> > It seems to me that they should be put into a queue to be retried.
> > Sending the notifications blindly is almost as bad as dropping them,
> > as you have no idea if the agent is alive or not.
> 
> This is more the kind of discussion I was looking for.  
> 
> In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'.  Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture.  Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea?  In the best case the agent isn't really down and can process the notification.  In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns.  What am I missing? 
> 

Do you mean overload of neutron server? Not neutron agent.
So event agent sends periodic 'live' report, the reports are piled up
unprocessed by server.
When server sends notification, it considers agent dead wrongly.
Not because agent didn't send live reports due to overload of agent.
Is this understanding correct?


> Please consider that while a good solution will track notification delivery and success, we may need 2 solutions:
> 
> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana.

How about twisting DhcpAgent._periodic_resync_helper?
If no notification is received form server from last sleep,
it calls self.sync_state() even if self.needs_resync = False. Thus the
inconsistency between agent and server due to losing notification
will be fixed.


> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
> 
> I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle.

Loss of notifications is somewhat inevitable, I think.
(Or logging tasks to stable storage shared between server and agent)
And Unconditionally sending notifications would cause problem.

You mentioned agent crash. Server crash should also be taken care of
for reliability. Admin also sometimes wants to restart neutron
server/agents for some reasons.
Agent can crash after receiving notifications before start processing
actual tasks. Server can crash after commiting changes to DB before sending
notifications. In such cases, notification will be lost.
Polling to resync would be necessary somewhere.

- notification loss isn't considered.
  self.resync is not always run.
  some optimization is possible, for example
  - detect loss by sequence number
  - polling can be postponed when notifications come without loss.

- periodic resync spawns threads, but doesn't wait their completion.
  So if resync takes long time, next resync can start even while
  resync is going on.

- processing notification can be batched.

- reducing live report.
  piggyback on other RPC call.
-- 
Isaku Yamahata <isaku.yamahata at gmail.com>



More information about the OpenStack-dev mailing list