[openstack-dev] [Neutron] DHCP Agent Reliability

Maru Newby marun at redhat.com
Fri Dec 6 07:30:17 UTC 2013

On Dec 5, 2013, at 5:21 PM, Isaku Yamahata <isaku.yamahata at gmail.com> wrote:

> On Wed, Dec 04, 2013 at 12:37:19PM +0900,
> Maru Newby <marun at redhat.com> wrote:
>> In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'.  Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture.  Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea?  In the best case the agent isn't really down and can process the notification.  In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns.  What am I missing? 
> Do you mean overload of neutron server? Not neutron agent.
> So event agent sends periodic 'live' report, the reports are piled up
> unprocessed by server.
> When server sends notification, it considers agent dead wrongly.
> Not because agent didn't send live reports due to overload of agent.
> Is this understanding correct?

Your interpretation is likely correct.  The demands on the service are going to be much higher by virtue of having to field RPC requests from all the agents to interact with the database on their behalf.

>> Please consider that while a good solution will track notification delivery and success, we may need 2 solutions:
>> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana.
> How about twisting DhcpAgent._periodic_resync_helper?
> If no notification is received form server from last sleep,
> it calls self.sync_state() even if self.needs_resync = False. Thus the
> inconsistency between agent and server due to losing notification
> will be fixed.

Unless I'm missing something, wouldn't forcing more and potentially unnecessary resyncs increase the load on the Neutron service and negatively impact reliability?

>> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
>> I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle.
> Loss of notifications is somewhat inevitable, I think.
> (Or logging tasks to stable storage shared between server and agent)
> And Unconditionally sending notifications would cause problem.

Regarding sending notifications unconditionally, what specifically are you worried about?  I can imagine 2 scenarios:

Case 1: Send notification to an agent that is incorrectly reported as down. 
Result:  Agent receives notification and acts on it.

Case 2: Send notification to an agent that is actually down.
Result: Agent comes up eventually (in a production environment this should be a given) and calls sync_state().  We definitely need to make sync_state more reliable, though (I discuss the specifics later in this message).

Notifications could of course be dropped if AMQP queues are not persistent and are lost, but I don't think there needs to be a code-based remedy for this.  An operator is likely to deploy the AMQP service in HA to prevent the queues from being lost, and know to restart everything in the event of catastrophic failure.

That's not to say we don't have work to do, though.  An agent is responsible for communicating resource state changes to the service, but the service neither detects nor reacts when the state of a resource is scheduled to change and fails to do so in a reasonable timeframe.  Thus, as in the bug that prompted this discussion, it is up to the user to detect the failure (a VM without connectivity).  Ideally, Neutron should be tracking resource state changes with sufficient detail and reviewing them periodically to allow timely failure detection and remediation.  However, such a change is unlikely to be a candidate for backport so it will have to wait.

> You mentioned agent crash. Server crash should also be taken care of
> for reliability. Admin also sometimes wants to restart neutron
> server/agents for some reasons.
> Agent can crash after receiving notifications before start processing
> actual tasks. Server can crash after commiting changes to DB before sending
> notifications. In such cases, notification will be lost.
> Polling to resync would be necessary somewhere.

Agreed, we need to consider the cases of both agent and service failure.  

In the case of service failure, thanks to recently merged patches, the dhcp agent will at least force a resync in the event of an error in communicating with the server.  However, there is no guarantee that the agent will communicate with the server during the downtime.  While polling is one possible solution, might it be preferable for the service to simply notify the agents when it starts?  The dhcp agent can already receive an agent_updated RPC message that triggers a resync.  

> - notification loss isn't considered.
>  self.resync is not always run.
>  some optimization is possible, for example
>  - detect loss by sequence number
>  - polling can be postponed when notifications come without loss.

Notification loss due to agent failure is already solved - sync_state() is called on startup.  Notification loss due to server failure could be handled as described above.   I think the larger problem is that calling sync_state() does not affect processing of notifications already in the queue, which could result in stale notifications being processed out-of-order, e.g.

- service sends 'network down' notification
- service goes down after committing 'network up' to db, but before sending notification
- service comes back up
- agent knows (somehow) to resync, setting the network 'up'
- agent processes stale 'network down' notification

Though tracking sequence numbers is one possible fix, what do you think of instead ignoring all notifications generated before a timestamp set at the beginning of sync_state()?  

> - periodic resync spawns threads, but doesn't wait their completion.
>  So if resync takes long time, next resync can start even while
>  resync is going on.

sync_state() now waits for the completion of threads thanks to the following patch:



More information about the OpenStack-dev mailing list