[openstack-dev] [Neutron] DHCP Agent Reliability

Isaku Yamahata isaku.yamahata at gmail.com
Sat Dec 7 08:53:29 UTC 2013

On Fri, Dec 06, 2013 at 04:30:17PM +0900,
Maru Newby <marun at redhat.com> wrote:

> >> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
> >> 
> >> I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle.
> > 
> > Loss of notifications is somewhat inevitable, I think.
> > (Or logging tasks to stable storage shared between server and agent)
> > And Unconditionally sending notifications would cause problem.
> Regarding sending notifications unconditionally, what specifically are you worried about?  I can imagine 2 scenarios:
> Case 1: Send notification to an agent that is incorrectly reported as down. 
> Result:  Agent receives notification and acts on it.
> Case 2: Send notification to an agent that is actually down.
> Result: Agent comes up eventually (in a production environment this should be a given) and calls sync_state().  We definitely need to make sync_state more reliable, though (I discuss the specifics later in this message).
> Notifications could of course be dropped if AMQP queues are not persistent and are lost, but I don't think there needs to be a code-based remedy for this.  An operator is likely to deploy the AMQP service in HA to prevent the queues from being lost, and know to restart everything in the event of catastrophic failure.

Case 3: Hardware failure. So an agent on the node is gone.
        Another agent will run on another node.

If AMQP service is set up not to lose notification, notifications will be piled up
and stress AMQP service. I would say single node failure isn't catastrophic.

> That's not to say we don't have work to do, though.  An agent is responsible for communicating resource state changes to the service, but the service neither detects nor reacts when the state of a resource is scheduled to change and fails to do so in a reasonable timeframe.  Thus, as in the bug that prompted this discussion, it is up to the user to detect the failure (a VM without connectivity).  Ideally, Neutron should be tracking resource state changes with sufficient detail and reviewing them periodically to allow timely failure detection and remediation.

You are proposing polling by Neutron server.
So polling somewhere (in server or agent or hybrid) is the way to go in long term.
Do you agree?
Details to discuss would be, how to do polling, how often(or adaptive) polling
should be done, how the cost of polling can be mitigated by tricks...

> However, such a change is unlikely to be a candidate for backport so it will have to wait.

Right, this isn't for backport. I'm talking about middle/long term direction.

> > You mentioned agent crash. Server crash should also be taken care of
> > for reliability. Admin also sometimes wants to restart neutron
> > server/agents for some reasons.
> > Agent can crash after receiving notifications before start processing
> > actual tasks. Server can crash after commiting changes to DB before sending
> > notifications. In such cases, notification will be lost.
> > Polling to resync would be necessary somewhere.
> Agreed, we need to consider the cases of both agent and service failure.  
> In the case of service failure, thanks to recently merged patches, the dhcp agent will at least force a resync in the event of an error in communicating with the server.  However, there is no guarantee that the agent will communicate with the server during the downtime.  While polling is one possible solution, might it be preferable for the service to simply notify the agents when it starts?  The dhcp agent can already receive an agent_updated RPC message that triggers a resync.  

Agreed, notification on server startup is better.

> > - notification loss isn't considered.
> >  self.resync is not always run.
> >  some optimization is possible, for example
> >  - detect loss by sequence number
> >  - polling can be postponed when notifications come without loss.
> Notification loss due to agent failure is already solved - sync_state() is called on startup.  Notification loss due to server failure could be handled as described above.   I think the larger problem is that calling sync_state() does not affect processing of notifications already in the queue, which could result in stale notifications being processed out-of-order, e.g.
> - service sends 'network down' notification
> - service goes down after committing 'network up' to db, but before sending notification
> - service comes back up
> - agent knows (somehow) to resync, setting the network 'up'
> - agent processes stale 'network down' notification
> Though tracking sequence numbers is one possible fix, what do you think of instead ignoring all notifications generated before a timestamp set at the beginning of sync_state()?  

I agree that improvement is necessary in the are and it is better for agent to
ignore stale notification somehow.

Regarding to out-of-order notification, making agent to be able to accept
out-of-order notifications somehow (by polling, sequence number or however) will
open up the possibility for active-active Neutron server which is being discussed
in this thread.

> > - periodic resync spawns threads, but doesn't wait their completion.
> >  So if resync takes long time, next resync can start even while
> >  resync is going on.
> sync_state() now waits for the completion of threads thanks to the following patch:
> https://review.openstack.org/#/c/59863/

Wow, great.
Isaku Yamahata <isaku.yamahata at gmail.com>

More information about the OpenStack-dev mailing list