[Openstack] HA router fail-over time

Erdősi Péter fazy at niif.hu
Thu Mar 30 02:25:01 UTC 2017


Hi,

2017. 03. 30. 2:10 keltezéssel, Sterdnot Shaken írta:
> b) why are they different times?
Just an idea, but it used to happen because of the underlaying switches 
CAM table expiration (it can be set on indrustrial devices, in our 
brocade, it's 300 sec by default).

For this problem to solve, vrrp routers has to broadcast gratuitous ARPs 
with _only_ the virtual MAC and the IP (kind of force to the switch, to 
learn the new state).
Check out this: https://tools.ietf.org/html/rfc3768#section-8.2

In neutron side, there is a config option in neutron.conf [DEFAULT] 
section, which is "send_arp_for_ha". This is an integer, which used to 
send as many ARPs, that you have configured.

BTW, we have this corrected, but sometimes it takes a while for neutron 
to came up, and make things works (we use Mitaka) because if i 
understand right, the l3 agent, which handle the state changes on a 
single thread now, and they introduced a 
|ha_keepalived_state_change_server_threads|option in Newton, which is:

"A new option ha_keepalived_state_change_server_threads has been added 
to configure the number of concurrent threads spawned for keepalived 
server connection requests. Higher values increase the CPU load on the 
agent nodes. The default value is half of the number of CPUs present on 
the node. This allows operators to tune the number of threads to suit 
their environment. With more threads, simultaneous requests for multiple 
HA routers state change can be handled faster."

Source: 
https://docs.openstack.org/releasenotes/neutron/newton.html#upgrade-notes

My another (and last) guess is the reboot/boot process of your network 
node... how did you do it? (I mean, gracefull, or a pure reset?) It may 
can add a few seconds too, since gracefull shutdown service stopping 
dependecies could create a few secs, where some of the required services 
not running... (just an example from our boot process, as i saw after a 
bit debugging:
  - Start OVS (without any dynamic data)
  - Start L3 agent (and it's start keepalived)
  - New keepalived instance connect to OVS HA network (and L3 agent 
start to push the dynamic config to OVS)
  - Until they HA network are up and running on both network side 
(keepalived daemons can talk to each other) the 2 neutron has 
master-master state on the same subnet
  - When they can talk, one of them goes to backup (and the gratuitous 
ARP came in again, but maybe just mix things up, since the few secs of 
master-master state mix things up on the CAM table of your switch(es))

Hope that gives you some clue, where to start debugging :)

Regards:
  Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170330/e4c82bcb/attachment.html>


More information about the Openstack mailing list