[Openstack] HA router fail-over time
Erdősi Péter
fazy at niif.hu
Thu Mar 30 02:25:01 UTC 2017
Hi,
2017. 03. 30. 2:10 keltezéssel, Sterdnot Shaken írta:
> b) why are they different times?
Just an idea, but it used to happen because of the underlaying switches
CAM table expiration (it can be set on indrustrial devices, in our
brocade, it's 300 sec by default).
For this problem to solve, vrrp routers has to broadcast gratuitous ARPs
with _only_ the virtual MAC and the IP (kind of force to the switch, to
learn the new state).
Check out this: https://tools.ietf.org/html/rfc3768#section-8.2
In neutron side, there is a config option in neutron.conf [DEFAULT]
section, which is "send_arp_for_ha". This is an integer, which used to
send as many ARPs, that you have configured.
BTW, we have this corrected, but sometimes it takes a while for neutron
to came up, and make things works (we use Mitaka) because if i
understand right, the l3 agent, which handle the state changes on a
single thread now, and they introduced a
|ha_keepalived_state_change_server_threads|option in Newton, which is:
"A new option ha_keepalived_state_change_server_threads has been added
to configure the number of concurrent threads spawned for keepalived
server connection requests. Higher values increase the CPU load on the
agent nodes. The default value is half of the number of CPUs present on
the node. This allows operators to tune the number of threads to suit
their environment. With more threads, simultaneous requests for multiple
HA routers state change can be handled faster."
Source:
https://docs.openstack.org/releasenotes/neutron/newton.html#upgrade-notes
My another (and last) guess is the reboot/boot process of your network
node... how did you do it? (I mean, gracefull, or a pure reset?) It may
can add a few seconds too, since gracefull shutdown service stopping
dependecies could create a few secs, where some of the required services
not running... (just an example from our boot process, as i saw after a
bit debugging:
- Start OVS (without any dynamic data)
- Start L3 agent (and it's start keepalived)
- New keepalived instance connect to OVS HA network (and L3 agent
start to push the dynamic config to OVS)
- Until they HA network are up and running on both network side
(keepalived daemons can talk to each other) the 2 neutron has
master-master state on the same subnet
- When they can talk, one of them goes to backup (and the gratuitous
ARP came in again, but maybe just mix things up, since the few secs of
master-master state mix things up on the CAM table of your switch(es))
Hope that gives you some clue, where to start debugging :)
Regards:
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170330/e4c82bcb/attachment.html>
More information about the Openstack
mailing list