Hello everyone! In Juno in Neutron was implemented L3 HA feature based on Keepalived (VRRP). During next cycles it was improved, we performed scale testing [1] to find weak places and tried to fix them. The only alternative for L3 HA with VRRP is router rescheduling performed by Neutron server, but it is significantly slower and depends on control plane. What issues we experienced with L3 HA VRRP? 1. Bugs in Keepalived (bad versions) [2] 2. Split brain [3] 3. Complex structure (ha networks, ha interfaces) - which actually cause races that we were fixing during Liberty, Mitaka and Newton. This all is not critical, but this is a bad experience and not everyone ready (or want) to use Keepalived approach. I think we can make things more flexible. For example, we can allow user to use external services like etcd instead of Keepalived to synchronize current HA state across agents. I've done several experiments and I've got failover time comparable to L3 HA with VRRP. Tooz [4] can be used to abstract from concrete backend. For example, it can allow us to use Zookeeper, Redis and other backends to store HA state. What I want to propose? I want to bring up idea that Neutron should have some general classes for L3 HA which will allow to use not only Keepalived but also other backends for HA state. This at least will make it easier to try some other approaches and compare them with existing ones. Does this sound reasonable? [1] - http://docs.openstack.org/developer/performance-docs/test_results/neutron_features/index.html [2] - https://bugs.launchpad.net/neutron/+bug/1497272 https://bugs.launchpad.net/neutron/+bug/1433172 [3] - https://bugs.launchpad.net/neutron/+bug/1375625 [4] - http://docs.openstack.org/developer/tooz/ -- Regards, Ann Taraday -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170210/fb287815/attachment.html>