[openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

Peter Feiner peter at gridcentric.ca
Tue Dec 10 17:01:34 UTC 2013


On Tue, Dec 10, 2013 at 7:48 AM, Nathani, Sreedhar (APS)
<sreedhar.nathani at hp.com> wrote:
> My setup has 17 L2 agents (16 compute nodes, one Network node). Setting the minimize_polling helped to reduce the CPU
> utilization by the L2 agents but it did not help in instances getting the IP during first boot.
>
> With the minimize_polling polling enabled less number of instances could get IP than without the minimize_polling fix.
>
> Once the we reach certain number of ports(in my case 120 ports), during subsequent concurrent instance deployment(30 instances),
> updating the port details in the dnsmasq host is taking long time, which causing the delay for instances getting IP address.

To figure out what the next problem is, I recommend that you determine
precisely what "port details in the dnsmasq host [are] taking [a] long
time" to update. Is the DHCPDISCOVER packet from the VM arriving
before the dnsmasq process's hostsfile is updated and dnsmasq is
SIGHUP'd? Is the VM sending the DHCPDISCOVER request before its tap
device is wired to the dnsmasq process (i.e., determine the status of
the chain of bridges at the time the guest sends the DHCPDISCOVER
packet)? Perhaps the DHCPDISCOVER packet is being dropped because the
iptables rules for the VM's port haven't been instantiated when the
DHCPDISCOVER packet is sent. Or perhaps something else, such as the
replies being dropped. These are my only theories at the moment.

Anyhow, once you determine where the DHCP packets are being lost,
you'll have a much better idea of what needs to be fixed.

One suggestion I have to make your debugging less onerous is to
reconfigure your guest image's networking init script to retry DHCP
requests indefinitely. That way, you'll see the guests' DHCP traffic
when neutron eventually gets everything in order. On CirrOS, add the
following line to the eht0 stanza in /etc/network/interfaces to retry
DHCP requests 100 times every 3 seconds:

udhcpc_opts -t 100 -T 3

> When I deployed only 5 instances concurrently (already had 211 instances active) instead of 30, all the instances are able to get the IP.
> But when I deployed 10 instances concurrently (already had 216 instances active) instead of 30, none of the instances could able to get the IP

This is reminiscent of yet another problem I saw at scale. If you're
using the security group rule "VMs in this group can talk to everybody
else in this group", which is one of the defaults in devstack, you get
O(N^2) iptables rules for N VMs running on a particular host. When you
have more VMs running, the openvswitch agent, which is responsible for
instantiating the iptables and does so somewhat laboriously with
respect to the number of iptables rules, the opevnswitch agent could
take too long to configure ports before VMs' DHCP clients time out.
However, considering that you're seeing low CPU utilization by the
openvswitch agent, I don't think you're having this problem; since
you're distributing your VMs across numerous compute hosts, N is quite
small in your case. I only saw problems when N was > 100.



More information about the OpenStack-dev mailing list