[openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

Nathani, Sreedhar (APS) sreedhar.nathani at hp.com
Tue Dec 10 12:48:13 UTC 2013


Hello Peter,

I have merged the code of following patches for minimize_polling setting  and enabled minimize_polling in all the L2 agents
	https://review.openstack.org/45676
	https://review.openstack.org/45677
	https://review.openstack.org/45678
	https://review.openstack.org/57475/

My setup has 17 L2 agents (16 compute nodes, one Network node). Setting the minimize_polling helped to reduce the CPU
utilization by the L2 agents but it did not help in instances getting the IP during first boot. 

With the minimize_polling polling enabled less number of instances could get IP than without the minimize_polling fix.

Once the we reach certain number of ports(in my case 120 ports), during subsequent concurrent instance deployment(30 instances),
updating the port details in the dnsmasq host is taking long time, which causing the delay for instances getting IP address. 

When I deployed only 5 instances concurrently (already had 211 instances active) instead of 30, all the instances are able to get the IP. 
But when I deployed 10 instances concurrently (already had 216 instances active) instead of 30, none of the instances could able to get the IP

Thanks & Regards,
Sreedhar Nathani

-----Original Message-----
From: Nathani, Sreedhar (APS) 
Sent: Friday, December 06, 2013 12:21 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: RE: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

Hello Peter,

Thanks for the info. I will do the tests with your code changes.

What surprises me is, When I did the tests in Grizzly,  up to 210 instances could get an IP during the first boot. 
Once we cross 210 active instances, during the next batch some instances could not get IP. As the number of active instances grows,  more number of instances could not get IP.
But once I restart the instances, those could get IP Address.. I did the tests close to 10 times, so this behavior was consistent all the times. 

But in Havana, Instances are not able to get the IP once we cross 80 instances. Moreover, we need to restart the dnsmasq process for instances to get IP during next reboot

Thanks & Regards,
Sreedhar Nathani


-----Original Message-----
From: Peter Feiner [mailto:peter at gridcentric.ca]
Sent: Thursday, December 05, 2013 10:57 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

On Thu, Dec 5, 2013 at 8:23 AM, Nathani, Sreedhar (APS) <sreedhar.nathani at hp.com> wrote:
> Hello Marun,
>
>
>
> Please find the details about my setup and tests which i have done so 
> far
>
>
>
> Setup
>
>   - One Physical Box with 16c, 256G memory. 2 VMs created on this Box
> - One for Controller and One for Network Node
>
>   - 16x compute nodes (each has 16c, 256G memory)
>
>   - All the systems are installed with Ubuntu Precise + Havana Bits 
> from Ubuntu Cloud Archive
>
>
>
> Steps to simulate the issue
>
>   1) Concurrently create 30 Instances (m1.small) using REST API with
> mincount=30
>
>   2) sleep for 20min and repeat the step (1)
>
>
>
>
>
> Issue 1
>
> In Havana, once we cross 150 instances (5 batches x 30) during 6th 
> batch some instances are going into ERROR state
>
> due to network port not able to create and some instances are getting 
> duplicate IP address
>
>
>
> Per Maru Newby this issue might related to this bug
>
> https://bugs.launchpad.net/bugs/1192381
>
>
>
> I have done the similar with Grizzly on the same environment 2 months 
> back, where I could able to deploy close to 240 instances without any 
> errors
>
> Initially on Grizzly also seen the same behavior but with these 
> tunings based on this bug
>
> https://bugs.launchpad.net/neutron/+bug/1160442, never had issues 
> (tested more than 10 times)
>
>        sqlalchemy_pool_size = 60
>
>        sqlalchemy_max_overflow = 120
>
>        sqlalchemy_pool_timeout = 2
>
>        agent_down_time = 60
>
>        report_internval = 20
>
>
>
> In Havana, I have tuned the same tunables but I could never get past
> 150+ instances. Without the tunables I could not able to get past
>
> 100 instances. We are getting many timeout errors from the DHCP agent 
> and neutron clients
>
>
>
> NOTE: After tuning the agent_down_time to 60 and report_interval to 
> 20, we no longer getting these error messages
>
>    2013-12-02 11:44:43.421 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>    2013-12-02 11:44:43.439 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>    2013-12-02 11:44:43.452 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>
>
>
>
> In the compute node openvswitch agent logs, we see these errors 
> repeating continuously
>
>
>
> 2013-12-04 06:46:02.081 3546 TRACE
> neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout 
> while waiting on RPC response - topic: "q-plugin", RPC method:
> "security_group_rules_for_devices" info: "<unknown>"
>
> and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads 
> waiting for msg_id
>
>
>
> DHCP agent has below errors
>
>
>
> 2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] 
> Unable to reload_allocations dhcp.
>
> 2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_dhcp_port" info: "<unknown>"
>
>
>
> 2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] 
> Unable to sync network state.
>
> 2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_active_networks_info" info: "<unknown>"
>
>
>
>
>
> In Havana, I have merged the code from this patch and set api_workers 
> to 8 (My Controller VM has 8cores/16Hyperthreads)
>
> https://review.openstack.org/#/c/37131/
>
>
>
> After this patch and starting 8 neutron-server worker threads, during 
> the batch creation of 240 instances with 30 concurrent requests during 
> each batch,
>
> 238 instances became active and 2 instances went into error. 
> Interesting these 2 instances which went into error state are from the 
> same compute node.
>
>
>
> Unlike earlier this time, the errors are due to 'Too Many Connections' 
> to the MySQL database.
>
> 2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager
> [req-26d64693-d1ef-40f3-8350-659e34d5b1d7
> c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance:
> c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
>
> 2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager
> [req-26d64693-d1ef-40f3-8350-659e34d5b1d7
> c4d609870d4447c684858216da2f8041 9b073211dd5c4988993341cc955e200b] [instance:
> c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: 
> OperationalError
> (OperationalError) (1040, 'Too many connections') None None
>
>
>
> Need to back port the patch 'https://review.openstack.org/#/c/37131/' 
> to address the Neutron Scaling issues in Havana.
>
> Carl already back porting this patch into Havana 
> https://review.openstack.org/#/c/60082/ which is good.
>
>
>
> Issue 2
>
> Grizzly :
>
> During the concurrent instance creation in Grizzly, once we cross 210 
> instances, during subsequent 30 instance creation some of
>
> the instances could not get their IP address during the first boot 
> with in first few min. Instance MAC and IP Address details
>
> were updated in the dnsmasq host file but with a delay. Instances are 
> able to get their IP address with a delay eventually.
>
>
>
> If we reboot the instance using 'nova reboot' instance used to get IP 
> Address.
>
> * Amount of delay is depending on number of network ports and delay is 
> in the range of 8seconds to 2min
>
>
>
>
>
> Havana :
>
> But in Havana only 81 instances could get the IP during the first 
> boot. Port is getting created and IP address are getting allocated
>
> very fast, but by the time port is UP its taking quite lot of time. 
> Once the port is UP, Instances are able to send the DHCP Request
>
> and get the IP address.
>
>
>
> During the network port create and network port update, there are lot 
> of 'security_group_rules_for_devices' messages. OVS Agents in the
>
> compute nodes are getting Timeouts during "security_group_rules_for_devices"
>
>
>
> Even though this issue exist in Grizzly but we observed this issue 
> only after 200+ active instances (200 network ports), but in Havana
>
> We are having these issues with less than 100 active ports.
>
>
>
> In Havana, if we reboot the instance it's not able to get the IP 
> Address even though its network port entry is already exist in
>
> dnsmasq hosts file. We can't even ping the IP Address now which we 
> were able to ping before the instance reboot. After restarting the
>
> 'neutron-dhcp-agent' service which will restart the 'dnsmasq' and 
> reboot of the instance could get the IP
>
>
>
> This clear shows we have performance regression in neutron/havana 
> compared to quantum/grizzly
>
>
>
> FYI, I attached the my notes for  one of the instance which could not 
> get IP during first boot with this email
>
>
>
> I am happy to share the results of my grizzly tests and logs during 
> recent havana tests
>
>
>
> Thanks & Regards,
>
> Sreedhar Nathani
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Hi Sreedhar,

Many of these problems sound similar to issues I was having when doing performance testing on Havana. In my case, it turned out that neutron's openvswitch agent was thrashing due to its periodic polling of the openvswitch database. As the number of instances on a host increased, the polling took longer; in particular, what seemed to take a long time was just parsing the output of the ovs-vsctl queries. When there were enough instances, the polling took so long that it exceeded the duration of its periodic interval, thus other duties the agent had were getting pushed back indefinitely. Since the agent is ultimately responsible for connecting the VM's tap device to the bridge that the DHCP server's tap device is on, a backed up agent means VMs don't get DHCP leases. Worse, guests network init scripts are typically configured to give up on DHCP after a couple of minutes. So, if a guest's DHCP request(s) don't make it through in a couple of minutes, the guest is hosed.

The fix was implemented in
https://github.com/openstack/neutron/commit/cb0df591a9508e863ad5d5d71190eca349dc551f.
Now, neutron has a minimize_polling setting that trades ovs-vsctl polling for an event-based approach. Setting minimize_polling=True makes the openvswitch agent much more efficient. In my case, the typical cpu utilization of the openvswitch agent went down from 100% to something more like 20% when I booted 40 instances concurrently.

Unfortunately, https://github.com/openstack/neutron/commit/cb0df591a9508e863ad5d5d71190eca349dc551f
hasn't been backported to stable/havana, so minimize_polling isn't in the current Havana release and won't be in the next Havana release unless somebody does a backport.

Peter

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list