[Openstack-operators] Neutron crashed hard
jaypipes at gmail.com
Thu Dec 19 05:05:00 UTC 2013
On 12/18/2013 09:33 PM, Joe Topjian wrote:
> I set up an internal OpenStack cloud to give a workshop for around 15
> people. I decided to use Neutron as I'm trying to get more experience
> with it. The cloud consisted of a cloud controller and four compute
> nodes. Very decent Dell hardware, Ubuntu 12.04, Havana 2013.2.0.
> Neutron was configured with the OVS plugin, non-overlapping IPs, and a
> single shared subnet. GRE tunnelling was used between compute nodes.
What version of OVS did you deploy? There's a bad bug/behavior in OVS
1.04 that can result in circular routes in the GRE mesh, which we saw
entirely take down an entire deployment zone with tenant traffic
swamping the bonded NIC that was housing the GRE overlay network.
Upgrading to OVS 1.10 and then 1.11 solved that issue along with some
> Everything was working fine until the 15 people tried launching a CirrOS
> instance at approximately the same time.
> Then Neutron crashed.
> The compute nodes had this in their logs:
> 2013-12-18 09:52:57.707 28514 TRACE nova.compute.manager
> ConnectionFailed: Connection to neutron failed: timed out
> All instances went into an Error state.
> Restarting the Neutron services did no good. Terminating the Error'd
> instances seemed to make the problem worse -- the entire cloud became
> unavailable (meaning, both Horizon and Nova were unusable as they would
> time out waiting for Neutron).
> We moved on to a different cloud to continue on with the workshop. I
> would occasionally issue "neutron net-list" in the original cloud to see
> if I would get a result. It took about an hour.
> What happened?
> I've read about Neutron performance issues -- would this be something
> along those lines?
Tough to tell. It very well could be, or it could be OVS itself.
Look in the Neutron L3 agent, neutron-plugin-openvswitch-agent log (both
on the L3 router node and the compute workers) and neutron-server logs
for errors. It may be some contention issues on the database or MQ end
Are you using a multi-plexed neutron server (workers config option > 1)?
> What's the best way to quickly recover from a situation like this?
There isn't one. Search through all your logs for Neutron and
openvswitchd looking for issues.
> Since then, I haven't recreated the database, networks, or anything like
> that. Is there a specific log or database table I can look for to see
> more information on how exactly this situation happened?
You could look at your database slow log (if using MySQL), but I doubt
you'll find anything in there... but you may get lucky.
Let us know what you find.
More information about the OpenStack-operators