[Openstack-operators] [Essex] compute node hard reboot, can't create domain.

Samuel Winchenbach swinchen at gmail.com
Mon Jul 8 03:25:20 UTC 2013


Holy smokes that is a pain in the butt!

I got it, but I had to:
1) create the bridge for each tenant
2) create the tagged interface for each tenant
3) add the appropriate vlan to each bridge
4) "guess" at the ip that should be assigned to each bridge and assign it.
5) reboot each VM.

This has to have been a bug in essex.  I see no reason compute shouldn't
recreate all the networking after a hard reboot.

Thank you both for the help.... if it wasn't bed time I would go grab a
beer or a bottle of wild turkey :P

- Sam


On Sun, Jul 7, 2013 at 11:12 PM, Narayan Desai <narayan.desai at gmail.com>wrote:

> You'll also need to setup the tagged interface (eth2 at 14) and add it to
> the br14 bridge.
>  -nld
>
>
> On Sun, Jul 7, 2013 at 9:32 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:
>
>> Lorin,
>>
>> I am running in vlan mode (not multihost mode).   Restarting nova-compute
>> does not seem to create the bridge "br14".   I tried creating manually with
>> "brctl addbr br14".  Doing this allowed me to start the VM but I can not
>> ping or ssh to the VM on either the internal or external network.  I
>> probably need to create the vlan14 at eth0 interface as well and add it to
>> the bridge?
>>
>> Why might nova-compute not recreate the bridges and interfaces?  I don't
>> see any warnings or errors in the log files (on either node) when starting
>> nova-compute on failed compute node.
>>
>> ************ FROM THE CONTROLLER NODE ************
>> root at cloudy:~# nova-manage service list
>> 2013-07-07 22:25:40 DEBUG nova.utils
>> [req-f4a55a39-03d8-4bc7-b5c8-f53b1825f934 None None] backend <module
>> 'nova.db.sqlalchemy.api' from
>> '/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.pyc'> from
>> (pid=29059) __get_backend /usr/lib/python2.7/dist-packages/nova/utils.py:658
>> Binary           Host                                 Zone
>> Status     State Updated_At
>> nova-scheduler   cloudy                               nova
>> enabled    :-)   2013-07-08 02:25:40
>> nova-compute     cloudy                               nova
>> enabled    :-)   2013-07-08 02:25:25
>> nova-network     cloudy                               nova
>> enabled    :-)   2013-07-08 02:25:40
>> nova-compute     compute-01                           nova
>> enabled    :-)   2013-07-08 02:25:40
>> nova-compute     compute-02                           nova
>> enabled    XXX   2013-05-21 17:47:13  <--- this is ok, I have it turned off.
>>
>>
>> ************ FROM THE FAILED NODE ************
>> root at compute-01:~# service nova-compute restart
>> nova-compute stop/waiting
>> nova-compute start/running, process 13057
>> root at compute-01:~# sleep 10
>> root at compute-01:~# ip a
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>     inet 127.0.0.1/8 scope host lo
>>     inet6 ::1/128 scope host
>>        valid_lft forever preferred_lft forever
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>> qlen 1000
>>     link/ether 00:25:90:56:d9:d2 brd ff:ff:ff:ff:ff:ff
>>     inet 10.54.50.30/16 brd 10.54.255.255 scope global eth0 <-------------------
>> Address obtained via DHCP for external route
>>     inet 10.20.0.2/16 scope global eth0
>> <--------------------------------------- OpenStack Management/Internal
>> Network
>>     inet6 fe80::225:90ff:fe56:d9d2/64 scope link
>>        valid_lft forever preferred_lft forever
>> 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
>>     link/ether 00:25:90:56:d9:d3 brd ff:ff:ff:ff:ff:ff
>> 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>> qlen 1000
>>     link/ether 00:02:c9:34:e9:90 brd ff:ff:ff:ff:ff:ff
>>     inet 10.57.60.2/16 brd 10.57.255.255 scope global eth2 <--------------------
>> On the NFS network (for live migration)
>>     inet6 fe80::202:c9ff:fe34:e990/64 scope link
>>        valid_lft forever preferred_lft forever
>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
>>     link/ether 00:02:c9:34:e9:91 brd ff:ff:ff:ff:ff:ff
>> 6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
>> state DOWN
>>     link/ether 56:ed:f9:dd:bc:58 brd ff:ff:ff:ff:ff:ff
>>     inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
>>
>>
>> On Sun, Jul 7, 2013 at 10:03 PM, Lorin Hochstein <
>> lorin at nimbisservices.com> wrote:
>>
>>> Hi Samuel:
>>>
>>> It sounds like your VMs are configured to plug into a Linux bridge that
>>> doesn't exist on compute-01 anymore. You could create it manually, although
>>> I would expect that it would have been created automatically by the
>>> relevant nova service when they came back up.
>>>
>>> You can check if the bridge is there by doing "ip a" and looking for the
>>> "br14" network device.
>>>
>>> Are you running networking in multihost mode? If so, I think restarting
>>> the nova-network service on compute-01 should do it. If you aren't running
>>> in multihost mode, then it should come back by restarting the nova-compute
>>> service on compute-01.
>>>
>>> Otherwise, you'll need to create the bridge manually, and how you do
>>> that will depend on whether you're running flat or vlan. If it was called
>>> br14, I'm assuming you're running in vlan mode with vlan tag 14 associated
>>> with this project?
>>>
>>> Lorin
>>>
>>>
>>> On Sun, Jul 7, 2013 at 9:21 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have an old Essex cluster that we are getting ready to phase out for
>>>> grizzly.  Unfortunately over the weekend one of the compute nodes powered
>>>> off (power supply failure it looks like).  When I tried a "nova reboot
>>>> <UUID>"
>>>>
>>>> I got:
>>>>
>>>> 2013-07-07 21:17:34 ERROR nova.rpc.amqp
>>>> [req-d2ea5f46-9dc2-4788-9951-07d985a1f8dc 6986639ba3c84ab5b05fdd2e122101f0
>>>> 3806a811d2d34542bdfc5d7f31ce7b89] Exception during message handling
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp Traceback (most recent call
>>>> last):
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 253, in
>>>> _process_data
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     rval =
>>>> node_func(context=ctxt, **node_args)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return f(*args, **kw)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 159, in
>>>> decorated_function
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     function(self, context,
>>>> instance_uuid, *args, **kwargs)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 183, in
>>>> decorated_function
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     sys.exc_info())
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     self.gen.next()
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 177, in
>>>> decorated_function
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return function(self,
>>>> context, instance_uuid, *args, **kwargs)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 904, in
>>>> reboot_instance
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     reboot_type)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return f(*args, **kw)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>>>> 721, in reboot
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     if
>>>> self._soft_reboot(instance):
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>>>> 757, in _soft_reboot
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     dom.create()
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>> "/usr/lib/python2.7/dist-packages/libvirt.py", line 551, in create
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     if ret == -1: raise
>>>> libvirtError ('virDomainCreate() failed', dom=self)
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp libvirtError: Cannot get
>>>> interface MTU on 'br14': No such device
>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp
>>>>
>>>>
>>>> So I tried starting it manually:
>>>>
>>>> root at compute-01:/etc/libvirt/qemu# virsh create instance-00000035.xml
>>>> error: Failed to create domain from instance-00000035.xml
>>>> error: Cannot get interface MTU on 'br14': No such device
>>>>
>>>>
>>>> Any idea what I might be doing wrong?  All the services show :-) with
>>>> nova-manage
>>>>
>>>>
>>>> Thanks for your help...
>>>>
>>>> Sam
>>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>
>>>>
>>>
>>>
>>> --
>>> Lorin Hochstein
>>> Lead Architect - Cloud Services
>>> Nimbis Services, Inc.
>>> www.nimbisservices.com
>>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130707/2187f973/attachment.html>


More information about the OpenStack-operators mailing list