[Openstack-operators] [Essex] compute node hard reboot, can't create domain.

Narayan Desai narayan.desai at gmail.com
Mon Jul 8 12:41:30 UTC 2013


Yeah, it is kind of a pain. Are you running nova-network on each host? This
is a longer series of steps than we normally need to perform. (our bridges
don't have IP addresses on them)
 -nld


On Sun, Jul 7, 2013 at 10:25 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:

> Holy smokes that is a pain in the butt!
>
> I got it, but I had to:
> 1) create the bridge for each tenant
> 2) create the tagged interface for each tenant
> 3) add the appropriate vlan to each bridge
> 4) "guess" at the ip that should be assigned to each bridge and assign it.
> 5) reboot each VM.
>
> This has to have been a bug in essex.  I see no reason compute shouldn't
> recreate all the networking after a hard reboot.
>
> Thank you both for the help.... if it wasn't bed time I would go grab a
> beer or a bottle of wild turkey :P
>
> - Sam
>
>
> On Sun, Jul 7, 2013 at 11:12 PM, Narayan Desai <narayan.desai at gmail.com>wrote:
>
>> You'll also need to setup the tagged interface (eth2 at 14) and add it to
>> the br14 bridge.
>>  -nld
>>
>>
>> On Sun, Jul 7, 2013 at 9:32 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:
>>
>>> Lorin,
>>>
>>> I am running in vlan mode (not multihost mode).   Restarting
>>> nova-compute does not seem to create the bridge "br14".   I tried creating
>>> manually with "brctl addbr br14".  Doing this allowed me to start the VM
>>> but I can not ping or ssh to the VM on either the internal or external
>>> network.  I probably need to create the vlan14 at eth0 interface as well
>>> and add it to the bridge?
>>>
>>> Why might nova-compute not recreate the bridges and interfaces?  I don't
>>> see any warnings or errors in the log files (on either node) when starting
>>> nova-compute on failed compute node.
>>>
>>> ************ FROM THE CONTROLLER NODE ************
>>> root at cloudy:~# nova-manage service list
>>> 2013-07-07 22:25:40 DEBUG nova.utils
>>> [req-f4a55a39-03d8-4bc7-b5c8-f53b1825f934 None None] backend <module
>>> 'nova.db.sqlalchemy.api' from
>>> '/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.pyc'> from
>>> (pid=29059) __get_backend /usr/lib/python2.7/dist-packages/nova/utils.py:658
>>> Binary           Host                                 Zone
>>> Status     State Updated_At
>>> nova-scheduler   cloudy                               nova
>>> enabled    :-)   2013-07-08 02:25:40
>>> nova-compute     cloudy                               nova
>>> enabled    :-)   2013-07-08 02:25:25
>>> nova-network     cloudy                               nova
>>> enabled    :-)   2013-07-08 02:25:40
>>> nova-compute     compute-01                           nova
>>> enabled    :-)   2013-07-08 02:25:40
>>> nova-compute     compute-02                           nova
>>> enabled    XXX   2013-05-21 17:47:13  <--- this is ok, I have it turned off.
>>>
>>>
>>> ************ FROM THE FAILED NODE ************
>>> root at compute-01:~# service nova-compute restart
>>> nova-compute stop/waiting
>>> nova-compute start/running, process 13057
>>> root at compute-01:~# sleep 10
>>> root at compute-01:~# ip a
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     inet 127.0.0.1/8 scope host lo
>>>     inet6 ::1/128 scope host
>>>        valid_lft forever preferred_lft forever
>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>> qlen 1000
>>>     link/ether 00:25:90:56:d9:d2 brd ff:ff:ff:ff:ff:ff
>>>     inet 10.54.50.30/16 brd 10.54.255.255 scope global eth0 <-------------------
>>> Address obtained via DHCP for external route
>>>     inet 10.20.0.2/16 scope global eth0
>>> <--------------------------------------- OpenStack Management/Internal
>>> Network
>>>     inet6 fe80::225:90ff:fe56:d9d2/64 scope link
>>>        valid_lft forever preferred_lft forever
>>> 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
>>>     link/ether 00:25:90:56:d9:d3 brd ff:ff:ff:ff:ff:ff
>>> 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>> qlen 1000
>>>     link/ether 00:02:c9:34:e9:90 brd ff:ff:ff:ff:ff:ff
>>>     inet 10.57.60.2/16 brd 10.57.255.255 scope global eth2 <--------------------
>>> On the NFS network (for live migration)
>>>     inet6 fe80::202:c9ff:fe34:e990/64 scope link
>>>        valid_lft forever preferred_lft forever
>>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
>>>     link/ether 00:02:c9:34:e9:91 brd ff:ff:ff:ff:ff:ff
>>> 6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
>>> state DOWN
>>>     link/ether 56:ed:f9:dd:bc:58 brd ff:ff:ff:ff:ff:ff
>>>     inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
>>>
>>>
>>> On Sun, Jul 7, 2013 at 10:03 PM, Lorin Hochstein <
>>> lorin at nimbisservices.com> wrote:
>>>
>>>> Hi Samuel:
>>>>
>>>> It sounds like your VMs are configured to plug into a Linux bridge that
>>>> doesn't exist on compute-01 anymore. You could create it manually, although
>>>> I would expect that it would have been created automatically by the
>>>> relevant nova service when they came back up.
>>>>
>>>> You can check if the bridge is there by doing "ip a" and looking for
>>>> the "br14" network device.
>>>>
>>>> Are you running networking in multihost mode? If so, I think restarting
>>>> the nova-network service on compute-01 should do it. If you aren't running
>>>> in multihost mode, then it should come back by restarting the nova-compute
>>>> service on compute-01.
>>>>
>>>> Otherwise, you'll need to create the bridge manually, and how you do
>>>> that will depend on whether you're running flat or vlan. If it was called
>>>> br14, I'm assuming you're running in vlan mode with vlan tag 14 associated
>>>> with this project?
>>>>
>>>> Lorin
>>>>
>>>>
>>>> On Sun, Jul 7, 2013 at 9:21 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have an old Essex cluster that we are getting ready to phase out for
>>>>> grizzly.  Unfortunately over the weekend one of the compute nodes powered
>>>>> off (power supply failure it looks like).  When I tried a "nova reboot
>>>>> <UUID>"
>>>>>
>>>>> I got:
>>>>>
>>>>> 2013-07-07 21:17:34 ERROR nova.rpc.amqp
>>>>> [req-d2ea5f46-9dc2-4788-9951-07d985a1f8dc 6986639ba3c84ab5b05fdd2e122101f0
>>>>> 3806a811d2d34542bdfc5d7f31ce7b89] Exception during message handling
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp Traceback (most recent call
>>>>> last):
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 253, in
>>>>> _process_data
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     rval =
>>>>> node_func(context=ctxt, **node_args)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return f(*args, **kw)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 159, in
>>>>> decorated_function
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     function(self, context,
>>>>> instance_uuid, *args, **kwargs)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 183, in
>>>>> decorated_function
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     sys.exc_info())
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     self.gen.next()
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 177, in
>>>>> decorated_function
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return function(self,
>>>>> context, instance_uuid, *args, **kwargs)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 904, in
>>>>> reboot_instance
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     reboot_type)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     return f(*args, **kw)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>>>>> 721, in reboot
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     if
>>>>> self._soft_reboot(instance):
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>>>>> 757, in _soft_reboot
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     dom.create()
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp   File
>>>>> "/usr/lib/python2.7/dist-packages/libvirt.py", line 551, in create
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp     if ret == -1: raise
>>>>> libvirtError ('virDomainCreate() failed', dom=self)
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp libvirtError: Cannot get
>>>>> interface MTU on 'br14': No such device
>>>>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp
>>>>>
>>>>>
>>>>> So I tried starting it manually:
>>>>>
>>>>> root at compute-01:/etc/libvirt/qemu# virsh create instance-00000035.xml
>>>>> error: Failed to create domain from instance-00000035.xml
>>>>> error: Cannot get interface MTU on 'br14': No such device
>>>>>
>>>>>
>>>>> Any idea what I might be doing wrong?  All the services show :-) with
>>>>> nova-manage
>>>>>
>>>>>
>>>>> Thanks for your help...
>>>>>
>>>>> Sam
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-operators mailing list
>>>>> OpenStack-operators at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Lorin Hochstein
>>>> Lead Architect - Cloud Services
>>>> Nimbis Services, Inc.
>>>> www.nimbisservices.com
>>>>
>>>
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130708/136155c7/attachment.html>


More information about the OpenStack-operators mailing list