[Openstack-operators] [Essex] compute node hard reboot, can't create domain.
Samuel Winchenbach
swinchen at gmail.com
Mon Jul 8 02:32:31 UTC 2013
Lorin,
I am running in vlan mode (not multihost mode). Restarting nova-compute
does not seem to create the bridge "br14". I tried creating manually with
"brctl addbr br14". Doing this allowed me to start the VM but I can not
ping or ssh to the VM on either the internal or external network. I
probably need to create the vlan14 at eth0 interface as well and add it to the
bridge?
Why might nova-compute not recreate the bridges and interfaces? I don't
see any warnings or errors in the log files (on either node) when starting
nova-compute on failed compute node.
************ FROM THE CONTROLLER NODE ************
root at cloudy:~# nova-manage service list
2013-07-07 22:25:40 DEBUG nova.utils
[req-f4a55a39-03d8-4bc7-b5c8-f53b1825f934 None None] backend <module
'nova.db.sqlalchemy.api' from
'/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.pyc'> from
(pid=29059) __get_backend /usr/lib/python2.7/dist-packages/nova/utils.py:658
Binary Host Zone
Status State Updated_At
nova-scheduler cloudy nova
enabled :-) 2013-07-08 02:25:40
nova-compute cloudy nova
enabled :-) 2013-07-08 02:25:25
nova-network cloudy nova
enabled :-) 2013-07-08 02:25:40
nova-compute compute-01 nova
enabled :-) 2013-07-08 02:25:40
nova-compute compute-02 nova
enabled XXX 2013-05-21 17:47:13 <--- this is ok, I have it turned off.
************ FROM THE FAILED NODE ************
root at compute-01:~# service nova-compute restart
nova-compute stop/waiting
nova-compute start/running, process 13057
root at compute-01:~# sleep 10
root at compute-01:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen
1000
link/ether 00:25:90:56:d9:d2 brd ff:ff:ff:ff:ff:ff
inet 10.54.50.30/16 brd 10.54.255.255 scope global eth0
<-------------------
Address obtained via DHCP for external route
inet 10.20.0.2/16 scope global eth0
<--------------------------------------- OpenStack Management/Internal
Network
inet6 fe80::225:90ff:fe56:d9d2/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 00:25:90:56:d9:d3 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen
1000
link/ether 00:02:c9:34:e9:90 brd ff:ff:ff:ff:ff:ff
inet 10.57.60.2/16 brd 10.57.255.255 scope global eth2
<--------------------
On the NFS network (for live migration)
inet6 fe80::202:c9ff:fe34:e990/64 scope link
valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 00:02:c9:34:e9:91 brd ff:ff:ff:ff:ff:ff
6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state
DOWN
link/ether 56:ed:f9:dd:bc:58 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
On Sun, Jul 7, 2013 at 10:03 PM, Lorin Hochstein
<lorin at nimbisservices.com>wrote:
> Hi Samuel:
>
> It sounds like your VMs are configured to plug into a Linux bridge that
> doesn't exist on compute-01 anymore. You could create it manually, although
> I would expect that it would have been created automatically by the
> relevant nova service when they came back up.
>
> You can check if the bridge is there by doing "ip a" and looking for the
> "br14" network device.
>
> Are you running networking in multihost mode? If so, I think restarting
> the nova-network service on compute-01 should do it. If you aren't running
> in multihost mode, then it should come back by restarting the nova-compute
> service on compute-01.
>
> Otherwise, you'll need to create the bridge manually, and how you do that
> will depend on whether you're running flat or vlan. If it was called br14,
> I'm assuming you're running in vlan mode with vlan tag 14 associated with
> this project?
>
> Lorin
>
>
> On Sun, Jul 7, 2013 at 9:21 PM, Samuel Winchenbach <swinchen at gmail.com>wrote:
>
>> Hi All,
>>
>> I have an old Essex cluster that we are getting ready to phase out for
>> grizzly. Unfortunately over the weekend one of the compute nodes powered
>> off (power supply failure it looks like). When I tried a "nova reboot
>> <UUID>"
>>
>> I got:
>>
>> 2013-07-07 21:17:34 ERROR nova.rpc.amqp
>> [req-d2ea5f46-9dc2-4788-9951-07d985a1f8dc 6986639ba3c84ab5b05fdd2e122101f0
>> 3806a811d2d34542bdfc5d7f31ce7b89] Exception during message handling
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp Traceback (most recent call last):
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 253, in
>> _process_data
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp rval =
>> node_func(context=ctxt, **node_args)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp return f(*args, **kw)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 159, in
>> decorated_function
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp function(self, context,
>> instance_uuid, *args, **kwargs)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 183, in
>> decorated_function
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp sys.exc_info())
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp self.gen.next()
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 177, in
>> decorated_function
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp return function(self,
>> context, instance_uuid, *args, **kwargs)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 904, in
>> reboot_instance
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp reboot_type)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp return f(*args, **kw)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>> 721, in reboot
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp if
>> self._soft_reboot(instance):
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
>> 757, in _soft_reboot
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp dom.create()
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp File
>> "/usr/lib/python2.7/dist-packages/libvirt.py", line 551, in create
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp if ret == -1: raise
>> libvirtError ('virDomainCreate() failed', dom=self)
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp libvirtError: Cannot get
>> interface MTU on 'br14': No such device
>> 2013-07-07 21:17:34 TRACE nova.rpc.amqp
>>
>>
>> So I tried starting it manually:
>>
>> root at compute-01:/etc/libvirt/qemu# virsh create instance-00000035.xml
>> error: Failed to create domain from instance-00000035.xml
>> error: Cannot get interface MTU on 'br14': No such device
>>
>>
>> Any idea what I might be doing wrong? All the services show :-) with
>> nova-manage
>>
>>
>> Thanks for your help...
>>
>> Sam
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
>
> --
> Lorin Hochstein
> Lead Architect - Cloud Services
> Nimbis Services, Inc.
> www.nimbisservices.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130707/a207b9c9/attachment.html>
More information about the OpenStack-operators
mailing list