[Openstack] compute nodes down

Jim Okken jim at jokken.com
Fri Dec 29 20:53:14 UTC 2017


I believe this issue turned out to be the shared storage device we are
using for shared storage to each compute node.

it had an access issue and one instance's vHD files had access attempts
that hung forever and never timed out.
this make sense for one node to be having nova issues. But could this cause
all compute nodes to have nova services to stop after some time? (in a
shared storage setup does each node access/query each vHD on the storage
periodically?)

thanks!

-- Jim

On Tue, Dec 19, 2017 at 3:45 AM, Tobias Urdin <tobias.urdin at crystone.com>
wrote:

> Enable debug in nova.conf and check conductor and compute logs.
>
> Check that your clock is in-sync with NTP or you might experience that the
> alive checks in the database exceeds the service_down_time config value.
>
> On 12/19/2017 12:09 AM, Jim Okken wrote:
>
> hi list,
>
> hoping someone could shed some light on this issue I just started seeing
> today
>
> all my compute nodes started showing as "Down" in the Horizon ->
> Hypervisors -> Compute Nodes tab
>
>
> root at node-1:~# nova service-list
> +-----+------------------+-------------------+----------+---
> ------+-------+----------------------------+-----------------+
> | Id  | Binary           | Host              | Zone     | Status  | State
> | Updated_at                 | Disabled Reason |
> +-----+------------------+-------------------+----------+---
> ------+-------+----------------------------+-----------------+
> | 325 | nova-compute     | node-9.mydom.com  | nova     | enabled | down
> | 2017-12-18T21:59:38.000000 | -               |
> | 448 | nova-compute     | node-14.mydom.com | nova     | enabled | up
> | 2017-12-18T22:41:42.000000 | -               |
> | 451 | nova-compute     | node-17.mydom.com | nova     | enabled | up
> | 2017-12-18T22:42:04.000000 | -               |
> | 454 | nova-compute     | node-11.mydom.com | nova     | enabled | up
> | 2017-12-18T22:42:02.000000 | -               |
> | 457 | nova-compute     | node-12.mydom.com | nova     | enabled | up
> | 2017-12-18T22:42:12.000000 | -               |
> | 472 | nova-compute     | node-16.mydom.com | nova     | enabled | down
> | 2017-12-18T00:16:01.000000 | -               |
> | 475 | nova-compute     | node-10.mydom.com | nova     | enabled | down
> | 2017-12-18T00:26:09.000000 | -               |
> | 478 | nova-compute     | node-13.mydom.com | nova     | enabled | down
> | 2017-12-17T23:54:06.000000 | -               |
> | 481 | nova-compute     | node-15.mydom.com | nova     | enabled | up
> | 2017-12-18T22:41:34.000000 | -               |
> | 484 | nova-compute     | node-8.mydom.com  | nova     | enabled | down
> | 2017-12-17T23:55:50.000000 | -               |
>
>
> if I stop and the start nova-compute on the down nodes the stop will take
> several minutes and then the start will be quick and fine. but after about
> 2 hours the nova-compute service will show down again.
>
> i am not seeing any ERRORS in nova logs.
>
> I get this for the status of a node that is showing as "UP"
>
>
>
> root at node-14:~# systemctl status nova-compute.service
> â nova-compute.service - OpenStack Compute
>    Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled;
> vendor preset: enabled)
>    Active: active (running) since Mon 2017-12-18 21:57:10 UTC; 35min ago
>      Docs: man:nova-compute(1)
>   Process: 32193 ExecStartPre=/bin/chown nova:adm /var/log/nova
> (code=exited, status=0/SUCCESS)
>   Process: 32190 ExecStartPre=/bin/chown nova:nova /var/lock/nova
> /var/lib/nova (code=exited, status=0/SUCCESS)
>   Process: 32187 ExecStartPre=/bin/mkdir -p /var/lock/nova /var/log/nova
> /var/lib/nova (code=exited, status=0/SUCCESS)
>  Main PID: 32196 (nova-compute)
>    CGroup: /system.slice/nova-compute.service
>            ââ32196 /usr/bin/python /usr/bin/nova-compute
> --config-file=/etc/nova/nova-compute.conf --config-file=/etc/nova/nova.conf
> --log-file=/var/log/nova/nova-compute.log
>
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.570 32196 DEBUG oslo_messaging._drivers.amqpdriver
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] CALL msg_id:
> 2877b9707da144f3a91e7b80e2705fb3 exchange 'nova' topic 'conductor' _send
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.604 32196 DEBUG oslo_messaging._drivers.amqpdriver [-] received
> reply msg_id: 2877b9707da144f3a91e7b80e2705fb3 __call__
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:296
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.605 32196 INFO nova.compute.resource_tracker
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] Total usable vcpus:
> 40, total allocated vcpus: 0
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.606 32196 INFO nova.compute.resource_tracker
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] Final resource view:
> name=node-14.mydom.com phys_ram=128812MB used_ram=512MB phys_disk=6691GB
> used_disk=0GB total_vcpus=40 used_vcpus=0 pci_stats=[]
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.610 32196 DEBUG oslo_messaging._drivers.amqpdriver
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] CALL msg_id:
> ad32abe833f4440d86c15b911aa35c43 exchange 'nova' topic 'conductor' _send
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.632 32196 DEBUG oslo_messaging._drivers.amqpdriver [-] received
> reply msg_id: ad32abe833f4440d86c15b911aa35c43 __call__
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:296
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.633 32196 WARNING nova.scheduler.client.report
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] Unable to refresh my
> resource provider record
> Dec 18 22:31:47 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:47.634 32196 INFO nova.compute.resource_tracker
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] Compute_service
> record updated for node-14.mydom.com:node-14.mydom.com
> Dec 18 22:31:52 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:52.247 32196 DEBUG oslo_messaging._drivers.amqpdriver
> [req-f30b2331-2097-4981-89c8-acea4a81f7f2 - - - - -] CALL msg_id:
> 4cbf019c36ce41cd89d34e59f6acc55f exchange 'nova' topic 'conductor' _send
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
> Dec 18 22:31:52 node-14.mydom.com nova-compute[32196]: 2017-12-18
> 22:31:52.265 32196 DEBUG oslo_messaging._drivers.amqpdriver [-] received
> reply msg_id: 4cbf019c36ce41cd89d34e59f6acc55f __call__
> /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:296
> root at node-14:~#
>
>
>
>
>
>
>
> I get this for the status of a node that is showing "DOWN"
>
>
> root at node-9:~# systemctl status nova-compute.service
> â nova-compute.service - OpenStack Compute
>    Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled;
> vendor preset: enabled)
>    Active: active (running) since Mon 2017-12-18 21:20:30 UTC; 1h 11min ago
>      Docs: man:nova-compute(1)
>   Process: 9488 ExecStartPre=/bin/chown nova:adm /var/log/nova
> (code=exited, status=0/SUCCESS)
>   Process: 9485 ExecStartPre=/bin/chown nova:nova /var/lock/nova
> /var/lib/nova (code=exited, status=0/SUCCESS)
>   Process: 9482 ExecStartPre=/bin/mkdir -p /var/lock/nova /var/log/nova
> /var/lib/nova (code=exited, status=0/SUCCESS)
>  Main PID: 9491 (nova-compute)
>    CGroup: /system.slice/nova-compute.service
>            ââ 9491 /usr/bin/python /usr/bin/nova-compute
> --config-file=/etc/nova/nova-compute.conf --config-file=/etc/nova/nova.conf
> --log-file=/var/log/nova/nova-compute.log
>            ââ20428 /usr/bin/python /usr/bin/nova-compute
> --config-file=/etc/nova/nova-compute.conf --config-file=/etc/nova/nova.conf
> --log-file=/var/log/nova/nova-compute.log
>
> Dec 18 22:00:32 node-9.mydom.com nova-compute[9491]: 2017-12-18
> 22:00:32.065 9491 INFO nova.virt.libvirt.imagecache
> [req-7623143a-2263-448e-8100-b6248501ab42 - - - - -] image
> 2e42dd45-0b7d-468a-bc1e-3ece5bdc1638 at (/mnt/MSA_FC_Vol1/nodes/_base/
> fe073169880f11099449fef24d86f3c1f8bb9763): checking
> Dec 18 22:00:32 node-9.mydom.com nova-compute[9491]: 2017-12-18
> 22:00:32.066 9491 INFO nova.virt.libvirt.imagecache
> [req-7623143a-2263-448e-8100-b6248501ab42 - - - - -] image
> 2e42dd45-0b7d-468a-bc1e-3ece5bdc1638 at (/mnt/MSA_FC_Vol1/nodes/_base/
> fe073169880f11099449fef24d86f3c1f8bb9763): in use: on this node 0 local,
> 4 on other nodes sharing this
> Dec 18 22:00:32 node-9.mydom.com sudo[40588]:     nova : TTY=unknown ;
> PWD=/var/lib/nova ; USER=root ; COMMAND=/usr/bin/nova-rootwrap
> /etc/nova/rootwrap.conf touch -c /mnt/MSA_FC_Vol1/nodes/_base/
> fe073169880f11099449fef24d86f3c1f8bb9763
> Dec 18 22:00:32 node-9.mydom.com sudo[40588]: pam_unix(sudo:session):
> session opened for user root by (uid=0)
> Dec 18 22:00:32 node-9.mydom.com sudo[40588]: pam_unix(sudo:session):
> session closed for user root
> Dec 18 22:00:32 node-9.mydom.com nova-compute[9491]: 2017-12-18
> 22:00:32.148 9491 INFO nova.virt.libvirt.imagecache
> [req-7623143a-2263-448e-8100-b6248501ab42 - - - - -] image
> 01db7edf-ee68-4fbc-bf3e-1ac4bc990488 at (/mnt/MSA_FC_Vol1/nodes/_base/
> a49721a231fdd7b45293b29dd13c34207c9c891b): checking
> Dec 18 22:00:32 node-9.mydom.com nova-compute[9491]: 2017-12-18
> 22:00:32.149 9491 INFO nova.virt.libvirt.imagecache
> [req-7623143a-2263-448e-8100-b6248501ab42 - - - - -] image
> 01db7edf-ee68-4fbc-bf3e-1ac4bc990488 at (/mnt/MSA_FC_Vol1/nodes/_base/
> a49721a231fdd7b45293b29dd13c34207c9c891b): in use: on this node 0 local,
> 2 on other nodes sharing this
> Dec 18 22:00:32 node-9.mydom.com sudo[40591]:     nova : TTY=unknown ;
> PWD=/var/lib/nova ; USER=root ; COMMAND=/usr/bin/nova-rootwrap
> /etc/nova/rootwrap.conf touch -c /mnt/MSA_FC_Vol1/nodes/_base/
> a49721a231fdd7b45293b29dd13c34207c9c891b
> Dec 18 22:00:32 node-9.mydom.com sudo[40591]: pam_unix(sudo:session):
> session opened for user root by (uid=0)
> Dec 18 22:00:32 node-9.mydom.com sudo[40591]: pam_unix(sudo:session):
> session closed for user root
> lines 1-22/22 (END)
> root at node-9:~#
>
>
>
>
>
>
> Does anything in the status messages show what could be wrong? What do the
> "nova : TTY=unknown" messages mean?
>
> thanks!!
>
> -- Jim
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20171229/1e28928a/attachment.html>


More information about the Openstack mailing list