Re: [Nova] Instances can't be started after compute nodes unexpectedly shut down because of power outage

12 Jul 2019


      How is the recovery coming along Gökhan?  I am curious to hear.

On Fri, Jul 12, 2019 at 3:46 AM Gökhan IŞIK <skylightcoder@gmail.com> wrote:
...
Awesome, thanks! Donny,
I followed below steps and rescue my instance.
1.
Find instance id and compute host
root@infra1-utility-container-50bcf920:~# openstack server show 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e -c id  -c OS-EXT-SRV-ATTR:hypervisor_hostname
   +-------------------------------------+--------------------------------------+
   | Field                               | Value                                |
   +-------------------------------------+--------------------------------------+
   | OS-EXT-SRV-ATTR:hypervisor_hostname | compute06                            |
   | id                                  | 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e |
   +-------------------------------------+--------------------------------------+
2.
Find image and backing image file on compute host
root@compute06:~# qemu-img info -U  --backing-chain /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
   image: /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
   file format: qcow2
   virtual size: 160G (171798691840 bytes)
   disk size: 32G
   cluster_size: 65536
   backing file: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
   Format specific information:
       compat: 1.1
       lazy refcounts: false
       refcount bits: 16
       corrupt: false
   image: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
   file format: raw
   virtual size: 160G (171798691840 bytes)
   disk size: 18G
3. Copy image and backing image file
root@compute06:~# cp  /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk master
   root@compute06:~# cp /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323 new-master
4.
Rebase the image file that was backed off the original file so that
   it uses the new file i.e new-master then commit those changes back to
   original file master back into the new base new-master
root@compute06:~# qemu-img rebase  -b new-master  -U master
root@compute06:~# qemu-img commit master
root@compute06:~# qemu-img info new-master
5.
Convert raw image to qcow2
root@compute06:~# qemu-img convert -f raw -O qcow2 new-master new-master.qcow2
6.  Time to upload glance and then launch instance from this image :)
Thanks,
Gökhan.
Donny Davis <donny@fortnebula.com>, 12 Tem 2019 Cum, 00:56 tarihinde şunu
yazdı:
...
Of course you can also always just pull the disk images from the vm
folders, merge them back with the base file, upload to glance and then
relaunch the instances.
You can give this method a spin with the lowest risk to your instances
https://medium.com/@kumar_pravin/qemu-merge-snapshot-and-backing-file-into-s...
On Thu, Jul 11, 2019 at 4:10 PM Donny Davis <donny@fortnebula.com> wrote:
...
You surely want to leave locking turned on.
You may want to ask qemu-devel about the locking of a image file and how
it works. This isn't really an Openstack issue, seems to be a layer below.
Depending on how mission critical your VM's are, you could probably work
around it by just passing in  --force-share into the command openstack is
trying to run.
I cannot recommend this path, the best way is to find out how you remove
the lock.
On Thu, Jul 11, 2019 at 3:23 PM Gökhan IŞIK <skylightcoder@gmail.com>
wrote:
...
In [1] it says "Image locking is added and enabled by default.
Multiple QEMU processes cannot write to the same image as long as the host
supports OFD or posix locking, unless options are specified otherwise." May
be need to do something on nova side.
I run this command and get same error. Output is in
http://paste.openstack.org/show/754311/
İf I run qemu-img info instance-0000219b with -U , it doesn't give any
errors.
[1] https://wiki.qemu.org/ChangeLog/2.10
Donny Davis <donny@fortnebula.com>, 11 Tem 2019 Per, 22:11 tarihinde
şunu yazdı:
...
Well that is interesting. If you look in your libvirt config directory
(/etc/libvirt on Centos) you can get a little more info on what is being
used for locking.
Maybe strace can shed some light on it. Try something like
strace -ttt -f qemu-img info
/var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk
On Thu, Jul 11, 2019 at 2:39 PM Gökhan IŞIK <skylightcoder@gmail.com>
wrote:
...
I run virsh list --all command and output is below:
root@compute06:~# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-000012f9              shut off
 -     instance-000013b6              shut off
 -     instance-000016fb              shut off
 -     instance-0000190a              shut off
 -     instance-00001a8a              shut off
 -     instance-00001e05              shut off
 -     instance-0000202a              shut off
 -     instance-00002135              shut off
 -     instance-00002141              shut off
 -     instance-000021b6              shut off
 -     instance-000021ec              shut off
 -     instance-000023db              shut off
 -     instance-00002ad7              shut off
And also when I try start instances with virsh , output is below:
root@compute06:~# virsh start instance-0000219b
error: Failed to start domain instance-000012f9
error: internal error: process exited while connecting to monitor:
 2019-07-11T18:36:34.229534Z qemu-system-x86_64: -chardev
pty,id=charserial0,logfile=/dev/fdset/2,logappend=on: char device
redirected to /dev/pts/3 (label charserial0)
2019-07-11T18:36:34.243395Z qemu-system-x86_64: -drive
file=/var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=ignore:
Failed to get "write" lock
Is another process using the image?
Thanks,
Gökhan
Donny Davis <donny@fortnebula.com>, 11 Tem 2019 Per, 21:06 tarihinde
şunu yazdı:
> Can you ssh to the hypervisor and run virsh list to make sure your
> instances are in fact down?
>
> On Thu, Jul 11, 2019 at 3:02 AM Gökhan IŞIK <skylightcoder@gmail.com>
> wrote:
>
>> Can anyone help me please ? I can no't rescue my instances yet :(
>>
>> Thanks,
>> Gökhan
>>
>> Gökhan IŞIK <skylightcoder@gmail.com>, 9 Tem 2019 Sal, 15:46
>> tarihinde şunu yazdı:
>>
>>> Hi folks,
>>> Because of power outage, Most of our compute nodes  unexpectedly
>>> shut  down and now I can not start our instances.  Error message is "Failed
>>> to get "write" lock another process using the image?". Instances Power
>>> status is No State.  Full error log is
>>> http://paste.openstack.org/show/754107/. My environment is
>>> OpenStack Pike on Ubuntu 16.04 LTS servers and Instances are on a nfs
>>> shared storage. Nova version is 16.1.6.dev2. qemu version is 2.10.1.
>>> libvirt version is 3.6.0. I saw a commit [1], but it doesn't solve this
>>> problem.
>>> There are important instances on my environment. How can I rescue
>>> my instances? What would you suggest ?
>>>
>>> Thanks,
>>> Gökhan
>>>
>>> [1] https://review.opendev.org/#/c/509774/
>>>
>>