[Nova] Instances can't be started after compute nodes unexpectedly shut down because of power outage
Donny Davis
donny at fortnebula.com
Fri Jul 12 19:37:05 UTC 2019
How is the recovery coming along Gökhan? I am curious to hear.
On Fri, Jul 12, 2019 at 3:46 AM Gökhan IŞIK <skylightcoder at gmail.com> wrote:
> Awesome, thanks! Donny,
> I followed below steps and rescue my instance.
>
> 1.
>
> Find instance id and compute host
>
> root at infra1-utility-container-50bcf920:~# openstack server show 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e -c id -c OS-EXT-SRV-ATTR:hypervisor_hostname
> +-------------------------------------+--------------------------------------+
> | Field | Value |
> +-------------------------------------+--------------------------------------+
> | OS-EXT-SRV-ATTR:hypervisor_hostname | compute06 |
> | id | 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e |
> +-------------------------------------+--------------------------------------+
>
>
> 2.
>
> Find image and backing image file on compute host
>
> root at compute06:~# qemu-img info -U --backing-chain /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
> image: /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
> file format: qcow2
> virtual size: 160G (171798691840 bytes)
> disk size: 32G
> cluster_size: 65536
> backing file: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
> Format specific information:
> compat: 1.1
> lazy refcounts: false
> refcount bits: 16
> corrupt: false
> image: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
> file format: raw
> virtual size: 160G (171798691840 bytes)
> disk size: 18G
>
>
>
> 3. Copy image and backing image file
>
>
> root at compute06:~# cp /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk master
> root at compute06:~# cp /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323 new-master
>
>
> 4.
>
> Rebase the image file that was backed off the original file so that
> it uses the new file i.e new-master then commit those changes back to
> original file master back into the new base new-master
>
> root at compute06:~# qemu-img rebase -b new-master -U master
>
> root at compute06:~# qemu-img commit master
>
> root at compute06:~# qemu-img info new-master
>
>
>
>
> 5.
>
> Convert raw image to qcow2
>
> root at compute06:~# qemu-img convert -f raw -O qcow2 new-master new-master.qcow2
>
>
> 6. Time to upload glance and then launch instance from this image :)
>
>
> Thanks,
> Gökhan.
>
> Donny Davis <donny at fortnebula.com>, 12 Tem 2019 Cum, 00:56 tarihinde şunu
> yazdı:
>
>> Of course you can also always just pull the disk images from the vm
>> folders, merge them back with the base file, upload to glance and then
>> relaunch the instances.
>>
>> You can give this method a spin with the lowest risk to your instances
>>
>>
>> https://medium.com/@kumar_pravin/qemu-merge-snapshot-and-backing-file-into-standalone-disk-c8d3a2b17c0e
>>
>>
>>
>>
>>
>> On Thu, Jul 11, 2019 at 4:10 PM Donny Davis <donny at fortnebula.com> wrote:
>>
>>> You surely want to leave locking turned on.
>>>
>>> You may want to ask qemu-devel about the locking of a image file and how
>>> it works. This isn't really an Openstack issue, seems to be a layer below.
>>>
>>> Depending on how mission critical your VM's are, you could probably work
>>> around it by just passing in --force-share into the command openstack is
>>> trying to run.
>>>
>>> I cannot recommend this path, the best way is to find out how you remove
>>> the lock.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 11, 2019 at 3:23 PM Gökhan IŞIK <skylightcoder at gmail.com>
>>> wrote:
>>>
>>>> In [1] it says "Image locking is added and enabled by default.
>>>> Multiple QEMU processes cannot write to the same image as long as the host
>>>> supports OFD or posix locking, unless options are specified otherwise." May
>>>> be need to do something on nova side.
>>>>
>>>> I run this command and get same error. Output is in
>>>> http://paste.openstack.org/show/754311/
>>>>
>>>> İf I run qemu-img info instance-0000219b with -U , it doesn't give any
>>>> errors.
>>>>
>>>> [1] https://wiki.qemu.org/ChangeLog/2.10
>>>>
>>>> Donny Davis <donny at fortnebula.com>, 11 Tem 2019 Per, 22:11 tarihinde
>>>> şunu yazdı:
>>>>
>>>>> Well that is interesting. If you look in your libvirt config directory
>>>>> (/etc/libvirt on Centos) you can get a little more info on what is being
>>>>> used for locking.
>>>>>
>>>>> Maybe strace can shed some light on it. Try something like
>>>>>
>>>>> strace -ttt -f qemu-img info
>>>>> /var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 11, 2019 at 2:39 PM Gökhan IŞIK <skylightcoder at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I run virsh list --all command and output is below:
>>>>>>
>>>>>> root at compute06:~# virsh list --all
>>>>>> Id Name State
>>>>>> ----------------------------------------------------
>>>>>> - instance-000012f9 shut off
>>>>>> - instance-000013b6 shut off
>>>>>> - instance-000016fb shut off
>>>>>> - instance-0000190a shut off
>>>>>> - instance-00001a8a shut off
>>>>>> - instance-00001e05 shut off
>>>>>> - instance-0000202a shut off
>>>>>> - instance-00002135 shut off
>>>>>> - instance-00002141 shut off
>>>>>> - instance-000021b6 shut off
>>>>>> - instance-000021ec shut off
>>>>>> - instance-000023db shut off
>>>>>> - instance-00002ad7 shut off
>>>>>>
>>>>>> And also when I try start instances with virsh , output is below:
>>>>>>
>>>>>> root at compute06:~# virsh start instance-0000219b
>>>>>> error: Failed to start domain instance-000012f9
>>>>>> error: internal error: process exited while connecting to monitor:
>>>>>> 2019-07-11T18:36:34.229534Z qemu-system-x86_64: -chardev
>>>>>> pty,id=charserial0,logfile=/dev/fdset/2,logappend=on: char device
>>>>>> redirected to /dev/pts/3 (label charserial0)
>>>>>> 2019-07-11T18:36:34.243395Z qemu-system-x86_64: -drive
>>>>>> file=/var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=ignore:
>>>>>> Failed to get "write" lock
>>>>>> Is another process using the image?
>>>>>>
>>>>>> Thanks,
>>>>>> Gökhan
>>>>>>
>>>>>> Donny Davis <donny at fortnebula.com>, 11 Tem 2019 Per, 21:06 tarihinde
>>>>>> şunu yazdı:
>>>>>>
>>>>>>> Can you ssh to the hypervisor and run virsh list to make sure your
>>>>>>> instances are in fact down?
>>>>>>>
>>>>>>> On Thu, Jul 11, 2019 at 3:02 AM Gökhan IŞIK <skylightcoder at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can anyone help me please ? I can no't rescue my instances yet :(
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Gökhan
>>>>>>>>
>>>>>>>> Gökhan IŞIK <skylightcoder at gmail.com>, 9 Tem 2019 Sal, 15:46
>>>>>>>> tarihinde şunu yazdı:
>>>>>>>>
>>>>>>>>> Hi folks,
>>>>>>>>> Because of power outage, Most of our compute nodes unexpectedly
>>>>>>>>> shut down and now I can not start our instances. Error message is "Failed
>>>>>>>>> to get "write" lock another process using the image?". Instances Power
>>>>>>>>> status is No State. Full error log is
>>>>>>>>> http://paste.openstack.org/show/754107/. My environment is
>>>>>>>>> OpenStack Pike on Ubuntu 16.04 LTS servers and Instances are on a nfs
>>>>>>>>> shared storage. Nova version is 16.1.6.dev2. qemu version is 2.10.1.
>>>>>>>>> libvirt version is 3.6.0. I saw a commit [1], but it doesn't solve this
>>>>>>>>> problem.
>>>>>>>>> There are important instances on my environment. How can I rescue
>>>>>>>>> my instances? What would you suggest ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Gökhan
>>>>>>>>>
>>>>>>>>> [1] https://review.opendev.org/#/c/509774/
>>>>>>>>>
>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190712/ec625a2d/attachment-0001.html>
More information about the openstack-discuss
mailing list