[Nova] Instances can't be started after compute nodes unexpectedly shut down because of power outage

Donny Davis donny at fortnebula.com
Wed Jul 17 14:05:36 UTC 2019


Any word yet?

On Fri, Jul 12, 2019 at 3:37 PM Donny Davis <donny at fortnebula.com> wrote:

> How is the recovery coming along Gökhan?  I am curious to hear.
>
> On Fri, Jul 12, 2019 at 3:46 AM Gökhan IŞIK <skylightcoder at gmail.com>
> wrote:
>
>> Awesome, thanks! Donny,
>> I followed below steps and rescue my instance.
>>
>>    1.
>>
>>     Find instance id and compute host
>>
>>    root at infra1-utility-container-50bcf920:~# openstack server show 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e -c id  -c OS-EXT-SRV-ATTR:hypervisor_hostname
>>    +-------------------------------------+--------------------------------------+
>>    | Field                               | Value                                |
>>    +-------------------------------------+--------------------------------------+
>>    | OS-EXT-SRV-ATTR:hypervisor_hostname | compute06                            |
>>    | id                                  | 1d2e8a39-97ee-4ce7-a612-1b50f90cc51e |
>>    +-------------------------------------+--------------------------------------+
>>
>>
>>    2.
>>
>>     Find image and backing image file on compute host
>>
>>    root at compute06:~# qemu-img info -U  --backing-chain /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
>>    image: /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk
>>    file format: qcow2
>>    virtual size: 160G (171798691840 bytes)
>>    disk size: 32G
>>    cluster_size: 65536
>>    backing file: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
>>    Format specific information:
>>        compat: 1.1
>>        lazy refcounts: false
>>        refcount bits: 16
>>        corrupt: false
>>    image: /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323
>>    file format: raw
>>    virtual size: 160G (171798691840 bytes)
>>    disk size: 18G
>>
>>
>>
>>    3. Copy image and backing image file
>>
>>
>>    root at compute06:~# cp  /var/lib/nova/instances/1d2e8a39-97ee-4ce7-a612-1b50f90cc51e/disk master
>>    root at compute06:~# cp /var/lib/nova/instances/_base/a1960f539532979a591c5f837ad604eedd9c7323 new-master
>>
>>
>>    4.
>>
>>     Rebase the image file that was backed off the original file so that
>>    it uses the new file i.e new-master then commit those changes back to
>>    original file master back into the new base new-master
>>
>>    root at compute06:~# qemu-img rebase  -b new-master  -U master
>>
>>    root at compute06:~# qemu-img commit master
>>
>>    root at compute06:~# qemu-img info new-master
>>
>>
>>
>>
>>    5.
>>
>>     Convert raw image to qcow2
>>
>>    root at compute06:~# qemu-img convert -f raw -O qcow2 new-master new-master.qcow2
>>
>>
>>    6.  Time to upload glance and then launch instance from this image :)
>>
>>
>> Thanks,
>> Gökhan.
>>
>> Donny Davis <donny at fortnebula.com>, 12 Tem 2019 Cum, 00:56 tarihinde
>> şunu yazdı:
>>
>>> Of course you can also always just pull the disk images from the vm
>>> folders, merge them back with the base file, upload to glance and then
>>> relaunch the instances.
>>>
>>> You can give this method a spin with the lowest risk to your instances
>>>
>>>
>>> https://medium.com/@kumar_pravin/qemu-merge-snapshot-and-backing-file-into-standalone-disk-c8d3a2b17c0e
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 11, 2019 at 4:10 PM Donny Davis <donny at fortnebula.com>
>>> wrote:
>>>
>>>> You surely want to leave locking turned on.
>>>>
>>>> You may want to ask qemu-devel about the locking of a image file and
>>>> how it works. This isn't really an Openstack issue, seems to be a layer
>>>> below.
>>>>
>>>> Depending on how mission critical your VM's are, you could probably
>>>> work around it by just passing in  --force-share into the command openstack
>>>> is trying to run.
>>>>
>>>> I cannot recommend this path, the best way is to find out how you
>>>> remove the lock.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 11, 2019 at 3:23 PM Gökhan IŞIK <skylightcoder at gmail.com>
>>>> wrote:
>>>>
>>>>> In [1] it says "Image locking is added and enabled by default.
>>>>> Multiple QEMU processes cannot write to the same image as long as the host
>>>>> supports OFD or posix locking, unless options are specified otherwise." May
>>>>> be need to do something on nova side.
>>>>>
>>>>> I run this command and get same error. Output is in
>>>>> http://paste.openstack.org/show/754311/
>>>>>
>>>>> İf I run qemu-img info instance-0000219b with -U , it doesn't give any
>>>>> errors.
>>>>>
>>>>> [1] https://wiki.qemu.org/ChangeLog/2.10
>>>>>
>>>>> Donny Davis <donny at fortnebula.com>, 11 Tem 2019 Per, 22:11 tarihinde
>>>>> şunu yazdı:
>>>>>
>>>>>> Well that is interesting. If you look in your libvirt config
>>>>>> directory (/etc/libvirt on Centos) you can get a little more info on what
>>>>>> is being used for locking.
>>>>>>
>>>>>> Maybe strace can shed some light on it. Try something like
>>>>>>
>>>>>> strace -ttt -f qemu-img info
>>>>>> /var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 11, 2019 at 2:39 PM Gökhan IŞIK <skylightcoder at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I run virsh list --all command and output is below:
>>>>>>>
>>>>>>> root at compute06:~# virsh list --all
>>>>>>>  Id    Name                           State
>>>>>>> ----------------------------------------------------
>>>>>>>  -     instance-000012f9              shut off
>>>>>>>  -     instance-000013b6              shut off
>>>>>>>  -     instance-000016fb              shut off
>>>>>>>  -     instance-0000190a              shut off
>>>>>>>  -     instance-00001a8a              shut off
>>>>>>>  -     instance-00001e05              shut off
>>>>>>>  -     instance-0000202a              shut off
>>>>>>>  -     instance-00002135              shut off
>>>>>>>  -     instance-00002141              shut off
>>>>>>>  -     instance-000021b6              shut off
>>>>>>>  -     instance-000021ec              shut off
>>>>>>>  -     instance-000023db              shut off
>>>>>>>  -     instance-00002ad7              shut off
>>>>>>>
>>>>>>> And also when I try start instances with virsh , output is below:
>>>>>>>
>>>>>>> root at compute06:~# virsh start instance-0000219b
>>>>>>> error: Failed to start domain instance-000012f9
>>>>>>> error: internal error: process exited while connecting to monitor:
>>>>>>>  2019-07-11T18:36:34.229534Z qemu-system-x86_64: -chardev
>>>>>>> pty,id=charserial0,logfile=/dev/fdset/2,logappend=on: char device
>>>>>>> redirected to /dev/pts/3 (label charserial0)
>>>>>>> 2019-07-11T18:36:34.243395Z qemu-system-x86_64: -drive
>>>>>>> file=/var/lib/nova/instances/659b5853-d094-4425-85a9-5bcacf88c84e/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=ignore:
>>>>>>> Failed to get "write" lock
>>>>>>> Is another process using the image?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Gökhan
>>>>>>>
>>>>>>> Donny Davis <donny at fortnebula.com>, 11 Tem 2019 Per, 21:06
>>>>>>> tarihinde şunu yazdı:
>>>>>>>
>>>>>>>> Can you ssh to the hypervisor and run virsh list to make sure your
>>>>>>>> instances are in fact down?
>>>>>>>>
>>>>>>>> On Thu, Jul 11, 2019 at 3:02 AM Gökhan IŞIK <
>>>>>>>> skylightcoder at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Can anyone help me please ? I can no't rescue my instances yet :(
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Gökhan
>>>>>>>>>
>>>>>>>>> Gökhan IŞIK <skylightcoder at gmail.com>, 9 Tem 2019 Sal, 15:46
>>>>>>>>> tarihinde şunu yazdı:
>>>>>>>>>
>>>>>>>>>> Hi folks,
>>>>>>>>>> Because of power outage, Most of our compute nodes  unexpectedly
>>>>>>>>>> shut  down and now I can not start our instances.  Error message is "Failed
>>>>>>>>>> to get "write" lock another process using the image?". Instances Power
>>>>>>>>>> status is No State.  Full error log is
>>>>>>>>>> http://paste.openstack.org/show/754107/. My environment is
>>>>>>>>>> OpenStack Pike on Ubuntu 16.04 LTS servers and Instances are on a nfs
>>>>>>>>>> shared storage. Nova version is 16.1.6.dev2. qemu version is 2.10.1.
>>>>>>>>>> libvirt version is 3.6.0. I saw a commit [1], but it doesn't solve this
>>>>>>>>>> problem.
>>>>>>>>>> There are important instances on my environment. How can I rescue
>>>>>>>>>> my instances? What would you suggest ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Gökhan
>>>>>>>>>>
>>>>>>>>>> [1] https://review.opendev.org/#/c/509774/
>>>>>>>>>>
>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190717/d7956bb7/attachment-0001.html>


More information about the openstack-discuss mailing list