On Thu, 2023-04-06 at 11:19 +0200, Dmitriy Rabotyagov wrote:
I think I just came up with another "usecase" or better say missing functionality. So in case VM is stuck in `unshelving` state, for example due to messaging issues or smth, there's no clean way of recovering VM from this state. rebuild would not eb corrct to use there.
Given you will reset state to active
that is not safe to do. the correct fix would be reset it to shevle_offloaded which you currntly would have to do in the db.
- you won't be able to execute `stop` since VM is not assigned to any compute (and fail with "instance not ready"), as it was shelved. So then rebuild could be used, since it will pass VM to be assigned to some host as a result. Another way around would be of course updating the database, setting VM back to `shelved_offloaded` and trying to unshelve again, but I hate messing up with DB.
I think this kinda brings me back to Sean's point of having an API call to re-create a VM while keeping it's data, as that would cover such corner-cases as well. well we have talked about allowing reset-state reset to other states in the past or allowing evacuate to work. i proablywoul dnot allow the recreate api to work in that broken state.
the recreate api was not intended for error recovery. it was intented to fullfile two usercases 1.) unify rebuild and resize so you can do either or both from a singel api call. 2.) update your vm so that it gets the latest flavor extra_specs and image properies appleis witout data lose.
вт, 21 мар. 2023 г. в 15:59, Dan Smith <dms@danplanet.com>:
Basically they have an additional and unusual compute host recovery process, where a compute host after a failure is brought back by the same name. Then they rebuild the servers on the same compute host where the servers were running before. When the server's disk was backed by a volume, so its content was not lost by the compute host failure, they don't want to lose it either in the recovery process. The evacute operation clearly would be a better fit to do this, but that disallows evacuating to the "same" host. For a long time rebuild just allowed "evacuating to the same host". So they went with it.
Aside from the "should this be possible" question, is rebuild even required in this case?
if your vm is boot form voluem or you are using the ceph image backend for nova or nova on nfs then i think all that is requried is hard reboot. there are no port updates/bindings and hard reboot both plugs the netowrk interface into ovs or whatever the backend is on the host but also invokes os-brick to do the same for the volumes. so its not clear to my why rebuild woudl be required in a shared storage case.
For the non-volume-backed instances, we need rebuild to re-download the image and create the root disk.
yes although when you had the hardware failure you could have used evacuate to rebuild the vm on another host. if you could not do that because the vm was pinned to that host then the existing rebuild command is sufficent. if the failure was a motherboard or simialr and the data on disk was not lost then a hard reboot should also be enough for vms with local storage. rebuild would only be required if the data was lost.
If it's really required for volume-backed instances, I'm guessing there's just some trivial amount of state that isn't in place on recovery that the rebuild "solves". It is indeed a very odd fringe use-case that is an obvious mis-use of the function. ya if hard reboot/power on is not enough i think there is a trival bug there. we are obviouly missing somethign that should be done. power_on/hard reboot are intended to be abel to recreate the vm with its data after the host had been power off and powered on again. so it is ment to do everything required to be able to start the instance. nova has all the info in its database to do that without needing to call the other service like cinder and neutorn.
it woudl be good to know what actully fails if you just do hard reboot and capature that in a bug report.
At the moment I did not find a prohibition in the documentation to bring back a failed compute host by the same name. If I missed it or this is not recommended for any reason, please let me know.
I'm not sure why this would be specifically documented, but since compute nodes are not fully stateless, your scenario is basically "delete part of the state of the system and expect things to keep working" which I don't think is reasonable (nor something we should need to document).
Your scenario is basically the same as one where your /var/lib/nova is mounted on a disk that doesn't come up after reboot, or on NFS that was unavailable at boot. If nova were to say "meh, a bunch of state disappeared, I must be a rebuilt compute host" then it would potentially destroy (or desynchronize) actual state in other nodes (i.e. the database) for a transient/accidental situation. TBH, we might should even explicitly *block* rebuild on an instance that appears to be missing its on-disk state to avoid users, who don't know the state of the infra, from doing this to try to unblock their instances while ops are doing maintenance.
I will point out that bringing back a compute node under the same name (without cleaning the residue first) is strikingly similar to renaming a compute host, which we do *not* support. As of Antelope, the compute node would detect your scenario as a potential rename and refuse to start, again because of state that has been lost in the system. So just FYI that an actual blocker to your scenario is coming :)
Clearly in many clouds evacuating can fully replace what they do here. I believe they may have chosen this unusual compute host recovery option to have some kind of recovery process for very small deployments, where you don't always have space to evacuate before you rebuilt the failed compute host. And this collided with a deployment system which reuses host names.
At this point I'm not sure if this really belongs to the rebuild operation. Could easily be better addressed in evacuate. Or in the deployment system not reusing hostnames.
Evacuate can't work for this case either because it requires the compute node to be down to perform. As you note, bringing it back under a different name would solve that problem. However, neither "evacuate to same host" or "use rebuild for this recovery procedure" are reasonable, IMHO.
--Dan