[self-healing] live-migrate instance in response to fault signals
Hi dasp, Follow up on the discussion today at self-healing BoF. I think you said on the etherpad [1]: ==== Ability to drain (live migrate away) instances automatically in response to any failure/soft-fail/early failure indication (e.g. dropped packets, SMART disk status, issues with RBD connections, repeated build failures, etc) Then quarantine, rebuild, self-test compute host (or hold for hardware fix) Context: generally no clue what is running inside VMs (like public cloud) ==== I just want to follow up to get more info on the context; specifically, which of the following pieces are the main difficulties? - detecting the failure/soft-fail/early failure indication - codifying how to respond to each failure scenario - triggering/executing the desired workflow - something else [1] https://etherpad.openstack.org/p/DEN-self-healing-SIG
----- Original Message -----
From: "Eric K" <ekcs.openstack@gmail.com> To: "openstack-discuss" <openstack-discuss@lists.openstack.org> Sent: Wednesday, May 1, 2019 4:59:57 PM Subject: [self-healing] live-migrate instance in response to fault signals ...
I just want to follow up to get more info on the context; specifically, which of the following pieces are the main difficulties? - detecting the failure/soft-fail/early failure indication - codifying how to respond to each failure scenario - triggering/executing the desired workflow - something else
We currently attempt to do all of the above using less-than-optimal custom scripts (using openstacksdk) and pipelines (running Ansible). I think there is tremendous value in developing at least one tested way to do all of the above by connecting e.g. Monasca, Mistral and Nova together to do the above. Maybe it's currently somewhat possible - then it's more of a documentation issue that would benefit operators. One of the derivative issues is the quality of live-migration in Nova. (I don't have production-level experience with Rocky/Stein yet.) When we do lots of live migrations, there is obviously a limit on the number of live migrations happening at the same time (doing more would be counter productive). These limits could be smarter/more dynamic in some cases. There is no immediate action item here right now though. I would like to begin with putting together all the pieces that currently work together and go from there - see what's missing. -Daniel
On Thu, May 2, 2019 at 9:02 AM Daniel Speichert <daniel@speichert.pl> wrote:
----- Original Message -----
From: "Eric K" <ekcs.openstack@gmail.com> To: "openstack-discuss" <openstack-discuss@lists.openstack.org> Sent: Wednesday, May 1, 2019 4:59:57 PM Subject: [self-healing] live-migrate instance in response to fault signals ...
I just want to follow up to get more info on the context; specifically, which of the following pieces are the main difficulties? - detecting the failure/soft-fail/early failure indication - codifying how to respond to each failure scenario - triggering/executing the desired workflow - something else
We currently attempt to do all of the above using less-than-optimal custom scripts (using openstacksdk) and pipelines (running Ansible).
I think there is tremendous value in developing at least one tested way to do all of the above by connecting e.g. Monasca, Mistral and Nova together to do the above. Maybe it's currently somewhat possible - then it's more of a documentation issue that would benefit operators.
Agreed.
One of the derivative issues is the quality of live-migration in Nova. (I don't have production-level experience with Rocky/Stein yet.) When we do lots of live migrations, there is obviously a limit on the number of live migrations happening at the same time (doing more would be counter productive). These limits could be smarter/more dynamic in some cases. There is no immediate action item here right now though.
Any rough thoughts on which factors would be considered to decide an appropriate dynamic limit? I'm assuming something to do with network traffic?
I would like to begin with putting together all the pieces that currently work together and go from there - see what's missing.
I hope to make progress on this too. Mistral workflows (including ansible playbooks) can be triggered via API. What's needed then is a mechanism to collect (pre) failure data (Monasca perhaps) and a mechanism that evaluates the data to decide what workflow/playbook to trigger (Monasca does threshold evaluation and raise alarms, Congress can process Monasca alarms then make contextual decision to trigger workflows/playbooks). The pieces starting from Monasca raising an alarm to Congress to Mistral are in place (though need to be better documented). But I am less clear on the sources of raw data needed and how to collect them in Monasca.
-Daniel
participants (2)
-
Daniel Speichert
-
Eric K