[self-healing] live-migrate instance in response to fault signals
ekcs.openstack at gmail.com
Fri May 17 22:00:02 UTC 2019
On Thu, May 2, 2019 at 9:02 AM Daniel Speichert <daniel at speichert.pl> wrote:
> ----- Original Message -----
> > From: "Eric K" <ekcs.openstack at gmail.com>
> > To: "openstack-discuss" <openstack-discuss at lists.openstack.org>
> > Sent: Wednesday, May 1, 2019 4:59:57 PM
> > Subject: [self-healing] live-migrate instance in response to fault signals
> > I just want to follow up to get more info on the context;
> > specifically, which of the following pieces are the main difficulties?
> > - detecting the failure/soft-fail/early failure indication
> > - codifying how to respond to each failure scenario
> > - triggering/executing the desired workflow
> > - something else
> >  https://etherpad.openstack.org/p/DEN-self-healing-SIG
> We currently attempt to do all of the above using less-than-optimal custom
> scripts (using openstacksdk) and pipelines (running Ansible).
> I think there is tremendous value in developing at least one tested
> way to do all of the above by connecting e.g. Monasca, Mistral and Nova
> together to do the above. Maybe it's currently somewhat possible - then
> it's more of a documentation issue that would benefit operators.
> One of the derivative issues is the quality of live-migration in Nova.
> (I don't have production-level experience with Rocky/Stein yet.)
> When we do lots of live migrations, there is obviously a limit on the number
> of live migrations happening at the same time (doing more would be counter
> productive). These limits could be smarter/more dynamic in some cases.
> There is no immediate action item here right now though.
Any rough thoughts on which factors would be considered to decide an
appropriate dynamic limit? I'm assuming something to do with network
> I would like to begin with putting together all the pieces that currently
> work together and go from there - see what's missing.
I hope to make progress on this too. Mistral workflows (including
ansible playbooks) can be triggered via API. What's needed then is a
mechanism to collect (pre) failure data (Monasca perhaps) and a
mechanism that evaluates the data to decide what workflow/playbook to
trigger (Monasca does threshold evaluation and raise alarms, Congress
can process Monasca alarms then make contextual decision to trigger
The pieces starting from Monasca raising an alarm to Congress to
Mistral are in place (though need to be better documented). But I am
less clear on the sources of raw data needed and how to collect them
More information about the openstack-discuss