[self-healing] live-migrate instance in response to fault signals

1 May 2019

      Hi dasp,

Follow up on the discussion today at self-healing BoF. I think you
said on the etherpad [1]:

====
Ability to drain (live migrate away) instances automatically in
response to any failure/soft-fail/early failure indication (e.g.
dropped packets, SMART disk status, issues with RBD connections,
repeated build failures, etc)
Then quarantine, rebuild, self-test compute host (or hold for hardware fix)
Context: generally no clue what is running inside VMs (like public cloud)
====

I just want to follow up to get more info on the context;
specifically, which of the following pieces are the main difficulties?
- detecting the failure/soft-fail/early failure indication
- codifying how to respond to each failure scenario
- triggering/executing the desired workflow
- something else

[1] https://etherpad.openstack.org/p/DEN-self-healing-SIG

Eric K

Daniel Speichert

Eric K

tags

participants (2)