Hi dasp, Follow up on the discussion today at self-healing BoF. I think you said on the etherpad [1]: ==== Ability to drain (live migrate away) instances automatically in response to any failure/soft-fail/early failure indication (e.g. dropped packets, SMART disk status, issues with RBD connections, repeated build failures, etc) Then quarantine, rebuild, self-test compute host (or hold for hardware fix) Context: generally no clue what is running inside VMs (like public cloud) ==== I just want to follow up to get more info on the context; specifically, which of the following pieces are the main difficulties? - detecting the failure/soft-fail/early failure indication - codifying how to respond to each failure scenario - triggering/executing the desired workflow - something else [1] https://etherpad.openstack.org/p/DEN-self-healing-SIG