Thanks Feilong and Sven.
> If so, cluster resize should be able to bring the cluster back. And you can just resize the cluster to the current node number. For that case, magnum should be able to fix the heat stack.
I thought this too. But when I try and run "check stack" under heat it fails. The log for this failure is that the resource is missing, ie one of the nodes is not there (which I know about).
I tried the cluster resize from horizon, to resize the cluster to the valid size / current size (without the additional node failure which is not there) and horizon immediately fails this with a red error in the corner of the web page. There's no log printed within magnum or heat logs at all. And the horizon error is not really helpful with error "Error: Unable to resize given cluster id: 1a8e1ed9-64b3-41b1-ab11-0f01e66da1d7.".
> Are you using the magnum auto healing feature by chance?
The "repair unhealthy nodes" option was chosen for this I believe. But I didnt set up the cluster so I am not sure.
Based on your replies, I discovered how to initiate the cluster resize using the CLI. After issuing the command, the missing node was rebuilt immediately. This then appears like some sort of issue with horizon only.
I wanted to get the resized cluster operating successfully before I replied, but though it re-deployed the missing node, the cluster resize went timed out and failed. Aside from a quick 30 min investigation on this I've not been able to do much more with that and it's been abandoned.
Thanks all the same for your help.