[openstack-dev] [vitrage] error handling

Afek, Ifat (Nokia - IL/Kfar Sava) ifat.afek at nokia.com
Thu Jun 1 14:44:47 UTC 2017


Hi Yujun,

Indeed, during the initialization phase it might be beneficial to make sure the user is aware of configuration problems (although I’m not sure that crashing is the solution). The problem is that the same code is executed both in initialization and later on, so telling the difference is not trivial.

So for now we agree that we need to add a UI for configuration information and datasources status.

Best Regards,
Ifat.

From: "Yujun Zhang (ZTE)" <zhangyujun+zte at gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
Date: Tuesday, 30 May 2017 at 11:50
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] error handling

On Tue, May 30, 2017 at 3:59 PM Afek, Ifat (Nokia - IL/Kfar Sava) <ifat.afek at nokia.com<mailto:ifat.afek at nokia.com>> wrote:
Hi Yujun,

You started an interesting discussion. I think that the distinction between an operational error and a programmer error is correct and we should always keep that in mind.

I agree that having an overall design for error handling in Vitrage is a good idea; but I disagree that until then we better let it crash.

I think that Vitrage is made out of many pieces that don’t necessarily depend on one another. For example, if one datasource fails, everything else can work as usual – so why crash? Similarly, if one template fails to load, all other templates can still be activated.

This usually or always happens during initialization phase, doesn't it? It is a period with human inspecting and should be detected in the deployment or user acceptance test. So if something fails, it is better to isolate them before continue running, e.g. correct the invalid template, invalid data source configuration or remove the template and disable the data source. This is because such error is permanent and they won't recover automatically.

Here we need to distinguish the error that data source is temporarily unavailable due to network connection issue or data source not up yet. In this case, I agree we'd better start the rest component and perform a retry periodically until it recovers.

Another aspect is that the main purpose of Vitrage is to provide insights. In case of a failure in one datasource/template, some of the insights might be missing. But this will not lead to inaccurate behavior or to wrong actions being executed in the system. IMO, we should give the user as much information as possible given that we have only part of the input.

I agree, if enough insights could be provided by the running system. We can improve the handling of permanent error. What is even better is supporting of a hot load for the components and templates.

What I don't like much is sometimes errors are handled but without enough details. In this case, a crash with trace stack is more useful than a user "friendly" message like "failed to start xxx component" or "invalid configuration file" (I'm not talking about vitrage, it is quite common in many projects)

My preference is "good error handling" > "no error handling" > "bad error handling". Though it is difficult to distinguish what is a good error handling and what is bad...

Regarding the use cases that you mentioned:


  1.  invalid configuration file
[Ifat] This should depend on the specific configuration. If keystone is misconfigured, nothing will work of course. But if for example Zabbix is misconfigured, Vitrage should work and show the topology and the non-Zabbix alarms.

Agree. It should be handled in a different way regarding what kind of error and how critical it is.


  1.  failed to communicate with data source
[Ifat] I think that the error should be logged, and all other datasources should work as usual.

Yes, and it would be good to have a retry mechanism


  1.  malformed data from data source

[Ifat] I think that the error should be logged, and all other datasources should work as usual. This problem means we must modify the code in the datasource itself, but until then Vitrage should work, right?
Yes, I think it is possible when the data source version changes and we should discard the data and indicate the error. The other part should not be affected.


  1.  failed to execute an action
[Ifat] Again, that’s a problem that requires code changes; but why fail other actions?

What I meant here is temporary failure, e.g. when you try to mark host down but not able to reach it due to network connection issue or other reasons


  1.  ...
BTW, it might be a good idea to add API/UI for showing the configuration and the status of the datasources. We all know that errors in the log files are often ignored…

Sure, the errors I mentioned above is what the system operators could encounter even with a correct configuration and not related to software bugs. Display them in UI would be very helpful. The log files are more for the engineers to analyse the root cause.

Best Regards,
Ifat.


From: "Yujun Zhang (ZTE)" <zhangyujun+zte at gmail.com<mailto:zhangyujun%2Bzte at gmail.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Monday, 29 May 2017 at 16:13
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Subject: [openstack-dev] [vitrage] error handling

Brought up by a recent code review, I think it worth a thorough discussion about the error handling rule.

I once read an article[1] from Joyent and it impressed me on the distinguish between Operational errors vs. programmer errors. The article is written for nodejs, but the principle also applies for other programming language.

The basic rule recommended by Joyent is
Handling operational errors
(Not) handling programmer errors
There is also one rule in openstack style guide line[2] close to this idea.

[H201] Do not write except:, use except Exception: at the very least. When catching an exception you should be as specific so you don’t mistakenly catch unexpected exceptions.

I do think before we have a well designed error handling, it is better to let it crash. It is dangerous to hide the errors and keep the system running in undetermined states.

So the question is what kind of operational errors are we facing in vitrage? I can think of something like

  1.  invalid configuration file
  2.  failed to communicate with data source
  3.  malformed data from data source
  4.  failed to execute an action
  5.  ...
Maybe this could be the first step for the error handling design.

[1]: https://www.joyent.com/node-js/production/design/errors
[2]: https://docs.openstack.org/developer/hacking/

--
Yujun Zhang
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe<http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
--
Yujun Zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170601/debda168/attachment.html>


More information about the OpenStack-dev mailing list