<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Tue, May 30, 2017 at 3:59 PM Afek, Ifat (Nokia - IL/Kfar Sava) <<a href="mailto:ifat.afek@nokia.com">ifat.afek@nokia.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="white" lang="EN-US" link="blue" vlink="purple">
<div class="m_6319999045706068691WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">Hi Yujun,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">You started an interesting discussion. I think that the distinction between an operational error and a programmer error is correct and we should always keep that in mind.
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">I agree that having an overall design for error handling in Vitrage is a good idea; but I disagree that until then we better let it crash.
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">I think that Vitrage is made out of many pieces that don’t necessarily depend on one another. For example, if one datasource fails, everything else can work as usual – so why crash? Similarly,
if one template fails to load, all other templates can still be activated.</span></p></div></div></blockquote><div><br></div><div>This usually or always happens during initialization phase, doesn't it? It is a period with human inspecting and should be detected in the deployment or user acceptance test. So if something fails, it is better to isolate them before continue running, e.g. correct the invalid template, invalid data source configuration or remove the template and disable the data source. This is because such error is permanent and they won't recover automatically.</div><div><br></div><div>Here we need to distinguish the error that data source is temporarily unavailable due to network connection issue or data source not up yet. In this case, I agree we'd better start the rest component and perform a retry periodically until it recovers.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"> <u></u>
<u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">Another aspect is that the main purpose of Vitrage is to provide insights. In case of a failure in one datasource/template, some of the insights might be missing. But this will not lead
to inaccurate behavior or to wrong actions being executed in the system. IMO, we should give the user as much information as possible given that we have only part of the input.</span></p></div></div></blockquote><div> </div><div><span style="font-family:Calibri;font-size:11pt">I agree, if enough insights could be provided by the running system. We can improve the handling of permanent error. What is even better is supporting of a hot load for the components and templates.</span></div><div><span style="font-family:Calibri;font-size:11pt"><br></span></div><div><span style="font-family:Calibri;font-size:11pt">What I don't like much is sometimes errors are handled but without enough details. In this case, a crash with trace stack is more useful than a user "friendly" message like "failed to start xxx component" or "invalid configuration file" (I'm not talking about vitrage, it is quite common in many projects)</span></div><div><br></div><div>My preference is "good error handling" > "no error handling" > "bad error handling". Though it is difficult to distinguish what is a good error handling and what is bad...</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">Regarding the use cases that you mentioned:<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<ol style="margin-top:0cm" start="1" type="1">
<li class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">invalid configuration file<u></u><u></u></span></li></ol>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Calibri">[Ifat] This should depend on the specific configuration. If keystone is misconfigured, nothing will work of course. But if for example Zabbix is misconfigured,
Vitrage should work and show the topology and the non-Zabbix alarms.</span></p></div></div></blockquote><div><br></div><div>Agree. It should be handled in a different way regarding what kind of error and how critical it is.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<ol style="margin-top:0cm" start="2" type="1">
<li class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">failed to communicate with data source<u></u><u></u></span></li></ol>
</div></div><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1"><p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Calibri">[Ifat] I think that the error should be logged, and all other datasources should work as usual.</span></p></div></div></blockquote><div><br></div><div>Yes, and it would be good to have a retry mechanism</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<ol style="margin-top:0cm" start="3" type="1">
<li class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">malformed data from data source<u></u><u></u></span></li></ol>
</div></div><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1"><p class="m_6319999045706068691MsoListParagraph"><span style="font-size:11.0pt;font-family:Calibri">[Ifat] I think that the error should be logged, and all other datasources should work as usual. This problem means we must modify the code in the datasource itself, but until then
Vitrage should work, right?</span></p></div></div></blockquote><div>Yes, I think it is possible when the data source version changes and we should discard the data and indicate the error. The other part should not be affected.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<ol style="margin-top:0cm" start="4" type="1">
<li class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">failed to execute an action<u></u><u></u></span></li></ol>
</div></div><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1"><p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Calibri">[Ifat] Again, that’s a problem that requires code changes; but why fail other actions?</span></p></div></div></blockquote><div><br></div><div>What I meant here is temporary failure, e.g. when you try to mark host down but not able to reach it due to network connection issue or other reasons</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<ol style="margin-top:0cm" start="5" type="1">
<li class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">...</span> </li></ol>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">BTW, it might be a good idea to add API/UI for showing the configuration and the status of the datasources. We all know that errors in the log files are often ignored…</span></p></div></div></blockquote><div><br></div><div>Sure, the errors I mentioned above is what the system operators could encounter even with a correct configuration and not related to software bugs. Display them in UI would be very helpful. The log files are more for the engineers to analyse the root cause.</div><div> <span style="font-family:Calibri;font-size:11pt"> </span></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">Best Regards,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri">Ifat.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Calibri"><u></u> <u></u></span></p>
<div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="font-family:Calibri;color:black">From: </span>
</b><span style="font-family:Calibri;color:black">"Yujun Zhang (ZTE)" <<a href="mailto:zhangyujun%2Bzte@gmail.com" target="_blank">zhangyujun+zte@gmail.com</a>><br>
<b>Reply-To: </b>"OpenStack Development Mailing List (not for usage questions)" <<a href="mailto:openstack-dev@lists.openstack.org" target="_blank">openstack-dev@lists.openstack.org</a>><br>
<b>Date: </b>Monday, 29 May 2017 at 16:13<br>
<b>To: </b>"OpenStack Development Mailing List (not for usage questions)" <<a href="mailto:openstack-dev@lists.openstack.org" target="_blank">openstack-dev@lists.openstack.org</a>><br>
<b>Subject: </b>[openstack-dev] [vitrage] error handling<u></u><u></u></span></p>
</div></div></div><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1">
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Brought up by a recent code review, I think it worth a thorough discussion about the error handling rule.
<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I once read an article[1] from Joyent and it impressed me on the distinguish between
<b>Operational</b> errors vs. <b>programmer</b> errors. The article is written for nodejs, but the principle also applies for other programming language.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The basic rule recommended by Joyent is <u></u><u></u></p>
</div>
<div>
<h3 style="margin-right:0cm;margin-bottom:8.25pt;margin-left:0cm;box-sizing:border-box">
<span style="font-size:21.0pt;font-family:Helvetica;color:#333333;font-weight:normal">Handling operational errors<u></u><u></u></span></h3>
</div>
<div>
<h3 style="margin-right:0cm;margin-bottom:8.25pt;margin-left:0cm;box-sizing:border-box">
<span style="font-size:21.0pt;font-family:Helvetica;color:#333333;font-weight:normal">(Not) handling programmer errors<u></u><u></u></span></h3>
</div>
<div>
<p class="MsoNormal">There is also one rule in openstack style guide line[2] close to this idea.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:Arial;color:#3e4349">[H201] Do not write<span class="m_6319999045706068691inbox-inbox-apple-converted-space"> </span></span><span class="m_6319999045706068691inbox-inbox-pre"><span style="font-size:10.0pt;font-family:"Courier New";color:#3e4349">except:</span></span><span style="font-size:11.0pt;font-family:Arial;color:#3e4349">,
use<span class="m_6319999045706068691inbox-inbox-apple-converted-space"> </span></span><span class="m_6319999045706068691inbox-inbox-pre"><span style="font-size:10.0pt;font-family:"Courier New";color:#3e4349">except</span></span><span class="m_6319999045706068691inbox-inbox-apple-converted-space"><span style="font-size:10.0pt;font-family:"Courier New";color:#3e4349"> </span></span><span class="m_6319999045706068691inbox-inbox-pre"><span style="font-size:10.0pt;font-family:"Courier New";color:#3e4349">Exception:</span></span><span class="m_6319999045706068691inbox-inbox-apple-converted-space"><span style="font-size:11.0pt;font-family:Arial;color:#3e4349"> </span></span><span style="font-size:11.0pt;font-family:Arial;color:#3e4349">at
the very least. When catching an exception you should be as specific so you don’t mistakenly catch unexpected exceptions.</span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">I do think before we have a well designed error handling, it is better to let it crash. It is dangerous to hide the errors and keep the system running in undetermined states.<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So the question is <b>what kind of operational errors are we facing in vitrage?</b> I can think of something like<u></u><u></u></p>
</div>
<div>
<ol start="1" type="1">
<li class="MsoNormal">
invalid configuration file<u></u><u></u></li><li class="MsoNormal">
failed to communicate with data source<u></u><u></u></li><li class="MsoNormal">
malformed data from data source<u></u><u></u></li><li class="MsoNormal">
failed to execute an action<u></u><u></u></li><li class="MsoNormal">
...<u></u><u></u></li></ol>
<div>
<p class="MsoNormal">Maybe this could be the first step for the error handling design.<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">[1]: <a href="https://www.joyent.com/node-js/production/design/errors" target="_blank">https://www.joyent.com/node-js/production/design/errors</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">[2]: <a href="https://docs.openstack.org/developer/hacking/" target="_blank">https://docs.openstack.org/developer/hacking/</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">-- <u></u><u></u></p>
</div>
<p class="MsoNormal">Yujun Zhang<u></u><u></u></p>
</div></div><div bgcolor="white" lang="EN-US" link="blue" vlink="purple"><div class="m_6319999045706068691WordSection1"></div>
</div>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</blockquote></div></div><div dir="ltr">-- <br></div><div data-smartmail="gmail_signature"><div dir="ltr">Yujun Zhang</div></div>