[Openstack] [Orchestration] Handling error events ... explicit vs. implicit
sandy.walsh at RACKSPACE.COM
Wed Dec 7 17:55:22 UTC 2011
True ... this idea has come up before (and is still being kicked around). My biggest concern is what happens if that scheduler dies? We need a mechanism that can live outside of a single scheduler service.
The more of these long-running processes we leave in a service the greater the impact when something fails. Shouldn't we let the queue provide the resiliency and not depend on the worker staying alive? Personally I'm not a fan of removing our synchronous nature.
From: Yun Mao [yunmao at gmail.com]
Sent: Wednesday, December 07, 2011 1:03 PM
To: Sandy Walsh
Cc: openstack at lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
I'm wondering if it is possible to change the scheduler's rpc cast to
rpc call. This way the exceptions should be magically propagated back
to the scheduler, right? Naturally the scheduler can find another node
to retry or decide to give up and report failure. If we need to
provision many instances, we can spawn a few green threads for that.
On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh <sandy.walsh at rackspace.com> wrote:
> For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id.
> With most of the compute.manager calls the resource id is the third parameter in the call (after self & context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically.
> A little background:
> In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code:
> @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())
> What started as a simple decorator now takes parameters and the code has become nasty.
> But it works ... no matter where the exception was generated, the notifier gets:
> * compute.<host_id>
> * <method name>
> * and whatever arguments the method takes.
> So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver.
> One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although "explicit is better than implicit" keeps ringing in my head.
> Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both.
> Mailing list: https://launchpad.net/~openstack
> Post to : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help : https://help.launchpad.net/ListHelp
More information about the Openstack