[nova][dev] Improved scheduling error messages

Sean Mooney smooney at redhat.com
Mon Aug 30 12:19:22 UTC 2021

On Fri, 2021-08-27 at 22:37 -0400, Mohammed Naser wrote:
> Hi there,
> I like the idea, but historically, Nova has steered away from giving more
> details on why things failed to schedule in order to prevent leaking
> information about the cloud.
> I agree that it’s one of the more painful errors, but I see the purpose
> behind masking it from the user in an environment where the user is not the
> operator.
yes it has been given as feedback in the past that for some cloud exposing a detailed error message would be see as a
moderate security issue/infomation leak.
> It would be good to hear from other devs, or maybe if this can be an
> admin-level thing.

we do provide some addtional info to admins today.
for example to no admins if there is an external error message from say libvirt we will not provide the details
of the trace back but addmins will see the full error.

the main way we provide addtional info to admins however is via log messages in the case of scheduling failure.
most of that however happens at debug level. the proablem with adding more loggin also comes in the fac that it gets
very very verbose for large clouds.
> Thanks
> Mohammed
> On Wed, Aug 25, 2021 at 9:53 AM Brito, Hugo Nicodemos <
> Hugo.Brito at windriver.com> wrote:
> > Hi,
> > 
> > In a prototype, we have improved Nova's scheduling error messages.
> > This helps both developers and end users better understand the
> > scheduler problems that occur on creation of an instance.
> > 
> > When a scheduler error happens during instance creation via the nova
> > upstream, we get the following message on the Overview tab
> > (Horizon dashboard): "No valid host was found." This doesn't give us
> > enough information about what really happened, so our solution was to
> > add more details on the instance's overview page, e.g.:
> > 
> > **Fault:Message** attribute provides a summary of why each host can not
> > satisfy the instance’s resource requirements,
> > 
this has been tried in the past and is very expensive to implement.
it signifciatly increase the memovy overhaed for multi create or large clouds.
after all if you are shcuilng on 1000 nodes and only 100 of them end up beign valide
we need to store the reason for why the request foails for 900 of them.

if that was a multi create reqwust with 10 vms beign creat at once then we have to store that 10 times over
finally we have to compose the 9000 node removed messages into a single reponce boday and retrun that to the user.
> >  e.g. for controller-0, it
> > indicates “No valid host was found. Not enough host cell CPUs to fit
> > instance cell” (where cell is a numa-node or socket).
> > 
> > **Fault:Details** attribute provides even more detail for each
> > individual host, for example it shows that the instance “required” 2
> > CPU cores and shows the “actual” CPU cores available on each “numa”
> > node: “actual:0, numa:1” and “actual:1, numa:0”.

so for each of the 9000 host removal messsages we would need to also dump the request spec effectivly
and synstise the requirement that was impost by each filter then present the host info for you to validate(the host state object)
and also include that in the fault details
> > 
> > These details are also present using the OpenStack CLI, in the
> > _fault_ attribute:
> > 
> > - openstack server show <instance>
and the expose the MBS of data in the server show.
> > 
> > With that in mind, we'd like to know if you are open to consider such
> > a change. We are willing to submit a spec and upstream that
> > implementation.

that really does not sound like somethign we coudl accept upstream
it has been propsed before by windriver and reject in the past.

if you have a design that will scale well and is configurable so that we can avoid leaking details about the host infrastucure to teh end user while stil
provideing useful info to the admin i would be happy to review the nova spec proposal.
the most recent work done on this topic was by chris friesen https://review.opendev.org/c/openstack/nova-specs/+/390116

we do know this can be a pain point and that it would be good to improve it but we need something that works for 1 vm and 10 host to 100 vms an 1000 hosts
and is still comprehendabel and does not use all your ram in the process.

> > 
> > Regards,
> > - nicodemos
> > 

More information about the openstack-discuss mailing list