[openstack-dev] [gate][neutron][infra] tempest jobs timing out due to general sluggishness of the node?

Ihar Hrachyshka ihrachys at redhat.com
Wed Feb 15 20:29:01 UTC 2017


On Fri, Feb 10, 2017 at 2:48 PM, Clark Boylan <cboylan at sapwetik.org> wrote:
> On Fri, Feb 10, 2017, at 10:54 AM, Ihar Hrachyshka wrote:
>> Oh nice, I haven't seen that. It does give (virtualized) CPU model
>> types. I don't see a clear correlation between models and
>> failures/test times though. We of course miss some more details, like
>> flags being emulated, but I doubt it will give us a clue.
>
> Yes, this will still be the virtualized CPU. Also the lack of cpu flag
> info is a regression compared to the old method of collecting this data.
> If we think that info could be useful somehow we should find a way to
> add it back in. (Maybe just add back the cat /proc/cpuinfo step in
> devstack-gate).

To update, I posted a patch that logs /proc/cpuinfo using new ansible
data gathering playbook: https://review.openstack.org/#/c/433949/

>
>> It would be interesting to know the overcommit/system load for each
>> hypervisor affected. But I assume we don't have access to that info,
>> right?
>
> Correct, with the exception of infracloud and OSIC (if we ask nicely) I
> don't expect it will be very easy to get this sort of information from
> our clouds.
>
> For infracloud a random sample of a hypervisor shows that it has 24 real
> cores. In the vanilla region we are limited to 126 VM  instances with
> 8vcpu each. We have ~41 hypervisors which is just over 3 VM instances
> per hypervisor. 24realcpus/8vcpu = 3 VM instances without
> oversubscribing. So we are just barely oversubscribing if at all.

Ack, thanks for checking, we will need to find some other hypothesis then.

For the record, we discussed with Clark an idea of adding a synthetic
benchmark at the start of every job (before our software is actually
installed on the node), to get some easily comparable performance
numbers between runs that are guaranteed to be unaffected by OpenStack
installation; but Clark had his reservation because the test would be
synthetic and hence not real life, and because we already have
./stack.sh run time that can be used as a silly benchmark. Of course,
./stack.sh depends on lots of externalities, so it's not as precise as
a targeted benchmark would be, but Clark feels the latter would be of
limited use.

Apart from that, it's not clear where to go next. I doubt cpuinfo dump
will reveal anything insane in failing jobs, so other ideas are
welcome.

Ihar



More information about the OpenStack-dev mailing list