[openstack-dev] [Neutron][Infra] Post processing of gate hooks on job timeouts

Morales, Victor victor.morales at intel.com
Mon Apr 11 13:39:29 UTC 2016






On 4/11/16, 5:07 AM, "Jakub Libosvar" <jlibosva at redhat.com> wrote:

>Hi,
>
>recently we hit an issue in Neutron with tests getting stuck [1]. As a
>side effect we discovered logs are not collected properly which makes it
>hard to find the root cause. The reason of missing logs is that we send
>SIGKILL to whatever gate hook is running when we hit the global timeout
>per gate job [2]. This gives no time to running process to perform any
>post-processing. In post_gate_hook function in Neutron, we collect logs
>from /tmp directory, compress them and move them to /opt/stack/logs to
>make them exposed.
>
>I have in mind two solutions to which I'd like to get feedback before
>sending patches.
>
>1) In Neutron, we execute tests in post_gate_hook (dunno why). But even
>if we would have moved test execution into gate_hook and tests get stuck
>then the post_gate_hook won't be triggered [3]. So the solution I
>propose here is to terminate gate_hook N minutes before global timeout
>and still execute post_gate_hook (with timeout) as post-processing routine.
>
>2) Second proposal is to let timeout wrapped commands know they are
>about to be killed. We can send let's say SIGTERM instead of SIGKILL and
>after certain amount of time, send SIGKILL. Example: We send SIGTERM 3
>minutes before global timeout, letting these 3 minutes to 'command' to
>handle the SIGTERM signal.
>
> timeout -s 15 -k 3 $((REMAINING_TIME-3))m bash -c "command"
>
>With the 2nd approach we can trap the signal that kills running test
>suite and collects logs with same functions we currently have.
>
>
>I would personally go with second option but I want to hear if anybody
>has a better idea about post processing in gate jobs or if there is
>already a tool we can use to collect logs.

I also like the second option, it seems less aggressive and give opportunity to catch
more information before killing processes.  Ideally, timeouts are ultimatums for worst-case scenarios
and should be never reach it.

>
>Thanks,
>Kuba
>
>
>[1] https://bugs.launchpad.net/bugs/1567668
>[2]
>https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L1151
>[3]
>https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate-wrap.sh#L581
>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


More information about the OpenStack-dev mailing list