[openstack-dev] [Neutron][Infra] Post processing of gate hooks on job timeouts

Assaf Muller assaf at redhat.com
Mon Apr 11 15:31:46 UTC 2016


On Mon, Apr 11, 2016 at 9:39 AM, Morales, Victor
<victor.morales at intel.com> wrote:
>
>
>
>
>
> On 4/11/16, 5:07 AM, "Jakub Libosvar" <jlibosva at redhat.com> wrote:
>
>>Hi,
>>
>>recently we hit an issue in Neutron with tests getting stuck [1]. As a
>>side effect we discovered logs are not collected properly which makes it
>>hard to find the root cause. The reason of missing logs is that we send
>>SIGKILL to whatever gate hook is running when we hit the global timeout
>>per gate job [2]. This gives no time to running process to perform any
>>post-processing. In post_gate_hook function in Neutron, we collect logs
>>from /tmp directory, compress them and move them to /opt/stack/logs to
>>make them exposed.
>>
>>I have in mind two solutions to which I'd like to get feedback before
>>sending patches.
>>
>>1) In Neutron, we execute tests in post_gate_hook (dunno why). But even
>>if we would have moved test execution into gate_hook and tests get stuck
>>then the post_gate_hook won't be triggered [3]. So the solution I
>>propose here is to terminate gate_hook N minutes before global timeout
>>and still execute post_gate_hook (with timeout) as post-processing routine.
>>
>>2) Second proposal is to let timeout wrapped commands know they are
>>about to be killed. We can send let's say SIGTERM instead of SIGKILL and
>>after certain amount of time, send SIGKILL. Example: We send SIGTERM 3
>>minutes before global timeout, letting these 3 minutes to 'command' to
>>handle the SIGTERM signal.
>>
>> timeout -s 15 -k 3 $((REMAINING_TIME-3))m bash -c "command"
>>
>>With the 2nd approach we can trap the signal that kills running test
>>suite and collects logs with same functions we currently have.
>>
>>
>>I would personally go with second option but I want to hear if anybody
>>has a better idea about post processing in gate jobs or if there is
>>already a tool we can use to collect logs.
>
> I also like the second option, it seems less aggressive and give opportunity to catch
> more information before killing processes.  Ideally, timeouts are ultimatums for worst-case scenarios
> and should be never reach it.

Kuba and I discussed this issue at length - I also think the 2nd
approach is reasonable but I'd like to see what more Devstack oriented
folks think.

>
>>
>>Thanks,
>>Kuba
>>
>>
>>[1] https://bugs.launchpad.net/bugs/1567668
>>[2]
>>https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L1151
>>[3]
>>https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate-wrap.sh#L581
>>
>>__________________________________________________________________________
>>OpenStack Development Mailing List (not for usage questions)
>>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list