[openstack-dev] [Neutron][Infra] Post processing of gate hooks on job timeouts

Clark Boylan cboylan at sapwetik.org
Mon Apr 11 16:41:19 UTC 2016

On Mon, Apr 11, 2016, at 03:07 AM, Jakub Libosvar wrote:
> Hi,
> recently we hit an issue in Neutron with tests getting stuck [1]. As a
> side effect we discovered logs are not collected properly which makes it
> hard to find the root cause. The reason of missing logs is that we send
> SIGKILL to whatever gate hook is running when we hit the global timeout
> per gate job [2]. This gives no time to running process to perform any
> post-processing. In post_gate_hook function in Neutron, we collect logs
> from /tmp directory, compress them and move them to /opt/stack/logs to
> make them exposed.
> I have in mind two solutions to which I'd like to get feedback before
> sending patches.
> 1) In Neutron, we execute tests in post_gate_hook (dunno why). But even
> if we would have moved test execution into gate_hook and tests get stuck
> then the post_gate_hook won't be triggered [3]. So the solution I
> propose here is to terminate gate_hook N minutes before global timeout
> and still execute post_gate_hook (with timeout) as post-processing
> routine.
> 2) Second proposal is to let timeout wrapped commands know they are
> about to be killed. We can send let's say SIGTERM instead of SIGKILL and
> after certain amount of time, send SIGKILL. Example: We send SIGTERM 3
> minutes before global timeout, letting these 3 minutes to 'command' to
> handle the SIGTERM signal.
>  timeout -s 15 -k 3 $((REMAINING_TIME-3))m bash -c "command"
> With the 2nd approach we can trap the signal that kills running test
> suite and collects logs with same functions we currently have.
> I would personally go with second option but I want to hear if anybody
> has a better idea about post processing in gate jobs or if there is
> already a tool we can use to collect logs.
> Thanks,
> Kuba

Devstack gate already does a "soft" timeout [0] then proceeds to cleanup
(part of which is collecting logs) [1], then Jenkins does the "hard"
timeout [2]. Why aren't we collecting the required log files as part of
the existing cleanup?



More information about the OpenStack-dev mailing list