[openstack-dev] [Neutron][Infra] Post processing of gate hooks on job timeouts

Jakub Libosvar jlibosva at redhat.com
Mon Apr 11 10:07:33 UTC 2016


recently we hit an issue in Neutron with tests getting stuck [1]. As a
side effect we discovered logs are not collected properly which makes it
hard to find the root cause. The reason of missing logs is that we send
SIGKILL to whatever gate hook is running when we hit the global timeout
per gate job [2]. This gives no time to running process to perform any
post-processing. In post_gate_hook function in Neutron, we collect logs
from /tmp directory, compress them and move them to /opt/stack/logs to
make them exposed.

I have in mind two solutions to which I'd like to get feedback before
sending patches.

1) In Neutron, we execute tests in post_gate_hook (dunno why). But even
if we would have moved test execution into gate_hook and tests get stuck
then the post_gate_hook won't be triggered [3]. So the solution I
propose here is to terminate gate_hook N minutes before global timeout
and still execute post_gate_hook (with timeout) as post-processing routine.

2) Second proposal is to let timeout wrapped commands know they are
about to be killed. We can send let's say SIGTERM instead of SIGKILL and
after certain amount of time, send SIGKILL. Example: We send SIGTERM 3
minutes before global timeout, letting these 3 minutes to 'command' to
handle the SIGTERM signal.

 timeout -s 15 -k 3 $((REMAINING_TIME-3))m bash -c "command"

With the 2nd approach we can trap the signal that kills running test
suite and collects logs with same functions we currently have.

I would personally go with second option but I want to hear if anybody
has a better idea about post processing in gate jobs or if there is
already a tool we can use to collect logs.


[1] https://bugs.launchpad.net/bugs/1567668

More information about the OpenStack-dev mailing list