[TC][all] Recheck analysis

19 Mar 2024

      Hi Stackers,

Some time ago I started monitoring something what we have called "bare rechecks" which are "recheck" comments without any reason given. I was also asking various teams to improve that, try to check failures of the CI jobs and put some kind of reason in the recheck comment. That went pretty well and currently most of the teams are doing pretty good job there. See stats from last 30 days:

+--------------------+---------------+--------------+-------------------+
| Team               | Bare rechecks | All Rechecks | Bare rechecks [%] |
+--------------------+---------------+--------------+-------------------+
| OpenStackSDK       | 8             | 8            | 100.0             |
| OpenStack-Helm     | 41            | 41           | 100.0             |
| cloudkitty         | 1             | 1            | 100.0             |
| monasca            | 6             | 6            | 100.0             |
| skyline            | 2             | 2            | 100.0             |
| trove              | 1             | 1            | 100.0             |
| rally              | 1             | 1            | 100.0             |
| watcher            | 1             | 1            | 100.0             |
| mistral            | 5             | 8            | 62.5              |
| kuryr              | 1             | 2            | 50.0              |
| zaqar              | 1             | 2            | 50.0              |
| horizon            | 6             | 15           | 40.0              |
| swift              | 9             | 25           | 36.0              |
| manila             | 5             | 14           | 35.71             |
| kolla              | 14            | 68           | 20.59             |
| cinder             | 15            | 77           | 19.48             |
| ironic             | 10            | 53           | 18.87             |
| glance             | 11            | 59           | 18.64             |
| requirements       | 13            | 73           | 17.81             |
| Telemetry          | 2             | 12           | 16.67             |
| nova               | 16            | 105          | 15.24             |
| OpenStackAnsible   | 5             | 40           | 12.5              |
| magnum             | 1             | 9            | 11.11             |
| neutron            | 13            | 118          | 11.02             |
| Quality Assurance  | 3             | 30           | 10.0              |
| Release Management | 1             | 12           | 8.33              |
| octavia            | 2             | 34           | 5.88              |
| keystone           | 0             | 1            | 0.0               |
| OpenStack Charms   | 0             | 1            | 0.0               |
| oslo               | 0             | 8            | 0.0               |
| Puppet OpenStack   | 0             | 95           | 0.0               |
| barbican           | 0             | 2            | 0.0               |
| designate          | 0             | 4            | 0.0               |
| heat               | 0             | 7            | 0.0               |
| tacker             | 0             | 63           | 0.0               |
| tripleo            | 0             | 4            | 0.0               |
+--------------------+---------------+--------------+-------------------+

Now I though that it's time for the "next step" so I spent some time looking at the reasons of the rechecks done in last 30 days.
I collected those reasons with script [1] and here are some top reasons found:

+----------------------------------------------------------+----------------------+
| Recheck comment                                          | Number of occurences |
+----------------------------------------------------------+----------------------+
| recheck unrelated failure                                | 48                   |
| recheck irrelevant failure                               | 45                   |
| recheck unrelated test failure                           | 9                    |
| recheck testing                                          | 9                    |
| recheck infra-failure                                    | 9                    |
| recheck tempest-slow-py3                                 | 8                    |
| recheck nova-next                                        | 8                    |
| recheck timeout                                          | 7                    |
| recheck neutron-functional-with-uwsgi                    | 7                    |
| recheck failure not related to this change               | 7                    |
| recheck ovn-octavia-provider-functional-master unrelated | 7                    |
| recheck jobs running too early                           | 6                    |
| recheck dependency updated                               | 5                    |
| recheck random failure                                   | 5                    |
| recheck - bifrost bug                                    | 5                    |
+----------------------------------------------------------+----------------------+

Those comments as they were put in the gerrit. Of course there was much more different comments there. I just wanted to highlight here some of the most common ones.

I also asked ChatGPT to group all of them in some common groups. Below are some of the main categories of the reason found by the ChatGPT:

    CI/Testing Issues:
        Timeouts in various tests (e.g., tempest tests, tox tests).
        Infrastructure failures impacting CI.
        Kernel panics or guest panics during testing.
        Dependency-related failures (merged or updated).
        Unexpected failures or intermittent issues.

    Dependency Management:
        Dependency patches merged or updated.
        Dependency-related failures.

    Release/Documentation Updates:
        Release note fixes or updates.
        Content cleanup or removal.
        Documentation-related failures.

    Infrastructure/Tooling:
        Infrastructure mirror issues.
        Tooling updates or fixes (e.g., grub2-tools, proliantutils).
        Issues related to mirror servers.

    Specific Project/Component Issues:
        Nova-related failures (e.g., nova-ceph-multistore).
        Cinder-related failures (e.g., cinder-plugin-ceph-tempest).
        OpenStack SDK related failures.

    Configuration/Setup Problems:
        Problems with deployment or setup (e.g., grub2-tools missing, bifrost bug).
        Issues with configurations (e.g., host patterns).

    Network/Connection Problems:
        DNS resolution issues.
        Timeout errors related to network operations.

    Post-Failure Analysis:
        Attempts to reproduce failures.
        Investigation into infrastructural problems.

Finally some conclusion from this analysis. It seems that our most common reason of the gate instability is something called "unrelated/irrelevant failure" and also things like "infra failure". In both cases we can't really do much to try to improve that as this information is not enough.
So I would like to ask all of You to try to be more precise in the recheck comments and give more detailed reasons why You recheck Your patch. If for example You have to recheck the patch due to "unrelated failure", please write there also bug number which cause failure or simply give there name of the unrelated failed test or something like that. That way when I will do similar analysis in e.g. month from now maybe it will be easier to identify most common bugs which are causing "unrelated failures" and try to prioritize somehow work on fix of that bug.

[1] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks_r...

-- 
Slawek Kaplonski
Principal Software Engineer
Red Hat

[TC][all] Recheck analysis

Sławek Kapłoński