Re: [TC][all] Recheck analysis

22 Mar 2024

      On Fri, 2024-03-22 at 07:02 -0700, Dan Smith wrote:
...
...
the irritation i was expressing was because previous time we asked peopel
not to recheck with out a reason was not implemented as asking people to trying
and see if they can fix the underlying problem.
repeating 
 * STOP DOING BLIND RECHECKS aka. 'recheck'
   https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons
people were giving for a recheck was unhelpful and franckly quite annoying.
That document has lots of good examples and suggestions for exactly what we’re trying to do: get people to examine the
failures. The most common thing I hear when I confront people about why they’re rechecking is that they “don’t know
how to debug this stuff”. So I think pointing them at a document that gives them pointers and places to start is
pretty good.
yes it does and i agree that inclduing it is good.
...
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a
team
* please avoid bare rechecks   (bauzas, 16:12:14)
* ACTION: bauzas to tell about the bare rechecks every week in our
   meeting  (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind
recheck
its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal
first
step in that process.
It’s unfortunate that you’re triggered by a reminder, but it’s also like the least impactful thing in the meeting. It
gets said, and then immediately we move on to the next thing. Could you maybe just read it with the new lens of
understanding where it's coming from from now on?
sure although the imietly moving on is part of what is triggering.
in our gate status section we talk about the active issues that are directly breaking/blocking things
i kind of wish if instead of just moving on we had like a subteam update on hopefully the active progress that is been
made on adressing the most common probelm
**this is a very nova specific view point so it may not apply to others**
instead of 

#topic Gate status

    #link check queue gate status http://status.openstack.org/elastic-recheck/index.html
    #link 3rd party CI status (not so dead) http://ciwatch.mmedvede.net/project?project=nova
    # tell peopel not to do blind recheck with link and move on

we did somethin like
#topic Gate status (reminder read before rechecking
https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...)

    #link check queue gate status http://status.openstack.org/elastic-recheck/index.html
    #link 3rd party CI status (not so dead) http://ciwatch.mmedvede.net/project?project=nova
    # talks about most common recheck reason and how we can address it or general status of effort

this really feall liek it shoudl be a sig or pop up team to improve the ci helath and we shoudl treat it
in the meaing like the Stable branch status.

im not agianst talking about it in the meeting, but i would prefer if we tried to engagne in actionable
converstaions about this instead of just a reminder.

i was hoping to be able to spend more time purly upstream in dalmaiation.
unfortunately i likely will have to spend less tiem instead. that means that more then ever
i want to try and focus on the activities that will have the largest impact such as improving
ci health.

The triggering part of this email thread and the team meeitng was it started with the table of rechecks vs bare recheck
not the analasys fo the recheck reasosn.

if we just take the first sub catagory i have some feedback on how to imporve them that we coudl try and capture in
https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...

    CI/Testing Issues:
        Timeouts in various tests (e.g., tempest tests, tox tests).

this is somethmes just a slow node or intermint package cloning issue
this is genrally less actioanable but when this happens i woudl recommend you note which ci provider the job executed on
if we see the timeout only on ovh-whatever then we can reach out to the mainteiners of that provider to see if its a
problem they can fix. sometimes this is a issue in the job where its gitclonign somethign that should be listed as
a required project and precloned/preapred by zuul. 

for example i know that the neutorn jobs in the past were cloning the ovs repo adn compiling it form souces a number of
years ago in the ci job. i saw this failure once for networking-ovs-dpdk (now unmainated) so i reached out to the infra
team to add ovs and dpdk https://github.com/openstack/project-config/commit/3f41d6be143a98d2848c42f9a...
to the zuul tenenat and i spoke to the neutron team to let them knwo that this was now aviable but i would not
ahve time to update there jobs to use zuul to clone the repo
the neutron-functional job still build ovs form git by default
https://github.com/openstack/neutron/blob/master/zuul.d/base.yaml#L41

if that has ever timed out or failed to clone that can entirly be eliminated by adding

openvswitch/ovs to required project and updateing OVS_REPO to the zuul cloned repo path
which is something like ~/src/github.com/openvswitch/ovs there is a standard way to look this up in teh zuul inventory.

        Infrastructure failures impacting CI.
This is fortunetly quite rare the infra team do a very good jobs at avoiding downtime.
mostly this only happens when they need to do a do a gerrtit/zuul restart which htey annoch ahead of tiem
or if there is a provider issue which is out of our contol.

        Kernel panics or guest panics during testing.
this as i noted previsouly so partly related to using old guest images with know buggy kernels
and sometime qemu bugs. this is the thing we have the most difficutly fixing in the nova ci
because it largely not issues in our code. but we have tried experiment with diffent images
to try and mitigate this. i even went as far as starting to create a replacement guest image based on
alpine https://review.opendev.org/c/openstack/diskimage-builder/+/755413/4 . 
to be clear this is not a cirros probelm i.e. cirros is fundementally broken

cirros is using the unmodifed ubuntu kernel so it is a good proxy for real guest workloads
moving cirros to the latest upstream lts kernel might help or even the latest ubuntu kernel not the latest
ubuntu lts.

im hopign that when we move to  ubuntu 24.04 some fo the qemu bugs go away and that also help.
again not a slignt on ubutnu we just are not using the latest release of qemu in our jobs so im
hoping that when we get a newer veriosn it is less likely to have bugs? i can dream right :)
the size of our guest and the amount of boots we do in our ci i think is why we might hit
some of these issues that the qemu/kernel devs may not see in there own testign.

        Dependency-related failures (merged or updated).
I think there has actully been a zuul change in behavior here that has regressed our usage.
what i have observed is zuul no seams to abort jobs for later patches in the serise if one of the eairler
patches fails and then report verifed -1. it may aslo report the failure or a parent patch after the fact
i have not dug into it. that means that  we now need to recheck those later patches in the serise where as
before we did not. we could rehceck the base and when it meged the gate jobs for the later patches could progress.

im not explainign this very well but there has been a subtle change in behavior during caracal that i did not
recal seeing before then so perhaps we shoudl see if a recent zuul release has change something in this regard
adn perhaps we can configure the old behaivor which requied less reject.

        Unexpected failures or intermittent issues.
ya so this ususally means there is a bug
either a test leaking state
a incorrectly written test case
PSA if you are writing a tempest test to attach/detach a prot of volume from a nova instace and you are checking
neutron or cinder to see if its down your test is wrong. you shoudl be chekcing nova we release the prot/volume
before we commit the db update on our side so that we only do that if the external serivice did not have an error.
im raising this as an example because we know of at least one tempest test case that fails intermitnetly because
of this exact race. i tried to fix that quickly https://review.opendev.org/c/openstack/tempest/+/905130 but
have not had time to actully dig into why that did not work adn fi anyone form the tempest team can take that over
please do. its probly something trivial like verificaiton is not configure that you will see and i have missed.

anyway that very long but if this was the tyep of converstaion we were having in the nova team meeting or
a pop-up team setting i woudl be happy to engage.
...
—-Dan

Re: [TC][all] Recheck analysis

smooney＠redhat.com