[openstack-dev] [fedora] Re: upstream f21 devstack test

Attila Fazekas afazekas at redhat.com
Sun Jan 25 17:10:56 UTC 2015


I have tried the old 'vmlinuz-3.17.4-301.fc21.x86_64' kernel in my env,
 with this version volume attachment related tests are failing, but in the test case
so I do not see the secondary network failures.

In my env with '3.17.8-300.fc21.x86_64' everything passes with nnet,
I would say 3.17.4-301.fc21.x86_64 version of kernel is buggy.

On the gate vm the new kernel(3.17.8-300.fc21.x86_64) was installed before the boot,
but the boot manager config still picks the old kernel. I tried to switch to the new
kernel in `your vm`, but the machine failed to reboot, maybe I miss configured the extlinux.conf or we have
some environment specific issue.
I lost your onhold vm. :(

Looks like https://bugs.launchpad.net/nova/+bug/1353939 was always triggered,  ie.
the vm failed to delete which caused wrong iptables rules left behind, which caused
several subsequent ssh test failure if the test used the same fixedip as the test_rescued_vm_detach_volume.

Tempest could be stricter and fail the test suite at tearDownClass when the vm moves to ERROR state at delete.


----- Eredeti üzenet -----
> Feladó: "Attila Fazekas" <afazekas at redhat.com>
> Címzett: "Ian Wienand" <iwienand at redhat.com>
> Másolatot kap: "Alvaro Lopez Ortega\", \"Jeremy Stanley\", \"Sean Dague\" <sean at dague.net>, \"dean Troyer\"
> <dtroyer at gmail.com>, \"OpenStack Development Mailing List" <fungi at yuggoth.org>
> Elküldött üzenetek: Csütörtök, 2015. Január 22. 18:16:01
> Tárgy: [fedora] Re: upstream f21 devstack test
> 
> 
> 
> ----- Mail original -----
> > De: "Attila Fazekas" <afazekas at redhat.com>
> > À: "Ian Wienand" <iwienand at redhat.com>
> > Cc: "Alvaro Lopez Ortega" <aortega at redhat.com>, "Jeremy Stanley"
> > <fungi at yuggoth.org>, "Sean Dague" <sean at dague.net>,
> > "dean Troyer" <dtroyer at gmail.com>, "OpenStack Development Mailing List (not
> > for usage questions)"
> > <openstack-dev at lists.openstack.org>
> > Envoyé: Lundi 19 Janvier 2015 18:02:17
> > Objet: Re: upstream f21 devstack test
> > 
> > Per request moving this thread to the openstack-dev list.
> > 
> > I was not able to reproduce the issue so far either on the
> > vm you pointed me or in any of my VMs.
> > 
> > Several things I observed on `your` machine:
> > 1. The installed kernel is newer then the actually used (No known related
> > issue)
> 
> strace on libvirt does not wants to terminate properly on Ctrl+C,
> probably this not the only miss behavior related to processes.
> 
> The kernel version and hyper-visor type might be relevant to the
> 'Exception during message handling: Failed to terminate process 32495 with
> SIGKILL: Device or resource busy'
> 
> According to the strace the signal was sent, and the process was killed,
> but it is zombie until the strace not killed.
> 
> 
> > 2. On the First tempest (run logs are collected [0]) lp#1353939 was
> > triggered, but not related
> I was wrong.
> This was related. An exception during instance delete can live behind
> iptables rules, so not the correct security group rules will be applied.
> 
> In the other jenkins jobs this situation is rare.
> 
> On `your` vm 'tox -eall test_rescued_vm_detach_volume' triggers the issue
> almost always, in other env I was not able tor reproduce it so far.
> 
> > 3. After tried to reproduce the use many-many times I hit lp#1411525, the
> > patch
> >    which introduced is already reverted.
> > 4. Once I saw 'Returning 400 to user: No nw_info cache associated with
> > instance' what I haven't
> >    seen with nova network for a long time.  (once in 100 run)
> > 5. I see many annoying iscsi related logging, It also does not related to
> > the
> > connection issue,
> >    IMHO the tgtadm can be considered as DEPRECATED thing, and we should
> >    switch to lioadm.
> > 
> > So far, No Log entry found in connection to connection issue
> >  which would worth to search on logstash.
> > 
> > The nova network log is not sufficient to figure out the actual netfilter
> > state at any moment.
> > According the log it should have update the chains with something, but who
> > knows..
> > 
> > With the ssh connection issues you can do very few things as post-mortem
> > analyses.
> > Tempest normally deletes the related resources, so less evidences
> > remaining.
> > If the issue is reproducible some cases enough to alter the test to do not
> > destroy evidences,
> > but very frequently some kind of real debugger required.
> > 
> > Several suspected thing:
> > * The vm was able to acquire address via dhcp -> successful boot, has L2
> > connectivity.
> > * No evidence found for a dead qemu, no special libvirt operation requested
> > before failure.
> > * nnet claims it added the floating ip to the br100
> > * L3 issue / security group rules ?..
> > 
> > The basic network debug was removed form tempest[1]. I would like to
> > recommend to revert that change
> > in order to have an idea at least the interfaces and netfilter was or
> > wasn't
> > in a good shape [1].
> > 
> Full tempest runs was required to reproduce the issue and reverting [1] to
> see
> what is really happened.
> 
> test_rescued_vm_detach_volume + any ssh test can be sufficient to reproduce
> the issue.
> 
> > I also created a vm with enabled firewalld (normally it is not in my
> > devstack
> > setups), the 3
> > mentioned test case working fine even after running these tests for hours.
> > However the '/var/log/firewalld' contains COMMAD_FAILURES as in `your` vm.
> > 
> > I will try run more full tempest+nnet at F21 in my env to have more sample for
> > success rate.
> > 
> > So far I reproduced 0 ssh failure,
> > so I will scan the logs[0] again more carefully on `your` machine,
> > maybe I missed something, maybe those tests interfered with something less
> > obvious.
> > 
> > I'll check the other gate f21 logs (~100 job/week),
> >  does anything happened when the issue started and/or is the issue still
> >  exists.
> > 
> > 
> > So, I have nothing useful at the moment, but I did not given up.
> > 
> > [0]
> > http://logs.openstack.org/87/139287/14/check/check-tempest-dsvm-f21/5f3d210/console.html.gz
> > [1] https://review.openstack.org/#/c/140531/
> > 
> > 
> > PS.:
> > F21's HaProxy is more sensitive to services which stops listening,
> > and it will not be evenly balanced.
> > For a working F21 neutron job better listener is required:
> > https://review.openstack.org/#/c/146039/ .
> >  
> > 
> > 
> > ----- Original Message -----
> > > From: "Ian Wienand" <iwienand at redhat.com>
> > > To: "Attila Fazekas" <afazekas at redhat.com>
> > > Cc: "Alvaro Lopez Ortega" <aortega at redhat.com>, "Jeremy Stanley"
> > > <fungi at yuggoth.org>, "Sean Dague" <sean at dague.net>,
> > > "dean Troyer" <dtroyer at gmail.com>
> > > Sent: Friday, January 16, 2015 5:24:38 AM
> > > Subject: upstream f21 devstack test
> > > 
> > > Hi Attila,
> > > 
> > > I don't know if you've seen, but upstream f21 testing is happening for
> > > devstack jobs.  As an experimental job I was getting good runs, but in
> > > the last day and a bit, all runs have started failing.
> > > 
> > > The failing tests are varied; a small sample I pulled:
> > > 
> > > [1]
> > > tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_compute_with_volumes
> > > [2]
> > > tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,image,network]
> > > [3]
> > > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance[compute,image,network]
> > > 
> > > The common thread is that they can't ssh to the cirros instance
> > > started up.
> > > 
> > > So far I can not replicate this locally.  I know there were some
> > > firewalld/neutron issues, but this is not a neutron job.
> > > 
> > > Unfortunately, I'm about to head out the door on PTO until 2015-01-27.
> > > I don't like the idea of this being broken while I don't have time to
> > > look at it, so I'm hoping you can help out.
> > > 
> > > There is a failing f21 machine on hold at
> > > 
> > >  jenkins at xx.yy.zz.qq
> > Sanitized.
> > > 
> > > I've attached a private key that should let you log in.  This
> > > particular run failed in [4]:
> > > 
> > >  tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_compute_with_volumes
> > >  tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario[compute,image,network,volume]
> > > 
> > > Sorry I haven't got very far in debugging this.  Nothing obviously
> > > jumped out at me in the logs, but I only had a brief look.  I'm hoping
> > > as the best tempest guy I know you can find some time to take a look
> > > at this in my absence :)
> > > 
> > > Thanks,
> > > 
> > > -i
> > > 
> > > [1]
> > > http://logs.openstack.org/03/147303/1/check/check-tempest-dsvm-f21/3d0c86d/console.html
> > > [2]
> > > http://logs.openstack.org/09/147209/2/check/check-tempest-dsvm-f21/83444c9/console.html
> > > [3]
> > > http://logs.openstack.org/71/141971/5/check/check-tempest-dsvm-f21/95b1574/console.html
> > > [4] https://jenkins06.openstack.org/job/check-tempest-dsvm-f21/8/console
> > > 
> > 
> 



More information about the OpenStack-dev mailing list