Open Stack

Mon Jan 19 17:02:17 UTC 2015

Per request moving this thread to the openstack-dev list.

I was not able to reproduce the issue so far either on the
vm you pointed me or in any of my VMs.

Several things I observed on `your` machine:
1. The installed kernel is newer then the actually used (No known related issue)
2. On the First tempest (run logs are collected [0]) lp#1353939 was triggered, but not related
3. After tried to reproduce the use many-many times I hit lp#1411525, the patch
   which introduced is already reverted.
4. Once I saw 'Returning 400 to user: No nw_info cache associated with instance' what I haven't
   seen with nova network for a long time.  (once in 100 run)
5. I see many annoying iscsi related logging, It also does not related to the connection issue,
   IMHO the tgtadm can be considered as DEPRECATED thing, and we should switch to lioadm.

So far, No Log entry found in connection to connection issue 
 which would worth to search on logstash.

The nova network log is not sufficient to figure out the actual netfilter state at any moment.
According the log it should have update the chains with something, but who knows..

With the ssh connection issues you can do very few things as post-mortem analyses.
Tempest normally deletes the related resources, so less evidences remaining.
If the issue is reproducible some cases enough to alter the test to do not destroy evidences,
but very frequently some kind of real debugger required.

Several suspected thing:
* The vm was able to acquire address via dhcp -> successful boot, has L2 connectivity.
* No evidence found for a dead qemu, no special libvirt operation requested before failure.
* nnet claims it added the floating ip to the br100
* L3 issue / security group rules ?..

The basic network debug was removed form tempest[1]. I would like to recommend to revert that change
in order to have an idea at least the interfaces and netfilter was or wasn't in a good shape [1].

I also created a vm with enabled firewalld (normally it is not in my devstack setups), the 3
mentioned test case working fine even after running these tests for hours.
However the '/var/log/firewalld' contains COMMAD_FAILURES as in `your` vm. 

I will try run more full tempest+nnet at F21 in my env to have more sample for success rate.

So far I reproduced 0 ssh failure,
so I will scan the logs[0] again more carefully on `your` machine,
maybe I missed something, maybe those tests interfered with something less obvious.

I'll check the other gate f21 logs (~100 job/week),
 does anything happened when the issue started and/or is the issue still exists. 

So, I have nothing useful at the moment, but I did not given up.

[0] http://logs.openstack.org/87/139287/14/check/check-tempest-dsvm-f21/5f3d210/console.html.gz
[1] https://review.openstack.org/#/c/140531/

PS.:
F21's HaProxy is more sensitive to services which stops listening,
and it will not be evenly balanced. 
For a working F21 neutron job better listener is required: https://review.openstack.org/#/c/146039/ .

----- Original Message -----
> From: "Ian Wienand" <iwienand at redhat.com>
> To: "Attila Fazekas" <afazekas at redhat.com>
> Cc: "Alvaro Lopez Ortega" <aortega at redhat.com>, "Jeremy Stanley" <fungi at yuggoth.org>, "Sean Dague" <sean at dague.net>,
> "dean Troyer" <dtroyer at gmail.com>
> Sent: Friday, January 16, 2015 5:24:38 AM
> Subject: upstream f21 devstack test
> 
> Hi Attila,
> 
> I don't know if you've seen, but upstream f21 testing is happening for
> devstack jobs.  As an experimental job I was getting good runs, but in
> the last day and a bit, all runs have started failing.
> 
> The failing tests are varied; a small sample I pulled:
> 
> [1]
> tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_compute_with_volumes
> [2]
> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,image,network]
> [3]
> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance[compute,image,network]
> 
> The common thread is that they can't ssh to the cirros instance
> started up.
> 
> So far I can not replicate this locally.  I know there were some
> firewalld/neutron issues, but this is not a neutron job.
> 
> Unfortunately, I'm about to head out the door on PTO until 2015-01-27.
> I don't like the idea of this being broken while I don't have time to
> look at it, so I'm hoping you can help out.
> 
> There is a failing f21 machine on hold at
> 
>  jenkins at xx.yy.zz.qq
Sanitized.
> 
> I've attached a private key that should let you log in.  This
> particular run failed in [4]:
> 
>  tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_compute_with_volumes
>  tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario[compute,image,network,volume]
> 
> Sorry I haven't got very far in debugging this.  Nothing obviously
> jumped out at me in the logs, but I only had a brief look.  I'm hoping
> as the best tempest guy I know you can find some time to take a look
> at this in my absence :)
> 
> Thanks,
> 
> -i
> 
> [1]
> http://logs.openstack.org/03/147303/1/check/check-tempest-dsvm-f21/3d0c86d/console.html
> [2]
> http://logs.openstack.org/09/147209/2/check/check-tempest-dsvm-f21/83444c9/console.html
> [3]
> http://logs.openstack.org/71/141971/5/check/check-tempest-dsvm-f21/95b1574/console.html
> [4] https://jenkins06.openstack.org/job/check-tempest-dsvm-f21/8/console
> 

Open Stack

[openstack-dev] upstream f21 devstack test

OpenStack

Community

Documentation

Branding & Legal