[OpenStack-Infra] tgt restart fails in Cinder startup "start: job failed to start"

Akihiro Motoki motoki at da.jp.nec.com
Mon Mar 17 11:51:27 UTC 2014


Hi,

During setting up our third party testing for Neutron,
I sometimes experienced the similar issues:
Sometime tgtd failure and sometimes a node has many nbd devices...
Finally I gave up reuse a same machine using stack/unstack/clean.
As Sean and John suggested, it is better to run a test with a fresh env.

At now, after a test completes we force shutdown a VM running
the test, restore a snapshot disk image with fresh Ubuntu 12.04
and re-start the VM. It really descrease an failure rate with
unknown errors. I do it with libvirt and a simple shell script.
The script is kicked with a slave name where a gate job ran
after making sure to offline the slave.
(the slave name is passed to the script as build parameter)

Thanks,
Akihiro

(2014/03/13 5:58), Dane Leblanc (leblancd) wrote:
> Hi Roey:
>
> Looks like your suggested changes to /etc/sysctl.conf have done the trick… I haven’t seen the problem with tgtd failing to start since I made this change.
>
> Thanks!
>
> Dane
>
> *From:*Sukhdev Kapur [mailto:sukhdevkapur at gmail.com]
> *Sent:* Monday, March 10, 2014 11:19 PM
> *To:* Roey Chen
> *Cc:* Dane Leblanc (leblancd); Sean Dague; John Griffith; openstack-infra at lists.openstack.org
> *Subject:* Re: [OpenStack-Infra] tgt restart fails in Cinder startup "start: job failed to start"
>
> Hi Roey,
>
> Thanks for the tip. I have made the change according to your suggestion and fired off tests for overnight test. Will let you know in the morning if this fixes the issue.
>
> Thanks
>
> -Sukhdev
>
> On Mon, Mar 10, 2014 at 4:17 PM, Roey Chen <roeyc at mellanox.com <mailto:roeyc at mellanox.com>> wrote:
>
> Hi,
>
> Hope this could help,
>
> I've encountered this issue myself not to long ago on Ubuntu 12.04 host,
>
> it didn't happen again after messing with the Kernel Semaphore Limits parameters [1]:
>
> Adding this [2] line to `/etc/sysctl.conf` seems to do the trick.
>
> - Roey
>
> [1] http://paste.openstack.org/show/73086/
>
> [2] http://paste.openstack.org/show/73082/
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> *From:*Dane Leblanc (leblancd) [leblancd at cisco.com <mailto:leblancd at cisco.com>]
> *Sent:* Monday, March 10, 2014 11:54 PM
> *To:* Sukhdev Kapur; Sean Dague; John Griffith
> *Cc:* openstack-infra at lists.openstack.org <mailto:openstack-infra at lists.openstack.org>
>
>
> *Subject:* Re: [OpenStack-Infra] tgt restart fails in Cinder startup "start: job failed to start"
>
> Sean, John:
>
> I’ve had a similar experience as Sukhdev… I had tried doing clean.sh on every run, but that didn’t help prevent the tgt problem, and it doesn’t help recover from it.
>
> Sounds like the best option is to reset the VM for each run.
>
> Thanks,
>
> Dane
>
> *From:*Sukhdev Kapur [mailto:sukhdevkapur at gmail.com <mailto:sukhdevkapur at gmail.com>]
> *Sent:* Monday, March 10, 2014 4:33 PM
> *To:* Sean Dague
> *Cc:* Dane Leblanc (leblancd); openstack-infra at lists.openstack.org <mailto:openstack-infra at lists.openstack.org>
> *Subject:* Re: [OpenStack-Infra] tgt restart fails in Cinder startup "start: job failed to start"
>
> Hi Sean,
>
> In my case, for every run, I do unstack.sh, clean.sh, sudo rm -rf devstack, sudo rm -rf /opt/stack.
>
> Then I go get everything fresh and stack.sh, and a full run of smoke tests
>
> Few iterations of this sequence will get you into this condition. Once in this condition - clean.sh and unstack.sh, nothing helps, it fails solid 100% of times. If reboot the VM, everything works just
> fine for next 10-20 cycles until it hits the same condition. So, I am planning on modifying the script to reboot the VM every two hours or so....as a work around....but, the underlying problem
> occurred close to Ichouse check-ins. I started to notice this few days earlier than the Icehouse deadline, prior to that I was running the same sequence without any issue (for several weeks) - if that
> helps any...
>
> -Sukhdev
>
> On Mon, Mar 10, 2014 at 1:07 PM, Sean Dague <sean at dague.net <mailto:sean at dague.net>> wrote:
>
> So, honestly, running stack.sh / unstack.sh that many times in a row
> really isn't expected to work in my experience. You should at minimum be
> doing ./clean.sh to try to reset the state further.
>
>          -Sean
>
>
> On 03/10/2014 03:00 PM, Dane Leblanc (leblancd) wrote:
>  > In my case, the base OS is 12.04 Precise.
>  >
>  > The problem is intermittent in that it takes maybe 15 to 20 cycles of unstack/stack to get it into the failure mode, but once in the failure mode, it appears that tgt daemon is 100% dead-in-the-water.
>  >
>  > -----Original Message-----
>  > From: Sean Dague [mailto:sean at dague.net <mailto:sean at dague.net>]
>  > Sent: Monday, March 10, 2014 1:49 PM
>  > To: Dane Leblanc (leblancd); openstack-infra at lists.openstack.org <mailto:openstack-infra at lists.openstack.org>
>  > Subject: Re: [OpenStack-Infra] tgt restart fails in Cinder startup "start: job failed to start"
>  >
>  > What base OS? A change was made there recently to better handle debian because we believed (possibly incorrectly) that precise actually had working init scripts.
>  >
>  > It would be interesting to understand if this was a 100% failure, or only intermittent, and what base OS it was on.
>  >
>  >       -Sean
>  >
>  > On 03/10/2014 11:37 AM, Dane Leblanc (leblancd) wrote:
>  >> I don't know if anyone can give me some troubleshooting advice with this issue.
>  >>
>  >> I'm seeing an occasional problem whereby after several DevStack unstack.sh/stack.sh <http://unstack.sh/stack.sh> cycles, the tgt daemon (tgtd) fails to start during Cinder startup.  Here's a
> snippet from the stack.sh log:
>  >>
>  >> 2014-03-10 07:09:45.214 | Starting Cinder
>  >> 2014-03-10 07:09:45.215 | + return 0
>  >> 2014-03-10 07:09:45.216 | + sudo rm -f /etc/tgt/conf.d/stack.conf
>  >> 2014-03-10 07:09:45.217 | + _configure_tgt_for_config_d
>  >> 2014-03-10 07:09:45.218 | + [[ ! -d /etc/tgt/stack.d/ ]]
>  >> 2014-03-10 07:09:45.219 | + is_ubuntu
>  >> 2014-03-10 07:09:45.220 | + [[ -z deb ]]
>  >> 2014-03-10 07:09:45.221 | + '[' deb = deb ']'
>  >> 2014-03-10 07:09:45.222 | + sudo service tgt restart
>  >> 2014-03-10 07:09:45.223 | stop: Unknown instance:
>  >> 2014-03-10 07:09:45.619 | start: Job failed to start
>  >> jenkins at neutronpluginsci:~/devstack$ 2014-03-10 07:09:45.621 | +
>  >> exit_trap
>  >> 2014-03-10 07:09:45.622 | + local r=1
>  >> 2014-03-10 07:09:45.623 | ++ jobs -p
>  >> 2014-03-10 07:09:45.624 | + jobs=
>  >> 2014-03-10 07:09:45.625 | + [[ -n '' ]]
>  >> 2014-03-10 07:09:45.626 | + exit 1
>  >>
>  >> If I try to restart tgt manually without success:
>  >>
>  >> jenkins at neutronpluginsci:~$ sudo service tgt restart
>  >> stop: Unknown instance:
>  >> start: Job failed to start
>  >> jenkins at neutronpluginsci:~$ sudo tgtd
>  >> librdmacm: couldn't read ABI version.
>  >> librdmacm: assuming: 4
>  >> CMA: unable to get RDMA device list
>  >> (null): iser_ib_init(3263) Failed to initialize RDMA; load kernel modules?
>  >> (null): fcoe_init(214) (null)
>  >> (null): fcoe_create_interface(171) no interface specified.
>  >> jenkins at neutronpluginsci:~$
>  >>
>  >> The config in /etc/tgt is:
>  >>
>  >> jenkins at neutronpluginsci:/etc/tgt$ ls -l total 8 drwxr-xr-x 2 root
>  >> root 4096 Mar 10 07:03 conf.d
>  >> lrwxrwxrwx 1 root root   30 Mar 10 06:50 stack.d -> /opt/stack/data/cinder/volumes
>  >> -rw-r--r-- 1 root root   58 Mar 10 07:07 targets.conf
>  >> jenkins at neutronpluginsci:/etc/tgt$ cat targets.conf include
>  >> /etc/tgt/conf.d/*.conf include /etc/tgt/stack.d/*
>  >> jenkins at neutronpluginsci:/etc/tgt$ ls conf.d
>  >> jenkins at neutronpluginsci:/etc/tgt$ ls /opt/stack/data/cinder/volumes
>  >> jenkins at neutronpluginsci:/etc/tgt$
>  >>
>  >> I don't know if there's any missing Cinder config in my DevStack localrc files. Here's one that I'm using:
>  >>
>  >> MYSQL_PASSWORD=nova
>  >> RABBIT_PASSWORD=nova
>  >> SERVICE_TOKEN=nova
>  >> SERVICE_PASSWORD=nova
>  >> ADMIN_PASSWORD=nova
>  >> ENABLED_SERVICES=g-api,g-reg,key,n-api,n-crt,n-obj,n-cpu,n-cond,cinder
>  >> ,c-sch,c-api,c-vol,n-sch,n-novnc,n-xvnc,n-cauth,horizon,rabbit
>  >> enable_service mysql
>  >> disable_service n-net
>  >> enable_service q-svc
>  >> enable_service q-agt
>  >> enable_service q-l3
>  >> enable_service q-dhcp
>  >> enable_service q-meta
>  >> enable_service q-lbaas
>  >> enable_service neutron
>  >> enable_service tempest
>  >> VOLUME_BACKING_FILE_SIZE=2052M
>  >> Q_PLUGIN=cisco
>  >> declare -a Q_CISCO_PLUGIN_SUBPLUGINS=(openvswitch nexus) declare -A
>  >> Q_CISCO_PLUGIN_SWITCH_INFO=([10.0.100.243]=admin:Cisco12345:22:neutron
>  >> pluginsci:1/9)
>  >> NCCLIENT_REPO=git://github.com/CiscoSystems/ncclient.git <http://github.com/CiscoSystems/ncclient.git>
>  >> PHYSICAL_NETWORK=physnet1
>  >> OVS_PHYSICAL_BRIDGE=br-eth1
>  >> TENANT_VLAN_RANGE=810:819
>  >> ENABLE_TENANT_VLANS=True
>  >> API_RATE_LIMIT=False
>  >> VERBOSE=True
>  >> DEBUG=True
>  >> LOGFILE=/opt/stack/logs/stack.sh.log
>  >> USE_SCREEN=True
>  >> SCREEN_LOGDIR=/opt/stack/logs
>  >>
>  >> Here are links to a log showing another localrc file that I use, and the corresponding stack.sh log:
>  >>
>  >> http://128.107.233.28:8080/job/neutron/1390/artifact/vpnaas_console_lo
>  >> g.txt
>  >> http://128.107.233.28:8080/job/neutron/1390/artifact/vpnaas_stack_sh_l
>  >> og.txt
>  >>
>  >> Does anyone have any advice on how to debug this, or recover from this (beyond rebooting the node)? Or am I missing any Cinder config?
>  >>
>  >> Thanks in advance for any help on this!!!
>  >> Dane
>  >>
>  >>
>  >>
>  >> _______________________________________________
>  >> OpenStack-Infra mailing list
>  >> OpenStack-Infra at lists.openstack.org <mailto:OpenStack-Infra at lists.openstack.org>
>  >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>  >>
>  >
>  >
>  > --
>  > Sean Dague
>  > Samsung Research America
>  > sean at dague.net <mailto:sean at dague.net> / sean.dague at samsung.com <mailto:sean.dague at samsung.com>
>  > http://dague.net
>  >
>
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net <mailto:sean at dague.net> / sean.dague at samsung.com <mailto:sean.dague at samsung.com>
> http://dague.net
>
>
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org <mailto:OpenStack-Infra at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
>
>
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>


More information about the OpenStack-Infra mailing list