[OpenStack-Infra] timeout, shells and ansible 2.2
Ian Wienand
iwienand at redhat.com
Wed Jan 11 05:53:54 UTC 2017
So I'm trying to test ansible 2.2.1-rc3 on puppetmaster. If as root
you try
# /root/ianw/ansible/bin/activate # ansible 2.2.1-rc3 venv
# /root/ianw/run_all.sh # run_all but with --check to dry-run
You should see the problem. ansible-playbook just stops; when you
take a look at the wait channel, you can see it's all blocked on a
signal.
---
root at puppetmaster:/root/ianw/ansible# ps -aefl | grep ansible
0 S root 4202 1010 0 80 0 - 2927 wait 03:06 pts/15 00:00:00 timeout -k 2m 120m ansible-playbook -vvv
0 T root 4203 4202 0 80 0 - 44381 signal 03:06 pts/15 00:00:02 /root/ianw/ansible/bin/python /root/ianw/
1 T root 4331 4203 0 80 0 - 46533 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root 4360 4203 0 80 0 - 46596 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4393 4331 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
4 T root 4398 4360 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4400 4203 0 80 0 - 46661 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4410 4400 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4412 4203 0 80 0 - 46661 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root 4421 4203 0 80 0 - 46725 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4422 4412 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4428 4203 0 80 0 - 46982 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root 4431 4203 0 80 0 - 46725 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4436 4421 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4438 4203 0 80 0 - 46789 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4446 4438 0 80 0 - 11551 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4447 4203 0 80 0 - 46792 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4448 4447 0 80 0 - 11109 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4452 4203 0 80 0 - 46856 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root 4455 4452 0 80 0 - 11037 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
4 T root 4461 4431 0 80 0 - 11005 signal 03:06 pts/15 00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root 4462 4428 0 80 0 - 46725 signal 03:06 pts/15 00:00:00 /root/ianw/ansible/bin/python /root/ianw/
---
stracing any of those gives
---
root at puppetmaster:/root/ianw/ansible# strace -p 4412
Process 4412 attached
--- stopped by SIGTTOU ---
---
I know from experience that SIGTTOU is just bad news, it means crappy
tricky terminal stuff ahead. I started looking through an strace to
see where this comes up
---
(ansible)root at puppetmaster:/tmp# cat /tmp/output.log | grep TTOU | grep kill
21029 kill(21029, SIGTTOU <unfinished ...>
---
Having a look at pid 21029 it's an ssh process, that ansible is
launching via some pipe/fifo related system, but it's going to
---
execve("/usr/bin/ssh" 21029 connect(3, {sa_family=AF_INET6,
sin6_port=htons(22), inet_pton(AF_INET6, "2001:4800:7819:105:be76:4eff:fe04:a5b2", ...)
---
Well guess what; this host obviously doesn't have keys deployed and is
throwing up a password prompt.
---
# ssh 2001:4800:7819:105:be76:4eff:fe04:a5b2
root at 2001:4800:7819:105:be76:4eff:fe04:a5b2's password:
---
(The other thing is -- what is this host and why is shade picking it
up? but we have to handle this. also maybe at the time it was a
host key unknown message, but same same)
It's at this point we can see ssh opening /dev/tty and then, bang, in
comes our SIGTTOU
---
21029 open("/dev/tty", O_RDWR <unfinished ...>
21029 <... open resumed> ) = 4
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_CONTINUE or SNDRV_TIMER_IOCTL_GPARAMS or TCSETSF <unfinished ...>
21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
21029 --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
---
Which is all great EXCEPT it seems there is an, AFAICT, unresolved bug
in "timeout" that means it has incorrectly reset the SIGTTOU handlers
to default (?) so everything just completely stops. [1]
I believe this is going to happen when run_all.sh tries to run [2],
although *maybe* under cron without a terminal it will work? But
anyway, we'll hit it if we ever need to run run_all.sh by hand.
The thing is, ansible 2.0.2.0 seems to do all the ssh stuff very
differently so this doesn't appear to happen.
At this point, thoughts welcome,
-i
[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=15779
[2] https://git.openstack.org/cgit/openstack-infra/system-config/tree/run_all.sh#n41
More information about the OpenStack-Infra
mailing list