[OpenStack-Infra] timeout, shells and ansible 2.2

Ian Wienand iwienand at redhat.com
Wed Jan 11 05:53:54 UTC 2017


So I'm trying to test ansible 2.2.1-rc3 on puppetmaster.  If as root
you try

# /root/ianw/ansible/bin/activate # ansible 2.2.1-rc3 venv
# /root/ianw/run_all.sh # run_all but with --check to dry-run

You should see the problem.  ansible-playbook just stops; when you
take a look at the wait channel, you can see it's all blocked on a
signal.

---
root at puppetmaster:/root/ianw/ansible# ps -aefl | grep ansible
0 S root      4202  1010  0  80   0 -  2927 wait   03:06 pts/15   00:00:00 timeout -k 2m 120m ansible-playbook -vvv 
0 T root      4203  4202  0  80   0 - 44381 signal 03:06 pts/15   00:00:02 /root/ianw/ansible/bin/python /root/ianw/
1 T root      4331  4203  0  80   0 - 46533 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root      4360  4203  0  80   0 - 46596 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4393  4331  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
4 T root      4398  4360  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4400  4203  0  80   0 - 46661 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4410  4400  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4412  4203  0  80   0 - 46661 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root      4421  4203  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4422  4412  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4428  4203  0  80   0 - 46982 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
1 T root      4431  4203  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4436  4421  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4438  4203  0  80   0 - 46789 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4446  4438  0  80   0 - 11551 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4447  4203  0  80   0 - 46792 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4448  4447  0  80   0 - 11109 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4452  4203  0  80   0 - 46856 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
4 T root      4455  4452  0  80   0 - 11037 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
4 T root      4461  4431  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh -C -o ControlMaster=auto -o ControlPe
1 T root      4462  4428  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 /root/ianw/ansible/bin/python /root/ianw/
---

stracing any of those gives

---
root at puppetmaster:/root/ianw/ansible# strace -p 4412
Process 4412 attached
--- stopped by SIGTTOU ---
---

I know from experience that SIGTTOU is just bad news, it means crappy
tricky terminal stuff ahead.  I started looking through an strace to
see where this comes up

---
(ansible)root at puppetmaster:/tmp# cat /tmp/output.log  | grep TTOU | grep kill
21029 kill(21029, SIGTTOU <unfinished ...>
---

Having a look at pid 21029 it's an ssh process, that ansible is
launching via some pipe/fifo related system, but it's going to

---
execve("/usr/bin/ssh" 21029 connect(3, {sa_family=AF_INET6,
sin6_port=htons(22), inet_pton(AF_INET6, "2001:4800:7819:105:be76:4eff:fe04:a5b2", ...)
---

Well guess what; this host obviously doesn't have keys deployed and is
throwing up a password prompt.

---
# ssh 2001:4800:7819:105:be76:4eff:fe04:a5b2
root at 2001:4800:7819:105:be76:4eff:fe04:a5b2's password:
---

(The other thing is -- what is this host and why is shade picking it
up?  but we have to handle this.  also maybe at the time it was a
host key unknown message, but same same)

It's at this point we can see ssh opening /dev/tty and then, bang, in
comes our SIGTTOU

---
21029 open("/dev/tty", O_RDWR <unfinished ...>
21029 <... open resumed> )              = 4
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_CONTINUE or SNDRV_TIMER_IOCTL_GPARAMS or TCSETSF <unfinished ...>
21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
21029 --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
---

Which is all great EXCEPT it seems there is an, AFAICT, unresolved bug
in "timeout" that means it has incorrectly reset the SIGTTOU handlers
to default (?) so everything just completely stops. [1]

I believe this is going to happen when run_all.sh tries to run [2],
although *maybe* under cron without a terminal it will work?  But
anyway, we'll hit it if we ever need to run run_all.sh by hand.

The thing is, ansible 2.0.2.0 seems to do all the ssh stuff very
differently so this doesn't appear to happen.

At this point, thoughts welcome,

-i

[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=15779
[2] https://git.openstack.org/cgit/openstack-infra/system-config/tree/run_all.sh#n41



More information about the OpenStack-Infra mailing list