Open Stack

Mon Jan 30 04:06:10 UTC 2017

Hi,

We got a report of CI jobs failing with disconnects when
downloading from tarballs.openstack.org.  The file in question is
a largish container for kolla-kubernetes [1]

ISTR this is not the first time we've had complaints about this, but
I'm not sure if we ever came up with a solution.

Below are some of the failed jobs with the ip, start & failure
time of the download.

---
http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-nv/21e4361/console.html
inet 146.20.105.26
2017-01-30 00:00:08.109056 | + curl ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:04:44.002456 | curl: (18) transfer closed with 88598540 bytes remaining to read

http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-multi-nv/fe0849b/console.html
inet 146.20.105.198
2017-01-30 00:00:08.471434 | + curl -o ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:04:42.685201 | curl: (18) transfer closed with 542002092 bytes remaining to read

http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-external-ovs-nv/38030a1/console.html
inet6 2001:4800:1ae1:18:f816:3eff:fe9b:f9e2/64 
2017-01-30 00:01:07.306370 | + curl -o ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:03:34.810258 | curl: (18) transfer closed with 222546512 bytes remaining to read
---

At first, there is not much correlation and two of the requests
appear to not be logged at all.

---
root at static:/var/log/apache2# grep 'centos-binary-ceph.tar.bz2' tarballs.openstack.org_*  | grep '30/Jan'
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9e:4ccf - - [30/Jan/2017:00:00:33 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0"
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe67:2ca6 - - [30/Jan/2017:00:00:47 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0"
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9b:f9e2 - - [30/Jan/2017:00:01:07 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0"
---

However, I went to the generic apache error log around that time and
found the following

---
[Sun Jan 29 23:59:16.284909 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:00:02.334021 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:00:04.336258 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:01:48.449350 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:02:25.490781 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:03:10.539081 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
...
---

I think this is a smoking gun for the issue, because this issue leads
to the death of the serving process, which gets logged a little later.
Correlating this it seems like a few of these time-stamps match up
with when the reported jobs reported they got disconnected.

---
[Mon Jan 30 00:03:31.562290 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 14410 exit signal Segmentation fault (11)
[Mon Jan 30 00:03:35.566735 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 20378 exit signal Segmentation fault (11)
...
[Mon Jan 30 00:04:17.614883 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4126 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:17.614951 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4204 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:22.621893 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4290 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:37.638901 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4358 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:41.643388 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4324 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:42.645053 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 20485 exit signal Segmentation fault (11)
---

This ungraceful exit might also explain the lack of logs.

Unfortunately, the prognosis for this issue is not great.  The
original bug [2] seems to show it is a systemic issue and it is
discussed in the documentation [3] which says in short

 This mpm showed some scalability bottlenecks in the past leading to
 the following error: "scoreboard is full, not at MaxRequestWorkers".
 ...
 From 2.4.24 onward, mpm-event is smarter and it is able to handle
 graceful terminations in a much better way

We are not in an immediate position to upgrade apache, and it does not
seem tuning is a reliable solution.  Considering the workload of this
server, it seems avoiding the problem by switching to the worker MPM
might be the expedient solution.  I've proposed that in [4].

Thanks,

-i

[1] http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
[2] https://bz.apache.org/bugzilla/show_bug.cgi?id=53555
[3] https://httpd.apache.org/docs/2.4/mod/event.html (Graceful process termination)
[4] https://review.openstack.org/#/c/426639/

Open Stack

[OpenStack-Infra] Dropped connections from static.o.o

OpenStack

Community

Documentation

Branding & Legal