[openstack-dev] [openstac-dev][infra] High si/sys values via top in instances

Bob Hansen hansenb at us.ibm.com
Fri Mar 25 20:10:14 UTC 2016



Looking for some help to figure out what's going on here. I'm in the
process of creating a third party CI system for our project. I'm initially
trying to setup 6 manually created jenkins slaves using diskimage builder
and puppet to run gate jobs and will scale from there and eventually move
to nodepool.

I don't think this is specific to devstack-gate. I suspect it'll do this
with system activity that stresses the instance.

My setup is as follows:

Physical Servers(2): Intel 1 socket 12 core, 128gb RAM.
Openstack Liberty installed as a 3 node; 1 controller, 1 compute/network
(96gb RAM) , 2nd Compute (96gb RAM) as per the liberty installation guide.
Openstack controller, and compute ndoe guests, were created by hand using
libvirt on the respective physical server.
Using provider network, with linuxbridge.
Backing store for jenkins slaves/openstack liberty is the local file
system.
Jenkins slaves are built using puppet, images are built using diskimage
builder. The standard third party setup described in the CI documentation.
Jenkins slaves are 4 vcpu and 8gb of ram.
I have verified kvm acceleration is being used. All vm definitions are
using virtio for network and disk and virtio-pci is installed. All vms
using mode-passthrough in the cpu-model in the libvirt.xml describing it.

Trying to keep it simple as I learn the ropes...

All systems are using Kernel 3.19.0-56-generic #62~14.04.1-Ubuntu SMP on
Ubuntu 14.04.4 LTS (I've seen the same thing on early kernels and earlier
14.04 versions).

My issue is as follows,

If I create a single jenkins slave on a single compute node, the basic
setup time (we'll ignore tempest, but a similar thing happens) to run
devstack-gate is about roughly 20 minutes, sometimes less. As I scale the
number of jenkins slaves on the compute node, up to 3, the setup time
increases dramatically, the last run I did had it at nearly an hour.
Clearly something is wrong, as I have not over-comitted memory, nor ram on
either of the compute nodes.

What I'm finding is the CPU's are getting overwhelmed as I scale in the
jenkins slaves. Top will show sys/si percentages eating up the majority of
CPU, sometimes collectively they are taking up 70-80% of the cpu time. This
will drop to what's shown below when the system becomes idle.

When the systems are idle (after one run) this is a typical view of top,
mongodb is using 9.3% of the cpu, sys is at 9.8% and si at 5.2% of the
available cpu (Irix mode off). The compute node and the physical server do
not show this sort of load, they are typically in the 1-2% for sys, and 0,
for si when the slaves are idle, but will grow a bit when the slaves are
running the devstack-gate script.

top - 19:39:43 up 1 day, 39 min,  1 user,  load average: 0.65, 1.03, 1.59
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  9.8 sy,  0.0 ni, 77.9 id,  0.3 wa,  0.0 hi,  5.2 si,
6.5 st
KiB Mem:   8175872 total,  2620708 used,  5555164 free,   211212 buffers
KiB Swap:        0 total,        0 used,        0 free.  1665764 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1402 mongodb   20   0  382064  48232  10912 S  9.3  0.6 162:25.72 mongod
18436 rabbitmq  20   0 2172776  54528   4072 S  4.2  0.7  20:41.26 beam.smp
20059 root      10 -10   20944    420     48 S  2.9  0.0  26:54.20 monitor
20069 root      10 -10   21452    432     48 S  2.6  0.0  25:45.48 monitor
28786 mysql     20   0 2375444 110308  11216 S  2.0  1.3  15:43.30 mysqld
 3731 jenkins   20   0 4113288 114320  21160 S  1.9  1.4  31:01.35 java
    3 root      20   0       0      0      0 S  1.3  0.0  10:29.24
ksoftirqd/0


When the devstack-gate script is running this is typical. Again the compute
node as 0.6 for sy and 0.0 for si, when I copied this, similarly for the
physical server.

top - 19:45:02 up 1 day, 44 min,  1 user,  load average: 14.67, 12.20,
11.20
Tasks: 217 total,   5 running, 212 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.9 us, 43.5 sy,  0.0 ni,  5.2 id,  0.0 wa,  0.0 hi, 32.0 si,
0.4 st
KiB Mem:   8175872 total,  4970836 used,  3205036 free,   217968 buffers
KiB Swap:        0 total,        0 used,        0 free.  1604240 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  687 jenkins   20   0   78420  21544   3296 R  45.7  0.3   4:17.87 ansible
  676 jenkins   20   0   78556  25556   7116 S  40.1  0.3   4:19.29 ansible
 1368 mongodb   20   0  382064  48508  10896 S  32.2  0.6 207:31.76 mongod
 5060 root      10 -10   20944    420     48 S  14.1  0.0  12:04.99 monitor

Digging deeper with the various perf related tools, the best I can find for
a clue (used vmstat, looked at /proc/interrupts and mpstat, nothing in
logs), is that when idle mongo is doing this, which is driving up the sy
number. I have yet to figure out what may be driving the si number.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    1.697270        5813       292           select
------ ----------- ----------- --------- --------- ----------------
100.00    1.697270                   292           total

and when the job is running ansible is doing this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.72    4.536098        2925      1551           select
  0.28    0.012786           8      1551           poll
------ ----------- ----------- --------- --------- ----------------
100.00    4.548884                  3102           total

I'm at a loss on how to figure this out as this is a basic scaling issue.
Suggestions on what to check, what to look at? Anyone seen this before?
This appears to be something in the definition of the jenkins slave as the
compute node an physical server never seem to be over taxed. I'm missing
something basic here, or there is a bug somewhere.

I moved to vivid based on information I found that read similar to this
with mongo (wouldn't explain ansible), a fix was picked up in 3.19.0.45.



Bob H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160325/57f4c5e2/attachment.html>


More information about the OpenStack-dev mailing list