<div dir="ltr">Hi there,<div><br></div><div>We're trying to figure out why, sometimes, rpc_loop takes over 10 seconds to process an iteration when booting instances. So we deployed devstack on a 8GB, 4vCPU VM and did some profiling on the following command:</div><div> </div><div>nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec --nic net-name=private --min-count 8 instance<br></div><div><br></div><div>(network private has port_security_enabled set to False to avoid the overhead of setting up sgs) </div><div><br></div><div>Logs showed that sometimes, the network-vif-plugged event was sent by the server ~12 seconds after the vif was detected by ovsdb monitor. Usually, first and second events come faster while the rest take longer. Further analysis showed that rpc_loop iterations take several seconds to complete so, if the vif is detected while iteration X is running, it won't be processed until iteration X+1.</div><div><br></div><div>As an example, I've attached a simplified sequence diagram [0] to show what happened in a particular iteration of my debug (I have full logs and pstat files for this session for those interested). In this example, iteration 76 is going to process two ports while some of the previous spawned machines are being managed by libvirt and so on... At the beginning of iteration 76, a new vif is detected by ovsdb monitor but it won't be processed until 12 seconds later in iteration 77.</div><div><br></div><div>Profiling files show that aggregated CPU time for neutron workers is 97 seconds, while CPU time for ovs agent is 2.1 seconds. Most of its time is spent waiting for RPC so it looks like there's apparently some room for optimization and multiprocessing here.</div><div><br></div><div>According to dstat.log, CPU is at ~90% and there's ~1GB of free RAM. I can't tell whether the hypervisor was swapping or not since I didn't have access to it.</div><div><br></div><div>----system---- ----total-cpu-usage---- ------memory-usage----- -net/total-><br></div><div><div>     time     |usr sys idl wai hiq siq| used  buff  cach  free| recv  send></div></div><div><div>05-02 14:22:50| 89  11   0   0   0   0|5553M    0  1151M 1119M|1808B 1462B></div><div>05-02 14:22:51| 90  10   0   0   0   0|5567M    0  1151M 1106M|1588B  836B></div><div>05-02 14:22:52| 89  11   1   0   0   0|5581M    0  1151M 1092M|3233B 2346B><br></div><div>05-02 14:22:53| 89  10   0   0   0   0|5598M    0  1151M 1075M|2676B 2038B></div><div>05-02 14:22:54| 90  10   0   0   0   0|5600M    0  1151M 1073M|  20k   14k></div><div>05-02 14:22:55| 90   9   0   0   0   0|5601M    0  1151M 1072M|  22k   16k></div></div><div><br></div><div>Also, while having a look at server profiling, around the 33% of the time was spent building SQL queries [1]. Mike Bayer went through this and suggested having a look at baked queries and also submitted a sketch of his proposal [2]. </div><div><br></div><div>I wanted to share these findings with you (probably most of you knew but I'm quite new to OpenStack so It's been a really nice exercise for me to better understand how things work) and gather your feedback about how things can be improved. Also, I'll be happy to share the results and discuss further if you think it's worth during the PTG next week.<br></div><div><br></div><div>Thanks a lot for reading and apologies for such a long email!</div><div><br></div><div>Cheers,</div><div>Daniel </div><div>IRC: dalvarez</div><div><br></div><div>[0] <a href="http://imgur.com/WQqaiYQ">http://imgur.com/WQqaiYQ</a></div><div>[1] <a href="http://imgur.com/6KrfJUC">http://imgur.com/6KrfJUC</a></div><div>[2] <a href="https://review.openstack.org/430973">https://review.openstack.org/430973</a></div></div>