Open Stack

Wed Mar 4 12:27:01 UTC 2015

Hi all,

We're experiencing huge nodepool slowness under load. Nodes are in the delete state for a long time (sometimes up to 20 minutes) before they actually get removed (we see very similar things for node creation too), and that exhausts our resources very quickly and our throughput slows to the speed of a snail with heavy shopping.

To try and figure out why, I wrote a little log analysis tool, and here are some graphs from the data.

Individual task time taken
https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/4H008OHlrWf4NLm/task-time.png

This shows the time taken in seconds by each nodepool task (e.g. AddFloatingIPTask). Yes, it's slow, but consistent. During high load, the tasks only get more densely packed, they don't get slower.

Nodepool task queue size
https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/1S0kAiKGMMQCrpb/queue-size.png

This shows the number of individual nodepool tasks (e.g. AddFloatingIPTask) waiting in the queue. Guess when a load of jobs hit us!

Total node deletion time
https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/ixQxq4U4C5icl2K/deletion-time.png

That shows the amount of time the nodes spend in the delete state, from going from used to delete, to all the delete tasks having run and the node getting removed. Take a look at what happens when there's a lot of stuff in the queue. Ouchy.

Our 'rate' is the default of 1.0. Any ideas or help would be appreciated!

Thanks,
Mike

Open Stack

[OpenStack-Infra] Poor nodepool performance under load (help needed!)

OpenStack

Community

Documentation

Branding & Legal