Open Stack

Tue Feb 26 03:06:08 UTC 2013

Hi Joe,

On 26/02/2013, at 1:39 PM, Joe Gordon <jogo at cloudscaling.com> wrote:

> 
> 
> On Mon, Feb 25, 2013 at 6:14 PM, Sam Morrison <sorrison at gmail.com> wrote:
> Hi Joe,
> 
> On 26/02/2013, at 11:19 AM, Joe Gordon <jogo at cloudscaling.com> wrote:
> 
>> On Sun, Feb 24, 2013 at 3:31 PM, Sam Morrison <sorrison at gmail.com> wrote:
>> I have been playing with the AggregateInstanceExtraSpecs filter and can't get it to work.
>> 
>> In our staging environment it works fine with 4 compute nodes, I have 2 aggregates to split them into 2.
>> 
>> When I try to do the same in our production environment which has 80 compute nodes (splitting them again into 2 aggregates) it doesn't work.
>> 
>> nova-scheduler starts to go very slow,  I scheduled an instance and gave up after 5 minutes, it seemed to be taking ages and the host was at 100% cpu. Also got about 500 messages in rabbit that were unacknowledged.
>> 
>> 
>> what does the nova-scheduler log say?  Where is the unacknowledged rabbitmq messages sent from?
> 
> Logs are below. Note the large time gap between selecting a host, this is pretty much instantaneous without this filter.
> 
> Can't figure out how to see an unacknowledged message in rabbit but my guess is it is the compute service updates from all the compute nodes. These aren't happening and I think this is the reason that the attempts to schedule further down are rejected with "is disabled or has not been heard from in a while"
> 
> Do you see anything that could be an issue? Flags we use for scheduler are below also:
> 
> Thanks for your help,
> Sam
> 
> 
> It looks like the scheduler issues are related to the rabbitmq issues.   "host 'qh2-rcc77' ... is disabled or has not been heard from in a while"
> 
> What does 'nova host-list' say?   the clocks must all be synced up?
>  

Yeah all the clocks are synced up fine. Doing a nova-manage service list gives me all :-) and updated at is correct.

We only have one nova-scheduler. It gets locked up and goes at 100% CPU. nova-scheduler seems to take the compute service updates off the queue while this is happening but doesn't ack them and going by the logs doesn't process them. This is why I suspect the hosts are eventually being rejected with a "not been heard from in a while" message. 
This is a symptom though I believe as the real issue is nova-scheduler locking up, it seems to take 30-60 seconds for it to process each host to determine if it passes the filters.

Does that make sense? Any other ideas on how to debug? 

Cheers,
Sam

Open Stack

[Openstack] AggregateInstanceExtraSpecs very slow?

OpenStack

Community

Documentation

Branding & Legal