Open Stack

Mon Dec 9 21:57:52 UTC 2013

> So the question is are we looking at /nodes/ that have a /current
> role/, or are we looking at /roles/ that have some /current nodes/.
>
> My contention is that the role is the interesting thing, and the nodes
> is the incidental thing. That is, as a sysadmin, my hierarchy of
> concerns is something like:
>   A: are all services running
>   B: are any of them in a degraded state where I need to take prompt
> action to prevent a service outage [might mean many things: - software
> update/disk space criticals/a machine failed and we need to scale the
> cluster back up/too much load]
>   C: are there any planned changes I need to make [new software deploy,
> feature request from user, replacing a faulty machine]
>   D: are there long term issues sneaking up on me [capacity planning,
> machine obsolescence]
>
> If we take /nodes/ as the interesting thing, and what they are doing
> right now as the incidental thing, it's much harder to map that onto
> the sysadmin concerns. If we start with /roles/ then can answer:
>   A: by showing the list of roles and the summary stats (how many
> machines, service status aggregate), role level alerts (e.g. nova-api
> is not responding)
>   B: by showing the list of roles and more detailed stats (overall
> load, response times of services, tickets against services
>       and a list of in trouble instances in each role - instances with
> alerts against them - low disk, overload, failed service,
> early-detection alerts from hardware
>   C: probably out of our remit for now in the general case, but we need
> to enable some things here like replacing faulty machines
>   D: by looking at trend graphs for roles (not machines), but also by
> looking at the hardware in aggregate - breakdown by age of machines,
> summary data for tickets filed against instances that were deployed to
> a particular machine
>
> C: and D: are (F) category work, but for all but the very last thing,
> it seems clear how to approach this from a roles perspective.
>
> I've tried to approach this using /nodes/ as the starting point, and
> after two terrible drafts I've deleted the section. I'd love it if
> someone could show me how it would work:)
>
>>>>      * Unallocated nodes
>>>>
>>>> This implies an 'allocation' step, that we don't have - how about
>>>> 'Idle nodes' or something.
>>>>
>>>> It can be auto-allocation. I don't see problem with 'unallocated' term.
>>>
>>> Ok, it's not a biggy. I do think it will frame things poorly and lead
>>> to an expectation about how TripleO works that doesn't match how it
>>> does, but we can change it later if I'm right, and if I'm wrong, well
>>> it won't be the first time :).
>>>
>>
>> I'm interested in what the distinction you're making here is.  I'd rather get things
>> defined correctly the first time, and it's very possible that I'm missing a fundamental
>> definition here.
>
> So we have:
>   - node - a physical general purpose machine capable of running in
> many roles. Some nodes may have hardware layout that is particularly
> useful for a given role.
>   - role - a specific workload we want to map onto one or more nodes.
> Examples include 'undercloud control plane', 'overcloud control
> plane', 'overcloud storage', 'overcloud compute' etc.
>   - instance - A role deployed on a node - this is where work actually happens.
>   - scheduling - the process of deciding which role is deployed on which node.

This glossary is really handy to make sure we're all speaking the same 
language.

> The way TripleO works is that we defined a Heat template that lays out
> policy: 5 instances of 'overcloud control plane please', '20
> hypervisors' etc. Heat passes that to Nova, which pulls the image for
> the role out of Glance, picks a node, and deploys the image to the
> node.
>
> Note in particular the order: Heat -> Nova -> Scheduler -> Node chosen.
>
> The user action is not 'allocate a Node to 'overcloud control plane',
> it is 'size the control plane through heat'.
>
> So when we talk about 'unallocated Nodes', the implication is that
> users 'allocate Nodes', but they don't: they size roles, and after
> doing all that there may be some Nodes that are - yes - unallocated,

I'm not sure if I should ask this here or to your point above, but what 
about multi-role nodes? Is there any piece in here that says "The policy 
wants 5 instances but I can fit two of them on this existing 
underutilized node and three of them on unallocated nodes" or since it's 
all at the image level you get just what's in the image and that's the 
finest-level of granularity?

> or have nothing scheduled to them. So... I'm not debating that we
> should have a list of free hardware - we totally should - I'm debating
> how we frame it. 'Available Nodes' or 'Undeployed machines' or
> whatever. I just want to get away from talking about something
> ([manual] allocation) that we don't offer.

My only concern here is that we're not talking about cloud users, we're 
talking about admins adminning (we'll pretend it's a word, come with me) 
a cloud. To a cloud user, "give me some power so I can do some stuff" is 
a safe use case if I trust the cloud I'm running on. I trust that the 
cloud provider has taken the proper steps to ensure that my CPU isn't in 
New York and my storage in Tokyo.

To the admin setting up an overcloud, they are the ones providing that 
trust to eventual cloud users. That's where I feel like more visibility 
and control are going to be desired/appreciated.

I admit what I just said isn't at all concrete. Might even be flat out 
wrong. I was never an admin, I've just worked on sys management software 
long enough to have the opinion that their levels of OCD are legendary. 
I can't shake this feeling that someone is going to slap some fancy new 
jacked-up piece of hardware onto the network and have a specific purpose 
they are going to want to use it for. But maybe that's antiquated 
thinking on my part.

> -Rob
>

Open Stack

[openstack-dev] [TripleO][Tuskar] Icehouse Requirements

OpenStack

Community

Documentation

Branding & Legal