[openstack-dev] [TripleO][Tuskar] Icehouse Requirements

Robert Collins robertc at robertcollins.net
Mon Dec 9 21:24:59 UTC 2013


On 10 December 2013 09:55, Tzu-Mainn Chen <tzumainn at redhat.com> wrote:
>> >        * created as part of undercloud install process

>> By that note I meant, that Nodes are not resources, Resource instances
>> run on Nodes. Nodes are the generic pool of hardware we can deploy
>> things onto.
>
> I don't think "resource nodes" is intended to imply that nodes are resources; rather, it's supposed to
> indicate that it's a node where a resource instance runs.  It's supposed to separate it from "management node"
> and "unallocated node".

So the question is are we looking at /nodes/ that have a /current
role/, or are we looking at /roles/ that have some /current nodes/.

My contention is that the role is the interesting thing, and the nodes
is the incidental thing. That is, as a sysadmin, my hierarchy of
concerns is something like:
 A: are all services running
 B: are any of them in a degraded state where I need to take prompt
action to prevent a service outage [might mean many things: - software
update/disk space criticals/a machine failed and we need to scale the
cluster back up/too much load]
 C: are there any planned changes I need to make [new software deploy,
feature request from user, replacing a faulty machine]
 D: are there long term issues sneaking up on me [capacity planning,
machine obsolescence]

If we take /nodes/ as the interesting thing, and what they are doing
right now as the incidental thing, it's much harder to map that onto
the sysadmin concerns. If we start with /roles/ then can answer:
 A: by showing the list of roles and the summary stats (how many
machines, service status aggregate), role level alerts (e.g. nova-api
is not responding)
 B: by showing the list of roles and more detailed stats (overall
load, response times of services, tickets against services
     and a list of in trouble instances in each role - instances with
alerts against them - low disk, overload, failed service,
early-detection alerts from hardware
 C: probably out of our remit for now in the general case, but we need
to enable some things here like replacing faulty machines
 D: by looking at trend graphs for roles (not machines), but also by
looking at the hardware in aggregate - breakdown by age of machines,
summary data for tickets filed against instances that were deployed to
a particular machine

C: and D: are (F) category work, but for all but the very last thing,
it seems clear how to approach this from a roles perspective.

I've tried to approach this using /nodes/ as the starting point, and
after two terrible drafts I've deleted the section. I'd love it if
someone could show me how it would work:)

>> >     * Unallocated nodes
>> >
>> > This implies an 'allocation' step, that we don't have - how about
>> > 'Idle nodes' or something.
>> >
>> > It can be auto-allocation. I don't see problem with 'unallocated' term.
>>
>> Ok, it's not a biggy. I do think it will frame things poorly and lead
>> to an expectation about how TripleO works that doesn't match how it
>> does, but we can change it later if I'm right, and if I'm wrong, well
>> it won't be the first time :).
>>
>
> I'm interested in what the distinction you're making here is.  I'd rather get things
> defined correctly the first time, and it's very possible that I'm missing a fundamental
> definition here.

So we have:
 - node - a physical general purpose machine capable of running in
many roles. Some nodes may have hardware layout that is particularly
useful for a given role.
 - role - a specific workload we want to map onto one or more nodes.
Examples include 'undercloud control plane', 'overcloud control
plane', 'overcloud storage', 'overcloud compute' etc.
 - instance - A role deployed on a node - this is where work actually happens.
 - scheduling - the process of deciding which role is deployed on which node.

The way TripleO works is that we defined a Heat template that lays out
policy: 5 instances of 'overcloud control plane please', '20
hypervisors' etc. Heat passes that to Nova, which pulls the image for
the role out of Glance, picks a node, and deploys the image to the
node.

Note in particular the order: Heat -> Nova -> Scheduler -> Node chosen.

The user action is not 'allocate a Node to 'overcloud control plane',
it is 'size the control plane through heat'.

So when we talk about 'unallocated Nodes', the implication is that
users 'allocate Nodes', but they don't: they size roles, and after
doing all that there may be some Nodes that are - yes - unallocated,
or have nothing scheduled to them. So... I'm not debating that we
should have a list of free hardware - we totally should - I'm debating
how we frame it. 'Available Nodes' or 'Undeployed machines' or
whatever. I just want to get away from talking about something
([manual] allocation) that we don't offer.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list