[openstack-dev] [TripleO] Generalising racks :- modelling a datacentre

Keith Basil kbasil at redhat.com
Wed Sep 25 16:31:49 UTC 2013


On Sep 25, 2013, at 10:36 AM, Tomas Sedovic wrote:

> On 09/25/2013 05:15 AM, Robert Collins wrote:
>> One of the major things Tuskar does is model a datacenter - which is
>> very useful for error correlation, capacity planning and scheduling.

Tuskar was designed for general infrastructure modeling within the scope of OpenStack.

Yes, Tuskar could be used to model a datacenter but that was not its original design goal.  This is not to say that modeling a datacenter wouldn't be useful and some of the points, concepts and ideas later in the post are very good.

But in terms of an MVP, we were focused on providing an easy approach for cloud operators wishing to deploy OpenStack.  What we're seeing is a use case where deployments are fairly small (small being 2-30 racks of gear).


>> Long term I'd like this to be held somewhere where it is accessible
>> for schedulers and ceilometer etc. E.g. network topology + switch
>> information might be held by neutron where schedulers can rely on it
>> being available, or possibly held by a unified topology db with
>> scheduler glued into that, but updated by neutron / nova / cinder.
>> Obviously this is a) non-trivial and b) not designed yet.
>> 
>> However, the design of Tuskar today needs to accomodate a few things:
>>  - multiple reference architectures for clouds (unless there really is
>> one true design)

>>  - the fact that today we don't have such an integrated vertical scheduler.

+1 to both, but recognizing that these are long term asks. 

>> So the current Tuskar model has three constructs that tie together to
>> model the DC:
>>  - nodes
>>  - resource classes (grouping different types of nodes into service
>> offerings - e.g. nodes that offer swift, or those that offer nova).
>>  - 'racks'
>> 
>> AIUI the initial concept of Rack was to map to a physical rack, but
>> this rapidly got shifted to be 'Logical Rack' rather than physical
>> rack, but I think of Rack as really just a special case of a general
>> modelling problem..
> 
> Yeah. Eventually, we settled on Logical Rack meaning a set of nodes on the same L2 network (in a setup where you would group nodes into isolated L2 segments). Which kind of suggests we come up with a better name.
> 
> I agree there's a lot more useful stuff to model than just racks (or just L2 node groups).

Indeed.  We chose the label "rack" because most folk understand it.  When generating a bill of materials for cloud gear for example, people tend to think in rack elevations, etc.  The "rack" model breaks down a bit when you start to consider things like system on chip solutions like Moonshot with the possibility of a number of chassis within a physical rack.  This  prompted further refinement of the concept.  And as Tomas mentioned, we have shifted to "logical racks" based on L2 binding between nodes.  Better, more fitting naming ideas here are welcome. 


>>> From a deployment perspective, if you have two disconnected
>> infrastructures, thats two AZ's, and two underclouds : so we know that
>> any one undercloud is fully connected (possibly multiple subnets, but
>> one infrastructure). When would we want to subdivide that?
>> 
>> One case is quick fault aggregation: if a physical rack loses power,
>> rather than having 16 NOC folk independently investigating the same 16
>> down hypervisors, one would prefer to identify that the power to the
>> rack has failed (for non-HA powered racks); likewise if a single
>> switch fails (for non-HA network topologies) you want to identify that
>> that switch is down rather than investigating all the cascaded errors
>> independently.
>> 
>> A second case is scheduling: you may want to put nova instances on the
>> same switch as the cinder service delivering their block devices, when
>> possible, or split VM's serving HA tasks apart. (We currently do this
>> with host aggregates, but being able to do it directly would be much
>> nicer).
>> 
>> Lastly, if doing physical operations like power maintenance or moving
>> racks around in a datacentre, being able to identify machines in the
>> same rack can be super useful for planning, downtime announcements, orhttps://plus.google.com/hangouts/_/04919b4400b8c4c5ba706b752610cd433d9acbe1
>> host evacuation, and being able to find a specific machine in a DC is
>> also important (e.g. what shelf in the rack, what cartridge in a
>> chassis).
> 
> I agree. However, we should take care not to commit ourselves to building a DCIM just yet.
> 
>> 
>> Back to 'Logical Rack' - you can see then that having a single
>> construct to group machines together doesn't really support these use
>> cases in a systematic fasion:- Physical rack modelling supports only a
>> subset of the location/performance/failure use cases, and Logical rack
>> doesn't support them at all: we're missing all the rich data we need
>> to aggregate faults rapidly : power, network, air conditioning - and
>> these things cover both single machine/groups of machines/racks/rows
>> of racks scale (consider a networked PDU with 10 hosts on it - thats a
>> fraction of a rack).
>> 
>> So, what I'm suggesting is that we model the failure and performance
>> domains directly, and include location (which is the incremental data
>> racks add once failure and performance domains are modelled) too. We
>> can separately noodle on exactly what failure domain and performance
>> domain modelling looks like - e.g. the scheduler focus group would be
>> a good place to have that discussion.
> 
> Yeah I think it's pretty clear that the current Tuskar concept where Racks are the first-class objects isn't going to fly. We should switch our focus on the individual nodes and their grouping and metadata.

I disagree here. We absolutely must have a first class object that aggregates a class of nodes.  This is a foundational piece that gives us the leverage to manage things at scale. I think the naming of a "rack" is the issue, not the concept of an aggregated class of nodes.   

> I'd like to start with something small and simple that we can improve upon, though. How about just going with freeform tags and key/value metadata for the nodes?

The concept of tagging is great and allows flexibility, but we need to have some concrete objects to start with.

> We can define some well-known tags and keys to begin with (rack, l2-network, power, switch, etc.), it would be easy to iterate and once we settle on the things we need, we can solidify them more.
> 
> In the meantime, we have the API flexible enough to handle whatever architectures we end up supporting and the UI can provide the appropriate views into the data.
> 
> And this would allow people to add their own criteria that we didn't consider.
> 
>> 
>> E.g. for any node I should be able to ask:
>> - what failure domains is this in? [e.g. power-45, switch-23, ac-15,
>> az-3, region-1]
>> - what locality-of-reference features does this have? [e.g. switch-23,
>> az-3, region-1]
>> - where is it [e.g. DC 2, pod 4, enclosure 2, row 5, rack 3, RU 30,
>> cartridge 40].


>> And then we should be able to slice and dice the DC easily by these aspects:
>> - location: what machines are in DC 2, or DC2 pod 4
>> - performance: what machines are all in region-1, or az-3, or switch-23.
>> - failure: what failure domains do machines X and Y have in common?
>> - failure: if we power off switch-23, what machines will be impacted?
>> 
>> So, what do you think?

The concept of blending logical and physical failure domains is brilliant.  The level of infrastructure awareness gained by this would be powerful - almost a Venn diagram of failure exposure, so to speak.  Scheduling would be very interesting, as well.

But in the short term, we should singularly focus on the original basic Tuskar design goals and solve the most prevalent deployment use cases.  

-k


More information about the OpenStack-dev mailing list