Open Stack

Tue Mar 15 18:16:17 UTC 2016

On Tue, Mar 15, 2016 at 4:04 AM Roman Prykhodchenko <me at romcheg.me> wrote:

> Fuelers,
>
> I would like to continue the series of "Getting rid of …" emails. This
> time I’d like to talk about statuses of clusters.
>
> The issues with that attribute is that it is not actually related to real
> world very much and represents nothing. A few month ago I proposed to make
> it more real-world-like [1] by replacing a simple string by an aggregated
> value. However, after task based deployment was introduced even that
> approach lost its connection to the real world.
>
> My idea is to get rid of that attribute from a cluster and start working
> with status of every single node in it. Nevertheless, we only have tasks
> that are executed on nodes now, so we cannot apply the "status" term to
> them. What if we replace that with a sort of boolean value called
> maintenance_mode (or similar) that we will use to tell if the node is
> operational or not. After that we will be able to use an aggregated
> property for cluster and check, if there are any nodes that are under a
> progress of performing some tasks on them.
>

Yes, we still need an operations attribute, I'm not sure a bool is enough,
but you are quite correct, setting the status of the cluster after
operational == True based on the result of a specific node failing, is in
practice invalid.

At the same time, operational == True is not necessarily deployment
succeeded, its more along the line of deployment validated, which may be
further testing passing like ostf, or more manual in the operator wants to
do more testing their own prior to changing the state.

As we adventure in to the LCM flow, we actually need status of each
component in addition of the general status of the cluster to determine the
proper course of action the on the next operation.

For example nova-compute
if the cluster is not operational, then we can provision compute nodes, and
have them enabled, or active in the scheduler automatically. However if the
cluster is operational, a new compute node must be disabled, or otherwise
blocked from the default scheduler until the node has received validation.
In this case the interpretation of operational is quite simple

For example ceph
Here we care less about the status of the cluster (slightly, this example
ignores ceph's impact on nova-compute), and more about the status of the
service. In the case that we deploy ceph-osd's when their are not replica
factor osd hosts online (3) the we can provision the OSD's similar to
nova-compute,  in that we can bring them all online and active and data
could be placed to them immediately (more or less). but if the ceph status
is operational, then we have to take a different action, the OSD's have to
be brought in disabled, and gradually(probably by the operator) have their
data weight increased so they don't clog the network with data peering
which causes the clients may woes.

> Thoughts, ideas?
>
>
> References:
>
> 1. https://blueprints.launchpad.net/fuel/+spec/complex-cluster-status
>
>
> - romcheg
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-- 

--

Andrew Woodward

Mirantis

Fuel Community Ambassador

Ceph Community
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160315/9a4e07fa/attachment.html>

Open Stack

[openstack-dev] [Fuel] Getting rid of cluster status

OpenStack

Community

Documentation

Branding & Legal