[openstack-dev] More on the topic of DELIMITER, the Quota Management Library proposal

Amrith Kumar amrith at tesora.com
Sun Apr 17 00:51:32 UTC 2016


First, let me thank Vilobh, Nikhil, Duncan and others who have been
working on this quota management library/service issue. Recently there
have been a number of threads [1], [2], [3] on the subject of quotas and
quota management. This email is partially in response to [3] but since
it is too long for a review comment, and since I believe that it is a
discussion more suited for the ML than the review itself,  I’m posting
it to the ML. I will reference it in the review.

I think I can say that there is consensus that if a project is setup to
handle quotas in a standardized way, it should be a library and not a
service. To be clear, I am not implying that there is consensus that
such a project should exist, nor that if such a capability existed that
everyone was on board to use it, merely that several have raised
legitimate objections to a service, and that the community in general
seems to believe that the service approach is misguided.

If we therefore assume that this will be a Quota Management Library, it
is safe to assume  that quotas are going to be managed on a per-project
basis, where participating projects will use this library. I believe
that it stands to reason that any data persistence will have to be in a
location decided by the individual project. That may not be a very
interesting statement but the corollary is, I think, a very significant
statement; it cannot be assumed that the quota management information
for all participating projects is in the same database.

A hypothetical service consuming the Delimiter library provides
requesters with some widgets, and wishes to track the widgets that it
has provisioned both on a per-user basis, and on the whole. It should
therefore multi-tenant and able to track the widgets on a per tenant
basis and if required impose limits on the number of widgets that a
tenant may consume at a time, during a course of a period of time, and
so on. Such a hypothetical service may also consume resources from other
services that it wishes to track, and impose limits on.

It is also understood as Jay Pipes points out in [4] that the actual
process of provisioning widgets could be time consuming and it is
ill-advised to hold a database transaction of any kind open for that
duration of time. Ensuring that a user does not exceed some limit on the
number of concurrent widgets that he or she may create therefore
requires some mechanism to track in-flight requests for widgets. I view
these as “intent” but not yet materialized.

Looking up at this whole infrastructure from the perspective of the
database, I think we should require that the database must not be
required to operate in any isolation mode higher than READ-COMMITTED;
more about that later (i.e. requiring a database run either serializable
or repeatable read is a show stopper).

In general therefore, I believe that the hypothetical service processing
requests for widgets would have to handle three kinds of operations,
provision, modify, and destroy. The names are, I believe,
self-explanatory.

Without loss of generality, one can say that all three of them must
validate that the operation does not violate some limit (no more than X
widgets, no fewer than X widgets, rates, and so on). Assuming that the
service provisions resources from other services, it is also conceivable
that limits be imposed on the quantum of those services consumed. In
practice, I can imagine a service like Trove using the Delimiter project
to perform all of these kinds of limit checks; I’m not suggesting that
it does this today, nor that there is an immediate plan to implement all
of them, just that these all seem like good uses a Quota Management
capability.

        - User may not have more than 25 database instances at a time
        - User may not have more than 4 clusters at a time
        - User may not consume more than 3TB of SSD storage at a time
        - User may not launch more than 10 huge instances at a time
        - User may not launch more than 3 clusters an hour
        - No more than 500 copies of Oracle may be run at a time

While Nova would be the service that limits the number of instances a
user can have at a time, the ability for a service to limit this further
should not be underestimated.

In turn, should Nova and Cinder also use the same Quota Management
Library, they may each impose limitations like:

        - User may not launch more than 20 huge instances at a time
        - User may not launch more than 3 instances in a minute
        - User may not consume more than 15TB of SSD at a time
        - User may not have more than 30 volumes at a time
        
Again, I’m not implying that either Nova or Cinder should provide these
capabilities.

With this in mind, I believe that the minimal set of operations that
Delimiter should provide are:

        - define_resource(name, max, min, user_max, user_min, …)
        - update_resource_limits(name, user, user_max, user_min, …)
        - reserve_resource(name, user, size, parent_resource, …)
        - provision_resource(resource, id)
        - update_resource(id or resource, newsize)
        - release_resource(id or resource)
        - expire_reservations()

Let me illustrate the way I see these things fitting together. A
hypothetical Trove system may be setup as follows:

        - No more than 2000 database instances in total, 300 clusters in
        total 
        - Users may not launch more than 25 database instances, or 4
        clusters 
        - The particular user ‘amrith’ is limited to 2 databases and 1
        cluster 
        - No user may consume more than 20TB of storage at a time 
        - No user may consume more than 10GB of memory at a time

At startup, I believe that the system would make the following sequence
of calls:

        - define_resource(databaseInstance, 2000, 0, 25, 0, …)
        - update_resource_limits(databaseInstance, amrith, 2, 0, …)
        - define_resource(databaseCluster, 300, 0, 4, 0, …)
        - update_resource_limits(databaseCluster, amrith, 1, 0, …)
        - define_resource(storage, -1, 0, 20TB, 0, …)
        - define_resource(memory, -1, 0, 10GB, 0, …)

Assume that the user john comes along and asks for a cluster with 4
nodes, 1TB storage per node and each node having 1GB of memory, the
system would go through the following sequence:

        - reserve_resource(databaseCluster, john, 1, None)
                o this returns a resourceID (say cluster-resource-ID)
                o the cluster instance that it reserves counts against
                the limit of 300 cluster instances in total, as well as
                the 4 clusters that john can provision. If 'amrith' had
                requested it, that would have been counted against the
                limit of 2 clusters for the user.

        - reserve_resource(databaseInstance, john, 1,
        cluster-resource-id)
        - reserve_resource(databaseInstance, john, 1,
        cluster-resource-id)
        - reserve_resource(databaseInstance, john, 1,
        cluster-resource-id)
        - reserve_resource(databaseInstance, john, 1,
        cluster-resource-id)
                o this returns four resource id’s, let’s say
                instance-1-id,  instance-2-id, instance-3-id,
                instance-4-id
                o note that each instance is that, an instance by
                itself. it is therefore not right to consider this as
                equivalent to a call to reserve_resource() with a size
                of 4, especially because each instance could later be
                tracked as an individual Nova instance.

        - reserve_resource(storage, john, 1TB, instance-1-id)
        - reserve_resource(storage, john, 1TB, instance-2-id)
        - reserve_resource(storage, john, 1TB, instance-3-id)
        - reserve_resource(storage, john, 1TB, instance-4-id)

                o each of them returns some resourceID, let’s say they
                returned cinder-1-id, cinder-2-id, cinder-3-id,
                cinder-4-id
                o since the storage of 1TB is a unit, it is treated as
                such. In other words, you don't need to invoke
                reserve_resource 10^12 times, once per byte allocated :)

        - reserve_resource(memory, john, 1GB, instance-1-id)
        - reserve_resource(memory, john, 1GB, instance-2-id)
        - reserve_resource(memory, john, 1GB, instance-3-id)
        - reserve_resource(memory, john, 1GB, instance-4-id)
                o each of these return something, say
                Dg4KBQcODAENBQEGBAcEDA, CgMJAg8FBQ8GDwgLBA8FAg,
                BAQJBwYMDwAIAA0DBAkNAg, AQMLDA4OAgEBCQ0MBAMGCA. I have
                made up arbitrary strings just to highlight that we
                really don't track these anywhere so we don't care about
                them. 

If all this works, then the system knows that John’s request does not
violate any quotas that it can enforce, it can then go ahead and launch
the instances (calling Nova), provision storage, and so on.

The system then goes and creates four Cinder volumes, these are 
cinder-1-uuid, cinder-2-uuid, cinder-3-uuid, cinder-4-uuid. 

It can then go and confirm those reservations.

        - provision_resource(cinder-1-id, cinder-1-uuid)
        - provision_resource(cinder-2-id, cinder-2-uuid)
        - provision_resource(cinder-3-id, cinder-3-uuid)
        - provision_resource(cinder-4-id, cinder-4-uuid)

It could then go and launch 4 nova instances and similarly provision
those resources, and so on. This process could take some minutes and
holding a database transaction open for this is the issue that Jay
brings up in [4]. We don’t have to in this proposed scheme.

Since the resources are all hierarchically linked through the overall
cluster id, when the cluster is setup, it can finally go and provision
that:

- provision_resource(cluster-resource-id, cluster-uuid)

When Trove is done with some individual resource, it can go and release
it. Note that I’m thinking this will invoke release_resource with the ID
of the underlying object OR the resource.

        - release_resource(cinder-4-id), and
        - release_resource(cinder-4-uuid)
        
are therefore identical and indicate that the 4th 1TB volume is now
released. How this will be implemented in Python, kwargs or some other
mechanism is, I believe, an implementation detail.

Finally, it releases the cluster resource by doing this:

        - release_resource(cluster-resource-id)
        
This would release the cluster and all dependent resources in a single
operation.

A user may wish to manage a resource that was provisioned from the
service. Assume that this results in a resizing of the instances, then
it is a matter of updating that resource.

Assume that the third 1TB volume is being resized to 2TB, then it is
merely a matter of invoking:

        - update_resource(cinder-3-uuid, 2TB)
        
Delimiter can go figure out that cinder-3-uuid is a 1TB device and
therefore this is an increase of 1TB and verify that this is within the
quotas allowed for the user.

The thing that I find attractive about this model of maintaining a
hierarchy of reservations is that in the event of an error, the service
need merely call release_resource() on the highest level reservation and
the Delimiter project can walk down the chain and release all the
resources or reservations as appropriate.

Under the covers I believe that each of these operations should be
atomic and may update multiple database tables but these will all be
short lived operations.

For example, reserving an instance resource would increment the number
of instances for the user as well as the number of instances on the
whole, and this would be an atomic operation. 

I have two primary areas of concern about the proposal [3].

        The first is that it makes the implicit assumption that the
        “flat mode” is implemented. That provides value to a consumer
        but I think it leaves a lot for the consumer to do. For example,
        I find it hard to see how the model proposed would handle the
        release of quotas, leave alone the case of a nested release of a
        hierarchy of resources.

        The other is the notion that the implementation will begin a
        transaction, perform a query(), make some manipulations, and
        then do a save(). This makes for an interesting transaction
        management challenge as it would require the underlying database
        to run in an isolation mode of at least repeatable reads and
        maybe even serializable which would be a performance bear on a
        heavily loaded system. If run in the traditional read-committed
        mode, this would silently lead to over subscriptions, and the
        violation of quota limits.

I believe that it should be a requirement that the Delimiter library
should be able to run against a database that supports, and is
configured for READ-COMMITTED, and should not require anything higher.
The model proposed above can certainly be implemented with a database
running READ-COMMITTED, and I believe that this is also true with the
caveat that the operations will be performed through SQLAlchemy.

Thanks,

-amrith

[1] http://openstack.markmail.org/thread/tkl2jcyvzgifniux
[2] http://openstack.markmail.org/thread/3cr7hoeqjmgyle2j
[3] https://review.openstack.org/#/c/284454/
[4] http://markmail.org/message/7ixvezcsj3uyiro6



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 966 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160417/b413401b/attachment.pgp>


More information about the OpenStack-dev mailing list