[openstack-dev] More on the topic of DELIMITER, the Quota Management Library proposal

Jay Pipes jaypipes at gmail.com
Mon Apr 18 18:54:02 UTC 2016


On 04/16/2016 05:51 PM, Amrith Kumar wrote:
> If we therefore assume that this will be a Quota Management Library, it
> is safe to assume  that quotas are going to be managed on a per-project
> basis, where participating projects will use this library. I believe
> that it stands to reason that any data persistence will have to be in a
> location decided by the individual project.

Depends on what you mean by "any data persistence". If you are referring 
to the storage of quota values (per user, per tenant, global, etc) I 
think that should be done by the Keystone service. This data is 
essentially an attribute of the user or the tenant or the service 
endpoint itself (i.e. global defaults). This data also rarely changes 
and logically belongs to the service that manages users, tenants, and 
service endpoints: Keystone.

If you are referring to the storage of resource usage records, yes, each 
service project should own that data (and frankly, I don't see a need to 
persist any quota usage data at all, as I mentioned in a previous reply 
to Attila).

> That may not be a very interesting statement but the corollary is, I
> think, a very significant statement; it cannot be assumed that the
> quota management information for all participating projects is in the
> same database.

It cannot be assumed that this information is even in a database at all...

> A hypothetical service consuming the Delimiter library provides
> requesters with some widgets, and wishes to track the widgets that it
> has provisioned both on a per-user basis, and on the whole. It should
> therefore multi-tenant and able to track the widgets on a per tenant
> basis and if required impose limits on the number of widgets that a
> tenant may consume at a time, during a course of a period of time, and
> so on.

No, this last part is absolutely not what I think quota management 
should be about.

Rate limiting -- i.e. how many requests a particular user can make of an 
API in a given period of time -- should *not* be handled by OpenStack 
API services, IMHO. It is the responsibility of the deployer to handle 
this using off-the-shelf rate-limiting solutions (open source or 
proprietary).

Quotas should only be about the hard limit of different types of 
resources that a user or group of users can consume at a given time.

> Such a hypothetical service may also consume resources from other
> services that it wishes to track, and impose limits on.

Yes, absolutely agreed.

> It is also understood as Jay Pipes points out in [4] that the actual
> process of provisioning widgets could be time consuming and it is
> ill-advised to hold a database transaction of any kind open for that
> duration of time. Ensuring that a user does not exceed some limit on the
> number of concurrent widgets that he or she may create therefore
> requires some mechanism to track in-flight requests for widgets. I view
> these as “intent” but not yet materialized.

It has nothing to do with the amount of concurrent widgets that a user 
can create. It's just about the total number of some resource that may 
be consumed by that user.

As for an "intent", I don't believe tracking intent is the right way to 
go at all. As I've mentioned before, the major problem in Nova's quota 
system is that there are two tables storing resource usage records: the 
*actual* resource usage tables (the allocations table in the new 
resource-providers modeling and the instance_extra, pci_devices and 
instances table in the legacy modeling) and the *quota usage* tables 
(quota_usages and reservations tables). The quota_usages table does not 
need to exist at all, and neither does the reservations table. Don't do 
intent-based consumption. Instead, just consume (claim) by writing a 
record for the resource class consumed on a provider into the actual 
resource usages table and then "check quotas" by querying the *actual* 
resource usages and comparing the SUM(used) values, grouped by resource 
class, against the appropriate quota limits for the user. The 
introduction of the quota_usages and reservations tables to cache usage 
records is the primary reason for the race problems in the Nova (and 
other) quota system because every time you introduce a caching system 
for highly-volatile data (like usage records) you introduce complexity 
into the write path and the need to track the same thing across multiple 
writes to different tables needlessly.

> Looking up at this whole infrastructure from the perspective of the
> database, I think we should require that the database must not be
> required to operate in any isolation mode higher than READ-COMMITTED;
> more about that later (i.e. requiring a database run either serializable
> or repeatable read is a show stopper).

This is an implementation detail is not relevant to the discussion about 
what the interface of a quota library would look like.

> In general therefore, I believe that the hypothetical service processing
> requests for widgets would have to handle three kinds of operations,
> provision, modify, and destroy. The names are, I believe,
> self-explanatory.

Generally, modification of a resource doesn't come into play. The 
primary exception to this is for transferring of ownership of some resource.

> Without loss of generality, one can say that all three of them must
> validate that the operation does not violate some limit (no more than X
> widgets, no fewer than X widgets, rates, and so on).

No, only the creation (and very rarely the modification) needs any 
validation that a limit could been violated. Destroying a resource never 
needs to be checked for limit violations.

> Assuming that the service provisions resources from other services,
> it is also conceivable that limits be imposed on the quantum of those
> services consumed. In practice, I can imagine a service like Trove
> using the Delimiter project to perform all of these kinds of limit
> checks; I’m not suggesting that it does this today, nor that there is
> an immediate plan to implement all of them, just that these all seem
> like good uses a Quota Management  capability.
>
>          - User may not have more than 25 database instances at a time
>          - User may not have more than 4 clusters at a time
>          - User may not consume more than 3TB of SSD storage at a time

Only if SSD storage is a distinct resource class from DISK_GB. Right 
now, Nova makes no differentiation w.r.t. SSD or HDD or shared vs. local 
block storage.

>          - User may not launch more than 10 huge instances at a time

What is the point of such a limit?

>          - User may not launch more than 3 clusters an hour

-1. This is rate limiting and should be handled by rate-limiting services.

>          - No more than 500 copies of Oracle may be run at a time

Is "Oracle" a resource class?

> While Nova would be the service that limits the number of instances a
> user can have at a time, the ability for a service to limit this further
> should not be underestimated.
>
> In turn, should Nova and Cinder also use the same Quota Management
> Library, they may each impose limitations like:
>
>          - User may not launch more than 20 huge instances at a time

Not a useful limitation IMHO.

>          - User may not launch more than 3 instances in a minute

-1. This is rate limiting.

>          - User may not consume more than 15TB of SSD at a time
>          - User may not have more than 30 volumes at a time
>
> Again, I’m not implying that either Nova or Cinder should provide these
> capabilities.
>
> With this in mind, I believe that the minimal set of operations that
> Delimiter should provide are:
>
>          - define_resource(name, max, min, user_max, user_min, …)

What would the above do? What service would it be speaking to?

>          - update_resource_limits(name, user, user_max, user_min, …)

This doesn't belong in a quota library. It belongs as a REST API in 
Keystone.

>          - reserve_resource(name, user, size, parent_resource, …)

This doesn't belong in a quota library at all. I think reservations are 
not germane to resource consumption and should be handled by an external 
service at the orchestration layer.

>          - provision_resource(resource, id)

A quota library should not be provisioning anything. A quota library 
should simply provide a consistent interface for *checking* that a 
structured request for some set of resources *can* be provided by the 
service.

>          - update_resource(id or resource, newsize)

Resizing resources is a bad idea, IMHO. Resources are easier to deal 
with when they are considered of immutable size and simple (i.e. not 
complex or nested). I think the problem here is in the definition of 
resource classes improperly.

For example, a "cluster" is not a resource. It is a collection of 
resources of type node. "Resizing" a cluster is a misnomer, because you 
aren't resizing a resource at all. Instead, you are creating or 
destroying resources inside the cluster (i.e. joining or leaving cluster 
nodes).

BTW, this is also why the "resize instance" API in Nova is such a giant 
pain in the ass. It's attempting to "modify" the instance "resource" 
when the instance isn't really the resource at all. The VCPU, RAM_MB, 
DISK_GB, and PCI devices are the actual resources. The instance is a 
convenient way to tie those resources together, and doing a "resize" of 
the instance behind the scenes actually performs a *move* operation, 
which isn't a *change* of the original resources. Rather, it is a 
creation of a new set of resources (of the new amounts) and a deletion 
of the old set of resources.

The "resize" API call adds some nasty confirmation and cancel semantics 
to the calling interface that hint that the underlying implementation of 
the "resize" operation is in actuality not a resize at all, but rather a 
create-new-and-delete-old-resources operation.

>          - release_resource(id or resource)
>          - expire_reservations()

I see no need to have reservations in the quota library at all, as 
mentioned above.

As for your proposed interface and calling structure below, I think a 
much simpler proposal would work better. I'll work on a cross-project 
spec that describes this simpler proposal, but the basics would be:

1) Have Keystone store quota information for defaults (per service 
endpoint), for tenants and for users.

Keystone would have the set of canonical resource class names, and each 
project, upon handling a new resource class, would be responsible for a 
change submitted to Keystone to add the new resource class code.

Straw man REST API:

GET /quotas/resource-classes
200 OK
{
   "resource_classes": {
     "compute.vcpu": {
       "service": "compute",
       "code": "compute.vcpu",
       "description": "A virtual CPU unit"
     },
     "compute.ram_mb": {
       "service": "compute",
       "code": "compute.ram_mb",
       "description": "Memory in megabytes"
     },
     ...
     "volume.disk_gb": {
       "service": "volume",
       "code": "volume.disk_gb",
       "description": "Amount of disk space in gigabytes"
     },
     ...
     "database.count": {
        "service": "database",
        "code": "database.count",
        "description": "Number of database instances"
     }
   }
}

# Get the default limits for new users...
GET /quotas/defaults
200 OK
{
   "quotas": {
     "compute.vcpu": 100,
     "compute.ram_mb": 32768,
     "volume.disk_gb": 1000,
     "database.count": 25
   }
}

# Get a specific user's limits...
GET /quotas/users/{UUID}
200 OK
{
   "quotas": {
     "compute.vcpu": 100,
     "compute.ram_mb": 32768,
     "volume.disk_gb": 1000,
     "database.count": 25
   }
}

# Get a tenant's limits...
GET /quotas/tenants/{UUID}
200 OK
{
   "quotas": {
     "compute.vcpu": 1000,
     "compute.ram_mb": 327680,
     "volume.disk_gb": 10000,
     "database.count": 250
   }
}

2) Have Delimiter communicate with the above proposed new Keystone REST 
API and package up data into an oslo.versioned_objects interface.

Clearly all of the above can be heavily cached both on the server and 
client side since they rarely change but are read often.

The Delimiter library could be used to provide a calling interface for 
service projects to get a user's limits for a set of resource classes:

(please excuse wrongness, typos, and other stuff below, it's just a 
straw-man not production working code...)

# file: delimiter/objects/limits.py
import oslo.versioned_objects.base as ovo
import oslo.versioned_objects.fields as ovo_fields


class ResourceLimit(ovo.VersionedObjectBase):
   # 1.0: Initial version
   VERSION = '1.0'

    fields = {
       'resource_class': ovo_fields.StringField(),
       'amount': ovo_fields.IntegerField(),
    }


class ResourceLimitList(ovo.VersionedObjectBase):
   # 1.0: Initial version
   VERSION = '1.0'

   fields = {
     'resources': ListOfObjectsField(ResourceLimit),
   }

   @cache_this_heavily
   @remotable_classmethod
   def get_all_by_user(cls, user_uuid):
     """Returns a Limits object that tells the caller what a user's
     absolute limits for the set of resource classes in the system.
     """
     # Grab a keystone client session object and connect to Keystone
     ks = ksclient.Session(...)
     raw_limits = ksclient.get_limits_by_user()
     return cls(resources=[ResourceLimit(**d) for d in raw_limits])

3) Each service project would be responsible for handling the 
consumption of a set of requested resource amounts in an atomic and 
consistent way. The Delimiter library would return the limits that the 
service would pre-check before claiming the resources and either 
post-check after claim or utilize a compare-and-update technique with a 
generation/timestamp during claiming to prevent race conditions.

For instance, in Nova with the new resource providers database schema 
and doing claims in the scheduler (a proposed change), we might do 
something to the effect of:

from delimiter import objects as delim_obj
from delimier import exceptions as delim_exc
from nova import objects as nova_obj

request = nova_obj.RequestSpec.get_by_uuid(request_uuid)
requested = request.resources
limits = delim_obj.ResourceLimitList.get_all_by_user(user_uuid)
allocations = nova_obj.AllocationList.get_all_by_user(user_uuid)

# Pre-check for violations
for resource_class, requested_amount in requested.items():
   limit_idx = limits.resources.index(resource_class)
   resource_limit = limits.resources[limit_idx].amount
   alloc_idx = allocations.resources.index(resource_class)
   resource_used = allocations.resources[alloc_idx]
   if (resource_used + requested_amount) > resource_limit:
     raise delim_exc.QuotaExceeded

# Do claims in scheduler in an atomic, consistent fashion...
claims = scheduler_client.claim_resources(request)

# Post-check for violations
allocations = nova_obj.AllocationList.get_all_by_user(user_uuid)
# allocations now include the claimed resources from the scheduler

for resource_class, requested_amount in requested.items():
   limit_idx = limits.resources.index(resource_class)
   resource_limit = limits.resources[limit_idx].amount
   alloc_idx = allocations.resources.index(resource_class)
   resource_used = allocations.resources[alloc_idx]
   if resource_used > resource_limit:
     # Delete the allocation records for the resources just claimed
     delete_resources(claims)
     raise delim_exc.QuotaExceeded

4) The only other thing that would need to be done for a first go of the 
Delimiter library is some event listener that can listen for changes to 
the quota limits for a user/tenant/default in Keystone. We'd want the 
services to be able notify someone if a reduction in quota results in an 
overquota situation.

Anyway, that's my idea. Keep the Delimiter library small and focused on 
describing the limits only, not on the resource allocations. Have the 
Delimiter library present a versioned object interface so the 
interaction between the data exposed by the Keystone REST API for quotas 
can evolve naturally and smoothly over time.

Best,
-jay

> Let me illustrate the way I see these things fitting together. A
> hypothetical Trove system may be setup as follows:
>
>          - No more than 2000 database instances in total, 300 clusters in
>          total
>          - Users may not launch more than 25 database instances, or 4
>          clusters
>          - The particular user ‘amrith’ is limited to 2 databases and 1
>          cluster
>          - No user may consume more than 20TB of storage at a time
>          - No user may consume more than 10GB of memory at a time
>
> At startup, I believe that the system would make the following sequence
> of calls:
>
>          - define_resource(databaseInstance, 2000, 0, 25, 0, …)
>          - update_resource_limits(databaseInstance, amrith, 2, 0, …)
>          - define_resource(databaseCluster, 300, 0, 4, 0, …)
>          - update_resource_limits(databaseCluster, amrith, 1, 0, …)
>          - define_resource(storage, -1, 0, 20TB, 0, …)
>          - define_resource(memory, -1, 0, 10GB, 0, …)
>
> Assume that the user john comes along and asks for a cluster with 4
> nodes, 1TB storage per node and each node having 1GB of memory, the
> system would go through the following sequence:
>
>          - reserve_resource(databaseCluster, john, 1, None)
>                  o this returns a resourceID (say cluster-resource-ID)
>                  o the cluster instance that it reserves counts against
>                  the limit of 300 cluster instances in total, as well as
>                  the 4 clusters that john can provision. If 'amrith' had
>                  requested it, that would have been counted against the
>                  limit of 2 clusters for the user.
>
>          - reserve_resource(databaseInstance, john, 1,
>          cluster-resource-id)
>          - reserve_resource(databaseInstance, john, 1,
>          cluster-resource-id)
>          - reserve_resource(databaseInstance, john, 1,
>          cluster-resource-id)
>          - reserve_resource(databaseInstance, john, 1,
>          cluster-resource-id)
>                  o this returns four resource id’s, let’s say
>                  instance-1-id,  instance-2-id, instance-3-id,
>                  instance-4-id
>                  o note that each instance is that, an instance by
>                  itself. it is therefore not right to consider this as
>                  equivalent to a call to reserve_resource() with a size
>                  of 4, especially because each instance could later be
>                  tracked as an individual Nova instance.
>
>          - reserve_resource(storage, john, 1TB, instance-1-id)
>          - reserve_resource(storage, john, 1TB, instance-2-id)
>          - reserve_resource(storage, john, 1TB, instance-3-id)
>          - reserve_resource(storage, john, 1TB, instance-4-id)
>
>                  o each of them returns some resourceID, let’s say they
>                  returned cinder-1-id, cinder-2-id, cinder-3-id,
>                  cinder-4-id
>                  o since the storage of 1TB is a unit, it is treated as
>                  such. In other words, you don't need to invoke
>                  reserve_resource 10^12 times, once per byte allocated :)
>
>          - reserve_resource(memory, john, 1GB, instance-1-id)
>          - reserve_resource(memory, john, 1GB, instance-2-id)
>          - reserve_resource(memory, john, 1GB, instance-3-id)
>          - reserve_resource(memory, john, 1GB, instance-4-id)
>                  o each of these return something, say
>                  Dg4KBQcODAENBQEGBAcEDA, CgMJAg8FBQ8GDwgLBA8FAg,
>                  BAQJBwYMDwAIAA0DBAkNAg, AQMLDA4OAgEBCQ0MBAMGCA. I have
>                  made up arbitrary strings just to highlight that we
>                  really don't track these anywhere so we don't care about
>                  them.
>
> If all this works, then the system knows that John’s request does not
> violate any quotas that it can enforce, it can then go ahead and launch
> the instances (calling Nova), provision storage, and so on.
>
> The system then goes and creates four Cinder volumes, these are
> cinder-1-uuid, cinder-2-uuid, cinder-3-uuid, cinder-4-uuid.
>
> It can then go and confirm those reservations.
>
>          - provision_resource(cinder-1-id, cinder-1-uuid)
>          - provision_resource(cinder-2-id, cinder-2-uuid)
>          - provision_resource(cinder-3-id, cinder-3-uuid)
>          - provision_resource(cinder-4-id, cinder-4-uuid)
>
> It could then go and launch 4 nova instances and similarly provision
> those resources, and so on. This process could take some minutes and
> holding a database transaction open for this is the issue that Jay
> brings up in [4]. We don’t have to in this proposed scheme.
>
> Since the resources are all hierarchically linked through the overall
> cluster id, when the cluster is setup, it can finally go and provision
> that:
>
> - provision_resource(cluster-resource-id, cluster-uuid)
>
> When Trove is done with some individual resource, it can go and release
> it. Note that I’m thinking this will invoke release_resource with the ID
> of the underlying object OR the resource.
>
>          - release_resource(cinder-4-id), and
>          - release_resource(cinder-4-uuid)
>
> are therefore identical and indicate that the 4th 1TB volume is now
> released. How this will be implemented in Python, kwargs or some other
> mechanism is, I believe, an implementation detail.
>
> Finally, it releases the cluster resource by doing this:
>
>          - release_resource(cluster-resource-id)
>
> This would release the cluster and all dependent resources in a single
> operation.
>
> A user may wish to manage a resource that was provisioned from the
> service. Assume that this results in a resizing of the instances, then
> it is a matter of updating that resource.
>
> Assume that the third 1TB volume is being resized to 2TB, then it is
> merely a matter of invoking:
>
>          - update_resource(cinder-3-uuid, 2TB)
>
> Delimiter can go figure out that cinder-3-uuid is a 1TB device and
> therefore this is an increase of 1TB and verify that this is within the
> quotas allowed for the user.
>
> The thing that I find attractive about this model of maintaining a
> hierarchy of reservations is that in the event of an error, the service
> need merely call release_resource() on the highest level reservation and
> the Delimiter project can walk down the chain and release all the
> resources or reservations as appropriate.
>
> Under the covers I believe that each of these operations should be
> atomic and may update multiple database tables but these will all be
> short lived operations.
>
> For example, reserving an instance resource would increment the number
> of instances for the user as well as the number of instances on the
> whole, and this would be an atomic operation.
>
> I have two primary areas of concern about the proposal [3].
>
>          The first is that it makes the implicit assumption that the
>          “flat mode” is implemented. That provides value to a consumer
>          but I think it leaves a lot for the consumer to do. For example,
>          I find it hard to see how the model proposed would handle the
>          release of quotas, leave alone the case of a nested release of a
>          hierarchy of resources.
>
>          The other is the notion that the implementation will begin a
>          transaction, perform a query(), make some manipulations, and
>          then do a save(). This makes for an interesting transaction
>          management challenge as it would require the underlying database
>          to run in an isolation mode of at least repeatable reads and
>          maybe even serializable which would be a performance bear on a
>          heavily loaded system. If run in the traditional read-committed
>          mode, this would silently lead to over subscriptions, and the
>          violation of quota limits.
>
> I believe that it should be a requirement that the Delimiter library
> should be able to run against a database that supports, and is
> configured for READ-COMMITTED, and should not require anything higher.
> The model proposed above can certainly be implemented with a database
> running READ-COMMITTED, and I believe that this is also true with the
> caveat that the operations will be performed through SQLAlchemy.
>
> Thanks,
>
> -amrith
>
> [1] http://openstack.markmail.org/thread/tkl2jcyvzgifniux
> [2] http://openstack.markmail.org/thread/3cr7hoeqjmgyle2j
> [3] https://review.openstack.org/#/c/284454/
> [4] http://markmail.org/message/7ixvezcsj3uyiro6
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list