<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2017-06-19 22:17 GMT+08:00 Jay Pipes <span dir="ltr"><<a href="mailto:jaypipes@gmail.com" target="_blank">jaypipes@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">On 06/19/2017 09:04 AM, Edward Leafe wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Current flow:<br>

* Scheduler gets a req spec from conductor, containing resource requirements<br>

* Scheduler sends those requirements to placement<br>

* Placement runs a query to determine the root RPs that can satisfy those requirements<br>

</blockquote>

<br></span>

Not root RPs. Non-sharing resource providers, which currently effectively means compute node providers. Nested resource providers isn't yet merged, so there is currently no concept of a hierarchy of providers.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Placement returns a list of the UUIDs for those root providers to scheduler<br>

</blockquote>

<br></span>

It returns the provider names and UUIDs, yes.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler uses those UUIDs to create HostState objects for each<br>

</blockquote>

<br></span>

Kind of. The scheduler calls ComputeNodeList.get_all_by_uui<wbr>d(), passing in a list of the provider UUIDs it got back from the placement service. The scheduler then builds a set of HostState objects from the results of ComputeNodeList.get_all_by_uui<wbr>d().<br>

<br>

The scheduler also keeps a set of AggregateMetadata objects in memory, including the association of aggregate to host (note: this is the compute node's *service*, not the compute node object itself, thus the reason aggregates don't work properly for Ironic nodes).<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler runs those HostState objects through filters to remove those that don't meet requirements not selected for by placement<br>

</blockquote>

<br></span>

Yep.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler runs the remaining HostState objects through weighers to order them in terms of best fit.<br>

</blockquote>

<br></span>

Yep.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler takes the host at the top of that ranked list, and tries to claim the resources in placement. If that fails, there is a race, so that HostState is discarded, and the next is selected. This is repeated until the claim succeeds.<br>

</blockquote>

<br></span>

No, this is not how things work currently. The scheduler does not claim resources. It selects the top (or random host depending on the selection strategy) and sends the launch request to the target compute node. The target compute node then attempts to claim the resources and in doing so writes records to the compute_nodes table in the Nova cell database as well as the Placement API for the compute node resource provider.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler then creates a list of N UUIDs, with the first being the selected host, and the the rest being alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.<br>

</blockquote>

<br></span>

This isn't currently how things work, no. This has been discussed, however.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler returns that list to conductor.<br>

* Conductor determines the cell of the selected host, and sends that list to the target cell.<br>

* Target cell tries to build the instance on the selected host. If it fails, it unclaims the resources for the selected host, and tries to claim the resources for the next host in the list. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.<br>

</blockquote>

<br></span>

This isn't currently how things work, no. There has been discussion of having the compute node retry alternatives locally, but nothing more than discussion.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Proposed flow:<br>

* Scheduler gets a req spec from conductor, containing resource requirements<br>

* Scheduler sends those requirements to placement<br>

* Placement runs a query to determine the root RPs that can satisfy those requirements<br>

</blockquote>

<br></span>

Yes.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Placement then constructs a data structure for each root provider as documented in the spec. [0]<br>

</blockquote>

<br></span>

Yes.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.<br>

</blockquote>

<br></span>

"of these data structures as JSON blobs" is kind of redundant... all our REST APIs return data structures as JSON blobs.<br>

<br>

While we discussed the fact that there may be a lot of entries, we did not say we'd immediately support a paging mechanism.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler continues to request the paged results until it has them all.<br>

</blockquote>

<br></span>

See above. Was discussed briefly as a concern but not work to do for first patches.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.<br>

</blockquote>

<br></span>

No, this isn't correct. The scheduler will have *some* of the information it requires for weighing from the returned data from the GET /allocation_candidates call, but not all of it.<br>

<br>

Again, operators have insisted on keeping the flexibility currently in the Nova scheduler to weigh/sort compute nodes by things like thermal metrics and kinds of data that the Placement API will never be responsible for.<br>

<br>

The scheduler will need to merge information from the "provider_summaries" part of the HTTP response with information it has already in its HostState objects (gotten from ComputeNodeList.get_all_by_uui<wbr>d() and AggregateMetadataList).<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.<br>

</blockquote>

<br></span>

Kind of, yes. The scheduler will select a *host* that meets its needs.<br>

<br>

There may be more than one allocation request that includes that host resource provider, because of shared providers and (soon) nested providers. The scheduler will choose one of these allocation requests and attempt a claim of resources by simply PUT /allocations/{instance_uuid} with the serialized body of that allocation request. If 202 returned, cool. If not, repeat for the next allocation request.<span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.<br>

</blockquote>

<br></span>

Yes, this is the proposed solution for allowing retries within a cell.</blockquote><div><br></div><div>Is that possible we use trait to distinguish different cells? Then the retry can be done in the cell by query the placement directly with trait which indicate the specific cell.</div><div><br></div><div>Those traits will be some custom traits, and generate by the cell name.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

* Scheduler returns that list to conductor.<br>

* Conductor determines the cell of the selected host, and sends that list to the target cell.<br>

* Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.<br></blockquote></span></blockquote><div><br></div><div>In the compute node, will we get rid of the allocation update in the periodic task "update_available_resource"? Otherwise, we will have race between the claim in the nova-scheduler and that periodic task.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

</blockquote>

<br></span>

I'll let Dan discuss this last part.<br>

<br>

Best,<br>

-jay<div class="gmail-HOEnZb"><div class="gmail-h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

[0] <a href="https://review.openstack.org/#/c/471927/" rel="noreferrer" target="_blank">https://review.openstack.org/#<wbr>/c/471927/</a><br>

<br>

<br>

<br>

<br>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>

<br>

</blockquote>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>

</div></div></blockquote></div><br></div></div>