<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <br>

    <div class="moz-cite-prefix">Le 15/02/2016 10:48, Cheng, Yingxin a

      écrit :<br>

    </div>

    <blockquote

cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

@font-face

        {font-family:"\@SimSun";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:#954F72;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";

        color:black;}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;

        color:black;}

span.EmailStyle18

        {mso-style-type:personal;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:"Consolas",serif;

        color:black;}

span.EmailStyle21

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><a moz-do-not-send="true"

            name="_MailEndCompose"><span style="color:#1F497D">Thanks

              Sylvain,<o:p></o:p></span></a></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">1. The below

            ideas will be extended to a spec ASAP.<o:p></o:p></span></p>

      </div>

    </blockquote>

    <br>

    Nice, looking forward to it then :-)<br>

    <blockquote

cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"

      type="cite">

      <div class="WordSection1">

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">2. Thanks for

            providing concerns I’ve not thought it yet, they will be in

            the spec soon.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">3. Let me copy

            my thoughts from another thread about the integration with

            resource-provider:<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">The idea is

            about “Only compute node knows its own final compute-node

            resource view” or “The accurate resource view only exists at

            the place where it is actually consumed.” I.e., The

            incremental updates can only come from the actual

            “consumption” action, no matter where it is(e.g. compute

            node, storage service, network service, etc.). Borrow the

            terms from resource-provider, compute nodes can maintain its

            accurate version of “compute-node-inventory” cache, and can

            send incremental updates because it actually consumes

            compute resources, furthermore, storage service can also

            maintain an accurate version of “storage-inventory” cache

            and send incremental updates if it also consumes storage

            resources. If there are central services in charge of

            consuming all the resources, the accurate cache and updates

            must come from them.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

      </div>

    </blockquote>

    <br>

    That is one of the things I'd like to see in your spec, and how you

    could interact with the new model.<br>

    Thanks,<br>

    -Sylvain<br>

    <br>

    <br>

    <blockquote

cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"

      type="cite">

      <div class="WordSection1">

        <div>

          <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

          <p class="MsoNormal"><span style="color:#1F497D">Regards,<o:p></o:p></span></p>

          <p class="MsoNormal"><span style="color:#1F497D">-Yingxin<o:p></o:p></span></p>

        </div>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <div style="border:none;border-left:solid blue 1.5pt;padding:0in

          0in 0in 4.0pt">

          <div>

            <div style="border:none;border-top:solid #E1E1E1

              1.0pt;padding:3.0pt 0in 0in 0in">

              <p class="MsoNormal"><a moz-do-not-send="true"

                  name="_____replyseparator"></a><b><span

                    style="color:windowtext">From:</span></b><span

                  style="color:windowtext"> Sylvain Bauza

                  [<a class="moz-txt-link-freetext" href="mailto:sbauza@redhat.com">mailto:sbauza@redhat.com</a>]

                  <br>

                  <b>Sent:</b> Monday, February 15, 2016 5:28 PM<br>

                  <b>To:</b> OpenStack Development Mailing List (not for

                  usage questions)

                  <a class="moz-txt-link-rfc2396E" href="mailto:openstack-dev@lists.openstack.org"><openstack-dev@lists.openstack.org></a><br>

                  <b>Subject:</b> Re: [openstack-dev] [nova] A prototype

                  implementation towards the "shared state scheduler"<o:p></o:p></span></p>

            </div>

          </div>

          <p class="MsoNormal"><o:p> </o:p></p>

          <p class="MsoNormal" style="margin-bottom:12.0pt"><span

              style="font-size:12.0pt"><o:p> </o:p></span></p>

          <div>

            <p class="MsoNormal">Le 15/02/2016 06:21, Cheng, Yingxin a

              écrit :<o:p></o:p></p>

          </div>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p class="MsoNormal">Hi,<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">I’ve uploaded a prototype <a

                moz-do-not-send="true"

                href="https://review.openstack.org/#/c/280047/">

                https://review.openstack.org/#/c/280047/</a> to testify

              its design goals in accuracy, performance, reliability and

              compatibility improvements. It will also be an Austin

              Summit Session if elected:

              <a moz-do-not-send="true"

href="https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316">https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316</a>

              <o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">I want to gather opinions about this

              idea:<o:p></o:p></p>

            <p class="MsoNormal">1. Is this feature possible to be

              accepted in the Newton release?<o:p></o:p></p>

          </blockquote>

          <p class="MsoNormal"><span

              style="font-size:12.0pt;font-family:"Times New

              Roman",serif"><br>

              Such feature requires a spec file to be written <a

                moz-do-not-send="true"

href="http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged">http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged</a><br>

              <br>

              Ideally, I'd like to see your below ideas written in that

              spec file so it would be the best way to discuss on the

              design.<br>

              <br>

              <br>

              <br>

              <o:p></o:p></span></p>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p class="MsoNormal">2. Suggestions to improve its design

              and compatibility.<o:p></o:p></p>

          </blockquote>

          <p class="MsoNormal"><span

              style="font-size:12.0pt;font-family:"Times New

              Roman",serif"><br>

              I don't want to go into details here (that's rather the

              goal of the spec for that), but my biggest concerns would

              be when reviewing the spec :<br>

               - how this can meet the OpenStack mission statement (ie.

              ubiquitous solution that would be easy to install and

              massively scalable)<br>

               - how this can be integrated with the existing (filters,

              weighers) to provide a clean and simple path for operators

              to upgrade<br>

               - how this can be supporting rolling upgrades (old

              computes sending updates to new scheduler)<br>

               - how can we test it<br>

               - can we have the feature optional for operators<br>

              <br>

              <br>

              <br>

              <o:p></o:p></span></p>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p class="MsoNormal">3. Possibilities to integrate with

              resource-provider bp series: I know resource-provider is

              the major direction of Nova scheduler, and there will be

              fundamental changes in the future, especially according to

              the bp

              <a moz-do-not-send="true"

href="https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst">https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst</a>.

              However, this prototype proposes a much faster and

              compatible way to make schedule decisions based on

              scheduler caches. The in-memory decisions are made at the

              same speed with the caching scheduler, but the caches are

              kept consistent with compute nodes as quickly as possible

              without db refreshing.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

          </blockquote>

          <p class="MsoNormal"><span

              style="font-size:12.0pt;font-family:"Times New

              Roman",serif"><br>

              That's the key point, thanks for noticing our priorities.

              So, you know that our resource modeling is drastically

              subject to change in Mitaka and Newton. That is the new

              game, so I'd love to see how you plan to interact with

              that.<br>

              Ideally, I'd appreciate if Jay Pipes, Chris Dent and you

              could share your ideas because all of you are having great

              ideas to improve a current frustrating solution.<br>

              <br>

              -Sylvain<br>

              <br>

              <br>

              <br>

              <o:p></o:p></span></p>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p class="MsoNormal">Here is the detailed design of the

              mentioned prototype:<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">>>----------------------------<o:p></o:p></p>

            <p class="MsoNormal">Background:<o:p></o:p></p>

            <p class="MsoNormal">The host state cache maintained by host

              manager is the scheduler resource view during schedule

              decision making. It is updated whenever a request is

              received[1], and all the compute node records are

              retrieved from db every time. There are several problems

              in this update model, proven in experiments[3]:<o:p></o:p></p>

            <p class="MsoNormal">1. Performance: The scheduler

              performance is largely affected by db access in retrieving

              compute node records. The db block time of a single

              request is 355ms in average in the deployment of 3 compute

              nodes, compared with only 3ms in in-memory

              decision-making. Imagine there could be at most 1k nodes,

              even 10k nodes in the future.<o:p></o:p></p>

            <p class="MsoNormal">2. Race conditions: This is not only a

              parallel-scheduler problem, but also a problem using only

              one scheduler. The detailed analysis of

              one-scheduler-problem is located in bug analysis[2]. In

              short, there is a gap between the scheduler makes a

              decision in host state cache and the<o:p></o:p></p>

            <p class="MsoNormal">compute node updates its in-db resource

              record according to that decision in resource tracker. A

              recent scheduler resource consumption in cache can be lost

              and overwritten by compute node data because of it, result

              in cache inconsistency and unexpected retries. In a

              one-scheduler experiment using 3-node deployment, there

              are 7 retries out of 31 concurrent schedule requests

              recorded, results in 22.6% extra performance overhead.<o:p></o:p></p>

            <p class="MsoNormal">3. Parallel scheduler support: The

              design of filter scheduler leads to an "even worse"

              performance result using parallel schedulers. In the same

              experiment with 4 schedulers on separate machines, the

              average db block time is increased to 697ms per request

              and there are 16 retries out of 31 schedule requests,

              namely 51.6% extra overhead.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">Improvements:<o:p></o:p></p>

            <p class="MsoNormal">This prototype solved the mentioned

              issues above by implementing a new update model to

              scheduler host state cache. Instead of refreshing caches

              from db, every compute node maintains its accurate version

              of host state cache updated by the resource tracker, and

              sends incremental updates directly to schedulers. So the

              scheduler cache are synchronized to the correct state as

              soon as possible with the lowest overhead. Also, scheduler

              will send resource claim with its decision to the target

              compute node. The compute node can decide whether the

              resource claim is successful immediately by its local host

              state cache and send responds back ASAP. With all the

              claims are tracked from schedulers to compute nodes, no

              false overwrites will happen, and thus the gaps between

              scheduler cache and real compute node states are

              minimized. The benefits are obvious with recorded

              experiments[3] compared with caching scheduler and filter

              scheduler:<o:p></o:p></p>

            <p class="MsoNormal">1. There is no db block time during

              scheduler decision making, the average decision time per

              request is about 3ms in both single and multiple scheduler

              scenarios, which is equal to the in-memory decision time

              of filter scheduler and caching scheduler.<o:p></o:p></p>

            <p class="MsoNormal">2. Since the scheduler claims are

              tracked and the "false overwrite" is eliminated, there

              should be 0 retries in one-scheduler deployment, as proven

              in the experiment. Thanks to the quick claim responding

              implementation, there are only 2 retries out of 31

              requests in the 4-scheduler experiment.<o:p></o:p></p>

            <p class="MsoNormal">3. All the filtering and weighing

              algorithms are compatible because the data structure of

              HostState is unchanged. In fact, this prototype even

              supports filter scheduler running at the same time(already

              tested). Like other operations with resource changes such

              as migration, resizing or shelving, they make claims in

              the resource tracker directly and update the compute node

              host state immediately without major changes.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">Extra features:<o:p></o:p></p>

            <p class="MsoNormal">More efforts are made to better adjust

              the implementation to real-world scenarios, such as

              network issues, service unexpectedly down and overwhelming

              messages etc:<o:p></o:p></p>

            <p class="MsoNormal">1. The communication between schedulers

              and compute nodes are only casts, there are no RPC calls

              thus no blocks during scheduling.<o:p></o:p></p>

            <p class="MsoNormal">2. All updates from nodes to schedulers

              are labelled with an incremental seed, so any message

              reordering, lost or duplication due to network issues can

              be detected by MessageWindow immediately. The inconsistent

              cache can be detected and refreshed correctly.<o:p></o:p></p>

            <p class="MsoNormal">3. The overwhelming messages are

              compressed by MessagePipe in its async mode. There is no

              need to send all the messages one by one in the MQ, they

              can be merged before sent to schedulers.<o:p></o:p></p>

            <p class="MsoNormal">4. When a new service is up or

              recovered, it sends notifications to all known remotes for

              quick cache synchronization, even before the service

              record is available in db. And if a remote service is

              unexpectedly down according to service group records, no

              more messages will send to it. The ComputeFilter is also

              removed because of this feature, the scheduler can detect

              remote compute nodes by itself.<o:p></o:p></p>

            <p class="MsoNormal">5. In fact the claim tracking is not

              only from schedulers to compute nodes, but also from

              compute-node host state to the resource tracker. One

              reason is that there is still a gap between a claim is

              acknowledged by compute-node host state and the claim is

              successful in resource tracker. It is necessary to track

              those unhandled claims to keep host state accurate. The

              second reason is to separate schedulers from compute node

              and resource trackers. Scheduler only export limited

              interfaces `update_from_compute` and

              `handle_rt_claim_failure` to compute service and the RT,

              so the testing and reusing are easier with clear

              boundaries.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">TODOs:<o:p></o:p></p>

            <p class="MsoNormal">There are still many features to be

              implemented, the most important are unit tests and

              incremental updates to PCI and NUMA resources, all of them

              are marked out inline.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">References:<o:p></o:p></p>

            <p class="MsoNormal">[1] <a moz-do-not-send="true"

href="https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104">https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104</a>

              <o:p></o:p></p>

            <p class="MsoNormal">[2] <a moz-do-not-send="true"

                href="https://bugs.launchpad.net/nova/+bug/1341420/comments/24">

                https://bugs.launchpad.net/nova/+bug/1341420/comments/24</a>

              <o:p></o:p></p>

            <p class="MsoNormal">[3] <a moz-do-not-send="true"

                href="http://paste.openstack.org/show/486929/">http://paste.openstack.org/show/486929/</a>

              <o:p></o:p></p>

            <p class="MsoNormal">----------------------------<<<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">The original commit history of this

              prototype is located in <a moz-do-not-send="true"

                href="https://github.com/cyx1231st/nova/commits/shared-scheduler">

https://github.com/cyx1231st/nova/commits/shared-scheduler</a><o:p></o:p></p>

            <p class="MsoNormal">For instructions to install and test

              this prototype, please refer to the commit message of

              <a moz-do-not-send="true"

                href="https://review.openstack.org/#/c/280047/">https://review.openstack.org/#/c/280047/</a>

              <o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal">Regards,<o:p></o:p></p>

            <p class="MsoNormal">-Yingxin<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p class="MsoNormal"><span

                style="font-size:12.0pt;font-family:"Times New

                Roman",serif"><br>

                <br>

                <br>

                <o:p></o:p></span></p>

            <pre>__________________________________________________________________________<o:p></o:p></pre>

            <pre>OpenStack Development Mailing List (not for usage questions)<o:p></o:p></pre>

            <pre>Unsubscribe: <a moz-do-not-send="true" href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><o:p></o:p></pre>

            <pre><a moz-do-not-send="true" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><o:p></o:p></pre>

          </blockquote>

          <p class="MsoNormal"><span

              style="font-size:12.0pt;font-family:"Times New

              Roman",serif"><o:p> </o:p></span></p>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">__________________________________________________________________________

OpenStack Development Mailing List (not for usage questions)

Unsubscribe: <a class="moz-txt-link-abbreviated" href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>

<a class="moz-txt-link-freetext" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>