<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<br>
<div class="moz-cite-prefix">Le 15/02/2016 10:48, Cheng, Yingxin a
écrit :<br>
</div>
<blockquote
cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
color:black;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";
color:black;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
color:black;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Consolas",serif;
color:black;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><a moz-do-not-send="true"
name="_MailEndCompose"><span style="color:#1F497D">Thanks
Sylvain,<o:p></o:p></span></a></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">1. The below
ideas will be extended to a spec ASAP.<o:p></o:p></span></p>
</div>
</blockquote>
<br>
Nice, looking forward to it then :-)<br>
<blockquote
cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"
type="cite">
<div class="WordSection1">
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">2. Thanks for
providing concerns I’ve not thought it yet, they will be in
the spec soon.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">3. Let me copy
my thoughts from another thread about the integration with
resource-provider:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">The idea is
about “Only compute node knows its own final compute-node
resource view” or “The accurate resource view only exists at
the place where it is actually consumed.” I.e., The
incremental updates can only come from the actual
“consumption” action, no matter where it is(e.g. compute
node, storage service, network service, etc.). Borrow the
terms from resource-provider, compute nodes can maintain its
accurate version of “compute-node-inventory” cache, and can
send incremental updates because it actually consumes
compute resources, furthermore, storage service can also
maintain an accurate version of “storage-inventory” cache
and send incremental updates if it also consumes storage
resources. If there are central services in charge of
consuming all the resources, the accurate cache and updates
must come from them.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
</div>
</blockquote>
<br>
That is one of the things I'd like to see in your spec, and how you
could interact with the new model.<br>
Thanks,<br>
-Sylvain<br>
<br>
<br>
<blockquote
cite="mid:7C13C62E3E32B841A445DD46E39D3CC98DE4B6@shsmsx102.ccr.corp.intel.com"
type="cite">
<div class="WordSection1">
<div>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">-Yingxin<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0in
0in 0in 4.0pt">
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><a moz-do-not-send="true"
name="_____replyseparator"></a><b><span
style="color:windowtext">From:</span></b><span
style="color:windowtext"> Sylvain Bauza
[<a class="moz-txt-link-freetext" href="mailto:sbauza@redhat.com">mailto:sbauza@redhat.com</a>]
<br>
<b>Sent:</b> Monday, February 15, 2016 5:28 PM<br>
<b>To:</b> OpenStack Development Mailing List (not for
usage questions)
<a class="moz-txt-link-rfc2396E" href="mailto:openstack-dev@lists.openstack.org"><openstack-dev@lists.openstack.org></a><br>
<b>Subject:</b> Re: [openstack-dev] [nova] A prototype
implementation towards the "shared state scheduler"<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span
style="font-size:12.0pt"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal">Le 15/02/2016 06:21, Cheng, Yingxin a
écrit :<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I’ve uploaded a prototype <a
moz-do-not-send="true"
href="https://review.openstack.org/#/c/280047/">
https://review.openstack.org/#/c/280047/</a> to testify
its design goals in accuracy, performance, reliability and
compatibility improvements. It will also be an Austin
Summit Session if elected:
<a moz-do-not-send="true"
href="https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316">https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316</a>
<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I want to gather opinions about this
idea:<o:p></o:p></p>
<p class="MsoNormal">1. Is this feature possible to be
accepted in the Newton release?<o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><span
style="font-size:12.0pt;font-family:"Times New
Roman",serif"><br>
Such feature requires a spec file to be written <a
moz-do-not-send="true"
href="http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged">http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged</a><br>
<br>
Ideally, I'd like to see your below ideas written in that
spec file so it would be the best way to discuss on the
design.<br>
<br>
<br>
<br>
<o:p></o:p></span></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">2. Suggestions to improve its design
and compatibility.<o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><span
style="font-size:12.0pt;font-family:"Times New
Roman",serif"><br>
I don't want to go into details here (that's rather the
goal of the spec for that), but my biggest concerns would
be when reviewing the spec :<br>
- how this can meet the OpenStack mission statement (ie.
ubiquitous solution that would be easy to install and
massively scalable)<br>
- how this can be integrated with the existing (filters,
weighers) to provide a clean and simple path for operators
to upgrade<br>
- how this can be supporting rolling upgrades (old
computes sending updates to new scheduler)<br>
- how can we test it<br>
- can we have the feature optional for operators<br>
<br>
<br>
<br>
<o:p></o:p></span></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">3. Possibilities to integrate with
resource-provider bp series: I know resource-provider is
the major direction of Nova scheduler, and there will be
fundamental changes in the future, especially according to
the bp
<a moz-do-not-send="true"
href="https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst">https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst</a>.
However, this prototype proposes a much faster and
compatible way to make schedule decisions based on
scheduler caches. The in-memory decisions are made at the
same speed with the caching scheduler, but the caches are
kept consistent with compute nodes as quickly as possible
without db refreshing.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><span
style="font-size:12.0pt;font-family:"Times New
Roman",serif"><br>
That's the key point, thanks for noticing our priorities.
So, you know that our resource modeling is drastically
subject to change in Mitaka and Newton. That is the new
game, so I'd love to see how you plan to interact with
that.<br>
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you
could share your ideas because all of you are having great
ideas to improve a current frustrating solution.<br>
<br>
-Sylvain<br>
<br>
<br>
<br>
<o:p></o:p></span></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Here is the detailed design of the
mentioned prototype:<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">>>----------------------------<o:p></o:p></p>
<p class="MsoNormal">Background:<o:p></o:p></p>
<p class="MsoNormal">The host state cache maintained by host
manager is the scheduler resource view during schedule
decision making. It is updated whenever a request is
received[1], and all the compute node records are
retrieved from db every time. There are several problems
in this update model, proven in experiments[3]:<o:p></o:p></p>
<p class="MsoNormal">1. Performance: The scheduler
performance is largely affected by db access in retrieving
compute node records. The db block time of a single
request is 355ms in average in the deployment of 3 compute
nodes, compared with only 3ms in in-memory
decision-making. Imagine there could be at most 1k nodes,
even 10k nodes in the future.<o:p></o:p></p>
<p class="MsoNormal">2. Race conditions: This is not only a
parallel-scheduler problem, but also a problem using only
one scheduler. The detailed analysis of
one-scheduler-problem is located in bug analysis[2]. In
short, there is a gap between the scheduler makes a
decision in host state cache and the<o:p></o:p></p>
<p class="MsoNormal">compute node updates its in-db resource
record according to that decision in resource tracker. A
recent scheduler resource consumption in cache can be lost
and overwritten by compute node data because of it, result
in cache inconsistency and unexpected retries. In a
one-scheduler experiment using 3-node deployment, there
are 7 retries out of 31 concurrent schedule requests
recorded, results in 22.6% extra performance overhead.<o:p></o:p></p>
<p class="MsoNormal">3. Parallel scheduler support: The
design of filter scheduler leads to an "even worse"
performance result using parallel schedulers. In the same
experiment with 4 schedulers on separate machines, the
average db block time is increased to 697ms per request
and there are 16 retries out of 31 schedule requests,
namely 51.6% extra overhead.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Improvements:<o:p></o:p></p>
<p class="MsoNormal">This prototype solved the mentioned
issues above by implementing a new update model to
scheduler host state cache. Instead of refreshing caches
from db, every compute node maintains its accurate version
of host state cache updated by the resource tracker, and
sends incremental updates directly to schedulers. So the
scheduler cache are synchronized to the correct state as
soon as possible with the lowest overhead. Also, scheduler
will send resource claim with its decision to the target
compute node. The compute node can decide whether the
resource claim is successful immediately by its local host
state cache and send responds back ASAP. With all the
claims are tracked from schedulers to compute nodes, no
false overwrites will happen, and thus the gaps between
scheduler cache and real compute node states are
minimized. The benefits are obvious with recorded
experiments[3] compared with caching scheduler and filter
scheduler:<o:p></o:p></p>
<p class="MsoNormal">1. There is no db block time during
scheduler decision making, the average decision time per
request is about 3ms in both single and multiple scheduler
scenarios, which is equal to the in-memory decision time
of filter scheduler and caching scheduler.<o:p></o:p></p>
<p class="MsoNormal">2. Since the scheduler claims are
tracked and the "false overwrite" is eliminated, there
should be 0 retries in one-scheduler deployment, as proven
in the experiment. Thanks to the quick claim responding
implementation, there are only 2 retries out of 31
requests in the 4-scheduler experiment.<o:p></o:p></p>
<p class="MsoNormal">3. All the filtering and weighing
algorithms are compatible because the data structure of
HostState is unchanged. In fact, this prototype even
supports filter scheduler running at the same time(already
tested). Like other operations with resource changes such
as migration, resizing or shelving, they make claims in
the resource tracker directly and update the compute node
host state immediately without major changes.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Extra features:<o:p></o:p></p>
<p class="MsoNormal">More efforts are made to better adjust
the implementation to real-world scenarios, such as
network issues, service unexpectedly down and overwhelming
messages etc:<o:p></o:p></p>
<p class="MsoNormal">1. The communication between schedulers
and compute nodes are only casts, there are no RPC calls
thus no blocks during scheduling.<o:p></o:p></p>
<p class="MsoNormal">2. All updates from nodes to schedulers
are labelled with an incremental seed, so any message
reordering, lost or duplication due to network issues can
be detected by MessageWindow immediately. The inconsistent
cache can be detected and refreshed correctly.<o:p></o:p></p>
<p class="MsoNormal">3. The overwhelming messages are
compressed by MessagePipe in its async mode. There is no
need to send all the messages one by one in the MQ, they
can be merged before sent to schedulers.<o:p></o:p></p>
<p class="MsoNormal">4. When a new service is up or
recovered, it sends notifications to all known remotes for
quick cache synchronization, even before the service
record is available in db. And if a remote service is
unexpectedly down according to service group records, no
more messages will send to it. The ComputeFilter is also
removed because of this feature, the scheduler can detect
remote compute nodes by itself.<o:p></o:p></p>
<p class="MsoNormal">5. In fact the claim tracking is not
only from schedulers to compute nodes, but also from
compute-node host state to the resource tracker. One
reason is that there is still a gap between a claim is
acknowledged by compute-node host state and the claim is
successful in resource tracker. It is necessary to track
those unhandled claims to keep host state accurate. The
second reason is to separate schedulers from compute node
and resource trackers. Scheduler only export limited
interfaces `update_from_compute` and
`handle_rt_claim_failure` to compute service and the RT,
so the testing and reusing are easier with clear
boundaries.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">TODOs:<o:p></o:p></p>
<p class="MsoNormal">There are still many features to be
implemented, the most important are unit tests and
incremental updates to PCI and NUMA resources, all of them
are marked out inline.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">References:<o:p></o:p></p>
<p class="MsoNormal">[1] <a moz-do-not-send="true"
href="https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104">https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104</a>
<o:p></o:p></p>
<p class="MsoNormal">[2] <a moz-do-not-send="true"
href="https://bugs.launchpad.net/nova/+bug/1341420/comments/24">
https://bugs.launchpad.net/nova/+bug/1341420/comments/24</a>
<o:p></o:p></p>
<p class="MsoNormal">[3] <a moz-do-not-send="true"
href="http://paste.openstack.org/show/486929/">http://paste.openstack.org/show/486929/</a>
<o:p></o:p></p>
<p class="MsoNormal">----------------------------<<<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">The original commit history of this
prototype is located in <a moz-do-not-send="true"
href="https://github.com/cyx1231st/nova/commits/shared-scheduler">
https://github.com/cyx1231st/nova/commits/shared-scheduler</a><o:p></o:p></p>
<p class="MsoNormal">For instructions to install and test
this prototype, please refer to the commit message of
<a moz-do-not-send="true"
href="https://review.openstack.org/#/c/280047/">https://review.openstack.org/#/c/280047/</a>
<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Regards,<o:p></o:p></p>
<p class="MsoNormal">-Yingxin<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:12.0pt;font-family:"Times New
Roman",serif"><br>
<br>
<br>
<o:p></o:p></span></p>
<pre>__________________________________________________________________________<o:p></o:p></pre>
<pre>OpenStack Development Mailing List (not for usage questions)<o:p></o:p></pre>
<pre>Unsubscribe: <a moz-do-not-send="true" href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><o:p></o:p></pre>
<pre><a moz-do-not-send="true" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><o:p></o:p></pre>
</blockquote>
<p class="MsoNormal"><span
style="font-size:12.0pt;font-family:"Times New
Roman",serif"><o:p> </o:p></span></p>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: <a class="moz-txt-link-abbreviated" href="mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>
<a class="moz-txt-link-freetext" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>