On Mon, Jul 3, 2023, 2:00 PM Dominik Danelski <ddanelski@cloudferro.com> wrote:

Hello,

I would like to introduce you to the tool developed under the working
title "Scheduler Optimiser". It is meant to test the effectiveness of
different Scheduler configurations, both weights and filters on a given
list of VM orders and in a semi-realistic infrastructure.

My company - CloudFerro - has been preparing in-house for the last few
months and foresees publishing the project as FOSS once it reaches the
MVP stage. To make the final result more useful to the community and
speed up the development (and release), I humbly ask for your expertise:
Are you aware of previous similar efforts? Do you notice some flaws in
the current approach? What, in your opinion, are more important aspects
of the infrastructure behaviour, and what can be relatively safely
ignored in terms of the effect on Scheduler results/allocation?

Project objectives:

* Use Devstack (or another OpenStack deployer) with a real Scheduler
to replay a list of compute VM orders, either real from one's
infrastructure or artificially created.
* Assess the effectiveness of the scheduling in various terms like:
"How many machines of a given type can still be allocated at the
moment?" using plug-in "success meters". In a strict sense, the
project does not simulate THE Scheduler but interacts with it.
* Use fake-virt to emulate huge architectures on a relatively tiny
test bench.
* Have as little as possible, and ideally no changes to the Devstack's
code that could not be included in the upstream repository. The
usage should be as simple as: 1. Install Devstack. 2. Configure
Devstack's cluster with its infrastructure information like flavours
and hosts. 3. Configure Scheduler for a new test case. 4. Replay VM
orders. 5. Repeat steps 3 and 4 to find better Scheduler settings.
* Facilitate creating a minimal required setup of the test bench. Not
by replacing standard Devstack scripts, but mainly through tooling
for quick rebuilding data like flavours, infrastructure state, and
other factors relevant to the simulation.

Outside of the scope:

* Running continuous analysis on the production environment, even if
some plug-ins could be extracted for this purpose.
* Retaining information about users and projects when replaying orders.
* (Probably / low priority) replaying actions other than VM
creation/deletion as they form a minority of operations and ignoring
them should not have a distinct effect on the comparison experiments.

Current state:

Implemented:

* Recreating flavours from JSON file exported via OpenStack CLI.
* Replaying a list of orders in the form of (creation_date,
termination_date, resource_id (optional), flavor_id) with basic
flavour properties like VCPU, RAM, and DISK GB. The orders are
replayed consecutively.
* Plug-in success-rater mechanism which runs rater classes (returning
quantified success measure) after each VM add/delete action, retains
their intermediate history and "total success" - how it is defined
is implementation dependent. First classes interacting with
Placement like: "How many VMs of flavours x (with basic parameters
for now) can fit in the cluster?" or "How many hosts are empty?".

Missing:

* Recreating hosts, note the fake-virt remark from "Risks and Challenges".
* Tools facilitating Scheduler configuration.
* Creating VMs with more parameters like VGPU, traits, and aggregates.
* (Lower priority) saving the intermediate state of the cluster during
simulation i.e. allocations to analyse it without rerunning the
experiment. Currently, only the quantified meters are saved.
* Gently failing and saving all information in case of resource
depletion: close to completion, handling one exception type in upper
layers is needed.
* More success meters.

Risks and Challenges:

* Currently, the tool replays actions one by one, it waits for each
creation and deletion to be complete before running success raters
and taking another order. Thus, the order of actions is important,
but not their absolute time and temporal density. This might skip
some side-effects of a realistic execution.
* Similarly, to the above, fake-virt provides simple classes that will
not reproduce some behaviours of real-world hypervisors. An explicit
Scheduler avoids hosts that had recently failed to allocate a VM,
but most likely fake-virt will not mock such behaviour.
* Fake-virt should reproduce a real diverse infrastructure instead of
x copies of the same flavour. This might be the only, but very
important change to the OpenStack codebase. If successful, it could
benefit other projects and tests as well.

Even though the list of missing features is seemingly larger, the most
important parts of the program are already there, so we hope to finish
the MVP development in a relatively short amount of time. We are going
to publish it as FOSS in either case, but as mentioned your observations
would be very much welcome at this stage. I am also open to answering
more questions about the project.

Kind regards

Dominik Danelski