Nice idea, but my main question is, how do you plan to beat the ones implemented currently? I'm working a little researching a little with some techniques to try to beat the random resource allocation schedulers. Can you share more about your research and/or implementation idea? Cheers. --- Alvaro Soto. Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you. ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people. On Mon, Jul 3, 2023, 2:00 PM Dominik Danelski <ddanelski@cloudferro.com> wrote:
Hello,
I would like to introduce you to the tool developed under the working title "Scheduler Optimiser". It is meant to test the effectiveness of different Scheduler configurations, both weights and filters on a given list of VM orders and in a semi-realistic infrastructure.
My company - CloudFerro - has been preparing in-house for the last few months and foresees publishing the project as FOSS once it reaches the MVP stage. To make the final result more useful to the community and speed up the development (and release), I humbly ask for your expertise: Are you aware of previous similar efforts? Do you notice some flaws in the current approach? What, in your opinion, are more important aspects of the infrastructure behaviour, and what can be relatively safely ignored in terms of the effect on Scheduler results/allocation?
Project objectives:
* Use Devstack (or another OpenStack deployer) with a real Scheduler to replay a list of compute VM orders, either real from one's infrastructure or artificially created. * Assess the effectiveness of the scheduling in various terms like: "How many machines of a given type can still be allocated at the moment?" using plug-in "success meters". In a strict sense, the project does not simulate THE Scheduler but interacts with it. * Use fake-virt to emulate huge architectures on a relatively tiny test bench. * Have as little as possible, and ideally no changes to the Devstack's code that could not be included in the upstream repository. The usage should be as simple as: 1. Install Devstack. 2. Configure Devstack's cluster with its infrastructure information like flavours and hosts. 3. Configure Scheduler for a new test case. 4. Replay VM orders. 5. Repeat steps 3 and 4 to find better Scheduler settings. * Facilitate creating a minimal required setup of the test bench. Not by replacing standard Devstack scripts, but mainly through tooling for quick rebuilding data like flavours, infrastructure state, and other factors relevant to the simulation.
Outside of the scope:
* Running continuous analysis on the production environment, even if some plug-ins could be extracted for this purpose. * Retaining information about users and projects when replaying orders. * (Probably / low priority) replaying actions other than VM creation/deletion as they form a minority of operations and ignoring them should not have a distinct effect on the comparison experiments.
Current state:
Implemented:
* Recreating flavours from JSON file exported via OpenStack CLI. * Replaying a list of orders in the form of (creation_date, termination_date, resource_id (optional), flavor_id) with basic flavour properties like VCPU, RAM, and DISK GB. The orders are replayed consecutively. * Plug-in success-rater mechanism which runs rater classes (returning quantified success measure) after each VM add/delete action, retains their intermediate history and "total success" - how it is defined is implementation dependent. First classes interacting with Placement like: "How many VMs of flavours x (with basic parameters for now) can fit in the cluster?" or "How many hosts are empty?".
Missing:
* Recreating hosts, note the fake-virt remark from "Risks and Challenges". * Tools facilitating Scheduler configuration. * Creating VMs with more parameters like VGPU, traits, and aggregates. * (Lower priority) saving the intermediate state of the cluster during simulation i.e. allocations to analyse it without rerunning the experiment. Currently, only the quantified meters are saved. * Gently failing and saving all information in case of resource depletion: close to completion, handling one exception type in upper layers is needed. * More success meters.
Risks and Challenges:
* Currently, the tool replays actions one by one, it waits for each creation and deletion to be complete before running success raters and taking another order. Thus, the order of actions is important, but not their absolute time and temporal density. This might skip some side-effects of a realistic execution. * Similarly, to the above, fake-virt provides simple classes that will not reproduce some behaviours of real-world hypervisors. An explicit Scheduler avoids hosts that had recently failed to allocate a VM, but most likely fake-virt will not mock such behaviour. * Fake-virt should reproduce a real diverse infrastructure instead of x copies of the same flavour. This might be the only, but very important change to the OpenStack codebase. If successful, it could benefit other projects and tests as well.
Even though the list of missing features is seemingly larger, the most important parts of the program are already there, so we hope to finish the MVP development in a relatively short amount of time. We are going to publish it as FOSS in either case, but as mentioned your observations would be very much welcome at this stage. I am also open to answering more questions about the project.
Kind regards
Dominik Danelski