[dev][tc] Part 2: Evaluating projects in relation to OpenStack cloud vision
This a "part 2" or "other half" of evaluating OpenStack projects in relation to the technical vision. See the other threads [1][2] for more information. In the conversations that led up to the creation of the vision document [3] one of the things we hoped was that the process could help identify ways in which existing projects could evolve to be better at what they do. This was couched in two ideas: * Helping to make sure that OpenStack continuously improves, in the right direction. * Helping to make sure that developers were working on projects that leaned more towards interesting and educational than frustrating and embarrassing, where choices about what to do and how to do it were straightforward, easy to share with others, so well-founded in agreed good practice that argument would be rare, and so few that it was easy to decide. Of course, to have a "right direction" you first have to have a direction, and thus the vision document and the idea of evaluating how aligned a project is with that. The other half, then, is looking at the projects from a development standpoint and thinking about what aspects of the project are: * Things (techniques, tools) the project contributors would encourage others to try. Stuff that has worked out well. * Things—given a clean slate, unlimited time and resources, the benefit of hindsight and without the weight of legacy—the project contributors would encourage others to not repeat. And documenting those things so they can be carried forward in time some place other than people's heads, and new projects or refactorings of existing projects can start on a good foot. A couple of examples: * Whatever we might say about the implementation (in itself and how it is used), the concept of a unified configuration file format, via oslo_config, is probably considered a good choice, and we should keep on doing that. * On the other hand, given hindsight and improvements in commonly available tools, using a homegrown WSGI (non-)framework (unless you are Swift) plus eventlet may not have been the way to go, yet because it is what's still there in nova, it often gets copied. It's not clear at this point whether these sorts of things should be documented in projects, or somewhere more central. So perhaps we can just talk about it here in email and figure something out. I'll followup with some I have for placement, since that's the project I've given the most attention. [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001417.h... [2] http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002524.... [3] https://governance.openstack.org/tc/reference/technical-vision.html -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent
On Sun, 10 Feb 2019, Chris Dent wrote:
It's not clear at this point whether these sorts of things should be documented in projects, or somewhere more central. So perhaps we can just talk about it here in email and figure something out. I'll followup with some I have for placement, since that's the project I've given the most attention.
Conversation on vision reflection for placement [1] is what reminded me that this part 2 is something we should be doing. I should disclaim that I'm the author of a lot of the architecture of placement so I'm hugely biased. Please call me out where my preferences are clouding reality. Other contributors to placement probably have other ideas. They would be great to hear. However, it's been at least two years since we started, so I think we can extract some useful lessons. Things have have worked out well (you can probably see a theme): * Placement is a single purpose service with, until very recently, only the WSGI service as the sole moving part. There are now placement-manage and placement-status commands, but they are rarely used (thankfully). This makes the system easier to reason about than something with multiple agents. Obviously some things need lots of agents. Placement isn't one of them. * Using gabbi [2] as the framework for functional tests of the API and using them to enable test-driven-development, via those functional tests, has worked out really well. It keeps the focus on that sole moving part: The API. * No RPC, no messaging, no notifications. * Very little configuration, reasonable defaults to that config. It's possible to run a working placement service with two config settings, if you are not using keystone. Keystone adds a few more, but not that much. * String adherence to WSGI norms (that is, any WSGI server can run a placement WSGI app) and avoidance of eventlet, but see below. The combination of this with small number of moving parts and little configuration make it super easy to deploy placement [3] in lots of different setups, from tiny to huge, scaling and robustifying those setups as required. * Declarative URL routing. There's a dict which maps HTTP method:URL pairs to python functions. Clear dispatch is a _huge_ help when debugging. Look one place, as a computer or human, to find where to go. * microversion-parse [4] has made microversion handling easy. Things that haven't gone so well (none of these are dire) and would have been nice to do differently had we but known: * Because of a combination of "we might need it later", "it's a handy tool and constraint" and "that's the way we do things" the interface between the placement URL handlers and the database is mediated through oslo versioned objects. Since there's no RPC, nor inter-version interaction, this is overkill. It also turns out that OVO getters and setters are a moderate factor in performance. Initially we were versioning the versioned objects, which created a lot of cognitive overhead when evolving the system, but we no longer do that, now that we've declared RPC isn't going to happen. * Despite the strict adherence to being a good WSGI citizen mentioned above, placement is using a custom (very limited) framework for the WSGI application. An initial proof of concept used flask but it was decided that introducing flask into the nova development environment would be introducing another thing to know when decoding nova. I suspect the expected outcome was that placement would reuse nova's framework, but the truth is I simply couldn't do it. Declarative URL dispatch was a critical feature that has proven worth it. The resulting code is relatively straightforward but it is unicorn where a boring pony would have been the right thing. Boring ponies are very often the right thing. I'm sure there are more here, but I've run out of brain. [1] https://review.openstack.org/#/c/630216/ [2] https://gabbi.readthedocs.io/ [3] https://anticdent.org/placement-from-pypi.html [4] https://pypi.org/project/microversion_parse/ -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent
Good thread. Comments inline. On 02/10/2019 04:08 PM, Chris Dent wrote:
On Sun, 10 Feb 2019, Chris Dent wrote: Things have have worked out well (you can probably see a theme):
* Placement is a single purpose service with, until very recently, only the WSGI service as the sole moving part. There are now placement-manage and placement-status commands, but they are rarely used (thankfully). This makes the system easier to reason about than something with multiple agents. Obviously some things need lots of agents. Placement isn't one of them.
Yes.
* Using gabbi [2] as the framework for functional tests of the API and using them to enable test-driven-development, via those functional tests, has worked out really well. It keeps the focus on that sole moving part: The API.
Yes. Bigly. I'd also include here the fact that we didn't care much at all in placement land about unit tests and instead focused almost exclusively on functional test coverage.
* No RPC, no messaging, no notifications.
This is mostly just a historical artifact of wanting placement to be single-purpose; not something that was actively sought after, though :) I think having placement send event notifications would actually be A Good Thing since it turns placement into a better cloud citizen, enabling interested observers to trigger action instead of polling the placement API for information. But I agree with your overall point that the simplicity gained by not having all the cruft of nova's RPC/messaging layer was a boon.
* Very little configuration, reasonable defaults to that config. It's possible to run a working placement service with two config settings, if you are not using keystone. Keystone adds a few more, but not that much.
Yes.
* String adherence to WSGI norms (that is, any WSGI server can run a
Strict adherence I think you meant? :)
placement WSGI app) and avoidance of eventlet, but see below. The combination of this with small number of moving parts and little configuration make it super easy to deploy placement [3] in lots of different setups, from tiny to huge, scaling and robustifying those setups as required.
Yes.
* Declarative URL routing. There's a dict which maps HTTP method:URL pairs to python functions. Clear dispatch is a _huge_ help when debugging. Look one place, as a computer or human, to find where to go.
Yes.
* microversion-parse [4] has made microversion handling easy.
Yes. I will note a couple other things that I believe have worked out well: 1) Using generation markers for concurrent update mechanisms Using a generation marker field for the relevant data models under the covers -- and exposing/expecting that generation via the API -- has enabled us to have a clear concurrency model and a clear mechanism for callers to trigger a re-drive of change operations. The use of generation markers has enabled us over time to reduce our use of caching and to have a single consistent trigger for callers (nova-scheduler, nova-compute) to fetch updated information about providers and consumers. Finally, the use of generation markers means there is nowhere in either the placement API nor its clients that use any locking semantics *at all*. No mutexes. No semaphores. No "lock this thing" API call. None of that heavyweight old skool concurrency. 2) Separation of quantitative and qualitative things Unlike the Nova flavor and its extra specs, placement has clear boundaries and expectations regarding what is a *resource* (quantitative thing that is consumed) and what is a *trait* (qualitative thing that describes a capability of the thing providing resources). This simple black-and-white modeling has allowed placement to fulfill scheduling queries and resource claim transactions efficiently. I hope, long term, that we can standardize on placement for tracking quota usage since its underlying data model and schema are perfectly suited for this task.
Things that haven't gone so well (none of these are dire) and would have been nice to do differently had we but known:
* Because of a combination of "we might need it later", "it's a handy tool and constraint" and "that's the way we do things" the interface between the placement URL handlers and the database is mediated through oslo versioned objects. Since there's no RPC, nor inter-version interaction, this is overkill. It also turns out that OVO getters and setters are a moderate factor in performance.
Data please.
Initially we were versioning the versioned objects, which created a lot of cognitive overhead when evolving the system, but we no longer do that, now that we've declared RPC isn't going to happen.
I agree with you that ovo is overkill and not needed in placement.
* Despite the strict adherence to being a good WSGI citizen mentioned above, placement is using a custom (very limited) framework for the WSGI application. An initial proof of concept used flask but it was decided that introducing flask into the nova development environment would be introducing another thing to know when decoding nova. I suspect the expected outcome was that placement would reuse nova's framework, but the truth is I simply couldn't do it. Declarative URL dispatch was a critical feature that has proven worth it. The resulting code is relatively straightforward but it is unicorn where a boring pony would have been the right thing. Boring ponies are very often the right thing.
Not sure I agree with this. The simplicity of the placement WSGI (non-)framework is a benefit. We don't need to mess with it. Really, it hasn't been an issue at all. I'll add one thing that I don't believe we did correctly and that we'll regret over time: Placement allocations currently have a distinct lack of temporal awareness. An allocation either exists or doesn't exist -- there is no concept of an allocation "end time". What this means is that placement cannot be used for a reservation system. I used to think this was OK, and that reservation systems should be layered on top of the simpler placement data model. I no longer believe this is a good thing, and feel that placement is actually the most appropriate service for modeling a reservation system. If I were to have a "do-over", I would have added the concept of a start and end time to the allocation. Best, -jay
I'm sure there are more here, but I've run out of brain.
[1] https://review.openstack.org/#/c/630216/ [2] https://gabbi.readthedocs.io/ [3] https://anticdent.org/placement-from-pypi.html [4] https://pypi.org/project/microversion_parse/
On 2/14/19 8:16 AM, Jay Pipes wrote:
This simple black-and-white modeling has allowed placement to fulfill scheduling queries and resource claim transactions efficiently. I hope, long term, that we can standardize on placement for tracking quota usage since its underlying data model and schema are perfectly suited for this task.
Instead of, or in addition to the Keystone unified limits?
On 02/14/2019 09:26 AM, Ben Nemec wrote:
On 2/14/19 8:16 AM, Jay Pipes wrote:
This simple black-and-white modeling has allowed placement to fulfill scheduling queries and resource claim transactions efficiently. I hope, long term, that we can standardize on placement for tracking quota usage since its underlying data model and schema are perfectly suited for this task.
Instead of, or in addition to the Keystone unified limits?
In addition. Keystone unified limits stores the limits. Placement stores the usage counts. Best, -jay
On Thu, Feb 14, 2019, 09:51 Jay Pipes <jaypipes@gmail.com wrote:
On 02/14/2019 09:26 AM, Ben Nemec wrote:
On 2/14/19 8:16 AM, Jay Pipes wrote:
This simple black-and-white modeling has allowed placement to fulfill scheduling queries and resource claim transactions efficiently. I hope, long term, that we can standardize on placement for tracking quota usage since its underlying data model and schema are perfectly suited for this task.
Instead of, or in addition to the Keystone unified limits?
In addition. Keystone unified limits stores the limits. Placement stores the usage counts.
Best, -jay
This was the exact response I was hoping to see. I'm pleased if we start having consistent consumption of quota as well as the unifited limit storage. Placement seems very well positioned for providing the functionality.
On Thu, 14 Feb 2019, Jay Pipes wrote:
* No RPC, no messaging, no notifications.
This is mostly just a historical artifact of wanting placement to be single-purpose; not something that was actively sought after, though :)
I certainly sought it and would have fought hard to prevent it if we ever ran into a situation where we had time to do it. These days, given time constraints, these sort of optional nice to haves are easier to avoid because there are fewer people to do them...
I think having placement send event notifications would actually be A Good Thing since it turns placement into a better cloud citizen, enabling interested observers to trigger action instead of polling the placement API for information.
I think some kind of event stream would be interesting, but there are many ways to skin that cat. The current within-openstack standards for such things are pretty heavyweight, better ways are on the scene in the big wide world. By putting it off as long as possible, we can take avantage of that new stuff.
* String adherence to WSGI norms (that is, any WSGI server can run a
Strict adherence I think you meant? :)
My strictness is much better in wsgi than typing.
1) Using generation markers for concurrent update mechanisms
I agree. I'm still conflicted over whether we should have exposed them as ETags or not (mostly from an HTTP-love standpoint), but overall they've made lots of stuff possible and easier.
Finally, the use of generation markers means there is nowhere in either the placement API nor its clients that use any locking semantics *at all*. No mutexes. No semaphores. No "lock this thing" API call. None of that heavyweight old skool concurrency.
Yeah. State handling (lack of) is nice.
2) Separation of quantitative and qualitative things
Yes, very much agree.
* Because of a combination of "we might need it later", "it's a handy tool and constraint" and "that's the way we do things" the interface between the placement URL handlers and the database is mediated through oslo versioned objects. Since there's no RPC, nor inter-version interaction, this is overkill. It also turns out that OVO getters and setters are a moderate factor in performance.
Data please.
When I wrote that bullet I just had some random profiling data from running a profiler during a bunch of requests, which made it clear that some ovo methods (in the getters and setters) were being called a ton (in large part because of the number of objects invovled in an allocation candidates response). I didn't copy that down anywhere at the time because I planned to do it more formally. Since then, I've made this: https://review.openstack.org/#/c/636631/ That's a stack which removes OVO from placement. While we know the perfload job is not scientific, it does provide a nice quide. An ovo-using patch <http://logs.openstack.org/95/633595/2/check/placement-perfload/267131a/logs/placement-perf.txt.gz> has perfload times of 2.65-ish (seconds). The base of that OVO removal stack (which changes allocation candidates) < http://logs.openstack.org/31/636631/4/check/placement-perfload/a413724/logs/placement-perf.txt> is 2.3-ish. The end of it <http://logs.openstack.org/07/636807/2/check/placement-perfload/fa7d58f/logs/placement-perf.txt> is 1.5-ish. And there are ways in which the code is much more explicit. There's plenty of cleanup to do, and I'm not wed to us making that change if people aren't keen, but I can see a fair number reasons above and beyond peformance to do it but that might be enough. Lot's more info in the commits and comments in that stack.
* Despite the strict adherence to being a good WSGI citizen mentioned above, placement is using a custom (very limited) framework for the WSGI application. An initial proof of concept used flask but it was decided that introducing flask into the nova development environment would be introducing another thing to know when decoding nova. I suspect the expected outcome was that placement would reuse nova's framework, but the truth is I simply couldn't do it. Declarative URL dispatch was a critical feature that has proven worth it. The resulting code is relatively straightforward but it is unicorn where a boring pony would have been the right thing. Boring ponies are very often the right thing.
Not sure I agree with this. The simplicity of the placement WSGI (non-)framework is a benefit. We don't need to mess with it. Really, it hasn't been an issue at all.
I agree that it is very hands off now, and not worth changing, but as an example for new projects, it is something to think about. It had creation costs in various forms. If there wasn't a me around (many custom non-frameworks under my belt) it would have been harder to create something (and then manage/maintain/educate it). sdague and I nearly came to metaphorical blows over it. If it were just normal to use the boring pony such things wouldn't need to happen.
Placement allocations currently have a distinct lack of temporal awareness. An allocation either exists or doesn't exist -- there is no concept of an allocation "end time". What this means is that placement cannot be used for a reservation system. I used to think this was OK, and that reservation systems should be layered on top of the simpler placement data model.
Yeah, I was thinking about this recently too. Trying to come up with conceptual hacks that would make it possible without drastically changing the existing data model. There's stuff percolating in my brain, potentially as weird as infinite resource classes but maybe not, but nothing has gelled. I hope, at least, that we can get the layered on top stuff working well.
Best, -jay
Thanks very much for chiming in here, I hope other people will too.
I'm sure there are more here, but I've run out of brain.
One thing that came up in the TC discussions [1] related to placement governance [1] was that given there have been bumps in the extraction road, it might be useful to also document the learnings from that. The main one, from my perspective is: If there's any inkling that a new service (something with what might be described as a public interface) is ever going to be eventually extracted, start it outside from the outset, but make sure the people involved overlap. [1] http://eavesdrop.openstack.org/irclogs/%23openstack-tc/%23openstack-tc.2019-... [2] https://review.openstack.org/#/c/636416/ -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent
On 02/14/2019 09:47 AM, Chris Dent wrote:
* Because of a combination of "we might need it later", "it's a handy tool and constraint" and "that's the way we do things" the interface between the placement URL handlers and the database is mediated through oslo versioned objects. Since there's no RPC, nor inter-version interaction, this is overkill. It also turns out that OVO getters and setters are a moderate factor in performance.
Data please.
When I wrote that bullet I just had some random profiling data from running a profiler during a bunch of requests, which made it clear that some ovo methods (in the getters and setters) were being called a ton (in large part because of the number of objects invovled in an allocation candidates response). I didn't copy that down anywhere at the time because I planned to do it more formally.
Since then, I've made this:
https://review.openstack.org/#/c/636631/
That's a stack which removes OVO from placement. While we know the perfload job is not scientific, it does provide a nice quide. An ovo-using patch <http://logs.openstack.org/95/633595/2/check/placement-perfload/267131a/logs/placement-perf.txt.gz>
has perfload times of 2.65-ish (seconds).
The base of that OVO removal stack (which changes allocation candidates) < http://logs.openstack.org/31/636631/4/check/placement-perfload/a413724/logs/placement-perf.txt>
is 2.3-ish.
The end of it <http://logs.openstack.org/07/636807/2/check/placement-perfload/fa7d58f/logs/placement-perf.txt>
is 1.5-ish.
And there are ways in which the code is much more explicit. There's plenty of cleanup to do, and I'm not wed to us making that change if people aren't keen, but I can see a fair number reasons above and beyond peformance to do it but that might be enough. Lot's more info in the commits and comments in that stack.
bueno. :) I'll review that series over the next couple days. Great work, Chris. -jay
On Feb 14, 2019, at 8:16 AM, Jay Pipes <jaypipes@gmail.com> wrote:
Placement allocations currently have a distinct lack of temporal awareness. An allocation either exists or doesn't exist -- there is no concept of an allocation "end time". What this means is that placement cannot be used for a reservation system. I used to think this was OK, and that reservation systems should be layered on top of the simpler placement data model.
I no longer believe this is a good thing, and feel that placement is actually the most appropriate service for modeling a reservation system. If I were to have a "do-over", I would have added the concept of a start and end time to the allocation.
I’m not clear on how you are envisioning this working. Will Placement somehow delete an allocation at this end time? IMO this sort of functionality should really be done by a system external to Placement. But perhaps you are thinking of something completely different, and I’m just a little thick? -- Ed Leafe
On 02/14/2019 10:08 AM, Ed Leafe wrote:
On Feb 14, 2019, at 8:16 AM, Jay Pipes <jaypipes@gmail.com> wrote:
Placement allocations currently have a distinct lack of temporal awareness. An allocation either exists or doesn't exist -- there is no concept of an allocation "end time". What this means is that placement cannot be used for a reservation system. I used to think this was OK, and that reservation systems should be layered on top of the simpler placement data model.
I no longer believe this is a good thing, and feel that placement is actually the most appropriate service for modeling a reservation system. If I were to have a "do-over", I would have added the concept of a start and end time to the allocation.
I’m not clear on how you are envisioning this working. Will Placement somehow delete an allocation at this end time? IMO this sort of functionality should really be done by a system external to Placement. But perhaps you are thinking of something completely different, and I’m just a little thick?
I'm not actually proposing this functionality be added to placement at this time. Just remarking that had I to do things over again, I would have modeled an end time in the allocation concept. The end times are not yet upon us, fortunately. Best, -jay
On 2/10/19 2:33 PM, Chris Dent wrote:
This a "part 2" or "other half" of evaluating OpenStack projects in relation to the technical vision. See the other threads [1][2] for more information.
In the conversations that led up to the creation of the vision document [3] one of the things we hoped was that the process could help identify ways in which existing projects could evolve to be better at what they do. This was couched in two ideas:
* Helping to make sure that OpenStack continuously improves, in the right direction. * Helping to make sure that developers were working on projects that leaned more towards interesting and educational than frustrating and embarrassing, where choices about what to do and how to do it were straightforward, easy to share with others, so well-founded in agreed good practice that argument would be rare, and so few that it was easy to decide.
Of course, to have a "right direction" you first have to have a direction, and thus the vision document and the idea of evaluating how aligned a project is with that.
The other half, then, is looking at the projects from a development standpoint and thinking about what aspects of the project are:
* Things (techniques, tools) the project contributors would encourage others to try. Stuff that has worked out well.
Oslo documents some things that I think would fall under this category in http://specs.openstack.org/openstack/oslo-specs/#team-policies The incubator one should probably get removed since it's no longer applicable, but otherwise I feel like we mostly still follow those policies and find them to be reasonable best practices. Some are very Oslo-specific and not useful to anyone else, of course, but others could be applied more broadly. There's also http://specs.openstack.org/openstack/openstack-specs/specs/eventlet-best-pra... although in the spirit of your next point I would be more +1 on the "don't use Eventlet" option for new projects. It might be nice to have a document that discusses preferred Eventlet alternatives for new projects. I know there are a few Eventlet-free projects out there that could probably provide feedback on their method.
* Things—given a clean slate, unlimited time and resources, the benefit of hindsight and without the weight of legacy—the project contributors would encourage others to not repeat.
And documenting those things so they can be carried forward in time some place other than people's heads, and new projects or refactorings of existing projects can start on a good foot.
A couple of examples:
* Whatever we might say about the implementation (in itself and how it is used), the concept of a unified configuration file format, via oslo_config, is probably considered a good choice, and we should keep on doing that.
I'm a _little_ biased, but +1. Things like your env var driver or the drivers for moving secrets out of plaintext would be next to impossible if everyone were using a different configuration method.
* On the other hand, given hindsight and improvements in commonly available tools, using a homegrown WSGI (non-)framework (unless you are Swift) plus eventlet may not have been the way to go, yet because it is what's still there in nova, it often gets copied.
And as I noted above, +1 to this too.
It's not clear at this point whether these sorts of things should be documented in projects, or somewhere more central. So perhaps we can just talk about it here in email and figure something out. I'll followup with some I have for placement, since that's the project I've given the most attention.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001417.h...
[2] http://lists.openstack.org/pipermail/openstack-discuss/2019-February/002524....
[3] https://governance.openstack.org/tc/reference/technical-vision.html
participants (5)
-
Ben Nemec
-
Chris Dent
-
Ed Leafe
-
Jay Pipes
-
Morgan Fainberg