[openstack-dev] [tc] [all] OpenStack moving both too fast and too slow at the same time

Joshua Harlow harlowja at fastmail.com
Tue May 9 17:59:41 UTC 2017


Matt Riedemann wrote:
> On 5/8/2017 1:10 PM, Octave J. Orgeron wrote:
>> I do agree that scalability and high-availability are definitely issues
>> for OpenStack when you dig deeper into the sub-components. There is a
>> lot of re-inventing of the wheel when you look at how distributed
>> services are implemented inside of OpenStack and deficiencies. For some
>> services you have a scheduler that can scale-out, but the conductor or
>> worker process doesn't. A good example is cinder, where cinder-volume
>> doesn't scale-out in a distributed manner and doesn't have a good
>> mechanism for recovering when an instance fails. All across the services
>> you see different methods for coordinating requests and tasks such as
>> rabbitmq, redis, memcached, tooz, mysql, etc. So for an operator, you
>> have to sift through those choices and configure the per-requisite
>> infrastructure. This is a good example of a problem that should be
>> solved with a single architecturally sound solution that all services
>> can standardize on.
>
> There was an architecture workgroup specifically designed to understand
> past architectural decisions in OpenStack, and what the differences are
> in the projects, and how to address some of those issues, but from lack
> of participation the group dissolved shortly after the Barcelona summit.
> This is, again, another example of if you want to make these kinds of
> massive changes, it's going to take massive involvement and leadership.

I agree on the 'massive changes, it's going to take massive involvement 
and leadership.' though I am not sure how such changes and involvement 
actually happens; especially nowadays where companies which such 
leadership are moving on to something else (k8s, mesos, or other...)

So knowing that what are the options to actually make some kind of 
change occur? IMHO it must be driven by PTLs (yes I know they are always 
busy, to bad, so sad, lol). I'd like all the PTLs to get together and 
restart the arch-wg and make it a *requirement* that PTLs actually show 
up (and participate) in that group/meeting vs it just being a bunch of 
senior(ish) folks, such as myself, that showed up. Then if PTLs do not 
show up, I would start to say that the next time around they are running 
for PTL said lack of participation in the wider openstack vision should 
be known and potentially cause them to get kicked out (voted out?) of 
being a PTL in the future.

>>
>> The problem in a lot of those cases comes down to development being
>> detached from the actual use cases customers and operators are going to
>> use in the real world. Having a distributed control plane with multiple
>> instances of the api, scheduler, coordinator, and other processes is
>> typically not testable without a larger hardware setup. When you get to
>> large scale deployments, you need an active/active setup for the control
>> plane. It's definitely not something you could develop for or test
>> against on a single laptop with devstack. Especially, if you want to use
>> more than a handful of the OpenStack services.

I've heard *crazy* things about actual use cases customers and operators 
are doing because of the scaling limits that projects have (ie nova has 
a limit of 300 compute nodes so ABC customer will then setup X * 300 
clouds to reach Y compute nodes because of that limit).

IMHO I'm not even sure I would want to target said use-cases in the 
first place, because they feel messed up in the first place (and it 
seems bad/dumb? to go down the rabbit hole of targeting use-cases that 
were deployed to band-aid over the initial problems that created those 
use-cases/deployments in the first place).

>
> I think we can all agree with this. Developers don't have a lab with
> 1000 nodes lying around to hack on. There was OSIC but that's gone. I've
> been requesting help in Nova from companies to do scale testing and help
> us out with knowing what the major issues are, and report those back in
> a form so we can work on those issues. People will report there are
> issues, but not do the profiling, or at least not report the results of
> profiling, upstream to help us out. So again, this is really up to
> companies that have the resources to do this kind of scale testing and
> report back and help fix the issues upstream in the community. That
> doesn't require OpenStack 2.0.
>

So how do we close that gap? The only way I really know is by having 
people that can see the problems from the get-go, instead of having to 
discover it at some later point (when it falls over and ABC customer 
starts to start having Y clouds just to reach the target number of 
compute nodes they want to reach). Now maybe the skill level in 
openstack (especially in regards to distributed systems) is just to low 
and the only real way to gather data is by having companies do scale 
testing (ie some kind of architecting things to work after they are 
deployed); if so that's sad...

-Josh



More information about the OpenStack-dev mailing list