[openstack-dev] Which program for Rally

Matthew Treinish mtreinish at kortar.org
Wed Aug 13 19:57:21 UTC 2014


On Tue, Aug 12, 2014 at 01:45:17AM +0400, Boris Pavlovic wrote:
> Hi stackers,
> 
> 
> I would like to put some more details on current situation.
> 
> >
> > The issue is with what Rally is in it's
> > current form. It's scope is too large and monolithic, and it duplicates
> > much of
> > the functionality we either already have or need in current QA or Infra
> > projects. But, nothing in Rally is designed to be used outside of it. I
> > actually
> > feel pretty strongly that in it's current form Rally should *not* be a
> > part of
> > any OpenStack program
> 
> 
> Rally is not just a bunch of scripts like tempest, it's more like Nova,
> Cinder, and other projects that works out of box and resolve Operators &
> Dev use cases in one click.
> 
> This architectural design is the main key of Rally success, and why we got
> such large adoption and community.
> 
> So I'm opposed to this option. It feels to me like this is only on the table
> > because the Rally team has not done a great job of communicating or
> > working with
> > anyone else except for when it comes to either push using Rally, or this
> > conversation about adopting Rally.
> 
> 
> Actually Rally team done already a bunch of useful work including cross
> projects and infra stuff.
> 
> Keystone, Glance, Cinder, Neutron and Heat are running rally performance
> jobs, that can be used for performance testing, benchmarking, regression
> testing (already now). These jobs supports in-tree plugins for all
> components (scenarios, load generators, benchmark context) and they can use
> Rally fully without interaction with Rally team at all. More about these
> jobs:
> https://docs.google.com/a/mirantis.com/document/d/1s93IBuyx24dM3SmPcboBp7N47RQedT8u4AJPgOHp9-A/
> So I really don't see anything like this in tempest (even in observed
> future)
> 

So this is actually the communication problem I mentioned before. Singling out
individual projects and getting them to add a rally job is not "cross project"
communication. (this is part of what I meant by "push using Rally") There was no
larger discussion on the ML or a topic in the project meeting about adding these
jobs. There was no discussion about the value vs risk of adding new jobs to the
gate. Also, this is why less than half of the integrated projects have these
jobs. Having asymmetry like this between gating workloads on projects helps no
one.

> 
> I would like to mention work on OSprofiler (cross service/project profiler)
> https://github.com/stackforge/osprofiler (that was done by Rally team)
> https://review.openstack.org/#/c/105096/
> (btw Glance already accepted it https://review.openstack.org/#/c/105635/ )

So I don't think we're actually talking about osprofiler here, this discussion
is about Rally itself. Although personally I feel that the communication issues
I mentioned before are still present around osprofiler. From everything I've
seen about osprofiler adoption it has been the same divide and conquer strategy
when talking to people, instead of having a combined discussion about the
project and library adoption upfront.

That being said the reason I think osprofiler has been more accepted and it's
adoption into oslo is not nearly as contentious is because it's an independent
library that has value outside of itself. You don't need to pull in a monolithic
stack to use it. Which is a design point more conducive with the rest of
OpenStack.

> 
> 
> My primary concern is the timing for doing all of this work. We're
> > approaching
> > J-3 and honestly this feels like it would take the better part of an entire
> > cycle to analyze, plan, and then implement. Starting an analysis of how to
> > do
> > all of the work at this point I feel would just distract everyone from
> > completing our dev goals for the cycle. Probably the Rally team, if they
> > want
> > to move forward here, should start the analysis of this structural split
> > and we
> > can all pick this up together post-juno
> 
> 
> 
> Matt, Sean - seriously community is about convincing people, not about
> forcing people to do something against their wiliness.  You are making huge
> architectural decisions without deep knowledge about what is Rally, what
> are use cases, road map, goals and auditory.
> 
> IMHO community in my opinion is thing about convincing people. So QA
> program should convince Rally team (at least me) to do such changes. Key
> secret to convince me, is to say how this will help OpenStack to perform
> better.

If community, per your definition, is about convincing people then there needs
to be a 2-way discussion. This is an especially key point considering the
feedback on this thread is basically the same feedback you've been getting since
you first announced Rally on the ML. [1] (and from even before that I think, but
it's hard to remember all the details from that far back)  I'm afraid that
without a shared willingness to explore what we're suggesting because of
preconceived notions then I fail to see the point in moving forward. The fact
that this feedback has been ignored is why this discussion has come up at all.

> 
> Currently Rally team see a lot of issues related to this decision:
> 
> 1) It breaks already existing performance jobs (Heat, Glance, Cinder,
> Neutron, Keystone)

So firstly, I want to say I find these jobs troubling. Not just from the fact
that because of the nature of the gate (2nd level virt on public clouds) the
variability between jobs can be staggering. I can't imagine what value there is
in running synthetic benchmarks in this environment. It would only reliably
catch the most egregious of regressions. Also from what I can tell none of these
jobs actually compare the timing data to the previous results, it just generates
the data and makes a pretty graph. The burden appears to be on the user to
figure out what it means, which really isn't that useful. How have these jobs
actually helped? IMO the real value in performance testing in the gate is to
capture the longer term trends in the data. Which is something these jobs are
not doing.

Considering all the issues we've had with reliability in the gate. Going off
and running something that's unproven at the level we run things in the gate is
just a recipe for disaster. Then combine that with the fact that by doing this
our test load in the gate is being managed by 2 separate teams is troubling if
we want a reliable gate.

That being said I don't fault anyone who approved this stuff, without having a
wider cross-project discussion about this, as it's hard to know the implications
of adding the job. Everyone wants to make OpenStack better and in an isolated
discussion who wouldn't want to have "performance tests" run on each patch? I
do think that the jobs definitely have better looking output than a normal gate
run, the graphs are definitely very nice looking.

> 
> 2) It breaks functional testing of Rally (that is already done in gates)
> 
> 2) It makes Rally team depending on Tempest throughput, and what I heard
> multiple times from QA team is that performance work is very low priority
> and that major goals are to keep gates working. So it will slow down work
> of performance team.

I think you're misinterpreting what you've been told. When there are high
gate failure rates random graphs from rally showing performance data aren't
useful when things are on fire. Having a working gate is always going to be a
priority, it should be for everyone not just the QA and Infra teams.

From a strictly QA program development priority standpoint it would actually
have a greater priority than new tests (which are the bread and butter of
tempest contributions). Mostly because feature enhancements in tempest tend
to get more attention. But, in fact right now improvements related to
instrumentation and result collection are a high priority item. (without
anyone necessarily currently working on them). For example, fixing the bugs in
testr related to worker allocation reporting and getting accurate test timing
data including the setUpClass in the subunit streams would be great places to
start.

> 
> 3) Brings a ton of questions what should be in Rally and what should be in
> Tempest. That are at the moment quite resolved.
> https://docs.google.com/a/pavlovic.ru/document/d/137zbrz0KJd6uZwoZEu4BkdKiR_Diobantu0GduS7HnA/edit#heading=h.9ephr9df0new

I feel that this is the underlying trouble with the discussion here. This entire
proposal is based on your personal bias about what the role of tempest and the
QA program are. Your arbitrary distinction between "performance tests" and
"functional tests" is very confusing especially considering in most cases they
are nearly identical tests. I agree that you've correctly identified shortcomings
with tempest. For the most part they're actually work items we've been discussing
and/or working on fixing. But, I can't think of a reason why fixing these issues
should be the domain of another project. (let alone another program)

I think a possible result of this thread is that we may need to tweak the
mission statement for the QA program to make it a bit more clear for people who
aren't working with us on a day to day basis.

> 
> 4) It breaks existing OpenStack team, that is working 100% on performance,
> regressions and sla topics. Sorry but there is no such team in tempest.
> This directory is not active developed:
> https://github.com/openstack/tempest/commits/master/tempest/stress

I think the resistance you're seeing is that much work has been duplicated
instead of working to expand the scope and capabilities of tempest. We want
these sorts of things, and the lack of movement does not imply otherwise. Asking
to merge something built as an island after the fact is not a replacement for
this.

I also think you're missing that the framework allows any existing tempest test
case to be used as a stress test. Which means we're improving the stress tests
by working on other tempest tests. Which is why using this as a load generator
makes so much sense.

> 
> 
> Matt, Sean, David - what are the real goals of merging Rally to Tempest?
> I see a huge harm for OpenStack and companies that are using Rally, and
> don't see actually any benefits.
> What I heard for now is something like "this decision will make tempest
> better"..
> But do you care more about Tempest than OpenStack?
> 

So I think Sean's last reply on this thread [2] sums up the argument here quite
well, but I'll attempt to summarize the point even more concisely. Really just
re-read his post, because any further simplification will abstract out too many
details.

The thing I think you're missing here is that by pushing Rally as another
separate workflow from what we already have in the gate hurts everyone. Besides
introducing another debug workflow that every developer will have to familiarize
themselves with. It adds additional strain in the form of additional cross
project communication by adding yet another thing to coordinate on. Additionally,
there are a lot of things in tempest that we've slowly evolved over time as we've
been running it continually in the gate. These are things I can already see
haven't been considered in running rally at this scale (like autodiscovery of
features for testing) and will have to be rediscovered as problems crop up.

I also take offense to the implication here, I think you'll be hard pressed to
find a group of people who collectively are more dedicated and care more about
OpenStack then those who work in the QA and Infra space. (not to imply that
anybody else cares less)

In summary I'm still -2 on accepting Rally in it's current form. I would love to
see Rally-like functionality but in a form that actually works with the
ecosystem that we've built and evolved over time. I really want to see us
actually use all the performance data we can collect from gating jobs we're
already running. I especially would like to see collaboration and integration
between the QA program and the Rally team. But, introducing a second independent
project into the mix just adds more overhead and creates to manage.

-Matt Treinish

[1] http://lists.openstack.org/pipermail/openstack-dev/2013-October/017004.html
[2] http://lists.openstack.org/pipermail/openstack-dev/2014-August/042269.html

> 
> 
> On Tue, Aug 12, 2014 at 12:37 AM, David Kranz <dkranz at redhat.com> wrote:
> 
> >  On 08/11/2014 04:21 PM, Matthew Treinish wrote:
> >
> > I apologize for the delay in my response to this thread, between travelling
> > and having a stuck 'a' key on my laptop this is the earliest I could
> > respond.
> > I opted for a separate branch on this thread to summarize my views and I'll
> > respond inline later on some of the previous discussion.
> >
> > On Wed, Aug 06, 2014 at 12:30:35PM +0200, Thierry Carrez wrote:
> > > Hi everyone,
> > >
> > > At the TC meeting yesterday we discussed Rally program request and
> > > incubation request. We quickly dismissed the incubation request, as
> > > Rally appears to be able to live happily on top of OpenStack and would
> > > benefit from having a release cycle decoupled from the OpenStack
> > > "integrated release".
> > >
> > > That leaves the question of the program. OpenStack programs are created
> > > by the Technical Committee, to bless existing efforts and teams that are
> > > considered *essential* to the production of the "OpenStack" integrated
> > > release and the completion of the OpenStack project mission. There are 3
> > > ways to look at Rally and official programs at this point:
> > >
> > > 1. Rally as an essential QA tool
> > > Performance testing (and especially performance regression testing) is
> > > an essential QA function, and a feature that Rally provides. If the QA
> > > team is happy to use Rally to fill that function, then Rally can
> > > obviously be adopted by the (already-existing) QA program. That said,
> > > that would put Rally under the authority of the QA PTL, and that raises
> > > a few questions due to the current architecture of Rally, which is more
> > > product-oriented. There needs to be further discussion between the QA
> > > core team and the Rally team to see how that could work and if that
> > > option would be acceptable for both sides.
> >
> > So ideally this is where Rally would belong, the scope of what Rally is
> > attempting to do is definitely inside the scope of the QA program. I don't
> > see
> > any reason why that isn't the case. The issue is with what Rally is in it's
> > current form. It's scope is too large and monolithic, and it duplicates
> > much of
> > the functionality we either already have or need in current QA or Infra
> > projects. But, nothing in Rally is designed to be used outside of it. I
> > actually
> > feel pretty strongly that in it's current form Rally should *not* be a
> > part of
> > any OpenStack program.
> >
> > All of the points Sean was making in the other branch on this thread
> > (which I'll
> > probably respond to later) are a huge concerns I share with Rally. He
> > basically
> > summarized most of my views on the topic, so I'll try not to rewrite
> > everything.
> > But, the fact that all of this duplicate functionality was implemented in a
> > completely separate manner which is Rally specific and can't really be used
> > unless all of Rally is used is of a large concern. What I think the path
> > forward here is to have both QA and Rally work together on getting common
> > functionality that is re-usable and shareable. Additionally, I have some
> > concerns over the methodology that Rally uses for it's performance
> > measurement.
> > But, I'll table that discussion because I think it would partially derail
> > this
> > discussion.
> >
> > So one open question is long-term where would this leave Rally if we want
> > to
> > bring it in under the QA program. (after splitting up the functionality to
> > more
> > conducive with all our existing tools and projects) The one thing Rally
> > does
> > here which we don't have an analogous solution for is, for lack of better
> > term,
> > the post processing layer. The part that generates the performs the
> > analysis on
> > the collected data and generates the graphs. That is something that we'll
> > have
> > an eventually need for and that is something that we can work on turning
> > rally
> > into as we migrate everything to actually work together.
> >
> > There are probably also other parts of Rally which don't fit into an
> > existing
> > QA program project, (or the QA program in general) and in those cases we
> > probably should split them off as smaller projects to implement that bit.
> > For
> > example, the SLA stuff Rally has that probably should be a separate entity
> > as
> > well, but I'm unsure if that fits under QA program.
> >
> > My primary concern is the timing for doing all of this work. We're
> > approaching
> > J-3 and honestly this feels like it would take the better part of an entire
> > cycle to analyze, plan, and then implement. Starting an analysis of how to
> > do
> > all of the work at this point I feel would just distract everyone from
> > completing our dev goals for the cycle. Probably the Rally team, if they
> > want
> > to move forward here, should start the analysis of this structural split
> > and we
> > can all pick this up together post-juno.
> >
> > >
> > > 2. Rally as an essential operator tool
> > > Regular benchmarking of OpenStack deployments is a best practice for
> > > cloud operators, and a feature that Rally provides. With a bit of a
> > > stretch, we could consider that benchmarking is essential to the
> > > completion of the OpenStack project mission. That program could one day
> > > evolve to include more such "operations best practices" tools. In
> > > addition to the slight stretch already mentioned, one concern here is
> > > that we still want to have performance testing in QA (which is clearly
> > > essential to the production of "OpenStack"). Letting Rally primarily be
> > > an operational tool might make that outcome more difficult.
> > >
> >
> > So I'm opposed to this option. It feels to me like this is only on the
> > table
> > because the Rally team has not done a great job of communicating or
> > working with
> > anyone else except for when it comes to either push using Rally, or this
> > conversation about adopting Rally.
> >
> > That being said, looking at a separate "operator tool" program for Rally
> > doesn't
> > make much sense to me. There is nothing in Rally that is more or less
> > operator
> > tooling specific compared to Tempest or some of the infra tooling. I fail
> > to see
> > what in Rally warrants a separate program. To be a bit sardonic, my
> > question is
> > if Tempest had a REST API [1][2] then should we move it to the proposed
> > operators program too? The other thing, which came out of the summit, was
> > that
> > tempest is often used by operators in a loop to get a heartbeat on their
> > cloud.
> >
> > My point is that just because a tool is part of the QA program doesn't mean
> > it's not useful for operators. I think that's something that seems to be
> > lost
> > during this discussion. (or just brushed over) Sure, our first priority is
> > going
> > to be on making things work in dev environment and the gate, but that
> > doesn't
> > necessarily preclude using things against a production environment. For
> > tempest
> > at least, that's something we actually explicitly support. [3]
> >
> > +1
> > We were a little slow out of the gate (so to speak) on this but are making
> > progress by eliminating all devstack-specific stuff from tempest
> > configuration, adding support for non-admin parallel tempest with multiple
> > users, and in general getting rid of discovered roadblocks to real use. As
> > has been pointed out before, many folks use tempest against real clouds,
> > including many members of the tempest core team. IMO this should be
> > considered an equal priority with gating a dev environment. The biggest
> > problem with that goal is that tempest gate jobs do not run in most of the
> > vast number of actual configurations that most real clouds can use and so
> > it is hard to keep it working with all configurations. But we should
> > support these cases as best we can.
> >
> >  -David
> >
> >  Maybe, one day there will be a need for a program like this, but I'm
> > just not
> > seeing it here with Rally.
> >
> > > 3. Let Rally be a product on top of OpenStack
> > > The last option is to not have Rally in any program, and not consider it
> > > *essential* to the production of the "OpenStack" integrated release or
> > > the completion of the OpenStack project mission. Rally can happily exist
> > > as an operator tool on top of OpenStack. It is built as a monolithic
> > > product: that approach works very well for external complementary
> > > solutions... Also be more integrated in OpenStack or part of the
> > > OpenStack programs might come at a cost (slicing some functionality out
> > > of rally to make it more a framework and less a product) that might not
> > > be what its authors want.
> >
> > Honestly, if the Rally team wants the project to remain in it's current
> > form and
> > scope then I agree that it belongs outside of OpenStack. It definitely
> > feels
> > like a product to me, and there is nothing stopping them from continuing to
> > operate as they do now on top of OpenStack. I'm sorry, but the fact that
> > the
> > docs in the rally tree has a section for user testimonials [4] I feel
> > speaks a
> > lot about the intent of the project.
> >
> > >
> > > Let's explore each option to see which ones are viable, and the pros and
> > > cons of each.
> > >
> >
> > I apologize if any of this is somewhat incoherent, I'm still a bit
> > jet-lagged
> > so I'm not sure that I'm making much sense.
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140813/29d65930/attachment.pgp>


More information about the OpenStack-dev mailing list