Open Stack

Fri Sep 19 20:42:00 UTC 2014

On Fri, Sep 19, 2014 at 11:13 PM, Sean Dague <sean at dague.net> wrote:
> I've spent the better part of the last 2 weeks in the Nova bug tracker
> to try to turn it into something that doesn't cause people to run away
> screaming. I don't remember exactly where we started at open bug count 2
> weeks ago (it was north of 1400, with > 200 bugs in new, but it might
> have been north of 1600), but as of this email we're at < 1000 open bugs
> (I'm counting Fix Committed as closed, even though LP does not), and ~0
> new bugs (depending on the time of the day).
>
> == Philosophy in Triaging ==
>
> I'm going to lay out the philosophy of triaging I've had, because this
> may also set the tone going forward.
>
> A bug tracker is a tool to help us make a better release. It does not
> exist for it's own good, it exists to help. Which means when evaluating
> what stays in and what leaves we need to evaluate if any particular
> artifact will help us make a better release. But also more importantly
> realize that there is a cost for carrying every artifact in the tracker.
> Resolving duplicates gets non linearly harder as the number of artifacts
> go up. Triaging gets non-linearly hard as the number of artifacts go up.
>
> With this I was being somewhat pragmatic about closing bugs. An old bug
> that is just a stacktrace is typically not useful. An old bug that is a
> vague sentence that we should refactor a particular module (with no
> specifics on the details) is not useful. A bug reported against a very
> old version of OpenStack where the code has changed a lot in the
> relevant area, and there aren't responses from the author, is not
> useful. Not useful bugs just add debt, and we should get rid of them.
> That makes the chance of pulling a random bug off the tracker something
> that you could actually look at fixing, instead of mostly just stalling out.
>
> So I closed a lot of stuff as Invalid / Opinion that fell into those camps.
>
> == Keeping New Bugs at close to 0 ==
>
> After driving the bugs in the New state down to zero last week, I found
> it's actually pretty easy to keep it at 0.
>
> We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
> aren't actually a bug, and can be closed immediately. ~30% look like a
> bug, but don't have anywhere near enough information in them, and
> flipping them to incomplete with questions quickly means we have a real
> chance of getting the right info. ~10% are fixable in < 30 minutes worth
> of work. And the rest are real bugs, that seem to have enough to dive
> into it, and can be triaged into Confirmed, set a priority, and add the
> appropriate tags for the area.

On the bugs which would take less than 30 minutes, is that because
they're not bugs, or are they just trivial? It would be cool to be
adding the low-hanging-fruit tag to those bugs if you're not, because
we should just fix them.

> But, more importantly, this means we can filter bug quality on the way
> in. And we can also encourage bug reporters that are giving us good
> stuff, or even easy stuff, as we respond quickly.
>
> Recommendation #1: we adopt a 0 new bugs policy to keep this from
> getting away from us in the future.

Agreed, this was a goal we used to have back in the day and I'd like
to bring it back.

> == Our worse bug reporters are often core reviewers ==
>
> I'm going to pick on Dan Prince here, mostly because I have a recent
> concrete example, however in triaging the bug queue much of the core
> team is to blame (including myself).
>
> https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
> was set incomplete and no response. I'm almost 100% sure it's a dupe of
> the multiprocess bug we've been tracking down but it's so terse that you
> can't get to the bottom of it.
>
> There were a ton of 2012 nova bugs that were basically "post it notes".
> Oh, "we should refactor this function". Full stop. While those are fine
> for personal tracking, their value goes to zero probably 3 months after
> they are files, especially if the reporter stops working on the issue at
> hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
> convinced using bugs for those is useful unless we go and close them out
> aggressively if they stall.
>
> Also, if Nova core can't file a good bug, it's hard to set the example
> for others in our community.
>
> Recommendation #2: hey, Nova core, lets be better about filing the kinds
> of bugs we want to see! mkay!
>
> Recommendation #3: Let's create a tag for "personal work items" or
> something for these class of TODOs people are leaving themselves that
> make them a ton easier to cull later when they stall and no one else has
> enough context to pick them up.

I think we also get a lot of bugs filed almost immediately before a
fix. Sort of like a tracking mechanism for micro-features. Do we want
to continue doing that, or do we want to just let smallish things land
without a bug?

> == Tags ==
>
> The aggressive tagging that Tracy brought into the project has been
> awesome. It definitely helps slice out into better functional areas.
> Here is the top of our current official tag list (and bug count):
>
> 95 compute
> 83 libvirt
> 74 api
> 68 vmware
> 67 network
> 41 db
> 40 testing
> 40 volumes
> 36 ec2
> 35 icehouse-backport-potential
> 32 low-hanging-fruit
> 31 xenserver
> 25 ironic
> 23 hyper-v
> 16 cells
> 14 scheduler
> 12 baremetal
> 9 ceph
> 9 security
> 8 oslo
> ...
>
> So, good stuff. However I think we probably want to take a further step
> and attempt to get champions for tags. So that tag owners would ensure
> their bug list looks sane, and actually spend some time fixing them.
> It's pretty clear, for instance, that the ec2 bugs are just piling up,
> and very few fixes coming in. Cells seems like it's in the same camp (a
> bunch of recent bugs have been cells related, it looks like a lot more
> deployments are trying it).
>
> Probably the most important thing in tag owners would be cleaning up the
> bugs in the tag. Realizing that 2 bugs were actually the same bug.
> Cleaning up descriptions / titles / etc so that people can move forward
> on them.
>
> Recommendation #4: create tag champions

Tracy has already tried to do this IIRC, but I agree we should chase
down people to "own" each of these tags. Those people are probably
best positions to help with working through possible solutions -- for
example I might be able to tell a bug is vmware related, and I might
be able to tell its relatively minor, but that doesn't mean I know
what the right fix is in all cases. The vmware bug tag owner probably
does though.

> == Soft Spots ==
>
> After looking at probably close to 1000 bugs in 2 weeks I have a
> particular impression of soft spots that we have.
>
> Quotas are kind of a mess. It's not clear that we're even eventually
> consistent. There are a lot of bugs about creating servers, deleteing
> servers, and leaking quota in the process. I know Jay and Sylvan are
> diving hard on the resource tracker right now, I think this should be a
> Kilo focus area because it creates terrible confusion and bugs for people.
>
> EC2 has definitely regressed, especially after block device mapping
> changes, to the point that it's not clear it's functional outside of the
> most basic server create commands. The EC2 code is largely unchanged
> since 2012, and only lightly tested, we need to decide if this is
> important or not, and either fix it or delete it. There have been many
> past hands going up that said they would help, and then they never do
> (you known who you are).
>
> The VM State machine model is .... Well it's at least suboptimal, but
> it's also clear that it's massively leaky, and the way we handle it
> internally means we end up in inconsistent wedges all the time. I expect
> the complexity here causes a ton of bugs. We need some refactoring to
> make things a ton more clear about what's supposed to be happening, and
> how to rollback when they go wrong. I think the Tasks work was headed
> down that path, but that seems stalled now.
>
> Cross interaction with Neutron and Cinder remains racey. We are pretty
> optimistic on when resources will be available. Even the event interface
> with Neutron hasn't fully addressed this. I think a really great Design
> Summit session would be Nova + Neutron + Cinder to figure out a shared
> architecture to address this. I'd expect this to be at least a double
> session.

Wanna put that on our ideas etherpad please?

https://etherpad.openstack.org/p/kilo-nova-summit-topics

> Recommendation #5 - 8: we should get on those things :)
>
> == Triaging Inconsistencies ==
>
> I found some inconsistencies in how people were triaging bugs, and the
> state inconsistencies probably don't help with making the bugs seem
> confusing: https://wiki.openstack.org/wiki/BugTriage provides some
> guideance.
>
> Importantly:
>
> Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
> a bunch of 'closing' comments but a move to Incomplete.
>
> Triaged should be used if the solution to fix the bug is in the bug
> itself. Triaged is Confirmed + Solution at enough details to fix it.
>
> Incomplete bugs should not have assignees or milestones, otherwise it
> won't time out.
>
> == General Cleanup Rules ==
>
> Here are some general cleanup rules that I was using:
>
> If an Incomplete bug has no response after 30 days it's fair game to
> close (Invalid, Opinion, Won't Fix).
>
> If a bug is In Progress with no patch posted after 30 days, it is not In
> Progress. Remove assignee, move back to last state (probably confirmed).
> Move to Opinion if it's really a "post it note".
>
> If a bug is In Progress but the patches were abandoned, it's no longer
> In Progress. Remove assignee, move back to last state (probably
> confirmed). Move to Opinion if it's really a "post it note".
>
> == Rescuing Stalled Fixes ==
>
> Over the course of this I found a bunch of the In Progress bugs were
> real issues, with real fixes, that had stalled out for one of a number
> of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
> clear the author didn't really know how to do that for this patch. Other
> times the author's first language was not english, and the patch commit
> message was confusing enough that no one understood what it was fixing.
> (One of these bugs I restored, rewrote the commit message, and then it
> sailed through the process.)
>
> Recommendation #9: if you are going to -1 for unit tests, please go the
> extra step of saying 'I think you should write a test that does X, Y, Z'.
>
> Recommendation #10: We need to find a better balance in rewriting commit
> messages. Maybe we should just make it socially acceptable to rewrite
> the commit message as part of review.

I sort of thought we already had. Certainly we've been talking about
it being ok for a while now. Do we need a formal written policy or
something?

> ....
>
> I'm sure there are other thoughts, but my brain is running out of steam.
> These were the things that popped to the top of my head. It's definitely
> been really interesting to spend this much time with the tracker to
> build a bigger picture of this feedback channel we have from our users.
> Hopefully other folks found some of this handy.

Thanks heaps for doing this work.

Michael

-- 
Rackspace Australia

Open Stack

[openstack-dev] [nova] 2 weeks in the bug tracker

OpenStack

Community

Documentation

Branding & Legal