[openstack-dev] [nova] 2 weeks in the bug tracker

Dan Prince dprince at redhat.com
Tue Sep 23 17:10:54 UTC 2014


On Fri, 2014-09-19 at 09:13 -0400, Sean Dague wrote:
> I've spent the better part of the last 2 weeks in the Nova bug tracker
> to try to turn it into something that doesn't cause people to run away
> screaming. I don't remember exactly where we started at open bug count 2
> weeks ago (it was north of 1400, with > 200 bugs in new, but it might
> have been north of 1600), but as of this email we're at < 1000 open bugs
> (I'm counting Fix Committed as closed, even though LP does not), and ~0
> new bugs (depending on the time of the day).
> 
> == Philosophy in Triaging ==
> 
> I'm going to lay out the philosophy of triaging I've had, because this
> may also set the tone going forward.
> 
> A bug tracker is a tool to help us make a better release. It does not
> exist for it's own good, it exists to help. Which means when evaluating
> what stays in and what leaves we need to evaluate if any particular
> artifact will help us make a better release. But also more importantly
> realize that there is a cost for carrying every artifact in the tracker.
> Resolving duplicates gets non linearly harder as the number of artifacts
> go up. Triaging gets non-linearly hard as the number of artifacts go up.
> 
> With this I was being somewhat pragmatic about closing bugs. An old bug
> that is just a stacktrace is typically not useful. An old bug that is a
> vague sentence that we should refactor a particular module (with no
> specifics on the details) is not useful. A bug reported against a very
> old version of OpenStack where the code has changed a lot in the
> relevant area, and there aren't responses from the author, is not
> useful. Not useful bugs just add debt, and we should get rid of them.
> That makes the chance of pulling a random bug off the tracker something
> that you could actually look at fixing, instead of mostly just stalling out.
> 
> So I closed a lot of stuff as Invalid / Opinion that fell into those camps.
> 
> == Keeping New Bugs at close to 0 ==
> 
> After driving the bugs in the New state down to zero last week, I found
> it's actually pretty easy to keep it at 0.
> 
> We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
> aren't actually a bug, and can be closed immediately. ~30% look like a
> bug, but don't have anywhere near enough information in them, and
> flipping them to incomplete with questions quickly means we have a real
> chance of getting the right info. ~10% are fixable in < 30 minutes worth
> of work. And the rest are real bugs, that seem to have enough to dive
> into it, and can be triaged into Confirmed, set a priority, and add the
> appropriate tags for the area.
> 
> But, more importantly, this means we can filter bug quality on the way
> in. And we can also encourage bug reporters that are giving us good
> stuff, or even easy stuff, as we respond quickly.
> 
> Recommendation #1: we adopt a 0 new bugs policy to keep this from
> getting away from us in the future.
> 
> == Our worse bug reporters are often core reviewers ==
> 
> I'm going to pick on Dan Prince here, mostly because I have a recent
> concrete example, however in triaging the bug queue much of the core
> team is to blame (including myself).
> 
> https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
> was set incomplete and no response. I'm almost 100% sure it's a dupe of
> the multiprocess bug we've been tracking down but it's so terse that you
> can't get to the bottom of it.


This bug was filed as a result of a cryptic (to me at the time) gate
unit test failure that occurred in this review:

https://review.openstack.org/#/c/120099/

I mistakenly grabbed the last timeout error instead of looking at the
original timeout. Within 30 minutes or so of my post Matt Riedemann had
correctly classified it as https://bugs.launchpad.net/nova/+bug/1357578

I've added some extra data and marked it as a dup.

Dan


> 
> There were a ton of 2012 nova bugs that were basically "post it notes".
> Oh, "we should refactor this function". Full stop. While those are fine
> for personal tracking, their value goes to zero probably 3 months after
> they are files, especially if the reporter stops working on the issue at
> hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
> convinced using bugs for those is useful unless we go and close them out
> aggressively if they stall.
> 
> Also, if Nova core can't file a good bug, it's hard to set the example
> for others in our community.
> 
> Recommendation #2: hey, Nova core, lets be better about filing the kinds
> of bugs we want to see! mkay!
> 
> Recommendation #3: Let's create a tag for "personal work items" or
> something for these class of TODOs people are leaving themselves that
> make them a ton easier to cull later when they stall and no one else has
> enough context to pick them up.
> 
> == Tags ==
> 
> The aggressive tagging that Tracy brought into the project has been
> awesome. It definitely helps slice out into better functional areas.
> Here is the top of our current official tag list (and bug count):
> 
> 95 compute
> 83 libvirt
> 74 api
> 68 vmware
> 67 network
> 41 db
> 40 testing
> 40 volumes
> 36 ec2
> 35 icehouse-backport-potential
> 32 low-hanging-fruit
> 31 xenserver
> 25 ironic
> 23 hyper-v
> 16 cells
> 14 scheduler
> 12 baremetal
> 9 ceph
> 9 security
> 8 oslo
> ...
> 
> So, good stuff. However I think we probably want to take a further step
> and attempt to get champions for tags. So that tag owners would ensure
> their bug list looks sane, and actually spend some time fixing them.
> It's pretty clear, for instance, that the ec2 bugs are just piling up,
> and very few fixes coming in. Cells seems like it's in the same camp (a
> bunch of recent bugs have been cells related, it looks like a lot more
> deployments are trying it).
> 
> Probably the most important thing in tag owners would be cleaning up the
> bugs in the tag. Realizing that 2 bugs were actually the same bug.
> Cleaning up descriptions / titles / etc so that people can move forward
> on them.
> 
> Recommendation #4: create tag champions
> 
> == Soft Spots ==
> 
> After looking at probably close to 1000 bugs in 2 weeks I have a
> particular impression of soft spots that we have.
> 
> Quotas are kind of a mess. It's not clear that we're even eventually
> consistent. There are a lot of bugs about creating servers, deleteing
> servers, and leaking quota in the process. I know Jay and Sylvan are
> diving hard on the resource tracker right now, I think this should be a
> Kilo focus area because it creates terrible confusion and bugs for people.
> 
> EC2 has definitely regressed, especially after block device mapping
> changes, to the point that it's not clear it's functional outside of the
> most basic server create commands. The EC2 code is largely unchanged
> since 2012, and only lightly tested, we need to decide if this is
> important or not, and either fix it or delete it. There have been many
> past hands going up that said they would help, and then they never do
> (you known who you are).
> 
> The VM State machine model is .... Well it's at least suboptimal, but
> it's also clear that it's massively leaky, and the way we handle it
> internally means we end up in inconsistent wedges all the time. I expect
> the complexity here causes a ton of bugs. We need some refactoring to
> make things a ton more clear about what's supposed to be happening, and
> how to rollback when they go wrong. I think the Tasks work was headed
> down that path, but that seems stalled now.
> 
> Cross interaction with Neutron and Cinder remains racey. We are pretty
> optimistic on when resources will be available. Even the event interface
> with Neutron hasn't fully addressed this. I think a really great Design
> Summit session would be Nova + Neutron + Cinder to figure out a shared
> architecture to address this. I'd expect this to be at least a double
> session.
> 
> Recommendation #5 - 8: we should get on those things :)
> 
> == Triaging Inconsistencies ==
> 
> I found some inconsistencies in how people were triaging bugs, and the
> state inconsistencies probably don't help with making the bugs seem
> confusing: https://wiki.openstack.org/wiki/BugTriage provides some
> guideance.
> 
> Importantly:
> 
> Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
> a bunch of 'closing' comments but a move to Incomplete.
> 
> Triaged should be used if the solution to fix the bug is in the bug
> itself. Triaged is Confirmed + Solution at enough details to fix it.
> 
> Incomplete bugs should not have assignees or milestones, otherwise it
> won't time out.
> 
> == General Cleanup Rules ==
> 
> Here are some general cleanup rules that I was using:
> 
> If an Incomplete bug has no response after 30 days it's fair game to
> close (Invalid, Opinion, Won't Fix).
> 
> If a bug is In Progress with no patch posted after 30 days, it is not In
> Progress. Remove assignee, move back to last state (probably confirmed).
> Move to Opinion if it's really a "post it note".
> 
> If a bug is In Progress but the patches were abandoned, it's no longer
> In Progress. Remove assignee, move back to last state (probably
> confirmed). Move to Opinion if it's really a "post it note".
> 
> == Rescuing Stalled Fixes ==
> 
> Over the course of this I found a bunch of the In Progress bugs were
> real issues, with real fixes, that had stalled out for one of a number
> of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
> clear the author didn't really know how to do that for this patch. Other
> times the author's first language was not english, and the patch commit
> message was confusing enough that no one understood what it was fixing.
> (One of these bugs I restored, rewrote the commit message, and then it
> sailed through the process.)
> 
> Recommendation #9: if you are going to -1 for unit tests, please go the
> extra step of saying 'I think you should write a test that does X, Y, Z'.
> 
> Recommendation #10: We need to find a better balance in rewriting commit
> messages. Maybe we should just make it socially acceptable to rewrite
> the commit message as part of review.
> 
> ....
> 
> I'm sure there are other thoughts, but my brain is running out of steam.
> These were the things that popped to the top of my head. It's definitely
> been really interesting to spend this much time with the tracker to
> build a bigger picture of this feedback channel we have from our users.
> Hopefully other folks found some of this handy.
> 
> 	-Sean
> 





More information about the OpenStack-dev mailing list