[openstack-dev] [nova] 2 weeks in the bug tracker

Matt Riedemann mriedem at linux.vnet.ibm.com
Mon Sep 22 00:37:59 UTC 2014



On 9/19/2014 8:13 AM, Sean Dague wrote:
> I've spent the better part of the last 2 weeks in the Nova bug tracker
> to try to turn it into something that doesn't cause people to run away
> screaming. I don't remember exactly where we started at open bug count 2
> weeks ago (it was north of 1400, with > 200 bugs in new, but it might
> have been north of 1600), but as of this email we're at < 1000 open bugs
> (I'm counting Fix Committed as closed, even though LP does not), and ~0
> new bugs (depending on the time of the day).
>
> == Philosophy in Triaging ==
>
> I'm going to lay out the philosophy of triaging I've had, because this
> may also set the tone going forward.
>
> A bug tracker is a tool to help us make a better release. It does not
> exist for it's own good, it exists to help. Which means when evaluating
> what stays in and what leaves we need to evaluate if any particular
> artifact will help us make a better release. But also more importantly
> realize that there is a cost for carrying every artifact in the tracker.
> Resolving duplicates gets non linearly harder as the number of artifacts
> go up. Triaging gets non-linearly hard as the number of artifacts go up.
>
> With this I was being somewhat pragmatic about closing bugs. An old bug
> that is just a stacktrace is typically not useful. An old bug that is a
> vague sentence that we should refactor a particular module (with no
> specifics on the details) is not useful. A bug reported against a very
> old version of OpenStack where the code has changed a lot in the
> relevant area, and there aren't responses from the author, is not
> useful. Not useful bugs just add debt, and we should get rid of them.
> That makes the chance of pulling a random bug off the tracker something
> that you could actually look at fixing, instead of mostly just stalling out.
>
> So I closed a lot of stuff as Invalid / Opinion that fell into those camps.
>
> == Keeping New Bugs at close to 0 ==
>
> After driving the bugs in the New state down to zero last week, I found
> it's actually pretty easy to keep it at 0.
>
> We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
> aren't actually a bug, and can be closed immediately. ~30% look like a
> bug, but don't have anywhere near enough information in them, and
> flipping them to incomplete with questions quickly means we have a real
> chance of getting the right info. ~10% are fixable in < 30 minutes worth
> of work. And the rest are real bugs, that seem to have enough to dive
> into it, and can be triaged into Confirmed, set a priority, and add the
> appropriate tags for the area.
>
> But, more importantly, this means we can filter bug quality on the way
> in. And we can also encourage bug reporters that are giving us good
> stuff, or even easy stuff, as we respond quickly.
>
> Recommendation #1: we adopt a 0 new bugs policy to keep this from
> getting away from us in the future.
>
> == Our worse bug reporters are often core reviewers ==
>
> I'm going to pick on Dan Prince here, mostly because I have a recent
> concrete example, however in triaging the bug queue much of the core
> team is to blame (including myself).
>
> https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
> was set incomplete and no response. I'm almost 100% sure it's a dupe of
> the multiprocess bug we've been tracking down but it's so terse that you
> can't get to the bottom of it.
>
> There were a ton of 2012 nova bugs that were basically "post it notes".
> Oh, "we should refactor this function". Full stop. While those are fine
> for personal tracking, their value goes to zero probably 3 months after
> they are files, especially if the reporter stops working on the issue at
> hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
> convinced using bugs for those is useful unless we go and close them out
> aggressively if they stall.
>
> Also, if Nova core can't file a good bug, it's hard to set the example
> for others in our community.
>
> Recommendation #2: hey, Nova core, lets be better about filing the kinds
> of bugs we want to see! mkay!
>
> Recommendation #3: Let's create a tag for "personal work items" or
> something for these class of TODOs people are leaving themselves that
> make them a ton easier to cull later when they stall and no one else has
> enough context to pick them up.
>
> == Tags ==
>
> The aggressive tagging that Tracy brought into the project has been
> awesome. It definitely helps slice out into better functional areas.
> Here is the top of our current official tag list (and bug count):
>
> 95 compute
> 83 libvirt
> 74 api
> 68 vmware
> 67 network
> 41 db
> 40 testing
> 40 volumes
> 36 ec2
> 35 icehouse-backport-potential
> 32 low-hanging-fruit
> 31 xenserver
> 25 ironic
> 23 hyper-v
> 16 cells
> 14 scheduler
> 12 baremetal
> 9 ceph
> 9 security
> 8 oslo
> ...
>
> So, good stuff. However I think we probably want to take a further step
> and attempt to get champions for tags. So that tag owners would ensure
> their bug list looks sane, and actually spend some time fixing them.
> It's pretty clear, for instance, that the ec2 bugs are just piling up,
> and very few fixes coming in. Cells seems like it's in the same camp (a
> bunch of recent bugs have been cells related, it looks like a lot more
> deployments are trying it).
>
> Probably the most important thing in tag owners would be cleaning up the
> bugs in the tag. Realizing that 2 bugs were actually the same bug.
> Cleaning up descriptions / titles / etc so that people can move forward
> on them.
>
> Recommendation #4: create tag champions
>
> == Soft Spots ==
>
> After looking at probably close to 1000 bugs in 2 weeks I have a
> particular impression of soft spots that we have.
>
> Quotas are kind of a mess. It's not clear that we're even eventually
> consistent. There are a lot of bugs about creating servers, deleteing
> servers, and leaking quota in the process. I know Jay and Sylvan are
> diving hard on the resource tracker right now, I think this should be a
> Kilo focus area because it creates terrible confusion and bugs for people.
>
> EC2 has definitely regressed, especially after block device mapping
> changes, to the point that it's not clear it's functional outside of the
> most basic server create commands. The EC2 code is largely unchanged
> since 2012, and only lightly tested, we need to decide if this is
> important or not, and either fix it or delete it. There have been many
> past hands going up that said they would help, and then they never do
> (you known who you are).
>
> The VM State machine model is .... Well it's at least suboptimal, but
> it's also clear that it's massively leaky, and the way we handle it
> internally means we end up in inconsistent wedges all the time. I expect
> the complexity here causes a ton of bugs. We need some refactoring to
> make things a ton more clear about what's supposed to be happening, and
> how to rollback when they go wrong. I think the Tasks work was headed
> down that path, but that seems stalled now.
>
> Cross interaction with Neutron and Cinder remains racey. We are pretty
> optimistic on when resources will be available. Even the event interface
> with Neutron hasn't fully addressed this. I think a really great Design
> Summit session would be Nova + Neutron + Cinder to figure out a shared
> architecture to address this. I'd expect this to be at least a double
> session.
>
> Recommendation #5 - 8: we should get on those things :)
>
> == Triaging Inconsistencies ==
>
> I found some inconsistencies in how people were triaging bugs, and the
> state inconsistencies probably don't help with making the bugs seem
> confusing: https://wiki.openstack.org/wiki/BugTriage provides some
> guideance.
>
> Importantly:
>
> Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
> a bunch of 'closing' comments but a move to Incomplete.
>
> Triaged should be used if the solution to fix the bug is in the bug
> itself. Triaged is Confirmed + Solution at enough details to fix it.
>
> Incomplete bugs should not have assignees or milestones, otherwise it
> won't time out.
>
> == General Cleanup Rules ==
>
> Here are some general cleanup rules that I was using:
>
> If an Incomplete bug has no response after 30 days it's fair game to
> close (Invalid, Opinion, Won't Fix).
>
> If a bug is In Progress with no patch posted after 30 days, it is not In
> Progress. Remove assignee, move back to last state (probably confirmed).
> Move to Opinion if it's really a "post it note".
>
> If a bug is In Progress but the patches were abandoned, it's no longer
> In Progress. Remove assignee, move back to last state (probably
> confirmed). Move to Opinion if it's really a "post it note".
>
> == Rescuing Stalled Fixes ==
>
> Over the course of this I found a bunch of the In Progress bugs were
> real issues, with real fixes, that had stalled out for one of a number
> of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
> clear the author didn't really know how to do that for this patch. Other
> times the author's first language was not english, and the patch commit
> message was confusing enough that no one understood what it was fixing.
> (One of these bugs I restored, rewrote the commit message, and then it
> sailed through the process.)
>
> Recommendation #9: if you are going to -1 for unit tests, please go the
> extra step of saying 'I think you should write a test that does X, Y, Z'.
>
> Recommendation #10: We need to find a better balance in rewriting commit
> messages. Maybe we should just make it socially acceptable to rewrite
> the commit message as part of review.

When I'm essentially +2 on a change but for a small issue like typos in 
the commit message, the need for a note in the code or a test (or change 
to a test), I've been doing those myself lately and then will give the 
+2.  If the change already has a +2 and I'd be +W but for said things, 
I'm more inclined lately to approve and then push a dependent patch on 
top of it with the changes to keep things from stalling.

This might be a change in my workflow just because we're late in the 
release and want good bug fixes getting into the release candidates, it 
could be because of the weekly tirade of how the project is going down 
the toilet and we don't get enough things reviewed/approved, I'm not 
sure, but my point is I agree with making it socially acceptable to 
rewrite the commit message as part of the review.

>
> ....
>
> I'm sure there are other thoughts, but my brain is running out of steam.
> These were the things that popped to the top of my head. It's definitely
> been really interesting to spend this much time with the tracker to
> build a bigger picture of this feedback channel we have from our users.
> Hopefully other folks found some of this handy.
>
> 	-Sean
>

Agree with everything else said here.  It's also helpful that you're 
directly pinging people in IRC for action on things, e.g. "what's up 
with this bug (that you opened)?" or pointing out things that are ready 
for approval (I've been doing this more lately in IRC on what I consider 
trivial reviews that I've already +2'ed).

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list