[openstack-dev] [nova] 2 weeks in the bug tracker

Jay S. Bryant jsbryant at electronicjungle.net
Mon Sep 22 14:51:16 UTC 2014


On 09/21/2014 07:37 PM, Matt Riedemann wrote:
>
>
> On 9/19/2014 8:13 AM, Sean Dague wrote:
>> I've spent the better part of the last 2 weeks in the Nova bug tracker
>> to try to turn it into something that doesn't cause people to run away
>> screaming. I don't remember exactly where we started at open bug count 2
>> weeks ago (it was north of 1400, with > 200 bugs in new, but it might
>> have been north of 1600), but as of this email we're at < 1000 open bugs
>> (I'm counting Fix Committed as closed, even though LP does not), and ~0
>> new bugs (depending on the time of the day).
>>
>> == Philosophy in Triaging ==
>>
>> I'm going to lay out the philosophy of triaging I've had, because this
>> may also set the tone going forward.
>>
>> A bug tracker is a tool to help us make a better release. It does not
>> exist for it's own good, it exists to help. Which means when evaluating
>> what stays in and what leaves we need to evaluate if any particular
>> artifact will help us make a better release. But also more importantly
>> realize that there is a cost for carrying every artifact in the tracker.
>> Resolving duplicates gets non linearly harder as the number of artifacts
>> go up. Triaging gets non-linearly hard as the number of artifacts go up.
>>
>> With this I was being somewhat pragmatic about closing bugs. An old bug
>> that is just a stacktrace is typically not useful. An old bug that is a
>> vague sentence that we should refactor a particular module (with no
>> specifics on the details) is not useful. A bug reported against a very
>> old version of OpenStack where the code has changed a lot in the
>> relevant area, and there aren't responses from the author, is not
>> useful. Not useful bugs just add debt, and we should get rid of them.
>> That makes the chance of pulling a random bug off the tracker something
>> that you could actually look at fixing, instead of mostly just 
>> stalling out.
>>
>> So I closed a lot of stuff as Invalid / Opinion that fell into those 
>> camps.
>>
>> == Keeping New Bugs at close to 0 ==
>>
>> After driving the bugs in the New state down to zero last week, I found
>> it's actually pretty easy to keep it at 0.
>>
>> We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
>> aren't actually a bug, and can be closed immediately. ~30% look like a
>> bug, but don't have anywhere near enough information in them, and
>> flipping them to incomplete with questions quickly means we have a real
>> chance of getting the right info. ~10% are fixable in < 30 minutes worth
>> of work. And the rest are real bugs, that seem to have enough to dive
>> into it, and can be triaged into Confirmed, set a priority, and add the
>> appropriate tags for the area.
>>
>> But, more importantly, this means we can filter bug quality on the way
>> in. And we can also encourage bug reporters that are giving us good
>> stuff, or even easy stuff, as we respond quickly.
>>
>> Recommendation #1: we adopt a 0 new bugs policy to keep this from
>> getting away from us in the future.
>>
>> == Our worse bug reporters are often core reviewers ==
>>
>> I'm going to pick on Dan Prince here, mostly because I have a recent
>> concrete example, however in triaging the bug queue much of the core
>> team is to blame (including myself).
>>
>> https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
>> was set incomplete and no response. I'm almost 100% sure it's a dupe of
>> the multiprocess bug we've been tracking down but it's so terse that you
>> can't get to the bottom of it.
>>
>> There were a ton of 2012 nova bugs that were basically "post it notes".
>> Oh, "we should refactor this function". Full stop. While those are fine
>> for personal tracking, their value goes to zero probably 3 months after
>> they are files, especially if the reporter stops working on the issue at
>> hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
>> convinced using bugs for those is useful unless we go and close them out
>> aggressively if they stall.
>>
>> Also, if Nova core can't file a good bug, it's hard to set the example
>> for others in our community.
>>
>> Recommendation #2: hey, Nova core, lets be better about filing the kinds
>> of bugs we want to see! mkay!
>>
>> Recommendation #3: Let's create a tag for "personal work items" or
>> something for these class of TODOs people are leaving themselves that
>> make them a ton easier to cull later when they stall and no one else has
>> enough context to pick them up.
>>
>> == Tags ==
>>
>> The aggressive tagging that Tracy brought into the project has been
>> awesome. It definitely helps slice out into better functional areas.
>> Here is the top of our current official tag list (and bug count):
>>
>> 95 compute
>> 83 libvirt
>> 74 api
>> 68 vmware
>> 67 network
>> 41 db
>> 40 testing
>> 40 volumes
>> 36 ec2
>> 35 icehouse-backport-potential
>> 32 low-hanging-fruit
>> 31 xenserver
>> 25 ironic
>> 23 hyper-v
>> 16 cells
>> 14 scheduler
>> 12 baremetal
>> 9 ceph
>> 9 security
>> 8 oslo
>> ...
>>
>> So, good stuff. However I think we probably want to take a further step
>> and attempt to get champions for tags. So that tag owners would ensure
>> their bug list looks sane, and actually spend some time fixing them.
>> It's pretty clear, for instance, that the ec2 bugs are just piling up,
>> and very few fixes coming in. Cells seems like it's in the same camp (a
>> bunch of recent bugs have been cells related, it looks like a lot more
>> deployments are trying it).
>>
>> Probably the most important thing in tag owners would be cleaning up the
>> bugs in the tag. Realizing that 2 bugs were actually the same bug.
>> Cleaning up descriptions / titles / etc so that people can move forward
>> on them.
>>
>> Recommendation #4: create tag champions
>>
>> == Soft Spots ==
>>
>> After looking at probably close to 1000 bugs in 2 weeks I have a
>> particular impression of soft spots that we have.
>>
>> Quotas are kind of a mess. It's not clear that we're even eventually
>> consistent. There are a lot of bugs about creating servers, deleteing
>> servers, and leaking quota in the process. I know Jay and Sylvan are
>> diving hard on the resource tracker right now, I think this should be a
>> Kilo focus area because it creates terrible confusion and bugs for 
>> people.
>>
>> EC2 has definitely regressed, especially after block device mapping
>> changes, to the point that it's not clear it's functional outside of the
>> most basic server create commands. The EC2 code is largely unchanged
>> since 2012, and only lightly tested, we need to decide if this is
>> important or not, and either fix it or delete it. There have been many
>> past hands going up that said they would help, and then they never do
>> (you known who you are).
>>
>> The VM State machine model is .... Well it's at least suboptimal, but
>> it's also clear that it's massively leaky, and the way we handle it
>> internally means we end up in inconsistent wedges all the time. I expect
>> the complexity here causes a ton of bugs. We need some refactoring to
>> make things a ton more clear about what's supposed to be happening, and
>> how to rollback when they go wrong. I think the Tasks work was headed
>> down that path, but that seems stalled now.
>>
>> Cross interaction with Neutron and Cinder remains racey. We are pretty
>> optimistic on when resources will be available. Even the event interface
>> with Neutron hasn't fully addressed this. I think a really great Design
>> Summit session would be Nova + Neutron + Cinder to figure out a shared
>> architecture to address this. I'd expect this to be at least a double
>> session.
>>
>> Recommendation #5 - 8: we should get on those things :)
>>
>> == Triaging Inconsistencies ==
>>
>> I found some inconsistencies in how people were triaging bugs, and the
>> state inconsistencies probably don't help with making the bugs seem
>> confusing: https://wiki.openstack.org/wiki/BugTriage provides some
>> guideance.
>>
>> Importantly:
>>
>> Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
>> a bunch of 'closing' comments but a move to Incomplete.
>>
>> Triaged should be used if the solution to fix the bug is in the bug
>> itself. Triaged is Confirmed + Solution at enough details to fix it.
>>
>> Incomplete bugs should not have assignees or milestones, otherwise it
>> won't time out.
>>
>> == General Cleanup Rules ==
>>
>> Here are some general cleanup rules that I was using:
>>
>> If an Incomplete bug has no response after 30 days it's fair game to
>> close (Invalid, Opinion, Won't Fix).
>>
>> If a bug is In Progress with no patch posted after 30 days, it is not In
>> Progress. Remove assignee, move back to last state (probably confirmed).
>> Move to Opinion if it's really a "post it note".
>>
>> If a bug is In Progress but the patches were abandoned, it's no longer
>> In Progress. Remove assignee, move back to last state (probably
>> confirmed). Move to Opinion if it's really a "post it note".
>>
>> == Rescuing Stalled Fixes ==
>>
>> Over the course of this I found a bunch of the In Progress bugs were
>> real issues, with real fixes, that had stalled out for one of a number
>> of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
>> clear the author didn't really know how to do that for this patch. Other
>> times the author's first language was not english, and the patch commit
>> message was confusing enough that no one understood what it was fixing.
>> (One of these bugs I restored, rewrote the commit message, and then it
>> sailed through the process.)
>>
>> Recommendation #9: if you are going to -1 for unit tests, please go the
>> extra step of saying 'I think you should write a test that does X, Y, 
>> Z'.
>>
>> Recommendation #10: We need to find a better balance in rewriting commit
>> messages. Maybe we should just make it socially acceptable to rewrite
>> the commit message as part of review.
>
> When I'm essentially +2 on a change but for a small issue like typos 
> in the commit message, the need for a note in the code or a test (or 
> change to a test), I've been doing those myself lately and then will 
> give the +2.  If the change already has a +2 and I'd be +W but for 
> said things, I'm more inclined lately to approve and then push a 
> dependent patch on top of it with the changes to keep things from 
> stalling.
>
> This might be a change in my workflow just because we're late in the 
> release and want good bug fixes getting into the release candidates, 
> it could be because of the weekly tirade of how the project is going 
> down the toilet and we don't get enough things reviewed/approved, I'm 
> not sure, but my point is I agree with making it socially acceptable 
> to rewrite the commit message as part of the review.
Matt,

This is consistent with what I have been doing for Cinder as well. I 
know there are some people who prefer I not touch the commit messages 
and I respect those requests, but otherwise I make changes to keep the 
process moving.

Jay
>
>>
>> ....
>>
>> I'm sure there are other thoughts, but my brain is running out of steam.
>> These were the things that popped to the top of my head. It's definitely
>> been really interesting to spend this much time with the tracker to
>> build a bigger picture of this feedback channel we have from our users.
>> Hopefully other folks found some of this handy.
>>
>>     -Sean
>>
>
> Agree with everything else said here.  It's also helpful that you're 
> directly pinging people in IRC for action on things, e.g. "what's up 
> with this bug (that you opened)?" or pointing out things that are 
> ready for approval (I've been doing this more lately in IRC on what I 
> consider trivial reviews that I've already +2'ed).
>




More information about the OpenStack-dev mailing list