[openstack-dev] [nova] 2 weeks in the bug tracker

Sean Dague sean at dague.net
Fri Sep 19 15:18:21 UTC 2014


On 09/19/2014 09:58 AM, Sylvain Bauza wrote:
<snip>
>> == Keeping New Bugs at close to 0 ==
>>
>> After driving the bugs in the New state down to zero last week, I found
>> it's actually pretty easy to keep it at 0.
>>
>> We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
>> aren't actually a bug, and can be closed immediately. ~30% look like a
>> bug, but don't have anywhere near enough information in them, and
>> flipping them to incomplete with questions quickly means we have a real
>> chance of getting the right info. ~10% are fixable in < 30 minutes worth
>> of work. And the rest are real bugs, that seem to have enough to dive
>> into it, and can be triaged into Confirmed, set a priority, and add the
>> appropriate tags for the area.
>>
>> But, more importantly, this means we can filter bug quality on the way
>> in. And we can also encourage bug reporters that are giving us good
>> stuff, or even easy stuff, as we respond quickly.
>>
>> Recommendation #1: we adopt a 0 new bugs policy to keep this from
>> getting away from us in the future.
> 
> Agreed, provided we can review all the new bugs each week.

So I actually don't think this works if it's a weekly thing. Keeping new
bugs at 0 really has to be daily because the response to bug reports
sets up the expected cadence with the reporter. If you flip back new
bugs in < 6 or 8 hrs, there is a decent chance they are still on their
same work shift, and the context is still in their head (or even the
situation is still existing).

Once you pass 24hrs the chance of that goes way down. And,
realistically, I've found that when I open the bug tracker in the
morning and there are 5 bugs, that's totally doable over the first cup
of coffee. Poking the bug tracker a couple more times during the day is
all that's needed to keep it there.

>> == Our worse bug reporters are often core reviewers ==
>>
>> I'm going to pick on Dan Prince here, mostly because I have a recent
>> concrete example, however in triaging the bug queue much of the core
>> team is to blame (including myself).
>>
>> https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
>> was set incomplete and no response. I'm almost 100% sure it's a dupe of
>> the multiprocess bug we've been tracking down but it's so terse that you
>> can't get to the bottom of it.
>>
>> There were a ton of 2012 nova bugs that were basically "post it notes".
>> Oh, "we should refactor this function". Full stop. While those are fine
>> for personal tracking, their value goes to zero probably 3 months after
>> they are files, especially if the reporter stops working on the issue at
>> hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
>> convinced using bugs for those is useful unless we go and close them out
>> aggressively if they stall.
>>
>> Also, if Nova core can't file a good bug, it's hard to set the example
>> for others in our community.
>>
>> Recommendation #2: hey, Nova core, lets be better about filing the kinds
>> of bugs we want to see! mkay!
>>
>> Recommendation #3: Let's create a tag for "personal work items" or
>> something for these class of TODOs people are leaving themselves that
>> make them a ton easier to cull later when they stall and no one else has
>> enough context to pick them up.
> 
> I would propose to set their importance as "Wishlist" then. I would
> leave the tags for setting which components are impacted.

Maybe. I honestly don't think core team members should file wishlist
bugs at all. That really means feature and means a spec. Or it means
just do it (for refactoring).

>> == Tags ==
>>
>> The aggressive tagging that Tracy brought into the project has been
>> awesome. It definitely helps slice out into better functional areas.
>> Here is the top of our current official tag list (and bug count):
>>
>> 95 compute
>> 83 libvirt
>> 74 api
>> 68 vmware
>> 67 network
>> 41 db
>> 40 testing
>> 40 volumes
>> 36 ec2
>> 35 icehouse-backport-potential
>> 32 low-hanging-fruit
>> 31 xenserver
>> 25 ironic
>> 23 hyper-v
>> 16 cells
>> 14 scheduler
>> 12 baremetal
>> 9 ceph
>> 9 security
>> 8 oslo
>> ...
>>
>> So, good stuff. However I think we probably want to take a further step
>> and attempt to get champions for tags. So that tag owners would ensure
>> their bug list looks sane, and actually spend some time fixing them.
>> It's pretty clear, for instance, that the ec2 bugs are just piling up,
>> and very few fixes coming in. Cells seems like it's in the same camp (a
>> bunch of recent bugs have been cells related, it looks like a lot more
>> deployments are trying it).
>>
>> Probably the most important thing in tag owners would be cleaning up the
>> bugs in the tag. Realizing that 2 bugs were actually the same bug.
>> Cleaning up descriptions / titles / etc so that people can move forward
>> on them.
>>
>> Recommendation #4: create tag champions
> 
> +1. That said, some bugs can be having more than 1 tag (for example,
> compute/conductor/scheduler), so it would mean the champions would have
> to discuss between them.
> I can volunteer for looking at the "scheduler" tag.

Sure. But more communication on issues is only a good thing. :)

> 
>> == Soft Spots ==
>>
>> After looking at probably close to 1000 bugs in 2 weeks I have a
>> particular impression of soft spots that we have.
>>
>> Quotas are kind of a mess. It's not clear that we're even eventually
>> consistent. There are a lot of bugs about creating servers, deleteing
>> servers, and leaking quota in the process. I know Jay and Sylvan are
>> diving hard on the resource tracker right now, I think this should be a
>> Kilo focus area because it creates terrible confusion and bugs for
>> people.
>>
>> EC2 has definitely regressed, especially after block device mapping
>> changes, to the point that it's not clear it's functional outside of the
>> most basic server create commands. The EC2 code is largely unchanged
>> since 2012, and only lightly tested, we need to decide if this is
>> important or not, and either fix it or delete it. There have been many
>> past hands going up that said they would help, and then they never do
>> (you known who you are).
>>
>> The VM State machine model is .... Well it's at least suboptimal, but
>> it's also clear that it's massively leaky, and the way we handle it
>> internally means we end up in inconsistent wedges all the time. I expect
>> the complexity here causes a ton of bugs. We need some refactoring to
>> make things a ton more clear about what's supposed to be happening, and
>> how to rollback when they go wrong. I think the Tasks work was headed
>> down that path, but that seems stalled now.
>>
>> Cross interaction with Neutron and Cinder remains racey. We are pretty
>> optimistic on when resources will be available. Even the event interface
>> with Neutron hasn't fully addressed this. I think a really great Design
>> Summit session would be Nova + Neutron + Cinder to figure out a shared
>> architecture to address this. I'd expect this to be at least a double
>> session.
>>
>> Recommendation #5 - 8: we should get on those things :)
> 
> IMHO, these concerns are related to the technical debt we have, and how
> we can reduce it.

Agreed. Except for maybe the last one, as that's something we've just
never built the right infrastructure around now that you often need 3
components to bring up a VM.

>> == Triaging Inconsistencies ==
>>
>> I found some inconsistencies in how people were triaging bugs, and the
>> state inconsistencies probably don't help with making the bugs seem
>> confusing: https://wiki.openstack.org/wiki/BugTriage provides some
>> guideance.
>>
>> Importantly:
>>
>> Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
>> a bunch of 'closing' comments but a move to Incomplete.
>>
>> Triaged should be used if the solution to fix the bug is in the bug
>> itself. Triaged is Confirmed + Solution at enough details to fix it.
>>
>> Incomplete bugs should not have assignees or milestones, otherwise it
>> won't time out.
> 
> Thanks for clarifying it.
> 
>>
>> == General Cleanup Rules ==
>>
>> Here are some general cleanup rules that I was using:
>>
>> If an Incomplete bug has no response after 30 days it's fair game to
>> close (Invalid, Opinion, Won't Fix).
>>
>> If a bug is In Progress with no patch posted after 30 days, it is not In
>> Progress. Remove assignee, move back to last state (probably confirmed).
>> Move to Opinion if it's really a "post it note".
>>
>> If a bug is In Progress but the patches were abandoned, it's no longer
>> In Progress. Remove assignee, move back to last state (probably
>> confirmed). Move to Opinion if it's really a "post it note".
>>
>> == Rescuing Stalled Fixes ==
>>
>> Over the course of this I found a bunch of the In Progress bugs were
>> real issues, with real fixes, that had stalled out for one of a number
>> of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
>> clear the author didn't really know how to do that for this patch. Other
>> times the author's first language was not english, and the patch commit
>> message was confusing enough that no one understood what it was fixing.
>> (One of these bugs I restored, rewrote the commit message, and then it
>> sailed through the process.)
>>
>> Recommendation #9: if you are going to -1 for unit tests, please go the
>> extra step of saying 'I think you should write a test that does X, Y, Z'.
> +1
> 
>> Recommendation #10: We need to find a better balance in rewriting commit
>> messages. Maybe we should just make it socially acceptable to rewrite
>> the commit message as part of review.
> Well, there are good examples of commit messages here
> https://wiki.openstack.org/wiki/GitCommitMessages
> We can at least ask the reviewers to point this wikipage each time they
> -1 a commit msg.
> 
> FWIW, it's pretty easy to modify the commit msgs if you're reviewing
> them by using the Gerrit interface, so that sounds a bit cumbersome to
> leave a -1 for only a bad message. Instead, it sounds better to provide
> a new commit msg on your own.

Well I think the reason for -1ing on commit messages is to train people
into doing it right. I'm not convinced that's always happening. And more
importantly exceptionally confusing commit messages makes things lost.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list