[openstack-dev] [nova] 2 weeks in the bug tracker
Sean Dague
sean at dague.net
Fri Sep 19 13:13:29 UTC 2014
I've spent the better part of the last 2 weeks in the Nova bug tracker
to try to turn it into something that doesn't cause people to run away
screaming. I don't remember exactly where we started at open bug count 2
weeks ago (it was north of 1400, with > 200 bugs in new, but it might
have been north of 1600), but as of this email we're at < 1000 open bugs
(I'm counting Fix Committed as closed, even though LP does not), and ~0
new bugs (depending on the time of the day).
== Philosophy in Triaging ==
I'm going to lay out the philosophy of triaging I've had, because this
may also set the tone going forward.
A bug tracker is a tool to help us make a better release. It does not
exist for it's own good, it exists to help. Which means when evaluating
what stays in and what leaves we need to evaluate if any particular
artifact will help us make a better release. But also more importantly
realize that there is a cost for carrying every artifact in the tracker.
Resolving duplicates gets non linearly harder as the number of artifacts
go up. Triaging gets non-linearly hard as the number of artifacts go up.
With this I was being somewhat pragmatic about closing bugs. An old bug
that is just a stacktrace is typically not useful. An old bug that is a
vague sentence that we should refactor a particular module (with no
specifics on the details) is not useful. A bug reported against a very
old version of OpenStack where the code has changed a lot in the
relevant area, and there aren't responses from the author, is not
useful. Not useful bugs just add debt, and we should get rid of them.
That makes the chance of pulling a random bug off the tracker something
that you could actually look at fixing, instead of mostly just stalling out.
So I closed a lot of stuff as Invalid / Opinion that fell into those camps.
== Keeping New Bugs at close to 0 ==
After driving the bugs in the New state down to zero last week, I found
it's actually pretty easy to keep it at 0.
We get 10 - 20 new bugs a day in Nova (during a weekday). Of those ~20%
aren't actually a bug, and can be closed immediately. ~30% look like a
bug, but don't have anywhere near enough information in them, and
flipping them to incomplete with questions quickly means we have a real
chance of getting the right info. ~10% are fixable in < 30 minutes worth
of work. And the rest are real bugs, that seem to have enough to dive
into it, and can be triaged into Confirmed, set a priority, and add the
appropriate tags for the area.
But, more importantly, this means we can filter bug quality on the way
in. And we can also encourage bug reporters that are giving us good
stuff, or even easy stuff, as we respond quickly.
Recommendation #1: we adopt a 0 new bugs policy to keep this from
getting away from us in the future.
== Our worse bug reporters are often core reviewers ==
I'm going to pick on Dan Prince here, mostly because I have a recent
concrete example, however in triaging the bug queue much of the core
team is to blame (including myself).
https://bugs.launchpad.net/nova/+bug/1368773 is a terrible bug. Also, it
was set incomplete and no response. I'm almost 100% sure it's a dupe of
the multiprocess bug we've been tracking down but it's so terse that you
can't get to the bottom of it.
There were a ton of 2012 nova bugs that were basically "post it notes".
Oh, "we should refactor this function". Full stop. While those are fine
for personal tracking, their value goes to zero probably 3 months after
they are files, especially if the reporter stops working on the issue at
hand. Nova has plenty of "wouldn't it be great if we... " ideas. I'm not
convinced using bugs for those is useful unless we go and close them out
aggressively if they stall.
Also, if Nova core can't file a good bug, it's hard to set the example
for others in our community.
Recommendation #2: hey, Nova core, lets be better about filing the kinds
of bugs we want to see! mkay!
Recommendation #3: Let's create a tag for "personal work items" or
something for these class of TODOs people are leaving themselves that
make them a ton easier to cull later when they stall and no one else has
enough context to pick them up.
== Tags ==
The aggressive tagging that Tracy brought into the project has been
awesome. It definitely helps slice out into better functional areas.
Here is the top of our current official tag list (and bug count):
95 compute
83 libvirt
74 api
68 vmware
67 network
41 db
40 testing
40 volumes
36 ec2
35 icehouse-backport-potential
32 low-hanging-fruit
31 xenserver
25 ironic
23 hyper-v
16 cells
14 scheduler
12 baremetal
9 ceph
9 security
8 oslo
...
So, good stuff. However I think we probably want to take a further step
and attempt to get champions for tags. So that tag owners would ensure
their bug list looks sane, and actually spend some time fixing them.
It's pretty clear, for instance, that the ec2 bugs are just piling up,
and very few fixes coming in. Cells seems like it's in the same camp (a
bunch of recent bugs have been cells related, it looks like a lot more
deployments are trying it).
Probably the most important thing in tag owners would be cleaning up the
bugs in the tag. Realizing that 2 bugs were actually the same bug.
Cleaning up descriptions / titles / etc so that people can move forward
on them.
Recommendation #4: create tag champions
== Soft Spots ==
After looking at probably close to 1000 bugs in 2 weeks I have a
particular impression of soft spots that we have.
Quotas are kind of a mess. It's not clear that we're even eventually
consistent. There are a lot of bugs about creating servers, deleteing
servers, and leaking quota in the process. I know Jay and Sylvan are
diving hard on the resource tracker right now, I think this should be a
Kilo focus area because it creates terrible confusion and bugs for people.
EC2 has definitely regressed, especially after block device mapping
changes, to the point that it's not clear it's functional outside of the
most basic server create commands. The EC2 code is largely unchanged
since 2012, and only lightly tested, we need to decide if this is
important or not, and either fix it or delete it. There have been many
past hands going up that said they would help, and then they never do
(you known who you are).
The VM State machine model is .... Well it's at least suboptimal, but
it's also clear that it's massively leaky, and the way we handle it
internally means we end up in inconsistent wedges all the time. I expect
the complexity here causes a ton of bugs. We need some refactoring to
make things a ton more clear about what's supposed to be happening, and
how to rollback when they go wrong. I think the Tasks work was headed
down that path, but that seems stalled now.
Cross interaction with Neutron and Cinder remains racey. We are pretty
optimistic on when resources will be available. Even the event interface
with Neutron hasn't fully addressed this. I think a really great Design
Summit session would be Nova + Neutron + Cinder to figure out a shared
architecture to address this. I'd expect this to be at least a double
session.
Recommendation #5 - 8: we should get on those things :)
== Triaging Inconsistencies ==
I found some inconsistencies in how people were triaging bugs, and the
state inconsistencies probably don't help with making the bugs seem
confusing: https://wiki.openstack.org/wiki/BugTriage provides some
guideance.
Importantly:
Incomplete is an Open state. For bugzilla folks this is NEEDSINFO. I saw
a bunch of 'closing' comments but a move to Incomplete.
Triaged should be used if the solution to fix the bug is in the bug
itself. Triaged is Confirmed + Solution at enough details to fix it.
Incomplete bugs should not have assignees or milestones, otherwise it
won't time out.
== General Cleanup Rules ==
Here are some general cleanup rules that I was using:
If an Incomplete bug has no response after 30 days it's fair game to
close (Invalid, Opinion, Won't Fix).
If a bug is In Progress with no patch posted after 30 days, it is not In
Progress. Remove assignee, move back to last state (probably confirmed).
Move to Opinion if it's really a "post it note".
If a bug is In Progress but the patches were abandoned, it's no longer
In Progress. Remove assignee, move back to last state (probably
confirmed). Move to Opinion if it's really a "post it note".
== Rescuing Stalled Fixes ==
Over the course of this I found a bunch of the In Progress bugs were
real issues, with real fixes, that had stalled out for one of a number
of reasons. Often it had a -1 'needs unit tests' on it, and it's sort of
clear the author didn't really know how to do that for this patch. Other
times the author's first language was not english, and the patch commit
message was confusing enough that no one understood what it was fixing.
(One of these bugs I restored, rewrote the commit message, and then it
sailed through the process.)
Recommendation #9: if you are going to -1 for unit tests, please go the
extra step of saying 'I think you should write a test that does X, Y, Z'.
Recommendation #10: We need to find a better balance in rewriting commit
messages. Maybe we should just make it socially acceptable to rewrite
the commit message as part of review.
....
I'm sure there are other thoughts, but my brain is running out of steam.
These were the things that popped to the top of my head. It's definitely
been really interesting to spend this much time with the tracker to
build a bigger picture of this feedback channel we have from our users.
Hopefully other folks found some of this handy.
-Sean
--
Sean Dague
http://dague.net
More information about the OpenStack-dev
mailing list