[openstack-dev] Grizzly's out - let the numbers begin... [maybe now off-topic]
Jesus M. Gonzalez-Barahona
jgb at bitergia.com
Fri Apr 5 22:05:29 UTC 2013
What an interesting thread. Let me just make a couple of comments.
[Sorry for the long post: just skip if not interested in project
- Yes, any measure about company activity and contribution can be
cheated. But sets of measures are more difficult to cheat (or at least,
at some point cheating a good enough set of metrics may require too much
effort to pay the trouble). That's why we're trying to provide more
metrics in the analysis. Probably commits is the most "classical" one,
but others are also relevant.
- In any case, we're not trying to imply that such company is behind
such other just because they are having 20 more or less commits. The
idea is that if a company is high in most or all the metrics, it is very
likely that they are important contributors. If company A is like one
third of company B in most or all metrics, it is likely that company A
is contributing significantly less than company B. So that in the end,
we can enrich discussions with data.
- Company participation is just one of the many kinds of quantitative
analysis that can be done. To be honest, until seeing the level of
interest they raise, I never thought they were too important. But now I
guess I understand the reason: having some numbers, even if only
approximate, is much better than having no numbers at all. It is clear
that a company may try to cheat the analytics, but if there are no
numbers it is even easier: everything becomes a matter of opinion [but
of course, this is just my very personal impression].
- We've done this company analysis on our own, because we felt it was
interesting, and we had the tools and the data. Of course we had the
support of some members of the OpenStack community, including the people
that do the git-dm analysis, because we used in part their affiliation
data (thanks!). One of the nice things (in my opinion) of free / open
source software developed in the open is this transparency, or maybe
accountability, where anyone can come, look at the data, and extract
their conclusions. In addition, our tools are free software, our
methodology is public (if you need any detail just ask), the datasets
are public (thanks to OpenStack, which makes everything available). Any
critic is welcome: it will only help to improve the results for the next
- We're working with the OpenStack Foundation in other kinds of
analysis. We will be talking about them at Portland. Those have more to
do with how the project is performing as a whole, which metrics could be
identified that could help to track performance, bottlenecks, areas that
need work, or general evolution of the project. I hope that maybe those
analytics are more appealing to those of you who don't find the company
analysis helpful for the project.
In any case, as said, we will be in Portland, and will be more than
happy to discuss all these issues with any of you who may be interested,
either in the session for this, or at any other time. If there is
interest and logistics allow for it, we could even have some kind of BoF
on "measuring OpenStack".
[BTW, probably it is obvious from the above, but I'm one of the guys
behind Bitergia's report]
On Fri, 2013-04-05 at 14:12 -0700, Matt Joyce wrote:
> Made a quick attempt to guess gender of devs to see a gender bias in
> Came up with
> females : 25
> males : 262
> unknowncount : 240
> the gender import came from
> which uses US Census Data ( obviously going to have a bias there
> probably why I have 240 unknowns... )
> If someone knows of a better dataset to test against I'd love to hear
> Alternativelly.... git push your name and gender to the gender.py =D
> On Fri, Apr 5, 2013 at 12:18 PM, Daniel Izquierdo
> <dizquierdo at bitergia.com> wrote:
> Hi Eric,
> On 04/05/2013 08:31 PM, Eric Windisch wrote:
> On Friday, April 5, 2013 at 14:17 PM, Stefano Maffulli
> Let me pull in the authors of the study, as
> they may be able to shed
> some light on the inconsistencies you found.
> Eric, Joshua: can you please send Daniel and
> Jesus more details so they
> can look into them?
> I made a note on the blog. The response to others
> indicates that their results are based on two
> different methodologies (git-dm and their own dataset
> analysis), this would likely be the source of
> differences in numbers. I haven't noticed variations
> anywhere except author counts, but I haven't looked
> very hard, either.
> The methodology we have used to match developers and
> affiliations is based on information partially obtained from
> the OpenStack gitdm project, but also compared to our own
> dataset (that we already had from previous releases). Sorry if
> I didn't explain myself consistently in the blog.
> The bug here is related to how we're calculating data for the
> spreadsheets and company by company. The result is that the
> company by company analysis had a bug, and we were counting
> some more developers than expected and commits (we were
> counting for instance as two different people a developer who
> used at some point two different email addresses).
> So, the data at the tables (bottom part in the main page) is
> the correct one. The data for the source code management
> system in the left part of each of the companies is
> In addition, the number of commits in Rackspace will be a bit
> higher for the next round. Another developer told us that he
> moved from one company to Rackspace at some point, so you will
> see how that number will increased a bit.
> I guess it could also be differences or errors in
> employee->company mappings? Perhaps instead, one
> methodology includes those that report bugs, while the
> other only accounts for git? I'm not sure.
> Regarding to this point, the data about bug tracking system
> and mailing lists is only based on activity from developers.
> This means that people that have not committed a change to the
> source code are not counted as part of the activity of
> companies in Launchpad and Mailing Lists. In any case and as
> an example, we're covering around a 60% of the activity in the
> mailing lists because people that at some point submitted
> changes to the Git are that active.
> Our purpose with this is to show only activity from developers
> and their affiliations through the three data sources (git,
> tickets and mailing lists). This is also an option. From our
> point of view this analysis was pretty interesting, but
> perhaps for others this is not good enough.
> Other things like dividing commits/authors seems to
> just be the wrong methodology where a median would be
> more appropriate and harder to game.
> This is a good point. As you mention it is probably more fair
> to have such metric. At some point we would like to show some
> boxplots and other metrics to better understand the
> distribution of the datasets, but we had to choose some. In
> any case, we will take into account this for the next reports
> for sure. Thanks!
> Probably a good approach would be to have a common view with
> all of the people interested in this type of analysis. In this
> way we could reach an agreement about how to visualize data,
> necessary and interesting metrics, common methodology to
> measure stuff and projects involved. This analysis is just a
> possibility, but there are some more for sure.
> In any case, please, let us know any other concerns you may
> have and any feedback of the community is more than
> Thanks a lot for all your comments.
> Daniel Izquierdo.
> Eric Windisch
Bitergia: http://bitergia.com http://blog.bitergia.com
More information about the OpenStack-dev