[openstack-dev] [Metrics] Improving the data about contributor/affiliation/time

Sean Dague sean at dague.net
Fri Oct 18 12:33:52 UTC 2013


On 10/17/2013 05:34 PM, Stefano Maffulli wrote:
> hello folks
>
> first of all: congratulations to all developers, testers, users,
> translators, tech writers for the new release: Havana is out of the gate
> with impressive numbers.
>
> Speaking of numbers, a lot of you have noticed mistakes in the reported
> numbers, from misspelling of names to missing/wrong company
> affiliations. With my apologies for the mistakes comes an explanation of
> where I see things fail and a suggestion on how to fix this for the future.
>
> Currently there are three places where statistics about the project are
> released:
>
>   - OpenStack Activity Board http://activity.openstack.org/
>   - gitdm http://git.openstack.org/cgit/openstack-infra/gitdm/
>   - Stackalytics http://git.openstack.org/cgit/stackforge/stackalytics/
>
> Activity Board is actually made of two pieces: the Dash and Insights.
> Insights pulls straight from the OpenStack Foundation Members db
> http://www.openstack.org/community/members/, so what you see in personal
> pages like
>
> http://activity.openstack.org/data/plugins/zfacts/view.action?instance=Person,person3986c85a-b9af-4686-8c7b-45525f62e396
>
> is exactly what is written on Robert's personal profile
> http://www.openstack.org/community/members/profile/3619 (these
> confluence pages are updated daily).
>
> The data about companies on the Dash are the result of semi-automatic
> processing and cleanup of the data from OpenStack Foundation Members db.
> The cleanup is necessary because a) one can't always rely on people
> spelling correctly the name of their company b) the Profile pages lack
> the UI to properly track the history of affiliation [1]. Here is what
> the Dash looks like for Canonical:
>
> http://activity.openstack.org/dash/releases/company.html?company=Canonical
>
> gitdm and Stackalytics take their developer/company/time tuples from
> files maintained by developers themselves compensated by heuristics to
> 'guess' affiliations from things like email addresses in the commit logs.
>
> Four sources of data for this reporting is bad and not sustainable.
>
> Since it seems commonly accepted that all developers need to be members
> of the Foundation, and that Foundation members need to state their
> affiliation when they join and keep such data current when it changes, I
> think the Foundation is in a good place to provide the authoritative
> data for all projects to use.

I'm not sure it is well understoond that all members have to join the 
foundation. We don't make that a requirement on someone slinging a 
patch. It would be nice to know what percentage of ATCs actually are 
foundation members at the moment (presumably that number is easy to 
generate?)

The thing is, the Foundation data currently seems to be the least 
accurate of all the data sets. Also, the lack of affiliation over time 
is really a problem for this project, especially if one of the driving 
factors for so much interest in statistics comes from organizations 
wanting to ensure contributions by their employees get counted. A 
significant percentage of top contributors to OpenStack have not 
remained at a single employer over their duration to contributing to 
OpenStack, and I expect that to be the norm as the project ages.

Also, both gitdm and stackalytics have active open developer communities 
(and they are open source all the way down, don't need non open 
components to run), so again, I'm not sure why defaulting to the least 
open platform makes any sense.

Member affiliation in the Foundation database can also only be fixed by 
the individual. In the other tools people in the know can fix it. It 
means we get a wikipedia effect in getting the data more accurate, as 
you can fix any issue you see, not just your own.

If the foundation member database was it's own thing, had a REST API to 
bulk fetch, and supported temporal associations, and let others propose 
updates to people's affiliation, then it would be an option. But right 
now it seems very far from being useful, and is probably the least, not 
most, accurate version of the world.

> We can make things easier by making the personal profile pages more
> useful so people login more often and improve quality of data. Fixing
> the known shortcomings mentioned above is one step. Furthermore, we're
> working to develop an OpenID provider based on the Members DB that will
> be used across all our web properties (from gerrit to the upcoming
> groups.openstack.org, etc) so those profile will be used for more than
> just for the initial signup to be a member [2].
>
> Since nobody can rely on user input we will still have to 'cleanup' the
> data as it comes in from the Members DB in order to create a 'Master
> Data Record' that we can export for all to consume. Here things get a
> bit fuzzy because currently the Members DB has an API that is not
> designed to be securely consumed publicly[3].
>
> What I think we can do is to have a periodic job pulling the full list
> of members and their stated affiliation, and run on that an
> automatic/manual cleanup/sanitizing job that creates files/tables ready
> to be consumed by all projects.
>
> What do you think? I'm interested in gathering more ideas and lay down a
> plan to fix this issue.
>
> thanks,
> stef
>
>
> [1]  To improve problem A the system suggests proper spelling when you
> start typing. For problem B there is a fix coming to the site.
>
> [2] I'll send more details about this project soon
> https://blueprints.launchpad.net/openstack-ci/+spec/sso-openid-provider
>
> [3] The Members DB is tightly connected to the web site openstack.org.
> There is an effort to move the whole site under openstack-infra/ so this
> pain poing will be removed soon, hopefully.
>
> PS did you look at the numbers?
> http://www.openstack.org/software/havana/
> http://blog.bitergia.com/2013/10/17/the-openstack-havana-release/
>


-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list