[OpenStack-Infra] Fun (important!) project: optimize Gerrit's nova git repo

Zaro zaro0508 at gmail.com
Mon May 16 21:28:04 UTC 2016


Hello, My previous testing was done entirely based on the nova repo
and jgit (without gerrit) so I believe there was still some concern
whether we should allow gerrit to do the gc since that might do
something bad to our repos.  So going a step further with testing, I
have setup the provided snapshot of the nova repo[1] on
review-dev.o.o[2] and basically re-ran the my tests with `gerrit gc`

Additional info:

* I ran a manual `gerrit gc` on the 'nova-gc' repo.  From the source
code[3] it looks like Gerrit is just running a `jgit gc` on it.  I
have confirmed that the repo seems to work perfectly fine after
running it thru a gerrit gc.  I tested clone, push, fetch, and various
things thru the UI (reviews, comments, abandon, etc...) on the repro
after the gc.

* I cloned locally on review-dev.o.o to test the performance and the
result is pretty much the same as my previous testing.  Cloned 4 times
with each repo:

nova-no-gc: 3m44.551s, 2m55.797s, 2m51.078s, 2m57.749s
nova-gc: 0m28.824s, 0m28.960s, 0m29.359s, 0m31.943s

* I also tested fetch and push again and result were similar to
previous test as well.  These operations were pretty quick when run
locally but in general both were faster with 'nova-gc' repo.

* Looking at the repo files, the objects were significantly pruned
which saves disk space.  I actually expected the refs to be completely
cleaned up however `gerrit gc` doesn't actually clean out all of the
refs but only reduces it.  It does create the packed-refs file which
is probably the thing that improves the performance of the clone
operation.

before gerrit gc:

gerrit2 at review-dev:~$ du -hsx review_site/git/kdtest/nova-gc.git/* |
sort -rh |head -10
6.4G review_site/git/kdtest/nova-no-gc.git/objects
468M review_site/git/kdtest/nova-no-gc.git/refs
6.2M review_site/git/kdtest/nova-no-gc.git/info
2.2M review_site/git/kdtest/nova-no-gc.git/logs
4.0K review_site/git/kdtest/nova-no-gc.git/hooks
4.0K review_site/git/kdtest/nova-no-gc.git/HEAD
4.0K review_site/git/kdtest/nova-no-gc.git/description
4.0K review_site/git/kdtest/nova-no-gc.git/config
4.0K review_site/git/kdtest/nova-no-gc.git/branches

after gerrit gc:

gerrit2 at review-dev:~$ du -hsx review_site/git/kdtest/nova-gc.git/* |
sort -rh |head -10
475M review_site/git/kdtest/nova-gc.git/objects
86M review_site/git/kdtest/nova-gc.git/refs
6.2M review_site/git/kdtest/nova-gc.git/packed-refs
6.1M review_site/git/kdtest/nova-gc.git/info
2.2M review_site/git/kdtest/nova-gc.git/logs
4.0K review_site/git/kdtest/nova-gc.git/hooks
4.0K review_site/git/kdtest/nova-gc.git/HEAD
4.0K review_site/git/kdtest/nova-gc.git/config
4.0K review_site/git/kdtest/nova-gc.git/branches
0 review_site/git/kdtest/nova-gc.git/description


[1] http://tarballs.openstack.org/ci/nova.git.tar.bz2
[2] https://review-dev.openstack.org/#/admin/projects/?filter=kdtest
[3] http://git.openstack.org/cgit/openstack-infra/gerrit/tree/gerrit-server/src/main/java/com/google/gerrit/server/git/GarbageCollection.java?h=openstack/2.11.4#n83




On Fri, Mar 25, 2016 at 5:47 PM, Zaro <zaro0508 at gmail.com> wrote:
> So I've been researching this and I've found that there is a
> significant performance improvement after running git gc on this nova
> repro.  Below are my results.
>
> File sizes of repo as-is:
> ~/nova.git.orig$ du -hsx * | sort -r | head -10
> 6.4G objects
> 6.1M info
> 4.0K config
> 4.0K HEAD
> 382M refs
> 2.1M logs
>  0B hooks
>  0B description
>  0B branches
>
> Note that the repro as-is has already been thru a 'git repack -afd'.
>
>
> File sizes after running 'jgit gc':
> ~/nova.git.test$ du -hsx * | sort -r | head -10
> 6.1M packed-refs
> 6.1M info
> 420M objects
> 4.0K config
> 4.0K HEAD
> 2.1M logs
>  0B refs
>  0B hooks
>  0B description
>  0B branches
>
> The result is that the gc cleans up the objects (6.4G -> 420M) and
> moves the loose ref objects from 'refs' dir to a 'packed-refs' file
> (382M -> 6.1M).
>
> Note that I'm using jgit because that's what Gerrit would use to do
> the 'gc'.  The jgit version is 4.0.1.201506240215-r which is the one
> that's packaged with our current version of Gerrit
> (2.11.4-11-ga14450f) on review.o.o
>
>
> Here I've tested the performance of the git clone, fetch and push
> before and after running 'jgit gc':
>
> `git clone`
> ------------
> before:
> real  3m30.163s
> user 0m2.020s
> sys   3m15.087s
>
> after:
> real  0m0.925s
> user 0m0.406s
> sys   0m0.621s
>
>
> `git fetch origin stable/liberty`
> ---------------------------------
> before:
> real  0m4.271s
> user 0m0.701s
> sys   0m2.949s
>
> after:
> real  0m0.686s
> user 0m0.348s
> sys   0m0.307s
>
>
> `git push origin HEAD:refs/for/master`
> --------------------------------------
> before:
> real  0m36.454s
> user 0m5.346s
> sys   0m27.598s
>
> after:
> real  0m16.588s
> user 0m11.731s
> sys   0m3.218s
>
> Note: I pushed the exact same change for both scenarios.
>
>
> Conclusion:
> The results indicate that it would be very advantages to run 'git gc'
> for both file size reduction and improved performance. Below are
> additional resources that I've found on the internet that seems to
> back up my results.
>
>
>
> references:
>
> This says that one-file-per-ref format both wastes storage and hurts
> performance:  https://git-scm.com/docs/git-pack-refs
>
> This outlines some of the benefits and drawbacks of packed-refs file:
> https://www.mail-archive.com/git%40vger.kernel.org/msg65722.html
>
> Info on speeding up clones/fetches with pack bitmaps:
> https://www.mail-archive.com/git%40vger.kernel.org/msg65571.html
>
> On Fri, Jan 8, 2016 at 12:13 PM, James E. Blair <corvus at inaugust.com> wrote:
>> Hi,
>>
>> With the new version of Gerrit offering built-in "git gc" capability, we
>> looked at the current state of our git repo maintenance.  We run "git
>> repack -afd" weekly in an attempt to produce the smallest packfiles
>> possible, but it does not prune loose objects, which seems to be the
>> main thing "git gc" does that we are missing.
>>
>> Some (relatively) quick experimentation suggests that various
>> combinations of "git gc", "git repack", "git prune", "git prune-packed"
>> all have effects on the overall repo size, the number of pack files, and
>> the number of loose objects.
>>
>> However, we don't just want to find the thing that makes the smallest
>> repo size (that's easy: "git prune; git gc" -- 394M for nova; one
>> packfile with all objects and one packed-refs file with all refs)
>> because this repo is used as the basis of all of our mirrors and is
>> accessed over several protocols.  It's not immediately clear what the
>> right optimization is for our situation -- we don't necessarily want to
>> trade on-disk size for reduced network performance.  Even the packing of
>> refs isn't entirely straightforward -- while we haven't needed to for
>> some time, we have, in the past removed refs.
>>
>> We're looking for a volunteer to really dig into this problem and
>> thoroughly evaluate the implications of different ways of optimizing the
>> repo.  If you're interested, you can download a snapshot of the full
>> nova repository from Gerrit (it is a point-in-time snapshot and will not
>> be updated) at this URL:
>>
>>   http://tarballs.openstack.org/ci/nova.git.tar.bz2
>>
>> Please follow up this message if you are interested and with any
>> findings.
>>
>> Thanks,
>>
>> Jim
>>
>> _______________________________________________
>> OpenStack-Infra mailing list
>> OpenStack-Infra at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra



More information about the OpenStack-Infra mailing list