[openstack-dev] [git-upstream] How to view only commits applied since last import

Darragh Bailey - mailing lists dbailey at hpe.com
Fri Nov 18 12:47:13 UTC 2016


Hi Paul


This might be a bit long, noticed a few more items to be added to the
documentation for git-upstream. Good thing documentation is the major
item for the next release ;-)

After an initial attempt at writing one response, I've decided to break
it up a bit into a couple of emails, because there a few topics in here
and one of them is pretty lengthy and the others follow somewhat
independent issues.

For this one I'll stick to just the 2nd and 4th paragraphs since they
are related.

And apologies in advance for anyone that didn't need to see such a dive
into git log behaviour here. :-)


On 17/11/16 12:47, Paul Bourke wrote:
> Hi Darragh / git-upstream community,
> 
> I've been looking at a way to easily view a log of what commits made
> since the last upstream import when managing a branch with git-upstream.
> Right now this can be hard to do - something like 'git log
> upstream/master..HEAD' shows a lot of duplicate commits reasons I don't
> understand well enough to explain.


I'll go into that in a little more detail with a separate response, to
help anyone else reading or in the future searching. Definitely should
have been documented before now... O_o


> Darragh had suggested using the --dense option will help here, so 'git
> log --dense upstream/master..HEAD -- .' seems to generally give the
> right result. I've put up a patch to add this as a git-upstream command
> in the form of "carrying" at https://review.openstack.org/#/c/381669/.
> However, it seems there are in fact cases where the above command will
> show incorrect commits, and I'm struggling a bit to fully grok this.

Unfortunately explanation is going to be a bit lengthy, and there are
pieces of git here I'm not fully sure I understand what it's doing, and
plan to ask the git-users google group for a bit more info. Possibly may
require going to the git mailing list to get a complete picture
(git at vger.kernel.org).


Until recently, I didn't fully grok the git-log(1) manpage,
https://www.kernel.org/pub/software/scm/git/docs/git-log.html,
specifically the section about "History Simplification", naturally I
didn't realise that until I understood what it was saying better.

Mainly to do with how TREESAME and !TREESAME apply to simplifying the
history walked and displayed.


TREESAME means that the resulting git tree is identical and you can have
multiple commits referencing the same TREE SHA1 in git. It's one of the
ways it avoids duplicating the same information, identical directories
result in the same SHA1, so each commit to a git tree only adds blobs
(for files) and tree (for directory) objects for what is different.
!TREESAME basically means there was something different between the two
directories or trees in the repo, and as git allows for path limiting,
this can be applied to any subdirectory in a git repository as well.


In the review I was referencing a particular scenario that we've come
across locally (see the url above):

1) Dev uploads a patch 'A' for review against branch tracking latest
from stable/mitaka (presumably also sent it upstream)
2) Nightly sync jobs run and bring down all the patches that landed in
the stable branch upstream over the last 24 hours
2.1) Git upstream is able to replay the current local changes without
conflict, dropping those that landed upstream automatically
2.2) Git-upstream merges the result into the tracking branch, and it now
becomes the new head (assuming it passes a gate check locally)
3) Following day or so, reviewer approves patch 'A' to land and it
merges without conflict.

See following for the actual test scenario, it's a bit more complex than
what I've discribed and the letters used differ:

https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/tests/searchers/scenarios/changes_upload_prior_to_import.yaml


If git-upstream didn't do a special type of merge, and just produced a
new branch each time, we'd have to re-target patch 'A' to the new
branch, alternatively we could also have triggered a rebase of any
outstanding changes to avoid the particular git graph this produces.
Both of these seemed like unnecessary confusion and make-work for
developers/reviewers just trying to focus on landing changes.


The result is a git graph like the following (hope this displays correctly):

               A----
              /     \
-----M-------X---N---O  local/mitaka
    /           /
   /       X'---        (rebase/cherry-pick of X onto G)
  /       /
-E---F---G              upstream stable/mitaka branch


M = the merge commit created by git-upstream for the previous sync
X = local change not yet upsteam
X' = replay of the local change X onto the latest stable/mitaka
N = most recent merge commit created by git-upstream
A is the commit created locally by a developer to fix a issue affecting
local deployments (and may need to be reworked for upstream)
O = merge commit created in Gerrit by approval of 'A'

When git-upstream created N, it made the tree of N TREESAME to X', which
includes commits F & G, while ignoring anything that could have come
from M & X, which still retaining the history.

This allows existing patches in Gerrit to be merged without needing
rebases, and developers get to continue to use "git-pull --rebase"
locally without needing to worry about the target branch.

However as you've spotted it does complicate the view of git-log a bit.

Normally if 'A' was not involved:

  ---M-------X---N      local/mitaka
    /           /
   /       X'---        (rebase of X onto G)
  /       /
-E---F---G              upstreams stable/mitaka branch


And the command being run is:

 git log --dense stable/mitaka..local/mitaka

This would look at N and check whether the tree of each parent was
TREESAME (identical) or !TREESAME and only cares about commits that are
not TREESAME to any parent.

However what is not clear is that the behaviour between walking merge
commits and normal commits is not the same:

The only reference to this behaviour is listed with the default mode:
"If the commit was a merge, and it was TREESAME to one parent, follow
only that parent"

What this means is that looking at the parents of N, X is ignored, and
X' is followed because X' is TREESAME to N, or alternatively because X
contributes nothing to the contents of N it's irrelevant. With '--dense'
added N is removed from the final display since it's the same as X' and
thus adds nothing to the final tree.

The important thing to spot is that the man page describes two different
things, one is what history is *walked* to find commits, and how
TREESAME/!TREESAME as applied to merge commits is different there to
what is *displayed* as contributing to the tree.


In this case the only difference between the two branches is X'. And is
exactly what you've seen.


However when A comes into the picture.

O: neither parent is TREESAME so both get walked
N: X is TREESAME and so is excluded, X' is walked
X': is not TREESAME to G and so is included
G: is stable/mitaka, stop walking and ignore G in output
A: is not TREESAME to X, continue  <----- this is where the problem is
X: is not TREESAME to M, continue
M: is TREESAME to E so ignore the other parent
E: is on stable/mitaka, stop walking and ignore E in output

In this case you can see that X and X' will both be included in the
output, because git only concerns itself with the TREESAME/!TREESAME
when applied to directly related commits. Exclusion of X on the basis of
being TREESAME to N isn't used as a condition for deciding whether to
walk either it or any of it's parents when you encounter another path to
the same part of the graph.


Besides talking to the git maintainers and suggesting that if going to
exclude uninteresting commits based on TREESAME & !TREESAME, then the
graph including A represents a case that should be considered, I've not
seen any way to solve this simply. (I do plan to mention this to them,
in fact I'll be using some of this email to help explain it).


To solve, need to identify commit X programmatically and exclude both it
and any parent from the output by telling git to ignore everything
reachable from X.

Since "git log" also allows you to specify revisions to be excluded,
it's possible to get the right information using the following:

 git log --dense stable/mitaka..local/mitaka --not X

All that remains is to be able to find X to pass it to the

Git-upstream can find the previous import merges, it has to, in order to
solve this to determine which commits are changes to be carried on each
import. And in git terms X == N^1, or in non git notation the first
parent of the most recent import.

This means that for the log output, can use the existing git-upstream
code to determine N and then call:

 git log --dense stable/mitaka..local/mitaka --not N^1

Since git-upstream is opinionated about the parent order of import merge
commits, this will work consistently.


> My current understanding is we have a branch, that consists of a mixture
> of upstream commits from previous imports, and custom commits. We want
> to show a list of commits added since the last import. However, if those
> commits also contain commits from another non upstream branch, we want
> to exclude those? This makes sense with the example of say a packaging
> branch, but what if commits came from say a feature branch? Does it also
> make sense to ignore those?


Response in separate email, cuz frankly this one is long enough
already... and I've the final paragraph to go yet!


> Secondly, can you recap exactly how we find the most recent import
> commit? How does excluding the parent of this commit combined with
> --dense give the correct result? From your comment in Gerrit you
> identify it as the commit with the subject "[I] Merging E1 into E", but
> I can't see exactly how you spot this. Locally in the same scenario,
> taking the parent of the commit marked that subject and plugging it into
> the command is not showing the same graph as you pasted.


Think two questions are being asked;
1) how did I know which commit was the previous import
2) how does git-upstream identify such commits

Short answer to the first part; because the test scenario was written to
have "I" represent a previous import merge commit.


The test case referenced in the review is simulating a particular
history layout to exercise git-upstream replaying changes when one of
them is based on a commit from before the most recent import merge
commit. It was looking that that problem that revealed some of the
behaviour of git log's history simplification.

The test scenario is described by a yaml list in
git_upstream/tests/searchers/scenarios/changes_upload_prior_to_import.yaml
see:

https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/tests/searchers/scenarios/changes_upload_prior_to_import.yaml


This is combined with some code in git_upstream/tests/base.py see the
following for the code involved in taking the list and generated a git
repository from it:

https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/tests/base.py#L196-L319
https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/tests/base.py#L49-L92


Within the tree definition, there is an entry "[I, [E, =E1]]", which means:

1) Create a merge commit "I" with parents "E" and "E1", merged in that order
2) The "=" symbol means that the resulting tree contents of the merge
commit "I" should be exactly the same as "E1" instead of a combination
of "E" and "E1"

The code in git_upstream/tests/base.py has an algorithm that understands
how to workout the root commits of such graphs (no parents), and will
create the commits only after any required parents exist. I won't go
into detail on it here, you can look at the referenced urls above to see
what it does.


So this answers, when I was discussing the scenario how I knew that "[I]
Merging E1 into E" identified the previous import.

I noticed that I forgot to give a directly link to that file in the
review, or even include the ascii representation when discussing, oops!



However, I think you're also possibly asking, how does git-upstream
identify that "I" was the previous import?

The answer to that lies in the idea of TREESAME and !TREESAME. If
git-upstream encounters a merge commit, and one of the parents is the
exact same tree SHA1 (so exactly the same contents and file/directory
metadata) as the merge commit itself, then it assumes that the parent
commit that is TREESAME was a previous import, and the commit that had a
different contents that were discarded was the previous history.

There's a little bit more to it than that, since it has to cover some
edge cases seen as well.


The code for this is at:

https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/lib/searchers.py#L69-L249

Specifically the code that identifies the previous import is:

https://github.com/openstack/git-upstream/blob/a52f01f89401e79db95ad9fee9195de90e5a71f2/git_upstream/lib/searchers.py#L153-L179


Taking another look at it, I think moving that second block out to a
separate method would make it easier to complete the 'carrying' command
by allowing you to simply call it to do the heavy work of identifying
the previous import merge commit.


> Thanks in advance for anything that might help cut through some of the
> confusion.
> 
> Cheers,
> -Paul

>

Sorry that ended up being a bit wordy, hope it helps explain things a
bit better, and hope you can help be get some of those points into the
docs for a useful reference in the future :D



--
Regards,
Darragh Bailey
IRC: electrofelix
"Nothing is foolproof to a sufficiently talented fool" - Unknown

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20161118/c1fad00e/attachment.pgp>


More information about the OpenStack-dev mailing list