Monday 18 November 2013

Understanding the nuances of Git's squash and commit merge commands can save you a lot of grief

I support a fairly large project maintained in Git. At any given time there will be half a dozen feature branches going on at once with their own schedules for merging back into our development branch.

A developer recently showed me a very disturbing problem that they had encountered. They were attempting to merge from the development branch into their feature and changes that had only been made on their branch were being backed out without obvious root cause. There was no reason that any other branch should have worked on those files and Git wasn't exactly forthcoming with answers either.

Even worse, when I did finally find the cause, I discovered that it was caused by the standard behaviour of a commit merges and didn't happen when doing the equivalent squash merge.

Lets talk about merging

Git supports two kinds of merging called squash and commit. Each kind has its uses and a camp of supporters who argue how it should be applied to feature development branches.

So let's explain these two kinds of merges and the reasons people would favour one or the other.

Squash merge

A squash merge compares two branches (the checked out branch and the incoming branch), and applies the difference to the working directory and commits it. If you were to look at the git log after a squash merge you would see a big commit on the checked out branch containing all incoming differences. You would not see any join between the checked out branch or the incoming branch.

So essentially a squash is just a regular commit.

A squash is useful for landing a feature branch because it does not disrupt the mainline's history with merge points. The feature lands as a single commit and if necessary it can be reverted with a single commit.

The downside / counterargument is that by munging all the commits together as a single commit, that developers lose the history of the feature branch. The feature branch would have to be preserved in perpetuity if someone wished to reference the reason a particular line was changed in context.

Commit merge

A commit merge explicitly joins two branches together with a merge point. Various strategies exist to construct this point but the default recursive strategy essentially wends its way down each branch to order the commits in time and figure out which files (if any) are in conflict.

Once conflicts are resolved, the merge point then becomes part of the history of the checked out branch. If any files were in conflict and fixed by the developer then those files are committed as blobs and the merge point contains a reference to them.

The advantage of a commit merge is that if you were to look at the log, or the blame annotation, you will see the history from both branches. You could even delete the feature branch once it is merged without losing the merge point.
 
The downside / counter argument is that feature branches pollute the history with a lot of noise. There are extra branches to visualize and a branches may contain a lot of junk comments, particularly early in the feature life cycle when people are moving things around, less rigorous with their commit comments and so on.

Reverting a commit merge is also more difficult. Command line git has a special revert syntax to produce a revert commit that nullifies the changes brought in by one parent.


But they produce the same result don't they?

You would think so but you would be WRONG. They usually produce the same results but not always. It was such a scenario which prompted me to write this blog.

Here is the scenario:
  1. A new branch f1 is created from master
  2. A new branch f2 is created from master
  3. f2 commits a change A
  4. f1 accidentally commit merges from f2 instead of master (by using the wrong tag or hash). We shall call this botched merge A+
  5. f1 developer realises his mistake and pushes a revert commit which we shall call A-. So now A- cancels A+ right?
  6. Time passes and f1 and f2 receive more commits and lead their separate lives
  7. f1 commit merges to master with no issue using a commit merge
  8. f2 commit merges from master and suddenly commit A in their branch has disappeared!
If f1 had chosen a squash in 7) it would have gotten the correct result but they did not. If f1 had created a new branch from the previous commit and continued using that, it would have worked. But by polluting their branch with a merge and a revert they caused grief for f2 down the road.



Here is a command line script to simulate the issue:
rm -rf test; mkdir test; cd test
git initecho master1 > t.txt
git add t.txt
git commit -a -m "Master branch"

# f1 changes the file
git checkout -b f1
echo f1 > t.txt
git commit -a -m "f1 branch"

# f2 changes the file
git checkout master
git checkout -b f2
echo f2 > t.txt
git commit -a -m "f2 branch"

# Simulate an accidental merge
git checkout f1
git merge --no-ff f2
git checkout --theirs t.txt
git commit -a --no-edit
# Now revert the botched merge
git revert -m 1 --no-edit HEAD

git checkout master
# CHOOSE THE MERGE, squash or commit
git merge --no-ff f1

git checkout f2
git merge --no-ff master
echo The contents of the file is:
less t.txt
Run this script once, inspect t.txt and you will see it contains "f1". If you change the line below "CHOOSE THE MERGE" to this i.e.. "git merge --squash f1; git commit -a -m Merge" the script will properly detect a merge conflict and warn the user. It would then be apparent there was an issue and give the developer a chance to fix it.

What is going on?

The botched merge is what is going on. The developers of f1 screwed up when they merged from f2 by mistake and then corrected the error. You would think a history which contains A+ and A- should cancel itself out. And if f1 had been squash merged when it went into master it would have.

But git doesn't work off deltas. It works off an index file. When A+ was reverted to A-, it instructed Git to update the branch index to use older versions of those files.
When f1 was commit merged into master, the index of master was also reflected to use the older version of those files.

And finally when f2 next merged from master, its index was also "infected" with older versions of those files and the changes that were originally on the branch just seemed to disappear.

The same scenario could be reproduced with a cherry pick. e.g. f1 cherry picks a commit from master (or another branch) and then decides to revert it. That's fine until it commit merges to master and now master gets the revert too!

Squash merge is simply safer

I hope I have demonstrated a very nasty side effect of using commit merges. If your commit merge contains reverts of inadvertent merges or cherry picks then you can easily spread a bad commit to other branches.

In my opinion squash is definitely a preferred option when landing a feature. It is obvious, atomic, simple and easy to revert. It doesn't contain any nasty "surprises" like the scenario above and it cuts out a lot of noise when you are visualising your branches.

A commit merge is still useful when merging from the master to the feature branch since it provides an obvious merge point and is likely to be periodic and increasing as the feature nears completion. Just don't do it going back the other way.

But what if I really want a commit merge for landing a feature?

Then I can think of several suggestions but they all involve a bit of process to ensure that any mistakes in the branch's history are neutralized by the time of a merge and don't rear up to hurt another branch.

All of these suggestions assume you are working in a team with a branch which is pushed to a server.

If you botch a merge, create a new branch


If you've pushed up something bad then don't try and patch up the damage. Create a new branch immediately prior to the bad commit and tell people to move to the clean branch instead. Yes this is a bit of a pain, but the lesson here is don't push bad stuff.

If necessary other developers can cherry pick any commits they have made since on the old branch. The old branch can be junked when everyone has moved over.

If you have pushed the change to a server, DO NOT attempt to reset the HEAD of the branch.

Rebase the feature branch periodically


Don't use just one branch for your feature, use a series of them and get your developers used to the workflow.

Create an initial branch called featureX/0, and then periodically rebase it as your development progresses, e.g. featureX/1, featureX/2 etc. Uniquely naming your rebased branches is essential if you push them to a server.

Rebasing means you can interactively strip out empty operations. So if you screwed up with a merge, just interactive rebase to a featureX/1 and carry on. Or squash up the changes in featureX/0 (A+ and A- would cancel each other out). Then you have a nice clean, linear string of commits. Squashing is a useful way to eliminate some of the noise that might occur during feature development where things get moved around, reverted, and committed with less than useful comments.

Finally when you are ready to land the feature, you can merge featureX/n (where n is the last rebase iteration) to develop as a no-fastforward merge. Even if there is a dangerous commit left in the middle of your log which you later reverted, it does not matter because you brought everything in as a stack.

Conclusion

Squash and commit merges usually result in the same changes but not always. If you encounter weird merge issues with files changing for no obvious reason then it is probably due to a merge or a cherry pick which was reverted somewhere in the past. Prevention is the best cure, so favour squash merges or rebased commit merges when landing features.

No comments:

Post a Comment