All open source projects encourage people to submit patches, but in order to review them you want them to be orthogonal, that is, nice and clean, carrying just the changes relating to one bugfix or enhancement.
The old way of doing things is that I, being someone with commit access to the upstream project, take the bits that I am willing to accept and commit them. But if I commit a patch for someone, it looks like I did the work. No. We don’t have to live with that inequity any longer. One of the hallmarks of the modern distributed revision control systems is that the submitter‘s name shows up in the revision history and they get full credit and ownership of their contribution, so I am very adamant that contributors make their own commits. They they bundle up the revision or revisions that make up the contribution they want to make, and send it along.
So in addition to being clean and isolated we also want a contribution to actually be a revision carrying the metadata that makes up a commit — but not depend on unrelated revisions.
It’s this last criteria that is tricky. All of the people working on our projects are proceeding merrily on their way creating numerous or infrequent revisions as is their taste. The problem is that an individual branch history is, in essence, linear. From a sequence A,B,C,D,E someone who wants to contribute can’t just easily pull C out and send C to the upstream project because C depends on A,B.
This is one definition of the cherry picking problem, that is, pulling out only the change set you want and not everything else.
The only tool that ever got this right was Darcs, and it has a magnificent user interface for doing so. Unfortunately Darcs doesn’t scale — add too many files and you’ll hit the red giant bug which does tend to put a crimp into most development styles. This is not the only definition of cherry picking, mind you, but in general the “manually pick one revision out” thing has been tricky. Since the developers of the version control tool of choice know and highly respect Darcs, I took an off hand statement along the lines of “no, no Darcs-like UI to compose changesets yet” to mean that the cherry picking of revisions wasn’t there either. It turns out that this wasn’t quite correct.
So how to create orthogonal patches?
This has had me stuck for a while. I’ve been trying to figure out how to help new participants create clean orthogonal patch bundles, but getting those contributions up to scratch has been somewhat trying.
The obvious quick answer to creating orthogonal patches was:
“Individual micro branches for everything”
which at first makes sense; after all, in the 3rd generation distributed version control tools, branches are cheap. So every time you want to work on a new feature, just create a new branch:
$ bzr clone mainline fix-the-broken-juicer
$ cd fix-the-broken-juicer/
then do the necessary work, make a commit, and finishing by creating a patch for submission:
$ bzr commit
$ bzr bundle ../mainline > fix-the-broken-juicer.patch
resulting in a nice clean orthogonal bundle containing just the revisions needed for that fix which you can happily email off to the maintainer.
But this isn’t really practical, right? The whole point is that we get on with stuff. Nobody thinks in so compartmentalized a way. By the time you even think of fixing this bug you’ve already added two new features, refactored a bunch of stuff, and improved the build system. You’ve made lots of commits already, and you depend on that build system fix to even get the project to compile! But you don’t want to send all that stuff in, not yet anyway. You’re still working on some of it. And even if it was ready to go, it’d be a monster patch highly likely to have something wrong with it resulting it its rejection.
So how to submit code?
On the maintainer’s side, I don’t want to lower our standards, but if I reject their contributions too many times they’ll lose their enthusiasm and leave the project. Sure, lots of give and take, usually resulting in me having to clean up their patches in the end. Initially I was ok with that; it’s a reasonable effort to helping someone get the idea of what was required (and is itself collaboration). But there is often a time lapse between them creating a bit of code and their submitting it, my reviewing it, and then getting back to them with comments, and in the intervening time they’ve no doubt raced ahead adding more revisions, making it all the harder for them to rewind and try to keep everything straight in perfect little micro branches.
The solution turns out to be:
“You can cherry pick after all”
One new contributor, Nat Pryce, wrote in saying:
… the only way I can think of to avoid this is to manually copy the
files from my branch into a mainline checkout, commit, generate a
bundle, and then discard the mainline checkout again.
I agreed with him that this sounded horrible, and was thinking that this wouldn’t
have the desired effect of “transporting the revisions” we want to
It would seem, however, that we’re all missing the point.
The object of the exercise is to get code into the repository. What the
revision graph looks like is irrelevant. That part I more or less knew
already. But the more important parts are:
it doesn’t matter which revision gets a bit of code into mainline so long as it has the contributor’s name on it; and
if multiple revisions have the same code fragment, THAT’S OK.
When the two branches containing these various revisions eventually merge, they
will be MERGED and (oops, smack myself in the head) they will not
conflict because (duh) they’re they same code!
In other words, where we end up is that we need to create revisions X
and perhaps Z that convey code to the project maintainer out of a stream
of revisions A,B,C,D,E,F.
If X = B [in the java
equals() sense], and if Z = D..E, it’s no problem,
because when the day comes that X and Z are merged to some branch which
happens to be the public mainline, and you later pull from that public
branch, then X and B will “contend” and merge without conflict, and
likewise Z will merge without conflict. It’s all the same text. Mission accomplished.
Aside: it turns out there is a good reason to merge X and Z now into your local working branch (ie, what will be) A,B,C,D,E,F,X,Z,G,H,I,… rather than waiting for it to turn up via the mainline branch revision stream: you’ll be giving the merge resolution algorithms a hand. Beyond the text not conflicting you’ll now already have the revision locally which will save you from pulling it over the wire later, which is actually back to what we started with in terms of trying to ship revisions around. On the other hand if you had already diverged and there was a conflict, then you’ll already have had to resolve it locally, which is good. And if upstream changed something in that code, say X,X’ and Z,Z’,Z” the merge delta to be computed when you go to merge them in is small. If the result is a conflict, well that’s ok — indeed, resolving those cases are ultimately what it’s all about, but in general they won’t conflict and we’re done.
Support from Bazaar
So back to or quandary, that manually copying files in
create branch, manually copy file, commit, create bundle & send
would be ugly.
It turns out that Bazaar supports this with a slight twist.
You are still going to have to create a [momentary] branch, make a
commit, and bundle that off. But using the capabilities built into bzr’s
merge command, we can quickly create branches containing just what we
want to submit.
First create a pristine branch of a recent upstream:
$ bzr clone mainline plumbing-fix
$ cd plumbing-fix
Now either a) select a few specific revisions, perhaps:
$ bzr merge -r 292..294 ../working
$ bzr diff
or more interestingly, b) select specific files or directories, for
$ bzr merge ../working/src/bindings/org/gnome/gtk/Plumbing.java
$ bzr merge --force ../working/src/bindings/some/other/file.txt
$ bzr diff
No matter that you might have made 14 commits to get there. When you
create this branch with its single additional revision, you can make a
proper commit message. And if there’s stuff from pending work that
shouldn’t be in there, you can clean it up before you commit. But either way, using the merge command means that you do a merge, and any irreconcilable collisions that you’ve introduced relative to the branch you’re merging into will be dealt with as such.
$ bzr commit
$ bzr bundle ../mainline > john-plumbing-fix.patch
Email the patch, and you’re done.
Optionally, you can of course:
$ cd ..
$ rm -r plumbing-fix/
although you may want to keep the branch around in case any further changes need to be made; should that be necessary you can then just commit the fix-ups in the existing branch. Or you can create a new patch from scratch. Or you can always just reapply your bundle and recreate the branch. Lots of choices.
Eventually your patch will be merged to mainline, and when subsequently
the revision that carried it is pulled down from mainline to your local
copy of it and you are merging mainline into your primary working
branch, there won’t be a conflict because that code was in your working
branch all along (unless you’ve further changed something, of course; but that’s why it is advantageous to merge your contributed branch|patch back into your working branch(es) early; then its not just the code text that was once
common, but now there are recent revision(s) for common ancestors, and the
merge engine will slurp it right up).
This dramatically improves matters, I think. Now I don’t have to feel
so bad about rejecting patches that aren’t quite there yet. When I do
so I know I’m not inflicting as much pain as I was worried I was.
Contributors meanwhile can create all the private revisions they ever wanted, and don’t have to forecast which are going to be publicly visible and which
aren’t. So they can leave off worrying about a proper commit messages, etc,
until they actually go to create and submit an orthogonal patch.
But best of all, it restores the flexibility for people to do whatever they want. “No one can tell you no”; you might not get your change into upstream, but no one can keep it out of your code. And that’s what software freedom is all about.
Thanks to Robert Collins for having reviewed and commented on this essay.
bzr merge --force filename one-at-a time thing is a bit cumbersome. I had a word with some of the Bazaar hackers about it, and they’re going to see what they can do. Don’t forget there’s also the
bzr merge -r A..B form.
- Actually, the whole process is still a bit cumbersome. Someday, perhaps, we’ll have a UI to automate much of the changeset selection and bundle creation. Until then, though, this works, and it accomplishes the aim, which is facilitating contribution. That’s important to keep in mind.
- Most of this discussion is characteristic of the 2nd and 3rd generation distributed revision control tools as a class, although obviously command syntax will differ. The key to all this holding up at scale, however, is ultimately the quality of the merge algorithms in the tool you’re using and its ability to deal with corner cases.
- At the end of it all, though, if you can create a new branch and just do the work there in an isolated fashion resulting in an orthogonal patch, that is easier.
:) It depends on the nature of the project you’re working on, the size of the change you’re contemplating contributing, and they way you work. In our case, the insane refactoring power of an IDE like Eclipse tends to mean that we get a lot of unrelated changes in flight at the same time, thus exacerbating the original question of creating orthogonal patches.
- As of Bazaar 0.90, there will be a
bzr send -r command which, when supplied with a revision, will create an orthogonal patch in a single step!