Sense Amplifier: Why Mercurial’s Revision Storage Model Is Superieor To Git’s

Git and Mercurial are both DVCS inspired by Monotone. They have a lot in common, but still differ a lot. Mercurial has named branches, Git has index, etc. But there’s one big difference in the implementation layer which is barely visible to end users, but still a big one: the way revisions are stored.

Git’s model is based on four objects: blob, tree, commit and tag. You commit object is basically a full state of your project tree at commit time: it contains references to blob objects which in turn contain corresponding versions of your files. It also references parent commits. This means that to build log of a particular file, git must scan through all commits to the project to find ones where the referenced blob for the file in question changes. Basically, git doesn’t have per-file history at all and has to calculate it. Linus is pretty religious about this and considers per-file history a bug: http://permalink.gmane.org/gmane.comp.version-control.git/39358.

Meanwhile, Mercurial stores history of each file separately in so-called revlog, so to read all changes to a particular file one must make one continuous read from the revlog. However, this architecture makes building log for entire project more expensive than git’s.

Still I argue that Mercurial’s approach is much better for modern software projects with lots of refactorings: the code you’re interested in can travel freely across the repository. At first glance it might seem that this makes git’s model a better choice, but I don’t think so.

Imagine that you don’t know beforehand where the code your’re looking for lies (e.g. the function was here at previous commit, but now it’s not). In this case you’ll have to scan your repository to look for that code every time you build a log for it. A better approach would be to build an index: store a versioned file which tells you in which file particular classes and functions reside. This kind of information is kept by any modern IDE, so you only need to store it and make sure it fully reflects the committed state of repository.

If you have such file under your version control, Mercurial’s seeming weakness becomes a strength: at first pass you read revlog for the index file to determine in which file your code is located, second read to the found file’s revlog gets you all the details. Bingo!

Sense Amplifier

Tuesday, March 8, 2011

Why Mercurial’s Revision Storage Model Is Superieor To Git’s

No comments:

About Me

Blog Archive