Tuesday, March 8, 2011

Why Mercurial’s Revision Storage Model Is Superieor To Git’s

Git and Mercurial are both DVCS inspired by Monotone. They have a lot in common, but still differ a lot. Mercurial has named branches, Git has index, etc. But there’s one big difference in the implementation layer which is barely visible to end users, but still a big one: the way revisions are stored.

Git’s model is based on four objects: blob, tree, commit and tag. You commit object is basically a full state of your project tree at commit time: it contains references to blob objects which in turn contain corresponding versions of your files. It also references parent commits. This means that to build log of a particular file, git must scan through all commits to the project to find ones where the referenced blob for the file in question changes. Basically, git doesn’t have per-file history at all and has to calculate it. Linus is pretty religious about this and considers per-file history a bug: http://permalink.gmane.org/gmane.comp.version-control.git/39358.

Meanwhile, Mercurial stores history of each file separately in so-called revlog, so to read all changes to a particular file one must make one continuous read from the revlog. However, this architecture makes building log for entire project more expensive than git’s.

Still I argue that Mercurial’s approach is much better for modern software projects with lots of refactorings: the code you’re interested in can travel freely across the repository. At first glance it might seem that this makes git’s model a better choice, but I don’t think so.

Imagine that you don’t know beforehand where the code your’re looking for lies (e.g. the function was here at previous commit, but now it’s not). In this case you’ll have to scan your repository to look for that code every time you build a log for it. A better approach would be to build an index: store a versioned file which tells you in which file particular classes and functions reside. This kind of information is kept by any modern IDE, so you only need to store it and make sure it fully reflects the committed state of repository.

If you have such file under your version control, Mercurial’s seeming weakness becomes a strength: at first pass you read revlog for the index file to determine in which file your code is located, second read to the found file’s revlog gets you all the details. Bingo!

Monday, January 31, 2011

Promising and committing

Commit messages suck. If you don’t trust me, enter “svn/git/hg log” in your command line and see yourself. I’m personally tired of seeing endless “Fixes”, “Tests”, etc. in TeamCity build history. Nonetheless, when Mercurial asks me what the hell I was doing for the last hour before it agrees to share my changes with the team, I stare at the text field for couple seconds and, honestly, sometimes end with another “Fixes” poo.

Commit messages should generally follow the same guidelines as status reports at your daily scrum. They must be short, clear and (sometimes) inspiring. This is often not the case because of the huge stream of bugs, change requests, code smells, etc. pumping into your brain at incredible pace.

Experienced developers know that microcommits work best: a small change gets meaningful description easier. The problem is that controlling yourself is hard: when you start hacking your code you normally have a very broad ‘make this stuff even better’ vision in your head. And only in half an hour or more the system asks you to give a name tag to your actions.

Wouldn’t it be better if you could give your commit message before making change? You open up your Twitter favoride IDE and see a nice ‘What’s happening:’ ‘What are you doing:’ text field. You browse your issue tracker, take a look at the scrum board, reflect for a minute and write ‘Implement masterpiece wizard in Photoshop’. And this text is up there at the top of your code editor always reminding of what you’re doing. Now, whenever you switch to something else, you finish this thing first, commit and then change the message to start a new thing.

Thursday, January 27, 2011

LOD in Scrum Product Backlogs

We have recently started using Scrum in our teams and one of the things that immediately changed was the way we store features. We used to keep them along with issues in the bug tracking software (YouTrack in our case), but this doesn’t work well with Scrum as we need to arbitrarily order them (three priorities to sort by are just not enough).

So we switched to Checkvist (I should dedicate a separate post for this awesome piece of software) which is basically a very usable outliniing tool.

First we struggled with lack of structure in simple hierarchy of text labels: how to plan sprints and versions, how to group issues by user scanerios, etc.

Over the course of several weeks I came up with the following system which you might find helpful. It is based on LOD (Level-Of-Detail) approach used in computer games to determine how much details to put into rendering objects depending on distance to them: the monster you’re standing next to is very beautifully rendered with hi-res textures, bump-mapping and lots of vertices in the mesh. Trees and a church on the horizon are rendered with little details and smaller texture so you don’t waste CPU cycles (if you’re killed by the monster, you won’t reach the church, so why spend resources on it now).

The backlog is divided into five sections: INCOMING, SCENARIOS, DEFINED, PLANNED FOR SPRINT and DONE.

All incoming requests are put at the end of the INCOMING section. These are very basic requests from team members and end-users. If the request was submitted through the issue tracker, I attach a link to the request. I use tags in this area to mark requests with areas of interest: #performance, #overview, #code-intelligence, etc.

The next section is where I create end user scenarios to implement: these are combinations of improvements which enable users to more cool stuff in a specific area. In these stories I group requests from the INCOMING section and add missing pieces. An example of story would be: “Support AppEngine” or “Taggable work items”.

The next section, DEFINED, is where I get very specific about how to implement particular story: all items which are in DEFINED section have good HOWTO-DEMO descriptions. This section is where team members should look quite often to analyze if something is missing or should be done differently. This is where ideas turn into actions. This is the place where we take items from for planning, so items in this section are ordered by priority.

The rest two sections: PLANNED FOR SPRINT and DONE are easy: after planning meeting all items planned for the upcoming sprint are moved from DEFINED to PLANNED FOR SPRINT. All items which are completed as part of the sprint come back to DEFINED.

Pretty simple and easy to understand. So far works very well for us.

Thursday, April 22, 2010

On Salary Ranges

I have recently participated in several conversations on how salary ranges are used in various companies. I’m not speaking now about large corporations which need to synchronize ladders across offices and departments, but of micro- and mini-ISVs.

Many of the problems I heard of come from a simple mistake of taking these ranges too seriously. Far too often managers forget that every developer in a small team is individual and bound his career path to a predetermined grid.

In reality these ranges should not constrain, but help managers when he is not sure about the pay. Whenever you know exactly which compensation certain employee deserves, you shouldn’t try hard binding it to the grid.

Sunday, March 14, 2010

The Real Value of Productivity Tools

If you look around in 2010 you will see how well people are working together. Modern tools let them collaborate and create amazing things. Wikipedia and Flickr is what comes to mind first, but there's a lot more. Many contemporary thinkers point out that the social tools we have now thanks (mostly) to the Net help us care more than ever. Take Wikipedia: despite a number of failures, in general it shows unbelievable ability to remain a source of (mostly) accurate information, resist vandalism and get up-to-date faster than anything.

The power behind this fenomenon is collective care of people plus easy-to-use tool. If someone sees an inaccurate or broken page, it takes seconds to edit text or restore a previous version. The easier the process, the less motivation one needs to act.

This perfectly applies to software development as well. While some think that quality results are product of centralized organization and thought-out architecture, a lot depends on attention and care invdividual developers pay to the code. An architect might not have enough time to know entire codebase, but when an attracted developer does his job, he sees unused code, complicated methods or overly architected subsystems. It hurts, and the easier it is to fix things, the more chances he cleans up the mess.

Productivity tools are the ones which remove hurdles to making code improvements. Refactorings, code analyses, quick fixes and code cleanup are some of many features offered by tools like ReSharper and IntelliJ IDEA. Managers often consider them luxury or toys but in reality these are the tools that help team show their care. You should never underestimate the effect. If someone conducted a study showing how code quality changes over time in teams using and not using productivity/refactoring tools, I'm sure the results would prove my point.

Tuesday, March 4, 2008

Performance issue in Dictionary of enum types

Interesting performance issue: if you have a System.Collections.Generic.Dictionary with key type being an Enum type each request to it performs a boxing of the key due to specifics of JIT optimizations. If you use such constructs in performance-critical parts of your code be sure to replace the default IEqualityComparer with a custom one.

Coroutines and Mocking

A year ago I played with Ruby and in the course of investigating its possibilities I created a simple yet interesting mocking framework. I don't use mocking frameworks a lot in my testing code, but one thing that hurts me each time I see samples of mock objects in use is that strange notations for describing the expected behavior.

It usually goes like this (this example uses FlexMock, a Ruby mock objects framework):
  
mock.should_receive(:average).with(12).once

mock.should_receive(:average).with(Integer).
at_least.twice.at_most.times(10).
and_return { rand }

You can see that one must use special methods (which in this cases resemble plain English, but nonetheless) to describe what is expected to happen with mock object in the course of text execution. The problem I see is that it's not easy to quickly understand what's going on because the expectation code doesn't really look like normal code that should be written in order to meet these expectations.

What I always wanted to have as part of my imaginary mocking framework is the ability to simply write down the essence of what I want to happen:


amount = 100
atm.transfer(from, to, amount) do
from.withdraw(amount)
to.deposit(amount)
end


Here I invoke a transfer method on ATM object and expect it to first withdraw the specified amount from account from and then deposit the same amount to account to.

So I tried to do that and found a very simple way to implement this kind of framework using coroutines in Ruby. So with the simple skeleton I managed to write you can actually write the code like this:

require 'mock.rb'

class ATM
def transfer( from, to, amount )
t = from.withdraw(amount)
to.deposit(amount, amount)
end
end

from = MockObject.new
to = MockObject.new
atm = MockHost.new(ATM.new)

amount = 100
atm.transfer(from, to, amount) do
from.withdraw(amount) { "Transaction cookie" }
to.deposit(amount, "Transaction cookie")
end

This sample test will fail because of a (silly) bug in transfer method that passes invalid value as transaction identifier to deposit method.

You may think that this requires a huge and complex backend, but it doesn't: the skeleton for MockObject/MockHost is only about 80 lines. Of course it lacks many features that fully-fledged mocking frameworks have but I think it can be easily extended.


class MockHost
class << self
def block
@@block
end

def block=(block)
@@block = block
end
end

def initialize( realObject )
@realObject = realObject
@executing = false
end

def method_missing(method_name, *args, &block)
was_executing = @executing
begin
m = @realObject.method(method_name)
raise NoMethodError if not m
@@block = block
if was_executing
m.call(*args, &block)
else
@executing = true
m.call(*args)
end
ensure
@executing = false if not was_executing
end
end
end

class MockObject
def method_missing(method_name, *args, &block)
if @method_name
call_from_test(method_name, *args, &block)
else
call_from_main_block(method_name, *args)
end
end

def call_from_test(method_name, *args, &block)
p "Method #{method_name} does not match expected method #{@method_name}" if method_name != @method_name
p "Arguments count #{args.length} does not match expected count #{@args.length}" if args.length != @args.length

0.upto(args.length - 1) do |i|
real_arg = @args[i]
expected_arg = args[i]

p "Argument #{i} doesn't match. Expected #{expected_arg} but was #{real_arg}" if real_arg != expected_arg
end

@retvalue = block.call if block

callcc do |continue_test_block|
MockHost.block = continue_test_block
@continue_main_block.call
end
end

def call_from_main_block(method_name, *args)
@method_name = method_name
@args = args
begin
callcc do |@continue_main_block|
MockHost.block.call if MockHost.block
end

@retvalue
ensure
@method_name = nil
@args = nil
end
end
end