10 Reasons to Use Git for Research

Posted: 30 November 2008 by Jason Adams in Uncategorized
Tags: , , , , , , , , , , ,

Git is a version control system that has been gaining in popularity recently.  If you have heard of or used Subversion or CVS, you are familiar with the basic principle of keeping track of changes by multiple users in a series of documents (source code, text files, etc).  One of the chief benefits of version control in software is that you can rollback in case the code has become corrupted.  It’s easy to see which changes were made where and broken code can be fixed more easily than if you had no version control and had to reconstruct the working code from scratch.  Unlike Subversion and CVS, Git is a distributed version control system.  Each user has their own copy of the entire repository and history.  Branching and merging is much easier and it’s extremely simple to get started.  Plus, having used all three, Git is the most fun.

Academic settings impose different constraints on code base management.  The goal is usually less about code quality and more about exploring possibilities.  Academic code is often quite shitty, hacked together by some grad student(s), with dozens of false starts and changes in requirements.  Trying to recreate previous experiments is often very difficult unless the grad student made previsions for such rollbacks.  And if they have, it’s probably done in a way that seemed logical to the grad student at the time but is a nightmare for someone new to the project.  There are ways to avoid this, by placing more of an emphasis on software engineering, but sometimes projects are so small or short-lived that it doesn’t seem feasible to trouble with that at first.  And if you don’t even have a clear picture of where you are heading, it might not even be possible (though you are probably doomed to many problems in that case).

To help combat these issues, I will contend that every academic software project must use version control.  Git makes that easy and here’s why.

1.  Creating the first repository is a no-brainer.

To create a new repository you simply type:

git init

It’s so easy, you can use it for anything.  To clone someone else’s repository, just type:

git clone git://location.of.origin.repository

Cloning is very similar to checking out in Subversion and CVS, except that you can now work completely independently if you desire.  And you can tunnel it through ssh (substitute ssh:// for git:// above), if you’re worried about security.

2.  You can still use it while off the grid.

In Subversion, creating the initial repository means needing some central place where all of the code goes.  If you are collaborating with several people, chances are this repository is not on your own machine so if you cannot access the network, you cannot access the repository.  With Git, you store the entire repository and history on your own machine so even if you are off the network, you can take advantage of all of the features of having version control.

3.  Branch your experiments.

Often the need arises to try out different approaches in academic coding.  Branching in Git is ridiculously simple:

git checkout -b new-branch-name

You can easily switch between multiple branches, merge branches, or discard them.  One approach might be to keep the main architecture stuff in your master branch (the original) and use branches for different parameters in experiments.  This will let you easily and logically separate functionality so that running an old experiment is just a matter of checking out the branch that pertained to it.  Update:  Thanks to Dustin  Sallings for the shorter version of checking out a new branch.

4.  Version control your paper.

Why use a shared folder or email to edit your paper?  You can easily create a Git repository to collaborate and merge changes.  You can quickly see who contributed what to a paper.  Dario Taraborelli wrote about this a few months ago, though his point was that you would need your collaborators to be familiar with a version control system and they usually aren’t.  I am arguing that they should be.  On a side note, another VCS, Bazaar, is listed as an alternative in the comments to Dario’s post.

5.  Convert into an open source project.

Sourceforge has been around for a while, but the UI is absolute garbage.  There is an even better solution out there:  GitHub.  GitHub is free for open source projects and offers some great visualizations for helping you track the life of your open source project.  Of course, there is Google Code, which is quite nice and easy to use.  It doesn’t support Git, just Subversion.  The drawback to using Google Code is that you have a lifetime max of 10 open source projects.  No such limit with GitHub.  Moving your Git repository to GitHub is also a simple matter of forking your project to GitHub.

Why does this even matter?  Check out Ted Pedersen‘s Empiricism is not a matter of faith [pdf] in the September issue of Computational Linguistics.  He contends that you should create academic software with the goal of releasing it.  This ensures the survivability of your project, increases the impact of your work, and allows reproducibility of your results.  Git makes that easier, n’est-ce pas?

6.  Keep track of your grad students.

Suspect your grad students are slacking?  Check the commit logs!  And now I prepare for hate mail from grad students.  However, I think that if I had this form of accountability, it would have made me more productive.  Of course, you don’t need Git for this, any version control system would do.  Of all the systems I’ve used, Git’s presentation of changes is the user-friendliest.

7.  Version control helps you write the paper.

When it comes time to write the paper, the version control logs can be used to provide a roadmap of what you have done.  Even though you probably have kept good notes, version control keeps a calendar of events that can add useful perspective (or fill in gaps when your notes are inadequate).

8.  Git is faster and leaner than other version control systems.

Because you have the complete repository on your own system, most operations are much faster in git.  Git reports an order of magnitude improvement in speed for some operations.  Git has a packed format they report uses less storage in most circumstances, as well.  Git has been reported to be almost three times more space efficient than Bazaar, another distributed version control system mentioned above.  Git also features an easy binary search when trying to locate bugs.

9.  Version control makes it easier to bring new team members up to speed.

Speaking from experience, having a record of commits (and well documented commits) makes it easier to come up to speed on an existing project.  This applies not only to academic coding but to any coding endeavor.  Good documentation doesn’t hurt either.

10.  Save yourself some headaches.

I think you’ll minimize headaches if you use Git.  If not Git, at least use some version control system.  A lot of the things I listed above are covered by most version control systems, but Git combines regular advantages of version control in a way that is very friendly to non-linear coding situations.  Git also makes it a cinch to move your code into an open source project that can have a significant impact on your career as a researcher.  And Git is so easy to use, you have to ask yourself, why not?


Comments
  1. DrNI@CLB says:

    Actually I have some longish post about software and science in the drafts section of my blog and I don’t find the time to finish it. In general, I agree with what you write about grad students and programming. However, there is no need for nagging on them alone. I have seen too much of bad code written by teaching staff.
    Especially since I took this course in Software Architecture at our computer science department, I’m regularly getting sick when I have to dive into other people’s code.

  2. Jon Elsas says:

    nice post & very good points. I’ve been trying to dump code into a local SVM repository for a year or so, and have recently tried out Git. seems great, but not quite worked into my flow yet.

  3. jweathers777 says:

    Great points. It really does seem an especially great fit with projects that are rapidly evolving and needing to try various experimental branches.

    I’m enjoying using Git at home for my Ruby based chess/shogi project. Already, I’ve used its easy branching to create a new branch when I reached a crossroads point where I suspected I needed to alter my approach forward, but didn’t want to absolute commit to it by abandoning my old approach.

    I’ve definitely noticed the speed improvements over other tools and cannot emphasize how much easier it is start a project with version control in Git since you don’t have to fiddle around with setting up a proper network based repository until you get to the point where others are collaborating.

  4. Cheers for spreading the word of using version control for electronic notebooks! I totally agree, and appreciated it so much during my PhD… not having to worry about backups, because I shared my repository (SVN back then, now all Git) between machines… the latter is even so trivial with Git :)

    I think that Universities who still teach students how to write HTML have to stop do that, and use that time to learn students about Git!

  5. jhumphries says:

    I never investigated Git, but I did look into various source control systems over a year ago. Our team at work uses CVS, and we have several issues with it. At one point we were going to move to Subversion, which resolves some of the grievances we have against CVS, but I was drawn towards distributed systems.

    They seem to make collaboration easier and would better facilitate code reviews (I wrote a tool that allows us to easily do code reviews now – it simply rewrites all of your CVS/Root files w/ the reviewer’s name and creates a network share for the reviewer to examine all of your changes).

    They may also make managing our branches easier – but I never actually tested a system out to make sure.

    At the time, I was investigating Mercurial. I’ve heard many good things about it. Sun Microsystems switched to recently I think (relatively speaking – maybe a little over a year ago?).

    Anyhow, there is no support at work for any new version control system. Our only such champion left (John). Our organization is large enough that it is too difficult to roll it out in scale. And management would like engineering skills to be fungible – which means not having to learn new tools when moving from one team to another. So we stick with CVS (or Microsoft Team… about which I’ve heard nothing good – especially regarding the price!).

  6. Jason Adams says:

    I’ve only used CVS to check out stuff, but I’ve heard horror stories. It sucks when a bad piece of software gets entrenched. I understand the business motivation, it would cost them money, time, and it’s potentially risky to switch over to something new. Not everyone cares enough about their version control system to even be bothered with changing. The remaining bit who wants to learn something new is squelched.

    I haven’t used Mercurial, but I came across it. One comparison somebody ran had git outperforming it on speed and storage space, but single reports like that I take with a grain of salt. More evidence is needed.

  7. Fadzlan says:

    I’m not sure about the speed of Mercurial vs git, but in terms of space, Mercurial sure use a lot more space, since a branch is another copy of the repo, whereas git’s branching is in the repository itself.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>