Git – Combining Projects

Initial Configurations of Windows server 2019

There are many reasons to combine outside projects with your own. A
submodule is simply a project that forms a part of
your own Git repository but also exists independently in its own source
control repository. This chapter discusses why developers create submodules
and how Git attempts to deal with them.

Earlier in this book, we worked with a repository named public_html that we imagine contains your website. If your website relies on an AJAX library such as Prototype or jQuery, then you’ll need to have a copy of that library somewhere inside public_html. Not only that: you’d like to be able to update that library automatically, see what has changed when you do, and maybe even contribute changes back to the authors. Or perhaps, as Git allows and encourages, you want to make changes and not contribute them back but still be able to update your repository to their latest version.

Git does make all these things possible. But here’s the bad news: Git’s initial support for
submodules was unapologetically awful, for the simple reason that none of the Git developers had
a need for them. At the time that this book is being written, the situation has only recently
started to improve.

In the beginning, there were only two major projects that used Git—Git itself and
the Linux Kernel. These projects have two important things in common: they were both originally
written by Linus Torvalds, and they both have virtually no dependencies on any outside project.
Where they’ve borrowed code from other projects, they’ve imported it directly and made it their
own. There’s no intention of ever trying to merge that code back into someone else’s project.
Such an occurrence would be rare, and it would be easy enough to generate some diffs by hand and
submit them back to the other project.

If your project’s submodules are like that, where you import once, leaving the old project behind forever—then you don’t need this chapter. You already know enough about Git to simply add a directory full of files.

On the other hand, sometimes things get more complicated. One common situation at many companies is to have a lot of applications that rely on a common utility library or set of libraries. You want each of your applications to be developed, shared, branched, and merged in its own Git repository, either because that’s the logical unit of separation, or perhaps because of code ownership issues.

But dividing your applications up this way creates a problem: what about the shared library? Each application relies on a particular version of the shared library, and you need to keep track of exactly which version. If someone upgrades the library by accident to a version that hasn’t been tested, they might end up breaking your application. Yet the utility library isn’t developed all by itself; usually people are tweaking it to add new features that they need in their own applications. Eventually, they want to share those new features with everybody else writing other applications; that’s what a utility library is for.

What can you do? That’s what this chapter is about. I discuss several strategies in common use—although some people might not dignify them with that term, preferring to call them hacks—and end with the most sophisticated solution, submodules.

The Old Solution: Partial Checkouts

A popular feature in many VCSs, including CVS and Subversion, is called a partial checkout. With a partial checkout, you choose to retrieve only a particular subdirectory or subdirectories of the repository and work just in there.[32]

If you have a central repository that holds all
your projects, partial checkouts can be a workable way of handling
submodules. Simply put your utility library in one subdirectory and put
each application using that library in another directory. When you want to
get one application, just check out two subdirectories (the library and
the application) instead of checking out all directories: that’s a partial
checkout.

One benefit of using partial checkouts is that you don’t have to
download the gigantic, full history of every file. You just download just
the files you need for a particular revision of a particular project. You
may not even need the full history of just those files; the current
version alone may suffice.

This technique was especially popular in an older VCS: CVS. CVS has no conceptual
understanding of the whole repository; it only understands the history of individual files. In
fact, the history of the files is stored in the file itself. CVS’s repository format was so
simple that the repository administrator could make copies and use symbolic links between
different application repositories. Checking out a copy of each application would then
automatically check out a copy of the referenced files. You wouldn’t even have to know that
the files were shared with other projects.

This technique had its idiosyncrasies, but it has worked well on many projects for years. The KDE (K Desktop Environment) project, for example, encourages partial checkouts with their multigigabyte SVN repository.

Unfortunately, this idea isn’t compatible with distributed VCSs like Git. With Git, you don’t just download the current version of a selected set of files, you download all the versions of all the files. After all, every Git repository is a complete copy of the repository. Git’s current architecture doesn’t support partial checkouts well.[33]

As of this writing, the KDE project is considering a switch from SVN to Git, and submodules are their main point of contention. An import of the entire KDE repository into Git is still several gigabytes in size. Every KDE contributor would have to have a copy of all that data, even if they wanted to work on only one application. But you can’t just make one repository per application: each application depends on one or more of the KDE core libraries.

For KDE to successfully switch to Git, it needs an alternative to
huge, monolithic repositories using simple partial checkouts. For example,
one experimental import of KDE into Git separated the code base into
roughly 500 separate repositories.[34]

The Obvious Solution: Import the Code into Your Project

Let’s revisit one of the options glossed over earlier: why not just
import the library into your own project in a subdirectory? Then you can
copy in a new set of files if you ever want to update the library.

Depending on your needs, this method can actually work just fine. It
has these advantages:

  • You never end up with the wrong library version by
    accident.

  • It’s extremely simple to explain and understand, and it relies
    only on everyday Git features.

  • It works exactly the same way whether the external library is maintained in Git, some other VCS, or no VCS at all.

  • Your application repository is always self-contained, so a
    git clone of your application
    always includes everything your application needs.

  • It’s easy to apply application-specific patches to the library
    in your own repository, even if you don’t have commit access to the
    library’s repository.

  • Branching your application also makes a branch of the library,
    exactly as you’d expect.

  • If you use the subtree merge
    strategy (as described in the section Specialty Merges) in your git pull -s
    subtree
    command, then updating to newer versions of the
    library is just as easy as updating any other part of your project.

Unfortunately, there are also some disadvantages:

  • Each application that imports the same library duplicates that
    library’s files. There’s no easy way to share those Git objects
    between repositories. If KDE did this, for example, and you
    did want to check out the entire project—say,
    because you’re building the KDE distribution packages for Debian or
    Red Hat—then you would end up downloading the same library files
    dozens of times.

  • If your application makes changes to its copy of the library,
    then the only way to share those changes is by generating diffs and
    applying them to the main library’s repository. This is OK if you do
    it rarely, but it’s a lot of tedious work if you do it
    frequently.

For many people and many projects, these disadvantages aren’t very
serious. You should consider using this technique if you can, because its
simplicity often outweighs its disadvantages.

If you’re familiar with other VCS, particularly CVS, you may have had some bad experiences that make you want to avoid this method. You should be aware that many of those problems do not arise in Git. For example:

  • CVS didn’t support file or directory renames, and its features
    (e.g., vendor branches) for importing new upstream
    packages meant it was easy to make mistakes. One common mistake was to
    forget to delete old files when merging in new versions, which would
    result in strange inconsistencies. Git doesn’t have this problem
    because importing any package is a simple matter of deleting a
    directory, recreating it, and using git add
    –all
    .

  • Importing a new module can be a multistep process requiring several commits, and you might make mistakes. In CVS or SVN, such mistakes form a permanent part of the repository’s history. This is normally harmless, but making mistakes can unnecessarily bloat the repository when importing huge files. With Git, if you screw up, then you simply throw away the erroneous commits before pushing them to anyone.

  • CVS made it hard to follow the history of branches. If you imported upstream version 1.0, then applied some of your own changes, and then wanted to import version 2.0, it was complicated to extract your local changes and re-apply them. Git’s improved history management makes this much easier.

  • Some VCSs are very slow when checking for changes through a huge number of files. If you import several large packages using this technique, then the everyday speed impact could cancel out the anticipated productivity gains from including submodules in your repository. Git, however, has been optimized for dealing with tens of thousands of files in one project, so this is unlikely to be a problem.

If you do decide to handle submodules by just importing them
directly, there are two ways to proceed: by copying the files manually or
by importing the history.

Importing Subprojects by Copying

The most obvious way to import another project’s files into your
project is by simply copying them. In fact, if the other project isn’t
stored in a Git repository, this is your only option.

The steps for doing this are exactly as you might expect: delete
any files already in that directory, create the set of files you want
(e.g., by extracting a tarball or ZIP file containing the library you
want to import), and then git add
them. For example:

    $ cd myproject.git
    $ rm -rf mylib
    $ git rm mylib
    $ tar -xzf /tmp/mylib-1.0.5.tar.gz
    $ mv mylib-1.0.5 mylib
    $ git add mylib
    $ git commit

This method works fine, with the following caveats:

  • Only the exact versions of the library you import will appear in your Git history. Compared to our next alternative—including the complete history of the subproject—you might actually find this convenient, because it keeps your log files clean.

  • If you make application-specific changes to the library files, then you’ll have to re-apply those changes whenever you import a new version. For example, you’ll have to manually extract the changes through git diff and incorporate them through git apply (see Chapter 8 or Chapter 14 for more information). Git won’t do this automatically.

  • Importing a new version requires you to rerun the full command
    sequence removing and adding files every time; you can’t just
    git pull.

On the other hand, copying is easy to understand and explain to your coworkers.

Importing Subprojects with git pull -s subtree

Another way to import a subproject into yours is by merging the
entire history from that subproject. Of course, it works only if the
subproject’s history is already stored in Git.

This is a bit tricky to set up for the first time; however, once you’ve done the work, future merges are much easier than with the simple file-copying method. Because Git knows the entire history of the subproject, it knows exactly what needs to happen every time you need to do an update.

Let’s say you want to write a new application called myapp and you want to include a copy of the
Git source code in a directory called git. First, let’s create the new repository
and make the first commit. (If you already have a myapp project, you can skip this part.)

    $ cd /tmp
    $ mkdir myapp
    $ cd myapp

    $ git init
    Initialized empty Git repository in /tmp/myapp/.git/

    $ ls

    $ echo hello > hello.txt

    $ git add hello.txt

    $ git commit -m 'first commit'
    Created initial commit 644e0ae: first commit
     1 files changed, 1 insertions(+), 0 deletions(-)
     create mode 100644 hello.txt

Next, import the git project
from your local copy, assumed to be ~/git.git.[35] The first step is just like the one in the previous
section: extract a copy of it into a directory called git, then commit it.

The following example takes a particular version of the git.git project, denoted by the tag v1.6.0. The command git archive v1.6.0 creates a tar file of all the v1.6.0 files. They are then extracted into the new git subdirectory:

    $ ls
    hello.txt

    $ mkdir git

    $ cd git
    $ (cd ~/git.git && git archive v1.6.0) | tar -xf -

    $ cd ..

    $ ls
    git/  hello.txt

    $ git add git

    $ git commit -m 'imported git v1.6.0'
    Created commit 72138f0: imported git v1.6.0
     1440 files changed, 299543 insertions(+), 0 deletions(-)

So far, you’ve imported the (initial) files by hand, but your
myapp project still doesn’t know
anything about the history of its submodule. Now
you must inform Git that you have imported v1.6.0, which means you also should have the
entire history up to v1.6.0. To do
that, use the -s ours merge strategy (from Chapter 9) with your git
pull
command. Recall that -s ours just means
record that we’re doing a merge, but my files are the right
files, so don’t actually change anything.

Git isn’t matching up directories and file contents between your project and the imported project or anything like that. Instead Git is only importing the history and tree pathnames as they are found in the original subproject. We’ll have to account for this relocated directory basis later, though.

Simply pulling v1.6.0 doesn’t
work, which is due to a peculiarity of git
pull
.

    $ git pull -s ours ~/git.git v1.6.0
    fatal: Couldn't find remote ref v1.6.0
    fatal: The remote end hung up unexpectedly

This might change in a future version of Git, but for now the
problem is handled by explicitly spelling out refs/tags/v1.6.0, as described in refs and symrefs of Chapter 6:

    $ git pull -s ours ~/git.git refs/tags/v1.6.0
    warning: no common commits
    remote: Counting objects: 67034, done.
    remote: Compressing objects: 100% (19135/19135), done.
    remote: Total 67034 (delta 47938), reused 65706 (delta 46656)
    Receiving objects: 100% (67034/67034), 14.33 MiB | 12587 KiB/s, done.
    Resolving deltas: 100% (47938/47938), done.
    From ~/git.git
     * tag               v1.6.0     -> FETCH_HEAD
    Merge made by ours.

If all the v1.6.0 files were
already committed, then you might think there was no work left to do. On
the contrary, Git just imported the entire history
of git.git up to v1.6.0, so even though the files are the same
as before, our repository is now a lot more complete. Just to be sure,
let’s just check that the merge commit we just created didn’t really
change any files:

    $ git diff HEAD^ HEAD

You shouldn’t get any output from this command, which means the
files before and after the merge are exactly the same. Good.

Now let’s see what happens if we make some local changes to our
subproject and then try to upgrade it later. First, make a simple
change:

    $ cd git

    $ echo 'I am a git contributor!' > contribution.txt

    $ git add contribution.txt

    $ git commit -m 'My first contribution to git'
    Created commit 6c9fac5: My first contribution to git
     1 files changed, 1 insertions(+), 0 deletions(-)
     create mode 100644 git/contribution.txt

Our version of the Git subproject is now v1.6.0 with an extra patch.

Finally, let’s upgrade our Git to version v1.6.0.1 tag but without losing our additional
contribution. It’s as easy as this:

    $ git pull -s subtree ~/git.git refs/tags/v1.6.0.1
    remote: Counting objects: 179, done.
    remote: Compressing objects: 100% (72/72), done.
    remote: Total 136 (delta 97), reused 100 (delta 61)
    Receiving objects: 100% (136/136), 25.24 KiB, done.
    Resolving deltas: 100% (97/97), completed with 40 local objects.
    From ~/git.git
     * tag               v1.6.0.1   -> FETCH_HEAD
    Merge made by subtree.

Warning

Don’t forget to specify the -s subtree merge
strategy in your pull. The merge might have worked even without
-s subtree, because Git knows how to deal with file
renames and we do have a lot of renames: all the files from the
git.git project have been moved
from the root directory of the project into a subdirectory called
git. The -s
subtree
flag tells Git to look right away for that situation
and deal with it. To be safe, you should always use -s
subtree
when merging a subproject into a subdirectory (except
during the initial import, where we’ve seen that you should use
-s ours).

Was it really that easy? Let’s check that the files have been updated correctly. Because
all the files in v1.6.0.1 were in the root directory and
are now in the git directory, we must use some unusual
selector syntax with git diff. In this case, what we’re
saying is: Tell me the difference between the commit from which
we merged (i.e., parent #2, which is v1.6.0.1) and what
we merged into, the HEAD
version.
Because the latter is in the git
directory, we have to specify that directory after a colon. The former is in its root
directory, so we can omit the colon and default the directory.

The command and its output looks like this:

    $ git diff HEAD^2 HEAD:git
    diff --git a/contribution.txt b/contribution.txt
    new file mode 100644
    index 0000000..7d8fd26
    --- /dev/null
    +++ b/contribution.txt
    @@ -0,0 +1 @@
    +I am a git contributor!

It worked! The only difference from v1.6.0.1 is the change we
applied earlier.

How did we know it was HEAD^2?
After the merge, you can inspect the commit and see which branch
HEADs were merged:

    Merge: 6c9fac5... 5760a6b...

As with any merge, those are HEAD^1 and HEAD^2. You should recognize the
latter:

    commit 5760a6b094736e6f59eb32c7abb4cdbb7fca1627
    Author: Junio C Hamano <gitster@pobox.com>
    Date:   Sun Aug 24 14:47:24 2008 -0700

        GIT 1.6.0.1

        Signed-off-by: Junio C Hamano <gitster@pobox.com>

If your situation is a bit more complex, you might need to place your subproject deeper into your repository structure and not right at the top level as shown in this example. For instance, you might instead need other/projects/git. Git doesn’t automatically keep track of the directory relocation when you imported it. Thus, as before, you would need to spell out the full path to the imported subproject:

    $ git diff HEAD^2 HEAD:other/projects/git

You can also break down our contributions to the git directory one commit at a time:

    $ git log --no-merges HEAD^2..HEAD
    commit 6c9fac58bed056c5b06fd70b847f137918b5a895
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Sep 27 22:32:49 2008 -0400

        My first contribution to git

    commit 72138f05ba3e6681c73d0585d3d6d5b0ad329b7c
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Sep 27 22:17:49 2008 -0400

        imported git v1.6.0

Using -s subtree, you can merge and remerge updates from the main git.git project into your subproject as many times as you want, and it will work just as if you simply had your own fork of the git.git project all by itself.

Submitting Your Changes Upstream

Although you can easily merge history into your subproject, taking it out again is much harder. That’s because this technique doesn’t maintain any history of the subproject. It has only the history of the whole application project, including its subproject.

You could still merge your project’s history back into git.git using the -s subtree
merge strategy, but the result would be unexpected: you’d end up
importing all the commits from your entire
application project and then recording a deletion of all the files
except those in the git directory
at the point of the final merge.

Although such a merged history would be technically correct, it’s
just plain wrong to place the history of your entire application into
the repository holding the submodule. It would also mean that all the
versions of all the files in your application would become a permanent
part of the git project. They don’t
belong there, and it would be a time sink, produce an enormous amount of
irrelevant information, and waste a lot of effort. It’s the wrong
approach.

Instead, you’ll have to use alternative methods, such as git format-patch (discussed in Chapter 14). This requires more steps than a simple git pull. Luckily, you only have to approach the problem when contributing changes back to the subproject, not in the much more common case of pulling subproject changes into your application.

The Automated Solution: Checking out Subprojects Using Custom
Scripts

After reading the previous section, you might have reasons not to
copy the history of your subproject directly into a subdirectory of your
application. After all, anyone can see that the two projects are separate:
your application depends on the library, but they are obviously two
different projects. Merging the two histories together doesn’t feel like a
clean solution.

There are other ways to do it that you might like better. Let’s look
at one obvious method: simply git clone
the subproject into a subdirectory by hand every time you clone the main
project, like this:

    $ git clone myapp myapp-test
    $ cd myapp-test
    $ git clone ~/git.git git
    $ echo git >.gitignore

This method is reminiscent of the partial checkout method in SVN or CVS. Instead of checking out just a few subdirectories of one huge project, you check out two small projects, but the idea is the same.

This method of handling submodules has a few key advantages:

  • The submodule doesn’t have to be in Git; it can be in any VCS or it can just be a tar or ZIP file from somewhere. Because you’re retrieving the files by hand, you can retrieve them from anywhere you want.

  • The history of your main project never gets mixed up with the
    history of your subprojects. The log doesn’t become crowded with
    unrelated commits, and the Git repository itself stays small.

  • If you make changes to the subproject, you can contribute them back exactly as if you were working on the subproject by itself, because, in essence, you are.

Of course, there are also some problems that you need to deal
with:

  • Explaining to other users how to check out all the subprojects
    can be tedious.

  • You need to somehow ensure that you get the right
    revision
    of each subproject.

  • When you switch to a different branch of your main project or
    when you git pull changes from
    someone else, the subproject isn’t updated automatically.

  • If you make a change to the subproject, you must remember to
    git push it separately.

  • If you don’t have rights to contribute back to the subproject
    (i.e., commit access to its repository), then you may not be able to
    easily make application-specific changes. (If the subproject is in
    Git, you can always put a public copy of your changes somewhere, of
    course.)

In short, cloning subprojects by hand gives you infinite
flexibility, but it’s easy to over-complicate things or to make
mistakes.

If you choose to use this method, the best approach is to
standardize it by writing some simple scripts and including them in your
repository. For example, you might have a script called ./update-submodules.sh that clones and/or
updates all your submodules automatically.

Depending on how much effort you want to put in, such a script could update your submodules to particular branches or tags or even to particular revisions. You could hard-code commit IDs into the script, for example, and then commit a new version of the script to your main project whenever you want to update your application to a new version of the library. Then, when people check out a particular revision of your application, they can run the script to automatically derive the corresponding version of the library.

You might also think about creating a commit or update hook, using the techniques of Chapter 15, which prevents you from accidentally committing to your main project unless your changes to the subproject are properly committed and pushed.

You can well imagine that, if you want to manage your subprojects this way, then other people do, too. Thus, scripts to standardize and automate this process have already been written. One such script, by Miles Georgi, is called externals (or ext). You can find it at http://nopugs.com/ext-tutorial. Conveniently, ext works for any combination of SVN and Git projects and subprojects.

The Native Solution: gitlinks and git submodule

Git contains a command designed to work with submodules called git submodule. I saved it for last for two reasons:

  • It is much more complicated than simply importing the history of
    subprojects into your main project’s repository.

  • It is fundamentally the same as but more restrictive than the
    script-based solution just discussed.

Even though it sounds like Git submodules should be the natural
option, you should consider carefully before using them.

Git’s submodule support is evolving fast. The first mention of
submodules in Git development history was by Linus Torvalds in April 2007,
and there have been numerous changes since then. That makes it something
of a moving target, so you should check git help
submodule
in your version of Git to find out if anything has
changed since this book was written.

Unfortunately, the git submodule
command is not very transparent; you won’t be able to use it effectively
unless you understand exactly how it works. It’s a combination of two
separate features: so-called gitlinks and the actual git submodule command.

Gitlinks

A gitlink is a link from a tree
object
to a commit object.

Recall from Chapter 4 that each commit object points to a tree object and that each tree object points to a set of blobs and trees, which correspond (respectively) to files and subdirectories. A commit’s tree object uniquely identifies the exact set of files, filenames, and permissions attached to that commit. Also recall from Commit Graphs of Chapter 6, that the commits themselves are connected to each other in a DAG. Each commit object points to zero or more parent commits, and together they describe the history of your project.

But we haven’t yet seen a tree object pointing to a commit object.
The gitlink is Git’s mechanism to indicate a direct reference to another
Git repository.

Let’s try it out. As in Importing Subprojects with git pull -s subtree, we’ll create a
myapp repository and import the Git
source code into it:

    $ cd /tmp
    $ mkdir myapp
    $ cd myapp

    # Start the new super-project
    $ git init
    Initialized empty Git repository in /tmp/myapp/.git/

    $ echo hello >hello.txt

    $ git add hello.txt

    $ git commit -m 'first commit'
    [master (root-commit)]: created c3d9856: "first commit"
     1 files changed, 1 insertions(+), 0 deletions(-)
     create mode 100644 hello.txt

But this time, when we import the git project we’ll do so directly; we don’t use
git archive like we did last
time:

    $ ls
    hello.txt

    # Copy in a repository clone
    $ git clone ~/git.git git
    Initialized empty Git repository in /tmp/myapp/git/.git/

    $ cd git

    # Establish the desired submodule version
    $ git checkout v1.6.0
    Note: moving to "v1.6.0" which isn't a local branch
    If you want to create a new branch from this checkout, you may do so
    (now or later) by using -b with the checkout command again. Example:
      git checkout -b <new_branch_name>
    HEAD is now at ea02eef... GIT 1.6.0

    # Back to the super-project
    $ cd ..

    $ ls
    git/  hello.txt

    $ git add git

    $ git commit -m 'imported git v1.6.0'
    [master]: created b0814ac: "imported git v1.6.0"
     1 files changed, 1 insertions(+), 0 deletions(-)
     create mode 160000 git

Because there already exists a directory called git/.git (created during the git clone), git add
git
knows to create a gitlink to it.

Warning

Normally, git add git and
git add git/ (with the
POSIX-compatible trailing slash indicating that git must be a directory) would be
equivalent. But that’s not true if you want to create a gitlink! In
the sequence we just showed, adding a slash to make the command
git add git/ won’t create a gitlink
at all; it will just add all the files in the git directory, which is probably not what
you want.

Observe how the outcome of the preceding sequence differs from
that of the related steps in Importing Subprojects with git pull -s subtree. In that
section, the commit changed all the files in the repository. This time,
the commit message shows that only one file
changed. The resulting tree looks like this:

    $ git ls-tree HEAD
    160000 commit ea02eef096d4bfcbb83e76cfab0fcb42dbcad35e    git
    100644 blob ce013625030ba8dba906f756967f9e9ca394464a      hello.txt

The git subdirectory is of
type commit and has mode 160000. That
makes it a gitlink.

Git usually treats gitlinks as simple pointer values or references
to other repositories. Most Git operations, such as clone, do not dereference the gitlinks and
then act on the submodule repository.

For example, if you push your project into another repository, it won’t push in the sub-module’s commit, tree, and blob objects. If you clone your superproject repository, the subproject repository directories will be empty.

In the following example, the git subproject directory remains empty after
the clone command:

    $ cd /tmp

    $ git clone myapp app2
    Initialized empty Git repository in /tmp/app2/.git/

    $ cd app2

    $ ls
    git/  hello.txt

    $ ls git

    $ du git
    4       git

Gitlinks have the important feature that they link to objects that
are allowed to be missing from your repository.
After all, they’re supposed to be part of some other repository.

It is exactly because the gitlinks are allowed to be missing that
this technique even achieves one of the original goals: partial
checkouts. You don’t have to check out every subproject; you can check
out just the ones you need.

So now you know how to create a gitlink and that it’s allowed to
be missing. But missing objects aren’t very useful by themselves. How do
you get them back? That’s what the git
submodule
command is for.

The git submodule Command

At the time of this writing, the git submodule command is actually just a 700-line Unix shell script called git-submodule.sh. And if you’ve read this book all the way through to this point, you now know enough to write that script yourself. Its job is simple: to follow gitlinks and check out the corresponding repositories for you.

First of all, you should be aware that there’s no particular magic
involved in checking out a submodule’s files. In the app2 directory we just cloned, you could do
it yourself:

    $ cd /tmp/app2

    $ git ls-files --stage -- git
    160000 ea02eef096d4bfcbb83e76cfab0fcb42dbcad35e 0    git

    $ rmdir git

    $ git clone ~/git.git git
    Initialized empty Git repository in /tmp/app2/git/.git/

    $ cd git

    $ git checkout ea02eef
    Note: moving to "ea02eef" which isn't a local branch
    If you want to create a new branch from this checkout, you may do so
    (now or later) by using -b with the checkout command again. Example:
      git checkout -b <new_branch_name>
    HEAD is now at ea02eef... GIT 1.6.0

The commands you just ran are exactly equivalent to git submodule update. The only difference is
that git submodule will do the
tedious work such as determining the correct commit ID to check out for
you. Unfortunately, it doesn’t know how to do this without a bit of
help:

    $ git submodule update
    No submodule mapping found in .gitmodules for path 'git'

The git submodule command needs
to know one important bit of information before it can do anything:
where can it find the repository for your submodule? It retrieves that
information from a file called .gitmodules, which looks like this:

    [submodule "git"]
            path = git
            url = /home/bob/git.git

Using the file is a two-step process. First, create the .gitmodules file, either by hand or with
git submodule add. Because we created
the gitlink using git add earlier,
it’s too late now for git submodule
add
, so just create the file by hand:

    $ cat >.gitmodules <<EOF
    [submodule "git"]
            path = git
            url = /home/bob/git.git
    EOF

Tip

The git submodule add command
that performs the same operations is:

    $ git submodule add /home/bob/git.git git

The git submodule add command
will add an entry to the .gitmodules and populate a new Git
repository with a clone of the added repository.

Next, run git submodule init to
copy the settings from the .gitmodules file into your .git/config file:

    $ git submodule init
    Submodule 'git' (/home/bob/git.git) registered for path 'git'

    $ cat .git/config
    [core]
            repositoryformatversion = 0
            filemode = true
            bare = false
            logallrefupdates = true
    [remote "origin"]
            url = /tmp/myapp
            fetch = +refs/heads/*:refs/remotes/origin/*
    [branch "master"]
            remote = origin
            merge = refs/heads/master
    [submodule "git"]
            url = /home/bob/git.git

The git submodule init command
added only the last two lines.

The reason for this step is that you can reconfigure your local
submodules to point at a different repository from the one in the
official .gitmodules. If you make a
clone of someone’s project that uses submodules, you might want to keep
your own copy of the submodules and point your local clone at that. In
that case, you wouldn’t want to change the module’s official location in
.gitmodules, but you would want
git submodule to look at your
preferred location. So git submodule
init
copies any missing submodule information from .gitmodules into .git/config, where you can safely edit it.
Just find the [submodule] section
referring to the submodule you’re changing, and edit the URL.

Finally, run git submodule update to actually update the files, or if needed, clone the initial subproject repository:

    # Force a complete new clone by removing what's there
    $ rm -rf git

    $ git submodule update
    Initialized empty Git repository in /tmp/app2/git/.git/
    Submodule path 'git': checked out 'ea02eef096d4bfcbb83e76cfab0fcb42dbcad35e'

Here git submodule update goes
to the repository pointed to in your .git/config, fetches the commit ID found in
git ls-tree HEAD — git, and checks
out that revision in the directory specified in .git/config.

There are a few other things you need to know:

  • When you switch branches or git pull someone else’s branch, you always need to run git submodule update to obtain a matching set of submodules. This isn’t automatic because it could cause you to lose work in the submodule by mistake.

  • If you switch to another branch and don’t issue git submodule update, Git will think you have deliberately changed your submodule directory to point at a new commit (when really it was the old commit you were using before). If you then git commit -a, you will accidentally change the gitlink. Be careful!

  • You can update an existing gitlink by simply checking out the
    right version of a submodule, executing git
    add
    on the submodule directory, and then running git commit. You don’t use the git submodule command for that.

  • If you have updated and committed a gitlink on your branch and
    if you git pull or git merge another branch that updates the
    same gitlink differently, then Git doesn’t know
    how to represent this as a conflict and will just pick one or the
    other. You must remember to resolve conflicted gitlinks by yourself.

As you can see, the use of gitlinks and git submodule is quite complex. Fundamentally, the gitlink concept can perfectly represent how your submodules relate to your main project, but actually making use of that information is a lot harder than it sounds.

When considering how you want to use submodules in your own project, you need to consider carefully if the complexity is worth it Note that git submodule is a standalone command like any other, and it doesn’t make the process of maintaining submodules any simpler than, say, writing your own submodule scripts or using the ext package described at the end of the previous section. Unless you have a real need for the flexibility that git submodule provides, you might consider using one of the simpler methods.

On the other hand, I fully expect that the Git development community will address the shortfalls and issues with the git submodule command, to ultimately lead to a technically correct and very usable solution.


[32] In fact, SVN cleverly uses partial checkouts to implement all its branching and tagging features. You just make a copy of your files in a subdirectory and then check out only that subdirectory.

[33] Actually, there are some experimental patches that implement partial checkouts in Git. They aren’t yet in any released Git version, and may never be. Also, they are only partial checkouts, not partial clones. You still have to download the entire history even if it doesn’t end up in your working tree, and this limits the benefit. Some people are interested in solving that problem, too, but it’s extremely complicated—maybe even impossible—to do right.

[34] See http://labs.trolltech.com/blogs/2008/08/29/workflow-and-switching-to-git-part-2-the-tools/.

[35] If you don’t have such a repository already, you can clone it
from git://git.kernel.org/pub/scm/git/git.git.

Comments are closed.