Git – Submodule Best Practices

How to check for updates on Windows Server 2019

Submodules are a powerful, but sometimes perceived as complex
piece of the Git toolchain. Submodules
are, at the highest level, a facility for the composition of Git
repositories (Figure 17-1).

Figure 17-1. Nested repos

But unlike some of their non-Git cousins such as SVN
Externals, they default to offering greater precision, pointing not
only at the network address of the nested repository, but also to the commit
hash of the nested repository (Figure 17-2).

Figure 17-2. Nested repos pointing to precise revision

Because each commit ref has, within a repo, a unique identifier to a
specific point in the graph and all parent states that led up to that point,
pointing to the ref of another repo records that precise state in the commit
history of the parent project.

Submodule Commands

Although Chapter 16 provides an
exhaustive list of commands, a quick recap of the basic submodule actions
is helpful:

git submodule add
address
localdirectoryname

Register a new submodule for this superproject and,
optionally, express it in the specified folder name (can be a
subfolder path relative to the root of the project).

git submodule status

Summary of the commit ref and dirtiness state of all
submodules at this project level.

git submodule init

Use the .gitmodules long-term storage of
submodule information to update the .git/config file used during developer
repository actions.

git submodule update

Fetch the submodule contents using the address from
.git/config and check out the
superproject’s submodule-recorded ref in a detached HEAD state.

git submodule summary

Display a patch of the changes of each submodule’s
current state as compared to its committed state.

git submodule foreach
command

Scripts a shell command to be run on each submodule
and provides variables for $path,
$sha1, and other useful
identifiers.

Why Submodules?

The most common driving factor behind the use of submodules
is modularization. Submodules provide a componentization of a source code
base in the absence of such a modularization at the binary level (DLL,
JAR, SO). Solutions such as Maven
Multimodule Projects and Gradle
Multiproject Builds are well-known Java solutions for
componentized binary or semibinary dependency management that don’t
require the entire source base to be checked out to a monolithic folder.
Likewise, the .NET space has Assemblies
that allow for binary consumption of subcomponents and plug-ins. Driving
the use of submodules in the Objective-C ecosystem is the contrasting
sparseness of options for modularity and the inclusion of compiled
binaries.

Take, for example, the instructions for the Pull To Refresh
functionality that so many iOS apps are leveraging today. The README suggests that a developer Copy
the files, PullRefreshTableViewController.h, PullRefreshTableViewController.m, and arrow.png into your project.
This
concept of a nested source in a subdirectory is shown in Figure 17-3.

Figure 17-3. Nested source folders

Git submodules facilitate leaving the existing directory structure
of a subcomponent intact, provided the separation of components falls
along directory fault lines, while enabling precise labeling and version
control of each component that contributes to an aggregate project.

Leveraging the appropriate database terminology, submodules can also
facilitate the creation of multiple views of different versions of the
same plugins or different overlapping sets of plug-ins. More than one
superproject can contain the same submodule, and the different
superprojects can record a different desired ref of the submodule, thus
projecting older and newer views of the composed system, while allowing
the submodule developers to continue unimpeded with forward development at
no risk to the consuming superprojects.

Submodules Preparation

When considering the use of Git submodules, the first
question to ask is if the composition of the code base is ready to accept
such a fracture. Submodules are always expressed as subdirectories of the
superproject. Submodules cannot blend sets of files into a single
directory. Field experience has shown that most systems already have a
subdirectory composition, even in a monolithic repository, as the crudest
form of modularization. Thus, the translation and extraction of a
subfolder (Figure 17-4) to a true
submodule is relatively easy and can be implemented by these
steps:

Figure 17-4. Nested source folder extracted
  1. Move the subdirectory out of the superproject to be a
    peer to the superproject directory. If maintaining repository history
    is important, consider using git
    filter-branch
    to help extract subdirectory structure.

  2. Rename the submodule-to-be directory to more accurately express
    the nature of the submodule. For example, a refresh subdirectory might be renamed to
    client-app-refresh-plug-in.

  3. Create a new upstream hosting for the submodule as a first-class
    project (e.g., create a new project on GitHub to host the extracted
    code).

  4. Initialize the now stand-alone plug-in as a Git repo and push
    the commit to the newly created project hosting URL.

  5. In the superproject, add a Git submodule, pointing to the new
    submodule project URL.

  6. Commit and push the superproject, which will include the newly
    created .gitmodules file.

Why Read Only?

The recommendation for the previous extraction of a
subdirectory into a Git submodule advised for it to be cloned via a
read-only address, which frequently means access
through https:// without a username or
git://. This recommendation has served
many users of submodules very well, making it easier to cope with the
complexity that the use of submodules brings about. It offers an enforced
separation of activities, pushing work on submodules out into the
stand-alone clone of the submodule and suggesting that it should first be
engineered, tested, and built in an independent way. Then, as a secondary
step, the developer switches focus back into the superproject then fetches
and checks out the newer revision of the submodule. This step is
occasionally lamented as being tedious, but many developers learn to
appreciate the precision this offers over the less deterministic approach
of having a floating version of the subcomponent (in the style of an SVN
External pointing to trunk) always
pointing to the latest committed state.

Why Not Read Only?

If the previous recommendation is greatly disliked, it is practical,
though more risky, to update the source code directly within the
submodules of a superproject, committing, pushing, and checking out from
that nested directory. It can be slightly more efficient to use this
combined approach, although it foregoes the true separation of
implementing versus consuming modes that submodules were meant to bring
about.

The greatest risk with this all-in-one working directory approach,
even for veteran submodule users, is the committing of code and the
recording of an updated submodule hash in the superproject
without having pushed the submodule’s new commits to
a shared network repository. Thus, if the superproject’s new commit is
pushed, other developers, upon pulling the updated superproject, will find
they cannot fully check out the current committed ref because there are
inaccessible commits in the unpushed subproject that the superproject is
calling for.

Examining the Hashes of Submodule Commits

For developers wanting to examine their project one level
deeper than they will use it on a daily basis, the recording of a
submodule commit ref is a fascinatingly simple thing to observe. The ref
of the submodule’s commit is stored in the tree just
as the ref of a subdirectory or blob would be, but with an entry type of
commit rather than tree or
blob.

    $ git ls-tree HEAD
    100644 blob 0cf8086ddd1ac6c6463405ea9aa46102e0e6eb20  .gitmodules
    100644 blob e425f022e79989a5ecb2c8343e697d1e4bf70258  README.txt
    040000 tree aaa0af6b82db99c660b169962524e2201ac7079c  resources
    040000 tree 42103128ceaebabff8f50cf408903d12e14c21d9  src
    160000 commit 47b28b4e89481095f0eefe764eeefafcfa7e5b6c  submodule1

A practical use of this tooling output is in the
examination of, sometimes from a build automation script, the state of a
consumed submodule and comparing it to another known state. git rev-parse can be used on a HEAD or labeled build in another phase of
automation to capture a known good point of the submodule and then the
resultant hash can be compared to the currently preserved ref (state) of
the submodule within the superproject.

Credential Reuse

A traditional git clone
user@hostname:pathtorepo
is acceptable for a stand-alone Git
repository. However, this is a less desirable address for a git submodule add URL command because the
username will be saved in the submodule metadata at the superproject repository level. This username
will be preserved and unintentionally used by all other repository cloning
developers.

In a business where access control to repositories is
decided on a per user basis, it would be undesirable to store a specific
username as part of the .gitmodules
recorded address for a submodule. It would be nice if the superproject’s
username used during cloning was passed along to the submodules cloning
operation.

The Git submodule commands know to take the credentials
given during the superproject cloning operation and pass them downward
(Figure 17-5) to any actions invoked by
--recurse-submodules. This leaves the .gitmodules address free of any usernames and
usable by any developer authorized to clone the project.

Figure 17-5. Reuse of credentials in submodules

Use Cases

Open Sourcing a Book’s Code Samples

One of the most exciting examples of applying
submodules was the open sourcing of the Building and
Testing Gradle
book’s code examples long before the book
itself was put on the market. This allowed for the creation of some
early buzz around the book as well as community contributions to and
polishing of the examples. Using GitHub as the repository host, the
top level book project was closed source, but contained a submodule
for the example code in a folder named examples. Specific source code files in
the examples directory were
directly referenced by the book prose AsciiDoc files. The book PDF
and HTML generation tooling had no idea a Git submodule was used; it
was just a regular directory as far as it was concerned. The
contributors to the open source examples had no burden on how this
code was used in the book. It was an eye-opening experience that
other technical authors are encouraged to repeat.

A Plug-in

Frequently in the Objective-C world, but also in the
ANSI C and C++ ecosystems, plug-in–like code can be incorporated as
a submodule into a superproject without losing the ability to update
of a connection to the original add-in author’s repository. The
traditional README-suggested
process of copying these files into your project leaves them
detached from any historical metadata and subject to a manual
copy-and-paste update. This plug-in pattern extends even to
noncompiled code such as Emacs Lisp
setups, and dotfile
configurations with the inclusion of oh-my-zsh.

A Large Repo

The most contentious use of submodules is for scaling down the
size of a repository. Although a practical solution to Git’s desire
to have relatively small repos (1 to 4GB total) compared to
several-hundred-gigabyte SVN repositories, strategic developers
should consider solutions that link projects on a binary or
Application Programming
Interface (API) level rather than at the source level that
submodules provide.

Visibility Constraints

A final and unique implementation pattern of
submodules is the partitioning of (access control–based) visibility
of a composed application. One Git-using development team has
cryptographic code that had licensing constraints permitting only a
handful of developers to see it. That code was stored as a Git
submodule and when the superproject was cloned, the permissions
denied the majority of developers from being able to clone that
submodule. The build system for this project was carefully
constructed to adapt to the missing source of the cryptographic
component, outputting a
developer-only build. The SSH key of the
continuous integration server,
on the other hand, does have permission to retrieve the cryptography
submodule, thus producing the feature-complete builds that customers
will ultimately receive.

Multilevel Nesting of Repos

The use of submodules discussed thus far can be extended to
another level of recursion. submodules can in turn be superprojects, and
thus contain submodules. This proliferated the use of custom automation
scripts to recursively apply behavior to every nested submodule. However,
that need has been mitigated through recent improvements in submodule
support across the Git vocabulary.

Submodules have received renewed attention in the 1.6.x and
1.7.x era of Git, with the addition of
--recurse-submodules option switch to the majority of the
network-enabled Git commands. As of Git Version 1.7.9.4, this option is
supported by the clone, fetch, and pull commands. Furthermore, the convenience of
working with nested submodules has been improved with submodule status, submodule update, and submodule for each, all supporting the
--recurse option.

Submodules on the Horizon

I’ve been pleased to see that as submodule tooling support
increases, such as the Graphical User Interface (GUI) support for revision
updates in Git
Tower, in addition to the
hyperlinking of submodules
on GitHub, adoption has also increased (see Figure 17-6). This also parallels the developer
community’s ever increasing proficiency in Git. As the idea of pointers to
specific views of all files at an instant in time becomes more of a
pedestrian concept, the use of submodules is likely to increase even
further.

Figure 17-6. Submodule hyperlinks on GitHub repositories

Comments are closed.