loading...

Git – Advanced Manipulations

How to Configure Network Static IP Address on Ubuntu 19.10

Using git filter-branch

The command git
filter-branch
is a generic branch processing command that allows
you to arbitrarily rewrite the commits of a branch using custom commands
that operate on different objects within the repository. Some filters work
on commits, some filters on tree objects and directory structures, and
others provide environmental manipulation opportunity.

Does that sound useful and yet dangerous?

Good.

As you might suspect, with great power comes great
responsibility.[40] The power and purpose of filter-branch is also the source of my warning:
it has the potential to rewrite the entire repository’s commit history.
Executing this command on a repository that has already been published for
others to clone and use will likely cause them endless grief later. As
with all rebasing operations, commit history will change. After this
command, you should consider any
repositories cloned from it earlier as obsolete.

With that warning about rewriting repository history behind us,
let’s find out what the command can do, when and why it might be useful,
and how to use it responsibly.

The filter-branch command runs a
series of filters on one or more branches within your repository. Each
filter can have its own custom filtering command. You don’t have to run
them all, or even more than one. But they are designed and sequenced so
that earlier filters can affect the behavior of later filters. The
subdirectory-filter runs as a
precommit-processing selection filter, and the tag-name-filter runs as a postcommit-processing
step,

To help you get a clearer picture of what is happening
during the filtering process, it might help to know that as of version
1.7.9, git filter-branch is a shell
script.[41] Except for the commit-filter, each
command is evaluated in a shell ( sh) context using eval.

Here is a brief description of each filter and the order in which
they run:

env-filter
command

The env-filter can
be used to create or alter the shell environment settings prior to
running the subsequent filters and committing the newly rewritten
objects. Of note, changing variables such as GIT_AUTHOR_NAME, GIT_AUTHOR_EMAIL, GIT_COMMITTER_NAME, and GIT_COMMITTER_EMAIL may be useful. The
command should likely both set and export
environment variables.

tree-filter
command

The tree-filter
allows you to modify the contents of a directory that will be
captured by a tree object. You can use this filter to remove files
from or add files to the repository retroactively. This filter
checks out the branch at each commit during the filtering. Be aware
that the .gitignore file is not
effective during this filter.

index-filter
command

The index-filter is
used to alter the contents of the index prior to making a commit.
Throughout the filtering process, the index of each commit is made
available without checking out the corresponding files into a
working directory. Thus, this filter is similar to the tree-filter but faster if you don’t
actually need the file contents during the filter. You should study
the low-level git update-index
command.

parent-filter
command

The parent-filter
allows you to restructure the parent relationship of every commit.
For a given commit, you specify its new parent or parents. To use
this properly, you should study the low-level git commit-tree command.

msg-filter
command

Just prior to actually making a newly filtered commit,
the msg-filter allows you to edit
the commit message. The command should
accept the old message on stdin
and write the new message on stdout.

commit-filter
command

Normally during the filtering pipeline, git commit-tree will be used to perform
the commit. However, this filter gives you control over this step
yourself. Your command
will be called with the new (possibly rewritten)
tree-obj and a list of (possibly rewritten) -p
parent-obj
parameters. The
(possibly rewritten) commit message will be on stdin. You should likely still use
git commit-tree, but there are
also a few convenience functions provided environmentally as well:
map, skip_commit, git_commit_non_empty_tree, and die. The git
filter-branch
manual page has details for each of these
functions.

tag-name-filter
command

If your repository has any tags, you should probably
use tag-name-filter to rewrite
existing tags to reference the newly created corresponding commits.
By default, the old tags will remain, but you can use cat as the filter to obtain direct new-for-old mappings of your tags.
Although simply mapping tags to reference the new, corresponding
commits is certainly possible, maintaining a
signed tag is not. Remember that the whole point of
signing a tag was to maintain a cryptographically secure indicator
of the repository at a certain point in its history. That just went
out the window here, right? So all those signatures on signed tags
will be removed from the corresponding new tags.

subdirectory-filter
command

The subdirectory-filter can be used to limit
the rewriting of history to only those commits that affect a
specific directory. That is, after filtering, the new repository
will contain only the named directory at its
root.

After a git filter-branch
completes, the original references comprising the entire old commit
history are available as new refs in refs/original. Naturally, this implies that the
refs/original directory must be empty
at the start of the filtering operation. After verifying that you obtained
the filtered history you desired, and the original commit history is no
longer needed, carefully remove the .git/refs/original refs. (Or, if you want to be
fully Git compliant and Git friendly, you can even use the command
git update-ref -d
refs/original/ branch
for each
branch you filtered.) If you do not remove this
directory, you will continue to have the entirety of both the old and new
content within your repository. The old refs will linger and prevent
garbage collection (see Garbage Collection) from trimming out the
otherwise obsolete commits.[42] If you don’t want to explicitly remove this directory, you
can also clone away from it. That is, make a clone of the repository,
leaving these original refs behind and not cloning them into a new
repository. Think of it as a natural checkpoint backup.

There are several reasons that best practices with git filter-branch suggest you should always
operate on a newly cloned repository. For starters, git filter-branch flat-out requires that the
operation to begin with a clean working directory. Because the git filter-branch modifies your original
repository in place, it is often described as being a
destructive operation. Because the command has many steps,
options, and subtleties, running the command can be quite tricky and often
difficult to get right on the first attempt. Saving the original
repository is just prudent computing.

Examples Using git filter-branch

Now that we know what git
filter-branch
can do, let’s look at a few cases where it can
be used productively. One of the most useful situations occurs when you
have just created a repository full of commit history and want to clean
it up or do a large-scale alteration on it prior to making it available
for cloning and general use by others.

Using git filter-branch to expunge a file

A common use for git
filter-branch
is to completely remove a file from the entire
history of a repository. Remember, Git maintains the complete history
of every file within the repository. Thus, simply deleting a file with
git rm will not remove it from
older history. One can always go back to earlier commits and retrieve
the file.

However, by using git
filter-branch
, a file can be removed from any and every
commit in the repository, making it appear as if it was never there in
the first place.

Let’s work on an example repository that contains personal notes
after reading various books. The notes are stored in files named after
the works.

    $ cd BookNotes

    $ ls
    1984  Animal_Farm  Nightfall  Readme  Snow_Crash

    $ git log --pretty=oneline --abbrev-commit
    ffd358c Read Asimov's 'Nightfall'.
    4df8f74 Read a few classics.
    8d3f5a9 Read 'Snow Crash'
    3ed7354 Collect some notes about books.

And the classics from the third commit 4df8f74 are:

    $ git show 4df8f74
    commit 4df8f74b786b31b6043c44df59d7d13ee2b4b298
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Jan 14 12:57:35 2012 -0600

    Read a few classics.

        - Animal Farm by George Orwell
        - 1984 by George Orwell

    diff --git a/1984 b/1984
    new file mode 100644
    index 0000000..84a2da2
    --- /dev/null
    +++ b/1984
    @@ -0,0 +1 @@
    +George Orwell is disturbed.
    diff --git a/Animal_Farm b/Animal_Farm
    new file mode 100644
    index 0000000..e1fcda1
    --- /dev/null
    +++ b/Animal_Farm
    @@ -0,0 +1 @@
    +Animal Farm was interesting.

Suppose for some history-revising reason we have decided to
remove any record of George Orwell’s 1984 from
the repository. If you don’t care about the old commit history, simply
issuing a git rm 1984 would
suffice. But to be thoroughly Orwellian, it must be removed from the
complete history of the repository. It must never have existed.

Of all the filters listed previously, the likeliest candidates
for this operation are the tree-filter and index-filter. Because this is a small
repository and the operation we want to do, namely, remove one file,
is pretty simple and direct, we’ll use the tree-filter.

As advised earlier, start with a clean clone, just in
case.

    $ cd ..
    $ git clone BookNotes BookNotes.revised
    Cloning into 'BookNotes.revised'...
    done.
    $ cd BookNotes.revised

    $ git filter-branch --tree-filter 'rm 1984' master
    Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4)
        rm: cannot remove `1984': No such file or directory
    tree filter failed: rm 1984

    $ ls
    1984  Animal_Farm  Nightfall  Readme  Snow_Crash

Clearly that didn’t go well and something failed. The file is
still in the repository.

Let’s think a little about what Git is doing here. Git will
iterate over each commit in the master branch, starting with the very first
commit, establish the context (index, files, directories, etc.) of
that commit, and then try to remove the file 1984.

Git tells you which commit it was modifying when the command
failed. Commit 3ed7354 is the first
of 4 commits.

    Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4)

But recall that the file 1984 was introduced in the third commit,
4df8f74, and not the first. That
means that for the first two commits, 3ed7354 and 8d3f5a9, the 1984 file was not yet in the repository or
any of its managed files. That in turn means that when establishing
the filtering context of those first two commits, a simple rm 1984 shell command within the top-level
directory will fail for lack of a file to remove. It’s exactly as if
you had typed rm snizzle-frotz in a
directory with no snizzle-frotz
file in it.

    $ cd /tmp
    $ rm snizzle-frotz
    rm: cannot remove `snizzle-frotz': No such file or directory

The trick is to realize that when removing a file, you don’t
care whether the file is actually present or not. So just force the
removal and ignore nonexistent files using the -f or
--force option:

    $ cd /tmp
    $ rm -f snizzle-frotz
    $

OK, back to the BookNotes.revised repository:

    $ cd BookNotes.revised
    $ git filter-branch --tree-filter 'rm -f 1984' master
    Rewrite ffd358c675a1c6d36114e10a92d93fdc1ee84629 (4/4)
    Ref 'refs/heads/master' was rewritten

As a side note, Git really scrolls through all the commits,
stating which one it is presently rewriting, but only the last one
shows up on your screen, as just shown. If you are a bit more clever,
perhaps by piping that output through less, you can see that it actually prints
each commit processed:

    Rewrite 3ed7354c2c8ae2678122512b26d591a9ed61663e (1/4)
    Rewrite 8d3f5a96b18f9795a1bb41295e5a9d2d4eb414b4 (2/4)
    Rewrite 4df8f74b786b31b6043c44df59d7d13ee2b4b298 (3/4)
    Rewrite ffd358c675a1c6d36114e10a92d93fdc1ee84629 (4/4)

But it worked this time:

    $ ls
    Animal_Farm  Nightfall  Readme  Snow_Crash

The 1984 file is now
gone!

Tip

For the terminally curious, the corresponding command using
index-filter would be something
like this:

    $ git filter-branch --index-filter \ 
      'git rm --cached --ignore-unmatch 1984' master

Let’s look at the new commit log:

    $ git log --pretty=oneline --abbrev-commit
    ad1000b Read Asimov's 'Nightfall'.
    7298fc5 Read a few classics.
    8d3f5a9 Read 'Snow Crash'
    3ed7354 Collect some notes about books.

Notice how each commit starting with the original third commit
( 4df8f74 and ffd358c) now has different SHA1 values
( 7298fc5 and ad1000b), whereas the earlier commits
( 3ed7354 and 8d3f5a9) remain unchanged.

During the filtering and rewriting process, Git creates and
maintains this mapping between old and new commit values and makes it
available to you as the map
convenience function. If for some reason you need to convert from an
old commit SHA1 to the corresponding new SHA1, you can do so using
this mapping from within your filter
command command.

Let’s investigate a bit more, though.

    $ git show 7298fc5
    commit 7298fc55d1496c7e70909f3ebce238d447d07951
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Jan 14 12:57:35 2012 -0600

    Read a few classics.

        - Animal Farm by George Orwell
        - 1984 by George Orwell

    diff --git a/Animal_Farm b/Animal_Farm
    new file mode 100644
    index 0000000..e1fcda1
    --- /dev/null
    +++ b/Animal_Farm
    @@ -0,0 +1 @@
    +Animal Farm was interesting.

Indeed the commit that first introduced 1984 no longer does so! That means the file
was never introduced in the first place. It is not just gone from the
top commit; it is not just gone from any commit reachable from the
master branch; it never existed on
this branch.

But doesn’t it bother you that the commit message itself still
mentions the 1984 book? Let’s fix
that in the next section!

Using filter-branch to edit a commit message

Here’s the problem we’re solving: some commit message
needs to be revised. In the previous section, we saw how to remove a
file from the complete history of a repository. However, the commit
message that used to introduce it still alludes
to it:

    $ git log -1 7298fc55
    commit 7298fc55d1496c7e70909f3ebce238d447d07951
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Jan 14 12:57:35 2012 -0600

    Read a few classics.

        - Animal Farm by George Orwell
        - 1984 by George Orwell

That last line has to go!

This is the perfect use case for the
--msg-filter filter. Your filter command should
accept the old text of a commit message on stdin and write its revised text on stdout. That is, your filter should be a
classic stdin-to- stdout edit filter. Typically, it will be
something like sed, although it can
be as complex as needed.

In our case, we’ll want to delete that last 1984 line. We’ll also want to touch up the
previous sentence to just talk about one book rather than a a
few.
A sed command to do
these edits looks like this:

    sed -e "/1984/d" -e "s/few classics/classic/"

Put that together with the --msg-filter option.
Be careful with your line breaks on input here. It should be all one
line, or use the single quote as a command input continuation
technique.

    $ git filter-branch --msg-filter '
        sed -e "/1984/d" -e "s/few classics/classic/"' master
    Rewrite ad1000b936acf7dbe4a29da6706cb759efded1ae (4/4)
    Ref 'refs/heads/master' was rewritten

Let’s check:

    $ git log --pretty=oneline --abbrev-commit
    bf7351c Read Asimov's 'Nightfall.'
    f28e55d Read a classic.
    8d3f5a9 Read 'Snow Crash'
    3ed7354 Collect some notes about books.

We can already see that the log message from commit f28e55d has been singularized by our
sed script. Good. Looking again at
the whole message:

    $ git log -1 f28e55d
    commit f28e55dc8bbdee555a3f7778ba8355db9ab4c4a1
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sat Jan 14 12:57:35 2012 -0600

    Read a classic.

        - Animal Farm by George Orwell

Now it is truly as if it never existed in this repository! And
we’ve always been at war with Eastasia.

One cautionary note about the filtering process: make sure that
you are both operating on the items you want to
change, and that you are operating on only those items!

For example, the sed command
from the previous --msg-filter example appears to
change precisely the one commit message we wanted to adjust. However,
be aware that same sed script is
applied to every commit message in the history. If there were other,
perhaps incidental occurrences of the string 1984 in other commit messages, they would
also have been deleted because our script was not very discriminating.
Subsequently, you may have to write a more detailed sed command or a more clever
script.

filter-branch Pitfalls

It is important to understand a brutal consequence of the
name of this Git command: it is filter- branch. At
its core, the git filter-branch
command is designed to operate on just one branch or ref. However, it
can operate on many branches or refs.

In many cases, you want to have it operate on
all branches so as to obtain a repository-wide
coverage. In these cases, you will need the -- --all
tacked onto the end of the command.

    $ git filter-branch --index-filter \
    "git rm --cached -f --ignore-unmatch '*.jpeg'" \
    -- --all

Similarly, you almost certainly want to translate any tag refs
from a prefiltered state into the new postfiltered repository. That
means adding --tag-name-filter cat is also quite
common:

    $ git filter-branch --index-filter \
    "git rm --cached -f --ignore-unmatch '*.jpeg'" \
    --tag-name-filter cat \
    -- --all

Tip

How about this one? You used --tree-filter or
--index-filter to remove a file from a repository, but did
that file get moved or have its name changed at some point in its
history? You can use a command like this to find out:

    $ git log --name-only --follow --all -- file

If other names for that file exist, you might want to delete
those versions as well.

How I Learned to Love git rev-list

One day, I received this piece of email:

Jon,

I’m trying to figure out how to do a date-based check out from a
Git repository into an empty working directory. Unfortunately, winding
my way through the Git manual pages makes me feel like I’m playing
Adventure.

Eric

Indeed. Let’s see if we can navigate some of those twisty
passages.

Date-Based Checkout

It might seem that a command like git checkout master@{Jan 1, 2011} should work.
However, that command is really using the reflog (See The Stash) to resolve the date-based reference
for the master ref. There are lots of
ways this innocent looking construct might fail: your repository may not
have the reflog enabled, you may not have manipulated the master ref during that time period, or the
reflog may have already expired refs from that time period. Even more
subtly, that construct may not give you your expected answer. It
requests the reflog to resolve where your master was at the given time as you
manipulated the branch, and not according to the branch’s commit time
line. They may be related, especially if you developed and committed
that history in this repository, but they don’t have to be.

Ultimately, this approach can be a misleading dead-end. Using the
reflog might get what you want. But it can also
fail, and it isn’t a reliable method.

Instead, you should use the git
rev-list
command. It is the general purpose workhorse whose
job is to combine a multitude of options, sort through a complex commit
history of many branches, intuit potentially vague user specifications,
limit search spaces, and ultimately locate selected commits from within
the repository history. It then emits one or more SHA1 IDs for use by
other tools. Think of git rev-list
and its myriad options as a commit database front-end query tool for
your repository.

In this case, the goal is fairly simple: find the one commit in a
repository that existed immediately before a given date on a given
branch and then check it out.

Let’s use the actual Git source repository because it has a fairly
extensive and explorable history. First, we’ll use rev-list to find that SHA1. The -n
1
option limits the output from the command to just one commit
ID.

Here, we try to locate just the last master commit of 2011 from the Git source
repository:

    $ git clone git://github.com/gitster/git.git
    Cloning into 'git'...
    remote: Counting objects: 126850, done.
    remote: Compressing objects: 100% (41033/41033), done.
    remote: Total 126850 (delta 93115), reused 117003 (delta 84141)
    Receiving objects: 100% (126850/126850), 27.56 MiB | 1.03 MiB/s, done.
    Resolving deltas: 100% (93115/93115), done.

    $ cd git
    $ git rev-list -n 1 --before="Jan 1, 2012 00:00:00" master
    0eddcbf1612ed044de586777b233caf8016c6e70

Having identified the commit, you may use it, tag it, reference
it, or even check it out. But as the checkout note reminds you, you are
on a detached HEAD.

    $ git checkout 0eddcb
    Note: checking out '0eddcb'.

    You are in 'detached HEAD' state. You can look around, make experimental
    changes and commit them, and you can discard any commits you make in this
    state without impacting any branches by performing another checkout.

    If you want to create a new branch to retain commits you create, you may
    do so (now or later) by using -b with the checkout command again. Example:

      git checkout -b new_branch_name

    HEAD is now at 0eddcbf... Add MYMETA.json to perl/.gitignore

But is that really the right commit?

    $ git log -1 --pretty=fuller
    commit 0eddcbf1612ed044de586777b233caf8016c6e70
    Author:     Jack Nagel <jacknagel@gmail.com>
    AuthorDate: Wed Dec 28 22:42:05 2011 -0600
    Commit:     Junio C Hamano <gitster@pobox.com>
    CommitDate: Thu Dec 29 13:08:47 2011 -0800

    Add MYMETA.json to perl/.gitignore
    ...

The rev-list date selection
uses the CommitDate field, not the
AuthorDate field. So it looks like
the last commit of 2011 in the Git repository happened on December 29,
2011.

Date-based checkout cautions

A few words of caution are in order, though. Git’s date
handling is implemented using a function called approxidate(). Not that dates are inherently
approximate, but rather that Git’s interpretation of what you meant
are approximated, usually due to insufficient details or
precision.

    $ git rev-list -n 1 --before="Jan 1, 2012 00:00:00" master
    0eddcbf1612ed044de586777b233caf8016c6e70

    $ git rev-list -n 1 --before="Jan 1, 2012" master
    5c951ef47bf2e34dbde58bda88d430937657d2aa

I typed those two commands at 11:05 A.M. local time. For lack of
a specified time in the second command, Git
assumed I meant at this time on Jan 1, 2012.
Subsequently, 11 more hours of leeway were available in which to match
commits.

    $ git log -1 --pretty=fuller 5c951ef
    commit 5c951ef47bf2e34dbde58bda88d430937657d2aa
    Author:     Clemens Buchacher <drizzd@aon.at>
    AuthorDate: Sat Dec 31 12:50:56 2011 +0100
    Commit:     Junio C Hamano <gitster@pobox.com>
    CommitDate: Sun Jan 1 01:18:53 2012 -0800

    Documentation: read-tree --prefix works with existing subtrees
    ...

This commit happened an hour and 18 minutes into the new year;
well within the 11 hours past midnight that I accidentally specified
in my second command.

Git’s Date Parsing

So does Git’s date parsing behavior even make sense?
Probably.

Git is trying to intuit the intended meaning behind vaguely
specified time requests. For example, how should yesterday be interpreted? As the previous
24-hour period? As the absolute time period midnight-to-midnight of
the previous calendar date? As some vague notion of yesterday’s
business working hours? Git happens to use the first interpretation:
the 24 hours prior to the current time. Generalizing now, any date
used as a starting or ending point in Git uses the current time, and
if a date is specified without a time, the current time is used as
the demarcation, which is where the notion of the current
time
comes in. If you wanted to be more precise about just
exactly when yesterday, you could have said
something like yesterday noon, or
5pm yesterday.

One more caution about date-based checkout. Although you may get
a valid answer to your query for a specific commit, that same question
at some later date may yield a different answer. For example, consider
a repository with several lines of development happening on different
branches. As previously, when you request the commit --before
date
on a given branch, you get an
answer for the branch as it exists just then. At some later point in
time, however, new commits from other branches might be merged into
your branch, altering the notion of which commit might satisfy your
search conditions. In the previous January 1, 2012 example, someone
might merge in a commit from another branch that is closer to midnight
December 31, 2011 than December 29, 2011 at 13:08:47.

Retrieve Old Version of a File

Sometimes in the course of software archeology, you simply
want to retrieve an old version of a file from the repository history.
It seems overkill to use the techniques of a date-based checkout as
described in Date-Based Checkout because
that causes a complete change in your working directory state for every
directory and file just to get one file. In fact, it is even likely that
you want to keep your current working directory state but replace the
current version of just one file by reverting it to an earlier
version.

The first step is to identify a commit that contains the desired
version of the file. The direct approach is to use an explicit branch,
tag, or ref already known to have the correct version. In the absence of
that information, some searching has to be done. And when searching the
commit history, you should think about using some rev-list techniques to identify commits that
have the desired file. As previously seen, dates can be used to select
interesting commits. Git also allows the search to be restricted to a
particular file or set of files. Git calls this approach path
limiting.
It provides the ultimate guide to possible previous
commits that might contain different versions of a file, or as Git calls
them, paths.

Again, let’s explore Git’s source repository itself to see what
previous versions of, say, date.c are available.

    $ git clone git://github.com/gitster/git.git
    Cloning into 'git'...
    remote: Counting objects: 126850, done.
    remote: Compressing objects: 100% (41033/41033), done.
    remote: Total 126850 (delta 93115), reused 117003 (delta 84141)
    Receiving objects: 100% (126850/126850), 27.56 MiB | 1.03 MiB/s, done.
    Resolving deltas: 100% (93115/93115), done.

    $ git rev-list master -- date.c
    ee646eb48f9a7fc6c225facf2b7449a8a65ef8f2
    f1e9c548ce45005521892af0299696204ece286b
    ...
    89967023da94c0d874713284869e1924797d30bb
    ecee9d9e793c7573cf3730fb9746527a0a7e94e7

Uh, yeah, something like 60-odd lines of SHA1 commit IDs. Fun! But
what does it all mean? And how do you use it?

Because I didn’t specify the -n 1 option, all
matching commit IDs have been generated and printed. The default is to
emit them in reverse chronological order. So this means commit ee646e contains the most recent version of the
file date.c, and ecee9d9 contains the oldest version. In fact,
looking at commit ecee9d9 shows the
file being introduced into the repository for the first time.

    $ git show --stat ecee9d9 --pretty=short
    commit ecee9d9e793c7573cf3730fb9746527a0a7e94e7
    Author: Edgar Toernig <froese@gmx.de>

    [PATCH] Do date parsing by hand...

     Makefile      |    4 +-
     cache.h       |    3 +
     commit-tree.c |   27 +--------
     date.c        |  184 +++++++++++++++++++++++++++++++++++++++++++++
     4 files changed, 191 insertions(+), 27 deletions(-)

Where you go from here to find your desired commit is kind
of sketchy. You could do git log
operations on randomly selected SHA1 values from that rev-list
list
output. Or you could binary search the time stamps on
commits from that list. Earlier we used the -n 1 to
select the most recent. It’s really hard to say what trick might work in
your selection process to identify the precise commit that contains the
version of a file that is interesting to you.

But once you have identified one of those
commits, how do you use it? What does that version of date.c look like? What if we wanted to
retrieve it in place?

There are three slightly different approaches you can use to get
that version of a file. The first form directly checks out the named
version and overwrites the existing version in your working
directory.

    $ git checkout ecee9d9 date.c

Tip

If you want to get the version of a file from a commit
and you don’t know its SHA1, but you do happen to know some text from
its commit log message, you can use this searching technique to obtain
it:

    $ git checkout :/"Fix PR-1705" main.c

The youngest commit found is used.

In two other very similar commands, Git accepts the form
commit: path to
name the desired file (i.e., path) as it existed at the time the commit
happened, and writes the specified version of the file to be written to
stdout. What you do with that output
is up to you, though. You could pipe the output to other commands or
create files:

    $ git show ecee9d9:date.c > date.c-oldest

Or:

    $ git cat-file -p 89967:date.c > date.c-first-change

The difference between these two forms is a bit esoteric. The
former filters the output file through any applicable text conversion
filters, whereas the latter is a more basic, plumbing command and does
not. Differences might show up between these two commands when manipulating binaries, when
textconv filters are set up, or
possibly during some newline handling transformations. If you want the
raw data, use the cat -p form. If you
want the transformed version as it would be when checked out or added to
the repository, use the show
form.

These are exactly the same mechanisms you would use to obtain
versions of a file as it appears in another branch:

    $ git checkout dev date.c 

    $ git show dev:date.c > date.c-dev

Or even earlier on the same branch:

    $ git checkout HEAD~2:date.c

Interactive Hunk Staging

Although a bit of an ominous moniker, interactive hunk
staging is nevertheless an incredibly powerful tool that can be used to
simplify and organize your development into concise and easily understood
commits. If anyone has ever asked you to split your patch up or make
single-concept patches, chances are good that this section is for
you!

Unless you are a super coder, and both think and develop in concise
patches, your day-to-day development probably resembles mine: a little
scattered, perhaps over-extended, and likely containing several
intertwined ideas all mixed up as they occurred to you. One coding thought
leads to another and pretty soon you fixed the original bug, stumbled onto
another (but fixed it!), and then added a new easy feature while you were
there. Oh, and, you fixed those two typos as well.

And, if you, like I do, appreciate having someone review
your changes to important code before you ask for it to be accepted
upstream, chances are good that having all of those different, unrelated
changes will not make for a logical presentation of a single patch.
Indeed, some open source projects insist that submitted patches contain
separate self-contained fixes. That is, a patch shouldn’t serve multiple
purposes in one shot. Instead, each idea should stand alone and should be
presentable as a well-defined, simple patch that is just large enough to
do the job and nothing more. If more than one idea needs to be upstreamed,
more than one patch, perhaps in a sequence, will be needed. Common wisdom
suggests that these sorts of patches and patch sequences lead to very
solid reviews, quick turnaround, and easy acceptance into the mainline
upstream development.

So how do these perfect patch sequences come about? Although I
strive for a development style that facilitates simple patches, I’m not
always successful. Nevertheless, Git provides some tools to help formulate
good patches. One of those tools is the ability to interactively select
and commit pieces, or hunks, of a patch, leaving the rest
to be committed in a later patch. Ultimately, you will want to create a
new sequence of smaller commits that still sum up to your original
work.

What Git won’t do for you is decide which conceptual pieces of a
patch belong together and which do not. You have to be able to discern the
meaning and grouping of hunks that make logical sense together. Sometimes
those hunks are all in one file, but sometimes they are in multiple files.
Collectively, all the conceptually related hunks must be selected and
staged together as part of one commit.

Furthermore, you must ensure that your selection of hunks still
meets any external requirements. For example, if you are writing source
code that must be compiled, you will likely want to ensure that the code
base continues to be compilable after each commit. Thus, you must ensure
that your patch breakup, when reassembled in smaller parts, still compiles
at each commit within the new sequence. Git can’t do that for you; that’s
the part where you have to think. Sorry.

Staging hunks interactively is as easy as adding the
-p option to the git
add
command!

    $ git add -p file.c

Interactive hunk staging looks pretty easy, and it is. But we should
probably still have a mental model in mind of what Git is doing with our
patches. Remember way back in Chapter 5,
I explained how Git maintains the index as a staging area that accumulates
your changes prior to committing them. That’s still happening. But instead
of gathering the changes an entire file at a time, Git is picking apart
the changes you have made in your working copy of a file, and allowing you
to select which individual part or parts to stage in the index, waiting to
be committed.

Let’s suppose we’re developing a program to print out a histogram of
white-space–separated words found in a file. The very first version of
this program is the Hello, World! program that proves
things are starting out on the right compilation track. Here’s main.c:

    #include <stdio.h>

    int main(int argc, char **argv)
    {
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
         * Words are listed in no particular order.
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");
    }

Add a Makefile and .gitignore, and put it all in a new
repository:

    $ mkdir /tmp/histogram
    # cd /tmp/histogram
    $ git init
    Initialized empty Git repository in /tmp/histogram/.git/
    $ git add main.c Makefile .gitignore

    $ git commit -m "Initial histogram program."
    [master (root-commit) 42300e7] Initial histogram program.
     3 files changed, 18 insertions(+), 0 deletions(-)
     create mode 100644 .gitignore
     create mode 100644 Makefile
     create mode 100644 main.c

Let’s do some miscellaneous development until main.c looks like this:

    #include <stdio.h>
    #include <stdlib.h>

    struct htentry {
        char *item;
        int count;
        struct htentry *next;
    };

    struct htentry ht_table[256];

    void ht_init(void)
    {
        /* FIXME: details */
    }

    int main(int argc, char **argv)
    {
        FILE *f;

        f = fopen(argv[1], "r");
        if (f == 0)
            exit(-1);

        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
         * Words are listed in no particular order.
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");

        ht_init();
    }

Notice that this development effort has introduced two conceptually
different changes: the hash table structure and storage, and the
beginnings of the file reading operation. In a perfect world, these two
concepts would be introduced into the program with two separate patches.
It will take us a couple of steps to get there, but Git will help us split
these changes properly.

Git, along with most of the Free World, considers a hunk to be any
series of lines from a diff command
that are delineated by a line that looks something like this:

    @@ -1,7 +1,27 @@

or this:

    @@ -9,4 +29,6 @@ int main(int argc, char **argv)

In this case, git diff
shows two hunks:

    $ git diff
    diff --git a/main.c b/main.c
    index 9243ccf..b07f5dd 100644
    --- a/main.c
    +++ b/main.c
    @@ -1,7 +1,27 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +       char *item;
    +       int count;
    +       struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +       /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    +       FILE *f;
    +
    +       f = fopen(argv[1], "r");
    +       if (f == 0)
    +               exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
    @@ -9,4 +29,6 @@ int main(int argc, char **argv)
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");
    +
    +       ht_init();
     }

The first hunk starts with the line @@ -1,7
+1,27 @@
and finishes at the start of the second hunk: @@ -9,4 +29,6 @@ int main(int argc, char
**argv)
.

When interactively staging hunks with git
add -p
, Git offers a choice for each hunk in turn: do you want
to stage it?

But let’s look at our patch a bit more closely and consider the need
to break up the pieces so that conceptually related parts are all gathered
up and staged at the same time. That means we’d like to stage all the hash
table parts together in one patch, and then stage all the file operations
in a second patch. Unfortunately, it looks like the first hunk has
both hash table and file operation pieces in one
hunk! That means, for the purposes of the first commit (i.e., the hash
table pieces), we want to both stage it and not stage it. Or more
precisely, we want to stage part of the hunk. If Git
only asks us about the first and second hunks, we are in trouble.

But, not to worry! The hunk staging will allow us to
split a hunk. Any place where a contiguous sequence
of added and deleted lines identified by a plus or minus in the first
column is broken up by original context text, a split operation may be
performed.

Let’s see how this works by starting with a git add -p main.c command:

    $ git add -p
    diff --git a/main.c b/main.c
    index 4809266..c60b800 100644
    --- a/main.c
    +++ b/main.c
    @@ -1,7 +1,27 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +    char *item;
    +    int count;
    +    struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +    /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    +    FILE *f;
    +
    +    f = fopen(argv[1], "r");
    +    if (f == 0)
    +        exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]?

After reviewing this hunk and seeing both the hash table and file
operation related changes there, you realize you need to both stage and
not stage this hunk. That is your clue to answer s, for split, to the question.

    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? s
    Split into 2 hunks.
    @@ -1,4 +1,18 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +    char *item;
    +    int count;
    +    struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +    /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]?

Excellent. We want this hunk staged.

    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? y

And immediately next up:

    @@ -2,6 +16,12 @@

     int main(int argc, char **argv)
     {
    +    FILE *f;
    +
    +    f = fopen(argv[1], "r");
    +    if (f == 0)
    +        exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
    Stage this hunk [y,n,q,a,d,/,K,j,J,g,e,?]?

But not that one.

    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? n

And finally, Git offers to stage the last hunk. We want it,
too.

    @@ -9,4 +29,6 @@ int main(int argc, char **argv)
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");
    +
    +    ht_init();
     }
    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? y

Let’s review. Originally, there were two hunks. But we wanted only
part of the first hunk and all of the second. So when Git offered us the
first hunk we had to split it into two subhunks. We then staged the first
subhunk, and not the second subhunk. We then staged the entire original
second hunk.

Verifying that the staged pieces look correct is easy:

    $ git diff --staged
    diff --git a/main.c b/main.c
    index 4809266..8a95bb0 100644
    --- a/main.c
    +++ b/main.c
    @@ -1,4 +1,18 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +       char *item;
    +       int count;
    +       struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +       /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    @@ -9,4 +23,6 @@ int main(int argc, char **argv)
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");
    +
    +       ht_init();
     }

That looks good, so you can go ahead and commit it. Don’t worry that
there are lingering differences remaining in the file main.c. That’s by design because it is the
next patch! Oh, and don’t use the filename with this
next git commit command because that
would use the entire file and not the just the staged parts.

    $ git commit -m "Introduce a Hash Table."
    [master 66a212c] Introduce a Hash Table.
     1 files changed, 16 insertions(+), 0 deletions(-)

    $ git diff
    diff --git a/main.c b/main.c
    index 8a95bb0..c60b800 100644
    --- a/main.c
    +++ b/main.c
    @@ -16,6 +16,12 @@ void ht_init(void)

     int main(int argc, char **argv)
     {
    +       FILE *f;
    +
    +       f = fopen(argv[1], "r");
    +       if (f == 0)
    +               exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.

And with that, just add and commit the remaining change because it
is the total material for the file operations patch.

    $ git add main.c
    $ git commit -m "Open the word source file."
    [master e649d27] Open the word source file.
     1 files changed, 6 insertions(+), 0 deletions(-)

A glance at the commit history shows two new commits:

    $ git log --graph --oneline
    * e649d27 Open the word source file.
    * 66a212c Introduce a Hash Table.
    * 3ba81f7 Initial histogram program.

And that is a happy patch sequence!

As usual, there are a few caveats and extenuating circumstances. For
instance, what about that sneaky line:

    #include <stdlib.h>

Doesn’t it really belong with the file operation patch and not the
hash table patch? Yep. You got me. It does.

That’s a bit trickier to handle. But let’s do it anyway. We’ll have
to use the e option. First, reset to
the first commit and leave all those changes in your working tree so we
can do it all over again.

    $ git reset 3ba81f7
    Unstaged changes after reset:
    M    main.c

Do the git add -p again, and
split the first patch just like before. But this time, instead of
answering y to the first subhunk
staging request, answer e and request
to edit the patch:

    $ git add -p
    diff --git a/main.c b/main.c
    index 4809266..c60b800 100644
    --- a/main.c
    +++ b/main.c
    @@ -1,7 +1,27 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +    char *item;
    +    int count;
    +    struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +    /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    +    FILE *f;
    +
    +    f = fopen(argv[1], "r");
    +    if (f == 0)
    +        exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
    Stage this hunk [y,n,q,a,d,/,j,J,g,s,e,?]? s
    Split into 2 hunks.
    @@ -1,4 +1,18 @@
     #include <stdio.h>
    +#include <stdlib.h>
    +
    +struct htentry {
    +    char *item;
    +    int count;
    +    struct htentry *next;
    +};
    +
    +struct htentry ht_table[256];
    +
    +void ht_init(void)
    +{
    +    /* FIXME: details */
    +}

     int main(int argc, char **argv)
     {
    Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? e

You will be placed in your favorite editor[43] and allowed the chance to manually edit the patch. Read the
comment at the bottom of the editor buffer. Carefully delete that one
#include <stdlib.h> line. Don’t
disturb the context lines, and don’t mess with the line counts. Git, and
most any patch program, will lose its mind if you mess with the context
lines. However, my editor updates the line counts automatically.

In this case, because the #include line was removed, it will be swept up
in the remainder of the patches that get formed. This effectively
introduces it at the correct time in the patch with the other file
operation changes.

It is kind of tricky here, but Git now assumes that when you exit
your editor, the patch that is left in your editor should be applied and
its effects staged. So it offers you the following
hunk and lets you choose its disposition. Be careful.

Because Git has moved on to the file operation changes, don’t stage
those changes yet, but do pick up the last hash table change:

    @@ -2,6 +16,12 @@

     int main(int argc, char **argv)
     {
    +    FILE *f;
    +
    +    f = fopen(argv[1], "r");
    +    if (f == 0)
    +        exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.
    Stage this hunk [y,n,q,a,d,/,K,j,J,g,e,?]? n
    @@ -9,4 +29,6 @@ int main(int argc, char **argv)
         * FIXME: Implementation needed still!
         */
        printf("Histogram of words\n");
    +
    +    ht_init();
     }
    Stage this hunk [y,n,q,a,d,/,K,g,e,?]? y

The separation can be verified, noting that the #include <stdlib.h> line has been
correctly associated with the file operations now:

    $ git diff
    diff --git a/main.c b/main.c
    index 3e77315..c60b800 100644
    --- a/main.c
    +++ b/main.c
    @@ -1,4 +1,5 @@
     #include <stdio.h>
    +#include <stdlib.h>

     struct htentry {
        char *item;
    @@ -15,6 +16,12 @@ void ht_init(void)

     int main(int argc, char **argv)
     {
    +       FILE *f;
    +
    +       f = fopen(argv[1], "r");
    +       if (f == 0)
    +               exit(-1);
    +
        /*
         * Print a histogram of words found in a file.
         * "Words" are any whitespace separated characters.

As before, wrap up with a git
commit
for the hash table patch, then stage and commit the
remaining file operation pieces.

I’ve only touched on the essential responses to the Stage
this hunk?
question. In fact, even more options than those listed
in its prompt (i.e., [y,n,q,a,d,/,K,g,e,?]) are available. There are
options to delay the fate of a hunk and then revisit it when prompted
again later.

Furthermore, although this example only had two hunks in one file,
the staging operation generalizes too many hunks, possibly split, in many
files. Pulling together changes across multiple files can be a simple
process of applying git add -p to each
file that has a hunk needing to be staged.

However, there is another, outer level to the whole interactive hunk
staging process that can be invoked using the git
add -i
command. It can be a bit cryptic, but its purpose is to
allow you to select which paths (i.e., files) to stage in the index. As a
sub-option, you may then select the patch option for your chosen paths. This enters
the previously described per file staging mechanism.

Recovering a Lost Commit

Occasionally, an ill-timed git
reset
command or an accidental branch deletion leaves you
wishing you hadn’t lost the development it represented, and wishing you
could recover it somehow. The usual approach to recovering such work is to
inspect your reflog as shown in Chapter 11.
Sometimes the reflog isn’t available, perhaps because it has been turned
off (e.g., core.logAllRefUpdates =
false
), because you are manipulating a bare repository directly,
or perhaps because the reflog has simply expired. For whatever reason,
sometimes the reflog cannot help recover a lost commit.

The git fsck Command

Although not foolproof, Git provides the command git fsck to help locate lost data. The word
fsck is an old abbreviation for file system
check.
Although this command does not check your filesystem, it
does have many characteristics and algorithms that are quite similar to
a traditional filesystem check, and results in some of the same output
data as well.

Understanding how git fsck
works is predicated on a good understanding of Git’s data structures as
described in Chapter 4. Normally, every
object in the Git repository, whether it is a blob, tree, commit, or
tag, is connected to another object and anchored to a branch name, tag
name, or some other symbolic ref such as a reflog name.

However, various commands and manipulations can leave
objects in the object store that are not linked into the complete data
structure somehow. These objects are called unreachable
or dangling. They are unreachable because a traversal of
the full data structure that starts from every named ref and follows
every tag, commit, commit parent,
and tree object reference will never encounter the lost object. In a
sense, it is out there dangling on its own.

But traversing the ref-based commit graph isn’t the only way to
walk every object in the database! Consider simply listing the objects
in your object store using ls
directly:

    $ cd path/to/some/repo
    $ ls -R .git/objects/
    .git/objects/:
    25  3b  73  82  info  pack

    .git/objects/25:
    7cc5642cb1a054f08cc83f2d943e56fd3ebe99

    .git/objects/3b:
    d1f0e29744a1f32b08d5650e62e2e62afb177c

    .git/objects/73:
    8d05ac5663972e2dcf4b473e04b3d1f19ba674

    .git/objects/82:
    b5fee28277349b6d46beff5fdf6a7152347ba0

    .git/objects/info:

    .git/objects/pack:

In this simple example, the set of objects in the repository has
been listed without doing a traversal of the refs and commits.

By carefully comparing the total set of objects with those
reachable via a traversal of the ref-based commit graph, you can
determine all of the unreferenced objects. From the previous example,
the second object listed turns out to be an unreferenced blob (i.e.,
file):

    $ git fsck
    Checking object directories: 100% (256/256), done.
    dangling blob 3bd1f0e29744a1f32b08d5650e62e2e62afb177c

Let’s follow an example that shows how a lost commit can occur,
and see how git fsck can recover it.
First, construct a simple, new repository with a single simple file in
it.

    $ mkdir /tmp/lost
    $ cd /tmp/lost
    $ git init
    Initialized empty Git repository in /tmp/lost/.git/
    $ echo "foo" >> file
    $ git add file
    $ git commit -m "Add some foo"
    [master (root-commit) 1adf46e] Add some foo
     1 files changed, 1 insertions(+), 0 deletions(-)
     create mode 100644 file

    $ git fsck
    Checking object directories: 100% (256/256), done.

    $ ls -R .git/objects/
    .git/objects/:
    25  4a  f8  info  pack

    .git/objects/25:
    7cc5642cb1a054f08cc83f2d943e56fd3ebe99

    .git/objects/4a:
    1c03029e7407c0afe9fc0320b3258e188b115e

    .git/objects/f8:
    5b097ee0f77c5f4dc1868037acbffe59b0e93e

    .git/objects/info:

    .git/objects/pack:

Notice that there are only three objects and none of them are
dangling. In fact, starting from the master ref, which is the f85b097ee commit object, the traversal points
to the tree object 4a1c0302 and then
the blob 257cc564.

Tip

The command git cat-file -t
object-id
can be used to
determine an object’s type.

Now let’s make a second commit, and then hard reset back to the
first commit:

    $ echo bar >> file
    $ git commit -m "Add some bar" file
    [master 11e0dc9] Add some bar
     1 files changed, 1 insertions(+), 0 deletions(-)

And now the accident that causes us to lose a
commit:

    $ git reset --hard HEAD^
    HEAD is now at f85b097 Add some foo

    $ git fsck
    Checking object directories: 100% (256/256), done.

But wait! git fsck doesn’t
report any dangling object. It doesn’t seem to be lost after all. This
is exactly what the reflog is designed to do: prevent you from
accidentally losing commits. (See The Reflog.)

So let’s try again after brutally eliminating the
reflog:

    # Not recommended; this is for purposes of exposition only!
    $ rm -rf .git/logs
    $ git fsck
    Checking object directories: 100% (256/256), done.
    dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe

    $ ls -R .git/objects/
    .git/objects/:
    11  25  3b  41  4a  f8  info  pack

    .git/objects/11:
    e0dc9c11d8f650711b48c4a5707edf5c8a02fe

    .git/objects/25:
    7cc5642cb1a054f08cc83f2d943e56fd3ebe99

    .git/objects/3b:
    d1f0e29744a1f32b08d5650e62e2e62afb177c

    .git/objects/41:
    31fe4d33cd85da805ac9a6697c2251c913881c

    .git/objects/4a:
    1c03029e7407c0afe9fc0320b3258e188b115e

    .git/objects/f8:
    5b097ee0f77c5f4dc1868037acbffe59b0e93e

    .git/objects/info:

    .git/objects/pack:

Tip

You can use the git fsck
–no-reflog
command to find dangling objects as if the reflog were not
available to reference commits. That is, objects that are only
reachable from the reflog will be considered unreachable.

Now we can see that only the reflog was referencing the second
commit 11e0dc9c in which the
bar content was added.

But how would we even know what that dangling commit is?

    $ git show 11e0dc9c
    commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sun Feb 10 11:59:59 2012 -0600

    Add some bar

    diff --git a/file b/file
    index 257cc56..3bd1f0e 100644
    --- a/file
    +++ b/file
    @@ -1 +1,2 @@
     foo
    +bar

    # The "index" line above named blob 3bd1f0e

    $ git show 3bd1f0e
    foo
    bar

Note that the blob 3bd1f0e is
not considered dangling because it is actually referenced by the commit
11e0dc9c, even though the commit
itself is unreferenced.

Sometimes, though, git
fsck
will find blobs that are unreferenced. Remember, every
time you git add a file to the index,
its blob is added to the object store. If you subsequently change that
content and re-add it, no commit will have captured the intermediate
blob that was added to the object store. Thus, it will be
unreferenced.

    $ echo baz >> file
    $ git add file
    $ git fsck
    Checking object directories: 100% (256/256), done.
    dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe

    $ echo quux >> file
    $ git add file
    $ git fsck
    Checking object directories: 100% (256/256), done.
    dangling blob 0c071e1d07528f124e31f1b6c71348ec13f21a7a
    dangling commit 11e0dc9c11d8f650711b48c4a5707edf5c8a02fe

The reason the first git fsck
didn’t show a dangling blob was because that blob was still referenced
directly by the index. Only after the content associated with the
pathname file was changed again and
re-added did that blob become dangling.

    $ git show 0c071e1d
    foo
    baz

If you find you have a very cluttered git
fsck
report consisting entirely of unnecessary blobs and
commits and want to clean it up, consider running garbage collection as
described in Garbage Collection.

Reconnecting a Lost Commit

Although using git fsck
is a handy way to discover the SHA1 of lost commits and blobs, I
mentioned the reflog earlier as another mechanism. In fact, you could
cut and paste it from some lingering line of output found by scrolling
back over your terminal output log. Ultimately, it doesn’t matter how
you discover the SHA1 of a lost blob or commit. The question remains,
once you know it, how do you reconnect it or otherwise incorporate it
into your project?

By definition, blobs are nameless file content. All you really
have to do to reestablish a blob is place that content into a file and
git add it again. As I showed in the
previous section, git show can be
used on the blob SHA1 to obtain the full object content. Just redirect
that to your desired file:

    $ git show 0c071e1d > file2

On the other hand, reconnecting a commit might depend on what you
want to do with it. The simple example from the previous section is only
one commit. But it could equally well have been the first commit in an
entire sequence of commits that was lost. Maybe even an entire branch
was accidentally lost! Consequently, a usual practice would reintroduce
a lost commit as a branch.

Here, the previously lost commit that introduced the bar content,
11e0dc9c, is re-introduced on the new branch called
recovered:

    $ git branch recovered 11e0dc9c
    $ git show-branch
    * [master] Add some foo
     ! [recovered] Add some bar
    --
     + [recovered] Add some bar
    *+ [master] Add some foo

From there it can manipulated (kept as is, merged, etc.) as you
wish.


[40] François-Marie Arouet, of course!

[41] Due to the scripting context for each filter, it’s likely to
stay that way, too.

[42] But also see the section called Checklist for Shrinking a
Repository
from the git-filter-branch manual page.

[43] emacs, right?

Comments are closed.

loading...