Git – File Management and the Index

install phpMyAdmin On CentOS 8

When your project is under the care of a VCS, you edit in your
working directory and commit your changes to your repository for
safekeeping. Git works similarly but inserts another layer, the
index, between the working directory and the
repository to stage, or collect, alterations. When
you manage your code with Git, you edit in your working directory,
accumulate changes in your index, and commit whatever has amassed in the
index as a single changeset.

You can think of Git’s index as a set of intended or prospective
modifications. You add, remove, move, or repeatedly edit files right up to
the culminating commit, which actualizes the accumulated changes in the
repository. Most of the critical work actually precedes the commit
step.

Tip

Remember, a commit is a two-step process: stage your changes and commit the
changes. An alteration found in the working directory but not in the index
isn’t staged and thus can’t be committed.

For convenience, Git allows you to combine the two steps when you add or change a file:

    $ git commit index.html

But if you move or remove a file, you don’t have that luxury. The
two steps must then be separate:

    $ git rm index.html
    $ git commit

This chapter[9] explains how to manage the index and your corpus of files. It
describes how to add and remove a file from your repository, how to rename a
file, and how to catalog the state of the index. The finale of this chapter
shows how to make Git ignore temporary and other irrelevant files that need
not be tracked by version control.

It’s All About the Index

Linus Torvalds argued on the Git mailing list that you can’t grasp and
fully appreciate the power of Git without first understanding the purpose
of the index.

Git’s index doesn’t contain any file content; it simply tracks what
you want to commit. When you run git
commit
, Git checks the index rather than your working directory
to discover what to commit. (Commits are covered fully in Chapter 6.)

Although many of Git’s porcelain (higher level)
commands are designed to hide the details of the index from you and make
your job easier, it is still important to keep the index and its state in
mind.

You can query the state of the index at any time with the command git
status
. It explicitly calls out what files Git considers staged.
You can also peer into the internal state of Git with
plumbing commands such as git
ls-files
.

You’ll also likely find the git
diff
command useful during staging. (Diffs are discussed extensively in
Chapter 8.) This command can display two different
sets of changes: git diff displays the
changes that remain in your working directory and are not staged; git diff –cached shows changes that are staged
and will therefore contribute to your next commit.

You can use both variations of git
diff
to guide you through the process of staging changes.
Initially, git diff is a large set of
all modifications, and --cached is empty. As you stage,
the former set will shrink and the latter set will grow. If all your
working changes are staged and ready for a commit, the
--cached will be full and git
diff
will show nothing.

File Classifications in Git

Git classifies your files into three groups: tracked, ignored,
and untracked.

Tracked

A tracked file is any file already in the repository or any file
that is staged in the index. To add a new file somefile to
this group, run git add
somefile
.

Ignored

An ignored file must be explicitly declared invisible or ignored in
the repository even though it may be present within your working
directory. A software project tends to have a good number of ignored
files. Common ignored files include temporary and scratch files,
personal notes, compiler output, and most files generated
automatically during a build. Git maintains a default list of files
to ignore, and you can configure your repository to recognize
others. Ignored files are discussed in detail later in this chapter
(see The .gitignore File).

Untracked

An untracked file is any file not found in either of the previous two
categories. Git considers the entire set of files in your working
directory and subtracts both the tracked files and the ignored files
to yield what is untracked.

Let’s explore the different categories of files by creating a brand
new working directory and repository and then working with some
files.

    $ cd /tmp/my_stuff
    $ git init

    $ git status
    # On branch master
    #
    # Initial commit
    #
    nothing to commit (create/copy files and use "git add" to track)

    $ echo "New data" > data

    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       data
    nothing added to commit but untracked files present (use "git add" to track)

Initially, there are no files and the tracked, ignored, and
untracked sets are empty. Once you create data, git
status
reports a single, untracked file.

Editors and build environments often leave temporary or transient
files among your source code. Such files usually shouldn’t be tracked as
source files in a repository. To have Git ignore a file within a
directory, simply add that file’s name to the special file .gitignore:

    # Manually create an example junk file
    $ touch main.o

    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       data
    #       main.o

    $ echo main.o > .gitignore
    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       .gitignore
    #       data

Thus main.o is ignored, but
git status now shows a new, untracked
file called .gitignore. Although the .gitignore file has special meaning to Git, it is managed just like any
other normal file within your repository. Until .gitignore is added, Git considers it
untracked.

The next few sections demonstrate different ways to change the
tracked status of a file as well as how to add or remove it from the
index.

Using git add

The command git add
stages a file. In terms of Git’s file classifications, if a file
is untracked, then git add converts
that file’s status to tracked. When git
add
is used on a directory name, all of the files and
subdirectories beneath it are staged recursively.

Let’s continue the example from the previous section.

    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       .gitignore
    #       data

    # Track both new files.
    $ git add data .gitignore

    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Changes to be committed:
    #   (use "git rm --cached <file>..." to unstage)
    #
    #       new file: .gitignore
    #       new file: data
    #

The first git status
shows you that two files are untracked and reminds you that to
make a file tracked, you simply need to use git
add
. After git add, both
data and .gitignore are staged and tracked, and ready to
be added to the repository on the next commit.

In terms of Git’s object model, the entirety of each file at the
moment you issued git add was copied
into the object store and indexed by its resulting SHA1 name. Staging a
file is also called caching a file[10] or putting a file in the index.

You can use git ls-files
to peer under the object model hood and find the SHA1 values
for those staged files:

    $ git ls-files --stage
    100644 0487f44090ad950f61955271cf0a2d6c6a83ad9a 0       .gitignore
    100644 534469f67ae5ce72a7a274faf30dee3c2ea1746d 0       data

Most of the day-to-day changes within your repository will likely be
simple edits. After any edit and before you commit your changes, run
git add to update the index with the
absolute latest and greatest version of your file. If you don’t, you’ll
have two different versions of the file: one captured in the object store
and referenced from the index, and the other in your working
directory.

To continue the example, let’s change the file data so it’s different from the one in the
index and use the arcane git hash-object
file
command (which you’ll hardly ever invoke directly) to directly
compute and print the SHA1 hash for the new version.

    $ git ls-files --stage
    100644 0487f44090ad950f61955271cf0a2d6c6a83ad9a 0       .gitignore
    100644 534469f67ae5ce72a7a274faf30dee3c2ea1746d 0       data

    # edit "data" to contain...
    $ cat data
    New data
    And some more data now

    $ git hash-object data
    e476983f39f6e4f453f0fe4a859410f63b58b500

After the file is amended, the previous version of the file in the
object store and index has SHA1 534469f67ae5ce72a7a274faf30dee3c2ea1746d.
However, the updated version of the file has SHA1 e476983f39f6e4f453f0fe4a859410f63b58b500. Let’s
update the index to contain the new
version of the file:

    $ git add data
    $ git ls-files --stage
    100644 0487f44090ad950f61955271cf0a2d6c6a83ad9a 0       .gitignore
    100644 e476983f39f6e4f453f0fe4a859410f63b58b500 0       data

The index now has the updated version of the file. Again, the
file data has been
staged,
or speaking loosely, the file
data is in the
index
.
The latter phrase is less accurate because the
file is actually in the object store and the index merely refers to
it.

The seemingly idle play with SHA1 hashes and the index brings home a
key point: Think of git add not as
add this file, but more as add this
content.

In any event, the important thing to remember is that the version of
a file in your working directory can be out of sync with the version
staged in the index. When it comes time to make a commit, Git uses the
version in the index.

Tip

The --interactive option to either git add or git
commit
can be a useful way to explore which files you would like to stage for a commit.

Some Notes on Using git commit

Using git commit –all

The -A or --all option to
git commit causes it to automatically stage all unstaged, tracked file
changes—including removals of tracked files from the working copy—before
it performs the commit.

Let’s see how this works by setting up a few files with different
staging characteristics:

    # Setup test repository
    $ mkdir /tmp/commit-all-example
    $ cd /tmp/commit-all-example
    $ git init
    Initialized empty Git repository in /tmp/commit-all-example/.git/

    $ echo something >> ready
    $ echo somthing else >> notyet
    $ git add ready notyet
    $ git commit -m "Setup"
    [master (root-commit) 71774a1] Setup
     2 files changed, 2 insertions(+), 0 deletions(-)
     create mode 100644 notyet
     create mode 100644 ready

    # Modify file "ready" and "git add" it to the index
    # edit ready
    $ git add ready

    # Modify file "notyet", leaving it unstaged
    # edit notyet

    # Add a new file in a subdirectory, but don't add it
    $ mkdir subdir
    $ echo Nope >> subdir/new

Use git status to see what a regular commit (without command line options)
would do:

    $ git status
    # On branch master
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #       modified:   ready
    #
    # Changed but not updated:
    #   (use "git add <file>..." to update what will be committed)
    #
    #       modified:   notyet
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       subdir/

Here, the index is prepared to commit just the one file named
ready, because it’s the only file
that’s been staged.

However, if you run git commit
–all
, Git recursively traverses the entire repository; stages
all known, modified files and commits those. In this case, when your
editor presents the commit message template, it should indicate that the
modified and known file notyet
will, in fact, be committed as well:

    # Please enter the commit message for your changes.
    # (Comment lines starting with '#' will not be included)
    # On branch master
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #       modified:   notyet
    #       modified:   ready
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #       subdir/

Finally, because the directory named subdir/ is new and no file name or path within
it is tracked, not even the --all option causes it to
be committed:

    Created commit db7de5f: Some --all thing.
     2 files changed, 2 insertions(+), 0 deletions(-)

While Git recursively traverses the repository looking for
modified and removed files, the completely new file subdir/ directory and all of its files do not
become part of the commit.

Writing Commit Log Messages

If you do not directly supply a log message on the command line, Git runs an editor and prompts you
to write one. The editor chosen is selected from your configuration as
described in Configuration Files of Chapter 3.

If you are in the editor writing a commit log message and
for some reason decide to abort the operation, simply exit the editor
without saving; this results in an empty log message. If it’s too late
for that because you’ve already saved, just delete the entire log
message and save again. Git will not process an empty (no text)
commit.

Using git rm

The command git rm is, naturally the inverse of git
add
. It removes a file from both the repository and the working
directory. However, because removing a file tends to be more problematic
(if something goes wrong) than adding a file, Git treats the removal of a
file with a bit more care.

Git will remove a file only from the index or from the index and
working directory simultaneously. Git will not remove a file just from the
working directory; the regular rm
command may be used for that purpose.

Removing a file from your directory and the index does not remove
the file’s existing history from the repository. Any versions of the file
that are part of its history already committed in the repository remain in
the object store and retain that history.

Continuing the example, let’s introduce an accidental
additional file that shouldn’t be staged and see how to remove it.

    $ echo "Random stuff" > oops

    # Can't "git rm" files Git considers "other"
    # This should be just "rm oops"
    $ git rm oops
    fatal: pathspec 'oops' did not match any files

Because git rm is also an
operation on the index, the command won’t work on a file that hasn’t been
previously added to the repository or index; Git must first be aware of a
file. So let’s accidentally stage the oops file:

    # Accidentally stage "oops" file
    $ git add oops

    $ git status
    # On branch master
    #
    # Initial commit
    #
    # Changes to be committed:
    #   (use "git rm --cached <file>..." to unstage)
    #
    #       new file: .gitignore
    #       new file: data
    #       new file: oops
    #

To convert a file from staged to unstaged, use git rm
–cached
:

    $ git ls-files --stage
    100644 0487f44090ad950f61955271cf0a2d6c6a83ad9a 0       .gitignore
    100644 e476983f39f6e4f453f0fe4a859410f63b58b500 0       data
    100644 fcd87b055f261557434fa9956e6ce29433a5cd1c 0       oops

    $ git rm --cached oops
    rm 'oops'

    $ git ls-files --stage
    100644 0487f44090ad950f61955271cf0a2d6c6a83ad9a 0       .gitignore
    100644 e476983f39f6e4f453f0fe4a859410f63b58b500 0       data

Whereas git rm –cached removes
the file from the index and leaves it in the working directory, git rm removes the file from both the index and
the working directory.

Warning

Using git rm –cached to make a
file untracked while leaving a copy in the working directory is
dangerous, because you may forget that it is no longer being tracked.
Using this approach also overrides Git’s check that the working file’s
contents are current. Be careful.

If you want to remove a file once it’s been committed, just stage
the request through a simple git rm
filename
:

    $ git commit -m "Add some files"
    Created initial commit 5b22108: Add some files
     2 files changed, 3 insertions(+), 0 deletions(-)
     create mode 100644 .gitignore
     create mode 100644 data

    $ git rm data
    rm 'data'

    $ git status
    # On branch master
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #       deleted:    data
    #

Before Git removes a file, it checks to make sure the version of the
file in the working directory matches the latest version in the current
branch (the version that Git commands call the HEAD). This verification precludes the
accidental loss of any changes (due to your editing) that may have been
made to the file.

Tip

Use git rm -f to
force the removal of your file. Force is an
explicit mandate and removes the file even if you have altered it since
your last commit.

And in case you really meant to keep a file
that you accidentally removed, simply add it back:

    $ git add data
    fatal: pathspec 'data' did not match any files

Darn! Git removed the working copy, too! But don’t worry. VCSs are good at
recovering old versions of files:

    $ git checkout HEAD -- data
    $ cat data
    New data
    And some more data now

    $ git status
    # On branch master
    nothing to commit (working directory clean)

Using git mv

Suppose you need to move or rename a file. You may use a combination of git rm on
the old file and git add on the new
file, or you may use git mv directly.
Given a repository with a file named
stuff that you want to rename
newstuff, the following sequences of
commands are equivalent Git operations:

    $ mv stuff newstuff
    $ git rm stuff
    $ git add newstuff

and

    $ git mv stuff newstuff

In both cases, Git removes the pathname stuff from the index, adds a new pathname
newstuff, keeps the original content
for stuff in the object store, and
reassociates that content with the pathname newstuff.

With data back in the example
repository, let’s rename it and commit the change:

    $ git mv data mydata

    $ git status
    # On branch master
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #       renamed:    data -> mydata
    #

    $ git commit -m "Moved data to mydata"
    Created commit ec7d888: Moved data to mydata
     1 files changed, 0 insertions(+), 0 deletions(-)
     rename data => mydata (100%)

If you happen to check the history of the file, you may be a bit disturbed to see that Git has apparently
lost the history of the original data
file and remembers only that it renamed data to the current name:

    $ git log mydata
    commit ec7d888b6492370a8ef43f56162a2a4686aea3b4
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sun Nov 2 19:01:20 2008 -0600

    Moved data to mydata

Git does still remember the whole history, but the display
is limited to the particular filename you specified in the command. The
--follow option asks Git to trace back through the log
and find the whole history associated with the content:

    $ git log --follow mydata
    commit ec7d888b6492370a8ef43f56162a2a4686aea3b4
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sun Nov 2 19:01:20 2008 -0600

    Moved data to mydata

    commit 5b22108820b6638a86bf57145a136f3a7ab71818
    Author: Jon Loeliger <jdl@example.com>
    Date:   Sun Nov 2 18:38:28 2008 -0600

    Add some files

One of the classic problems with VCSs is that renaming a file can
cause them to lose track of a file’s history. Git preserves this
information even after a rename.

A Note on Tracking Renames

Let’s talk a bit more about how Git keeps track of file renames.

SVN, as an example of traditional revision control, does a lot of
work tracking when a file is renamed and moved around because it keeps
track only of diffs between files. If you move a file, it’s essentially
the same as deleting all the lines from the old file and adding them to
the new one. But it would be inefficient to transfer and store all the
contents of the file again whenever you do a simple rename; imagine
renaming a whole subdirectory that contains thousands of files.

To alleviate this situation, SVN tracks each rename explicitly. If
you want to rename hello.txt to
subdir/hello.txt, you must use
svn mv instead of svn rm and svn
add
on the files. Otherwise, SVN has no way to see that it’s a
rename and must go through the inefficient delete/add sequence just
described.

Next, given this exceptional feature of tracking a rename, the SVN
server needs a special protocol to tell its clients, please move
hello.txt into subdir/hello.txt.
Furthermore, each SVN
client must ensure that it performs this (relatively rare) operation
correctly.

Git, on the other hand, doesn’t keep track of a rename. You can move
or copy hello.txt anywhere you want,
but doing so affects only tree objects. (Remember that tree objects store
the relationships between content, whereas the content itself is stored in
blobs.) A look at the differences between two trees makes it obvious that
the blob named 3b18e5... has moved to a
new place. And even if you don’t explicitly examine the differences, every
part of the system knows it already has that blob, so every part knows it
doesn’t need another copy of it.

In this situation, as in many other places, Git’s simple hash-based
storage system simplifies a lot of
things that baffle or elude other RCS.

Problems with Tracking a Rename

Tracking the renaming of a file engenders a perennial debate among
developers of VCSs.

A simple rename is fodder enough for dissension. The argument
becomes even more heated when the file’s name changes and then its
content changes. Then the scenarios turn the parley from practical to
philosophical: Is that new file really a rename, or is it
merely similar to the old one? How similar should the new file be before
it’s considered the same file? If you apply someone’s patch that deletes
a file and recreates a similar one elsewhere, how is that managed? What
happens if a file is renamed in two different ways on two different
branches? Is it less error prone to automatically detect renames in such
a situation, as Git does, or to require the user to explicitly identify
renames, as SVN does?

In real life use, it seems that Git’s system for handling file
renames is superior, because there are just too many ways for a file to
be renamed and humans are simply not smart enough to make sure SVN knows
about them all. But there is no perfect system for handling renames …
yet.

The .gitignore File

Earlier in this chapter you saw how to use the .gitignore file to pass over main.o, an irrelevant file. As in that example,
you can skip any file by adding its name to .gitignore in the same directory.
Additionally, you can ignore the file everywhere by adding it to the
.gitignore file in the topmost
directory of your repository.

But Git also supports a much richer mechanism. A .gitignore file can contain a list of filename
patterns that specify what files to ignore. The
format of .gitignore is as
follows:

  • Blank lines are ignored, and lines starting with a pound sign
    ( #) can be used for comments. However, the # does not represent a comment if it follows
    other text on the line.

  • A simple, literal filename matches a file in any directory with
    that name.

  • A directory name is marked by a trailing slash character
    ( /). This matches the named
    directory and any subdirectory but does not match a file or a symbolic
    link.

  • A pattern containing shell globbing characters, such as
    an asterisk ( *), is expanded as a
    shell glob pattern. Just as in standard shell globbing, the match
    cannot extend across directories and so an asterisk can match only a
    single file or directory name. But an asterisk can still be part of a
    pattern that includes slashes to specify directory names (e.g.,
    debug/32bit/*.o).

  • An initial exclamation point ( !) inverts the sense of the pattern on the rest of the line.
    Additionally, any file excluded by an earlier pattern but matching an
    inversion rule is included. An inverted pattern overrides lower
    precedence rules.

Furthermore, Git allows you to have a .gitignore file in any directory within your
repository. Each file affects its
directory and all subdirectories. The .gitignore rules also cascade: you can override the rules in a
higher directory by including an inverted pattern (using the initial
!) in one of the subdirectories.

To resolve a hierarchy with multiple .gitignore directories, and to allow command-line addenda to the list of ignored
files, Git honors the following precedence, from highest to lowest:

  • Patterns specified on the command line.

  • Patterns read from .gitignore in the same directory.

  • Patterns in parent directories, proceeding upward. Hence, the
    current directory’s patterns overrule the parents’ patterns, and the
    parents close to the current directory take precedence over higher
    parents.

  • Patterns from the .git/info/exclude file.

  • Patterns from the file specified by the configuration variable
    core.excludefile.

Because a .gitignore is treated
as a regular file within your repository, it is copied during clone
operations and applies to all copies of your repository. In general, you
should place entries into your version controlled .gitignore files only if the patterns apply to
all derived repositories universally.

If the exclusion pattern is somehow specific to your one repository
and should not (or might not) be applicable to anyone else’s clone of your
repository, then the patterns should instead go into the .git/info/exclude file, because it is not
propagated during clone operations. Its pattern format and treatment is
the same as .gitignore files.

Here’s another scenario. It’s typical to exclude .o files, which are generated from source by
the compiler. To ignore .o files,
place *.o in your top level .gitignore. But what if you also had a
particular *.o file that was, say,
supplied by someone else and for which you couldn’t generate a replacement
yourself? You’d likely want to explicitly track that particular file. You
might then have a configuration like this:

    $ cd my_package
    $ cat .gitignore
    *.o

    $ cd my_package/vendor_files
    $ cat .gitignore
    !driver.o

The combination of rules means that Git will ignore all .o files within the repository but will track
one exception, the file driver.o
within the vendor_files
subdirectory.

A Detailed View of Git’s Object Model and Files

By now, you should have the basic skills to manage files.
Nonetheless, keeping track of what file is where—working
directory, index, and repository—can be confusing. Let’s follow a
series of four pictures to visualize the progress of a single file named
file1 as it is edited, staged in the
index, and finally committed. Each picture simultaneously shows your
working directory, the index, and the object store. For simplicity, let’s
stick to just the master branch.

The initial state is shown in Figure 5-1. Here,
the working directory contains two files named file1 and file2, with contents foo and
bar, respectively.

In addition to file1 and
file2 in the working directory, the
master branch has a commit that records
a tree with exactly the same foo and bar,
contents for files file1 and
file2. Furthermore, the index records
SHA1 values a23bf and 9d3a2 (respectively) for exactly those same file
contents. The working directory, the index, and the object store are all
synchronized and in agreement. Nothing is dirty.

Figure 5-2 shows the changes after editing
file1 in the working directory so
that its contents now consist of quux. Nothing in the index
nor in the object store has changed, but the working directory is now
considered dirty.

Figure 5-1. Initial files and objects
Figure 5-2. After editing file1

Some interesting changes take place when you use the command
git add file1 to stage the edit of
file1.

Figure 5-3. After git add

As Figure 5-3 shows, Git first takes the
version of file1 from the working
directory, computes a SHA1 hash ID ( bd71363) for its contents, and places that ID in
the object store. Next, Git records in the index that the pathname
file1 has been updated to the new
bd71363 SHA1.

Because the contents of file2
haven’t changed and no git add staged
file2, the index continues to
reference the original blob object for it.

At this point, you have staged file1 in the index, and the working directory
and index agree. However, the index is considered dirty with respect to
HEAD because it differs from the tree
recorded in the object store for the HEAD commit of the master branch.[11]

Finally, after all changes have been staged in the index, a commit applies them to the repository. The effects of
git commit are depicted in Figure 5-4.

Figure 5-4. After git commit

As Figure 5-4 shows, the commit initiates three steps. First, the virtual
tree object that is the index gets converted into a real tree object and
placed into the object store under its SHA1 name. Second, a new commit
object is created with your log message. The new commit points to the
newly created tree object and also to the previous or parent commit.
Third, the master branch ref is moved
from the most recent commit to the newly created commit object, becoming
the new master HEAD.

An interesting detail is that the working directory, index,
and object store (represented by the HEAD of master) are once again all synchronized and in
agreement, just as they were in Figure 5-1.


[9] I have it on good authority that this chapter should, in fact, be
titled Things Bart Massey Hates About Git.

[10] You did see the --cached in the git status output, didn’t you?

[11] You can get a dirty index in the other direction, too, irrespective of the
working directory state. By reading a non- HEAD commit out of the object store into the
index and not checking out the corresponding
files into the working directory, you create the situation where the
index and working directory are not in agreement and where the index
is still dirty with respect to the HEAD.

Comments are closed.