Some time ago, I stumbled upon the Initial Commit blog. While exploring its articles about Git, I decided to summarize my findings in this post.
Git is a version control tool that enables you to save snapshots of your project. These snapshots allow you to navigate back and forth between different project versions.
A single snapshot, called a commit, stores information about the project’s state at a given moment, along with metadata about the author, date/time, and a reference to the previous commit.
To better understand commits, how they are saved, and how they are stored, we first need to explore the concepts of object types and object storage.
Object Storage
When initializing a new Git project, a kind of “database” is created. This “database” is simply a directory filled with empty folders, ready to store Git objects (such as commits).
The “database” directory is located under .git/objects
. Inside, you will find folders named using combinations of two hexadecimal characters (0–f).
Each Git object is saved as a file and stored within one of these folders.
The SHA-1 hashing algorithm is used to determine an object‘s file name. The first two characters of the file name specify the folder where the object will be stored.
.git/objects/de/de9f2c7fd25e1b3afad3e85a0bd17d9b100db4b3
This approach creates a straightforward indexing system for Git’s content-addressable database.
The Object
All Git objects are referenced by their SHA-1 hash, a string of hexadecimal characters (0–f).
Each object is hashed based on its content. This is an important feature that helps save space by eliminating duplication. In other words every object in the object store is unique.
There are four types of objects stored in the object store:
- a blob
- a tree
- a commit
- an (annotated) tag
Git uses object chaining to link together various objects.
Commit -> Tree -> Blob
Each Git object follows the same format:
<type> <size>\0<data>
type
– object type can be a blob, tree, commit or a tag.' '
– a space used as a delimiter.size
– byte size of the entire object, including type, delimiters and binary data\0
– NULL, an empty byte –00000000
data
– the content of the object saved in binary format
Git creates and saves objects using the following approach:
- First, Git constructs the object in memory.
- Next, it calculates the object’s SHA-1 hash value based on it’s content.
- Then, the object is compressed using Zlib.
- Finally, the compressed version of the object is saved in a file within the object storage, with the object’s hash used as the file name.
Object Type: Blob
A blob is an object type whose binary data contains the content of a file in the repository. A blob stores only the file’s content and has no knowledge of the corresponding file it was created from. The file’s metadata—such as its path, name, and other details—are stored in a tree object.
blob <size-in-bytes>\0<binary-content-of-a-file>
Investigate blobs using following commands
// list all blobs in the repository
$ git rev-list --objects --all
// investigate content of a blob
$ git cat-file -p <sha-1>
Object Type: Tree
In Git, a tree object represents a snapshot of the directory structure at a specific point in time. It organizes and links the repository’s file content (stored as blob objects) and other directories (nested tree objects).
A tree associates blobs with their actual file path, name and permissions.
When a tree is built in memory, its binary content holds information about blobs and their corresponding files in the following format:
tree <size-in-bytes>\0
<file-1-mode> <object-1-type> <object-1-hash> <filename>
<file-2-mode> <object-2-type> <object-2-hash> <filename>
...
You can obtain the hash of the tree associated with HEAD using the git rev-parse HEAD^{tree}
command. Then use git ls-tree <SHA-1 hash>
command inspect the tree’s content.
Similar to a blob, a tree is first built in memory. Next, its SHA-1 hash is calculated based on its content. Finally, the tree object is compressed using Zlib and saved as a file in object storage.
Git tree objects are created on every commit, and sometimes during merges or rebases as well.
Object Type: Commit
A commit is a snapshot of a project taken by an author at a specific point in time.
Every commit (aside from the initial commit) contains a reference to the previous commit, known as the parent commit.
A commit can have more than one parent, for example, when it results from a Git merge that combines changes from multiple branches.
commit <size-in-bytes>\0
tree <tree hash>
parent <parent-1-commit-id>
parent <parent-2-commit-id>
...
author ID email date
commiter ID email date
user comment
Other Git Concepts
A ref is a human-readable reference that points to a commit.
A branch is one type of ref (reference) and is stored in the .git/refs/heads
directory. Simply put, a branch is a label that points to a commit.
$ cat .git/refs/heads/master
> 6919c5e8c4dba68cac062e5c61b9cc339775c302
A tag is jet another type of ref (reference).
An annotated tag is a type of tag that, in addition to a reference (ref), creates an object stored in object storage. This object contains additional information, such as the tag message and a pointer to a commit.
A head is a reference that points to the tip of the branch. You can find them in .git/refs/heads directory.
The HEAD is a special reference that points to a commit you are currently working on.
$ cat .git/HEAD
> ref: refs/heads/master
Detached mode – means that the HEAD is not pointing to the tip o the branch. In this state you can checkout files and make changes but they won’t be saved until a new branch isn’t created.
Conclusion
The core functionality of Git is not difficult to understand. It creatively leverages well-known concepts, such as trees and content hashing.
I use these concepts daily in my job. The product I’m building allows users to upload files containing geospatial information. I use content hashing to ensure that a file isn’t uploaded and processed multiple times. I also apply content hashing to geospatial shapes to prevent the same shape from being unnecessarily saved in the database multiple times.
Although I don’t have a use case for trees, I use content hashing similarly to Git to link file metadata to its content. However, in my case, the metadata is stored in database records, while the files are stored in a cloud bucket.