Git Made Simple

How Git Does Its Magic



Creative Commons License
Git Made Simple by Jim J. Moore is licensed under a Creative Commons Attribution 4.0 International License.
Presentation source at https://github.com/jdigger/git-fundimentals-preso

Jim Moore

cool dash by vexorb d4i8mak

  1. Naming Things
  2. Blobs, Trees and Commits
  3. Hash Trees
  4. Immutable DAG Benefits
  5. Performance and Cleanliness
  6. The Index
  7. References
  8. Add/Commit/Checkout/Cherry-Pick
  9. Merge/Rebase
LinuxCon Europe Linus Torvalds 03

Create A Repo

$ mkdir gittemp; cd gittemp
$ git init
$ git config user.name "Git Fundamentals"
$ git config user.email "gitfund@git.com"

Create Some Content

$ echo "Some cowbell" > file1.txt
$ echo "Just some file2 data" > file2.txt
$ mkdir morefiles
$ cp file1.txt morefiles/file3.txt

more cowbell by asylumactas d37n9z9

What’s in a name?

Names are important…​

How do we want to identify the content we just created?

ugh as if

Solution

We’re going to give it a name based on its content rather than something effectively random.

Hash function — any function that can be used to map data of arbitrary size to data of fixed size.

SnarkHunter Arrow in the gold

Naming the Content

$ git hash-object file1.txt
77bd771f7fefc66994a22d922cad923ae068675a

$ git hash-object file2.txt
63ffb2bff74e9161a82b82999718f441df6a68f1

$ git hash-object morefiles/file3.txt
77bd771f7fefc66994a22d922cad923ae068675a

plumbing sink

Identify What Should be Saved To The Repository

$ git add file1.txt file2.txt morefiles/file3.txt

$ git ls-files --stage
100644 77bd771f7fefc66994a22d922cad923ae068675a 0	file1.txt
100644 63ffb2bff74e9161a82b82999718f441df6a68f1 0	file2.txt
100644 77bd771f7fefc66994a22d922cad923ae068675a 0	morefiles/file3.txt

Save To The Repository

$ git commit -m "First good version"
[master (root-commit) 3b1b4ff] First good version
3 files changed, 3 insertions(+)
create mode 100644 file1.txt
create mode 100644 file2.txt
create mode 100644 morefiles/file3.txt</div>

Inspecting the Commit

$ git cat-file -p 3b1b4ff
tree 0550b757d5b412ab40416d2c7ede1313d1c38087
author Git Fundamentals <gitfund@git.com> 1409711557 -0600
committer Git Fundamentals <gitfund@git.com> 1409711557 -0600

First good version

The commit fields

tree

id of the object that contains the file tree structure

parent(s)

0..n ids to the parents of this commit. Most commits will have one parent; most merges will have two, though can have any number of parents; this is the first so it has none

author

the author (name and email) with when it was initially committed

committer

the committer; usually the author, but can be different for author attribution reasons

message

a description of why this change was saved

Inspecting the Tree

$ git cat-file -p 0550b75
100644 blob 77bd771f7fefc66994a22d922cad923ae068675a  file1.txt
100644 blob 63ffb2bff74e9161a82b82999718f441df6a68f1  file2.txt
040000 tree cb63181eaa27bf4caedf9d9d212840206a77c11d  morefiles

$ git cat-file -p cb63181
100644 blob 77bd771f7fefc66994a22d922cad923ae068675a  file3.txt

$ git cat-file -p 77bd771
Some cowbell

On the Filesystem

$ find .git/objects -type f
.git/objects/05/50b757d5b412ab40416d2c7ede1313d1c38087 # tree - top of commit 1
.git/objects/3b/1b4ff08b048aceeb4ccb38739101ec94b8fd01 # commit 1 - "First good version"
.git/objects/63/ffb2bff74e9161a82b82999718f441df6a68f1 # blob - file2.txt
.git/objects/77/bd771f7fefc66994a22d922cad923ae068675a # blob - file1.txt & file3.txt
.git/objects/cb/63181eaa27bf4caedf9d9d212840206a77c11d # tree - "morefiles" subdir under
                                                                top tree of commit 1

$ cat .git/HEAD
ref: refs/heads/master

$ cat .git/refs/heads/master
3b1b4ff08b048aceeb4ccb38739101ec94b8fd01

dag01

more cowbell 4392785 GIFSoup.com

Change File1.txt

$ echo "More cow-bell\!" > file1.txt

$ git commit -am "Added more cow-bell."
[master 7581387] Added more cow-bell.
1 file changed, 1 insertion(+), 1 deletion(-)

$ git cat-file -p 7581387
tree 366612747e1b0730cdbb680bb48ba6132b321848
parent 3b1b4ff08b048aceeb4ccb38739101ec94b8fd01
author Git Fundamentals <gitfund@git.com> 1409885160 -0600
committer Git Fundamentals <gitfund@git.com> 1409885160 -0600

$ git cat-file -p 3666127
100644 blob faff7b9f491fea306a748f2e17e536918fb6a601  file1.txt
100644 blob 63ffb2bff74e9161a82b82999718f441df6a68f1  file2.txt
040000 tree cb63181eaa27bf4caedf9d9d212840206a77c11d  morefiles

dag02

Rename Files

$ mv file2.txt wow.txt ; mv morefiles hasfiles
$ git add -A && git commit -m "Renamed files"
[master e001241] Renamed files
2 files changed, 0 insertions(+), 0 deletions(-)
rename {morefiles => hasfiles}/file3.txt (100%)
rename file2.txt => wow.txt (100%)

$ git cat-file -p e001241
tree 96996c4fd8813d8854ebeee54d753115cb8a880f
parent 758138729b824c8aabecff1a961202f8fd7d50df
author Git Fundamentals <gitfund@git.com> 1409888276 -0600
committer Git Fundamentals <gitfund@git.com> 1409888276 -0600

Renamed files

$ git cat-file -p 96996c4f
100644 blob faff7b9f491fea306a748f2e17e536918fb6a601  file1.txt
040000 tree cb63181eaa27bf4caedf9d9d212840206a77c11d  hasfiles
100644 blob 63ffb2bff74e9161a82b82999718f441df6a68f1  wow.txt

dag03

The Power of Guarantees

Because each object’s content has a pointer to the one below, a change in a sub-object guarantees that the hash for the parent will be different

Merkle Trees

For the developers/computer-science people, this is an extremely well-proven technique known as a hash tree (a.k.a. Merkle tree)

Instead of using pointers comprised of memory addresses, it uses pointers using a unique "name" based upon the referenced value’s contents

raindow dash with armor vector by pegasi pony d4jhgkc

It is used as a highly efficient way to guarantee data integrity for everything from filesystems and databases (e.g., ZFS, Riak, Casandra) to peer-to-peer data transfer (e.g., BitTorrent) to currencies (e.g., Bitcoin)

Direction of the Arrows

The DAG (Directed Acyclic Graph) based on hashes is what makes it possible for git to be fully distributed

same different ponies

Because identical content hashes to the same thing, and you have all of the content all of the time, it’s trivial to do the consistency checks needed to know exactly where, when and how content diverged

This is true even among systems with no communication between them

There is no need to know how you’re going to want to slice-n-dice that information ahead of time

Moving Files

It “noticed” that we renamed the file and directory…​

How, since nothing in the data-structures recorded that?

Some of the porcelain uses rename detection based on configurable sensitivity.

Because it’s easily detectable at runtime, some of the porcelain, such as git blame -M can show how individual lines have moved around between files.

Not Storing Deltas or Patches?

Each file, as we saw before, is a “complete blob” in git’s object store

Constant Time

Notice that navigating between revisions is an O(1) operation

Writing to the repository is extremely fast: no matter how many files there are, it’s just enough time to calculate the SHA-1s of the new content, zlib compress it, and copy just that new content into the object store

Reading vs Writing

Packing and Garbage Collection

$ git gc
Counting objects: 7, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (7/7), done.
Total 7 (delta 0), reused 0 (delta 0)

$ find .git/objects -type f
.git/objects/info/packs
.git/objects/pack/pack-4aaa8775b3535f0f7ae44215272d423d924ee3fa.idx
.git/objects/pack/pack-4aaa8775b3535f0f7ae44215272d423d924ee3fa.pack

The Index

The portal between the working-directory and the repository

Use Cases

All problems in computer science can be solved by another level of indirection.

 — David Wheeler, Beautiful Code

Thus, instead of just the Working Directory and the Repository, there is also the Index

(a.k.a. the Dir Cache, Staging Area, etc.)

Lifecycle

Working directory → Index → Tree → Commit

Manipulating the Index

sourcetree stage lines

References

Tags

$ git tag aTag

$ cat .git/refs/tags/aTag
e0012413fb1c0728dc1d9b86c9b074dfc24a220a

$ git cat-file -t e001241
commit

Annotated Tags

$ git tag -a annTag -m "annotated tag here"

$ cat .git/refs/tags/annTag
85aca4362a29ddb70a70c6d6f882d4c1e04b2eb3

$ git cat-file -t 85aca43
tag

$ git cat-file -p 85aca43
object e0012413fb1c0728dc1d9b86c9b074dfc24a220a
type commit
tag annTag
tagger Git Fundamentals <gitfund@git.com> 1410039830 -0600

annotated tag here

Branches

$ cat .git/refs/heads/master
e0012413fb1c0728dc1d9b86c9b074dfc24a220a

The big difference between branches and tags is that the value in that file will change over time during a normal workflow. (i.e., As you make changes to the branch.)

Mutability

The contents of “objects” can never change.

Period.

Changing an object changes its hash, thereby guaranteeing consistency

“References” may change, which is why they don’t have a hash

HEAD

“HEAD” (.git/HEAD) is not really a “reference”

It’s used as part of the porcelain to keep track of where the “user” is

mlp 10detached parts dried1

A “detached HEAD” is simply a HEAD that points to a specific commit instead of a branch

Revision IDs

A great explanation on Stackoverflow.

Some primary examples:

sha1

a specific hash reference

refname

branch name, tag, HEAD, etc.

rev^

the parent of the commit

  • “master^” is the parent of the commit pointed to by the head of “master”
rev~n

the nth parent of the commit.

  • "HEAD~1" is the parent of the commit pointed to by HEAD
  • "HEAD~5" goes back five commits before the current one

Applying Changes

git add _filename_

The process:

  1. Compare the working-directory hash of the file to what is in the index
  2. If the hashes are different, copy the file into the object store as a new blob
  3. Record the hash for the file in the Index (a.k.a., “tree builder”)

agent oso

git commit -m "_message_"

The process:

  1. Write the Index out as a tree object
  2. Create a commit object that has the tree’s hash and meta-data (author, date, comment, etc.)
  3. Update the head of the branch pointed to by HEAD

git checkout _branchname_

The process:

  1. Replace the Index with the tree for the commit
  2. Update the local filesystem to match the Index
  3. Change HEAD to point to the branch

git cherry-pick _rev_

The process:

  1. git diff -C _rev_~1 _rev_ → patch
  2. Apply patch to the current HEAD
  3. Create a new commit, copying over the metadata (such as “author”) from the cherry-picked commit

Merging and Rebasing

Merge

A new commit object is created with two parents

(more parents are possible, but don’t do it)

git merge _rev_

The process:

  1. The trees of both parents are examined, finding their common parent
    • Identical trees/blobs (both hashes of are the same) are added to the Index
    • A consistency check is done from their common parent to detect if both have added or removed files
    • Perform three-way merges on changes, creating new blobs and adding their hashes to the Index
  2. Generate a new tree from the Index
  3. Create Commit

git merge BR2

merge start dag
merge dag

git merge -s ours BR2

merge start dag
merge ours dag

git merge b1

merge branches 1
merge branches 2

Rebase

A series of cherry-picks are computed and applied to a new “base”

git rebase _rev_

The process:

  1. Find all the commits in the current branch until it gets to a the common ancestor with rev
  2. Update HEAD to the new base (“detaching” it)
  3. for (patch in changeset)
    • apply the cherry-pick to HEAD and move HEAD to the newly created commit
  4. Update the branch’s head and repoint the HEAD to the branch

git rebase b1

rebase branches 1
rebase branches 2

Interactive Rebase

Using the "-i" flag allows you to "step into" the cherry-pick process

You can:

Conclusion

Simplicity Brings Power

Trivial data structures and a few simple architectural principles (DAG, immutable, plumbing vs porcelain) provide an extremely rich, fast, flexible and easily customizable tool

Q & A

not ready for questions

References

Photo Attribution