Conversation
Edited 1 year ago

another thing I'm trying to figure out about git is that -- many people think of commits as being diffs. Technically in git commits are snapshots.

but I can't tell if it's important to understand that git commits are "really" implemented as snapshots instead of diffs. Does it actually matter? Why?

(commits are treated in different ways by diff commands: `git cherry-pick` treats a commit as a diff, `git checkout` treats it as a snapshot, `git log` treats it as a history)

(1/?)

8
0
0

@b0rk There’s only been one time I’ve cared that git stores commits as files and not diffs, and that was when I was regularly appending to large files in git. This would’ve been fine with diffs, since each individual change was small, but since the file itself was large, this caused enormous repo bloat. (Fortunately I did know that git stored commits as files, so I quickly realized what had happened and rewrote the repo to not do that.) Despite that, I don’t think this distinction is particularly important, since IMO hitting this sort of edge case is relatively rare.

I personally think this is a failing of a lot of advanced git tutorials, since I don’t think knowing the particulars of .git to be important at all until you already have a sound grasp of the commit graph, head, index, working tree, etc.

0
0
0

@b0rk when people treat them as diffs , then they treat git like rsync. And then they do horrific things to the git history and merge branches onto themselves

1
0
0

@b0rk I think part of what's confusing is that git does use "deltas", but apparently that's not really tied to specific commits. I guess it's essentially just a data compression mechanism for a collection of blobs that otherwise have a lot of duplication, either for storage or for transferring information from the server.

2
0
0

@ids1024 that’s a good point, i’ve never thought about how that’s implemented

0
0
0

@ids1024 yea, the "pack" format is an implementation detail in how objects are associated with particular sets of bytes. It can save an appreciable amount of space over the "loose" scheme when there's a lot of duplication, but the user-facing content addressing is exactly the same, perhaps modulo performance.

The 'Internals' section of the Git Book makes vague reference to finding "similar" files when packing, but I'm not sure exactly how that works.

https://git-scm.com/book/en/v2/Git-Internals-Packfiles

0
0
0

@b0rk I don't think it's important to understand it, I see it more like "trivia".

But it shows sometimes and if you know about it it feels less weird. For example, `git status` will likely see one file deleted and another added, but `gif diff` (or `git log -p`) will tell you the file has actually been renamed (and possibly slightly changed at the same time).
It also helps understand why `git diff` has different diffing algorithms (e.g. `--patience`) that can better detect such file moves.

1
0
0

@tbroyer that's a good example thanks

0
0
0

@b0rk Just personally, it made a massive difference to my grasp once I understood to storage mechanism… refs, tags, branches, all popped into place.

The whole git internals chapter here was eye-opening…

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

… but I **strongly suspect** I had to be ready for it. (Like I wouldn’t just sit down a beginner with it as “Intro to Git” 😅)

1
0
0

@carlton do you know why this is? i've heard folks say this a million times (and it feels true to me too) but I don't really understand why it helps

0
0
0

reasons folks mentioned why it might be useful to understand a git commit is a snapshot & not a diff so far:

* it explains why git is weird about file renames (it's just guessing that the file was renamed)
* merge commits make much more sense when you think of them as snapshots (if commits are diffs, how could a commit have 2 parents??)
* it tells you that having gigantic files that you're constantly making small changes to is bad

(2/?)

2
0
0

@b0rk To me, both the "branch might not yet have diverged from other branches" thing and the "commit stored as a snapshot and not a diff" thing flow from everything in git being pointers. I would guess that since most programmers don't have to think about pointers anymore (a huge win!), the "pointer mindset" is less common.

I'm old enough that when I learned everything in git is addressed by its SHA it all made sense.

I also think of tags and branches as the same thing.

1
0
0

@neall yeah i hear people say “everything in git is a pointer” all the time but i haven’t figured out why it’s an ah ha moment to people, i find that when i say those words to git newcomers it’s not helpful to them haha

0
0
0

@b0rk doesn’t git use diffs sometimes for a commit?

1
0
0

@edwin yeah you can import/export patches from git for sure

1
0
0

@b0rk Is the commit the thing git produces when I run git checkout? Or is the commit the thing git stores in its magic directory? What are the hashes then? What gets written to disk?

1
0
0

@lampsofgold

i will try to answer!

* the commit is stored in the magic directory
* the hashes are IDs of commits
* when you run git checkout HASH, git looks in the magic directory and fetches the files for that commit and copies them into your workdir

does that help?

1
0
0

@b0rk it does! I think I'm more confused about what actually gets stored in the magic directory, like how git accomplishes the feat of storing so many commits without ballooning the size of the repo. I thought it did it by placing each commit from a branch on top of each other to create the desired state, so each commit would be the difference between the previous state and this one, but I'm not sure how a snapshot fits into that

1
0
0

@lampsofgold yeah that's a great question!

i tried to explain it in this comic but it might not be that clear

(basically the point is that the sha1 hash of a file only changes when its contents change)

https://social.jvns.ca/@b0rk/111444540069664504

does that make any sense?

1
0
0

@b0rk yes! That makes sense, I had thought the reason you don't want binary files in the repo was because git couldn't pick out which “lines" changed, but it's actually part of the fact that you don't want *any* large files in the repo

1
0
0

@lampsofgold yeah! i actually put large binary files in my git repos all the time but there are definitely some consequences to that choice (my repo is like 1GB to clone)

0
0
0

@b0rk any other places? Git is so inconsistent at times I’ve found.

1
0
0

@b0rk a commit is not *just* a snapshot, it's a snapshot and a pointer to the previous snapshot. so it can also be described as a diff of the previous snapshot (or the reverse)

1
0
0

@gray17 i i didn’t think i needed to say “please don’t try to explain to me what a commit is” here but i guess i was wrong

0
0
0