Whirlwind tour of random programming topics

Every now and then I poke my head into the hacking channels on the discord, where people tend to talk about some really cool stuff they’re doing, and then I get the urge to talk about some cool stuff myself. Problem is, on the one hand, I’m not really doing anything cool at the moment[1], so that’s out the window. On the other hand, there’s a lot of cool stuff that exists outside of the immediate “make the GBA do a backflip” kind of wizardry that typically gets the spotlight that people might not be exposed to as a hobby programmer (source: I wasn’t exposed to a lot of this stuff until I was late into college), and as someone who wants a soapbox to stand on has a bit more outside experience than most, I figured that it might be useful to put together a bunch of topics that I find interesting and others might too.

I’m going to be treating this thread more like a blog than a traditional tutorial thread, so feel free to interject between posts with questions, comments, etc.

[1]: If you want to try a scavenger hunt, you can go find my personal math/programming blog that I post to maybe once a year. Fair warning that, if you’re reading this footnote, I don’t think you’ll find it super interesting.

Table of Contents

8 Likes

git concepts for ROM hackers

Disclaimer

This is not a tutorial, especially for beginners. There are already far too many resources of that sort out there, most of which are significantly better than what I’m capable of producing. If that’s what you’re looking for, a few of my personal recommendations are:

  • Github’s own guides are a fantastic way to quickly get productive with both the github desktop and the command line tool.

  • The Pro Git book can be a bit intimidating and more than a bit dense, but is extremely thorough. This is the resource I tend to reach for first when I’m having trouble.

Instead, this is more of a lecture about how git actually works.

I’m not going to talk too much about the visual-based Github Desktop app even though it’s great, because I don’t use it myself. Knowing more about what’s going on under the hood can help you be more effective even if you don’t use the command line.

Now, with that out of the way, onto the fun stuff.

Version control, in theory

An extremely common misconception is that the main purpose of git and version control in general is that its main purpose is tracking change history and backups. This may have been true at some point during the dawn of computing when the dinosaurs ruled the earth in the 1970s, but version control since 1980 is focused much more on future versions of your code, and particularly how you’re going to get there.

Version control exists to answer exactly one question, and one question only:

If you’re editing a file, and I’m editing the same file at the same time, what happens?

If you’ve used a concurrent editing software like Google Docs before, this question might seem a bit silly – the person who edits first wins, and then maybe the second person’s changes will end up in between to make words like swweervre if you aren’t careful. However, it’s often nice to have a bit more control over what’s happening, or to be able to do your work without an internet connection, or to be able to walk away from my computer for lunch and not have my code get changed while I’m gone.

Another solution is that, if I’m editing a file, I can “reserve” that file while making my changes, leaving you to sit on your hands until I’m done. Those of you who’ve participated in a community project that primarily consisted of emailing a ROM back and forth with FEBuilder may understand the pain of doing this, which is often more than just the pain of waiting. If the change that I want to make is long, or if it touches a lot of the same stuff that your change makes, I have to spend extra time learning about your change (which may, in and of itself, be big and hard to understand) and convincing myself that my change won’t break anything before I can actually try to start my work. This is actually the approach that the earliest version control systems used, which is where the “checkout” terminology that you may have seen comes from.

A compromise might be to force you, either by some technology or through social pressure, to release your file somewhat frequently, so that I don’t have to wait ten years every time I have to make a change. This will probably have the side effect of making my changes faster, because you’ll have less time to completely change the file, meaning I don’t have to spend as much time figuring out what you did.

In fact, what if we got rid of the reservation system altogether, and just fixed conflicts after the fact, if there was a problem? I mean, how often are you and I editing the exact same line? It’d be way nicer if we could just both work on our changes by ourselves, and then, if it turns out that we accidentally both changed the same thing, it would just tell me so I can fix it, after the fact.

As you may have guessed, modern version control almost looks like the last idea. It turns out that “the same line” doesn’t quite work for real-life uses, but the idea is the same. The three ideas I outlined above are just about the core concepts behind three of the four big “generations” of version control systems (VCSs) so far (the last generation is the rise of distributed VCS, which is a bit different). From here on out, I’m going to be talking entirely about the world of modern VCSs, including darcs, bzr, hg, and of course git. Most of the concepts in this post apply to all three, but I’m primarily going to focus on git, since that’s the one that’s most relevant today (also, I’ll give the first person who can DM me proof of someone using bzr in 2021 a badge. The source needs to be before the date that this post was published, from the person/group themselves, so a blog post or tech announcement dated between 2020-12-31 and 2021-11-10).

Differences and commitment issues

The fundamental unit of version control is the “commit”, which consists of two things:

  • Zero or more parent commits
  • A diff, also referred to as the contents of a commit

Let’s first talk about diffs, since those are simpler. Another word you might see version control junkies use is “patch”, which means more-or-less the same thing. As a ROMhacker, you’ve probably heard this word and have some idea of what it means, and it’s used in the same sense here. A “diff” is just a list of changes. For example, if I have a file:

Fire Emblem is an amazing video game series
that I love to play because of its tactics

and I change it to say

Fire Emblem is a terrible video game series
that I hate to play because of its shipping

Donate to Circles

we could think of the diff in three parts, like so:

Fire Emblem is an amazing terrible video game series


that I love hate to play because of its tactics shipping

Donate to Circles

where strikethroughs are removals, and bolded are additions. If I were to make this change in git and ask it to show me the diff with git diff, it gives me this:

diff --git a/file b/file
index fa23bad..f17c0b8 100644
--- a/file
+++ b/file
@@ -1,2 +1,4 @@
-Fire Emblem is an amazing video game series
-that I love to play because of its tactics
+Fire Emblem is a terrible video game series
+that I hate to play because of its shipping
+
+Donate to Circles

There’s a bit to see here, but the important things are the lines suggestively marked with - and + at the start. The lines with - are removals, and the + lines are additions, so this diff is saying “remove the first two lines and replace them with the + lines”. Diffs tend to be line-based first for technical reasons (they amount to “it’s easiest”), with patches as lists of groups of lines called hunks (the details of how a real text diffing algorithm works is a bit more complicated and implementation specific than I really want to get into in this post).

Finally, a third word you might see used in the technical manuals is changeset, which is also basically the same thing. Related to the concept of diffs and commits is a revision, which is “what my code looks like after applying a commit and all of its parents”. In our example above, the diff is that whole complicated thing up above containing six operations (two removals, four additions), but the revision is just the contents of the file, namely, the four lines.

Note that a lot of resources will use “commit” and “revision” interchangeably, and refer to things like “the parent revision”, which just means “the revision of the parent commit”. I’m certainly not qualified to say what’s the correct definition, and some operations make more sense when viewed one way or the other, so I will do my best to be self-consistent and use “commit” when I’m talking about a diff, and “revision” otherwise.

Merging, in practice

Now, let’s say that both of us are working on this same file at the same time. In particular, you’re making a commit with this diff:

diff --git a/file b/file
index 766b656..f17c0b8 100644
--- a/file
+++ b/file
@@ -1,4 +1,4 @@
 Fire Emblem is a terrible video game series
-that I hate to play because it has swords
+that I hate to play because of its shipping

 Donate to Circles

and I’m making a commit with this diff:

diff --git a/file b/file
index f17c0b8..be0e2f2 100644
--- a/file
+++ b/file
@@ -1,4 +1,4 @@
 Fire Emblem is a terrible video game series
 that I hate to play because of its shipping

-Donate to Circles
+Donate to Cam

Now, if you commit first, and then I commit, ideally, we’d want git to just automagically come up with a diff that contains both of these changesets, since they don’t really have anything to do with each other.

And in fact it does:

diff --cc file
index be0e2f2,766b656..b263b5e
--- a/file
+++ b/file
@@@ -1,4 -1,4 +1,4 @@@
  Fire Emblem is a terrible video game series
- that I hate to play because of its shipping
+ that I hate to play because it has swords

 -Donate to Circles
 +Donate to Cam

Note that we got both changes together, because they don’t interfere with each other! Don’t mind the extra spaces in this diff formatting, it’s just showing that the two changes came from different places. This is why commits can have more than one parent – if the commit is a merge, it actually has two (or more!). The zero-parent case is much less interesting, and is usually just an artifact of requiring “a first commit”.

This operation of combining two different changesets is called a merge, and git will try its hardest to do these by itself when it can. Sometimes, though, it’s just impossible, like if we try to edit the same line at the same time:

diff --git a/file b/file
index b263b5e..400ee9d 100644
--- a/file
+++ b/file
@@ -1,4 +1,4 @@
 Fire Emblem is a terrible video game series
-that I hate to play because it has swords
+that I hate to play because it has axes

 Donate to Cam
diff --git a/file b/file
index b263b5e..882a070 100644
--- a/file
+++ b/file
@@ -1,4 +1,4 @@
 Fire Emblem is a terrible video game series
-that I hate to play because it has swords
+that I hate to play because it has lances

 Donate to Cam

Here, there’s no way to possibly know whose diff should take priority, and we don’t want to just choose the “last” one, because we could be wiping out some valuable work. Instead, this creates a merge conflict, which git will show like this:

Fire Emblem is a terrible video game series
<<<<<<< HEAD
that I hate to play because it has lances
=======
that I hate to play because it has axes
>>>>>>> other

Donate to Cam

Don’t get intimidated! HEAD just refers to the current revision, and other is the other commit we’re trying to merge. Basically, git is trying to tell us:

  • What this hunk looks like at the current revision
  • What this hunk looks like at the reversion we’re trying to merge in

and forcing us to figure it out ourselves. Note that I’m talking about revisions here, not diffs, which is important – git is showing us the final state of both of our changes, not the changes themselves. There is a way to coax git into showing us the two diffs (this is called a three way merge, where you see the base and also both diffs), but I don’t know how offhand (nor is it particularly interesting – it gets a bit brain-melting in more complicated situations, but the basic concept is always the same).

Git can also do some fancier things, where one commit editing a line that is moved by another commit will work “properly”, where the merge will apply to the new location, but I couldn’t convince my test repository to demonstrate it.

Branching out

In practice, it’s not very convenient for merges to work at the commit level. While it isn’t the main purpose, commits also serve as a useful changelog and backup system, so it’s pretty useful to make smaller commits before you’re fully ready to merge. Thankfully, with the model we already have, we actualle get this for free!

Say we have some commits B1 (me) and B2 (you), descending from a parent A. I can just make a commit C1 descending from B1, and you won’t even see it! In other words, our commit graph looks like this:

  .--B1--C1
 /
A
 \
  `--B2

But now, we have a problem. If we just look at the diff of C1 to merge, it may not make any sense, because C1 is based on B1, which you haven’t seen. Instead, we want to track everything from A to C1, and merge that with B2. This is what’s known as a branch, which has a base (the “first” commit/revision that we care about) and a tip (the last). Then, we can take the diff between the revision of A and the revision of C1 (think of this as B1 + C1), and merge it with the diff of B2. As you can see, it only really makes sense to merge two branches with the same base.

Note that git’s idea of a branch is slightly different – what git calls a branch is actually just a specific commit, like C1. When you try to merge two branches, git will actually search the graph for the least common ancestor, which is the latest commit that both branches descend from, and use that as the common base for the merge. This is much more convenient to use than the above construction, but it’s a bit harder to wrap your head around if you aren’t used to thinking about graphs and diffs.

All the world’s a stage

To wrap up, let’s talk a bit about a git-specific construct. Keeping track of the file system and which files to actually commit is actually really hard to do well, and every VCS does it slightly differently. git uses something called a “staging area”, which is basically an intermediate thing holding all the changes you want to commit.

When you run git add (or checking the right box in Github Desktop), that adds any changes in that file to the staging area (it actually stages all hunks in that file, which is how Github Desktop lets you commit individual lines instead of full files). Running git commit then creates a commit from all files that are currently staged. If you have one, your .gitignore file just tells git not to automatically add files to the staging area if you try to stage “all” changes, which you can override by adding files manually (try to avoid doing this – a file that’s ignored won’t have any changes to it automatically added, even if the file itself exists at the current revision).

Tracking the staging area is something that very few people do because, in the grand scheme of things, it’s pretty rare that you want to do something other than “commit all of my changes”. However, it pays to know that it exists in the rare case that it matters, such as if you want to have some local files that you don’t want to commit, but also don’t want to add to your .gitignore.

Footnotes

A neat bit of trivia that took me a while to understand was what “Pull Request” actually meant. pushing means to send all your commits to the server, and pulling means to ask the server to send all its commits to you, so you and the server will have the same commit graph. A “pull request”, then, is asking a repository to pull from you, then merge your branch into master.

You also may have noticed that most of the diffs I posted are text-based. This isn’t on accident, people have put a lot of time and effort into figuring out how to compute correct and fast diffs on text files, whereas doing this for binaries efficiently is an unsolved problem (maybe someone wants to submit UPS as a binary patch system for git?). This means that git ends up storing the entire file any time a non-text file changes instead of just the diff, which can lead to a lot of slowdown and huge blowups in file size when you clone the repository. This is why it’s generally bad to store things like images, .dmps, etc, in git repos.

12 Likes