Note: The passage is originally written in Chinese and translated into English via ChatGPT.
Translated passage:
Alias: Gitlet in Hindsight: Why I Suggest You Always Read the DON’Ts Part in Spec First
After spending a few days catching up with the renowned project Gitlet, which has a weighty 60-page spec covering everything (lol), including informative content that provides a comprehensive understanding and training on every technical foundation, from system design to integration testing. It truly deserves the highest praise for this course project in its history.
[Reference]: UCB CS61B-21SP-Gitlet
Project Overview
Gitlet is a version control system that mimics the functionality of the popular system Git and implements some of its basic commands, including init
, add
, commit
, rm
, checkout
, branch
, reset
, rm-branch
, merge
, and more.
As an individual project for the course, it starts with only a few necessary .java
classes and a few lines of code samples. The task requires designing and completing the system, object methods, data structures, and a few algorithms based on the requirements.
Gitlet Version-Control Mechanism
In essence, version control in Git (Gitlet) revolves around the question of “how to save a certain version” and “how to switch to a specific version.” These two questions can be understood from three levels, from top to bottom: the user level, the object level, and the file read/write level.
From a design perspective, there should exist an abstraction barrier between these three levels, as explained in the abstraction barrier concept. This means that when users issue commands, they don’t need to know or manipulate objects, pointers, etc., and file read/write operations should not occur between objects either.
User Level
First, let’s discuss what happens at a higher level, which is the part that users are aware of.
- What does Git initialization do? It creates a hidden directory called
.git
in the current working directory (CWD) and some files inside it. - How are file versions saved? When a
commit
is made, Git captures the current snapshot of the committed files and stores it in the.git
directory. - How to switch to a specific version? When using commands like
checkout
orreset
to switch versions, Git looks for the corresponding snapshot based on the given branch name/commit ID in the.git
directory. It then restores the specified file or the entire directory in the CWD to match that snapshot.
Object Level
Now let’s see how these steps are implemented at the object level. Gitlet simplifies the directory structure of Git to some extent, storing fewer metadata for each object, but the essence remains the same. The following diagram represents the structure of the .gitlet
directory.
- Gitlet version control utilizes two types of objects: Commit and Blob.
- Each Blob object corresponds to a file snapshot.
- Each Commit object corresponds to a
commit
.
- How are these objects used to track file versions?
- When a file is added to the staging area (
add [file name]
), a Blob object is created to store the current file content. The mapping between the file name and the corresponding Blob instance is then stored in the staging area. - When committing files, a Commit object is created. It retrieves the mapping relationships from the staging area and saves them in the Commit object. In addition to the index mapping, each object also records the parent Commit, timestamp, commit message, etc.
- Example: In the diagram below, each blue square represents a Commit object. Inside each Commit object, there is a Map that records the file snapshots for the current commit. For instance, both Hello.txt in Commit 1 and Commit 2 point to Blob 0, indicating that the file content (snapshot) did not change in these two commits.
- When a file is added to the staging area (
- How is switching to a specific version implemented? — By moving pointers
- To switch the
HEAD
pointer to another branch’sbranchHeadCommit
, at the object level, it means that theHEAD
pointer, originally pointing to a Commit object on branchA, should now point to another Commit objectbranchHeadCommit
on branchB. This can be achieved by something likeupdatePointerToCommit(HEAD, branchHeadCommit)
.
- To switch the
File I/O Level
Finally, let’s dive into the lower-level and examine file read/write operations. Since certain commands require storing Blob objects, Commit objects, and the current content of the staging area locally, two questions arise:
-
How are objects stored as data (to retrieve and use them later)?
-
Java’s serialization is used to store objects. In Gitlet, all objects can be serialized and stored in files, including Blob, Commit, and StagingArea (if applicable). In the
.git
directory, they are stored in the/objects/
directory. -
On the other hand, pointers are managed through file read/write operations (without serialization). Each pointer corresponds to a file that contains the ID of the object it points to. When modifying a pointer’s target, the actual change occurs in the file by updating the ID. In the
.git
directory, pointers are stored in the.git/refs/
directory.
1 2 3 4 5 6 7 8 9 10 11
/* Serialize a Model object */ Model m = ....; // Assuming Model class implements Serializable File outFile = new File(saveFileName); // Create a new File try { ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(outFile)); out.writeObject(m); // Write the object to the stream out.close(); } catch (IOException excp) { ... }
1 2 3 4 5 6 7 8 9 10 11 12
/* Deserialize a Model object */ Model m; File inFile = new File(saveFileName); try { ObjectInputStream inp = new ObjectInputStream(new FileInputStream(inFile)); m = (Model) inp.readObject(); // Cast object into the expected class inp.close(); } catch (IOException | ClassNotFoundException excp) { ... m = null; }
-
-
How to find and retrieve objects/modify pointer targets?
- Gitlet, like Git, uses SHA-1 (Secure Hash Algorithm 1) to generate a 160-bit hash value as the unique ID (40 hexadecimal characters) for each object. When an object is created, its ID is generated based on its content. For example, identical file contents will produce the same ID after encryption. The filenames for the stored objects are their respective IDs. This means that the ID can be used to locate the serialized objects in the directory. Furthermore, this enables content-addressable lookup based on the object’s content.
- Regarding object retrieval, let’s take obtaining a Commit object as an example. The steps include: obtaining the commit ID (which should be a field of the object) -> obtaining the file path based on the ID (since they are stored in a specified directory) -> deserializing the file. Similarly, modifying the object involves updating its content and then serializing it back into the file.
- As for modifying pointer targets, at the file read/write level, the operation involves: obtaining the ID of the targetCommit -> writing the ID into the file corresponding to the HEAD pointer.
(Reminder to self) It’s important to encapsulate these operations within the objects. The main logic should not contain statements like commitMap.put(readContentAsString(commitPath), readContentAsString(blobPath))
. When reviewing others’ implementations, if you come across such mixed file read/write operations at the object level, be cautious.
In Brief
The above passages may seem repetitive, but they provide different perspectives on version control systems. Here’s a basic analogy:
- Initializing a version control system == Placing a small box in the current working directory.
- Saving file versions == Each time a commit is made, making a copy of the submitted files and storing them in the box.
- Switching the current directory or a file in the directory to a specific version == Finding the corresponding archive in the box and bringing it back to the current working directory (CWD).
The other concepts, such as objects, pointers, encoding, etc., are methods used to optimize the copying process and speed up the retrieval of files from the box (at least that’s how I understand it):
- Blob object == Archive of a single file.
- Commit object == A note indicating which file versions to retrieve at a particular time.
- Commit tree == An outline of the notes (commits).
- SHA-1 encoding == Giving each file a name based on its content (useful for fast content comparison and addressing).
- Pointers == Labels indicating which version is currently in the box.
Archiving files is a straightforward process that anyone can do (think: paper_final_final_final.docx). However, I believe the essence lies in SHA-1 encoding.
Reflection
Notes taken while coding, might be messy.
Reading Order for the Spec
- The project specification for Gitlet is quite lengthy, making it impractical to read it all at once before starting. If I were to do it again, I would watch the videos first, skim through the command explanations and the “avoids” section, and then refer to the spec while coding.
- It’s important to pay close attention to the “Don’t” sections in the spec. The reason they are mentioned is that people tend to make those mistakes. For example, using a HashMap as the default Map implementation and encountering a Heisenbug. In reality, a TreeMap should be used to maintain order. ← Starts with “callback”.
Regarding Design
-
Initially, it’s crucial to read the entire spec comprehensively, understanding the roles of each object and their commonly used interaction methods. Once the design is clear, implementation can begin.
Positive example: When implementing the
<branch>
command, a major directory overhaul was planned. However, due to a well-designed abstraction earlier, only a single line needed to be added to the File directory without any other changes. -
Protect the abstraction barrier. Interactions between higher-level objects should avoid using lower-level operations.
Negative example: Initially, hashing and serialization were done directly in the main logic, leading to a lot of refactoring during the encapsulation process.
-
Naming is crucial. After some painful lessons, the following points are summarized:
- Uniformity: Just like joining database tables, if objects need to communicate, they must have some common names. A negative example would be what I did initially, using different names for the IDs obtained from sha-1 hashing, such as shaName, shaId, hashName, etc.
- Intuitiveness: Variable names should be as specific as possible. For example, “map” can be written as “keyToVal,” making it easier to understand.
- Generality: Methods should not be overly specific so that they can be easily recalled when used elsewhere. For example, instead of using “getHead” and “getMaster,” it is better to use “getCommit” and “getPointer.”
Other Points
- It would be beneficial to read the source code of Git to find better practices, although the specification alone is generally sufficient.
Stats
Time and Space
The code consists of approximately 1,000 lines. In terms of time, it took around 4.5 days to complete, with a recorded duration of around 40 hours according to Wakatime. Although I spent more than 10 hours debugging during that time (oops!). I remember Josh sharing some data in class, and most students took around 30-40 hours to complete the project.
From a statistical standpoint, Gitlet is not a large-scale project. However, considering that it requires independent completion and involves design, unit and integration testing, makefile, Java file I/O, algorithms, encoding, and even training on Git itself, it is still a highly rewarding experience.
Autograder Results
All functional tests passed successfully. The Extra-Credit tests failed, as well as the style check (mainly due to naming, which I’ll improve next time). However, I believe those failures don’t significantly impact the overall outcome, so I didn’t continue the autograder-oriented programming.
BTW, reasons for not implementing Extra-Credit features:
- Towards the end, many commands were combinations of previous commands, resulting in diminishing returns.
- The “remote” command in Gitlet differs greatly from Git and may not contribute much to understanding the underlying logic of Git.
- After encountering a Heisenbug, my energy was depleted.
Despite not pursuing the Extra-Credit tasks, I still found the Gitlet project rewarding and a valuable learning experience.
Reflection
Gitlet is an impressive project in the world of renowned universities. As I mentioned at the beginning, I firmly believe that most individuals with a similar level of proficiency can gain a lot from this project. Just take a glance at the spec, and you’ll understand.
Speaking from a technical standpoint, I’m not qualified to say much as I consider myself a novice. So, let me conclude with a few remarks. In one of Josh’s lectures, he shared the results of a Gitlet survey, and one particular detail stuck with me. Some students spent over 50 hours on this project (which, in my personal experience, far exceeds the workload of a typical course assignment in the freshman year). However, only a few individuals gave negative feedback in the end, and I recall Josh expressing his apologies for that. This indirectly reflects the worthwhileness of “The Gitlet Grind.”
An official end of CS61B – so grateful for the open-source materials! Time to move on and continue working on something else.
Appendix
The following is an excerpt from the README I wrote, explaining the design aspects of my Gitlet implementation.
Design
Abstraction Principle
-
An issue with version control systems:
Requires cumbersome operations like hashing, serialization, map operations, directory concatenation, file I/O, etc.
-
Solution:
- On a higher level, involve only communications between objects (between Blob and Commit, there should only be
Blob b = commit1.get(filename)
) - Eliminate the need to dive into low-level operations through encapsulation.
i.e. Outside the class of that object, never try to hash things, or modify maps inside Commit/Blob objects.
E.g. The
StagingArea
supports common map operations. Upon put (fileName, Commit), it completes: read commit into commit id -> put into its map -> serialize itself and write into the file for staging.
- On a higher level, involve only communications between objects (between Blob and Commit, there should only be
Persistence
The directory structure looks like this:
|
|
The Main
class is the entry class of the project. It is responsible for calling different functions according to given commands.
The Repository
class will set up all persistance. It will
- Create and initialize files and directories in the
.gitlet
folder if the directory does not exist; - Handle all updates of
HEAD
,master
,branchHeads
and the serialization of two StagingAreasadd
andrm
. - Execute the commands / function calls from
Main
.
The Commit
class handles the serialization of Commit
objects. It also deals with conversion between commit ids and commit objects. Each Commit
records mappings of held file names and their corresponding file content. Specifically, it fulfil the following purposes:
- Constructs Commit objects;
- Serializes and saves Commit objects to the .gitlet/commits directory;
- Given a commit id, retrieves the corresponding Commit object.
The Blob
class handles the serialization of Blob
objects. A blob is a snapshot of a file’s content at the moment of addition. For instance, a file named “hello.txt” can refer to different Blobs
in different Commits
.
Its functions are similar to Commit
, namely object construction, serialization and retrieval.
The StagingArea
class stores files for addition and removal. A StagingArea works like a Java Map, stores mappings of file plain names to their blob ids, and supports basic map operations (remove
, get
, put
). add
and rm
are StagingAreas
for staged addition and removal respectively.