This post explores the fundamentals of version control systems and backup strategies as covered in Lecture 8 of the OCF Linux SysAdmin DeCal. Understanding these concepts is crucial for any developer or system administrator looking to maintain code integrity and data safety.
Why Version Control?
Version Control Systems (VCS) are essential tools in modern software development for several key reasons:
- Track changes: Monitor modifications to code over time.
- Collaboration: Work with others without worrying too much about conflicting changes.
- Safe experimentation: Create and test features without breaking production code.
Examples of VCS include Git, Mercurial, Subversion, Perforce, and Bazaar. Git has become the modern-day standard, and it's important to note: Git is not GitHub! Git is the version control system itself, while GitHub is a cloud-based hosting service for Git repositories.
About Git
Git is a free and open-source system created by Linus Torvalds in 2005 for development of the Linux kernel. After their previous proprietary VCS (Bitkeeper) withdrew the free version, Torvalds built the first implementation in just 2-3 months.
Fun fact: "Git" is also British slang meaning "unpleasant person." Torvalds has said: "I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'Git'."
What Makes Git Special?
Git Stores Snapshots
Git stores snapshots (versions) of files and directory structure, keeping track of relationships, authors, dates, and log messages.
Git Has Integrity
Git hashes objects with SHA1, making it impossible to change anything without Git knowing. This protects against:
- Lost information in transit
- File corruption
Git is Very Fast
Nearly every operation is local with no network latency overhead. Browsing project history involves Git reading directly from your local database. This means you can work offline.
Distributed Architecture
There can be many copies of a given repository, each supporting independent development, with machinery to transmit and reconcile versions between repositories.
Git Internals
Git represents project history as a directed acyclic graph (DAG) of commit nodes. Nodes point one-way to the state they're based on, and there are no cycles.
Key Components
- Commits: Correspond to project state's tree (snapshots).
- Blobs: Files stored as "blobs" of bits.
- Trees: Folders containing blobs and/or other trees.
- Branches: Pointers to the head of a line of work. Default name is
masterormain. - HEAD: A pointer to the local branch you're currently on.
File States in Git
Understanding file states is crucial for working with Git:
Modified
File is changed but isn't committed to your database (repository) yet.
Staged
Modified file is marked in its current version to be included in the next commit snapshot.
Tracked vs. Untracked
- Tracked files: Files git knows about (from last snapshot + newly staged files). Can be unmodified, modified, or staged.
- Untracked files: Everything else. Use
.gitignoreto specify intentionally untracked files to ignore.
Getting Started with Git
Creating a Repository
For an existing project in directory proj/:
cd proj/
git init
This creates a new subdirectory .git making proj/ a Git repository.
Cloning an Existing Repository
git clone <repo URL> [destination]
This creates a directory, initializes a .git directory inside it, and pulls down all data from the repository.
git clone https://github.com/0xcf/decal-labs
Branching: Git's Killer Feature
Branching is what sets Git apart from other version control systems. The idea is to create a new 'branch' with the current branch as the 'trunk'. This creates a new pointer at the same location as HEAD. Every time you commit, the pointer of the active branch moves forward automatically.
Basic Branch Commands
git branch <name>: Create a new branchgit checkout <name>: Switch to a branchgit checkout -b <name>: Create and switch to a new branchgit branch -d <name>: Delete a branch
Merging
The idea is to combine a branch back into the mainline/trunk with a merge commit. For example, if you have branch iss53 (issue #53) and are ready to merge it into the main codebase:
git checkout master
git merge iss53
git branch -d iss53
This changes HEAD to point to master, creates a merge commit combining both branches, and deletes the branch iss53.
Merge Conflicts
If the same part of the same file is different in the two branches you're merging, Git won't be able to merge them cleanly and will throw a merge conflict. Files with conflicts will go unmerged, and Git adds conflict-resolution markers:
<<<<<<< HEAD
[Lines of code from HEAD (i.e master)]
=======
[Lines of code from the merge target (i.e iss53)]
>>>>>>> [Commit message]
Use git status to see which files conflict, edit them to resolve conflicts, then git add each file to mark it as resolved, and git commit to finalize the merge.
Rebasing
The idea is to take all commits from a branch and apply them on top of HEAD with no merge commit, as if you just made all the commits on the main branch anyways (linear history).
git checkout experiment
git rebase master
git checkout master
git merge experiment
Note: Rebase re-writes history. Best used in private branches.
Working with Remotes
The remote is the offsite copy of the repository. origin is the default name for a remote when you run git clone. Remote branch names take the form <remote>/<branch>.
Essential Remote Commands
git remote -v: View all remotes and their URLsgit remote show <remote>: Show information about a remotegit fetch <remote>: Fetch new data from remote and update local database without modifying working directorygit pull [remote/branch]: Equivalent togit fetch && git mergegit push <remote> <branch>: Push local commits to remote/branch
Good practice: Pull before you push to avoid conflicts.
The Git Workflow
Here's a typical workflow for working with Git:
git checkout master
git checkout -b feature
[Modify files]
git add file-changed-1 [...] file-changed-n
git commit -m "description of changes"
git push origin feature
git checkout master
git pull origin master
git merge feature
git push origin master
Backups: Just Do It
According to Murphy's law: "Anything that can go wrong will go wrong." You need backups to protect against:
- Accidental or malicious deletion
- Device failure
- Software failure
- Theft
Important considerations:
- Automate backups because you will forget
- Don't leak information! Backups must be secure
- Make sure your backups actually work by routinely testing backup and recovery procedures
The 3-2-1 Rule
Follow this simple rule to ensure data safety:
- 3: Have at least 3 copies of your data
- 2: Store your data on at least 2 different media (e.g., 1 hard drive, 1 backup server)
- 1: Have at least 1 copy of your data off-site (e.g., on Amazon S3, "the cloud")
The GitLab Database Outage: A Cautionary Tale
On January 31, 2017, GitLab suffered a major database outage that serves as an important lesson on the importance of backups.
What Happened
An engineer accidentally ran rm -rf on their production PostgreSQL database. They noticed and stopped it after 1 second, but 300GB of production data was lost. This shouldn't have been catastrophic—they could just recover from a backup, right?
Backup 1: Amazon S3
GitLab had an automated process to upload a backup to Amazon S3 every 24 hours. However, their backups had been failing for weeks due to a version mismatch, and their notification system was broken too.
Backup 2: Azure Disk Snapshots
GitLab had enabled Azure disk snapshots to run every 24 hours... except on the database servers, because they thought they had enough backups.
Hail Mary: LVM Snapshots
LVM (Logical Volume Manager) snapshots weren't meant to be a backup, but luckily they had these. An engineer had run the copy from prod to staging ~6 hours before the incident. Unfortunately, it took GitLab 18 hours to recover since staging was not meant for data recovery (different region and slow disks).
Impact
GitLab estimated they lost at least 5,000 projects, 5,000 comments, and roughly 700 users. The incident became both a feel-good story of transparency and not firing the engineer involved, but also a cautionary tale about backup failures.
The lesson: The importance of keeping backups, making sure they work, and practicing recovering from them.
Tools for Backups
rsync
Simple command-line utility for local ↔ remote transfer. Skips copying files that are the same at the destination, making it ideal for backups. Uses SSH for transferring to remote hosts.
rsync -av -P [source] user@host:[destination]
The -a flag enables archive mode (recursion and preserve file permissions).
rclone
Described as "rsync for cloud storage," rclone supports every major cloud storage provider and can mount cloud storage as a local filesystem.
rclone sync source:path dest:path
Additional Resources
- Pro Git (git-scm.com) — The bible for Git. Chapters 1-5 form a good foundation.
- Oh Shit, Git!?! (ohshitgit.com) — "Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible." Short guide on how to recover from common Git mistakes.
- Shell/Editor integration — Tools like vim-fugitive can enhance your Git workflow.