Using Git to Version Control Your Manuscripts
I would like to in this post, explain why I like to use Git for version control, and provide a little guide for starters to use Git for academic purposes.
1. Why Git?
- Git takes snapshots of everything and stores it in the remote repository, so that you do not have to manually add identifiers to each version of your files
- Web-based hosting service providers, such as GitHub, GitLab, or Bitbucket, provide a nice interface where you can see differences between any two versions of a file. This is particularly helpful when multiple people collaborate on the same project.
- Git is very flexible - there are graphical interfaces where you can drag & drop, as well as command line commands that enable you to commit, pull, or push.
- There are “oops” moments when you accidentally delete something you did not mean to delete. Without a backup on another hard drive, or the help of those paid cloud storage solutions, the only thing you can do it to write the stuff again - a pain! With Git, a simple pull command would resolve this problem.
- It is completely free. GitHub now offers free private repositories, with a limit of 1GB per project, and 100MB per file. With the free version, private repositories can have up to 3 collaborators. Bitbucket has a less visually attractive interface. GitLab allows unlimited collaborators for private repositories, and each project can have up to 10GB space.
- I particularly like GitHub’s issue tracking and project management. You can list all things that need to be done in the project management page as cards, and assign “to do”, “in progress”, and “done” labels to them.
2. Get Started
2.1 What to version control and what not to
Usually when starting a project, we have a few subfolders, including manuscript,
simulation, real data application, and references. The manuscript folder will
.bib files, together with any
.bst files per
the specifications of journals. The simulation folder will be home to code we
use to conduct simulation studies, as well as any save
.RData. The real data
folder will, similarly, also contain codefiles. The references folder will
One quick principle is to only version control things that may change. For
.RData files, as long as the
seed is properly set and the whole simulation process is replicable. The
.bst, however, are what we may keep updating. Therefore
these should be the target.
2.2 Using GitHub (as an example)
Now suppose I have a local folder, together with four subfolders as mentioned before. The first thing to do is to login to GitHub, and initialize an empty repository.
Next, we need to associate the local folder with the remote repository. We do this by changing the directory to the local folder we have, and type:
git init git remote add origin email@example.com:ys-xue/furry-disco.git # establish association git add manuscript/my-draft.tex # hypothetical file git add manuscript/my-draft.bib # hypothetical file git commit -m 'initial commit' # commit message, a brief summary of changes you made git push -u origin master # push the changes to the master branch of the remote repo
For those files we never want to track, we can specify their extensions in the
.gitignore so that they will be automatically ignored by Git:
touch .gitignore echo "**/*.pdf" >> .gitignore echo "**/*.RData" >> .gitignore echo "**/*.Rds" >> .gitignore
Remark: Lines are not wrapped on GitHub’s graphical interface, but are
displayed in a single line. Therefore, it is recommended that each line in the
.tex files as well as the codefiles be controlled at a certain max length,
say, 80 or 100 characters. This will make finding the difference between
versions much easier.
2.3 Particularly Useful Functions
git blame -L start, end file.filehelps you identify when the lines between
endhave been written, and who wrote them.