Using Git to Version Control Your Manuscripts
I would like to in this post, explain why I like to use Git for version control, and provide a little guide for starters to use Git for academic purposes.
1. Why Git?
- Git takes snapshots of everything and stores it in the remote repository, so that you do not have to manually add identifiers to each version of your files
- Web-based hosting service providers, such as GitHub, GitLab, or Bitbucket, provide a nice interface where you can see differences between any two versions of a file. This is particularly helpful when multiple people collaborate on the same project.
- Git is very flexible - there are graphical interfaces where you can drag & drop, as well as command line commands that enable you to commit, pull, or push.
- There are “oops” moments when you accidentally delete something you did not mean to delete. Without a backup on another hard drive, or the help of those paid cloud storage solutions, the only thing you can do it to write the stuff again - a pain! With Git, a simple pull command would resolve this problem.
- It is completely free. GitHub now offers free private repositories, with a limit of 1GB per project, and 100MB per file. With the free version, private repositories can have up to 3 collaborators. Bitbucket has a less visually attractive interface. GitLab allows unlimited collaborators for private repositories, and each project can have up to 10GB space.
- I particularly like GitHub’s issue tracking and project management. You can list all things that need to be done in the project management page as cards, and assign “to do”, “in progress”, and “done” labels to them.
2. Get Started
2.1 What to version control and what not to
Usually when starting a project, we have a few subfolders, including manuscript,
simulation, real data application, and references. The manuscript folder will
contain the .tex
ad .bib
files, together with any .cls
or .bst
files per
the specifications of journals. The simulation folder will be home to code we
use to conduct simulation studies, as well as any save .RData
. The real data
folder will, similarly, also contain codefiles. The references folder will
contain .pdf
files.
One quick principle is to only version control things that may change. For
example the .pdf
files will not change. Papers are papers. A local copy is
enough. Neither do we need to keep track of the .RData
files, as long as the
seed is properly set and the whole simulation process is replicable. The .tex
,
.bib
, .cls
and .bst
, however, are what we may keep updating. Therefore
these should be the target.
2.2 Using GitHub (as an example)
Now suppose I have a local folder, together with four subfolders as mentioned before. The first thing to do is to login to GitHub, and initialize an empty repository.
Next, we need to associate the local folder with the remote repository. We do this by changing the directory to the local folder we have, and type:
git init
git remote add origin git@github.com:ys-xue/furry-disco.git # establish association
git add manuscript/my-draft.tex # hypothetical file
git add manuscript/my-draft.bib # hypothetical file
git commit -m 'initial commit' # commit message, a brief summary of changes you made
git push -u origin master # push the changes to the master branch of the remote repo
For those files we never want to track, we can specify their extensions in the
.gitignore
so that they will be automatically ignored by Git:
touch .gitignore
echo "**/*.pdf" >> .gitignore
echo "**/*.RData" >> .gitignore
echo "**/*.Rds" >> .gitignore
The **/*.pdf
specifies a wildcard that corresponds to .pdf
files in the main
directory and all its subfolders. The code above is only an illustration. In
practice, we also want to ignore the intermediary files from compiling .tex
.
Remark: Lines are not wrapped on GitHub’s graphical interface, but are
displayed in a single line. Therefore, it is recommended that each line in the
.tex
files as well as the codefiles be controlled at a certain max length,
say, 80 or 100 characters. This will make finding the difference between
versions much easier.
2.3 Particularly Useful Functions
git blame -L start, end file.file
helps you identify when the lines betweenstart
andend
have been written, and who wrote them.