Using Git to Version Control Your Manuscripts

I would like to in this post, explain why I like to use Git for version control, and provide a little guide for starters to use Git for academic purposes.

1. Why Git?

  1. Git takes snapshots of everything and stores it in the remote repository, so that you do not have to manually add identifiers to each version of your files
  2. Web-based hosting service providers, such as GitHub, GitLab, or Bitbucket, provide a nice interface where you can see differences between any two versions of a file. This is particularly helpful when multiple people collaborate on the same project.
  3. Git is very flexible - there are graphical interfaces where you can drag & drop, as well as command line commands that enable you to commit, pull, or push.
  4. There are “oops” moments when you accidentally delete something you did not mean to delete. Without a backup on another hard drive, or the help of those paid cloud storage solutions, the only thing you can do it to write the stuff again - a pain! With Git, a simple pull command would resolve this problem.
  5. It is completely free. GitHub now offers free private repositories, with a limit of 1GB per project, and 100MB per file. With the free version, private repositories can have up to 3 collaborators. Bitbucket has a less visually attractive interface. GitLab allows unlimited collaborators for private repositories, and each project can have up to 10GB space.
  6. I particularly like GitHub’s issue tracking and project management. You can list all things that need to be done in the project management page as cards, and assign “to do”, “in progress”, and “done” labels to them.

2. Get Started

2.1 What to version control and what not to

Usually when starting a project, we have a few subfolders, including manuscript, simulation, real data application, and references. The manuscript folder will contain the .tex ad .bib files, together with any .cls or .bst files per the specifications of journals. The simulation folder will be home to code we use to conduct simulation studies, as well as any save .RData. The real data folder will, similarly, also contain codefiles. The references folder will contain .pdf files.

One quick principle is to only version control things that may change. For example the .pdf files will not change. Papers are papers. A local copy is enough. Neither do we need to keep track of the .RData files, as long as the seed is properly set and the whole simulation process is replicable. The .tex, .bib, .cls and .bst, however, are what we may keep updating. Therefore these should be the target.

2.2 Using GitHub (as an example)

Now suppose I have a local folder, together with four subfolders as mentioned before. The first thing to do is to login to GitHub, and initialize an empty repository.

Next, we need to associate the local folder with the remote repository. We do this by changing the directory to the local folder we have, and type:

git init
git remote add origin git@github.com:ys-xue/furry-disco.git # establish association
git add manuscript/my-draft.tex # hypothetical file
git add manuscript/my-draft.bib # hypothetical file
git commit -m 'initial commit' # commit message, a brief summary of changes you made
git push -u origin master # push the changes to the master branch of the remote repo

For those files we never want to track, we can specify their extensions in the .gitignore so that they will be automatically ignored by Git:

touch .gitignore
echo "**/*.pdf" >> .gitignore
echo "**/*.RData" >> .gitignore
echo "**/*.Rds" >> .gitignore

The **/*.pdf specifies a wildcard that corresponds to .pdf files in the main directory and all its subfolders. The code above is only an illustration. In practice, we also want to ignore the intermediary files from compiling .tex.

Remark: Lines are not wrapped on GitHub’s graphical interface, but are displayed in a single line. Therefore, it is recommended that each line in the .tex files as well as the codefiles be controlled at a certain max length, say, 80 or 100 characters. This will make finding the difference between versions much easier.

2.3 Particularly Useful Functions
  • git blame -L start, end file.file helps you identify when the lines between start and end have been written, and who wrote them.
Yishu Xue
Yishu Xue
Data Scientist / Coder / Novice Sprinter / Gym Enthusiast

The night is dark and full of terrors.