Git, Github and the command line
Code written on the Analytical Platform should be stored in a git repository on GitHub. This includes R, Python and notebooks. Be careful NOT to include data or secrets on GitHub. (Data goes in S3 buckets and secrets, such as passwords or API keys, should be in Parameter Store.)
Note Before you can use Github with R Studio or Jupyter, you need to connect them together by creating an ‘ssh key’. Full guidance is here.
Github enables you to collaborate with colleagues on code and share you work with them. It puts your code in a centralised, searchable place. It enables easier and more robust approaches to quality assurance, and it enables you to version control your work. More information about the benefits of Github can be found here.
If you are new to Git and Github, it is worth clarifying the difference between Git and Github. Git is the software that looks after the version control of code, whereas Github is the website on which you publish and share your version controlled code. In practice this means you use Git to track versions of your code, and then submit those changes to Github.
Follow the step-by-step guide of how to create a GitHub project repo, followed by how to sync with it in R Studio and Jupyter.
Note: If any of the animated gifs below do not display correctly, try a different web browser e.g. Microsoft Edge, which is installed on your DOM1 machine.
Setup GitHub keys to access it from R Studio and Jupyter
To configure Git and GitHub for the Analytical Platform, you must complete the following steps:
- Create an SSH key.
- Add the SSH key to GitHub.
- Configure your username and email in Git on the Analytical Platform.
Create an SSH key
You can create an SSH key in RStudio or JupyterLab. You will need an SSH key for each tool that you use.
You can obviously copy one ssh key from rstudio to jupyter, please do not do this, its insecure. Use an SSH key per environment that you are in.
To create an SSH key in RStudio, follow the steps below:
- Open RStudio from the Analytical Platform control panel.
- In the menu bar, select Tools then Global Options…
- In the options window, select Git/SVN in the navigation menu.
- Select Create RSA key…
- Select Create.
- Select Close when the information window appears.
- Select View public key.
- Copy the SSH key to the clipboard by pressing Ctrl+C on Windows or ⌘C on Mac.
To create an SSH key in JupyterLab, follow the steps below:
- Open JupyerLab from the Analytical Platform control panel.
- Select the + icon in the file browser to open a new Launcher tab.
- Select Terminal from the ‘Other’ section.
Create an SSH key by running:
ssh-keygen -t rsa -b 4096 -C "email@example.com"
Here, you should substitute the email address you used to sign up to GitHub.
When prompted to enter a file in which to save the key, press Enter to accept the default location.
When prompted to enter a passphrase, press Enter to not set a passphrase.
View the SSH key by running:
Select the SSH key and copy it to the clipboard by pressing Ctrl+C on windows or ⌘C on Mac.
Add the SSH key to GitHub
To add the SSH key to GitHub, you should follow the guidance here. If you are only migrating your environment, you will only need to add your SSH key and set up the github user name and email, after which you should be able to use your old repositories. To try to clone them, follow this guidance for R, and this guidance for Python.
Configure your username and email in Git on the Analytical Platform
To configure your username and email in Git on the Analytical Platform using RStudio or JupyterLab, follow the steps below:
- Open a new terminal:
- In RStudio, select Tools in the menu bar and then Shell…
- In JupyterLap, select the + icon in the file browser and then select Terminal from the Other section in the new Launcher tab.
Configure your username by running:
git config --global user.name 'Your Name'
Here, you should substitute your GitHub username.
Configure your email address by runnung:
git config --global user.email 'firstname.lastname@example.org'
Here, you should substitute the email address you used to sign up to GitHub.
Creating your project repo on GitHub
Step 1 - Create a new project repo in the moj-analytical-services Github page
A GitHub ‘repo’ (short for ‘repository’) is conceptually similar to setting up a project folder on the DOM1 shared drive to save your work, and share it with others. The files in this Github repo represent the definitive version of the project. Everyone who works on the project makes contributions to this definitive version from their personal versions.
Note that if you want to contribute to an existing project, you can skip this step.
In your web browser go to
github.com and make sure you’re signed in.
Once signed in, go to the MoJ Analytical Services homepage.
Then follow the steps in this gif to create a new repository.
Leave your repository ‘private’ for now - the default setting. In the next step you will add access to colleagues and possibly make it ‘public’ (on the internet).
Make sure the owner is set to ‘moj-analytical-services’. This is the default setting, so long as you have clicked on ‘New’ from the MoJ Analytical Services homepage.
Step 2: Navigate to your new Repository on GitHub to decide who can see your code
Try to be as open as possible about who can view your code. Go to the Settings section of your repository (top right of the repository’s homepage) and then click on Collaborators & Teams on the left hand side panel. From there you can then decide on one of the four options below. They start with the most private all the way to completely public code:
- PRIVATE: Leave the default setting of your repository so it’s only visible to you as the creator.
- YOUR TEAM: Can the code be shared within your team? If so, add your team to the repository.
- ALL PLATFORM USERS: Can the code be shared with all Analytical Platform users? If so, add the ‘everyone’ team to the repository.
- PUBLIC: Can the code be public? If so, make it a public repository. To do this, click on the ‘Options’ section of the Settings, then scroll down to the ‘Danger Zone’ area that has a ‘Make public’ button.
We find that for most of our work, there’s no reason not to add the ‘+everyone’ team of all Analytical Platform users with read access to the code. This is possible as sensitive datasets are not stored in Github. By making code more open (either internally or publicly), users can start to get much more value of out the extremely powerful code search in GitHub.
Warning: Repos should contain no passwords/secrets and no data (apart from small reference tables) - this is particulary important for public repos, but applies to private ones too. And remember that GitHub shows the full history of files and changes in your repo, so removing these things requires special effort.
For more info, see choosing public, internal or private repos.
- you can add one or more teams to a repository, each with different permissions. For example, your team could have write privileges, but the ‘everyone’ team could be read only.
Below are point and click steps you can use to sync with your new GitHub repo in R Studio. You can also use the command line.
Step 1: Navigate to your platform R Studio and make a copy of the Github project in your R Studio
In this step, we create a copy of the definitive GitHub project in your personal R Studio workspace. This means you have a version of the project which you can work on and change.
Follow the steps in this gif: (Note: we now recommend making repositories
Internal which is not shown in this gif)
- When you copy the link to the repo from Github, ensure you use the ssh link, which starts
email@example.com opposed to the https one, which starts
- If this is your first time cloning a repo from Github you may be prompted to answer if you want to continue. Type yes and click enter.
Step 2: Edit your files and track them using Git
Edit your files as usual using R Studio.
Once you’re happy with your changes, Git enables you to create a ‘commit’. Each git commit creates a snapshot of your personal files on the Platform. You can can always undo changes to your work by reverting back to any of the snapshots. This ‘snapshotting’ ability is why git is a ‘verson control’ system.
In the following gif, we demonstrate changing a single file, staging the changes, and committing them. In reality, each commit would typically include changes to a number of different files, rather than the single file shown in the gif.
- ‘committing’ does not sync your changes with github.com. It just creates a snapshot of your personal files in your platform disk.
- Git will only become aware of changes you’ve made after you’ve saved the file as shown in the gif. Unsaved changes are signified when the filename in the code editor tab is red with an asterisk.
Step 3: Sync (‘push’) your work with github.com
In R Studio, click the ‘Push’ button (the green up arrow). This will send any change you have committed to the definitive version of the project on Github. You can then navigate to the project on Github in your web browser and you should see the changes.
- After pushing, make sure you refresh the GitHub page in your web browser to see changes.
That’s it! If you’re working on a personal project, and are not collaborating with others, those three basic steps will allow you to apply version control to your work with Github
Git functions aren’t built into JupyterLab. Use the command line instead - see below.
The command line is the text interface to your Analytical Platform tools. When googling, it may also be referred to as the shell, terminal, or console (and perhaps other names). In Jupyter, you can get the command line by selecting ‘Terminal’ from the launcher screen (the + button in the top left of JupyterLab). You can also use all these commands in RStudio by going to Tools -> Terminal -> New Terminal.
Once you are comfortable using the Terminal (in either R Studio or Jupyter) you can run all Git commands from the command line. If you are quite new to the command line, there are a few commands you may find useful to know, in addition to the git commands described later in this section:
mkdir: create a new directory/folder
cd: change directory
touch: create a file
ls: list files
For example, to create a new python script,
main.py in a new folder,
scripts you would do:
> mkdir scripts > cd scripts > touch main.py > ls main.py
You can go back a directory using
cd .. and back to your home directory with
cd ~. Some other commands you may wish to use are:
rm <filename>: delete file(s)
cp <filename> <new_location>: copy a file from current location to a new one
mv <filename> <new_location>: move a file from current location to a new one
It is a good idea to avoid the use of whitespace in file, folder and repository names, but if you have included a space you can escape it using a backslash (e.g.
cd directory\ with\ spaces). You can also hit the tab key to autocomplete if your file or directory already exists.
Make a copy of a GitHub project (‘cloning’)
Use your browser to go to the repository you want to copy. Click on ‘Code’ and select the ‘SSH’ tab. You’ll see a link. Click on the button to its right (the overlapping rectangles) to copy that link.
In the command line, navigate to the directory where you want to keep your copy of the project.
git clone followed by the link you’ve just copied from GitHub. So to clone this guidance enter:
git clone firstname.lastname@example.org:moj-analytical-services/user-guidance.git
Add files to your next commit (‘staging’)
Add changed files to your next commit with:
git add <filename1> <filename2>
This is known as ‘staging’ the files.
You can also type
git add . to add all changed files to your next commit. Before you do this, use
git status to check which files will be added.
‘Commit’ the files you’ve added:
git commit. After calling this command, you need to provide a commit message. R Studio provides a popup. Jupyter will start an editor where you write the message, before saving and exiting it.
To commit and add a message in one command, use
git commit -m "Your commit message". This is useful if you’re only including a short commit message.
Sync work with GitHub (‘pushing’)
‘Push’ your commits to GitHub:
git push origin <branch_name>.
The default branch name is
main. If you’re pushing to this your command would be
git push origin master.
Working on a branch
One of the most useful aspects of git is ‘branching’. This involves a few extra steps, but it enables some really important benefits:
Allows you to separate out work in progress from completed work. This means there is always a single ‘latest’ definitive working version of the code, that everyone agrees is the ‘master copy’.
Enables you and collaborators to work on the same project and files concurrently, resolving conflicts if you edit the same parts of the same files.
Enables you to coordinate work on several new features or bugs at once, keeping track of how the code has changed and why, and whether it’s been quality assured.
Creates intutitive, tagged ‘undo points’ which allow you to revert back to previous version of the project e.g. we may wish to revert to the exact code that was tagged ‘model run 2015Q1’.
We therefore highly recommend using branches. (Up until now, we’ve been working on a single branch called ‘master’.)
Step 1 (optional): Create an Issue in github that describes the piece of work you’re about to do (the purpose of the branch)
Github ‘issues’ are a central place to maintain a ‘to do’ list for a project, and to discuss them with your team. ‘Issues’ can be bug fixes (such as ‘fix divide by zero errors in output tables’), or features (e.g. ‘add a percentage change column to output table’), or anything else you want.
By using issues, you can keep track of who is working on what. If you use issues, you automatically preserve a record of why changes were made to code. So you can see when a line of code was last changed, and which issue it related to, and who wrote it.
Step 2: Create a new branch in R Studio and tell Github about its existence
Create a branch with a name of your choosing. The branch is essentially a label for the segment of work you’re doing. If you’re working on an issue, it often makes sense to name the branch after the issue.
To create a branch, you need to enter the following two commands into the shell:
git checkout -b my_branch_name. Substitute
my_branch_namefor a name of your choosing. This command simultaneously creates the branch and switches to it, so you are immediately working on it.
git push -u origin my_branch_name. This tells github.com about the existence of the new branch.
Step 3: Make some changes to address the Github issue, and push (sync) them with Github
Make changes to the code, commit them, and push them to Github.
Step 4: View changes on Github and create pull request
You can now view the changes in Github.
Github recognises that you’ve synced some code on a branch, and asks you whether you want to merge these changes onto the main ‘master’ branch.
You merge the changes using something called a ‘pull request’. A ‘pull request’ is a set of suggested changes to your project. You can merge these changes in yourself, or you can ask another collaborator to review the changes.
One way of using this process is for quality assurance. For instance, a team may agree that each pull request must be reviewed by a second team member before it is merged. The code on the main ‘master’ branch is then considered to be quality assured at all times. Pull requests also allow you and others working on the project to leave comments and feedback about the code. You can also leave comments that reference issues on the issue log (by writing
# followed by the issue number). For example you might comment saying “This pull request now fixes issue #102 and completes task #103”.
Step 5: Sync the changes you made on github.com with your local platform
When you merged the pull request, you made changes to your files on Github. Your personal version of the project in your R Studio hasn’t changed, and is unaware of these changes.
The final step is therefore to switch back to the ‘master’ branch in R Studio, and ‘Pull’ the code. ‘Pulling’ makes R Studio check for changes on Github, and update your local files to incorporate any changes.
Git training resources
If you are new to git and you want to learn more, we recommend that you complete the basic tutorial available here.
The slides from from the ASD git training are available here (dom1 access only)
- Using Github with R
- Introductory interactive tutorial.
- Quickstart guide and cheatsheet here and in pdf format here.
- More in depth materials:
The platform has configured simple “safety barriers” to reduce risk of accidentally exposing sensitive data on GitHub. For example it stops you committing a CSV file, because in most circumstances you should not put data into GitHub - it should be kept in an S3 bucket where it can be shared with authorized people (rather than the whole of DASD). This is the case even for internal or private repositories, because it doesn’t take much to make these public in the future. These rules can be overridden if that makes more sense.
|What||How it’s configured||Reasoning||How to override|
|Data files (.csv, .xls etc) & zip files||~/.gitignore||You should not put data into GitHub - it should be kept in an S3 bucket where it can be shared with authorized people.||When you add the file:
|Zip files||~/.gitignore||It’s better to unpack these files and commit the raw source. You can’t keep track of diffs of individual files if you keep them bundled up. There might be a data file lurking in the zip, which isn’t checked if it is bundled like this. Note: git has its own built in compression methods.||When you add the file:
|Large files (>5 Mb)||~/.git-templates/hooks/pre-commit||Likely to be data||When you commit:
|Notebook output stripping||~/.git-templates/hooks/pre-commit||Jupyter Notebook output often contains data||When you commit:
|Pushing to non-official GitHub organizations||~/.git-templates/hooks/pre-push||It would be outside MoJ control - not normally allowed.||When you push:
Private R packages on GitHub
Public, internal and private repositories
GitHub repositories can be public, internal or private.
In any case, repos should contain no passwords/secrets and no data (apart from small reference tables) - this is particulary important for public repos, but applies to internal and private ones too. And remember that GitHub shows the full history of files and changes in your repo, so removing these things requires special effort.
A public repo is visible to the world. Again, it is particulary important these contain no passwords/secrets or data.
An internal repo is internal to the
moj-analytical-services GitHub organisation and not visible to the outside world. The repo’s Owners and Admins can control which people / teams can see the repo on GitHub by going to Settings > Collaborators & Teams and adding teams to read/write/admin access groups. (‘Internal’ is equivalent to ‘private’ but adding the group
everyone with read permission.)
A Private repo is visible only to the users/team specifically added to the repo by the repo’s owners (or organization admins). Configure this here: Settings > Collaborators & Teams
Choosing public, internal or private repos
As an organization we aspire to use public repos by default. There are a host of benefits of coding in the open. With research and analysis it builds trust and transparency with the public, and reproducible methods allows others to test and build on your work.
However, it requires more discipline to avoid mistakes like slipping secrets and sensitive information, so tends to require more experienced developers and care over any political sensitivities related to the topics your analysis covers: open-source coding is continuous and worldwide publishing.
As a result, sometime internal and private repos are necessary, for example when it reveals a sensitive policy change that is not yet announced.
In this case, an internal repo has huge benefits over a private repo because it gives the code visibility amongst internal users:
- it is good for learning and sharing methods
- it is good for collaboration
- it makes your code searchable.
Choosing internal still keeps code that shouldn’t be in the public domain (e.g. unpublished commentary on a
.Rmd file) hidden from anyone outside the organisation.
Private repos are unlikely to be appropriate for MoJ work, but are available if necessary.
If in doubt, discuss with your manager and/or the AP team.
Private R packages for reproducible analysis
When a repository (e.g. an R package) is internal or private you need to authenticate to access it from R.
If you don’t then you will get:
Error: HTTP error 404. Not Found. (GitHub doesn’t even acknowledge the existence of the repo, to avoid speculative searching for private repositories.)
remotes package enables you to install private R packages using the same ssh credentials you use to close repositories from github.
To install a package which is stored in the moj-analytical-services organisation on github, in a repository called
package_repo_name then you can use the command:
or, if using
In both cases, replace
package_repo_name for the repository you have developed your package in.
Secrets and passwords
Never put a secret or password in your code. Even when the repo is private. See MOJ policy: https://security-guidance.service.justice.gov.uk/
Other tips and tricks
Search the code in MoJ Analytical Services to see who else has used a package.
Hyperlink to a specific line of code in your project
See here for how to do this.
View who made changes to code, when and why using Git blame.
Make your project available to people on different teams
Assign a reviewer to a pull request, and leave comments.
View how files have changed on the platform and on
Error when switching branches: fatal: index file smaller than expected.
This occurs when the index file gets corrupted, and can be fixed with:
$ rm .git/index $ git add . $ git reset HEAD