Data Science Tools: Introduction to git-extras
Additional plugins for
git to make using repositories more manageable for Data Science.
Introduction to the open source
git-extras package, it’s installation on a Ubuntu (GNU/Linux) environment and it’s application in general and for Data Science.
It’s common practice for most Data Scientists to interact with Version Control Systems (VCS) like Git. Beyond the basic commands like
log you might wander what else you need? If using
jupyter notebooks, it’s common practice to evolve an analysis with multiple notebooks and therefore there isn’t that much involvement with
git other than an insurance policy to undo errors or file corruption.
The motivation here is that whilst Exploratory Data Analysis (EDA)is often a solo effort, there are occasions where collaboration with others is required such as Machine Learning (ML) model deployment and crucially for documentation.
With the context set, the git-extras package provides additional utilities to support most users of
The installation instructions cover Linux, MacOS and Windows; for Ubuntu use the
apt package manager:
Having confirmed the package is available in the repository, it can be installed:
The following terminal cast captures the installation process:
Most GNU/Linux distributions are likely to have and older version of the
git-extras package (in this case version 5.1) so not all commands will work such as
The full list of commands are documented on the GitHub repository; but a few key examples are summarised below (in no particular order):
As you can see it provides a quick snapshot of the project, in this case it’s relatively new. Its an easy way to get a feel for the repo, analogous to using
The command reduces the traditional
checkout -b and
push -u origin approach into a single command:
The code above lists branches both local and remote (
-a) and shows only a single branch called
master . The traditional approach to creating and syncing a branch requires 2 steps; the
git create-branch with the
-r flag does the same in a single step. The final listing of branches shows that the new branch (02-git-extras) is available both locally and remotely (
It’s important to “reserve” the remote branch name as soon as possible to prevent a clash with fellow contributors in particular with regards branches for documentation.
Whilst the time saving from
create-branch may not look significant, renaming branches both locally and remotely is not easy and therefore the following command makes this much easier:
The first form of the commands permits the renaming of any arbitrary branch by providing the old and new names. The list of branches (
git branch -a) shows that the new name (
02-new-name) is visible locally and remotely. This branch is checked out for the second form of the command, which renames the existing branch. Finally, the branches are listed again to show that both the local and remote branches align.
Git Ignore and Ignore-io
The first thing to do for a Data Science
git repo after
init is of course manage the
.gitignore file. An unusual feature of
.gitignore is of course that if you don’t want to sync the file itself is to add the filename to the list. The
git ignore command adds a number of useful features to manage
.gitignore files seamlessly:
By default the
git ignore command lists the contents of both the local and global ignore files. Providing the
-l flag will show only the local patterns, which can be added by simply providing a pattern after the command as is the case with
python compiled files:
git ignore "*.pyc".
Rather than remembering common patterns for different Integrated Development Editors (IDEs), text editors (such as
vim) etc, community provided patterns can be obtained from gitignore.io:
The service allows users to combine different ignore patterns, for example
macOS with python. The output is in the form of a text file:
The same output can be obtained from the
git ignore command and displayed on the screen:
To add the ignore patterns to the local
.gitignore file use
git ignore -a macos. The
-a flag appends the patterns to the local
Git Show-tree, undo & setup
The last three are summarised briefly. The
git show-tree command replicates popular one-liners from Stack Overflow for showing the
git undo command allows for jumping to a previous commit or a number of commits:
git setup command (by default in the current working directory), initialises
git, adds all the files and makes an initial commit. The command accepts an alternative directory as an argument.
The open source
git-extras package adds additional commands for
git users that can reduce the friction associated with using the tool. The basic installation on an Ubuntu environment was illustrated and some of the key commands were showcased.
gists , notebooks and terminal casts are by the author. All of the artwork is based on assets explicitly CC0, Public Domain license or SIL OFL and is therefore non-infringing. Theme is inspired by and based on my favourite
vim theme: Gruvbox.
2021–03–07: Added updated artwork, attribution info and addendum section.