Data Science Tools: Introduction to git-extras

Additional plugins for git to make using repositories more manageable for Data Science.

Scope

Introduction to the open source git-extras package, it’s installation on a Ubuntu (GNU/Linux) environment and it’s application in general and for Data Science.

Introduction

It’s common practice for most Data Scientists to interact with Version Control Systems (VCS) like Git. Beyond the basic commands like init, add, commit, status, diff and log you might wander what else you need? If using jupyter notebooks, it’s common practice to evolve an analysis with multiple notebooks and therefore there isn’t that much involvement with git other than an insurance policy to undo errors or file corruption.

The motivation here is that whilst Exploratory Data Analysis (EDA)is often a solo effort, there are occasions where collaboration with others is required such as Machine Learning (ML) model deployment and crucially for documentation.

With the context set, the git-extras package provides additional utilities to support most users of git.

Git-Extras

Installation

The installation instructions cover Linux, MacOS and Windows; for Ubuntu use the apt package manager:

Having confirmed the package is available in the repository, it can be installed:

The following terminal cast captures the installation process:

Terminal cast for git-extras installation on Ubuntu | Cast by Author

Most GNU/Linux distributions are likely to have and older version of the git-extras package (in this case version 5.1) so not all commands will work such as git brv.

Commands

The full list of commands are documented on the GitHub repository; but a few key examples are summarised below (in no particular order):

Example Commands

Git Summary

As you can see it provides a quick snapshot of the project, in this case it’s relatively new. Its an easy way to get a feel for the repo, analogous to using .info in pandas.

Git Create-branch

The command reduces the traditional checkout -b and push -u origin approach into a single command:

The code above lists branches both local and remote (-a) and shows only a single branch called master . The traditional approach to creating and syncing a branch requires 2 steps; the git create-branch with the -r flag does the same in a single step. The final listing of branches shows that the new branch (02-git-extras) is available both locally and remotely (remotes/origin).

It’s important to “reserve” the remote branch name as soon as possible to prevent a clash with fellow contributors in particular with regards branches for documentation.

Git Rename-branch

Whilst the time saving from create-branch may not look significant, renaming branches both locally and remotely is not easy and therefore the following command makes this much easier:

The first form of the commands permits the renaming of any arbitrary branch by providing the old and new names. The list of branches (git branch -a) shows that the new name (02-new-name) is visible locally and remotely. This branch is checked out for the second form of the command, which renames the existing branch. Finally, the branches are listed again to show that both the local and remote branches align.

Git Ignore and Ignore-io

The first thing to do for a Data Science git repo after init is of course manage the .gitignore file. An unusual feature of .gitignore is of course that if you don’t want to sync the file itself is to add the filename to the list. The git ignore command adds a number of useful features to manage .gitignore files seamlessly:

By default the git ignore command lists the contents of both the local and global ignore files. Providing the -l flag will show only the local patterns, which can be added by simply providing a pattern after the command as is the case with python compiled files: git ignore "*.pyc".

Rather than remembering common patterns for different Integrated Development Editors (IDEs), text editors (such as vim) etc, community provided patterns can be obtained from gitignore.io:

.gitignore.io | Screenshot by Author | Content and Artwork by Toptal

The service allows users to combine different ignore patterns, for example macOS with python. The output is in the form of a text file:

The same output can be obtained from the git ignore command and displayed on the screen:

To add the ignore patterns to the local .gitignore file use git ignore -a macos. The -a flag appends the patterns to the local .gitignore file.

Git Show-tree, undo & setup

The last three are summarised briefly. The git show-tree command replicates popular one-liners from Stack Overflow for showing the git graph:

The git undo command allows for jumping to a previous commit or a number of commits:

Finally, the git setup command (by default in the current working directory), initialises git, adds all the files and makes an initial commit. The command accepts an alternative directory as an argument.

Conclusion

The open source git-extras package adds additional commands for git users that can reduce the friction associated with using the tool. The basic installation on an Ubuntu environment was illustrated and some of the key commands were showcased.

Addendum

Attribution

All gists , notebooks and terminal casts are by the author. All of the artwork is based on assets explicitly CC0, Public Domain license or SIL OFL and is therefore non-infringing. Theme is inspired by and based on my favourite vim theme: Gruvbox.

Changelog

2021–03–07: Added updated artwork, attribution info and addendum section.

Data Scientist and Chartered Aeronautical Engineer (MEng CEng EUR ING MRAeS) with over 15 years experience in the Aerospace, Defence and Rail Industry.