In this tutorial we introduce
git-annex
and its use within a
research laboratory to help share data and code among lab members,
external collaborators and anonymous users. git-annex
is a software
tool that extends the more famous software
git
in convenient ways when dealing with
large files and large repositories. In the following, we introduce
some basic concepts and then describe the scenario and the workflow
that we implemented in our Lab which, we believe, can be useful to
other people in a similar setting. git
and git-annex
require some
dedication before reaching fruitful use. During that process, it is
common to make mistakes, as it was for us. For this reason, in this
tutorial, we also describe common errors and how to recover from
them. Notice that, basic familiarity with git
is assumed as a
pre-requisite for this tutorial.
The tutorial is structured as follows. First we describe the scenario
in which git-annex
is used. Then, we provide some preliminary
information about what git-annex
is and some additional technical
details. After that, we describe how to set up a centralized
repository that will host a copy of all data. This is the main part of
the tutorial, in which we describe how to make the repository easily
accessible from a web server and from Github. The last part of the
tutorial describes the use of the repository from the point of view of
standard users that need just to access the data and get updates, as
well as of content creators, i.e. those having the rights to add new
content to the repository from remote.
In our lab, we have large datasets - terabytes of data - which also comprise many large files and code that generated part of the data. Such datasets are kept in a storage server. Small portions of the data are frequently needed on local desktop and laptop computers of lab members and collaborators, for processing and analysis. Moreover, new data are frequently generated by lab members, on local computers, by further processing of available data. One main aim is to share such new data and the code with others. Additionally, the data already shared are not static: from time to time, code is updated or bugs are fixed, so some of the preprocessed data is re-generated and shared, substituting the previous version. In such a setting, is is important for lab members and collaborators to get updates of data and code in a simple way.
Simply put, git-annex
is an
extension of git
that provides some extra functionalities:
- Large files in the repository are not locally copied, when cloning or fetching/pulling. Of course, they can be retrieved on request. Additionally, local copies of large files can be removed to free some space.
git-annex
keeps track of how many and where copies of each file are.- TODO
Here, we do not describe git
other than the most used version
control system, to our knowledge. There is an enourmous amount of
information already available about git
. git
helps keep tracks of
updates of files and support collaborative work among multiple
users. Unfortunately, git
do not provide native support to handle
large files in a convenient way. That is what git-annex
adds to it.
In this tutorial we refer to git-annex
version 6.20171211, on
GNU/Linux machines using Ubuntu 16.04. As of this version of
git-annex
, the default format of the repository is v5. In
future, we plan to upgrade to
v6, following the default
settings of git-annex
. At that time, we plan to update the parts of
this tutorial that are affected by this change.
When issuing git-annex
from the command line, two alternative ways
can be used, either git-annex
or git annex
. To our knowledge,
there is no difference between them.
Until v5 of the repository format, git-annex
uses certain filesystem
features that may not be available on all filesystems, like symbolic
links and FIFOs. For example, the FAT filesystem does not provide
them. When initializing the repository with git annex init
(see
below for further details), a clear warning will appear on the screen,
in case you are using such crippled filesystems. Nevertheless,
git-annex
has ways to (partially) address such problems. In this
tutorial, we do not discuss such issues and we assume that a
non-crippled filesystem is available, like the EXT4 filesystem,
default on GNU/Linux systems.
TODO
In the following example, we create a directory /labdata
on a
storage-server, where we store a copy of all the data with git
and
git-annex
, so that they can be shared with lab members and external
collaborators. The repository hosts both the git
database, in
/labdata/.git/
, and a copy of the actual files and directories, the
working tree, in /labdata/
, for easy browsing.
Permission to add or modify the data in the repository is enforced
through filesystem permissions by creating a group of users, named
dataowners
. Everyone else can (only) read the data in the
repository.
Here we describe the step-by-step procedure to create the repository from scratch, with example commands followed by their detailed explanation:
cd /
mkdir labdata
addgrup dataowners
adduser contributor dataowners
chgrp dataowners labdata
chmod g+rwx labdata
chmod o+rx-w
chmod g+s labdata
cd labdata
This first group of commands creates the directory to host the
repository /labdata
, creates a new system group dataowners
and
sets such group to /labdata
, with write permissions. Then, the user
contributor
is added to that group - and others may be added in the
same way. Additionally, read (r
), write (w
) and access (x
)
permissions are granted to the group (g+rwx
) and read and access
(but not write) permissions are granted to everyone else
(o+rx-w
). Finally, the setgid
permission
is enabled for the group (g+s
), so that all future files and
directories created inside /labdata
will automatically inherit the
group dataowners
and the setgid
bit.
git init --shared=group
git annex init storage-server
This second group of commands creates the git
repository and the
additional git-annex
part of it. Notice that, the git-annex
part
of the repository can only be initialized within an existing git
repository. In order to let the repository be group-writable and
accessible to everyone, the initialization of the git
repository
requires --shared=group
. This
will properly set permissions within /labdata/.git/
. The
initialization of git-annex
creates a /labdata/.git/annex/
directory, called the annex, where git-annex
stores all its
information. To conclude, we added the optional storage-server
description when initializing the git-annex
part of the
repository. This is convenient to set a desired human-readable label
to the repository.
At this point, content/changes can be added to the repository in two main ways:
- Directly on the storage-server, by copying files and directories in
/labdata
and then:- either via
git annex add <file>
andgit commit -m <message>
. In this case, The file is added to the annex, i.e. moved to/labdata/.git/annex/objects/
, set read-only, renamed according to its checksum and a symbolic link pointing to it is created in the original location of the file. Only the symbolic link is added to thegit
git repository, whilegit-annex
keeps track of the content. From the user perspective, the initial file is still accessible, through the link, in read-only mode. Notice that, when cloning this repository, only the symbolic link of this file will be present and not its content, unless explicitly requested. - Or via
git add <file>
andgit commit -m <message>
. In this case the file is added to thegit
repository and not to the annex. Notice that, when cloning this repository, a copy of this file will be present, as always withgit
.
- either via
- From remote repositories, through
git push
orgit annex sync
. In this second case, the repository must be configured properly, as explained below.
Using git annex add <file>
instead of git add <file>
can be
decided for each file, individually, and depends on the purpose of the
file and of the repository. Typically, code should be added via git add <file>
and data via git annex add <file>
. Nevertheless, it is
possible to use git annex add <file>
for everything. If, at a later
stage, a file needs to be moved from the git
repository to the
annex, or viceveresa,
Here follows an example transcript of what happens when executing git annex add <file>
on a file foo
present in the repository:
> ls -al
total 16
drwxrwsr-x 3 ele dataowners 4096 dic 26 16:19 .
drwxr-xr-x 26 root root 4096 dic 26 16:13 ..
-rw-rw-r-- 1 ele dataowners 4 dic 26 16:19 foo
drwxrwsr-x 9 ele dataowners 4096 dic 26 16:18 .git
> git annex add foo
> ls -al
total 16
drwxrwsr-x 3 ele dataowners 4096 dic 26 16:21 .
drwxr-xr-x 26 root root 4096 dic 26 16:13 ..
lrwxrwxrwx 1 ele dataowners 178 dic 26 16:19 foo -> .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
drwxrwsr-x 9 ele dataowners 4096 dic 26 16:21 .git
> git commit -m "added foo"
[master (root-commit) 3e461c6] added foo
1 file changed, 1 insertion(+)
create mode 120000 foo
> ls -al .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
-rw-rw-rw- 1 ele dataowners 4 dic 26 16:19 .git/annex/objects/g7/9v/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730/SHA256E-s4--7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
TODO: explain the branches below
> git branch -a
git-annex
* master
synced/master
remotes/origin/HEAD -> origin/master
remotes/origin/git-annex
remotes/origin/master
remotes/origin/synced/git-annex
remotes/origin/synced/master
Content and changes can be created on remote clones of the repository, i.e. local computers of lab members and collaborators. Such contents and changes need to be pushed to the storage-server, in order to be shared. For this reasons, the storage-server needs to be properly configured in order to allow that, in two steps. The first is:
git config receive.denyCurrentBranch updateInstead
With this command, we allow remote users to push to the
repository. Normally, this is not permitted, because the repository is
non bare, i.e. it has a working tree of files and directories,
besides the .git/
database. If you do not plan to push changes from
remote, then you do not need this configuration. Notice that, if you
push changes to the repository after enabling the previous
configuration, the working tree of the repository will not be
updated. See below how to enable the automatic update
of the working tree. The second step is:
cd /labdata
git annex wanted . standard
git annex group . backup
The storage-server is meant to keep copies of all files in the
repository. When content is created remotely, it is very important to
tell the storage-server to enforce this desideratum, when operations
like git annex sync
are performed (see TODO). In order to enforce
such behavior and other similar ones, git annex
provides rich
expressions to
be set, see. But it also offers standard groups of
preferences. The
commands above tells to all the repository to use a standard group of
preferences called backup, which means "All content is wanted. Even
content of old/deleted files".
If you want to check whether standard groups are enabled in the
repository, you just need to use the commands above, without
specifying standard
and backup
. The following trascript shows an
example:
> git annex wanted .
standard
> git annex group .
backup
Notice that you can set multiple standard groups, whose effect is left as exercise to the reader. Continuing the previous example:
> git annex group . client
group . ok
(recording state in git...)
> git annex group .
client backup
If you added a standard group by mistake and want remove it, you need
to use git annex ungroup
, as here:
> git annex ungroup . client
> git annex group .
backup
Information on how to access the repository when the storage server directory with the data is exposed via web server.
Basically, git update-server-info
should be executed whenever the
repository is remotely updated, e.g. via push, or after a local
commit. In order to do that automatically, git
hooks must be enabled:
cd /labdata
mv .git/hooks/post-update.sample .git/hooks/post-update
cp -a .git/hooks/post-update .git/hooks/post-commit
Warning: hooks needs to be executable.
Add new special remote via http.
git remote add httpdata HTTPURL/.git
git annex initremote datasrc type=git location=HTTPURL/.git autoenable=true
git annex merge # necessary?
git remote rm httpdata # not needed anymore
Note: after pulling/syncing in remote clones, git annex init
should
be re-run, according to the man page. Maybe it is necessary to run
git annex enableremote datasrc
on the user computer. TODO.
TODO
The idea is to keep a copy of the repository on github, without the
contents of the annex, so that it is more visible and can be easily
cloned by anonymous users. Moreover, it can be set up so that content
can be retrieved via git annex get <file>
leveraging the access to
the storage-server and/or the public access for the web.
....create repository on github....
git remote add github <github-URL-to-repository>
git push -u github master
git push -u github git-annex
After populating and using the repository, it is common to realize
that it may not be smart to have all files stored with git-annex
and that is would be better to have them simply stored in git
. The
following commands migrate files from git-annex
to git
:
git unannex <file>
git add <file>
git commit -m <message>
Notice that git unannex <file>
does not need a commit.
Viceversa: TODO.
TODO
In this section, we describe the use of git-annex
from the point of
view of users, when the centralized repository is already
available. We make a distinction between users that just access the
repository to obtain the data and, from time to time, the updates,
from users that contribute to the repository, by creating new content
or code to be sent to the central repository.
As user, the first action to do is to clone the repository hosted on the storage server. Notice that repository may be reached in several ways, like via SSH, if you have an account on the storage server, or via HTTP, if the repository has been published with a web server, or via Github if this option has been set up. In this last case, the content of the files in the repository is not available and at least one of the other means should be available to reach the content.
git clone user@storage-server:/labdata
The directory labdata/
is then created, with all the tree of
directories and symbolic links to the (missing) content of the files,
if they had been added with git annex add <file>
, or the actual
files, if added with git add <file>
. Additionally, as in every git
repository, it is present the labdata/.git
directory hosting all the
git
history and internal files. Notice that the directory
labdata/.git/annex
, created by git annex, is not present yet. Still,
the information necessary to git-annex
to retrieve the content of
the files in the annex is already available because it is stored in
the git-annex
branch. The list of all available branches shows it:
> git branch -a
* master
remotes/origin/HEAD -> origin/master
remotes/origin/git-annex
remotes/origin/master
remotes/origin/synced/git-annex
remotes/origin/synced/master
For this reason, the content of the files currently appearing just as broken links can be easily retrieved with:
git annex get <file>
where <file>
is a filename, a directory, or an expression with
wildcards that address the content we require.
From time to time, the user can retrieve updates of the repository by executing:
git pull
The user can also ask git-annex
information on where to find the
content of a given file:
git annex whereis <file>
If a user is also a contributor to the repository, then he/she can
create new content and push it to the repository on the storage
server. In order to do that, some additional steps should be done on
the local clone of the repository. For clarity, the following
instructions start from cloning the repository as contributor
:
git clone contributor@storage-server:/labdata
cd labdata/
git annex init contributor1-desktop
here, git annex init <label>
is not mandatory but it is good
practice for a collaborator to add a human-readable label to describe
the local repository, because it will show up in the information
stored by git-annex
and shared with others.
A second important step is to inform git-annex
that the local
repository should only get the content explicitly requested by the
collaborator. This is important when, later, the contributor will send
new content to the main repository on the storage server, with git annex sync
. git annex
provides a rich and flexible set of
expressions to set the preferences of content automatically retrieved
during certain operations. See [allow-remote-content] for a more
detailed explanation. Here, the main step is to set the preferences of
the content for the local repository to a standard group, called
manual
, meaning that content will only be manually retrieved by the
contributor via git annex get <file>
and manually removed when
needed with git annex drop <file>
:
cd labdata/
git annex wanted . standard
git annex group . manual
At this point, the contributor can create new files and add them to
the annex, via git annex add <file>
and commit that:
...creating new files...
git annex add <newfiles>
git commit -m "created <newfiles>
At this point, local changes can be sent to the repository on the
storage server with git push
but the content will not be sent, only
the symbolic link and some metadata. In order to copy the content of
the file to the annex on the storage server, git-annex
provides the
command git annex copy <newfiles> --to=origin
. Notice that, instead
of indicating the specific filename, it is sufficient to indicate the
name of the directory with the new files, when copying, and git annex
will figure out what content will need to be copied, e.g.:
git annex copy . --to=origin
Since pushing content to a repository often requires to pull first and
merge changes, then git-annex
provides a more convenient way to
perform all these operatations, through the sync
command:
git annex sync origin --content
Internally, git annex sync --content
performs the following steps:
git commit
git pull
git merge
git push
git copy . --to=origin
git annex get .
Notice that the last two steps will be avoided if --content
is
omitted. Moreover, had the standard group manual not being set in
the local repository, then all files available on the storage server
would have been copied locally. Anyway, if that happens, interrupting
the retrieval with CTRL-C
is safe.
If a <file>
is stored with in the annex and changes to it needs to
be made, then the file must be unlocked first:
git annex unlock <file>
...edit...edit...edit...
git annex add <file>
git commit -m "updated <file>"
Notice that git annex unlock <file>
removes the symbolic link and
copies the content of the file in its place, with write
permission. This is a second copy of the file because the one . After
changing the file, git-annex add
and git commit
can be performed
as usual. Notice that, if you need to frequently change a file, it may
be more convenient to store it with git
instead of git-annex
.
What happens if you attempt to edit a file without unlocking first?
Files added with git-annex
appears as symbolic links in the
filesystem. An application, such as an editor, should warn that you
are opening a link and not a file. Secondly, the content of the file,
pointed by the link, is stored in .git/annex/objects/
and set as
write-protected. This is the only copy of the content of the file in
the local repository, that is why it is protected. The application
attempting to write on this file should either fail, with
permission denied
, or clearly ask confirmation to write on a write
protected file, e.g. Sublime Text 3. If the user insists to write on
the file and the application allows that, basically the internal copy
of git-annex
is damaged. With git annex fsck <file>
, git-annex
will tell first that the local copy of the file is not good anymore
and will put it in .git/annex/bad/
. In order to solve such a
situation, it is necessary to retrieve a pristine copy of the file,
with git annex get <file>
, then unlock it, re-editing again or
copying the the file in .git/annex/bad/
on the unlocked file, then
adding and committing.
git clone http://storage-server.mydomain.com/labdata
cd labdata
git annex get <files>
[...]
git pull
git annex get <files>
Thanks to Michael Hanke's post, for inspiring parts of this tutorial and showing interesting solutions.
Thanks to Yaroslav Halchenko and Michael Hanke for their continuous
effort in improving and maintaining
NeuroDebian which, among many other
things, provides Debian/Ubuntu repositories with the latest
git-annex
, within the package git-annex-standalone
.
A special thank to Joey Hess, author of git-annex
, for the beautiful
and intriguing piece of software that sometimes tease us like a
puzzle, like git
does.