Datalad and Git Annex

Version control for our code helps us keep track of changes and reproducibility. With datalad, we can have similar version control over our data.

Background Reading

What should you know before using datalad? There's a lot of great information online about datalad, so I won't regurgitate it here. But, you should probably skim through the datalad handbook before jumping into using it on your own. There's also a nice cheat sheet of the main commands. As you're working through the basic background reading, I suggest focusing on the following points:

similarities with git and overlap with git-annex
creating datasets
linking with GitHub and Google Drive
running commands and logging output
saving changes to a dataset
publishing and sharing datasets

An Example Setup

Below are some of my notes from when I set this up and got it working. There was some trial and error, so I've tried to document some things that went wrong for me along the way. Before you begin, read through some of the guides and especially have an understanding of the commands used here.

Creating a dataset and a special remote

You first need to create a dataset. For this, I used the commands below (note that I'm using datalad version 0.11.8).

cd /data/projects
mkdir srndna-public-test
cd srndna-public-test
datalad create --annex-version 7 --text-no-annex --description "SRNDNA public test data on Smith Lab Linux" --shared-access group

Note that the --description should relate to what the data is and where it is. I couldn't find documentation for the --shared-access group option, but I think this is important for sharing later. Weirdly, it wasn't available as an option when I first started testing with datalad version 0.12.0rc5, but this could've been a weird quirk of what I was doing.

Now you have a dataset and you're in it, but you still want to set up a remote before going much further. The remote is where your data (i.e., your big files) will go (it is the annex). For this step, I'm using rclone and their integration with git-annex and Google Drive. I was initially using another git-annex-remote-googledrive, but I could only push data to the annex and I couldn't pull it down (note that this method might also lose access to Google Drive soon).

git annex initremote gdrive type=external externaltype=rclone target=dvs-temple prefix=srndna-public-test/annex chunk=50MiB encryption=none rclone_layout=lower

In the command above, target=dvs-temple is the name of the remote I set up with rclone (the setup is really easy and reliable, though maybe a little slow). The prefix=srndna-public-test/annex is where the data will be stored in my Google Drive. Finally, the gdrive is how this remote will be recognized by git-annex (and datalad?) later.

Now we need another place to store code and sym links that can be version controlled. For this step, the datalad create-sibling-github command allows us to create a sibling dataset on the lab GitHub. We'll publish changes here and hopefully this is where folks can download the test data before it lands on OpenNeuro.

datalad create-sibling-github srndna-public-test -s DVS-Lab-GitHub --github-organization DVS-Lab --publish-depends gdrive

Customizing for neuroimaging data

Before I push/publish anything, I first want to make some changes to what gets tracked in the repo and the annex. For the latter, I'm copying from what the HeuDiConv folks do for their datalad integration (add link).

cp /data/tools/general/.gitignore .
cp /data/tools/general/LICENSE .
git add .
git commit -m "add license and .gitignore tuned for python"
cp ../srndna-test/.gitattributes .
git commit -am "tune for heudiconv and imaging"

I also want to add in my stable scripts and code from the parent project. Note that I'm using cp -rL to make sure the copy command copies the actual file (not just the link), which is a critical consideration when you start using datalad and git annex.

cp -rL ../srndna-test/*.sh .
cp -rL ../srndna-test/stimuli .
cp -rL ../srndna-test/code .
cp -rL ../srndna/README.md .
mkdir -p derivatives/fsl/
cp -rL ../srndna-test/derivatives/fsl/templates derivatives/fsl/.
git add .
git commit -m "add README and scripts from srndna-test repo"
cp -rL ../srndna/masks/ . # these will get added to the annex below (via datalad save)

Saving and adding new data

We still haven't pushed anything to the remotes yet. That will happen soon. Let's first save where we are and tag it. Note that this command is sort of like git commit -am "your helpful commit message" but it is focused specifically on the state of the data.

datalad save . -m "initial save" --version-tag "initialsetup"

While we're here, let's also take advantage of the datalad run command and start some of our data conversion. This step will add more data to our annex.

datalad run -m "heudiconv, defacing, and mriqc" "bash run_prepdata.sh"

Publishing the data

If you're using git-annex and/or datalad, you are probably interested in backing up, sharing, and/or managing your data. By publishing to a remote (e.g., GitHub and Google Drive), we can back up our data and share it with others more easily. Recall that when we first set up the dataset (i.e., the folder on the local computer) and its sibling (i.e., the GitHub repo that will mirror everything), we included the --publish-depends gdrive option which should force it to push changes to the data to my Google Drive (though this doesn't seem to happen automatically). So, now we can run the commands below to publish the contents to GitHub and Google Drive.

datalad publish --to DVS-Lab-GitHub
git annex copy --to gdrive
datalad publish --to DVS-Lab-GitHub

As far as I know, you need both commands to get your changes onto GitHub and onto the annex. And the order seems to be important because annex keys change and need to be pushed to GitHub after the git annex copy command. Hopefully this will eventually just be one datalad command in the future.

If you don't want others to access your data yet, you can make your GitHub sibling private and/or not give them access to the Google Drive folder with the data. For this example, it's only a few subjects and I've tried to make it accessible to everyone. Using OpenNeuro and doi would be a much better solution for permanent archival, and that's what we plan to do with the the full dataset.

Note that this command will publish the code to the master branch and the sym links to the annexed files/objects will go to their own branch labeled git-annex.

Accessing the published data

Assuming you have access to the folder on my Google Drive*, you should be able to install this dataset and run it on your own computer using the commands below.

datalad install https://github.com/DVS-Lab/srndna-public-test
cd srndna-public-test
git annex enableremote gdrive type=external externaltype=rclone target=dvs-temple prefix=srndna-public-test/annex chunk=50MiB encryption=none rclone_layout=lower
git annex copy --from gdrive

*POTENTIAL ISSUE: How could someone else set up a RClone remote to my own personal drive? I guess they see it as "shared with you" if I share it. This is how I have things set up for K-Lab at Rutgers, but I only see their "Temple" Folder right now.

Troubleshooting

datalad save . -m "initial save" --version-tag "initialsetup" didn't let me use the --jobs 5 argument
Also can't do datalad status with this version of datalad?
Unlocking/locking files can be a little tricky, so be aware that a locked file cannot be read by some programs (e.g., mriqc).
Behavior of the annex can be a little weird/unexpected even with changes to .gitattributes. For example, the nifti files in masks don't get added to the annex even though they are > 100kb if I git add . them, so better to led datalad sort this out automatically with its save command.
Need to be careful with annexing the fsf files from FSL. May need to adjust .gitattributes file so that they never get annexed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly