Reference Data with CVMFS without Ansible

Overview
Questions:
Objectives:
  • Have an understanding of what CVMFS is and how it works

  • Install and configure the CVMFS client on a linux machine and mount the Galaxy reference data repository

  • Configure your Galaxy to use these reference genomes and indices

Time estimation: 1 hour
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Overview

The CernVM-FS is a distributed filesystem perfectly designed for sharing readonly data across the globe. We use it in the Galaxy Project for sharing things that a lot of Galaxy servers need. Namely:

  • Reference Data
    • Genome sequences for hundreds of useful species.
    • Indices for the genome sequences
    • Various bioinformatic tool indices for the available genomes
  • Tool containers
    • Singularity containers of everything stored in Biocontainers (A bioinformatic tool container repository.) You get these for free every time you build a Bioconda recipe/package for a tool.
  • Others too..

From the Cern website:

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.”

https://cernvm.cern.ch/portal/filesystem

A slideshow presentation on this subject can be found here. More details on the usegalaxy.org (Galaxy Main’s) reference data setup and CVMFS system can be found here

There are two sections to this exercise. The first shows you how to use Ansible to setup and configure CVMFS for Galaxy. The second shows you how to do everything manually. It is recommended that you use the Ansible method. The manual method is included here mainly for a more in depth understanding of what is happening.

If you really want to perform all these tasks manually, go here, otherwise just follow along.

Agenda
  1. Overview
  2. CVMFS and Galaxy without Ansible
    1. Configuring CVMFS
    2. Testing it out
    3. Look at the repository

CVMFS and Galaxy without Ansible

Comment: Manual version of Ansible Commands

If you wish to perform the same thing that we’ve just done, but by building the ansible script manually, follow these instructions. Otherwise, you have already done everything below and do not need to re-do it.

We are going to setup a CVMFS mount to the Galaxy reference data repository on our machines. To do this we have to install and configure the CVMFS client and then mount the appropriate CVMFS repository using the publicly available keys.

Hands-on: Installing the CVMFS Client
  1. On your remote machine, we need to first install the Cern software apt repo and then the CVMFS client and config utility:

    sudo apt install lsb-release
    wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb
    sudo dpkg -i cvmfs-release-latest_all.deb
    rm -f cvmfs-release-latest_all.deb
    sudo apt-get update
    
    sudo apt install cvmfs cvmfs-config
    
  2. Now we need to run the CVMFS setup script.

    sudo cvmfs_config setup
    

Configuring CVMFS

The configuration is not complex for CVMFS:

Hands-on: Configuring CVMFS
  1. Create a /etc/cvmfs/default.local file with the following contents:

    CVMFS_REPOSITORIES="data.galaxyproject.org"
    CVMFS_HTTP_PROXY="DIRECT"
    CVMFS_QUOTA_LIMIT="500"
    CVMFS_CACHE_BASE="/srv/cvmfs/cache"
    CVMFS_USE_GEOAPI=yes
    

    This tells CVMFS to mount the Galaxy reference data repository and use a specific location for the cache which is limited to 500MB in size and to use the instance’s geo-location to choose the best CVMFS repo server to connect to. You can use the cvmfs_quota_limit role variable to control this setting.

  2. Create a /etc/cvmfs/domain.d/galaxyproject.org.conf file with the following contents:

    CVMFS_SERVER_URL="http://cvmfs1-tacc0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-iu0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-psu0.galaxyproject.org/cvmfs/@fqrn@;http://galaxy.jrc.ec.europa.eu:8008/cvmfs/@fqrn@;http://cvmfs1-mel0.gvl.org.au/cvmfs/@fqrn@;http://cvmfs1-ufr0.galaxyproject.eu/cvmfs/@fqrn@"
    

    This is a list of the available stratum 1 servers that have this repo.

  3. Create a /etc/cvmfs/keys/data.galaxyproject.org.pub file with the following contents:

    -----BEGIN PUBLIC KEY-----
    MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5LHQuKWzcX5iBbCGsXGt
    6CRi9+a9cKZG4UlX/lJukEJ+3dSxVDWJs88PSdLk+E25494oU56hB8YeVq+W8AQE
    3LWx2K2ruRjEAI2o8sRgs/IbafjZ7cBuERzqj3Tn5qUIBFoKUMWMSIiWTQe2Sfnj
    GzfDoswr5TTk7aH/FIXUjLnLGGCOzPtUC244IhHARzu86bWYxQJUw0/kZl5wVGcH
    maSgr39h1xPst0Vx1keJ95AH0wqxPbCcyBGtF1L6HQlLidmoIDqcCQpLsGJJEoOs
    NVNhhcb66OJHah5ppI1N3cZehdaKyr1XcF9eedwLFTvuiwTn6qMmttT/tHX7rcxT
    owIDAQAB
    -----END PUBLIC KEY-----
    
  4. Make a directory for the cache files

    sudo mkdir /srv/cvmfs
    

Testing it out

Probe the connection.

Hands-on: Testing it out
  1. Run sudo cvmfs_config probe data.galaxyproject.org

    Question

    What does it output?

    OK
    

    If this doesn’t return OK then you may need to restart autofs: sudo systemctl restart autofs

  2. Change directory into /cvmfs/ and list the files in that folder

    Question

    What do you see?

    You should see nothing, as CVMFS uses autofs in order to mount paths only upon request.

  3. Change directory into /cvmfs/data.galaxyproject.org/.

    Input: Bash
    cd /cvmfs/data.galaxyproject.org/
    ls
    ls byhand
    ls managed
    
    Question

    What do you see now?

    You’ll see .loc files, genomes and indices. AutoFS only mounts the files when they’re accessed, so it appears like there is no folder there.

    And just like that we all have access to all the reference genomes and associated tool indices thanks to the Galaxy Project, IDC, and Nate’s hard work!

    If you are developing a new tool, and want to add a reference genome, we recommend you talk to us on Gitter. You can also look at one of the tools that uses reference data, and try and copy from that. If you’re developing the location files completely new, you need to write the data manager.

Look at the repository

Now to configure Galaxy to use the CVMFS references we have just installed, see the Ansible tutorial.

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Galaxy Server administration topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Simon Gladman, Helena Rasche, Reference Data with CVMFS without Ansible (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs-manual/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012



@misc{admin-cvmfs-manual,
author = "Simon Gladman and Helena Rasche",
title = "Reference Data with CVMFS without Ansible (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs-manual/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!