Automating Galaxy workflows using the command line
OverviewQuestions:Objectives:
How can I schedule and run tens or hundreds of Galaxy workflows easily?
How can I automate my analyses when large amounts of data are being produced daily?
Requirements:
Learn to use the
planemo run
subcommand to run workflows from the command line.Be able to write simple shell scripts for running multiple workflows concurrently or sequentially.
Learn how to use Pangolin to assign annotated variants to lineages.
- Familiarity with Galaxy and basic associated concepts, in particular workflows
- Basic knowledge of the command line
Time estimation: 2 hoursSupporting Materials:Last modification: Oct 18, 2022
Introduction
Galaxy is well-known as a web-based data analysis platform which provides a graphical interface for executing common bioinformatics tools in a reproducible manner. However, Galaxy is not just a user-friendly interface for executing one tool at a time. It provides two very useful features which allow scaling data analyses up to a high-throughput level: dataset collections and workflows.
-
Dataset collections represent groups of similar datasets. This doesn’t sound especially exciting, but it gets interesting when a tool is executed using a collection as input, where a single dataset would normally be created. In this case, Galaxy creates a new job for every dataset contained within the collection, and stores the tool outputs in a new collection (or collections, if there are multiple outputs) with the same size and structure as the input. This process is referred to as ‘mapping over’ the input collection.
-
Workflows are pipelines made up of multiple Galaxy tools executed in sequence. When a workflow is executed, Galaxy schedules all the jobs which need to be run to produce the workflow outputs. They will remain scheduled (represented by the familiar grey color in the Galaxy history) until the required inputs become available. After they complete, they will make their outputs available, which allows the next set of jobs to begin.
Between them, collections and workflows make it possible to scale-up from running a single tool on a single dataset to running multiple tools on multiple datasets.
Why use the command line?
All the functionality described so far is available through the graphical interface. So why use the command line? Let’s consider a couple of scenarios:
-
You want to run some molecular dynamics simulations to perform free energy calculations for protein-ligand binding. You have a compound library of 1000 ligands and would like to run an ensemble of 10 simulations for each. You would like to have the analysis for each ligand in a separate history, with the ensembles represented as dataset collections, which means you need to invoke your workflow 1000 times - theoretically possible to do in the graphical interface, but you probably want to avoid it if possible.
-
You are conducting research on a virus which is responsible for a deadly pandemic. New genomic data from the virus is being produced constantly, and you have a variant calling workflow which will tell you if a particular sample contains new mutations. You would like to run this workflow as soon as the data appears - every day perhaps, or even every hour. This will be quite tough if you have to click a button each time yourself.
If you are encountering similar problems as these in your research with Galaxy, this tutorial is for you! We will explain how to trigger workflow execution via the command line using Planemo, and provide some tips on how to write scripts to automate the process. The result will be a bot which takes care of the the whole analysis process for you.
Agenda
Get workflows and data
Workflows and data for this tutorial are hosted on GitHub.
Hands-on: Download workflow and dataDownload the workflows and data for this tutorial using
git clone
.Input: git clonegit clone https://github.com/usegalaxy-eu/workflow-automation-tutorial.git
Next, step into the cloned folder and take a look around.
Inputcd workflow-automation-tutorial ls
Output: Folder contentsexample LICENSE pangolin README.md
Of the two subfolders,
example/
contains just a toy workflow used in the following to guide you through the basics of running workflows from the command line.The
pangolin/
folder holds the workflow and other material you will use in the second part of the tutorial, in which you will be setting up an automated system for assigning batches of SARS-CoV-2 variant data to viral lineages.
A short guide to Planemo
The main tool we will use in this tutorial is Planemo, a command-line tool with a very wide range of functionality. If you ever developed a Galaxy tool, you probably encountered the planemo test
, planemo serve
and planemo lint
subcommands. In this tutorial we will be using a different subcommand: planemo run
.
Comment: NotePlanemo provides a more detailed tutorial on the
planemo run
functionality here. The pages on ‘Best Practices for Maintaining Galaxy Workflows’ and ‘Test Format’ also contain a lot of useful information.
For the purposes of this tutorial, we assume you have a recent version of Planemo (0.74.4 or later) installed in a virtual environment. If you don’t, please follow the installation instructions.
Get the workflow and prepare the job file
For this section, we will use a very simple workflow consisting of two text manipulation tools chained together.
Hands-on: Step into and explore the example folderInputcd example ls
Output: Folder contentstutorial.ga
The tutorial.ga
file defines the workflow in JSON format; if we are confident we have a functional workflow, we don’t need to worry about its contents or modify it. However, we need a second file, the so-called ‘job file’, which specifies the particular dataset and parameter inputs which should be used to execute the workflow. We can create a template for this file using the planemo workflow_job_init
subcommand.
Hands-on: Creating the job file
Run the
planemo workflow_job_init
subcommand.Input: workflow_job_initplanemo workflow_job_init tutorial.ga -o tutorial-init-job.yml # Now let's view the contents cat tutorial-init-job.yml
The
planemo workflow_job_init
command identifies the inputs of the workflow provided and creates a template job file with placeholder values for each.Output: File contentsDataset 1: class: File path: todo_test_data_path.ext Dataset 2: class: File path: todo_test_data_path.ext Number of lines: todo_param_value
The job file contains three inputs: two dataset inputs and one integer (parameter input).
Create two files which can be used as inputs:
Input: Creating the input filesprintf "hello\nworld" > dataset1.txt printf "hello\nuniverse!" > dataset2.txt ls
Outputdataset1.txt dataset2.txt tutorial.ga tutorial-init-job.yml
Replace the placeholder values in the job file, so that it looks like the following:
Dataset 1: class: File path: dataset1.txt Dataset 2: class: File path: dataset2.txt Number of lines: 3
Now we are ready to execute the workflow with our chosen parameters!
Running the workflow
Now we have a simple workflow, we can run it using planemo run
. At this point you need to choose a Galaxy server on which you want the workflow to run. One of the big public servers would be a possible choice. You could also use a local Galaxy instance. Either way, once you’ve chosen a server, the next step is to get your API key.
Hands-on: Running our workflow
Run the
planemo run
subcommand.Input: planemo runplanemo run tutorial.ga tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF" --tags "planemo-tutorial"
Navigate to the web browser - you should be able to see a new history has been created with the chosen name and tag.
One potential disadvantage of the previous command is that it waits until the invoked workflow has fully completed. For our very small example, this doesn’t matter, but for a workflow which takes hours or days to finish, it might be undesirable. Fortunately,planemo run
provides a--no_wait
flag which exits as soon as the workflow has been successfully scheduled.Run the
planemo run
subcommand with the--no_wait
flag.Input: planemo runplanemo run tutorial.ga tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF with no_wait" --tags "planemo-tutorial" --no_wait
This time you should see that the
planemo run
command exits as soon as the two datasets have been uploaded and the workflow has been scheduled.
Using Galaxy workflow and dataset IDs
We’ve now executed the same workflow twice. If you inspect your histories and workflows through the Galaxy web interface, you will see that a new workflow was created on the server for each invocation, and both Dataset 1
and Dataset 2
were uploaded twice. This is undesirable - we are creating a lot of clutter and the uploads are creating additional unnecessary work for the Galaxy server.
Every object associated with Galaxy, including workflows, datasets and dataset collections, have hexadecimal IDs associated with them, which look something like 6b15dfc0393f172c
. Once the datasets and workflows we need have been uploaded to Galaxy once, we can use these IDs in our subsequent workflow invocations.
Hands-on: Running our workflow using dataset and workflow IDs
- Navigate to one of the histories to get dataset IDs for the input datasets. For each one:
- Click on the galaxy-info View details icon on the dataset in the history.
- Under the heading
Dataset Information
, find the rowHistory Content API ID
and copy the hexadecimal ID next to it.- Modify
tutorial-init-job.yml
to look like the following:Dataset 1: class: File # path: dataset1.txt galaxy_id: <ID OF DATASET 1> Dataset 2: class: File # path: dataset2.txt galaxy_id: <ID OF DATASET 2> Number of lines: 3
- Now we need to get the workflow ID:
- Go to the workflows panel in Galaxy and find one of the workflows that have just been uploaded.
- From the dropdown menu, select
Edit
, to take you to the workflow editing interface.- The URL in your browser will look something like
https://usegalaxy.eu/workflow/editor?id=34d18f081b73cb15
. Copy the part after?id=
- this is the workflow ID.Run the
planemo run
subcommand using the new workflow ID.Input: planemo runplanemo run <WORKFLOW ID> tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF with Planemo" --tags "planemo-tutorial" --no_wait
Using Planemo profiles
Planemo provides a useful profile feature which can help simplify long commands. The idea is that flags which need to be used multiple times in different invocations can be combined together and run as a single profile. Let’s see how this works below.
Hands-on: Creating and using Planemo profiles
- Create a Planemo profile with the following command:
Input: planemo runplanemo profile_create planemo-tutorial --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY>
Output: TerminalProfile [planemo-tutorial] created.
You can view and delete existing profiles using the
profile_list
andprofile_delete
subcommands.- Now we can run our workflow yet again using the profile we have created:
Input: planemo runplanemo run <WORKFLOW ID> tutorial-init-job.yml --profile planemo-tutorial --history_name "Test Planemo WF with profile" --tags "planemo-tutorial"
This invokes the workflow with all the parameters specified in the profile
planemo-tutorial
.
Automated runs of a workflow for SARS-CoV-2 lineage assignment
It’s now time to apply your newly acquired knowledge of workflow execution with Planemo to a relevant scientific problem.
Scientific background
The SARS-CoV-2 pandemic has been accompanied by unprecedented world-wide sequencing efforts. One of the accepted goals behind sequencing hundreds of thousands of individual viral isolates is to monitor the evolution and spreading of viral lineages in as close as real time as possible. Viral lineages are characterized by defining patterns of mutations that make them different from each other and from the original virus that started the pandemic at the beginning of 2020. Examples of viral lineages are B.1.1.7, first observed in the UK in the fall of 2020 and now termed variant of concern (VOC) alpha according to the WHO’s classification system, and B.1.617.2, first seen in India at the beginning of 2021 and now recognized as VOC delta.
Pangolin is a widely used tool for assigning newly sequenced viral isolates to established viral lineages, and in this final section of this tutorial you are going to run a workflow that:
-
takes a collection of variant datasets in the variant call format VCF,
where you can think of a collection as representing a batch of freshly sequenced viral isolates with each of its VCF datasets listing the nucleotide differences between one sample and the sequence of an original SARS-CoV-2 reference isolate
-
reconstructs the viral genome sequence of each sample by incorporating its variants into the reference isolate’s sequence
-
uses Pangolin to classify the resulting collection of genome sequences in FASTA format and to create a report of lineage assignments for all samples.
Just like in a real world situation, you will receive VCF files for several batches of samples and you will face the challenge of uploading the files from each batch as a collection into Galaxy and of triggering a run of the workflow for each of them.
Setting up the bot
Unlike for the previous toy example you will not get complete step-by-step instructions, but you are supposed to try yourself to transfer the knowledge from part 1 to this new, more complex task.
Every step along the way comes with solutions, which you can expand at any time, but you’re encouraged to give each problem some thought first.
As a very first step, however, let’s look at how the material for this part is arranged.
Hands-on: Step into and explore the pangolin folderInputcd ../pangolin ls
Output: Folder contentsdata solutions vcf2lineage.ga
The file vcf2lineage.ga
defines the workflow just described, while the data/
folder holds the batches of VCF files we would, ultimately, like to run the workflow on.
Now, as a start, let’s get the workflow running on the first batch of files in the data/batch1/
subfolder.
Hands-on: An initial workflow run
- Create a template job file for the
vcf2lineage.ga
workflow.Replace the placeholder values:
Reference genome
should point todata/NC_045512.2_reference_sequence.fasta
,Variant calls
should contain all the VCF files indata/batch1
, andmin-AF for consensus variant
should be set to0.7
.- Now that we have a complete job file, let’s run the workflow.
We have now performed a test invocation of the vcf2lineage workflow. It was already more challenging than the first example; for the first time, we needed to resort to writing a script to achieve a task, in this case the construction of the job file.
The next step is to automate this process so we can run the workflow on each of the 10 batch*/
directories in the data/
folder. We can imagine that these are newly produced data released at regular intervals, which need to be analysed.
Hands-on: Automating vcf2lineage execution
If we want to execute the workflow multiple times, we will once again encounter the issue that the datasets and workflow will be reuploaded each time. To avoid this, let’s obtain the dataset ID for the
Reference genome
(which stays the same for each invocation) and the workflow ID for thevcf2lineage.ga
workflow.Now let’s create a template job file
vcf2lineage-job-template.yml
which we can modify at each invocation as necessary. We can start with the output ofworkflow_job_init
and add theReference genome
dataset ID and setmin-AF for consensus variant
to0.7
again.Write a shell script to iterate over all the batches, create a job file and invoke with
planemo run
. After execution, move the processed batch todata/complete
.Run your script. Do you notice any issues? What could be improved?
More advanced solutions
This was a very basic example of a workflow. Perhaps for your case, you need a more customized solution.
For example, it might be the case that you want to run multiple different workflows, one after another. In this case you would need to implement some sort of check to verify if one invocation had finished, before beginning with the next one. Planemo will probably not be enough for a task like this; you will need to resort to using the lower-level BioBlend library to interact directly with the Galaxy API.
The Galaxy SARS-CoV-2 genome surveillance bot provides an example of a more advanced customized workflow execution solution, combining Planemo commands, custom BioBlend scripts and bash scripts, which then get run automatically via continuous integration (CI) on a Jenkins server.
Conclusion
You should now have a better idea about how to run Galaxy workflows from the command line and how to apply the ideas you have learnt to your own project.
Key points
Workflows can be executed not only through the web browser, but also via the command line.
Executing workflows programmatically allows automation of analyses.
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Using Galaxy and Managing your Data topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumFeedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- Simon Bray, Wolfgang Maier, Automating Galaxy workflows using the command line (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/workflow-automation/tutorial.html Online; accessed Sun Jul 20 2025
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
Congratulations on successfully completing this tutorial!