Functional annotation of protein sequences

Author(s)

Overview
Questions:

How to perform functional annotation on protein sequences?

Objectives:

Perform functional annotation using EggNOG-mapper and InterProScan

Requirements:

Introduction to Galaxy Analyses

Time estimation: 1 hour

Level: Introductory Introductory

Supporting Materials:

Topic Overview slides

Datasets

Workflows

FAQs

instances Available on these Galaxies

docker_image Docker image
Galaxy Africa Galaxy India Galaxy@GenOuest Street Science UseGalaxy.be UseGalaxy.eu UseGalaxy.fr UseGalaxy.org.au

Last modification: Sep 28, 2022

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

When performing the structural annotation of a genome sequence, you get the position of each gene, but you don’t have information about their name of their function. That’s the goal of functional annotation.

In this short tutorial, we will run the most commonly used tools to perform functional annotation, starting from the predicted protein sequences of a few example genes.

For a more complete view of how this step integrates into a whole genome sequencing and annotation process, you can have a look at the Funannotate tutorial.

Agenda

In this tutorial, we will cover:

Introduction

Data upload

Functional annotation

EggNOG Mapper

InterProScan

Conclusion

Data upload

We will annotate a small set of protein sequences. These sequences were predicted from the gene structures obtained in the Funannotate tutorial? Though these sequences from from a fungal species, you can run the same tools on proteins from any organisms, including prokaryotes.

Hands-on: Data upload
Create a new history for this tutorial

Click the new-history icon at the top of the history panel.

If the new-history is missing:

Click on the galaxy-gear icon (History options) on the top of the history panel

Select the option Create New from the menu
Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Functional annotation of protein sequences):
https://zenodo.org/api/files/6628a5e4-d6be-47bd-bdaa-f2646112578e/proteins.fasta
Copy the link location

Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

Select Paste/Fetch Data

Paste the link into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Shared data (top panel) then Data libraries

Navigate to the correct folder as indicated by your instructor

Select the desired files

Click on the To History button near the top and select as Datasets from the dropdown menu

In the pop-up window, select the history you want to import the files to (or create a new one)

Click on Import

Functional annotation

EggNOG Mapper

EggNOG Mapper compares each protein sequence of the annotation to a huge set of ortholog groups from the EggNOG database. In this database, each ortholog group is associated with functional annotation like Gene Ontology (GO) terms or KEGG pathways. When the protein sequence of a new gene is found to be very similar to one of these ortholog groups, the corresponding functional annotation is transfered to this new gene.

Hands-on

eggNOG Mapper Tool: toolshed.g2.bx.psu.edu/repos/galaxyp/eggnog_mapper/eggnog_mapper/2.1.8+galaxy2.1.8 with the following parameters:

param-file “Fasta sequences to annotate”: proteins.fasta (Input dataset)

“Version of eggNOG Database”: select the latest version available

In “Output Options”:

“Exclude header lines and stats from output files”: No

The output of this tool is a tabular file, where each line represents a gene from our annotation, with the functional annotation that was found by EggNOG-mapper. It includes a predicted protein name, GO terms, EC numbers, KEGG identifiers, …

Display the file and explore which kind of identifiers were found by EggNOG Mapper.

InterProScan

InterPro is a huge integrated database of protein families. Each family is characterized by one or muliple signatures (i.e. sequence motifs) that are specific to the protein family, and corresponding functional annotation like protein names or Gene Ontology (GO). A good proportion of the signatures are manually curated, which means they are of very good quality.

InterProScan is a tool that analyses each protein sequence from our annotation to determine if they contain one or several of the signatures from InterPro. When a protein contains a known signature, the corresponding functional annotation will be assigned to it by InterProScan.

InterProScan itself runs multiple applications to search for the signatures in the protein sequences. It is possible to select exactly which ones we want to use when launching the analysis (by default all will be run).

Hands-on

InterProScan Tool: toolshed.g2.bx.psu.edu/repos/bgruening/interproscan/interproscan/5.55-88.0+galaxy3 with the following parameters:

param-file “Protein FASTA File”: proteins.fasta (Input dataset)

“InterProScan database”: select the latest version available

“Use applications with restricted license, only for non-commercial use?”: Yes (set it to No if you run InterProScan for commercial use)

“Output format”: Tab-separated values format (TSV) and XML

To speed up the processing by InterProScan during this tutorial, you can disable Pfam and PANTHER applications. When analysing real data, it is adviced to keep them enabled.

When some applications are disabled, you will of course miss the corresponding results in the output of InterProScan.

The output of this tool is both a tabular file and an XML file. Both contain the same information, but the tabular one is more readable for a Human: each line represents a gene from our annotation, with the different domains and motifs that were found by InterProScan.

If you display the TSV file you should see something like this:

InterProScan TSV output

Each line correspond to a motif found in one of the annotated proteins. The most interesting columns are:

Column 1: the protein identifier
Column 5: the identifier of the signature that was found in the protein sequence
Column 4: the databank where this signature comes from (InterProScan regroups several motifs databanks)
Column 6: the human readable description of the motif
Columns 7 and 8: the position where the motif was found
Column 9: a score for the match (if available)
Column 12 and 13: identifier of the signature integrated in InterPro (if available). Have a look an example webpage for IPR036859 on InterPro.
The following columns contains various identifiers that were assigned to the protein based on the match with the signature (Gene ontology term, Reactome, …)

The XML output file contains the same information in a computer-friendly format, we will use it in the next step.

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform the functional annotation of a set of protein sequences, using EggNOG mapper and InterProScan.

If you want to collect more functional annotation, you can try to run the NCBI BLAST+ blastp Tool: toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastp_wrapper/2.10.1+galaxy2 or Diamond Tool: toolshed.g2.bx.psu.edu/repos/bgruening/diamond/bg_diamond/2.0.15+galaxy0 tools against the UniProt or NR databases (Diamond runs much faster on big datasets). These tools will search for similarities between your protein sequences and the ones already described in big international databases.

Also note that many other more specialised tools exist to collect even more functional annotation, in particular for certain species (prokaryotes forexample), or enzyme/protein families.

Key points

EggNOG Mapper compares sequences to a database of annotated orthologous sequences

InterProScan detects known motifs in protein sequences

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Anthony Bretaudeau, Functional annotation of protein sequences (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/functional/tutorial.html Online; accessed TODAY
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{genome-annotation-functional,
author = "Anthony Bretaudeau",
title = "Functional annotation of protein sequences (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/functional/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Funding

These individuals or organisations provided funding support for the development of this resource

See Funder Profile

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics. eu flag with the text: with the support of the erasmus programme of the european union

Congratulations on successfully completing this tutorial!