Basic Variant Finding
Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). Larger structural variation such as inversions, duplications and large deletions are not typically covered by “variant calling”.
- How to map sequence reads versus a reference
- Visualise the resultant BAM file of alignments
- Search the BAM file for variants using a SNP caller
- Filter the SNPS
For variant calling, we need a reference genome that is of the same strain as the input sequence reads.
For this tutorial, our reference is the
If these files are not presently in your Galaxy history, import them from the Training dataset page.
Section 1 - Read Mapping
Map reads with BWA mem
- Go to
Tools → NGS Analysis → NGS: Mapping → Map with BWA mem
Will you select a reference genome from your history or use a built-in index?select Use a genome from history….
- Then for
Reference Sequencechoose the wildtype.fnafile.
Single or Paired-end readschoose Paired.
- Then choose the first set of reads,
mutant_R1.fastqand second set of reads, mutant_R2.fastq.
Leave everything else as default.
Convert the BAM File to a SAM file.
Now we want to look at the results. BWA mem will produce a BAM file. This is a compressed non-human readable format. So to see what it looks like, we need to convert the BAM file to a SAM file.
- Go to
Tools → NGS: Sam tools → BAM-to-SAM
- Select the BAM file we just produced above and click
Now look at the contents of the resultant SAM file.
View BAM file in JBrowse
Statistics and Visualisation → Graph/Display Data → JBrowse
Fasta Sequence(s)choose wildtype.fna. This sequence will be the reference against which BAM file is displayed.
Produce a Standalone Instanceselect Yes.
Genetic Codechoose 11: The Bacterial, Archaeal and Plant Plastid Code.
We will now set up a new track We will choose to display the sequence reads (the .bam file)
Track 1 - sequence reads
Insert Track Group
Track Cateogryname it “sequence reads”
Insert Annotation Track
Track Typechoose BAM Pileups
BAM Track Dataselect the bam file
Autogenerate SNP Trackselect Yes
A new file will be created, called
JBrowse on data XX and data XX - Complete. Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.
On the left, tick the boxes to display the tracks
Use the minus button to zoom out to see:
- sequence reads and their coverage (the grey graph)
Use the plus button to zoom in to see:
- probable real variants (a whole column of snps)
- probable errors (single one here and there)
In the coordinates box, type in 47299 and then
Goto see the position of the SNP discussed above.
- the correct codon at this position is TGT, coding for the amino acid Cysteine, in the middle row of the amino acid translations.
- the mutation of T → A turns this triplet into TGA, a stop codon.
End of part 1 of this exercise! More Slides!
Section 2 - Variant calling
Call variants in our BAM file with FreeBayes
- Go to
Tools &rarr NGS: Variant Analysis &rarr FreeBayes
Load reference genome fromselect History
BAM fileselect our BAM file.x: Map with BWA-MEM on data xx, data xx, and data xx (mapped reads in BAM format)
Use the following dataset as the reference sequenceselect x: Wildtype.fna
- Leave everything else as default
FreeBayes will now search through each position of our BAM file and look for statistically valid variants. It uses Bayesian inference to do this. See here for details if you’re interested. (Warning! It’s complex probability theory… )
Once it’s complete you’ll see a new file in your history called x: FreeBayes on data x and data x (variants) This is a VCF file that we discussed in the slides.
Click on the “eye” icon to view the file. You’ll notice that there is a lot of header information followed by some found variants. Can you find the one we looked at earlier in our visualisation?