Basic Variant Finding

Background

Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). Larger structural variation such as inversions, duplications and large deletions are not typically covered by “variant calling”.

Learning Objectives

  1. How to map sequence reads versus a reference
  2. Visualise the resultant BAM file of alignments
  3. Search the BAM file for variants using a SNP caller
  4. Filter the SNPS

Prepare reference

For variant calling, we need a reference genome that is of the same strain as the input sequence reads.

For this tutorial, our reference is the wildtype.fna file and our reads are mutant_R1.fastq and mutant_R2.fastq.

If these files are not presently in your Galaxy history, import them from the Training dataset page.

Section 1 - Read Mapping

Map reads with BWA mem

Convert the BAM File to a SAM file.

Now we want to look at the results. BWA mem will produce a BAM file. This is a compressed non-human readable format. So to see what it looks like, we need to convert the BAM file to a SAM file.

Now look at the contents of the resultant SAM file.

View BAM file in JBrowse

Track 1 - sequence reads

JBrowse screenshot

End of part 1 of this exercise! More Slides!

Section 2 - Variant calling

Call variants in our BAM file with FreeBayes

FreeBayes will now search through each position of our BAM file and look for statistically valid variants. It uses Bayesian inference to do this. See here for details if you’re interested. (Warning! It’s complex probability theory… )

Once it’s complete you’ll see a new file in your history called x: FreeBayes on data x and data x (variants) This is a VCF file that we discussed in the slides.

Click on the “eye” icon to view the file. You’ll notice that there is a lot of header information followed by some found variants. Can you find the one we looked at earlier in our visualisation?