We know this field can be confusing or even overwhelming to newcomers, and getting to grips with a large and varied toolkit like the GATK can be a big challenge. We have produce a presentation that we hope will help you review all the background information that you need to know in order to use the GATK:
In addition, the following links feature a lot of useful educational material about concepts and terminology related to next-generation sequencing:
A basic review of the sequencing process.
An excellent, detailed overview of the myriad next-gen sequencing methdologies.
A nice piece explaining the problems inherent in trying to analyze terabytes of data. The GATK addresses this issue by requiring all datasets be in reference order, so only small chunks of the genome need to be in memory at once, as explained here.
Imagine a simple question like, "What's the depth of coverage at position A of the genome?"
First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.
Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.
This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:
Can we all agree that this is 2016 and next-generation sequencing is really just sequencing at this point?
Seriously, I was in college when NGS was becoming a thing. In techno-geological terms, that was the Cretaceous. Yet over a decade later, this super-vague term is somehow still stuck in our collective consciousness.
I'm not the one to say what is the real next generation of sequencing, maybe Nanopore and all that exotic long-read tech. My point is that calling the current generation of sequencing technology next-gen or NGS is embarrassingly retrograde and we should all stop*.
*I'm sure we have some old articles in our docs that use the term NGS, if you point them out to me I'll fix them.
Of course, there's still Sanger sequencing and we want to be able to tell the difference -- but really, isn't it Sanger that is the oddball now, and the rest is just regular sequencing? Well, if we must -- hey look we have a technically-accurate term, it's called high-throughput sequencing. It even comes with a reasonably snappy three-letter abbreviation to slap in titles and on posters where space is at a premium: HTS (putting the hts in htsjdk).
Alright, rant over. Until next time.
As part of a variant calling pipeline i'm interested in lowering the threshold for allele frequency tolerance in GATK's HaplotypeCaller variant caller to 0.01 (1%), if possible. If not, is there another variant calling tool that doesn't fliter out variants with low allele frequencies?
Background: I know for a fact that there's at least one variant in my sample that doesn't appear in my VCF file if allele frequencies below 0.1 (10%) are filtered out during the variant calling as it's the default in some programs. I can see the variant when I inspect the corresponding bam file with samtools tview and another lab called that specific variant itself.
Thanks in advance, Alon