QC Methods
From GSA
All next-generation sequencing data run through GSA-Firehose is subjected to strict quality control procedures. These procedures include verifying minimum coverage has been achieved per lane, fingerprint concordance, number of SNPs called per lane, error rate per cycle, depth of coverage per lane, depth of coverage per target, and various metrics on the variants themselves. These checks are applied uniformly to all projects passed through the GSA-Firehose pipeline.
Contents |
Fingerprint Validation
All samples are genotyped on a 24-site fingerprint assay at known, variable sites. For targeted sequencing experiments, additional capture probes designed specifically to recover these sites are added to the larger set of capture probes for the actual target area (i.e. the exome). After sequencing, SNPs are called in each lane and checked that calls at fingerprint sites are both confident (i.e. there's sufficient coverage at the fingerprint sites) and match the previously observed genotype. We use data on the number of confident calls made in the panel and how many of those sites are correctly genotyped to make sure identity verification processes are going according to procedure. Typically when we see bad fingerprint data we can identify it as one of the following errors:
- Few confidently called sites--this generally indicates that the lab did not spike the fingerprint baits.
- Confidently called sites which do not match the expected genotype--this indicates identity failure.
- No fingerprint data available--this indicates that fingerprint data was not available at the time the lanes were run through the analysis pipeline.
The table below gives example fingerprint panel results:
| Lane | Confidently called sites | Confidently called matching sites | Lane fingerprint status |
|---|---|---|---|
| 1 | 24 | 23 | Good fingerprint match |
| 2 | 23 | 8 | Identity failure |
| 3 | 4 | 3 | Missing baits |
| 4 | - | - | Lack of fingerprint data at Picard processing step |
In general, when the fingerprint panel fails, we try to guarantee that a project has another reliable mechanism for identity verification. Identity QC is particularly important before SNP calling occurs, especially in large datasets. Because our SNP caller uses all samples at once to call SNPs, removing the effect of any one sample requires running the entire SNP calling process again. This is a computationally intensive process, and so whenever possible we prefer to make calls as few times as possible. Thus, we exclude any samples whose identity we cannot verify in order to avoid later reprocessing of sample sets containing unidentified samples.
SNPs called per lane
As an additional QC check, we call SNPs in each targeted sequencing lane, independent of other lanes. We expect SNPs to occur at a consistent rate across samples and lanes. When SNP totals in certain lanes are orders of magnitude different from our expected rate of SNP calling (See plot at right), these lanes are excluded. We especially exclude lanes with particularly large numbers of SNPs, as these samples have the potential to dramatically increase false positive rate. Lanes showing unusually low numbers of SNPs are often left in unless project analysts have good reasons for their removal, as they are less likely to decrease the specificity of the callset.
Error rate per cycle
Lanes showing particularly high error rates may be excluded as low quality, especially if this reduces overall callset quality (as measured by other factors such as False Positive rate) and/or these lanes fail other QC metrics. Generally the low-quality bases from these lanes are excluded during analysis anyway based on our standard Q20 base quality score filter. We do expect a small increase in error rate in later cycles, but lanes showing high error rates in early cycles are removed when necessary.
Coverage
Sequencing for a sample is considered "complete" when >80% of the target area is covered to >20x. After this minimum threshold is met, we consider coverage in a number of other ways, flagging deviations from expectations for further investigation on a case-by-base basis. Gross coverage consistency is examined across all targets in all samples, and is generally expected those targets with high G-C content to be less well covered. Samples with an unusual number of failed targets may be excluded from further analysis. If there are particular genes of interest, we look at the coverage on a by-gene basis to ensure that they are meeting project-set standards for depth. Coverage is also considered by lane; however, generally lanes with higher coverage compensate for lanes from the same sample with low coverage. When coverage is lower than desired on any of these levels, the overall quality of the callset in low coverage regions is flagged and evaluated more carefully on a case by case basis.
SNP callset metrics
After a final set of high-quality lanes are identified, we make final SNP calls on the lot. The final callset is evaluated in several ways. First, we calculate the number of variant sites. This is expected to be roughly consistent (+/- 10%) with the neutral expectation (further deviation is obviously expected for projects that involve a lot of related samples). We also check the ratio of transition mutations to transversion mutations (Ti/Tv), as this quantity has specific expectations in various regions of the genome. Based on empirical measurements from Hapmap3 and SNPs from the 1,000 Genomes Project, we expect a Ti/Tv ratio of around 2.3 across whole genome sequences. In coding regions, we observe known variants to have a Ti/Tv of ~3.3, while novel variants typically stabilize around 2.8-3.0 (novel variants are enriched for transversions as these variants are much more likely to be recently-acquired, low frequency, deleterious mutations). When Ti/Tv ratio is substantially lower than expected, this indicates a lower quality callset containing more false positives (see this slide for a detailed explanation of the reasoning behind this).
