Dear GATK team,
Is there a value in cohort calling in RNA-Seq similar to what is recommended in the GATK DNA-Seq workflow? I am trying to understand why cohort calling is highly emphasized in DNA-Seq but not mentioned in the RNA-Seq workflow.
Hi, I am performing RNA-Seq to identify new polymorphisms in a species of sea star. Our short-term goal is to generate novel DNA sequences of coding genes for phylogenetic analysis. It is therefore important that polymorphisms be called accurately and that they can be phased.
Our reference genome is poorly assembled and comprises over 60,000 scaffolds and contigs. Subsequently, when paired-end RNA-Seq reads are aligned to this reference genome (using TopHat), the two halves of the pair are often mapped to different scaffolds or contigs. This seems to greatly lower the MAQ score, which in turn leads to HaplotypeCaller missing well-supported polymorphisms, because the reads that support them have MAQ values between 1 and 3.
The obvious solution for this is to set the --min-mapping-quality-score to 1 or 2, rather than the default of 20; and raising the --min_base_quality_score from the default value of 10 to maybe 25 or 30. This does, however, increase the risk of calling false positives from poorly aligned regions.
Has this situation been considered by the GATK development team, and is there a recommended way to account for it?
My question is on bwa software when one want to map RNA-seq data on the entire human genome. What should be the specific settings to use to get maximum mapping? Should it be effective if no options are used in the command line?
Thank you for your time