Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).
I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?
If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?
Hi, I am trying to call SNPs from RNA-seq data. The data that I have is a pooled sample (from the larvae of shellfish 1000s of larvae pooled together to get enough RNA) I have 6 of those samples. Can I use GATK to call SNPs in these pooled samples ?
I am using I am using GATK RNA-seq variant pipeline for finding muttaion/vatiants on the list of gene given in teh follwoing command line
java-1.7 -jar -Xincgc -Xmx1586M GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller --filter_reads_with_N_cigar -R human_genome37_gatk.fa -D dbsnp_137.hg19.vcf -I sample_split.bam -o sample.vcf -L mylist.intervals
And the resulting VCF files has for variants AF either 100 % or 50 % . It would be great if anyone would explain me what does AF means in INFO column from VCF file. I AF means allele frequency or it has to be calculated separately from VCF file and if so how can I do it..??? Thank you
I have followed the instructions found here https://www.broadinstitute.org/gatk/guide/article?id=3891 to analyze RNA-seq data. However, when I tried to validate the final bam file using ValidateSamFile tool I get the following errors:
Error Type Count ERROR:INVALID_CIGAR 43527 ERROR:MATES_ARE_SAME_END 4416796 ERROR:MATE_NOT_FOUND 4522387 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 3675811 ERROR:MISMATCH_MATE_ALIGNMENT_START 7743225 ERROR:MISMATCH_MATE_CIGAR_STRING 7664 WARNING:MISSING_TAG_NM 75931282
It looks like the Split'N'Trim step is the cause of these errors since I validate the bam files before and after.
Can anyone tell me why I am getting these errors and how to fix them please?
I appreciate your help, Regards Hak
I am trying to run RNA-SeQC for alignment of reads against Zebrafish reference using this command:
java -Xmx24g -jar /share/pkg/RNA-SeQC/1.1.7/RNA-SeQC_v1.1.7.jar -r danRer7.fa -t danRer7/ucsc_new.gtf
-n 1000 -s 'sample1|tumor_aln_Filtered_SortedFixed_Reorder_Clean_MarkDup_AddRG.bam|RNASEQC analysis' -o metrics
After running for a while I get this error: Running GATK Depth of Coverage Analysis .... Arguments: -T DepthOfCoverage -R /home/sg15w/WES/Coel/BWAIndex/danRer7.fa -I /project/umw_michael_czech/BIOIFX-032/alignment/BWA/RealignedBamDir/tumor_aln_Filtered_SortedFixed_Reorder_Clean_MarkDup_AddRG.bam -o _metrics/sample1/highexpr//perBaseDoC.out -L _metrics/sample1/highexpr/intervals.list -l ERROR Arguments Array: [-T, DepthOfCoverage, -R, /home/sg15w/WES/Coel/BWAIndex/danRer7.fa, -I, /project/umw_michael_czech/BIOIFX-032/alignment/BWA/RealignedBamDir/tumor_aln_Filtered_SortedFixed_Reorder_Clean_MarkDup_AddRG.bam, -o, _metrics/sample1/highexpr//perBaseDoC.out, -L, _metrics/sample1/highexpr/intervals.list, -l, ERROR] org.broadinstitute.sting.utils.exceptions.UserException$MalformedGenomeLoc: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The genome loc coordinates 62098192-62098475 exceed the contig size (59938731) at org.broadinstitute.sting.utils.GenomeLocParser.vglHelper(GenomeLocParser.java:324) at org.broadinstitute.sting.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:307) at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:265) at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:389) at org.broadinstitute.sting.utils.interval.IntervalUtils.intervalFileToList(IntervalUtils.java:139) at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:71) at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:616) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:583) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:233) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runDoC(GATKTools.java:59) at org.broadinstitute.cga.rnaseq.PerBaseDoC.runDoC(PerBaseDoC.java:888) at org.broadinstitute.cga.rnaseq.PerBaseDoC.runDoC(PerBaseDoC.java:858) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:264) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:166) at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:135) RNA-SeQC Total Runtime: 115 min
Could you please help me fix this error.
Dear GATK team,
Is there a value in cohort calling in RNA-Seq similar to what is recommended in the GATK DNA-Seq workflow? I am trying to understand why cohort calling is highly emphasized in DNA-Seq but not mentioned in the RNA-Seq workflow.
Hi, I am performing RNA-Seq to identify new polymorphisms in a species of sea star. Our short-term goal is to generate novel DNA sequences of coding genes for phylogenetic analysis. It is therefore important that polymorphisms be called accurately and that they can be phased.
Our reference genome is poorly assembled and comprises over 60,000 scaffolds and contigs. Subsequently, when paired-end RNA-Seq reads are aligned to this reference genome (using TopHat), the two halves of the pair are often mapped to different scaffolds or contigs. This seems to greatly lower the MAQ score, which in turn leads to HaplotypeCaller missing well-supported polymorphisms, because the reads that support them have MAQ values between 1 and 3.
The obvious solution for this is to set the --min-mapping-quality-score to 1 or 2, rather than the default of 20; and raising the --min_base_quality_score from the default value of 10 to maybe 25 or 30. This does, however, increase the risk of calling false positives from poorly aligned regions.
Has this situation been considered by the GATK development team, and is there a recommended way to account for it?
My question is on bwa software when one want to map RNA-seq data on the entire human genome. What should be the specific settings to use to get maximum mapping? Should it be effective if no options are used in the command line?
Thank you for your time