Hi, I'm currently trying to recalibrate base qualities. I already prepared my BAM file to include the necessary fields (RG) for GATK. When I call
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R myref.fasta -I reads.bam -knownSites empty.vcf -o recal_data.table
I get the error that the reads file is unmapped. Of course, this is not true. The BAM file is coordinate-sorted and has an associated index file (bai). The reference also has an index file (fai) and just contains a single contig. I tested renaming the index files from .bam.bai and fasta.fai to .bai and .fai but this didn't help. Furthermore, I checked the content of my BAM file to see whether the given reference sequence name matches with that in the .fasta file and it does
The .fasta looks like this (first couple of lines):
>sample_consensus_048 GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGAGGAACTATTG TCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAGTGTCGTGCAGCCTCCAGGAC CCCCCCTCCCGGGAGAGCCATAGTGGTCTGCGGAACCGGTGAGTACACCGGAATTGCCAG GACGACCGGGTCCTTTCTTGGATAAACCCGCTCAATGCCTGGAGATTTGGGCGTGCCCCC
The .bam looks like this (first couple of lines, for RG info I added some placeholder values):
@HD VN:1.4 SO:coordinate @SQ SN:sample_consensus_048 LN:9374 @RG ID:1 LB:UsedLibrary PL:Illumina SM:ReadGroupSampleName PU:BARCODE @PG ID:smalt PN:smalt VN:0.7.4 CL:smaltcommand MISEQ-02:155:000000000-AF9CT:1:1101:7745:13826 99 sample_consensus_048 3499 60 218M = 3861 486 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCCTGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGCTCCTCAAGGTACCCGCTCACTGACACCCTGCACCTGC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGAFGGGGGGGGGGGGGGGGGGFFGGGGGDEEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGFFGGCFGGGGGGGGECEGGGGGGFGGFEGFGGGGFGGGGGGCGEGGGGGGG:8DFGFGGFGFB:(9C<6+<DF@F;534;@@FEF RG:Z:1 NM:i:0 AS:i:218 MISEQ-02:155:000000000-AF9CT:1:1103:24094:6940 99 sample_consensus_048 3499 60 213M = 3871 487 ACCAAGTGGAGGGTGAGGTCCAGATTGTGTCAACTGCTGCCCAGACTTTCCTGGCAACGTGCATCAATGGGGTGTGCTGGACCGTCTACCACGGGGCGGGAACGAGGACCATTGCATCACCCAAGGGTCTCGTCATACAGATGTACACCAATGTAGACAAAGACCTTGTAGGCTGGCCCGGTCCTCAAGGTACCCGCTCACTGACACCCTGCA CCCCCFEFGDFEGEGGGGGEFGGGGGGGGGGFFGFGFGC96EEGCG<C<EFGFFCCGGGFF@GGAFFGGGGDGGEC@FFGG<CFCFGFGGGGCG:FFFGE7FCGDGGC@F<=BDAE<FGG83<<=D7@,,73:CGC,EAA;,>3EFAE,<8@FFGFFFGC3D@BBF9=3,@EF;?C*>=C*=1:+?9<>FCFAGFG*0/3++2=*2;D68=?D RG:Z:1 NM:i:3 AS:i:204
At the moment I just use an empty VCF file, as it doesn't make any difference and I wanted to exclude the structure of the VCF as an error source (backslashes before bashes are just inserted for formatting on the forum).
\##fileformat=VCFv4.0 \##fileDate=20150730 \##source=lofreq_cmd \##reference=ref_seq.fasta \##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth"> \##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency"> \##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position"> \##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases"> \##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> \##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant)."> \##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position"> \##FILTER=<ID=min_snvqual_49,Description="Minimum SNV Quality (Phred) 49"> \##FILTER=<ID=min_indelqual_20,Description="Minimum Indel Quality (Phred) 20"> \#CHROM POS ID REF ALT QUAL FILTER INFO
I don't have any problems with my BAM file in any other usage scenario, e.g., using samtools or Biopython, so I'm not sure why GATK throws an error about unmapped reads. I would be really grateful if someone had an idea what might cause the problem with my bam/reference/vcf! I could only find one post where a person had the same error, but I didn't find anything that could help me.
I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.
I'm calling the indel realigner with the default options as follows:
java -Xmx16g -jar $GATK_DIR/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/my/ref \ -I input.bam.intervals \ -targetIntervals input.bam.intervals \ -o realn.bam \
I am generating the read groups using
AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK
RealignerTargetCreator with default options.
My bam file was generated off the raw reads of experiment
SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools)
I have tried this on other reads and do not have the same issue. Can anyone comment on why indel realigner could be throwing out so many reads.
I don't see any information from this post http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk,
I was wondering if it's possible to get the number of forward/reverse reads in the final VCF outputted by HaplotypeCaller?
Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj