Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.
Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:
htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam
This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.
The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.
Revert the BAM file to FastQ format by running the following HTSlib command:
htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq
This creates an interleaved FastQ file called
interleaved_reads.fq containing the now-unmapped paired reads.
Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.
Compress the FastQ file to reduce its size using the gzip utility:
This creates a gzipped FastQ file called
interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.
BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.
If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):
htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz
The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.
The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.
All BAM files must satisfy the following requirements:
See the BAM specification for more information.
It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:
Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...
UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...
The easiest way to do it is to download Samtools and run the following command to examine the header of your file:
$ samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...
If the order of the contigs here matches the contig ordering specified above, and the
SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.
Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.
A quick Unix command using Samtools will do the trick:
$ samtools view -H /path/to/my.bam | grep '^@RG' @RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC ...
The presence of the
@RG tags indicate the presence of read groups. Each read group has a
SM tag, indicating the sample from which the reads belonging to that read group originate.
In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,
$ samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...
membership in a read group is specified by the
RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).
Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.
For technical details, see the SAM specification on the Samtools website.
|Tag||Importance||SAM spec definition||Meaning|
||Required||Read group identifier. Each
||Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the
||Sample. Use pool name where a pool is being sequenced.||Required. As important as
||The name of the sample sequenced in this read group. GATK tools treat all read groups with the same
||Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.||Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data .||It's a good idea to use this field.|
||DNA preparation library identify||Essential for MarkDuplicates||MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.|
We do not require value for the
A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following
@RG fields in the header:
Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).
Use Picard's AddOrReplaceReadGroups tool to add read group information.
You can use the GATK to do the following:
GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam
and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.
It is useful for fixing problems such as not having read groups in a bam file.
java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo
Note that this tool is now part of the Picard package: http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups
This tool can fix BAM files without read group information:
# throws an error java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I testdata/exampleNORG.bam -T UnifiedGenotyper # fix the read groups java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo CREATE_INDEX=True # runs without error java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I exampleNewRG.bam -T UnifiedGenotyper
The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.
java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta
This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!
The tool, though once in the GATK, is now part of the Picard package.
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0, 587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0, 587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0, 587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0, 589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0, 589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0, 589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at
If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1 SORT1 594710 238.27 594710 238.27 165 245 330 NOTCH2 3011542 357.84 3011542 357.84 222 399 >500 LMNA 563183 186.73 563183 186.73 116 187 262 NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The
-geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes,
-geneList is ignored if
-omitIntervals is thrown.
i have been using HaplotypeCaller in gVCF mode on a cohort of 830 samples spread over 2450 bams. the number of bams per sample varies from 1-4. for samples with <=3 bams, the routine works perfectly. but for samples with 4 bams, the jobs always crash and I receive the error:
ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently
is this a bug? are there any options i can use to avoid this error. i suppose it is possible that there is an issue with my bams, but it seems odd that the error occurs systematically with 4 bam samples and never for samples with 3 or fewer bams.
thanks for any help!
When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file.
BWA->Picard dedup->GATK LocalRealign->GATK BQSR
Below is the output BAM file after BQSR step.
@PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false
Related thread (guess) http://gatkforums.broadinstitute.org/discussion/2118/baserecalibration-prinread-don-t-create-a-header-and-don-t-obtain-oq-orignal-base-quality-in-bam
I use --sample_name option to restrict output reads for PintReads walker (GATK version - 2.4.9). This option works well but the output BAM header @RG still contains sample name (SM) that was not included. Is there a way to apply this --sample_name option to BAM header at the same time? I could reheader the BAM files using samtools but I'm looking for a better and easier way.
We have used bwa 0.7.4 aln and sampe to align illumina reads. Then used the following command java -Xmx6g -jar ~/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T BaseRecalibrator -I ~/temp/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedindelrealigned.bam -R ~/hg19/ucsc.hg19.fasta -knownSites ~/dbSNP/dbsnp_137.hg19.vcf -o ~/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedBQSR.grp Which gave the following error message
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:193) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:245) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
can you help me in this error message? Why its coming and how to rectify it? Thanks in advance Mayukh
Is it possible to source and read bam files directly from ftp site into Unified Genotyper without downloading them first. This option works in samtools view -- but if I try the simple straightforward way in GATK (just replacing the -I inputfilename.bam with -I ftplinktobamfile.bam ) it does not work. Is there another way of doing this? This would save me a lot of diskspace if I could do this.
I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers?
If I have a bam file with three different read groups, and use SplitSamFile to split it like so:
java -Xmx2g -jar $GATKJAR -T SplitSamFile -I $INBAM -R $GENOME --outputRoot $PROJD/$IND/
Each of the output bam files have all three read groups. Is that the intended behavior? I would like each file to have only it's own read group info in the heads. Sorry for the bash arguments in the code above, is makes in readable at least.
I don't know if this question has been asked, if so sorry.
When using UnifiedGenotyper, I was wondering if it was possible (via hidden command option, etc) to use a VCF file as a prior. Currently I have just been adding additional bam files, but it would be nice (and quicker) if I could use a Indexed file.
Thanks for your time.
Hi I am using this version of GATK which is java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar
I have obtained my bam files using bowtie as aligner. So the quality score is 255 in the 5th column. I want to use UnifiedGenotyper and when i use it it gives me an error saying MappingQualityUnavailable.
So I see read filter ReassignOneMappingQualityFilter which can convert 255 to 60 and then it would be easily taken by UnifiedGenotyper to do the snp calling.
So what command line should i be using
This page is not telling how to provide input file and then output file http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_filters_ReassignOneMappingQualityFilter.html#-RMQF
I have just started to explore GATK. Hope to hear from you soon
we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.
details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:
@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1
@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1
@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1
@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1
@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1
@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1
Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?
Does that make sense? Any advice?
i have spend many hours trying to run GATK depthofcoverage but it never works.
-T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50
My BAM header seem to be malformed.
ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
here is the 1rst line of the header:
@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??
I Have search in the forum and doc about it. I have try to reorder my header with picard:
@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??but no more change.
Someone can help me please ?
I am currently running an analysis using the HaplotypeCaller on 300 large BAM files on our cluster and decided to chunk the the genome in 3MB bins in order for them to be processed in a decent time. I'm however experiencing very long runtimes as more and more jobs get scheduled to run in parallel on the same files. Looking at the GATK options, I saw these 2 that I thought could be of help and was wondering what were the recommendation for using them: --num_bam_file_handles --read_buffer_size
More precisely, does the num_bam_file_handles increase processing time by a lot? and what is the default value for --read_buffer_size ?
Thanks a lot, Laurent
I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?
Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.
My command-line is the following:
java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4
I get the following message:
SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.
My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.
Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj
Dear All, I am very new to the analysis of NGS data.
I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)
If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?
In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?
I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.
I am getting the following error when running DepthOfCoverage:
I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.
I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.
I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:
This seems to me to be a serious bug. Is this anything that's been noted before?
I am running GATKLite version 2.1-3-ge1dbcc8
We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?