# Tagged with #bam 5 documentation articles | 0 announcements | 32 forum discussions

#### Objective

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

#### Prerequisites

• Installed HTSlib

#### Steps

1. Shuffle the reads in the bam file
2. Revert the BAM file to FastQ format
3. Compress the FastQ file

### 1. Shuffle the reads in the bam file

#### Action

Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam


#### Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.

### 2. Revert the BAM file to FastQ

#### Action

Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq


#### Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.

### 3. Compress the FastQ file

#### Action

Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq


#### Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.

### 4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz


### 1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

### 2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

### 3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

• It must be aligned to one of the references described here.
• It must be sorted in coordinate order (not by queryname and not "unsorted").
• It must list the read groups with sample names in the header.
• The BAM file must pass Picard validation.

### 4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

$samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...  If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements. ### 6. My BAM file isn't sorted that way. How can I fix it? Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence. ### 7. How can I tell if my BAM file has read group and sample information? A quick Unix command using Samtools will do the trick: $ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...


The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...  membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414). ### 8. My BAM file doesn't have read group and sample information. Do I really need it? Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. ### 9. What's the meaning of the standard read group fields? For technical details, see the SAM specification on the Samtools website. Tag Importance SAM spec definition Meaning ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper. PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field. LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. We do not require value for the CN, DS, DT, PG, PI, or PU fields. A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header: Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400  Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library). ### 9. My BAM file doesn't have read group and sample information. How do I add it? Use Picard's AddOrReplaceReadGroups tool to add read group information. ### 10. How do I know if my BAM file is valid? Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK. ### 11. What's the best way to create a subset of my BAM file containing only reads over a small interval? You can use the GATK to do the following: GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam  and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK. ### This utility replaces read groups in a BAM file It is useful for fixing problems such as not having read groups in a bam file. java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo  Note that this tool is now part of the Picard package: http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups This tool can fix BAM files without read group information: # throws an error java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I testdata/exampleNORG.bam -T UnifiedGenotyper # fix the read groups java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo CREATE_INDEX=True # runs without error java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I exampleNewRG.bam -T UnifiedGenotyper  ### ReorderSam The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file. java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta  This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool! The tool, though once in the GATK, is now part of the Picard package. #### See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them. For a complete, detailed argument reference, refer to the GATK document page here. ### Introduction DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file: • Total, mean, median, and quartiles for each partition type: aggregate • Total, mean, median, and quartiles for each partition type: for each interval • A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22) • A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X • A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples) • A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries) Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously. For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD. DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth. ### Coverage by Gene To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument -geneList /path/to/gene/list.txt  The provided gene list must be of the following format: 585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0, 587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0, 587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0, 587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0, 589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0, 589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0, 589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,  If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at /humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt  If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows: Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1 SORT1 594710 238.27 594710 238.27 165 245 330 NOTCH2 3011542 357.84 3011542 357.84 222 399 &gt;500 LMNA 563183 186.73 563183 186.73 116 187 262 NOS1AP 513031 203.50 513031 203.50 91 191 290  Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown. No posts found with the requested search criteria. Hello, I was asked to re-post this question here. It was originally posted in the Picard forum at GitHub at https://github.com/broadinstitute/picard/issues/161. Regards, Bernt ## ORIGINAL POST (edited) There seems to be a problems with FixMateInformation crashing with Exception in thread "main" java.lang.NullPointerException at htsjdk.samtools.SamPairUtil.setMateInformationOnSupplementalAlignment(SamPairUtil.java:300) at htsjdk.samtools.SamPairUtilSetMateInfoIterator.advance(SamPairUtil.java:442)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:454) at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:360)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:194)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:185)
at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:125)
at picard.sam.FixMateInformation.main(FixMateInformation.java:93)


The problem first appeared in version 1.121. It is present in version 1.128. Versions prior to 1.120 worked and continue to work fine. I am currently using Java 1.7.0_75, but I observed the same problem with earlier version of Java. The problem occurs under several different version of Fedora.

The command lines I am using are:

java -jar picard-1.128/picard.jar FixMateInformation INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.121/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.120/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (succeeds)

I have observed the problem with various BAM files. This one is (a small subset of) the output of an indel realignment with GATK.

## Later in the same thread:

ValidateSam produces: java -jar /opt/ghpc/picard-1.121/ValidateSamFile.jar INPUT=test.bam OUTPUT=out.bam [Wed Feb 18 08:48:40 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam OUTPUT=out.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Feb 18 08:48:40 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.121(da291b4d265f877808b216dce96eaeffd2f30bd3_1411396652) IntelDeflater [Wed Feb 18 08:48:41 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=505937920 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

## And later:

The problem also exists in the new version, 1.129.

## New

Re-posted in GATK forum

Picard (1.129)'s ValidateSamFile complains about Mate not found - which was the reason for running FixMateInformation in the first place.

The output is:

java -jar /opt/ghpc/picard-current/picard.jar ValidateSamFile I=test.bam

[Thu Mar 26 16:44:49 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 26 16:44:49 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
Maximum output of [100] errors reached.
[Thu Mar 26 16:44:50 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=505937920


Please let me know where to send the test BAM file.

Regards,

Bernt

Hi,

I am attempting to merge the output of tophat in order to run some RNASeq QC metrics. This is a single read 50 bp on a Hiseq. In order to get past the fact that tophat gives a MAPQ of 255 to unmapped reads (not 0 as expected by Picard) I used the following tool ( https://github.com/cbrueffer/tophat-recondition) to change it.

Once completed, I added read groups using picard and then sorted the accepted_hits.bam by coordinate and sorted the unmapped reads by queryname.

tophat-recondition.py /home/ryan/NGS_Data/No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/unmapped_fixup.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/accepted_hits.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG.bam \ SORT_ORDER=coordinate \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ SORT_ORDER=queryname

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/accepted_hits-RG.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ SORT_ORDER=coordinate \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ MergeBamAlignment \ UNMAPPED_BAM=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ ALIGNED_BAM=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ O=/home/ryan/NGS_Data/merge_unmapped_accepted_hits_No_Dox.bam \ SORT_ORDER=coordinate \ REFERENCE_SEQUENCE=/home/ryan/Reference/human_spikein/hg19_spikein.fa \ PROGRAM_RECORD_ID=Tophat \ PROGRAM_GROUP_VERSION=0.1 \ PROGRAM_GROUP_COMMAND_LINE=tophat/dummy \ PAIRED_RUN=false \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true

I then get the following warning followed by the error:

WARNING 2015-03-17 09:44:22 SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Aligned record iterator (HS3:608:C6LNLACXX:7:2101:9946:63417) is behind the unmapped reads (HS3:608:C6LNLACXX:7:2101:9947:11009) INFO 2015-03-17 09:44:33 SamAlignmentMerger Read 1000000 records from alignment SAM/BAM. INFO 2015-03-17 09:44:43 SamAlignmentMerger Read 2000000 records from alignment SAM/BAM. .... INFO 2015-03-17 09:58:01 SamAlignmentMerger Read 96000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:09 SamAlignmentMerger Read 97000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:15 SamAlignmentMerger Finished reading 97571897 total records from alignment SAM/BAM. [Tue Mar 17 09:58:16 PDT 2015] picard.sam.MergeBamAlignment done. Elapsed time: 14.32 minutes. Runtime.totalMemory()=1908932608 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HS3:608:C6LNLACXX:7:1101:10000:11036) is behind the unmapped reads (HS3:608:C6LNLACXX:7:1101:10000:48402) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:383) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:153) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:248) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I have searched and have been unsuccessful at resolving this problem. Any ideas?

I am running picard 1.129 using java version "1.7.0_60"

ValidateSamFile on both inputs into MergeBamAlignment returns no errors in those files. I am at a lost and seeking for help!

Thanks,

Ryan

Hi,

with GATK 3.3-0 I am confronted with an error that was present in a much older version, but seemed resolved about a year ago:

ERROR MESSAGE: There is no space left on the device, so writing failed

There is 8TB left on the drive, no user limit. Sometimes re-running the exact same job works, sometimes not. Some jobs keep failing despite asking for an insane amount of memory on the cluster, given these are RNAseq bam files, the largest one being less than 7GB.

For example:

qsub -b y -cwd -N step3_145 -o step3_145.o -e step3_145.e -V -l h_vmem=40G /share/apps/java/oracle/1.8.0_11/bin/java -Xmx35G -jar /data/home/hhx037/GATK-3.3.0/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.75.dna.1-22XYMT.fa -I Analyses/file_dedup.bam -o Analyses/file_splittedcigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Here is the log:

INFO 10:50:51,568 HelpFormatter - Executing as hhx037@panos1 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12. INFO 10:50:51,571 HelpFormatter - Date/Time: 2015/02/13 10:50:51 INFO 10:50:51,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:51,576 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:52,503 GenomeAnalysisEngine - Strictness is SILENT INFO 10:50:52,827 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 10:50:52,861 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:50:52,876 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 10:50:53,021 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:50:53,027 GenomeAnalysisEngine - Done preparing for traversal INFO 10:50:53,030 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:50:53,030 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:50:53,030 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 10:50:53,047 ReadShardBalancer$1 - Loading BAM index data INFO 10:50:53,050 ReadShardBalancer$1 - Done loading BAM index data INFO 10:51:23,404 ProgressMeter - 1:1477348 702953.0 30.0 s 43.0 s 0.0% 17.5 h 17.5 h INFO 10:52:32,660 ProgressMeter - 1:16909108 1202983.0 99.0 s 82.0 s 0.5% 5.0 h 5.0 h INFO 10:53:09,769 ProgressMeter - 1:21069702 1302985.0 2.3 m 104.0 s 0.7% 5.6 h 5.5 h INFO 10:53:49,083 ProgressMeter - 1:27951393 1803181.0 2.9 m 97.0 s 0.9% 5.4 h 5.4 h INFO 10:54:29,275 ProgressMeter - 1:32739969 2103299.0 3.6 m 102.0 s 1.1% 5.7 h 5.6 h INFO 10:55:09,177 ProgressMeter - 1:36643589 2203300.0 4.3 m 116.0 s 1.2% 6.0 h 5.9 h INFO 10:55:45,643 ProgressMeter - 1:39854010 2303302.0 4.9 m 2.1 m 1.3% 6.3 h 6.2 h INFO 10:56:25,147 ProgressMeter - 1:40542516 2403303.0 5.5 m 2.3 m 1.3% 7.0 h 6.9 h INFO 10:57:10,934 ProgressMeter - 1:40654849 2503322.0 6.3 m 2.5 m 1.3% 8.0 h 7.9 h INFO 10:57:54,084 ProgressMeter - 1:43162895 2503322.0 7.0 m 2.8 m 1.4% 8.4 h 8.3 h INFO 10:58:24,149 ProgressMeter - 1:45244391 2703426.0 7.5 m 2.8 m 1.5% 8.6 h 8.4 h INFO 10:58:56,749 ProgressMeter - 1:53716450 2803427.0 8.1 m 2.9 m 1.7% 7.7 h 7.6 h INFO 10:59:38,928 ProgressMeter - 1:86821106 3103432.0 8.8 m 2.8 m 2.8% 5.2 h 5.1 h INFO 11:00:11,337 ProgressMeter - 1:93301870 3303437.0 9.3 m 2.8 m 3.0% 5.1 h 5.0 h INFO 11:01:13,113 ProgressMeter - 1:115252321 3803590.0 10.3 m 2.7 m 3.7% 4.6 h 4.5 h INFO 11:02:02,172 ProgressMeter - 1:145441389 4303778.0 11.2 m 2.6 m 4.7% 4.0 h 3.8 h INFO 11:02:38,237 ProgressMeter - 1:150547232 4703871.0 11.8 m 2.5 m 4.9% 4.0 h 3.8 h INFO 11:03:09,693 ProgressMeter - 1:153362937 5003904.0 12.3 m 2.5 m 5.0% 4.1 h 3.9 h INFO 11:03:39,934 ProgressMeter - 1:155984762 5403968.0 12.8 m 2.4 m 5.0% 4.2 h 4.0 h INFO 11:04:05,477 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

I understand temporary files may be large, but not that large. Are the temporary files written in the working directory (as I believe should be the case), or are they written in GATK installation directory?

Also, note I never run into this problem with the previous version.

Any idea?

Cheers,

Stephane

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Greeting all.

Currently, I have been using Picard's built-in library "BuildBamIndex" in order to index my bam files.

I have followed the manual described in Picard sites but I got error message.

Here is my command line that you can easily understand as below.

java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/output6 I tried different approach to avoid this error message so I used "samtools index" which i think is also same function as Picard BuildBamIndex. After using samtools, I successfully got my bam index files. I suppose that there are no major difference between Picard bamindex and samtools bam index. I am confusing that why only samtools index procedure is working fine? Below is my error message when run "BuildBamIndex" from Picard. **[Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex INPUT=/DATA1/sclee1/data/URC_WES/U01/01U_N_Filtered_Sorted_Markdup_readgroup.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Exception creating BAM index for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:291) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271) at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:133) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: htsjdk.samtools.SAMException: BAM cannot be indexed without setting a fileSource for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexMetaData.recordMetaData(BAMIndexMetaData.java:130) at htsjdk.samtools.BAMIndexerBAMIndexBuilder.processAlignment(BAMIndexer.java:182) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:90) ... 6 more **

I look forward to hearing positive answers from you soon.

Bye!

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.

Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?

Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?

For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?

Thanks a lot for your help!

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:

• chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
• the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
• and actually leave out all alternate contigs

This sounds quite trivial, but I haven't found a clean way to do this yet. I have tried the following: i=INPUT.bam j=OUTPUT.bam samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print$0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print$0} $3~/^[1-9]|X|Y/{$3="chr"$3; print$0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > j However, when I try running the HaplotypeCaller, I get the following error: ##### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with Could you help me prepare these BAM files for processing? Thanks a lot in advance Hi GATK team, I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence. I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples? The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf I hope I am clear, and looking forward to learn more, Thank you in advance i have been using HaplotypeCaller in gVCF mode on a cohort of 830 samples spread over 2450 bams. the number of bams per sample varies from 1-4. for samples with <=3 bams, the routine works perfectly. but for samples with 4 bams, the jobs always crash and I receive the error: ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently is this a bug? are there any options i can use to avoid this error. i suppose it is possible that there is an issue with my bams, but it seems odd that the error occurs systematically with 4 bam samples and never for samples with 3 or fewer bams. thanks for any help! RodWalkers are really fast and great. Is there any way to connect a BAM file to it and get the pileup in the given positions? If not, what is the appropriate way to get pileups for a small set of selection positions? When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file. BWA->Picard dedup->GATK LocalRealign->GATK BQSR Below is the output BAM file after BQSR step. @PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false Related thread (guess) http://gatkforums.broadinstitute.org/discussion/2118/baserecalibration-prinread-don-t-create-a-header-and-don-t-obtain-oq-orignal-base-quality-in-bam Hi, I use --sample_name option to restrict output reads for PintReads walker (GATK version - 2.4.9). This option works well but the output BAM header @RG still contains sample name (SM) that was not included. Is there a way to apply this --sample_name option to BAM header at the same time? I could reheader the BAM files using samtools but I'm looking for a better and easier way. Thank you, Taka We have used bwa 0.7.4 aln and sampe to align illumina reads. Then used the following command java -Xmx6g -jar ~/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T BaseRecalibrator -I ~/temp/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedindelrealigned.bam -R ~/hg19/ucsc.hg19.fasta -knownSites ~/dbSNP/dbsnp_137.hg19.vcf -o ~/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedBQSR.grp Which gave the following error message ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:193) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:245) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNanoTraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNanoTraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) can you help me in this error message? Why its coming and how to rectify it? Thanks in advance Mayukh Is it possible to source and read bam files directly from ftp site into Unified Genotyper without downloading them first. This option works in samtools view -- but if I try the simple straightforward way in GATK (just replacing the -I inputfilename.bam with -I ftplinktobamfile.bam ) it does not work. Is there another way of doing this? This would save me a lot of diskspace if I could do this. Hi, I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers? Thanks Aaron ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed ##### ERROR Please do not post this error to the GATK forum ##### ERROR ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: SAM/BAM file /btl/projects/aaron/Mouse/G42214/TET-OT-827ctl.recalibrated.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader! Hi, If I have a bam file with three different read groups, and use SplitSamFile to split it like so: java -Xmx2g -jarGATKJAR -T SplitSamFile -I $INBAM -R$GENOME --outputRoot $PROJD/$IND/


Each of the output bam files have all three read groups. Is that the intended behavior? I would like each file to have only it's own read group info in the heads. Sorry for the bash arguments in the code above, is makes in readable at least.

Thanks

Daniel

I don't know if this question has been asked, if so sorry.

When using UnifiedGenotyper, I was wondering if it was possible (via hidden command option, etc) to use a VCF file as a prior. Currently I have just been adding additional bam files, but it would be nice (and quicker) if I could use a Indexed file.

Shawn.

Hi I am using this version of GATK which is java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar

I have obtained my bam files using bowtie as aligner. So the quality score is 255 in the 5th column. I want to use UnifiedGenotyper and when i use it it gives me an error saying MappingQualityUnavailable.

So I see read filter ReassignOneMappingQualityFilter which can convert 255 to 60 and then it would be easily taken by UnifiedGenotyper to do the snp calling.

So what command line should i be using

I have just started to explore GATK. Hope to hear from you soon

Regards Varun

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Hello,

i have spend many hours trying to run GATK depthofcoverage but it never works. last try: -T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50

My BAM header seem to be malformed.

ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

here is the 1rst line of the header:

@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??

I Have search in the forum and doc about it. I have try to reorder my header with picard:

@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB?? but no more change.

Someone can help me please ?

Regards, Remi



Dear all,

I am currently running an analysis using the HaplotypeCaller on 300 large BAM files on our cluster and decided to chunk the the genome in 3MB bins in order for them to be processed in a decent time. I'm however experiencing very long runtimes as more and more jobs get scheduled to run in parallel on the same files. Looking at the GATK options, I saw these 2 that I thought could be of help and was wondering what were the recommendation for using them: --num_bam_file_handles --read_buffer_size

More precisely, does the num_bam_file_handles increase processing time by a lot? and what is the default value for --read_buffer_size ?

Thanks a lot, Laurent

I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?

Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.

My command-line is the following:

java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4


I get the following message:

SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.
`

My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.

how to add read group to the bam file using PICARD generated from GS reference mapper BAM file?

Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj

Dear All, I am very new to the analysis of NGS data.

I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)

If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?

In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?

I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.

I am getting the following error when running DepthOfCoverage:

I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field: