# Tagged with #bam 9 documentation articles | 0 announcements | 47 forum discussions

Created 2016-01-08 21:06:41 | Updated 2016-02-08 17:17:59 | Tags: best-practices bam markduplicates duplicates

Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).

To mark duplicates in RNA-Seq data, use MarkDuplicates. Reasons are explained in section 2 and section 3. And if you are considering using MarkDuplicatesWithMateCigar for your DNA data, be sure insert lengths are short and you have a low percentage of split or multi-mapping records.

Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.

#### Prerequisites

• Installed Picard tools
• Coordinate-sorted and indexed BAM alignment data. Secondary/supplementary alignments are flagged appropriately (256 and 2048 flags) and additionally with the mate unmapped (8) flag. See the MergeBamAlignment section (3C) of Tutorial#6483 for a description of how MergeBamAlignment ensures such flagging.
• For MarkDuplicatesWithMateCigar, pre-computed Mate CIGAR (MC) tags. Data produced according to Tutorial#6483 will have the MC tags added by MergeBamAlignment. Alternatively, see tools RevertOriginalBaseQualitiesAndAddMateCigar and FixMateInformation.
• Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

• Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV.
• tutorial_6747.tar.gz data contain human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. This machine type, along with HiSeq 4000, has the newer patterned flow cell that differs from the typical non-patterned flow cell. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 to produce snippet.bam. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.

## 1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.

For a given file with all MC (mate CIGAR) tags accounted for:

• and where all mates are accounted for, each tool--MarkDuplicates and MarkDuplicatesWithMateCigar--examines the same duplicate sets but prioritize which inserts get marked duplicate differently. This situation is represented by our snippet example data.
• but containing missing mates records, MarkDuplicates ignores the records, while MarkDuplicatesWithMateCigar still considers them for duplicate marking using the MC tag for mate information. Again, the duplicate scoring methods differ for each tool.

Use the following commands to flag duplicates for 6747_snippet.bam. These commands produce qualitatively different data.

Score duplicate sets based on the sum of base qualities using MarkDuplicates:

java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicates.bam \
METRICS_FILE=6747_snippet_markduplicates_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:

java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \
METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

• Each tool has a distinct default DUPLICATE_SCORING_STRATEGY. For MarkDuplicatesWithMateCigar it is TOTAL_MAPPED_REFERENCE_LENGTH and this is the only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY between the default SUM_OF_BASE_QUALITIES and TOTAL_MAPPED_REFERENCE_LENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.
• To merge multiple files into a single output, e.g. when aggregating a sample from across lanes, specify the INPUT parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT files must be coordinate sorted and indexed.
• The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.
• By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.
• To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

For snippet, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.

## 2. Slow or out of memory error? Special memory considerations for duplicate marking tools

The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.

Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.

For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.

    java -Xmx32G -jar picard.jar MarkDuplicates \
... \
TMP_DIR=/tmp 

These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar picard.jar MarkDuplicates ...   

### Set the java maxheapsize, specified by the -Xmx#G option, to the maximum your system allows.

The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.

In the example command, the -Xmx32G Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize.

### Set an additional temporary directory with the TMP_DIR parameter for memory spillage.

When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR parameter. If you work on a server separate from where you read and write files to, setting TMP_DIR to the server's local temporary directory (typically /tmp) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMP_DIR location you specify provides enough storage space. Use df -h to see how much is available.

## 3. Conceptual overview of duplicate flagging

The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.

GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead so you can include duplicates in analyses.

For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.

When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.

The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate lane-level library complexity, and second to aggregate data per sample while marking all library duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.

Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.

• For RNA-Seq data mapped against the genome, use MarkDuplicates. Specifically, MarkDuplicatesWithMateCigar will refuse to process data with large reference skips frequent in spliced RNA transcripts where the gaps are denoted with an N in the CIGAR string.
• Both tools only consider primary mappings, even if mapped to different contigs, and ignore secondary/supplementary alignments (256 flag and 2048 flag) altogether. Because of this, before flagging duplicates, be sure to mark primary alignments according to a strategy most suited to your experimental aims. See MergeBamAlignment's PRIMARY_ALIGNMENT_STRATEGY parameter for strategies the tool considers for changing primary markings made by an aligner.
• Both tools identify duplicate sets identically with the exception that MarkDuplicatesWithMateCigar additionally considers reads with missing mates. Missing mates occur for example when aligned reads are filtered using an interval list of genomic regions. This creates divorced reads whose mates aligned outside the targeted intervals.
• Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end. The tools intelligently factor for discordant pair orientations given these start and end coordinates. Within a duplicate set, with the exception of optical duplicates, read pairs may have either pair orientation--F1R2 or F2R1. For optical duplicates, pairs in the set must have the same orientation. Why this is is explained in section 4.
• Both tools take into account clipped and gapped alignments and singly mapping reads (mate unmapped and not secondary/supplementary).
• Each tool flags duplicates according to different priorities. MarkDuplicatesWithMateCigar prioritizes which pair to leave as the representative nondup based on the total mapped length of a pair while MarkDuplicates can prioritize based on the sum of base qualities of a pair (default) or the total mapped length of a pair. Duplicate inserts are marked at both ends.

## 4. Details of interest to some

To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.

Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.

### Don't be duped by identical numbers. Data from the two tools qualitatively differ.

First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.

Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.

• MarkDuplicates by default prioritizes the sum of base qualities for both mates of a pair. The pair with the highest sum of base qualities remains as the nondup.
• As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length, as provided by the CIGAR string, in scoring duplicates in a set. The pair with the longest mapping length remains as the nondup.
• If there are ties after applying each scoring strategy, both tools break the ties down a chain of deterministic factors starting with read name.

### Duplicate metrics in brief

We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.

See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.

The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.

The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.

### Pair orientation F1R2 is distinct from F2R1 for optical duplicates

Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

Optical duplicate designation requires the same pair orientation.

Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.

Created 2015-11-30 20:10:43 | Updated 2015-12-02 21:52:16 | Tags: printreads bam pre-processing

#### Prerequisites

• Installed GATK tools
• Reference genome
• Coordinate-sorted, aligned and indexed BAM

• Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference
• tutorial_6517.tar.gz contains four files: 6517_2Mbp_input.bam and .bai covering reads aligning to 10:90,000,000-92,000,000 and 6517_1Mbp_output.bam and .bai covering 10:91,000,000-92,000,000

### Create a snippet of reads corresponding to a genomic interval using PrintReads

PrintReads merges or subsets sequence data. The tool automatically applies MalformedReadFilter and BadCigarFilter to filter out certain types of reads that cause problems for downstream GATK tools, e.g. reads with mismatching numbers of bases and base qualities or reads with CIGAR strings containing the N operator.

• To create a test snippet of RNA-Seq data that retains reads with Ns in CIGAR strings, use -U ALLOW_N_CIGAR_READS.

Subsetting reads corresponding to a genomic interval using PrintReads requires reads that are aligned to a reference genome, coordinate-sorted and indexed. Place the .bai index in the same directory as the .bam file.

java -Xmx8G -jar /path/GenomeAnalysisTK.jar \
-R /path/human_g1k_v37_decoy.fasta \ #reference fasta
-L 10:91000000-92000000 \ #desired genomic interval chr:start-end
-I 6517_2Mbp_input.bam \ #input
-o 6517_1Mbp_output.bam 

This creates a subset of reads from the input file, 6517_2Mbp_input.bam, that align to the interval defined by the -L option, here a 1 Mbp region on chromosome 10. The tool creates two new files, 6517_1Mbp_output.bam and corresponding index 6517_1Mbp_output.bai.

• For paired reads, the tool does not account for reads whose mate aligns outside of the defined interval. To filter these lost mate reads, use RevertSam's SANITIZE option.

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

Created 2015-11-24 22:15:30 | Updated 2015-12-16 22:43:04 | Tags: bam igv

Visualize sequence read alignment data (BAM or SAM) on IGV using this quick-start tutorial. The Integrative Genomics Viewer is a non-GATK tool developed at the Broad Institute that allows for interactive exploration of large genomic datasets.

#### Prerequisites

• Coordinate-sorted and aligned BAM or SAM file
• Corresponding BAI index
• Matching reference genome to which the reads align. See IGV hosted genomes to check if IGV hosts a reference genome or this page for instructions on loading a .genome or FASTA file genome.

• tutorial_6491.tar.gz contains a coordinated-sorted BAM and corresponding BAI. Most reads align to a 1 Mbp genomic interval on chromosome 10 (10:96,000,000–97,000,000) of the human GRCh37 reference assembly. Specifically, reads align to GATK bundle's human_g1k_v37_decoy.fasta that corresponds to the Human (1kg, b37+decoy) reference hosted by IGV.

## View aligned reads using IGV

To view aligned reads using the Integrative Genomics Viewer (IGV), the SAM or BAM file must be coordinate-sorted and indexed.

1. Always load the reference genome first. Go to Genomes>Load Genome From Server or load from the drop-down menu in the upper left corner. Select Human (1kg, b37+decoy).
2. Load the data file. Go to File>Load from File and select 6491_snippet.bam. IGV automatically uses the corresponding 6491_snippet.bai index in the same folder.
3. Zoom in to see alignments. For our tutorial data, copy and paste 10:96,867,400-96,869,400 into the textbox at the top and press Go. A 2 kbp region of chromosome 10 comes into view as shown in the screenshot above.

Alongside read data, IGV automatically generates a coverage track that sums the depth of reads for each genomic position.

## Find a specific read and view as pairs

1. Right-click on the alignment track and Select by name. Copy and paste H0164ALXX140820:2:2107:7323:30703 into the read name textbox and press OK. IGV will highlight two reads corresponding to this query name in bold red.
2. Right-click on the alignment track and select View as pairs. The two highlighted reads will display in the same row connected by a line as shown in the screenshot.

Because IGV holds in memory a limited set of data overlapping with the genomic interval in view (this is what makes IGV fast), the select by name feature also applies only to the data that you call into view. For example, we know this read has a secondary alignment on contig hs37d5 (hs37d5:10,198,000-10,200,000).

### Some tips

If you find IGV sluggish, download a Java Web Start jnlp version of IGV that allows more memory. The highest memory setting as of this writing is 10 GB (RAM) for machines with 64-bit Java. For the tutorial example data, the typical 2 GB allocation is sufficient.

• To run the jnlp version of IGV, you may need to adjust your system's Java Control Panel settings, e.g. enable Java content in the browser. Also, when first opening the jnlp, overcome Mac OS X's gatekeeper function by right-clicking the saved jnlp and selecting Open with Java Web Start.

To change display settings, check out either the Alignment Preferences panel or the Alignment track Pop-up menu. For persistent changes to your IGV display settings, use the Preferences panel. For track-by-track changes, use the Pop-up menus.

Default Alignment Preferences settings are tuned to genomic sequence libraries. Go to View>Preferences and make sure the settings under the Alignments tab allows you to view reads of interest, e.g. duplicate reads.

• IGV saves any changes you make to these settings and applies them to future sessions.
• Some changes apply only to new sessions started after the change.
• To restore default preferences, delete or rename the prefs.properties file within your system's igv folder. IGV automatically generates a new prefs.properties file with default settings. See [IGV's user guide] for details.

After loading data, adjust viewing modes specific to track type by right-clicking on a track to pop up a menu of options. For alignment tracks, these options are described here.

Created 2015-11-23 21:11:51 | Updated 2015-12-10 21:53:28 | Tags: bam readgroup fastq pre-processing ubam

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

(A) Convert FASTQ to uBAM and add read group information using FastqToSam (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

#### Prerequisites

• Installed Picard tools

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

#### Related resources

• Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
• This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
• To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
• In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

### (A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar /path/picard.jar FastqToSam \
FASTQ=6484_snippet_XT_interleaved.fq \ #our single tutorial file contains both reads in a pair
OUTPUT=6484_snippet_fastqtosam.bam \
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

• QUALITY_FORMAT is detected automatically if unspecified.
• SORT_ORDER by default is queryname.
• Specify both FASTQ and FASTQ2 for paired reads in separate files.
• PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
• RUN_DATE is in Iso8601 date format.

### (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
I=6484_snippet.bam \
O=6484_snippet_revertsam.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

• We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[1]++) { print }}'  For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags. • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval. • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing. Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  Some comments on options kept at default: • SORT_ORDER=queryname For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed. • RESTORE_ORIGINAL_QUALITIES=true Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag. • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false. Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced. For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads. Original BAM H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339 CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:0 MQ:i:0 OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA UQ:i:0 AS:i:151 H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:2 MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### UQ:i:49 AS:i:50 After RevertSam H0164ALXX140820:2:1101:10003:23460 77 * 0 0 * * 0 0 TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJ H0164ALXX140820:2:1101:10003:23460 141 * 0 0 * * 0 0 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2 back to top Created 2015-11-23 20:55:24 | Updated 2016-01-19 13:30:21 | Tags: bam fastq bwa-mem pre-processing ubam merged-bams If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates. Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAMXT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed. The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file. Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind. We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow. Download example snippet data to follow along the tutorial. We welcome feedback. Share your suggestions in the Comments section at the bottom of this page. #### Jump to a section #### Tools involved #### Prerequisites • Installed Picard tools • Installed GATK tools • Installed BWA • Reference genome • Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane. #### Download example data • Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV. • I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping. #### Related resources #### Other notes • When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities. • For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \ TMP_DIR=/path/shlee  In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows. java -jar /path/picard.jar MarkIlluminaAdapters  To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads. • When I call default options within a command, follow suit to ensure the same results. ## 1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data. If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions. • To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2). • To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAMXT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon. See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads. back to top ## 2. Mark adapter sequences using MarkIlluminaAdapters MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads. java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \ I=6483_snippet_revertsam.bam \ O=6483_snippet_markilluminaadapters.bam \ M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required TMP_DIR=/path/shlee #optional to process large files This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged. • By default, the tool uses Illumina adapter sequences. This is sufficient for our example data. • Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter. We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads. Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%). Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores. Unmapped uBAM (step 1) After MarkIlluminaAdapters (step 2) After SamToFastq (step 3) After MergeBamAlignment (step 3) back to top ## 3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage. Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data. The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process. To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command. back to top ### 3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics. Illustration of an intermediate step unused in workflow. See piped command. java -Xmx8G -jar /path/picard.jar SamToFastq \ I=6483_snippet_markilluminaadapters.bam \ FASTQ=6483_snippet_samtofastq_interleaved.fq \ CLIPPING_ATTRIBUTE=XT \ CLIPPING_ACTION=2 \ INTERLEAVE=true \ NON_PF=true \ TMP_DIR=/path/shlee #optional to process large files  This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores. • For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option. • We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only. • Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value. back to top ### 3B. Align reads and flag secondary hits using BWA-MEM In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome. • Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval. • For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015). • As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project. • BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa. bwa index -a bwtsw human_g1k_v37_decoy.fasta The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file. Illustration of an intermediate step unused in workflow. See piped command. /path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension. Does the aligned file contain read group information? We invoke three options in the command. • -M to flag shorter split hits as secondary. This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments. • -p to indicate the given file contains interleaved paired reads. • -t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN. getconf _NPROCESSORS_ONLN  In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their mapped mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts. After MarkIlluminaAdapters (step 2) After BWA-MEM (step 3) After MergeBamAlignment (step 3) back to top ### 3C. Restore altered data and apply & adjust meta information using MergeBamAlignment MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background. Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean. Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow. Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate. Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged. A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20). In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname. Illustration of an intermediate step unused in workflow. See piped command. java -Xmx16G -jar /path/picard.jar MergeBamAlignment \ R=/path/Homo_sapiens_assembly19.fasta \ UNMAPPED_BAM=6383_snippet_revertsam.bam \ ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM O=6483_snippet_mergebamalignment.bam \ CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs ADD_MATE_CIGAR=true \ #default; adds MC tag CLIP_ADAPTERS=false \ #changed from default CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not overlap INCLUDE_SECONDARY_ALIGNMENTS=true \ #default MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z TMP_DIR=/path/shlee #optional to process large files This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data. Comments on select parameters • Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information. • It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter. • Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1. • Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data. • ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag. • Setting CLIP_ADAPTERS to false leaves reads unclipped. • By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index. • We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging. • As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM. Description of changes to our example data • MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs. • Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document. • Original base quality score restoration is illustrated in step 2. The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points. • MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence. • The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each. • MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit). • In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments. • MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag. Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations. Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the numberunmapped reads remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records. For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped? After BWA-MEM alignment After MergeBamAlignment back to top ### 3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file. Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options. set -o pipefail Overview of command structure [SamToFastq] | [BWA-MEM] | [MergeBamAlignment] Piped command java -Xmx8G -jar /path/picard.jar SamToFastq \ I=6483_snippet_markilluminaadapters.bam \ FASTQ=/dev/stdout \ CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \ TMP_DIR=/path/shlee | \ /path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \ java -Xmx16G -jar /path/picard.jar MergeBamAlignment \ ALIGNED_BAM=/dev/stdin \ UNMAPPED_BAM=6383_snippet_revertsam.bam \ OUTPUT=6483_snippet_piped.bam \ R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \ CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \ INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \ PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \ TMP_DIR=/path/shlee The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance. How do these small differences arise? Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads. INFO 2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads. back to top ### Some final remarks We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAMXT. We sleep well at night knowing that the clean BAM retains all original information. We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow. For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates. For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.  java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ... java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ... I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents. back to top Created 2013-07-03 00:18:01 | Updated 2013-07-31 15:57:23 | Tags: bam revert #### Objective Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly. #### Prerequisites • Installed HTSlib #### Steps 1. Shuffle the reads in the bam file 2. Revert the BAM file to FastQ format 3. Compress the FastQ file 4. Note for advanced users ### 1. Shuffle the reads in the bam file #### Action Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command: htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam  #### Expected Result This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted. The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked. ### 2. Revert the BAM file to FastQ #### Action Revert the BAM file to FastQ format by running the following HTSlib command: htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq  #### Expected Result This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads. Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files. ### 3. Compress the FastQ file #### Action Compress the FastQ file to reduce its size using the gzip utility: gzip interleaved_reads.fq #### Expected Result This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow. BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on. ### 4. Note for advanced users If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk): htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz  Created 2012-08-11 06:46:51 | Updated 2016-01-04 22:37:34 | Tags: vcf bam script reordersam sorting contig-order This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case. So what do you do? ### For BAM files You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this: java -jar picard.jar ReorderSam \ I=original.bam \ O=reordered.bam \ R=reference.fasta \ CREATE_INDEX=TRUE Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file. The CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index). Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool! ### For VCF files You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this: java -jar picard.jar SortVcf \ I=original.vcf \ O=sorted.vcf \ SEQUENCE_DICTIONARY=reference.dict  Where reference.dict is the sequence dictionary of your genome reference. Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF. Created 2012-08-11 03:43:49 | Updated 2015-11-20 19:39:58 | Tags: input bam sequencing ### 1. What file formats do you support for sequence data input? The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). Starting with version 3.5, the CRAM format is supported as well. SAM format is not supported but can be easily converted with Picard tools. ### 2. How do I get my data into BAM format? The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner. ### 3. What are the formatting requirements for my BAM file(s)? All BAM/CRAM files must satisfy the following requirements: • It must be aligned to one of the references described here. • It must be sorted in coordinate order (not by queryname and not "unsorted"). • It must list the read groups with sample names in the header. • Every read must belong to a read group. • The BAM file must pass Picard ValidateSamFile validation. See the official BAM specification for more information on what constitutes a valid BAM file. ### 4. What is the canonical ordering of human reference contigs in a BAM file? It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows: Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT... UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY... ### 5. How can I tell if my BAM file is sorted properly? The easiest way to do it is to download Samtools and run the following command to examine the header of your file:  samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

### 6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

### 7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$samtools view -H /path/to/my.bam | grep '^@RG' @RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC ... The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate. In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads, $ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

### 8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. You can use Picard's AddOrReplaceReadGroups tool to add read group information.

### 11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

java -jar GenomeAnalysisTK.jar -R reference.fasta -I full_input.bam -T PrintReads -L chr1:10-20 -o subset_input.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

See the Dictionary entry on read groups.

As detailed in the FAQs about input requirements, GATK expects all read groups appearing in the read data to be specified in the file header, and will fail with an error if it does not find that information (whether there is no read group information in the file, or a subset of reads do not have read groups).

Typically you should read group information when you perform the original alignments (with e.g. BWA, which has an option to do so). So what do you do if you forgot to do that, and you don't want to have to rerun BWA all over again?

### Solution

Here's an example:

# throws an error
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fasta \
-o output.vcf

SORT_ORDER=coordinate \
RGID=foo \
RGLB=bar \
RGPL=illumina \
RGSM=Sample1 \
CREATE_INDEX=True

# runs without error
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fasta \
-o output.vcf

Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need.

No posts found with the requested search criteria.

Created 2015-12-15 23:23:57 | Updated | Tags: readbackedphasing vcf bam htsjdk

I got a Contig X does not have a length field error when I run the phasing tool. By digging in more, it seems a user exception from htsjdk code, here is the trace of the coding line, has anyone experienced this issue before? Is it a problem in contig reference sequence or bad VCF file, or bad bam file?

Thanks for any help,

--------------------------------error exception raised in following code line-------------------------------------------------------------------

public SAMSequenceRecord getSAMSequenceRecord() { final String lengthString = this.getGenericFieldValue("length"); if (lengthString == null) throw new TribbleException("Contig " + this.getID() + " does not have a length field.");

which is called by RMDTrackBuilder.java validateVariantAgainstSequenceDictionary(final String name, final String descriptorName, final AbstractFeatureReader reader, final SAMSequenceDictionary dict )

Created 2015-11-20 14:59:27 | Updated 2015-11-20 15:00:44 | Tags: readbackedphasing bam

I want to phase some DNA-seq data.

java -jar GenomeAnalysisTK.jar -T ReadBackedPhasing -R ref.fasta -I readnames.bam --variant test.vcf -L Chr.list -o phased_SNPs.vcf --phaseQualityThresh 20.0

My vcf file looks like this and only contains information for 1 sample

# CHROM POS ID REF ALT QUAL FILTER INFO

NC_024331.1 131 . G GA . PASS SAF=0.655738;DP=61 NC_024331.1 147 . C G . PASS SAF=0.320000;DP=25 NC_024331.1 422 . C A . PASS SAF=0.414545;DP=275

Now I am getting an error

##### ERROR MESSAGE: No common samples in VCF and BAM headers, so nothing could possibly be phased!

Is there somewhere in the header of the vcf I can add AnySampleName?

Thanks!!

Created 2015-10-12 19:01:16 | Updated | Tags: vcf bam

I have a problem processing a BAM file with GATK 3.4-46. This is my command:

java -Xmx3g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -I /tmp/foo.bam -L Y -rf BadCigar -o /tmp/foo.vcf --output_mode EMIT_ALL_CONFIDENT_SITES

and I get an error:

ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReaderPrimitiveSamReaderToSamReaderAdapter@32734443 appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error I searched on the web and found that adding some more flags has helped some people to resolve the problem so I modified the command: java -Xmx3g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -I /tmp/foo.bam -L Y -rf BadCigar -o /tmp/foo.vcf --output_mode EMIT_ALL_CONFIDENT_SITES --fix_misencoded_quality_scores -fixMisencodedQuals Now it fails in a different way: ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool Any idea how to fix this? Created 2015-09-26 12:01:02 | Updated | Tags: bam picard htsjdk Dear colleagues, I am running a third-party program that makes use of the htsjdk library. For some of my samples (BWA alignments) it raises the following exception: Exception in thread "Thread-5" java.lang.IllegalStateException: Inappropriate call if not paired read at net.sf.samtools.SAMRecord.requireReadPaired(SAMRecord.java:655) at net.sf.samtools.SAMRecord.getMateUnmappedFlag(SAMRecord.java:682) It seems there may be some problem inside the BAM file. However, after running Picard's ValidateSamFile it only finds 1,155 INVALID_MAPPING_QUALITY errors (no errors related to mate reads). Anyhow, I repaired the bam file using Picard's CleanSam and rerun the third-party program. The "Inappropriate call if not paired read" error persisted. How could I solve this, please? Thanks in advance, Federico Created 2015-09-25 23:52:49 | Updated | Tags: best-practices bam gatk Hi, So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient. Thanks for your help, Alva Created 2015-09-22 16:01:09 | Updated | Tags: bam gatk error markduplicates merged-bams Hello, I am relatively new to GATK and stuck on this problem. After downloading the bam file that I wish to analyze, I am cutting portions of the file because I only wish to analyze variants at specific genes. Once I have the smaller files for each gene, I am using samtools merge to merge all of the smaller files back into one full sliced bam file. Next, I am realigning this file using bwa aln and sampe. After this step, I am attempting to use GATK best practices to mark the variants in these genes. First, I am using GATK AddOrReplaceReadGroups to modify read group information as necessary. Then, I am using Picard's MarkDuplicates to mark duplicates in the re-aligned bam file. However, I get the following error. Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 156: TLL:HWI-ST1222:5:2308:12532:76745#0 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:418) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:161) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:145) Looking at Picard's FAQ page (https://broadinstitute.github.io/picard/faq.html), it suggests that I try using Picard's MergeBamAlignment which also fails giving an error suggesting that the error is in the record ID of the bam file. Exception in thread "main" net.sf.picard.PicardException: Program Record ID already in use in unmapped BAM file. at net.sf.picard.sam.SamAlignmentMerger.<init>(SamAlignmentMerger.java:131) at net.sf.picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:226) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MergeBamAlignment.main(MergeBamAlignment.java:205) Continuing on, trying to change the options on samtools merge to change the record ID's when merging together the single files into the full sliced file also failed. Some more work into these issues found that using samtools fixmate and then Picard SortSam before calling MarkDuplicates resolves the issue, but the total amount of variants called by GATK is different when I run the sliced file through against a single gene file (for the same locations). Are there any other options to explore to resolve this error? Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk Hi, I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard: ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate etc. What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning? I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step. Has this error been encountered before? Not sure how to proceed next, except restart everything. Best, Alva Created 2015-09-14 16:27:46 | Updated | Tags: haplotypecaller vcf bam rnaseq variant-calling Hello, I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples:  grep 235068463 file.vcf
chr1    235068463   .   T   C   1795.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557   GT:AD:DP:GQ:PL  0/1:5,55:60:44:1824,0,44

60 reads, 5 T, 55 C.

samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 chr1 235068463 N 60 cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ? For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines: -2 pass STAR -Mark Dups -SplitNCigarReads -RealignerTargetCreator -IndelRealigner -BaseRecalibrator -PrintReads -MergeSamFiles.jar -Mark Dups -RealignerTargetCreator -IndelRealigner -HaplotyeCaller Thanks, Kipp Created 2015-08-14 19:03:41 | Updated | Tags: bam samtools Hi, I’ve been trying to track down an issue that cropped up when we were validating our pipeline on a newer system. We have a test that produces different output each time it’s run (it seems to cycle randomly between five different outputs), but only on ubuntu 14. The same test produces the same result every time on ubuntu 12, and when run on a Mac OS X desktop. I was able to create an isolated test using only SamReader and SamLocusIterator that simply iterates over the BAM and writes out every locus to a text file. This file exhibits the same behavior, e.g. it cycles between five different outputs. The test code is here: http://paste.openstack.org/show/412884/ Note that this will generate a lot of output, so it’s best to run with a small BAM file. My test file of 170M produces an 800M output file. I tested using JDK 8u51 on all machines, and am using the latest picard tools (1.138) although it originally showed up using a much earlier version of picard tools (1.93) and JDK 7. I’m not a bioinformatician so I don’t really know what the expected behavior is, although I couldn’t find the source of this variation by looking at the SAM tools source. If this is expected behavior that is OK, but I’d like confirmation of it before considering that the validation is complete. Better yet would be a way to disable or control this behavior to allow for consistent test results. Thanks, Phil Shapiro Created 2015-07-23 16:43:59 | Updated | Tags: bam Hi, We have 8 sequence files representing 8 fish families (5 fish each pooled). The families are belonging to 2 different phenotypes (Pheno 1 VS.2). Sequence files came from 3 different Illumina lanes. Could you please help us in assigning Group IDs and Sample names. Thanks, Rafet Al-Tobasei PhD candidate Computational Science department MTSU Created 2015-06-27 16:56:26 | Updated | Tags: bam gatk Meaning does it matter if my intervals file goes chr1, chr10, instead of chr1, chr2 ? Additionally, will a command line error occur if bam files are not sorted in coordinate order with respect to the reference? Thank you -Best Steve Created 2015-06-23 16:19:39 | Updated | Tags: vcf developer bam bug error I hava a question wish to get help from the developers: I am using GATK with two modles: 1, I just use the UnifiedGenotyper to call the variants from a prepared bam file, then I get a vcf file. (Call it A.vcf) 2, I run the UnifiedGenotype by Chr, one by one, say, by using the "-L" arg, then I get a sets of small vcf files. But, the result of ChrY is confusing me a lot.... the ChrY part in the A.vcf is QUITE different from the small vcf file that generated by "-L chrY", the difference seems to be larger than 50%. That means, the result is DIFFERENT for chrY. However, I have also checked the other Chromosomes, the difference is slight. ONLY the ChrY has this problem. Our script pasted here: java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T RealignerTargetCreator \ -I{sampleName}.bam \ -o {sampleName}.intervals \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T IndelRealigner \ -targetIntervals{sampleName}.intervals \ -I ${sampleName}.bam \ -o${sampleName}.realigned.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf

samtools index {sampleName}.realigned.bam java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T BaseRecalibrator \ -nct 8 \ -I{sampleName}.realigned.bam \ -knownSites dbsnp_138.hg19.vcf \ -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -knownSites 1000G_phase1.indels.hg19.sites.vcf \ -o ${sampleName}.recal_data.grp java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T PrintReads \ -nct 8 \ -I${sampleName}.realigned.bam \ -BQSR ${sampleName}.recal_data.grp \ -o${sampleName}.realigned.recal.bam

samtools index {sampleName}.realigned.recal.bam java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T UnifiedGenotyper \ -nct 8 \ -glm BOTH \ -I{sampleName}.realigned.recal.bam \ -D dbsnp_138.hg19.vcf \ -o {sampleName}.vcf \ #here A.vcf or small vcf generated -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -dcov 200 \ -A AlleleBalance -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A RMSMappingQuality -A InbreedingCoeff -A Coverage Wish you guys can offer me some help. Thanks, Created 2015-05-13 12:41:10 | Updated | Tags: indelrealigner bam error Hello, I run into a problem after the pre-processing, it seems that extra contigs where added to my bam file compared to the reference I used, which make the indel realigner step impossible to do. I have checked the headers of my file and the reference is the same but my bam file as a hundreds of additional contigs. Not sure what happen. The steps to get the bam where: • Aligned with bwa mem • Transform to bam and sort (Samtools) • Dedup (picard) • Add read group (picard) • Index bam (samtools) • Run Realigner target creator When I check the header of my bam file it still show the right contigs but when running it complains of difference (additional) compare to my reference. I am currently re-testing the whole pipeline on a single sample but if you have any pointer to what could cause this, maybe a problem with the bam formating? I am running GATK 3.3.0-g37228af Java 1.7 I have attached the ouput log from the command. Thanks, Julien PS: I attended your workshop in Cambridge! Created 2015-05-09 14:48:21 | Updated | Tags: best-practices bam third-party-tools picard Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample: sample1_lane1.dedup.realn.recal.bam --> sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam sample2_lane1.dedup.realn.recal.bam --> sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam ... I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe? Created 2015-04-29 10:27:03 | Updated | Tags: simulatereadsforvariants bam Dear list, I would like to better understand how the module SimulateReadsForVariants is working with VCF files. Given a VCF file containing 10 samples, I expected the SimulateReadsForVariants to simulate 10 BAM files for each sample. The module however simulates only 1 BAM file. How can we generate sample-specific simulated BAM files? Do we have to create 10 VCF files containing only 1 sample each? Many thanks for your lights, Best, Rémi Created 2015-03-26 16:03:02 | Updated | Tags: bam picard nullpointerexception fixmateinformation Hello, I was asked to re-post this question here. It was originally posted in the Picard forum at GitHub at https://github.com/broadinstitute/picard/issues/161. Regards, Bernt ## ORIGINAL POST (edited) There seems to be a problems with FixMateInformation crashing with Exception in thread "main" java.lang.NullPointerException at htsjdk.samtools.SamPairUtil.setMateInformationOnSupplementalAlignment(SamPairUtil.java:300) at htsjdk.samtools.SamPairUtilSetMateInfoIterator.advance(SamPairUtil.java:442)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:454) at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:360)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:194)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:185)
at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:125)
at picard.sam.FixMateInformation.main(FixMateInformation.java:93)

The problem first appeared in version 1.121. It is present in version 1.128. Versions prior to 1.120 worked and continue to work fine. I am currently using Java 1.7.0_75, but I observed the same problem with earlier version of Java. The problem occurs under several different version of Fedora.

The command lines I am using are:

java -jar picard-1.128/picard.jar FixMateInformation INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.121/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.120/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (succeeds)

I have observed the problem with various BAM files. This one is (a small subset of) the output of an indel realignment with GATK.

## Later in the same thread:

ValidateSam produces: java -jar /opt/ghpc/picard-1.121/ValidateSamFile.jar INPUT=test.bam OUTPUT=out.bam [Wed Feb 18 08:48:40 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam OUTPUT=out.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Feb 18 08:48:40 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.121(da291b4d265f877808b216dce96eaeffd2f30bd3_1411396652) IntelDeflater [Wed Feb 18 08:48:41 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=505937920 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

## And later:

The problem also exists in the new version, 1.129.

## New

Re-posted in GATK forum

Picard (1.129)'s ValidateSamFile complains about Mate not found - which was the reason for running FixMateInformation in the first place.

The output is:

java -jar /opt/ghpc/picard-current/picard.jar ValidateSamFile I=test.bam

[Thu Mar 26 16:44:49 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 26 16:44:49 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
Maximum output of [100] errors reached.
[Thu Mar 26 16:44:50 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=505937920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Please let me know where to send the test BAM file.

Regards,

Bernt

Created 2015-03-17 22:54:15 | Updated | Tags: bam picard

Hi,

I am attempting to merge the output of tophat in order to run some RNASeq QC metrics. This is a single read 50 bp on a Hiseq. In order to get past the fact that tophat gives a MAPQ of 255 to unmapped reads (not 0 as expected by Picard) I used the following tool ( https://github.com/cbrueffer/tophat-recondition) to change it.

Once completed, I added read groups using picard and then sorted the accepted_hits.bam by coordinate and sorted the unmapped reads by queryname.

tophat-recondition.py /home/ryan/NGS_Data/No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/unmapped_fixup.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/accepted_hits.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG.bam \ SORT_ORDER=coordinate \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ SORT_ORDER=queryname

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/accepted_hits-RG.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ SORT_ORDER=coordinate \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ MergeBamAlignment \ UNMAPPED_BAM=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ ALIGNED_BAM=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ O=/home/ryan/NGS_Data/merge_unmapped_accepted_hits_No_Dox.bam \ SORT_ORDER=coordinate \ REFERENCE_SEQUENCE=/home/ryan/Reference/human_spikein/hg19_spikein.fa \ PROGRAM_RECORD_ID=Tophat \ PROGRAM_GROUP_VERSION=0.1 \ PROGRAM_GROUP_COMMAND_LINE=tophat/dummy \ PAIRED_RUN=false \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true

I then get the following warning followed by the error:

WARNING 2015-03-17 09:44:22 SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Aligned record iterator (HS3:608:C6LNLACXX:7:2101:9946:63417) is behind the unmapped reads (HS3:608:C6LNLACXX:7:2101:9947:11009) INFO 2015-03-17 09:44:33 SamAlignmentMerger Read 1000000 records from alignment SAM/BAM. INFO 2015-03-17 09:44:43 SamAlignmentMerger Read 2000000 records from alignment SAM/BAM. .... INFO 2015-03-17 09:58:01 SamAlignmentMerger Read 96000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:09 SamAlignmentMerger Read 97000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:15 SamAlignmentMerger Finished reading 97571897 total records from alignment SAM/BAM. [Tue Mar 17 09:58:16 PDT 2015] picard.sam.MergeBamAlignment done. Elapsed time: 14.32 minutes. Runtime.totalMemory()=1908932608 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HS3:608:C6LNLACXX:7:1101:10000:11036) is behind the unmapped reads (HS3:608:C6LNLACXX:7:1101:10000:48402) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:383) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:153) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:248) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I have searched and have been unsuccessful at resolving this problem. Any ideas?

I am running picard 1.129 using java version "1.7.0_60"

ValidateSamFile on both inputs into MergeBamAlignment returns no errors in those files. I am at a lost and seeking for help!

Thanks,

Ryan

Created 2015-02-13 11:14:24 | Updated | Tags: bam gatk error

Hi,

with GATK 3.3-0 I am confronted with an error that was present in a much older version, but seemed resolved about a year ago:

ERROR MESSAGE: There is no space left on the device, so writing failed

There is 8TB left on the drive, no user limit. Sometimes re-running the exact same job works, sometimes not. Some jobs keep failing despite asking for an insane amount of memory on the cluster, given these are RNAseq bam files, the largest one being less than 7GB.

For example:

qsub -b y -cwd -N step3_145 -o step3_145.o -e step3_145.e -V -l h_vmem=40G /share/apps/java/oracle/1.8.0_11/bin/java -Xmx35G -jar /data/home/hhx037/GATK-3.3.0/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.75.dna.1-22XYMT.fa -I Analyses/file_dedup.bam -o Analyses/file_splittedcigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Here is the log:

INFO 10:50:51,568 HelpFormatter - Executing as hhx037@panos1 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12. INFO 10:50:51,571 HelpFormatter - Date/Time: 2015/02/13 10:50:51 INFO 10:50:51,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:51,576 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:52,503 GenomeAnalysisEngine - Strictness is SILENT INFO 10:50:52,827 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 10:50:52,861 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:50:52,876 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 10:50:53,021 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:50:53,027 GenomeAnalysisEngine - Done preparing for traversal INFO 10:50:53,030 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:50:53,030 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:50:53,030 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 10:50:53,047 ReadShardBalancer$1 - Loading BAM index data INFO 10:50:53,050 ReadShardBalancer$1 - Done loading BAM index data INFO 10:51:23,404 ProgressMeter - 1:1477348 702953.0 30.0 s 43.0 s 0.0% 17.5 h 17.5 h INFO 10:52:32,660 ProgressMeter - 1:16909108 1202983.0 99.0 s 82.0 s 0.5% 5.0 h 5.0 h INFO 10:53:09,769 ProgressMeter - 1:21069702 1302985.0 2.3 m 104.0 s 0.7% 5.6 h 5.5 h INFO 10:53:49,083 ProgressMeter - 1:27951393 1803181.0 2.9 m 97.0 s 0.9% 5.4 h 5.4 h INFO 10:54:29,275 ProgressMeter - 1:32739969 2103299.0 3.6 m 102.0 s 1.1% 5.7 h 5.6 h INFO 10:55:09,177 ProgressMeter - 1:36643589 2203300.0 4.3 m 116.0 s 1.2% 6.0 h 5.9 h INFO 10:55:45,643 ProgressMeter - 1:39854010 2303302.0 4.9 m 2.1 m 1.3% 6.3 h 6.2 h INFO 10:56:25,147 ProgressMeter - 1:40542516 2403303.0 5.5 m 2.3 m 1.3% 7.0 h 6.9 h INFO 10:57:10,934 ProgressMeter - 1:40654849 2503322.0 6.3 m 2.5 m 1.3% 8.0 h 7.9 h INFO 10:57:54,084 ProgressMeter - 1:43162895 2503322.0 7.0 m 2.8 m 1.4% 8.4 h 8.3 h INFO 10:58:24,149 ProgressMeter - 1:45244391 2703426.0 7.5 m 2.8 m 1.5% 8.6 h 8.4 h INFO 10:58:56,749 ProgressMeter - 1:53716450 2803427.0 8.1 m 2.9 m 1.7% 7.7 h 7.6 h INFO 10:59:38,928 ProgressMeter - 1:86821106 3103432.0 8.8 m 2.8 m 2.8% 5.2 h 5.1 h INFO 11:00:11,337 ProgressMeter - 1:93301870 3303437.0 9.3 m 2.8 m 3.0% 5.1 h 5.0 h INFO 11:01:13,113 ProgressMeter - 1:115252321 3803590.0 10.3 m 2.7 m 3.7% 4.6 h 4.5 h INFO 11:02:02,172 ProgressMeter - 1:145441389 4303778.0 11.2 m 2.6 m 4.7% 4.0 h 3.8 h INFO 11:02:38,237 ProgressMeter - 1:150547232 4703871.0 11.8 m 2.5 m 4.9% 4.0 h 3.8 h INFO 11:03:09,693 ProgressMeter - 1:153362937 5003904.0 12.3 m 2.5 m 5.0% 4.1 h 3.9 h INFO 11:03:39,934 ProgressMeter - 1:155984762 5403968.0 12.8 m 2.4 m 5.0% 4.2 h 4.0 h INFO 11:04:05,477 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

I understand temporary files may be large, but not that large. Are the temporary files written in the working directory (as I believe should be the case), or are they written in GATK installation directory?

Also, note I never run into this problem with the previous version.

Any idea?

Cheers,

Stephane

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs variant-calling

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Created 2015-01-21 19:53:26 | Updated | Tags: baserecalibrator haplotypecaller vcf bam merge rnaseq

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Created 2015-01-18 13:28:03 | Updated 2015-01-18 13:29:37 | Tags: bam picard index preprocess

Greeting all.

Currently, I have been using Picard's built-in library "BuildBamIndex" in order to index my bam files.

I have followed the manual described in Picard sites but I got error message.

Here is my command line that you can easily understand as below.

java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/output6 I tried different approach to avoid this error message so I used "samtools index" which i think is also same function as Picard BuildBamIndex. After using samtools, I successfully got my bam index files. I suppose that there are no major difference between Picard bamindex and samtools bam index. I am confusing that why only samtools index procedure is working fine? Below is my error message when run "BuildBamIndex" from Picard. [Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex INPUT=/DATA1/sclee1/data/URC_WES/U01/01U_N_Filtered_Sorted_Markdup_readgroup.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Exception creating BAM index for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:291) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271) at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:133) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: htsjdk.samtools.SAMException: BAM cannot be indexed without setting a fileSource for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexMetaData.recordMetaData(BAMIndexMetaData.java:130) at htsjdk.samtools.BAMIndexerBAMIndexBuilder.processAlignment(BAMIndexer.java:182) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:90) ... 6 more

I look forward to hearing positive answers from you soon.

Bye!

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

Created 2014-06-27 12:30:27 | Updated | Tags: bam hg19 1000g grch37

I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.

Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?

Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?

For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?

Thanks a lot for your help!

Created 2014-06-06 14:54:02 | Updated | Tags: haplotypecaller bam hg19 grch37

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:

• chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
• the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
• and actually leave out all alternate contigs

This sounds quite trivial, but I haven't found a clean way to do this yet. I have tried the following:

i=INPUT.bam
j=OUTPUT.bam
samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print$0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print$0}  $3~/^[1-9]|X|Y/{$3="chr"$3; print$0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > j However, when I try running the HaplotypeCaller, I get the following error: ##### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with Could you help me prepare these BAM files for processing? Thanks a lot in advance Created 2014-04-29 12:21:43 | Updated | Tags: unifiedgenotyper bam Hi GATK team, I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence. I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples? The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf I hope I am clear, and looking forward to learn more, Thank you in advance Created 2014-03-18 22:26:05 | Updated | Tags: haplotypecaller bam multiple-inputs gvcf i have been using HaplotypeCaller in gVCF mode on a cohort of 830 samples spread over 2450 bams. the number of bams per sample varies from 1-4. for samples with <=3 bams, the routine works perfectly. but for samples with 4 bams, the jobs always crash and I receive the error: ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently is this a bug? are there any options i can use to avoid this error. i suppose it is possible that there is an issue with my bams, but it seems odd that the error occurs systematically with 4 bam samples and never for samples with 3 or fewer bams. thanks for any help! Created 2014-03-11 08:10:55 | Updated | Tags: vcf bam rodwalker RodWalkers are really fast and great. Is there any way to connect a BAM file to it and get the pileup in the given positions? If not, what is the appropriate way to get pileups for a small set of selection positions? Created 2014-03-05 08:09:15 | Updated | Tags: bqsr bam header When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file. BWA->Picard dedup->GATK LocalRealign->GATK BQSR Below is the output BAM file after BQSR step. @PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false Created 2014-03-02 23:34:47 | Updated | Tags: printreadswalker bam header Hi, I use --sample_name option to restrict output reads for PintReads walker (GATK version - 2.4.9). This option works well but the output BAM header @RG still contains sample name (SM) that was not included. Is there a way to apply this --sample_name option to BAM header at the same time? I could reheader the BAM files using samtools but I'm looking for a better and easier way. Thank you, Taka Created 2013-12-16 11:41:23 | Updated | Tags: baserecalibrator bam bwa We have used bwa 0.7.4 aln and sampe to align illumina reads. Then used the following command java -Xmx6g -jar ~/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T BaseRecalibrator -I ~/temp/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedindelrealigned.bam -R ~/hg19/ucsc.hg19.fasta -knownSites ~/dbSNP/dbsnp_137.hg19.vcf -o ~/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedBQSR.grp Which gave the following error message ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:193) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:245) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNanoTraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNanoTraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) can you help me in this error message? Why its coming and how to rectify it? Thanks in advance Mayukh Created 2013-12-03 10:35:10 | Updated | Tags: unifiedgenotyper input ftp bam Is it possible to source and read bam files directly from ftp site into Unified Genotyper without downloading them first. This option works in samtools view -- but if I try the simple straightforward way in GATK (just replacing the -I inputfilename.bam with -I ftplinktobamfile.bam ) it does not work. Is there another way of doing this? This would save me a lot of diskspace if I could do this. Created 2013-09-13 15:46:18 | Updated | Tags: printreads bam bug Hi, I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers? Thanks Aaron ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed ##### ERROR Please do not post this error to the GATK forum ##### ERROR ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: SAM/BAM file /btl/projects/aaron/Mouse/G42214/TET-OT-827ctl.recalibrated.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader! Created 2013-09-05 10:17:13 | Updated | Tags: splitsamfile gatk2 bam Hi, If I have a bam file with three different read groups, and use SplitSamFile to split it like so: java -Xmx2g -jarGATKJAR -T SplitSamFile -I $INBAM -R$GENOME --outputRoot $PROJD/$IND/

Each of the output bam files have all three read groups. Is that the intended behavior? I would like each file to have only it's own read group info in the heads. Sorry for the bash arguments in the code above, is makes in readable at least.

Thanks

Daniel

Created 2013-09-02 20:48:40 | Updated | Tags: unifiedgenotyper vcf bam priors

I don't know if this question has been asked, if so sorry.

When using UnifiedGenotyper, I was wondering if it was possible (via hidden command option, etc) to use a VCF file as a prior. Currently I have just been adding additional bam files, but it would be nice (and quicker) if I could use a Indexed file.

Shawn.

Created 2013-08-20 15:29:17 | Updated 2013-08-20 15:40:47 | Tags: bam

Hi I am using this version of GATK which is java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar

I have obtained my bam files using bowtie as aligner. So the quality score is 255 in the 5th column. I want to use UnifiedGenotyper and when i use it it gives me an error saying MappingQualityUnavailable.

So I see read filter ReassignOneMappingQualityFilter which can convert 255 to 60 and then it would be easily taken by UnifiedGenotyper to do the snp calling.

So what command line should i be using

I have just started to explore GATK. Hope to hear from you soon

Regards Varun

Created 2013-06-26 16:28:06 | Updated | Tags: unifiedgenotyper vcf bam multi-sample variant-calling

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Created 2013-05-16 15:37:00 | Updated 2013-05-16 15:37:50 | Tags: depthofcoverage bam gatk

Hello,

i have spend many hours trying to run GATK depthofcoverage but it never works. last try: -T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50

My BAM header seem to be malformed.

ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

here is the 1rst line of the header:

@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??

I Have search in the forum and doc about it. I have try to reorder my header with picard:

@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??  but no more change.

Someone can help me please ?

Regards, Remi



Created 2013-04-02 09:21:05 | Updated 2013-04-02 09:21:56 | Tags: bam performance

Dear all,

I am currently running an analysis using the HaplotypeCaller on 300 large BAM files on our cluster and decided to chunk the the genome in 3MB bins in order for them to be processed in a decent time. I'm however experiencing very long runtimes as more and more jobs get scheduled to run in parallel on the same files. Looking at the GATK options, I saw these 2 that I thought could be of help and was wondering what were the recommendation for using them: --num_bam_file_handles --read_buffer_size

More precisely, does the num_bam_file_handles increase processing time by a lot? and what is the default value for --read_buffer_size ?

Thanks a lot, Laurent

Created 2013-03-29 20:27:26 | Updated | Tags: readbackedphasing bam

I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?

Created 2013-02-14 19:41:01 | Updated 2013-02-15 06:36:36 | Tags: indelrealigner input bam

Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.

My command-line is the following:

java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

I get the following message:

SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.`

My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.

Created 2012-12-24 20:09:25 | Updated 2013-01-07 19:16:08 | Tags: bam readgroup picard

how to add read group to the bam file using PICARD generated from GS reference mapper BAM file?

Created 2012-11-29 08:16:23 | Updated | Tags: bam walker summary mapping reads

Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj

Created 2012-11-16 18:29:54 | Updated 2013-01-07 20:01:32 | Tags: vcf bam merge

Dear All, I am very new to the analysis of NGS data.

I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)

If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?

In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?

I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.

Created 2012-10-15 14:43:35 | Updated 2012-10-18 00:43:47 | Tags: depthofcoverage bam

I am getting the following error when running DepthOfCoverage:

I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.

Created 2012-10-05 22:43:09 | Updated 2012-10-18 01:09:20 | Tags: unifiedgenotyper bam genotype

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field: