# Tagged with #picard 7 documentation articles | 2 announcements | 33 forum discussions

Created 2013-07-03 00:53:53 | Updated 2015-09-24 12:57:53 | Tags: analyst readgroup picard indexing

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar picard.jar SortSam \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

### 2. Mark duplicate reads

#### Action

Run the following Picard command:

java -jar picard.jar MarkDuplicates \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

#### Action

Run the following Picard command:

java -jar picard.jar AddOrReplaceReadGroups \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar picard.jar BuildBamIndex \


#### Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Created 2013-07-02 00:16:14 | Updated 2015-09-24 12:12:04 | Tags: install rscript igv picard gsalib samtools r ggplot2 rstudio

#### Objective

Install all software packages required to follow the GATK Best Practices.

#### Prerequisites

To follow these instructions, you will need to have a basic understanding of the meaning of the following words and command-line operations. If you are unfamiliar with any of the following, you should consult a more experienced colleague or your systems administrator if you have one. There are also many good online tutorials you can use to learn the necessary notions.

• Basic Unix environment commands
• Binary / Executable
• Compiling a binary
• Adding a binary to your path
• Command-line shell, terminal or console
• Software library

You will also need to have access to an ANSI compliant C++ compiler and the tools needed for normal compilations (make, shell, the standard library, tar, gunzip). These tools are usually pre-installed on Linux/Unix systems. On MacOS X, you may need to install the MacOS Xcode tools. See https://developer.apple.com/xcode/ for relevant information and software downloads. The XCode tools are free but an AppleID may be required to download them.

Starting with version 2.6, the GATK requires Java Runtime Environment version 1.7. All Linux/Unix and MacOS X systems should have a JRE pre-installed, but the version may vary. To test your Java version, run the following command in the shell:

java -version


This should return a message along the lines of ”java version 1.7.0_25” as well as some details on the Runtime Environment (JRE) and Virtual Machine (VM). If you have a version other than 1.7.x, be aware that you may run into trouble with some of the more advanced features of the Picard and GATK tools. The simplest solution is to install an additional JRE and specify which you want to use at the command-line. To find out how to do so, you should seek help from your systems administrator.

#### Software packages

1. BWA
2. SAMtools
3. Picard
4. Genome Analysis Toolkit (GATK)
5. IGV
6. RStudio IDE and R libraries ggplot2 and gsalib

Note that the version numbers of packages you download may be different than shown in the instructions below. If so, please adapt the number accordingly in the commands.

### 1. BWA

Read the overview of the BWA software on the BWA project homepage, then download the latest version of the software package.

• Installation

Unpack the tar file using:

tar xvzf bwa-0.7.12.tar.bz2


This will produce a directory called bwa-0.7.12 containing the files necessary to compile the BWA binary. Move to this directory and compile using:

cd bwa-0.7.12
make


The compiled binary is called bwa. You should find it within the same folder (bwa-0.7.12 in this example). You may also find other compiled binaries; at time of writing, a second binary called bwamem-lite is also included. You can disregard this file for now. Finally, just add the BWA binary to your path to make it available on the command line. This completes the installation process.

• Testing

Open a shell and run:

bwa


This should print out some version and author information as well as a list of commands. As the Usage line states, to use BWA you will always build your command lines like this:

bwa <command> [options]


This means you first make the call to the binary (bwa), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command.

### 2. SAMtools

Read the overview of the SAMtools software on the SAMtools project homepage, then download the latest version of the software package.

• Installation

Unpack the tar file using:

tar xvjf samtools-0.1.2.tar.bz2


This will produce a directory called samtools-0.1.2 containing the files necessary to compile the SAMtools binary. Move to this directory and compile using:

cd samtools-0.1.2
make


The compiled binary is called samtools. You should find it within the same folder (samtools-0.1.2 in this example). Finally, add the SAMtools binary to your path to make it available on the command line. This completes the installation process.

• Testing

Open a shell and run:

samtools


This should print out some version information as well as a list of commands. As the Usage line states, to use SAMtools you will always build your command lines like this:

samtools <command> [options]


This means you first make the call to the binary (samtools), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command. This is a similar convention as used by BWA.

### 3. Picard

Read the overview of the Picard software on the Picard project homepage, then download the latest version of the software package.

• Installation

Unpack the zip file using:

tar xjf picard-tools-1.139.zip


This will produce a directory called picard-tools-1.139 containing the Picard jar files. Picard tools are distributed as a pre-compiled Java executable (jar file) so there is no need to compile them.

Note that it is not possible to add jar files to your path to make the tools available on the command line; you have to specify the full path to the jar file in your java command, which would look like this:

java -jar ~/my_tools/jars/picard.jar <Toolname> [options]


This syntax will be explained in a little more detail further below.

However, you can set up a shortcut called an "environment variable" in your shell profile configuration to make this easier. The idea is that you create a variable that tells your system where to find a given jar, like this:

PICARD = "~/my_tools/jars/picard.jar"


So then when you want to run a Picard tool, you just need to call the jar by its shortcut, like this:

java -jar PICARD <Toolname> [options]  The exact way to set this up depends on what shell you're using and how your environment is configured. We like this overview and tutorial which explains how it all works; but if you are new to the command line environment and you find this too much too deal with, we recommend asking for help from your institution's IT support group. This completes the installation process. • Testing Open a shell and run: java -jar picard.jar -h  This should print out some version and usage information about the AddOrReplaceReadGroups.jar tool. At this point you will have noticed an important difference between BWA and Picard tools. To use BWA, we called on the BWA program and specified which of its internal tools we wanted to apply. To use Picard, we called on Java itself as the main program, then specified which jar file to use, knowing that one jar file = one tool. This applies to all Picard tools; to use them you will always build your command lines like this: java -jar picard.jar <ToolName> [options]  This means you first make the call to Java itself as the main program, then specify the picard.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis. Note that the command-line syntax of Picard tools has recently changed from java -jar <ToolName>.jar to java -jar picard.jar <ToolName>. We are using the newer syntax in this document, but some of our other documents may not have been updated yet. If you encounter any documents using the old syntax, let us know and we'll update them accordingly. If you are already using an older version of Picard, either adapt the commands or better, upgrade your version! Next we will see that GATK tools are called in essentially the same way, although the way the options are specified is a little different. The reasons for how tools in a given software package are organized and invoked are largely due to the preferences of the software developers. They generally do not reflect strict technical requirements, although they can have an effect on speed and efficiency. ### 4. Genome Analysis Toolkit (GATK) Hopefully if you're reading this, you're already acquainted with the purpose of the GATK, so go ahead and download the latest version of the software package. In order to access the downloads, you need to register for a free account on the GATK support forum. You will also need to read and accept the license agreement before downloading the GATK software package. Note that if you intend to use the GATK for commercial purposes, you will need to purchase a license. See the licensing page for an overview of the commercial licensing conditions. • Installation Unpack the tar file using: tar xjf GenomeAnalysisTK-3.3-0.tar.bz2  This will produce a directory called GenomeAnalysisTK-3.3-0 containing the GATK jar file, which is called GenomeAnalysisTK.jar, as well as a directory of example files called resources. GATK tools are distributed as a single pre-compiled Java executable so there is no need to compile them. Just like we discussed for Picard, it's not possible to add the GATK to your path, but you can set up a shortcut to the jar file using environment variables as described above. This completes the installation process. • Testing Open a shell and run: java -jar GenomeAnalysisTK.jar -h  This should print out some version and usage information, as well as a list of the tools included in the GATK. As the Usage line states, to use GATK you will always build your command lines like this: java -jar GenomeAnalysisTK.jar -T <ToolName> [arguments]  This means that just like for Picard, you first make the call to Java itself as the main program, then specify the GenomeAnalysisTK.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis. ### 5. IGV The Integrated Genomics Viewer is a genome browser that allows you to view BAM, VCF and other genomic file information in context. It has a graphical user interface that is very easy to use, and can be downloaded for free (though registration is required) from this website. We encourage you to read through IGV's very helpful user guide, which includes many detailed tutorials that will help you use the program most effectively. ### 6. RStudio IDE and R libraries ggplot2 and gsalib Download the latest version of RStudio IDE. The webpage should automatically detect what platform you are running on and recommend the version most suitable for your system. • Installation Follow the installation instructions provided. Binaries are provided for all major platforms; typically they just need to be placed in your Applications (or Programs) directory. Open RStudio and type the following command in the console window: install.packages("ggplot2")  This will download and install the ggplot2 library as well as any other library packages that ggplot2 depends on for its operation. Note that some users have reported having to install two additional package themselves, called reshape and gplots, which you can do as follows: install.packages("reshape") install.packages("gplots")  Finally, do the same thing to install the gsalib library: install.packages("gsalib")  This will download and install the gsalib library. Important note If you are using a recent version of ggplot2 and a version of GATK older than 3.2, you may encounter an error when trying to generate the BQSR or VQSR recalibration plots. This is because until recently our scripts were still using an older version of certain ggplot2 functions. This has been fixed in GATK 3.2, so you should either upgrade your version of GATK (recommended) or downgrade your version of ggplot2. If you experience further issues generating the BQSR recalibration plots, please see this tutorial. Created 2013-06-17 20:58:04 | Updated 2015-09-24 13:06:53 | Tags: official bwa picard mapping markduplicates #### Objective Map the read data to the reference and mark duplicates. #### Prerequisites • This tutorial assumes adapter sequences have been removed. #### Steps 1. Identify read group information 2. Generate a SAM file containing aligned reads 3. Convert to BAM, sort and mark duplicates ### 1. Identify read group information The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification. #### Action Compose the read group identifier in the following format: @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1  where the \t stands for the tab character. ### 2. Generate a SAM file containing aligned reads #### Action Run the following BWA command: In this command, replace read group info by the read group identifier composed in the previous step. bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam  replacing the <read group info> bit with the read group identifier you composed at the previous step. The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility). #### Expected Result This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference. Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands. ### 3. Convert to BAM, sort and mark duplicates These initial pre-processing operations format the data to suit the requirements of the GATK tools. #### Action Run the following Picard command to sort the SAM file and convert it to BAM: java -jar picard.jar SortSam \ INPUT=aligned_reads.sam \ OUTPUT=sorted_reads.bam \ SORT_ORDER=coordinate  #### Expected Results This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate. #### Action Run the following Picard command to mark duplicates: java -jar picard.jar MarkDuplicates \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam \ METRICS_FILE=metrics.txt  #### Expected Result This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics. #### Action Run the following Picard command to index the BAM file: java -jar picard.jar BuildBamIndex \ INPUT=dedup_reads.bam  #### Expected Result This creates an index file for the BAM file called dedup_reads.bai. Created 2013-06-17 20:44:09 | Updated 2015-09-24 13:10:42 | Tags: picard #### Objective Prepare a reference sequence so that it is suitable for use with BWA and GATK. #### Prerequisites • Installed BWA • Installed SAMTools • Installed Picard #### Steps 1. Generate the BWA index 2. Generate the Fasta file index 3. Generate the sequence dictionary ### 1. Generate the BWA index #### Action Run the following BWA command: bwa index -a bwtsw reference.fa  where -a bwtsw specifies that we want to use the indexing algorithm that is capable of handling the whole human genome. #### Expected Result This creates a collection of files used by BWA to perform the alignment. ### 2. Generate the fasta file index #### Action Run the following SAMtools command: samtools faidx reference.fa  #### Expected Result This creates a file called reference.fa.fai, with one record per line for each of the contigs in the FASTA reference file. Each record is composed of the contig name, size, location, basesPerLine and bytesPerLine. ### 3. Generate the sequence dictionary #### Action Run the following Picard command: java -jar picard.jar CreateSequenceDictionary \ REFERENCE=reference.fa \ OUTPUT=reference.dict  Note that this is the new syntax for use with the latest version of Picard. Older versions used a slightly different syntax because all the tools were in separate jars, so you'd call e.g. java -jar CreateSequenceDictionary.jar directly. #### Expected Result This creates a file called reference.dict formatted like a SAM header, describing the contents of your reference FASTA file. Created 2013-02-13 18:06:06 | Updated 2014-06-24 09:45:27 | Tags: tribble picard sam-jdk variantcontext htsjdk The picard repository on github contains all picard public tools. Libraries live under the htsjdk, which includes the samtools-jdk, tribble, and variant packages (which includes VariantContext and associated classes as well as the VCF/BCF codecs). If you just need to check out the sources and don't need to make any commits into the picard repository, the command is: git clone https://github.com/broadinstitute/picard.git  Then within the picard directory, clone the htsjdk. cd picard git clone https://github.com/samtools/htsjdk.git  Then you can attach the picard/src/java and picard/htsjdk/src/java directories in IntelliJ as a source directory (File -> Project Structure -> Libraries -> Click the plus sign -> "Attach Files or Directories" in the latest IntelliJ). To build picard and the htsjdk all at once, type ant from within the picard directory. To run tests, type ant test If you do need to make commits into the picard repository, first you'll need to create a github account, fork picard or htsjdk, make your changes, and then issue a pull request. For more info on pull requests, see: https://help.github.com/articles/using-pull-requests Created 2012-08-15 18:16:01 | Updated 2014-09-17 13:20:21 | Tags: official developer tribble intermediate picard htsjdk ## 1. Overview The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library. ## 2. Architecture Overview Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file. ## 3. Developer Overview For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type. For developers there are two classes that are important: • Feature This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1. • FeatureCodec This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature. To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like: public interface Feature { /** * Return the features reference sequence name, e.g chromosome or contig */ public String getChr(); /** * Return the start position in 1-based coordinates (first base is 1) */ public int getStart(); /** * Return the end position following 1-based fully closed conventions. The length of a feature is * end - start + 1; */ public int getEnd(); }  And the interface for FeatureCodec: /** * the base interface for classes that read in features. * @param <T> The feature type this codec reads */ public interface FeatureCodec<T extends Feature> { /** * Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop. * * @param line the input line to decode * @return Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is * a comment) */ public Feature decodeLoc(String line); /** * Decode a line as a Feature. * * @param line the input line to decode * @return Return the Feature encoded by the line, or null if the line does not represent a feature (e.g. is * a comment) */ public T decode(String line); /** * This function returns the object the codec generates. This is allowed to be Feature in the case where * conditionally different types are generated. Be as specific as you can though. * * This function is used by reflections based tools, so we can know the underlying type * * @return the feature type this codec generates. */ public Class<T> getFeatureType(); /** Read and return the header, or null if there is no header. * * @return header object */ public Object readHeader(LineReader reader); }  ## 4. Supported Formats The following formats are supported in Tribble: • VCF Format • DbSNP Format • BED Format • GATK Interval Format ## 5. Updating the Tribble, htsjdk, and/or Picard library Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken. NOTE: Any directory starting with ~ may be different on your machine, depending on where you cloned the various repositories for gsa-unstable, picard, and htsjdk. A Maven script to install picard into the local repository is located under gsa-unstable/private/picard-maven. To operate, it requires a symbolic link named picard pointing to a working checkout of the picard github repository. NOTE: compiling picard requires an htsjdk github repository checkout available at picard/htsjdk, either as a subdirectory or another symbolic link. The final full path should be gsa-unstable/private/picard-maven/picard/htsjdk. cd ~/src/gsa-unstable cd private/picard-maven ln -s ~/src/picard picard  Create a git branch of Picard and/or htsjdk and make your changes. To install your changes into the GATK you must run mvn install in the private/picard-maven directory. This will compile and copy the jars into gsa-unstable/public/repo, and update gsa-unstable/gatk-root/pom.xml with the corresponding version. While making changes your revision of picard and htslib will be labeled with -SNAPSHOT. cd ~/src/gsa-unstable cd private/picard-maven mvn install  Continue testing in the GATK. Once your changes and updated tests for picard/htsjdk are complete, push your branch and submit your pull request to the Picard and/or htsjdk github. After your Picard/htsjdk patches are accepted, switch your Picard/htsjdk branches back to the master branch. NOTE: Leave your gsa-unstable branch on your development branch! cd ~/src/picard ant clean git checkout master git fetch git rebase cd htsjdk git checkout master git fetch git rebase  NOTE: The version number of old and new Picard/htsjdk will vary, and during active development will end with -SNAPSHOT. While, if needed, you may push -SNAPSHOT version for testing on Bamboo, you should NOT submit a pull request with a -SNAPSHOT version. -SNAPSHOT indicates your local changes are not reproducible from source control. When ready, run mvn install once more to create the non -SNAPSHOT versions under gsa-unstable/public/repo. In that directory, git add the new version, and git rm the old versions. cd ~/src/gsa-unstable cd public/repo git add picard/picard/1.115.1499/ git add samtools/htsjdk/1.115.1509/ git rm -r picard/picard/1.112.1452/ git rm -r samtools/htsjdk/1.112.1452/  Commit and then push your gsa-unstable branch, then issue a pull request for review. Created 2012-07-23 18:02:24 | Updated 2015-05-15 09:13:41 | Tags: bam utilities picard reordersam sorting order This error occurs when for example, a collaborator gives you a BAM that's derived from what was originally the same reference as you are using, but for whatever reason the contigs are not sorted in the same order .The GATK can be particular about the ordering of a BAM file so it will fail with an error in this case. So what do you do? You use a Picard tool called ReorderSam to, well, reorder your BAM file. Here's an example usage where we reorder a BAM file that was sorted lexicographically so that the output will be another BAM, but this time sorted karyotypically : java -jar picard.jar ReorderSam \ I= lexicographic.bam \ O= kayrotypic.bam \ REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta  This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool! This tool is part of the Picard package. Created 2015-03-09 19:07:23 | Updated | Tags: picard support htsjdk Here's some good news for anyone who has been using both GATK and the Picard tools in their work -- which means all of you, since you all follow our Best Practices to a tee, right? As you may know, both toolkits are developed here at the Broad Institute, and are deployed together in the Broad's analysis pipelines. The fact that they have been developed, released and supported separately so far is more an accident of history and internal organization than anything else (and we know it's inconvenient to y'all). The good news is that we're taking steps to consolidate these efforts, which we believe will benefit everyone. In that spirit, we have been working closely with the Picard tools development team, and we're now ready to take the first step of consolidating support for the tools. From now on, you will be able to ask us questions about the Picard tools, and report bugs, in the GATK forum. And developers will be happy to hear that we are also committed to supporting HTSJDK for developers through the Github repo’s Issues tracker. In the near future, we will also start hosting downloads and documentation for the Picard tools on the GATK website. And before you ask, yes, the Picard tools will continue to be completely open-source and available to all free of charge. To recap, we have brought the GATK and Picard teams together, and we are working on bringing together in the same place all the methods and tools to perform genome analysis. Our goal is to make a world where you can run our complete Best Practices pipeline end-to-end with a single Broad toolkit. We think it’ll make your life easier, because it sure is making ours easier. Created 2014-11-05 19:09:49 | Updated 2014-11-24 22:57:13 | Tags: commandline picard topstory syntax Consider this a public service announcement, since most GATK users probably also use Picard tools routinely. The recently released version 1.124 of the Picard tools includes many lovely improvements, bug fixes and even a few new tools (see release notes for full details) -- but you'll want to pay attention to one major change in particular. From this version onward, the Picard release will contain a single JAR file containing all the tools, instead of one JAR file per tool as it was before. This means that when you invoke a Picar tool, you'll invoke a single JAR, then specify which tool (which they call CLP for Command Line Program) you want to run. This should feel completely familiar if you already use GATK regularly, but it does mean you'll need to update any scripts that use Picard tools to the new paradigm. Other than that, there's no change of syntax; Picard will still use e.g. I=input.bam where GATK would use -I input.bam. We will need to update some of our own documentation accordingly over the near future; please bear with us as we go through this process, and let us know by commenting in this thread if you find any docs that have yet to be updated. Created 2015-10-02 12:37:21 | Updated | Tags: picard genotypeconcordence gatk-resource-bundle I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample." The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance. That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option: TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required. Take dbsnp_138.hg19.vcf.gz as an example:  zcat dbsnp_138.hg19.vcf.gz | grep CHROM
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO


Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field.

How should I proceed?

Created 2015-09-27 12:23:40 | Updated | Tags: picard markduplicates libraries

I was hoping this had been addressed already on the forum, but I've not seen a definitive answer although I have seen a similar question posed on this and other forums.

Our current mark duplicate procedure using Picard MarkDuplicates is to run merges across lane data generated from the same library. I believe this makes sense, and once duplicates are marked, then library level merges are combined to create a sample level, multi-library bam file. Any duplicates found across libraries would not be expected to be PCR duplicates but instead just identical fragments.

It's not clear though whether Picard MarkDuplicates is library aware....ie. when it does mark duplicates does it account for read pairs only from the same library, or if run against a bam merge generated from multiple libraries, will it mark any duplicates it finds.

I don't see this addressed in the documentation, so I assume that is not the case, but I have seen suggestions elsewhere that it might be so.

Created 2015-09-26 12:01:02 | Updated | Tags: bam picard htsjdk

Dear colleagues,

I am running a third-party program that makes use of the htsjdk library. For some of my samples (BWA alignments) it raises the following exception:

Exception in thread "Thread-5" java.lang.IllegalStateException: Inappropriate call if not paired read at net.sf.samtools.SAMRecord.requireReadPaired(SAMRecord.java:655) at net.sf.samtools.SAMRecord.getMateUnmappedFlag(SAMRecord.java:682)

It seems there may be some problem inside the BAM file. However, after running Picard's ValidateSamFile it only finds 1,155 INVALID_MAPPING_QUALITY errors (no errors related to mate reads). Anyhow, I repaired the bam file using Picard's CleanSam and rerun the third-party program. The "Inappropriate call if not paired read" error persisted.

How could I solve this, please?

Thanks in advance, Federico

Created 2015-09-16 18:12:09 | Updated | Tags: java picard

Hi GATK team!

I have an issue with Picard SamToFastq.

I ran on a 16 GB BAM this command (on the latest version of Picard, I tried several):

java -Djava.io.tmpdir=/scratch -XX:ParallelGCThreads=8 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=4194304 -Xmx12G -jar /picard/dist/picard.jar SamToFastq \
VALIDATION_STRINGENCY=LENIENT \
INPUT=input.bam \
FASTQ=output.pair1.fastq.gz \
SECOND_END_FASTQ=output.pair2.fastq.gz


With this command-line, I was able to process WES BAM two or three times larger without the RAM complaining at all.

On this occurence, I have this result (and I tweaked MAX_RECORDS_IN_RAM argument, change from a 24 GB RAM machine to a 48 GB RAM machine):

[Wed Sep 16 12:17:28 EDT 2015] picard.sam.SamToFastq INPUT=input.bam FASTQ=output.pair1.fastq.gz SECOND_END_FASTQ=output.pair2.fastq.gz VALIDATION_STRINGENCY=LENIENT    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Sep 16 12:17:28 EDT 2015] Executing as emixaM@glop on Linux 2.6.32-504.23.4.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07; Picard version: 1.139(fd19c75b9a42d82cb57d45b53e1f93f0a3588541_1442367311) JdkDeflater
INFO    2015-09-16 12:17:36 SamToFastq  Processed     1 000 000 records.  Elapsed time: 00:00:07s.  Time for last 1 000 000:    7s.  Last read position: chr1:19 244 878
INFO    2015-09-16 12:17:45 SamToFastq  Processed     2 000 000 records.  Elapsed time: 00:00:17s.  Time for last 1 000 000:    9s.  Last read position: chr1:38 435 197
INFO    2015-09-16 12:17:57 SamToFastq  Processed     3 000 000 records.  Elapsed time: 00:00:29s.  Time for last 1 000 000:   12s.  Last read position: chr1:63 999 267
INFO    2015-09-16 12:18:10 SamToFastq  Processed     4 000 000 records.  Elapsed time: 00:00:42s.  Time for last 1 000 000:   13s.  Last read position: chr1:101 467 007
INFO    2015-09-16 12:18:17 SamToFastq  Processed     5 000 000 records.  Elapsed time: 00:00:48s.  Time for last 1 000 000:    6s.  Last read position: chr1:144 192 193
INFO    2015-09-16 12:18:36 SamToFastq  Processed     6 000 000 records.  Elapsed time: 00:01:07s.  Time for last 1 000 000:   19s.  Last read position: chr1:154 917 552
INFO    2015-09-16 12:18:43 SamToFastq  Processed     7 000 000 records.  Elapsed time: 00:01:14s.  Time for last 1 000 000:    7s.  Last read position: chr1:171 501 716
INFO    2015-09-16 12:19:01 SamToFastq  Processed     8 000 000 records.  Elapsed time: 00:01:33s.  Time for last 1 000 000:   18s.  Last read position: chr1:201 869 422
INFO    2015-09-16 12:19:08 SamToFastq  Processed     9 000 000 records.  Elapsed time: 00:01:40s.  Time for last 1 000 000:    6s.  Last read position: chr1:230 511 835
INFO    2015-09-16 12:19:15 SamToFastq  Processed    10 000 000 records.  Elapsed time: 00:01:46s.  Time for last 1 000 000:    6s.  Last read position: chr2:20 182 092
INFO    2015-09-16 12:19:57 SamToFastq  Processed    11 000 000 records.  Elapsed time: 00:02:28s.  Time for last 1 000 000:   41s.  Last read position: chr2:52 391 334
INFO    2015-09-16 12:20:55 SamToFastq  Processed    12 000 000 records.  Elapsed time: 00:03:26s.  Time for last 1 000 000:   58s.  Last read position: chr2:87 420 712
[Wed Sep 16 12:38:29 EDT 2015] picard.sam.SamToFastq done. Elapsed time: 21,03 minutes.
Runtime.totalMemory()=11453595648
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308)
at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:288)
at picard.sam.SamToFastq.doWork(SamToFastq.java:166)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)


Do you have an idea on what I can change to not cross this overhead limit?

Cheers!

Created 2015-09-14 20:57:40 | Updated | Tags: picard

Our group is trying to get a bwa-sampe/picard pipeline running on a windows cluster, and we're having problems piping to picard.

On unix, we'd use 'java -jar picard.jar SortSam I=/dev/stdin ...' but windows doesn't recognize /dev/stdin (even under cygwin) and complains that the file doesn't exist.

We can, of course, write the sam file out in its entirety, but we'd rather save the time and writing to disk, if possible.

Anyone have a solution (other than switch to unix)? Thanks in advance.

Created 2015-09-03 07:52:38 | Updated | Tags: picard picard-markduplicates

Hi there,

Hope someone can shed some light on this issue.

I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.

It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)

INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk. ... INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936 INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM. [Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes. Runtime.totalMemory()=6107234304 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:245) at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:165) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:281) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:318) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127) at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100) at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137) at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:236) ... 6 more

I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.

eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.

I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.

What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.

My entry "java -Xmx16g -Djavaio.tmpdir=/short/a32/working_temp -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT " Hope you can help and thanks for your assistance in advance. Eddie Created 2015-08-31 10:15:59 | Updated | Tags: picard filtervariants Dear all, I have a problem using Picard tool FilterVcf. I first run the command: java -jar picard.jar FilterVcf \ MIN_AB=10.0 \ INPUT=12-3455_S6.final.vcf \ OUTPUT=test.vcf And I receive the error: [Mon Aug 31 12:04:53 CEST 2015] Executing as smerella@b002 on Linux 2.6.32-279.2.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.138() JdkDeflater [Mon Aug 31 12:04:56 CEST 2015] picard.vcf.filter.FilterVcf done. Elapsed time: 0.06 minutes. Runtime.totalMemory()=1138753536 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalArgumentException: A reference dictionary is required for creating Tribble indices on the fly at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:389) at picard.vcf.filter.FilterVcf.doWork(FilterVcf.java:97) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Then I tried with: java -jar picard.jar FilterVcf \ REFERENCE_SEQUENCE=/lustre1/genomes/hg19/fa/hg19 \ MIN_AB=10.0 \ INPUT=12-3455_S6.final.vcf \ OUTPUT=test.vcf And I received the same error. In the directory /lustre1/genomes/hg19/fa/ I have the files hg19.dict hg19.fa hg19.fa.fai. I don't understand what I am doing wrong. Thanks for the help. [smerella@b002 scratch_bmc]$ java -jar picard.jar FilterVcf --version 1.138()

Created 2015-08-20 17:51:21 | Updated | Tags: picard markduplicates

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.

Created 2015-08-20 10:56:07 | Updated | Tags: picard metrics

Hi all, I have 2 questions about the Picard metrics. Sorry if they are stupid or you have already answered but I didn't find anything googling or searching in this forum. 1. HsMetrics: in the page of metrics definition it is declared that the tool returns the percentage of all target bases achieving 2x,10x,20x,30,40x,50x,100x but when I run the command in the output file there are only 2x,10x,20x,30x. Is that correct? Is there any option I have to use to add information about 40x,50x and 100x? My command is:

java -jar CalculateHsMetrics.jar \ BAIT_INTERVALS=PICARD_interval_list_CANCER.list \ REFERENCE_SEQUENCE=hg19.fa \ TARGET_INTERVALS=PICARD_interval_list_CANCER.list \ BAIT_SET_NAME=CANCER \ METRIC_ACCUMULATION_LEVEL=SAMPLE \ INPUT=sample.bam\ OUTPUT=sample.metrics

1. In the PICARD metric page I found metrics about VariantCalling: https://broadinstitute.github.io/picard/picard-metric-definitions.html#CollectVariantCallingMetrics.VariantCallingDetailMetrics But I didn't find the associated Picard tool. Is that an old metric not in use anymore?

Thanks for the help, hope to not bother you.

java -jar CalculateHsMetrics.jar --version 1.66(1178) Linux version 2.6.32-279.2.1.el6.x86_64 (mockbuild@c6b7.bsys.dev.centos.org)

Created 2015-08-12 23:44:49 | Updated 2015-08-12 23:53:26 | Tags: liftovervariants vcf hapmap picard liftovervcf

I am having a problem with picard's LiftoverVcf.

I am trying to Liftover hapmap files (downloaded plink files from hapmap and converted to vcf using plink) from ncbi36 to hg38. I was able to do this with GATK LiftoverVariants. My problem came when I had to merge the hapmap.hg38 with some genotype files (that I liftover from hg19 to hg38 using GATK LiftoverVariants). I am merging them so that I can run population stratification using plink. I used vcf-merge but it complained that a SNP has different reference allele in both files: rs3094315, should be reference allele G (which was correct in the genotype.hg38 files but in the hapmap.hg38 files it was wrong). I also first tried to lift hapmap.ncbi36 to hg19 then to hg38 but the offending allele was still there. So I decided to try and lift the hapmap.ncbi36 using LiftoverVCF from picard.

1. I downloaded the newest picard build (20 hours old) picard-tools-1.138.
2. Used the command: java -jar -Xmx6000m ../../../tools/picard-tools-1.138/picard.jar LiftoverVcf I=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf O=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf C=../../../tools/liftover/chain_files/hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf R=../../../data/assemblies/hg38/hg38.fa VERBOSITY=ERROR

Here is the run: [Thu Aug 13 00:43:40 CEST 2015] picard.vcf.LiftoverVcf INPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf OUTPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf CHAIN=......\tools\liftover\chain_files\hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf REFERENCE_SEQUENCE=......\data\assemblies\hg19\assemble\hg38.fa VERBOSITY=ERROR QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

Here is the error: Exception in thread "main" java.lang.IllegalStateException: Allele in genotype A* not in the variant context [T*, C] at htsjdk.variant.variantcontext.VariantContext.validateGenotypes(VariantContext.java:1357) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1295) at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:410) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:496) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:490) at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:200) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

1. I have no idea which SNP is the problem.
2. I do not know what T* means (does not seem to exist in the file).
3. I am new to picard so I thought VERBOSE=ERROR will give me something more but nothing more appeared.
4. Given that lifting hapmap.ncbi36 to hg19 then to hg38 produced the same erroneous reference allele I suppose lifting will not fix this and I will have to work with dnsnp to correct my file. Do you know how I can change reference allele in a vcf? Is there a tool for this? Is there a liftover tool for dbsnp?
5. As a side note I want to make picard work because I read that you will be deprecating the GATK liftover and will support the picard liftover (at some point in the future) so help with this tool will be appreciated.

Created 2015-07-29 18:59:07 | Updated | Tags: bwa picard

I noticed that the current build of picard contains a 3rd party version of bwa 0.5.9-r16, the version.txt which contains the text:

"patched by AW, and multithread samse/sampe patch applied."

I have an external collaborator who is interested in applying this patch. Could someone provide download and instructions for doing so, or for accessing this patched version of BWA? Or can I just send the bwa executable directly to the collaborator?

Thanks!

Created 2015-07-10 22:42:44 | Updated | Tags: vcf picard

Hi I would like to know how can I return the entire contents of a VCF file without having to know the chromosome/start/end parameters. The VCFFileReader class has a query method that can return parts of the VCF like as:

What I would like is to be able to say: 'return everything the VCF has'. Is there a way to do this with picard?

Martin

Created 2015-06-21 13:14:36 | Updated | Tags: vcf picard

What is the fastest way of getting the total number of variants in a VCF file? (using picard-tools-1.119, via SplitVcfs.jar.)

So far the fastest way I could have done it was this:

private static int getNumVariants(VCFFileReader reader) { int totalVariants = 0; final CloseableIterator iterator = reader.iterator(); while (iterator.hasNext()) {iterator.next(); totalVariants++; } iterator.close();

  return totalVariants;


}

• but this appears to iterate through the entire VCF file which for large files seems very inefficient...

I am thinking that there must be a faster way. After all, the number of variants is simply: total number of lines in file - number of lines in header?

Any way to get this?

Thanks Martin

Created 2015-06-21 13:08:56 | Updated 2015-06-21 13:10:27 | Tags: picard vcftools vcf-format

I am trying to open small uncompressed VCFs via picard-tools-1.119, via SplitVcfs.jar.

File vcfFile = new File("test2.vcf"); VCFFileReader vcfReader = new VCFFileReader(vcfFile,false);

Even though I've specified that an index is not required in the constructor, this generates the following error:

htsjdk.tribble.TribbleException: Index not found for: MYLOCATION\test2.vcf at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:253) at htsjdk.variant.vcf.VCFFileReader.query(VCFFileReader.java:124) (if I don't specify the index parameter, then it generates a different error:"An index is required, but none found" )

I've attached a test file so you can see it for yourself. ( NOTE: I've had to compress it into a .zip, otherwise the forum didn't allow me to attach the file itself, but if you want to replicate my results you will need to uncompress it and try loading the plain file (test2.vcf).

My question: is there any way to open uncompressed VCF files in picard via VCFFileReader or otherwise?

Thanks Martin

PS: these are small VCFs that would be exported out from larger datasets, so compressing/indexing them would seem like an unnecessary extra step.

Created 2015-06-01 10:54:07 | Updated | Tags: best-practices picard mapping

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment picard-tools MergeBamAlignment \ UNMAPPED_BAM=unmapped_reads.sam \ ALIGNED_BAM=aligned_reads.sam \ OUTPUT=aligned_reads.merged.bam \ REFERENCE_SEQUENCE=/path/to/reference.fasta \ PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile ## HISTOGRAM java.lang.String Error Type Count ERROR:INVALID_CIGAR 5261 ERROR:MATES_ARE_SAME_END 30 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

Created 2015-05-30 13:45:41 | Updated | Tags: indelrealigner picard reads

Hello,

I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.

I'm calling the indel realigner with the default options as follows:

java -Xmx16g -jar GATK_DIR/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/my/ref \ -I input.bam.intervals \ -targetIntervals input.bam.intervals \ -o realn.bam \  I am generating the read groups using AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK RealignerTargetCreator with default options. My bam file was generated off the raw reads of experiment SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools) I have tried this on other reads and do not have the same issue. Can anyone comment on why indel realigner could be throwing out so many reads. Thank you Created 2015-05-09 14:48:21 | Updated | Tags: best-practices bam third-party-tools picard Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample: sample1_lane1.dedup.realn.recal.bam --> sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam sample2_lane1.dedup.realn.recal.bam --> sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam ... I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe? Created 2015-03-29 01:11:42 | Updated 2015-03-29 01:14:42 | Tags: picard I have 16 libraries that were pooled and sequenced on the same lane. When using AddOrReplaceReadGroups.jar, should I create a unique RGLB name for each library (e.g., lib1, lib2, lib3, etc.)? Apologies if this sounds like a silly question, but I have seen folks on seq-answers use the same name (e.g., lib1) for the RGLB flag when libraries were sequenced on the same lane. Thanks! Created 2015-03-26 16:03:02 | Updated | Tags: bam picard nullpointerexception fixmateinformation Hello, I was asked to re-post this question here. It was originally posted in the Picard forum at GitHub at https://github.com/broadinstitute/picard/issues/161. Regards, Bernt ## ORIGINAL POST (edited) There seems to be a problems with FixMateInformation crashing with Exception in thread "main" java.lang.NullPointerException at htsjdk.samtools.SamPairUtil.setMateInformationOnSupplementalAlignment(SamPairUtil.java:300) at htsjdk.samtools.SamPairUtilSetMateInfoIterator.advance(SamPairUtil.java:442)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:454) at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:360)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:194)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:185)
at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:125)
at picard.sam.FixMateInformation.main(FixMateInformation.java:93)


The problem first appeared in version 1.121. It is present in version 1.128. Versions prior to 1.120 worked and continue to work fine. I am currently using Java 1.7.0_75, but I observed the same problem with earlier version of Java. The problem occurs under several different version of Fedora.

The command lines I am using are:

java -jar picard-1.128/picard.jar FixMateInformation INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.121/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.120/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (succeeds)

I have observed the problem with various BAM files. This one is (a small subset of) the output of an indel realignment with GATK.

## Later in the same thread:

ValidateSam produces: java -jar /opt/ghpc/picard-1.121/ValidateSamFile.jar INPUT=test.bam OUTPUT=out.bam [Wed Feb 18 08:48:40 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam OUTPUT=out.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Feb 18 08:48:40 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.121(da291b4d265f877808b216dce96eaeffd2f30bd3_1411396652) IntelDeflater [Wed Feb 18 08:48:41 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=505937920 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

## And later:

The problem also exists in the new version, 1.129.

## New

Re-posted in GATK forum

Picard (1.129)'s ValidateSamFile complains about Mate not found - which was the reason for running FixMateInformation in the first place.

The output is:

java -jar /opt/ghpc/picard-current/picard.jar ValidateSamFile I=test.bam

[Thu Mar 26 16:44:49 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 26 16:44:49 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
Maximum output of [100] errors reached.
[Thu Mar 26 16:44:50 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=505937920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp


Please let me know where to send the test BAM file.

Regards,

Bernt

Created 2015-03-17 22:54:15 | Updated | Tags: bam picard

Hi,

I am attempting to merge the output of tophat in order to run some RNASeq QC metrics. This is a single read 50 bp on a Hiseq. In order to get past the fact that tophat gives a MAPQ of 255 to unmapped reads (not 0 as expected by Picard) I used the following tool ( https://github.com/cbrueffer/tophat-recondition) to change it.

Once completed, I added read groups using picard and then sorted the accepted_hits.bam by coordinate and sorted the unmapped reads by queryname.

tophat-recondition.py /home/ryan/NGS_Data/No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/unmapped_fixup.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/accepted_hits.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG.bam \ SORT_ORDER=coordinate \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ SORT_ORDER=queryname

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/accepted_hits-RG.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ SORT_ORDER=coordinate \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ MergeBamAlignment \ UNMAPPED_BAM=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ ALIGNED_BAM=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ O=/home/ryan/NGS_Data/merge_unmapped_accepted_hits_No_Dox.bam \ SORT_ORDER=coordinate \ REFERENCE_SEQUENCE=/home/ryan/Reference/human_spikein/hg19_spikein.fa \ PROGRAM_RECORD_ID=Tophat \ PROGRAM_GROUP_VERSION=0.1 \ PROGRAM_GROUP_COMMAND_LINE=tophat/dummy \ PAIRED_RUN=false \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true

I then get the following warning followed by the error:

WARNING 2015-03-17 09:44:22 SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Aligned record iterator (HS3:608:C6LNLACXX:7:2101:9946:63417) is behind the unmapped reads (HS3:608:C6LNLACXX:7:2101:9947:11009) INFO 2015-03-17 09:44:33 SamAlignmentMerger Read 1000000 records from alignment SAM/BAM. INFO 2015-03-17 09:44:43 SamAlignmentMerger Read 2000000 records from alignment SAM/BAM. .... INFO 2015-03-17 09:58:01 SamAlignmentMerger Read 96000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:09 SamAlignmentMerger Read 97000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:15 SamAlignmentMerger Finished reading 97571897 total records from alignment SAM/BAM. [Tue Mar 17 09:58:16 PDT 2015] picard.sam.MergeBamAlignment done. Elapsed time: 14.32 minutes. Runtime.totalMemory()=1908932608 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HS3:608:C6LNLACXX:7:1101:10000:11036) is behind the unmapped reads (HS3:608:C6LNLACXX:7:1101:10000:48402) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:383) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:153) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:248) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I have searched and have been unsuccessful at resolving this problem. Any ideas?

I am running picard 1.129 using java version "1.7.0_60"

ValidateSamFile on both inputs into MergeBamAlignment returns no errors in those files. I am at a lost and seeking for help!

Thanks,

Ryan

Created 2015-01-26 17:10:50 | Updated | Tags: picard bwa-and-gatk

Hey guys,

I was running MarkDuplicates with BWA output, and I got a error "java.lang.NoClassDefFoundError: java/lang/ref/Finalizer$2". To get around it, I tried the following: 1) Used different versions of Picard. I tried 1.90, 1.119, 1.228. None of them worked. 2) I built picard.jar from source code. Still didn't work. 3) Used -M in BWA mem, no luck. When I used aln and sampe, I still got the same error. 4) So I tried MarkDuplicates on RNA-Seq data, it worked perfectly on TopHat output! So the conclusion I've reached is that MarkDuplicats doesn't work well with BWA output. Anyone knows how to deal with this situation? Thank you in advance! More information: I'm using openjdk version "1.6.0-internal", Linux system. The error message: Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/ref/Finalizer$2 at java.lang.ref.Finalizer.runFinalization(Finalizer.java:144) at java.lang.Runtime.runFinalization0(Native Method) at java.lang.Runtime.runFinalization(Runtime.java:705) at java.lang.System.runFinalization(System.java:967) at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:58) at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49) at htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76) at htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180) at htsjdk.samtools.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:164) at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.put(DiskBasedReadEndsForMarkDuplicatesMap.java:65) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:290) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Created 2015-01-18 13:28:03 | Updated 2015-01-18 13:29:37 | Tags: bam picard index preprocess

Greeting all.

Currently, I have been using Picard's built-in library "BuildBamIndex" in order to index my bam files.

I have followed the manual described in Picard sites but I got error message.

Here is my command line that you can easily understand as below.

java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/output6 I tried different approach to avoid this error message so I used "samtools index" which i think is also same function as Picard BuildBamIndex. After using samtools, I successfully got my bam index files. I suppose that there are no major difference between Picard bamindex and samtools bam index. I am confusing that why only samtools index procedure is working fine? Below is my error message when run "BuildBamIndex" from Picard. **[Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex INPUT=/DATA1/sclee1/data/URC_WES/U01/01U_N_Filtered_Sorted_Markdup_readgroup.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Exception creating BAM index for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:291) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271) at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:133) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: htsjdk.samtools.SAMException: BAM cannot be indexed without setting a fileSource for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexMetaData.recordMetaData(BAMIndexMetaData.java:130) at htsjdk.samtools.BAMIndexerBAMIndexBuilder.processAlignment(BAMIndexer.java:182) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:90) ... 6 more **

I look forward to hearing positive answers from you soon.

Bye!

Created 2014-12-05 15:12:16 | Updated 2014-12-05 15:12:48 | Tags: solid picard markduplicates

Hi,

I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:

22_758_632_F3

604_1497_576

124_1189_1519_F5

358_1875_702_F5-DNA

And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAME_REGEX="([0-9]+)([0-9]+)([0-9]+).*"

Any help is appreciated. Thanks!

Created 2014-09-02 19:46:27 | Updated 2014-09-02 20:10:09 | Tags: haplotypecaller picard

Hi,

We ran HaplotypeCaller on some bams and got the same error on them: Rod span is not contained within the data shard, meaning we wouldn't get all of the data we need.

Here is my command: java -Xmx32g -jar ~/GATK-3.2.2/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R /hg19/hg19.fasta \ -I $1 \ -ERC GVCF \ -nct 16 \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ --pair_hmm_implementation VECTOR_LOGLESS_CACHING \ -o${TEMPORARY_DIR}/2 Can you help me figure out whats wrong? The error said it may be a bug and I cannot find where this issue was previously addressed. I wonder if something went wrong with the bam processing. We received these bams from the sequencing institution already aligned. Individual reads were missing the @RG tag, so we used bamaddrg to add the @RG tags. That caused issues with the bam 'bins', so we had to run the htsjdk.samtools.FixBAMFile. We ran MarkDuplicates and attempted the above HaplotypeCaller command. Thanks! Mark Created 2014-08-15 17:43:45 | Updated | Tags: developer picard customwalker htsjdk Hi fellow htsjdk/picard/gatk developers! I've been thinking about this for quite some time now, so I thought I should write up a quick post about it here. I've been writing custom tools for our group using both picard and GATK for some time now. It's been working nicely, but I have been missing a set of basic tutorials and examples, for users to quickly get started writing walkers. My most commonly used reference has been the 20-line life savers (http://www.slideshare.net/danbolser/20line-lifesavers-coding-simple-solutions-in-the-gatk) which is getting a bit dated. What I would like to see is something like for following: • What's in htsjsk? What's not in htsjdk? (from a dev's perspective - in terms of frameworks) • What's in picard? What's not in picard? (from a dev's perspective - in terms of frameworks) • What's in gatk? What's not in gatk? (from a dev's perspective - in terms of frameworks) • When to use htsjdk, picard any GATK. What are the strengths and weaknesses of the three. (possibly more that I've missed) • Your first htsjdk walker • Your first picard walker • Your first gatk walker • Traversing a BAM in htsjdk vs gatk - what are the differences There might be more stuff that could go in here as well. The driving force behind this is that I'm myself a bit confused by the overlap of these three packages/frameworks. I do understand that picard uses htsjdk, and that GATK uses both as dependencies, but it's not super clear what extra functionality (for a developer) is added from htsjdk -> picard -> gatk. Could we assemble a small group of interested developers to contribute to this? We could set up a git repo with the examples and tutorials for easy collaboration and sharing online. Anyone interested? I'll could myself as the first member :) Created 2013-11-20 06:41:02 | Updated 2013-11-20 06:41:55 | Tags: queue qscript picard markduplicates Hi, So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster. I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code: case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob { @Input(doc="Input BAM file") val bam: File = inBam @Output(doc="Metrics file") val metrics: File = outMetrics this.input :+= bam this.targets = targetIntervals this.baits = baitIntervals this.output = metrics this.reference = refGenome this.isIntermediate = false }  Gives the following error message: ERROR 06:56:25,047 QGraph - Missing 2 values for function: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam' 'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' 'VALIDATION_STRINGENCY=SILENT' 'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics' 'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals' 'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals' 'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta' 'METRIC_ACCUMULATION_LEVEL=SAMPLE' ERROR 06:56:25,048 QGraph - @Argument: jarFile - jar ERROR 06:56:25,049 QGraph - @Argument: javaMainClass - Main class to run from javaClasspath  And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar: case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob { @Input(doc = "Input bam file") var inbam = inBam @Output(doc = "Output BAM file with dups removed") var outbam = outBam this.REMOVE_DUPLICATES = true this.input :+= inBam this.output = outBam this.metrics = metricsFile this.memoryLimit = 3 this.isIntermediate = false }  Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._. Created 2013-11-08 10:45:37 | Updated | Tags: reducereads error picard Hi guys, I've seen this error has been reported other times, for different reasons. The thing is that, the bam file I'm using to reduce the reads has been processed through GATK pipeline without problems, realignment and recalibration included. Therefore, I assumed the bam file generated after BQSR would be GATK-compliant. I was running with Queue, so I just run here the exact command passed to the job in an interactive mode, to see what happens. Here below is the full command and error message (apologies for lengthy output), where there's no stack trace after the error.  [fles@login07 reduced] 'java'  '-Xmx12288m'  '-Djava.io.tmpdir=/scratch/scratch/fles/project_analysis/reduced/tmp'  '-cp' '/home/fles/applications/Queue-2.7-4-g6f46d11/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'ReduceReads'  '-I' '/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam'  '-R' '/home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta'  '-o' '/scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam'
INFO  09:27:21,728 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:21,730 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:29:52
INFO  09:27:21,731 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  09:27:21,731 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  09:27:21,735 HelpFormatter - Program Args: -T ReduceReads -I /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam -R /home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta -o /scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam
INFO  09:27:21,735 HelpFormatter - Date/Time: 2013/11/08 09:27:21
INFO  09:27:21,735 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:21,735 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:34,156 GenomeAnalysisEngine - Strictness is SILENT
INFO  09:27:34,491 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 40
INFO  09:27:34,503 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:27:34,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
INFO  09:27:35,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  09:27:35,045 GenomeAnalysisEngine - Done preparing for traversal
INFO  09:27:35,045 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  09:27:35,046 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
INFO  09:27:35,080 ReadShardBalancer$1 - Loading BAM index data INFO 09:27:35,081 ReadShardBalancer$1 - Done loading BAM index data
INFO  09:28:05,059 ProgressMeter -      1:18958138        1.00e+06   30.0 s       30.0 s      0.6%        81.8 m    81.3 m
INFO  09:28:35,069 ProgressMeter -      1:46733396        2.30e+06   60.0 s       26.0 s      1.5%        66.4 m    65.4 m
INFO  09:29:05,079 ProgressMeter -      1:92187730        3.50e+06   90.0 s       25.0 s      3.0%        50.5 m    49.0 m
INFO  09:29:35,088 ProgressMeter -     1:145281942        4.90e+06  120.0 s       24.0 s      4.7%        42.7 m    40.7 m
INFO  09:30:05,098 ProgressMeter -     1:152323864        6.40e+06    2.5 m       23.0 s      4.9%        50.9 m    48.4 m
INFO  09:30:35,893 ProgressMeter -     1:181206886        7.70e+06    3.0 m       23.0 s      5.8%        51.4 m    48.4 m
INFO  09:31:05,902 ProgressMeter -     1:217604563        8.90e+06    3.5 m       23.0 s      7.0%        49.9 m    46.4 m
INFO  09:31:35,913 ProgressMeter -      2:14782401        1.02e+07    4.0 m       23.0 s      8.5%        47.0 m    43.0 m
INFO  09:32:05,922 ProgressMeter -      2:62429207        1.15e+07    4.5 m       23.0 s     10.0%        44.8 m    40.3 m
INFO  09:32:35,931 ProgressMeter -      2:97877374        1.28e+07    5.0 m       23.0 s     11.2%        44.7 m    39.7 m
INFO  09:33:06,218 ProgressMeter -     2:135574018        1.42e+07    5.5 m       23.0 s     12.4%        44.5 m    38.9 m
INFO  09:33:36,227 ProgressMeter -     2:179431307        1.56e+07    6.0 m       23.0 s     13.8%        43.5 m    37.5 m
INFO  09:34:06,237 ProgressMeter -     2:216279690        1.69e+07    6.5 m       23.0 s     15.0%        43.4 m    36.9 m
INFO  09:34:36,248 ProgressMeter -      3:14974731        1.81e+07    7.0 m       23.0 s     16.4%        42.9 m    35.9 m
INFO  09:35:07,073 ProgressMeter -      3:52443620        1.94e+07    7.5 m       23.0 s     17.6%        42.9 m    35.4 m
INFO  09:35:37,084 ProgressMeter -     3:111366536        2.05e+07    8.0 m       23.0 s     19.5%        41.3 m    33.2 m
INFO  09:36:07,094 ProgressMeter -     3:155571144        2.18e+07    8.5 m       23.0 s     20.9%        40.8 m    32.3 m
INFO  09:36:37,103 ProgressMeter -       4:3495327        2.31e+07    9.0 m       23.0 s     22.4%        40.4 m    31.3 m
INFO  09:37:07,114 ProgressMeter -      4:48178306        2.43e+07    9.5 m       23.0 s     23.8%        40.0 m    30.5 m
INFO  09:37:37,270 ProgressMeter -     4:106747046        2.56e+07   10.0 m       23.0 s     25.7%        39.0 m    29.0 m
INFO  09:38:07,483 ProgressMeter -     4:181303657        2.69e+07   10.5 m       23.0 s     28.1%        37.5 m    26.9 m
INFO  09:38:37,493 ProgressMeter -      5:41149454        2.81e+07   11.0 m       23.0 s     29.7%        37.1 m    26.1 m
INFO  09:38:51,094 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM file /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam is malformed: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
##### ERROR ------------------------------------------------------------------------------------------


Following your usual advice, I validated the bam file produced by BQSR with Picard and I get the exact same error, but no specific error indication

        [fles@login07 reduced]$java -jar ~/applications/picard-tools-1.102/ValidateSamFile.jar \ > INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam \ > IGNORE_WARNINGS=TRUE [Fri Nov 08 09:59:42 GMT 2013] net.sf.picard.sam.ValidateSamFile INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam IGNORE_WARNINGS=true MAX_OUTPUT=100 VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Nov 08 09:59:42 GMT 2013] Executing as fles@login07 on Linux 2.6.18-194.11.4.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.102(1591) INFO 2013-11-08 10:01:01 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:01:18s. Time for last 10,000,000: 78s. Last read position: 1:204,966,172 INFO 2013-11-08 10:02:19 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:02:36s. Time for last 10,000,000: 78s. Last read position: 2:232,121,396 INFO 2013-11-08 10:03:36 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:03:54s. Time for last 10,000,000: 77s. Last read position: 4:123,140,629 [Fri Nov 08 10:04:00 GMT 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 4.30 minutes. Runtime.totalMemory()=300941312 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397) at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371) at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:357) at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200) at net.sf.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:558)
at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:532) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687)
at net.sf.samtools.SAMFileReaderAssertableIterator.next(SAMFileReader.java:665) at net.sf.picard.sam.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:241) at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:177) at net.sf.picard.sam.SamFileValidator.validateSamFileSummary(SamFileValidator.java:104) at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:164) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:100) Caused by: java.io.IOException: Unexpected compressed block length: 1 at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:358) at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:113) at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:238) at java.io.DataInputStream.read(DataInputStream.java:149) at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:395)  any suggestions on what I might do wrong? Created 2013-08-13 21:38:39 | Updated | Tags: vcf gatk picard I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here? Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks Created 2013-03-14 14:36:56 | Updated | Tags: picard I happened to be readying a new reference genome and was using FastaStats to force the creation of the .dict and .fai files. The automatic .dict file creation first creates a temporary file, creates the .dict using Picard, and then copies it as appropriate. However, the newer versions of Picard won't overwrite the dict file, and so there is an error using the Java created temporary file as an output. The problematic section seems to be ReferenceDataSource.java. The error manifests as: Index file /gs01/projects/ngs/resources/gatk/2.3/human_g1k_v37.dict does not exist but could not be created because: . /gs01/projects/ngs/resources/gatk/2.3/dict8545428789729910550.tmp already exists. Delete this file and try again, or specify a different output file.  Created 2013-02-23 19:50:26 | Updated 2013-02-23 19:56:12 | Tags: realignertargetcreator picard Hi there, I get an error when I try to run GATK with the following command: java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I merged_bam_files_indexed_markduplicate.bam -o reads.intervals  However I get this error: SAM/BAM file SAMFileReader{/merged_bam_files_indexed_markduplicate.bam} is malformed: Read HWI-ST303_0093:5:5:13416:34802#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem  It suggest that it a header issue however my bam file has a header: samtools view -h merged_bam_files_indexed_markduplicate.bam | grep ^@RG @RG ID:test1 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan @RG ID:test2 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan  when I grep the read within the error: HWI-ST303_0093:5:5:13416:34802#0 99 1 1090 29 23S60M17S = 1150 160 TGTTTGGGTTGAAGATTGATACTGGAAGAAGATTAGAATTGTAGAAAGGGGAAAACGATGTTAGAAAGTTAATACGGCTTACTCCAGATCCTTGGATCTC GGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGGGGDGFGFGGGGGFEDFGEGGGDGEG?FGGDDGFFDGGEDDFFFFEDG?E MD:Z:60 PG:Z:MarkDuplicates RG:Z:test1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:M  Following Picard solution: java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test RGPL=Illumina RGSM=test/ RGPU=HWI-ST303 RGCN=japan CREATE_INDEX=True  I get this error after 2 min.: Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.  Any recommendation on how to solve this issue ? My plan is to do the following to resolve the issue: picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=LENIANT samtools index test_markduplicate.bam  I see a lot of messages like below but the command still running: Ignoring SAM validation error: ERROR: Record (number), Read name HWI-ST303_0093:5:5:13416:34802#0, RG ID on SAMRecord not found in header: test1  while running the command then try the GATK RealignerTargetCreator I already tried to do the following java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I merged_bam_files_indexed_markduplicate.bam -o reads.intervals --validation_strictness LENIENT  But I still got the same error N.B: the same command run with no issue with GATK version (1.2) My pipeline in short: mapping the paired end reads with bwa aln -q 20 ref.fa read > files.sai bwa sampe ref.fa file1.sai file2.sai read1 read2 > test1.sam samtools view -bS test1.sam | samtools sort - test samtools index test1.bam samtools merge -rh RG.txt test test1.bam test2.bam  RG.txt @RG ID:test1 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan @RG ID:test2 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan samtools index test.bam picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=SILENT samtools index test_markduplicate.bam  Created 2012-12-24 20:09:25 | Updated 2013-01-07 19:16:08 | Tags: bam readgroup picard how to add read group to the bam file using PICARD generated from GS reference mapper BAM file? Created 2012-11-28 11:47:29 | Updated | Tags: indelrealigner picard microscheduler baserecalibration createtarget Hi all, I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.) bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand. I have 85767226 paired end = 171534452 sequences in fastQ file BWA reports this number, the cleaned SAM file has 171534452 alignments as expected. MarkDuplicates reports: Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters. so nearly 6 million reads seem to miss. CreateTargets MicroScheduler reports 35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter so nearly 5 million reads seem to miss The Realigner MicroScheduler reports 0 reads were filtered out during traversal out of 171551640 total (0.00%) which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear. From Base recalibration MicroScheduler, I get 41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter ..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number. I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful? Thanks for any comments! Created 2012-09-12 15:14:41 | Updated 2013-01-07 20:40:49 | Tags: bwa picard mtdna Picard appears not to like the way BWA codes mtDNA. I am doing human exome sequencing using a copy of hg19 which I obtained from UCSC and indexed using BWA per the instructions here: ## Example 1 [Tue Aug 28 12:45:16 EDT 2012] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=125435904 FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Non-numeric value in ISIZE column; Line 3982 Line: FCC0CHTACXX:1:1101:14789:3170#TAGCTTAT 117 chrM 304415842 0 100M = -1610645157 2379906297 TGCGACTTGGAAGCGGATTCAGAGGACAGGACAGAACACTTGGGCAAGTGAATCTCTGTCTGTCTGTCTGTCTCATTGGTTGGTTTATTTCCATTTTCTT B@<:>CDDDBDDBDEEEEEEFEFCCHHFHHGGIIIHIGJJJIIGGGIIIIJJJIIGJIJGG@CEIFJIJJJJIJIJIJJJJIJJJGIHHGHFFEFFFCCC RG:Z:1868 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:39G45G14 XA:Z:chrM,-391302964,100M,2; at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:223) at net.sf.samtools.SAMTextReader.access400(SAMTextReader.java:40)
at net.sf.samtools.SAMTextReader$RecordIterator.parseInt(SAMTextReader.java:293) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:394)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:68)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at net.sf.picard.sam.SortSam.main(SortSam.java:57)


## Example 2

java -jar ~/bin/picard-tools-1.74/MarkDuplicates.jar \
INPUT=1sorted.bam \
OUTPUT=1dedup.bam \
ASSUME_SORTED=true \
METRICS_FILE=metrics \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT

...
Ignoring SAM validation error: ERROR: Record 691, Read name FCC0CHTACXX:1:1302:4748:176644#GGCTACAT, Mate Alignment start (436154938) must be <= reference sequence length (16571) on reference chrM
Ignoring SAM validation error: ERROR: Record 692, Read name FCC0CHTACXX:1:2104:8494:167812#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 693, Read name FCC0CHTACXX:1:1201:21002:183608#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 694, Read name FCC0CHTACXX:1:2303:3184:35872#GGCTACAT, Mate Alignment start (436154812) must be <= reference sequence length (16571) on reference chrM
...
`

I've truncated the output; in fact it throws such an error for every single line of mitochondrial reads.

I suspect I could solve this by writing my own script to go in and change the way one column is coded, but more broadly, I am interested in the answer to "how do you make BWA, Picard and GATK work seamlessly together without needing to do your own scripting"?