Tagged with #picard
5 documentation articles | 2 announcements | 46 forum discussions



Created 2013-07-03 00:53:53 | Updated 2015-12-04 18:55:56 | Tags: readgroup picard indexing
Comments (5)

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. The options on this page are listed in order of decreasing complexity.

You may ask, is all of this really necessary? The GATK imposes strict formatting guidelines, including requiring certain read group information, that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

Prerequisites

  • Installed Picard tools
  • If indexing or marking duplicates, then coordinate sorted reads
  • If coordinate sorting, then reference aligned reads
  • For each read group ID, a single BAM file. If you have a multiplexed file, separate to individual files per sample.

Jump to a section on this page

  1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups
  2. Coordinate sort and index using SortSam
  3. Index an already coordinate-sorted BAM using BuildBamIndex
  4. Mark duplicates using MarkDuplicates

Tools involved

Related resources

  • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure that your read group fields conform to the recommendations.
  • Picard's standard options includes adding CREATE_INDEX to the commands of any of its tools that produce coordinate sorted BAMs.


1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups

Use Picard's AddOrReplaceReadGroups to appropriately label read group (@RG) fields, coordinate sort and index a BAM file. Only the five required @RG fields are included in the command shown. Consider the other optional @RG fields for better record keeping.

java -jar picard.jar AddOrReplaceReadGroups \ 
    INPUT=reads.bam \ 
    OUTPUT=reads_addRG.bam \ 
    RGID=H0164.2 \ #be sure to change from default of 1
    RGLB= library1 \ 
    RGPL=illumina \ 
    RGPU=H0164ALXX140820.2 \ 
    RGSM=sample1 \ 

This creates a file called reads_addRG.bam with the same content and sorting as the input file, except the SAM record header's @RG line will be updated with the new information for the specified fields and each read will now have an RG tag filled with the @RG ID field information. Because of this repetition, the length of the @RG ID field contributes to file size.

To additionally coordinate sort by genomic location and create a .bai index, add the following options to the command.

    SORT_ORDER=coordinate \ 
    CREATE_INDEX=true

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory


2. Coordinate sort and index using SortSam

Picard's SortSam both sorts and indexes and converts between SAM and BAM formats. For coordinate sorting, reads must be aligned to a reference genome.

java -jar picard.jar SortSam \ 
    INPUT=reads.bam \ 
    OUTPUT=reads_sorted.bam \ 
    SORT_ORDER=coordinate \

Concurrently index by tacking on the following option.

    CREATE_INDEX=true

This creates a file called reads_sorted.bam containing reads sorted by genomic location, aka coordinate, and a .bai index file with the same prefix as the output, e.g. reads_sorted.bai, within the same directory.

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory


3. Index an already coordinate-sorted BAM using BuildBamIndex

Picard's BuildBamIndex allows you to index a BAM that is already coordinate sorted.

java -jar picard.jar BuildBamIndex \ 
    INPUT=reads_sorted.bam 

This creates a .bai index file with the same prefix as the input file, e.g. reads_sorted.bai, within the same directory as the input file. You want to keep this default behavior as many tools require the same prefix and directory location for the pair of files. Note that Picard tools do not systematically create an index file when they output a new BAM file, whereas GATK tools will always output indexed files.

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory


4. Mark duplicates using MarkDuplicates

Picard's MarkDuplicates flags both PCR and optical duplicate reads with a 1024 (0x400) SAM flag. The input BAM must be coordinate sorted.

java -jar picard.jar MarkDuplicates \ 
    INPUT=reads_sorted.bam \ 
    OUTPUT=reads_markdup.bam \
    METRICS_FILE=metrics.txt \
    CREATE_INDEX=true

This creates a file called reads_markdup.bam with duplicate reads marked. It also creates a file called metrics.txt containing duplicate read data metrics and a .bai index file.

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory
  • During sequencing, which involves PCR amplification within the sequencing machine, by a stochastic process we end up sequencing a proportion of DNA molecules that arose from the same parent insert. To be stringent in our variant discovery, GATK tools discount the duplicate reads as evidence for or against a putative variant.
  • Marking duplicates is less relevant to targeted amplicon sequencing and RNA-Seq analysis.
  • Optical duplicates arise from a read being sequenced twice as neighboring clusters.

back to top



Created 2013-07-02 00:16:14 | Updated 2015-09-24 12:12:04 | Tags: install rscript igv picard gsalib samtools r ggplot2 rstudio
Comments (45)

Objective

Install all software packages required to follow the GATK Best Practices.

Prerequisites

To follow these instructions, you will need to have a basic understanding of the meaning of the following words and command-line operations. If you are unfamiliar with any of the following, you should consult a more experienced colleague or your systems administrator if you have one. There are also many good online tutorials you can use to learn the necessary notions.

  • Basic Unix environment commands
  • Binary / Executable
  • Compiling a binary
  • Adding a binary to your path
  • Command-line shell, terminal or console
  • Software library

You will also need to have access to an ANSI compliant C++ compiler and the tools needed for normal compilations (make, shell, the standard library, tar, gunzip). These tools are usually pre-installed on Linux/Unix systems. On MacOS X, you may need to install the MacOS Xcode tools. See https://developer.apple.com/xcode/ for relevant information and software downloads. The XCode tools are free but an AppleID may be required to download them.

Starting with version 2.6, the GATK requires Java Runtime Environment version 1.7. All Linux/Unix and MacOS X systems should have a JRE pre-installed, but the version may vary. To test your Java version, run the following command in the shell:

java -version 

This should return a message along the lines of ”java version 1.7.0_25” as well as some details on the Runtime Environment (JRE) and Virtual Machine (VM). If you have a version other than 1.7.x, be aware that you may run into trouble with some of the more advanced features of the Picard and GATK tools. The simplest solution is to install an additional JRE and specify which you want to use at the command-line. To find out how to do so, you should seek help from your systems administrator.

Software packages

  1. BWA
  2. SAMtools
  3. Picard
  4. Genome Analysis Toolkit (GATK)
  5. IGV
  6. RStudio IDE and R libraries ggplot2 and gsalib

Note that the version numbers of packages you download may be different than shown in the instructions below. If so, please adapt the number accordingly in the commands.


1. BWA

Read the overview of the BWA software on the BWA project homepage, then download the latest version of the software package.

  • Installation

Unpack the tar file using:

tar xvzf bwa-0.7.12.tar.bz2 

This will produce a directory called bwa-0.7.12 containing the files necessary to compile the BWA binary. Move to this directory and compile using:

cd bwa-0.7.12
make

The compiled binary is called bwa. You should find it within the same folder (bwa-0.7.12 in this example). You may also find other compiled binaries; at time of writing, a second binary called bwamem-lite is also included. You can disregard this file for now. Finally, just add the BWA binary to your path to make it available on the command line. This completes the installation process.

  • Testing

Open a shell and run:

bwa 

This should print out some version and author information as well as a list of commands. As the Usage line states, to use BWA you will always build your command lines like this:

bwa <command> [options] 

This means you first make the call to the binary (bwa), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command.


2. SAMtools

Read the overview of the SAMtools software on the SAMtools project homepage, then download the latest version of the software package.

  • Installation

Unpack the tar file using:

tar xvjf samtools-0.1.2.tar.bz2 

This will produce a directory called samtools-0.1.2 containing the files necessary to compile the SAMtools binary. Move to this directory and compile using:

cd samtools-0.1.2 
make 

The compiled binary is called samtools. You should find it within the same folder (samtools-0.1.2 in this example). Finally, add the SAMtools binary to your path to make it available on the command line. This completes the installation process.

  • Testing

Open a shell and run:

samtools 

This should print out some version information as well as a list of commands. As the Usage line states, to use SAMtools you will always build your command lines like this:

samtools <command> [options] 

This means you first make the call to the binary (samtools), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command. This is a similar convention as used by BWA.


3. Picard

Read the overview of the Picard software on the Picard project homepage, then download the latest version of the software package.

  • Installation

Unpack the zip file using:

tar xjf picard-tools-1.139.zip 

This will produce a directory called picard-tools-1.139 containing the Picard jar files. Picard tools are distributed as a pre-compiled Java executable (jar file) so there is no need to compile them.

Note that it is not possible to add jar files to your path to make the tools available on the command line; you have to specify the full path to the jar file in your java command, which would look like this:

java -jar ~/my_tools/jars/picard.jar <Toolname> [options]

This syntax will be explained in a little more detail further below.

However, you can set up a shortcut called an "environment variable" in your shell profile configuration to make this easier. The idea is that you create a variable that tells your system where to find a given jar, like this:

PICARD = "~/my_tools/jars/picard.jar"

So then when you want to run a Picard tool, you just need to call the jar by its shortcut, like this:

java -jar $PICARD <Toolname> [options]

The exact way to set this up depends on what shell you're using and how your environment is configured. We like this overview and tutorial which explains how it all works; but if you are new to the command line environment and you find this too much too deal with, we recommend asking for help from your institution's IT support group.

This completes the installation process.

  • Testing

Open a shell and run:

java -jar picard.jar -h 

This should print out some version and usage information about the AddOrReplaceReadGroups.jar tool. At this point you will have noticed an important difference between BWA and Picard tools. To use BWA, we called on the BWA program and specified which of its internal tools we wanted to apply. To use Picard, we called on Java itself as the main program, then specified which jar file to use, knowing that one jar file = one tool. This applies to all Picard tools; to use them you will always build your command lines like this:

java -jar picard.jar <ToolName> [options] 

This means you first make the call to Java itself as the main program, then specify the picard.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis.

Note that the command-line syntax of Picard tools has recently changed from java -jar <ToolName>.jar to java -jar picard.jar <ToolName>. We are using the newer syntax in this document, but some of our other documents may not have been updated yet. If you encounter any documents using the old syntax, let us know and we'll update them accordingly. If you are already using an older version of Picard, either adapt the commands or better, upgrade your version!

Next we will see that GATK tools are called in essentially the same way, although the way the options are specified is a little different. The reasons for how tools in a given software package are organized and invoked are largely due to the preferences of the software developers. They generally do not reflect strict technical requirements, although they can have an effect on speed and efficiency.


4. Genome Analysis Toolkit (GATK)

Hopefully if you're reading this, you're already acquainted with the purpose of the GATK, so go ahead and download the latest version of the software package.

In order to access the downloads, you need to register for a free account on the GATK support forum. You will also need to read and accept the license agreement before downloading the GATK software package. Note that if you intend to use the GATK for commercial purposes, you will need to purchase a license. See the licensing page for an overview of the commercial licensing conditions.

  • Installation

Unpack the tar file using:

tar xjf GenomeAnalysisTK-3.3-0.tar.bz2 

This will produce a directory called GenomeAnalysisTK-3.3-0 containing the GATK jar file, which is called GenomeAnalysisTK.jar, as well as a directory of example files called resources. GATK tools are distributed as a single pre-compiled Java executable so there is no need to compile them. Just like we discussed for Picard, it's not possible to add the GATK to your path, but you can set up a shortcut to the jar file using environment variables as described above.

This completes the installation process.

  • Testing

Open a shell and run:

java -jar GenomeAnalysisTK.jar -h 

This should print out some version and usage information, as well as a list of the tools included in the GATK. As the Usage line states, to use GATK you will always build your command lines like this:

java -jar GenomeAnalysisTK.jar -T <ToolName> [arguments] 

This means that just like for Picard, you first make the call to Java itself as the main program, then specify the GenomeAnalysisTK.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis.


5. IGV

The Integrated Genomics Viewer is a genome browser that allows you to view BAM, VCF and other genomic file information in context. It has a graphical user interface that is very easy to use, and can be downloaded for free (though registration is required) from this website. We encourage you to read through IGV's very helpful user guide, which includes many detailed tutorials that will help you use the program most effectively.


6. RStudio IDE and R libraries ggplot2 and gsalib

Download the latest version of RStudio IDE. The webpage should automatically detect what platform you are running on and recommend the version most suitable for your system.

  • Installation

Follow the installation instructions provided. Binaries are provided for all major platforms; typically they just need to be placed in your Applications (or Programs) directory. Open RStudio and type the following command in the console window:

install.packages("ggplot2") 

This will download and install the ggplot2 library as well as any other library packages that ggplot2 depends on for its operation. Note that some users have reported having to install two additional package themselves, called reshape and gplots, which you can do as follows:

install.packages("reshape")
install.packages("gplots")

Finally, do the same thing to install the gsalib library:

install.packages("gsalib")

This will download and install the gsalib library.

Important note

If you are using a recent version of ggplot2 and a version of GATK older than 3.2, you may encounter an error when trying to generate the BQSR or VQSR recalibration plots. This is because until recently our scripts were still using an older version of certain ggplot2 functions. This has been fixed in GATK 3.2, so you should either upgrade your version of GATK (recommended) or downgrade your version of ggplot2. If you experience further issues generating the BQSR recalibration plots, please see this tutorial.


Created 2013-06-17 20:44:09 | Updated 2015-09-24 13:10:42 | Tags: picard
Comments (16)

Objective

Prepare a reference sequence so that it is suitable for use with BWA and GATK.

Prerequisites

  • Installed BWA
  • Installed SAMTools
  • Installed Picard

Steps

  1. Generate the BWA index
  2. Generate the Fasta file index
  3. Generate the sequence dictionary

1. Generate the BWA index

Action

Run the following BWA command:

bwa index -a bwtsw reference.fa 

where -a bwtsw specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.

Expected Result

This creates a collection of files used by BWA to perform the alignment.


2. Generate the fasta file index

Action

Run the following SAMtools command:

samtools faidx reference.fa 

Expected Result

This creates a file called reference.fa.fai, with one record per line for each of the contigs in the FASTA reference file. Each record is composed of the contig name, size, location, basesPerLine and bytesPerLine.


3. Generate the sequence dictionary

Action

Run the following Picard command:

java -jar picard.jar CreateSequenceDictionary \
    REFERENCE=reference.fa \ 
    OUTPUT=reference.dict 

Note that this is the new syntax for use with the latest version of Picard. Older versions used a slightly different syntax because all the tools were in separate jars, so you'd call e.g. java -jar CreateSequenceDictionary.jar directly.

Expected Result

This creates a file called reference.dict formatted like a SAM header, describing the contents of your reference FASTA file.


Created 2013-02-13 18:06:06 | Updated 2014-06-24 09:45:27 | Tags: tribble picard sam-jdk variantcontext htsjdk
Comments (5)

The picard repository on github contains all picard public tools. Libraries live under the htsjdk, which includes the samtools-jdk, tribble, and variant packages (which includes VariantContext and associated classes as well as the VCF/BCF codecs).

If you just need to check out the sources and don't need to make any commits into the picard repository, the command is:

git clone https://github.com/broadinstitute/picard.git

Then within the picard directory, clone the htsjdk.

cd picard
git clone https://github.com/samtools/htsjdk.git

Then you can attach the picard/src/java and picard/htsjdk/src/java directories in IntelliJ as a source directory (File -> Project Structure -> Libraries -> Click the plus sign -> "Attach Files or Directories" in the latest IntelliJ).

To build picard and the htsjdk all at once, type ant from within the picard directory. To run tests, type ant test

If you do need to make commits into the picard repository, first you'll need to create a github account, fork picard or htsjdk, make your changes, and then issue a pull request. For more info on pull requests, see: https://help.github.com/articles/using-pull-requests


Created 2012-08-15 18:16:01 | Updated 2014-09-17 13:20:21 | Tags: developer tribble intermediate picard htsjdk
Comments (6)

1. Overview

The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.

2. Architecture Overview

Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.

3. Developer Overview

For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:

  • Feature

    This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.

  • FeatureCodec

    This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.

To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:

public interface Feature {

    /**
     * Return the features reference sequence name, e.g chromosome or contig
     */
    public String getChr();

    /**
     * Return the start position in 1-based coordinates (first base is 1)
     */
    public int getStart();

    /**
     * Return the end position following 1-based fully closed conventions.  The length of a feature is
     * end - start + 1;
     */
    public int getEnd();
}

And the interface for FeatureCodec:

/**
 * the base interface for classes that read in features.
 * @param <T> The feature type this codec reads
 */
public interface FeatureCodec<T extends Feature> {
    /**
     * Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
     *
     * @param line the input line to decode
     * @return  Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public Feature decodeLoc(String line);

    /**
     * Decode a line as a Feature.
     *
     * @param line the input line to decode
     * @return  Return the Feature encoded by the line,  or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public T decode(String line);

    /**
     * This function returns the object the codec generates.  This is allowed to be Feature in the case where
     * conditionally different types are generated.  Be as specific as you can though.
     *
     * This function is used by reflections based tools, so we can know the underlying type
     *
     * @return the feature type this codec generates.
     */
    public Class<T> getFeatureType();

    /**  Read and return the header, or null if there is no header.
     *
     * @return header object
     */
    public Object readHeader(LineReader reader);
}

4. Supported Formats

The following formats are supported in Tribble:

  • VCF Format
  • DbSNP Format
  • BED Format
  • GATK Interval Format

5. Updating the Tribble, htsjdk, and/or Picard library

Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.

NOTE: Any directory starting with ~ may be different on your machine, depending on where you cloned the various repositories for gsa-unstable, picard, and htsjdk.

A Maven script to install picard into the local repository is located under gsa-unstable/private/picard-maven. To operate, it requires a symbolic link named picard pointing to a working checkout of the picard github repository. NOTE: compiling picard requires an htsjdk github repository checkout available at picard/htsjdk, either as a subdirectory or another symbolic link. The final full path should be gsa-unstable/private/picard-maven/picard/htsjdk.

cd ~/src/gsa-unstable
cd private/picard-maven
ln -s ~/src/picard picard

Create a git branch of Picard and/or htsjdk and make your changes. To install your changes into the GATK you must run mvn install in the private/picard-maven directory. This will compile and copy the jars into gsa-unstable/public/repo, and update gsa-unstable/gatk-root/pom.xml with the corresponding version. While making changes your revision of picard and htslib will be labeled with -SNAPSHOT.

cd ~/src/gsa-unstable
cd private/picard-maven
mvn install

Continue testing in the GATK. Once your changes and updated tests for picard/htsjdk are complete, push your branch and submit your pull request to the Picard and/or htsjdk github. After your Picard/htsjdk patches are accepted, switch your Picard/htsjdk branches back to the master branch. NOTE: Leave your gsa-unstable branch on your development branch!

cd ~/src/picard
ant clean
git checkout master
git fetch
git rebase
cd htsjdk
git checkout master
git fetch
git rebase

NOTE: The version number of old and new Picard/htsjdk will vary, and during active development will end with -SNAPSHOT. While, if needed, you may push -SNAPSHOT version for testing on Bamboo, you should NOT submit a pull request with a -SNAPSHOT version. -SNAPSHOT indicates your local changes are not reproducible from source control.

When ready, run mvn install once more to create the non -SNAPSHOT versions under gsa-unstable/public/repo. In that directory, git add the new version, and git rm the old versions.

cd ~/src/gsa-unstable
cd public/repo
git add picard/picard/1.115.1499/
git add samtools/htsjdk/1.115.1509/
git rm -r picard/picard/1.112.1452/
git rm -r samtools/htsjdk/1.112.1452/

Commit and then push your gsa-unstable branch, then issue a pull request for review.


Created 2015-03-09 19:07:23 | Updated | Tags: picard support htsjdk
Comments (4)

Here's some good news for anyone who has been using both GATK and the Picard tools in their work -- which means all of you, since you all follow our Best Practices to a tee, right?

As you may know, both toolkits are developed here at the Broad Institute, and are deployed together in the Broad's analysis pipelines. The fact that they have been developed, released and supported separately so far is more an accident of history and internal organization than anything else (and we know it's inconvenient to y'all).

The good news is that we're taking steps to consolidate these efforts, which we believe will benefit everyone. In that spirit, we have been working closely with the Picard tools development team, and we're now ready to take the first step of consolidating support for the tools. From now on, you will be able to ask us questions about the Picard tools, and report bugs, in the GATK forum. And developers will be happy to hear that we are also committed to supporting HTSJDK for developers through the Github repo’s Issues tracker.

In the near future, we will also start hosting downloads and documentation for the Picard tools on the GATK website. And before you ask, yes, the Picard tools will continue to be completely open-source and available to all free of charge.

To recap, we have brought the GATK and Picard teams together, and we are working on bringing together in the same place all the methods and tools to perform genome analysis. Our goal is to make a world where you can run our complete Best Practices pipeline end-to-end with a single Broad toolkit. We think it’ll make your life easier, because it sure is making ours easier.


Created 2014-11-05 19:09:49 | Updated 2014-11-24 22:57:13 | Tags: commandline picard topstory syntax
Comments (0)

Consider this a public service announcement, since most GATK users probably also use Picard tools routinely. The recently released version 1.124 of the Picard tools includes many lovely improvements, bug fixes and even a few new tools (see release notes for full details) -- but you'll want to pay attention to one major change in particular.

From this version onward, the Picard release will contain a single JAR file containing all the tools, instead of one JAR file per tool as it was before. This means that when you invoke a Picar tool, you'll invoke a single JAR, then specify which tool (which they call CLP for Command Line Program) you want to run. This should feel completely familiar if you already use GATK regularly, but it does mean you'll need to update any scripts that use Picard tools to the new paradigm. Other than that, there's no change of syntax; Picard will still use e.g. I=input.bam where GATK would use -I input.bam.

We will need to update some of our own documentation accordingly over the near future; please bear with us as we go through this process, and let us know by commenting in this thread if you find any docs that have yet to be updated.


Created 2016-02-11 11:26:28 | Updated | Tags: picard calculatehsmetrics
Comments (3)

I'm trying to run CalculateHsMetrics with following command from Picard Tools

java -jar /tools/picard-tools-1.141/picard.jar CalculateHsMetrics \ BI=probe_cancer_panel_2.0.bed \ TI=target_cancer_panel_2.0.bed \ I=R1_edit3_8bp_TE030_fq-quality_checked_R2_edit3_8bp_TE030_fq-quality_checked__sort46.bam \ O=output

Getting Following STDOUT:

[Thu Feb 11 15:40:12 IST 2016] picard.analysis.directed.CalculateHsMetrics BAIT_INTERVALS=[probe_cancer_panel_2.0.bed] TARGET_INTERVALS=[target_cancer_panel_2.0.bed] INPUT=R1_edit3_8bp_TE030_fq-quality_checked_R2_edit3_8bp_TE030_fq-quality_checked__sort46.bam OUTPUT=output METRIC_ACCUMULATION_LEVEL=[ALL_READS] VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Thu Feb 11 15:40:12 IST 2016] Executing as root@pts00256 on Linux 3.16.0-30-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_79-b15; Picard version: 1.141(8ece590411350163e7689e9e77aab8efcb622170_1447695087) IntelDeflater [Thu Feb 11 15:40:12 IST 2016] picard.analysis.directed.CalculateHsMetrics done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Interval list file must contain header. at htsjdk.samtools.util.IntervalList.fromReader(IntervalList.java:436) at htsjdk.samtools.util.IntervalList.fromFile(IntervalList.java:381) at htsjdk.samtools.util.IntervalList.fromFiles(IntervalList.java:410) at picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:85) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I'm not able to understand "Interval list file must contain header."

"probe_cancer_panel_2.0.bed" file used as BAIT file, contents(first 10 lines) as follows :

chr1 17345282 17345332 SDHB|NM_003000_exon_0_0_chr1_17345225_r_1_1939010 1000 + chr1 17345375 17345425 SDHB|NM_003000_exon_0_0_chr1_17345225_r_0_1939009 1000 - chr1 17345486 17345536 SDHB|NM_003000_exon_0_0_chr1_17345225_r_1_1939010 1000 - chr1 17345123 17345173 SDHB|NM_003000_exon_0_0_chr1_17345225_r_0_1939009 1000 + chr1 17349254 17349304 SDHB|NM_003000_exon_1_0_chr1_17349103_r_0_2563050 1000 - chr1 17349014 17349064 SDHB|NM_003000_exon_1_0_chr1_17349103_r_0_2563050 1000 + chr1 17350378 17350428 SDHB|NM_003000_exon_2_0_chr1_17350468_r_0_2563051 1000 + chr1 17350619 17350669 SDHB|NM_003000_exon_2_0_chr1_17350468_r_0_2563051 1000 - chr1 17354413 17354463 SDHB|NM_003000_exon_3_0_chr1_17354244_r_0_2563052 1000 - chr1 17354152 17354202 SDHB|NM_003000_exon_3_0_chr1_17354244_r_0_2563052 1000 +

"target_cancer_panel_2.0.bed" file used as Target file, contents(first 10 lines) as follows :

chr1 1981908 1982140 PRKCZ|NM_002744_exon_0_0_chr1_1981909_f chr1 1986879 1987001 PRKCZ|NM_002744_exon_1_0_chr1_1986880_f chr1 1987922 1988012 PRKCZ|NM_002744_exon_2_0_chr1_1987923_f chr1 1990979 1991030 PRKCZ|NM_002744_exon_3_0_chr1_1990980_f chr1 2005085 2005368 PRKCZ|NM_001033581_exon_0_0_chr1_2005086_f chr1 2005424 2005714 PRKCZ|NM_001242874_exon_0_0_chr1_2005425_f chr1 2036154 2036334 PRKCZ|NM_001033582_exon_0_0_chr1_2036155_f chr1 2066700 2066786 PRKCZ|NM_001033581_exon_1_0_chr1_2066701_f chr1 2075648 2075780 PRKCZ|NM_001033581_exon_2_0_chr1_2075649_f chr1 2077465 2077547 PRKCZ|NM_001033581_exon_3_0_chr1_2077466_f

Please let me understand what kind of header it requires.


Created 2016-02-06 22:26:13 | Updated | Tags: picard cram
Comments (8)

Hello there!

I am trying to use picard tools to mark duplicates using a cram format file; however I could not find any documentation to address this problem. How can I pass a valid CRAM reference?

Thanks in advance,

-lili

Sat Feb 06 16:09:42 CST 2016] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=250 INPUT=[/EXOME/gatk/test_10990_bwa_srtd.cram] OUTPUT=EXOME/gatk/test_10990_wes_dupMC.cram METRICS_FILE=/EXOME/gatk/test_10990_wes_dupMC_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 CREATE_INDEX=true SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Sat Feb 06 16:09:42 CST 2016] Executing as antunes@gpu10 on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater [Sat Feb 06 16:09:42 CST 2016] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=4116185088 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download at htsjdk.samtools.cram.ref.ReferenceSource.getDefaultCRAMReferenceSource(ReferenceSource.java:98) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:269) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:205) at picard.sam.markduplicates.MarkDuplicatesWithMateCigar.doWork(MarkDuplicatesWithMateCigar.java:118) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)


Created 2016-01-20 16:11:48 | Updated | Tags: install picard failure htsjdk
Comments (10)

(If this is the wrong place to post this, please tell me where to go)

I am new to GATK, trying to build Picard because I apparently will need to build Picard sequence directories for my genomes of interest.

I have installed Java6, Java8, ant, and HTSJDK. I have downloaded Picard 2a49ee2.

Following instructions at broadinstitute.github.io/picard/building.html, I have defined JAVA6_HOME, cd'd into broadinstitute-picard-2a49ee2, and when I run ant -lib lib/ant package-commands I get this message (relevant part only) "BUILD FAILED ... Basedir .../broadinstitute-picard-2a49ee2/htsjdk does not exist"

So it seems I need to do something for the build to know where my HTSJDK lives. Or did my download fail? Am I supposed to make a symbolic link from .../broadinstitute-picard-2a49ee2/htsjdk over to someplace in my HTSJDK directory? If so, to where?

Thanks for any help, Bob H


Created 2016-01-18 14:17:50 | Updated 2016-01-18 14:19:05 | Tags: picard indel-vcf-gatk short-read-preprocessing
Comments (2)

Does picard MergeBamAlignment function require paired reads as ALIGNED_BAM? This is not mentioned in the picard documents

However, we use trimmomatic to do reads QC, which generates three outputs: paired reads, unique reads after QC for 1st reads and 2nd reads. Then we use bwa to map three files individually to generate three sam files. Next, we merged bwa results into one bam file using picard MergeSamFiles. At last, we try to create a clean-up mapping file using picard MergeBamAlignment with a unmapped_bam file created from input fastq reads files using picard FastqToSam. The sort order of files are correct.

Does this mean picard and/or GATK does not work with a mixture of paired-reads and single-reads maps?


Created 2016-01-13 17:18:14 | Updated | Tags: picard r collectmultiplemetrics
Comments (3)

Hello!

I run CollectMultipleMetrics in a RNAseq pipeline, and I encounter this error:

[Wed Jan 13 11:50:10 EST 2016] picard.analysis.CollectMultipleMetrics INPUT=sorted.mdup.bam REFERENCE_SEQUENCE=Mus_musculus.GRCm38.fa OUTPUT=metrics PROGRAM=[CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, QualityScoreDistribution, MeanQualityByCycle, CollectBaseDistributionByCycle, CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics] TMP_DIR=[/scratch] VALIDATION_STRINGENCY=SILENT MAX_RECORDS_IN_RAM=2000000 ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Jan 13 11:50:10 EST 2016] Executing as me@machine on Linux 2.6.32-573.12.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07; Picard version: 1.123(286a232caea2fdc8fdd88574c09c460b46386fff_1413818736) IntelDeflater INFO 2016-01-13 11:50:19 SinglePassSamProgram Processed 1 000 000 records. Elapsed time: 00:00:08s. Time for last 1 000 000: 7s. Last read position: 1:24 615 160 INFO 2016-01-13 11:50:26 SinglePassSamProgram Processed 2 000 000 records. Elapsed time: 00:00:15s. Time for last 1 000 000: 6s. Last read position: 1:58 405 526 INFO 2016-01-13 11:50:33 SinglePassSamProgram Processed 3 000 000 records. Elapsed time: 00:00:22s. Time for last 1 000 000: 7s. Last read position: 1:84 726 212 INFO 2016-01-13 11:50:39 SinglePassSamProgram Processed 4 000 000 records. Elapsed time: 00:00:28s. Time for last 1 000 000: 6s. Last read position: 1:131 258 270 INFO 2016-01-13 11:50:46 SinglePassSamProgram Processed 5 000 000 records. Elapsed time: 00:00:35s. Time for last 1 000 000: 6s. Last read position: 1:162 699 014 INFO 2016-01-13 11:50:53 SinglePassSamProgram Processed 6 000 000 records. Elapsed time: 00:00:42s. Time for last 1 000 000: 6s. Last read position: 1:180 423 776 INFO 2016-01-13 11:51:00 SinglePassSamProgram Processed 7 000 000 records. Elapsed time: 00:00:49s. Time for last 1 000 000: 7s. Last read position: 10:40 288 148 INFO 2016-01-13 11:51:07 SinglePassSamProgram Processed 8 000 000 records. Elapsed time: 00:00:56s. Time for last 1 000 000: 6s. Last read position: 10:60 300 791 INFO 2016-01-13 11:51:14 SinglePassSamProgram Processed 9 000 000 records. Elapsed time: 00:01:03s. Time for last 1 000 000: 6s. Last read position: 10:78 194 008 INFO 2016-01-13 11:51:21 SinglePassSamProgram Processed 10 000 000 records. Elapsed time: 00:01:10s. Time for last 1 000 000: 7s. Last read position: 10:81 369 411 INFO 2016-01-13 11:51:27 SinglePassSamProgram Processed 11 000 000 records. Elapsed time: 00:01:17s. Time for last 1 000 000: 6s. Last read position: 10:117 277 576 INFO 2016-01-13 11:51:33 SinglePassSamProgram Processed 12 000 000 records. Elapsed time: 00:01:23s. Time for last 1 000 000: 6s. Last read position: 10:117 282 106 INFO 2016-01-13 11:51:40 SinglePassSamProgram Processed 13 000 000 records. Elapsed time: 00:01:30s. Time for last 1 000 000: 6s. Last read position: 10:128 492 616 INFO 2016-01-13 11:51:47 SinglePassSamProgram Processed 14 000 000 records. Elapsed time: 00:01:37s. Time for last 1 000 000: 7s. Last read position: 11:20 077 246 INFO 2016-01-13 11:51:54 SinglePassSamProgram Processed 15 000 000 records. Elapsed time: 00:01:43s. Time for last 1 000 000: 6s. Last read position: 11:50 284 406 INFO 2016-01-13 11:52:00 SinglePassSamProgram Processed 16 000 000 records. Elapsed time: 00:01:49s. Time for last 1 000 000: 6s. Last read position: 11:62 594 925 INFO 2016-01-13 11:52:07 SinglePassSamProgram Processed 17 000 000 records. Elapsed time: 00:01:56s. Time for last 1 000 000: 7s. Last read position: 11:75 576 108 INFO 2016-01-13 11:52:14 SinglePassSamProgram Processed 18 000 000 records. Elapsed time: 00:02:04s. Time for last 1 000 000: 7s. Last read position: 11:86 815 079 INFO 2016-01-13 11:52:21 SinglePassSamProgram Processed 19 000 000 records. Elapsed time: 00:02:10s. Time for last 1 000 000: 6s. Last read position: 11:100 781 869 INFO 2016-01-13 11:52:28 SinglePassSamProgram Processed 20 000 000 records. Elapsed time: 00:02:17s. Time for last 1 000 000: 7s. Last read position: 11:109 668 732 INFO 2016-01-13 11:52:35 SinglePassSamProgram Processed 21 000 000 records. Elapsed time: 00:02:24s. Time for last 1 000 000: 6s. Last read position: 11:120 347 082 INFO 2016-01-13 11:52:42 SinglePassSamProgram Processed 22 000 000 records. Elapsed time: 00:02:31s. Time for last 1 000 000: 6s. Last read position: 12:32 851 067 INFO 2016-01-13 11:52:48 SinglePassSamProgram Processed 23 000 000 records. Elapsed time: 00:02:38s. Time for last 1 000 000: 6s. Last read position: 12:86 883 769 INFO 2016-01-13 11:52:55 SinglePassSamProgram Processed 24 000 000 records. Elapsed time: 00:02:45s. Time for last 1 000 000: 6s. Last read position: 12:113 153 306 INFO 2016-01-13 11:53:03 SinglePassSamProgram Processed 25 000 000 records. Elapsed time: 00:02:52s. Time for last 1 000 000: 7s. Last read position: 13:52 611 052 INFO 2016-01-13 11:53:09 SinglePassSamProgram Processed 26 000 000 records. Elapsed time: 00:02:58s. Time for last 1 000 000: 6s. Last read position: 13:74 720 979 INFO 2016-01-13 11:53:16 SinglePassSamProgram Processed 27 000 000 records. Elapsed time: 00:03:05s. Time for last 1 000 000: 6s. Last read position: 14:14 116 451 INFO 2016-01-13 11:53:23 SinglePassSamProgram Processed 28 000 000 records. Elapsed time: 00:03:12s. Time for last 1 000 000: 6s. Last read position: 14:45 597 948 INFO 2016-01-13 11:53:29 SinglePassSamProgram Processed 29 000 000 records. Elapsed time: 00:03:18s. Time for last 1 000 000: 6s. Last read position: 14:63 521 977 INFO 2016-01-13 11:53:36 SinglePassSamProgram Processed 30 000 000 records. Elapsed time: 00:03:25s. Time for last 1 000 000: 6s. Last read position: 15:5 116 661 INFO 2016-01-13 11:53:43 SinglePassSamProgram Processed 31 000 000 records. Elapsed time: 00:03:32s. Time for last 1 000 000: 6s. Last read position: 15:73 809 694 INFO 2016-01-13 11:53:50 SinglePassSamProgram Processed 32 000 000 records. Elapsed time: 00:03:39s. Time for last 1 000 000: 6s. Last read position: 15:79 534 754 INFO 2016-01-13 11:53:57 SinglePassSamProgram Processed 33 000 000 records. Elapsed time: 00:03:46s. Time for last 1 000 000: 7s. Last read position: 15:98 931 497 INFO 2016-01-13 11:54:04 SinglePassSamProgram Processed 34 000 000 records. Elapsed time: 00:03:53s. Time for last 1 000 000: 6s. Last read position: 16:5 008 851 INFO 2016-01-13 11:54:11 SinglePassSamProgram Processed 35 000 000 records. Elapsed time: 00:04:00s. Time for last 1 000 000: 6s. Last read position: 16:35 022 587 INFO 2016-01-13 11:54:18 SinglePassSamProgram Processed 36 000 000 records. Elapsed time: 00:04:08s. Time for last 1 000 000: 7s. Last read position: 17:6 132 340 INFO 2016-01-13 11:54:25 SinglePassSamProgram Processed 37 000 000 records. Elapsed time: 00:04:14s. Time for last 1 000 000: 6s. Last read position: 17:26 646 052 INFO 2016-01-13 11:54:32 SinglePassSamProgram Processed 38 000 000 records. Elapsed time: 00:04:21s. Time for last 1 000 000: 6s. Last read position: 17:33 952 419 INFO 2016-01-13 11:54:38 SinglePassSamProgram Processed 39 000 000 records. Elapsed time: 00:04:28s. Time for last 1 000 000: 6s. Last read position: 17:39 847 693 INFO 2016-01-13 11:54:46 SinglePassSamProgram Processed 40 000 000 records. Elapsed time: 00:04:35s. Time for last 1 000 000: 7s. Last read position: 17:70 974 194 INFO 2016-01-13 11:54:53 SinglePassSamProgram Processed 41 000 000 records. Elapsed time: 00:04:42s. Time for last 1 000 000: 7s. Last read position: 18:34 349 659 INFO 2016-01-13 11:54:59 SinglePassSamProgram Processed 42 000 000 records. Elapsed time: 00:04:49s. Time for last 1 000 000: 6s. Last read position: 18:65 809 480 INFO 2016-01-13 11:55:06 SinglePassSamProgram Processed 43 000 000 records. Elapsed time: 00:04:56s. Time for last 1 000 000: 6s. Last read position: 19:4 316 395 INFO 2016-01-13 11:55:14 SinglePassSamProgram Processed 44 000 000 records. Elapsed time: 00:05:03s. Time for last 1 000 000: 7s. Last read position: 19:7 215 445 INFO 2016-01-13 11:55:20 SinglePassSamProgram Processed 45 000 000 records. Elapsed time: 00:05:09s. Time for last 1 000 000: 6s. Last read position: 19:21 276 603 INFO 2016-01-13 11:55:27 SinglePassSamProgram Processed 46 000 000 records. Elapsed time: 00:05:16s. Time for last 1 000 000: 6s. Last read position: 19:56 439 778 INFO 2016-01-13 11:55:34 SinglePassSamProgram Processed 47 000 000 records. Elapsed time: 00:05:23s. Time for last 1 000 000: 6s. Last read position: 2:24 405 695 INFO 2016-01-13 11:55:41 SinglePassSamProgram Processed 48 000 000 records. Elapsed time: 00:05:30s. Time for last 1 000 000: 7s. Last read position: 2:34 810 968 INFO 2016-01-13 11:55:48 SinglePassSamProgram Processed 49 000 000 records. Elapsed time: 00:05:37s. Time for last 1 000 000: 6s. Last read position: 2:79 662 361 INFO 2016-01-13 11:55:54 SinglePassSamProgram Processed 50 000 000 records. Elapsed time: 00:05:43s. Time for last 1 000 000: 6s. Last read position: 2:120 721 272 INFO 2016-01-13 11:56:01 SinglePassSamProgram Processed 51 000 000 records. Elapsed time: 00:05:50s. Time for last 1 000 000: 6s. Last read position: 2:148 872 896 INFO 2016-01-13 11:56:08 SinglePassSamProgram Processed 52 000 000 records. Elapsed time: 00:05:57s. Time for last 1 000 000: 6s. Last read position: 2:162 934 652 INFO 2016-01-13 11:56:15 SinglePassSamProgram Processed 53 000 000 records. Elapsed time: 00:06:04s. Time for last 1 000 000: 6s. Last read position: 2:178 035 820 INFO 2016-01-13 11:56:21 SinglePassSamProgram Processed 54 000 000 records. Elapsed time: 00:06:11s. Time for last 1 000 000: 6s. Last read position: 3:65 380 183 INFO 2016-01-13 11:56:28 SinglePassSamProgram Processed 55 000 000 records. Elapsed time: 00:06:18s. Time for last 1 000 000: 6s. Last read position: 3:90 605 083 INFO 2016-01-13 11:56:35 SinglePassSamProgram Processed 56 000 000 records. Elapsed time: 00:06:25s. Time for last 1 000 000: 6s. Last read position: 3:103 044 640 INFO 2016-01-13 11:56:42 SinglePassSamProgram Processed 57 000 000 records. Elapsed time: 00:06:31s. Time for last 1 000 000: 6s. Last read position: 3:142 567 677 INFO 2016-01-13 11:56:49 SinglePassSamProgram Processed 58 000 000 records. Elapsed time: 00:06:38s. Time for last 1 000 000: 7s. Last read position: 4:43 549 937 INFO 2016-01-13 11:56:56 SinglePassSamProgram Processed 59 000 000 records. Elapsed time: 00:06:45s. Time for last 1 000 000: 6s. Last read position: 4:103 564 637 INFO 2016-01-13 11:57:02 SinglePassSamProgram Processed 60 000 000 records. Elapsed time: 00:06:51s. Time for last 1 000 000: 6s. Last read position: 4:126 232 636 INFO 2016-01-13 11:57:09 SinglePassSamProgram Processed 61 000 000 records. Elapsed time: 00:06:58s. Time for last 1 000 000: 6s. Last read position: 4:134 161 336 INFO 2016-01-13 11:57:16 SinglePassSamProgram Processed 62 000 000 records. Elapsed time: 00:07:05s. Time for last 1 000 000: 6s. Last read position: 4:145 213 984 INFO 2016-01-13 11:57:22 SinglePassSamProgram Processed 63 000 000 records. Elapsed time: 00:07:12s. Time for last 1 000 000: 6s. Last read position: 5:5 783 143 INFO 2016-01-13 11:57:29 SinglePassSamProgram Processed 64 000 000 records. Elapsed time: 00:07:18s. Time for last 1 000 000: 6s. Last read position: 5:44 176 835 INFO 2016-01-13 11:57:36 SinglePassSamProgram Processed 65 000 000 records. Elapsed time: 00:07:25s. Time for last 1 000 000: 6s. Last read position: 5:107 903 709 INFO 2016-01-13 11:57:43 SinglePassSamProgram Processed 66 000 000 records. Elapsed time: 00:07:32s. Time for last 1 000 000: 6s. Last read position: 5:121 372 232 INFO 2016-01-13 11:57:49 SinglePassSamProgram Processed 67 000 000 records. Elapsed time: 00:07:38s. Time for last 1 000 000: 5s. Last read position: 5:134 620 184 INFO 2016-01-13 11:57:56 SinglePassSamProgram Processed 68 000 000 records. Elapsed time: 00:07:45s. Time for last 1 000 000: 7s. Last read position: 5:142 903 802 INFO 2016-01-13 11:58:03 SinglePassSamProgram Processed 69 000 000 records. Elapsed time: 00:07:52s. Time for last 1 000 000: 7s. Last read position: 5:151 643 744 INFO 2016-01-13 11:58:10 SinglePassSamProgram Processed 70 000 000 records. Elapsed time: 00:07:59s. Time for last 1 000 000: 7s. Last read position: 6:55 060 998 INFO 2016-01-13 11:58:17 SinglePassSamProgram Processed 71 000 000 records. Elapsed time: 00:08:06s. Time for last 1 000 000: 6s. Last read position: 6:97 221 731 INFO 2016-01-13 11:58:24 SinglePassSamProgram Processed 72 000 000 records. Elapsed time: 00:08:13s. Time for last 1 000 000: 6s. Last read position: 6:124 728 807 INFO 2016-01-13 11:58:31 SinglePassSamProgram Processed 73 000 000 records. Elapsed time: 00:08:20s. Time for last 1 000 000: 7s. Last read position: 7:3 650 845 INFO 2016-01-13 11:58:38 SinglePassSamProgram Processed 74 000 000 records. Elapsed time: 00:08:27s. Time for last 1 000 000: 6s. Last read position: 7:19 572 846 INFO 2016-01-13 11:58:44 SinglePassSamProgram Processed 75 000 000 records. Elapsed time: 00:08:34s. Time for last 1 000 000: 6s. Last read position: 7:29 149 560 INFO 2016-01-13 11:58:51 SinglePassSamProgram Processed 76 000 000 records. Elapsed time: 00:08:41s. Time for last 1 000 000: 6s. Last read position: 7:45 458 060 INFO 2016-01-13 11:58:58 SinglePassSamProgram Processed 77 000 000 records. Elapsed time: 00:08:47s. Time for last 1 000 000: 6s. Last read position: 7:80 735 551 INFO 2016-01-13 11:59:05 SinglePassSamProgram Processed 78 000 000 records. Elapsed time: 00:08:54s. Time for last 1 000 000: 6s. Last read position: 7:104 465 010 INFO 2016-01-13 11:59:12 SinglePassSamProgram Processed 79 000 000 records. Elapsed time: 00:09:01s. Time for last 1 000 000: 6s. Last read position: 7:126 699 913 INFO 2016-01-13 11:59:19 SinglePassSamProgram Processed 80 000 000 records. Elapsed time: 00:09:08s. Time for last 1 000 000: 7s. Last read position: 7:135 684 619 INFO 2016-01-13 11:59:26 SinglePassSamProgram Processed 81 000 000 records. Elapsed time: 00:09:15s. Time for last 1 000 000: 7s. Last read position: 8:13 587 009 INFO 2016-01-13 11:59:32 SinglePassSamProgram Processed 82 000 000 records. Elapsed time: 00:09:22s. Time for last 1 000 000: 6s. Last read position: 8:70 029 074 INFO 2016-01-13 11:59:39 SinglePassSamProgram Processed 83 000 000 records. Elapsed time: 00:09:29s. Time for last 1 000 000: 6s. Last read position: 8:83 725 864 INFO 2016-01-13 11:59:46 SinglePassSamProgram Processed 84 000 000 records. Elapsed time: 00:09:35s. Time for last 1 000 000: 6s. Last read position: 8:105 525 578 INFO 2016-01-13 11:59:53 SinglePassSamProgram Processed 85 000 000 records. Elapsed time: 00:09:43s. Time for last 1 000 000: 7s. Last read position: 8:122 699 229 INFO 2016-01-13 12:00:00 SinglePassSamProgram Processed 86 000 000 records. Elapsed time: 00:09:50s. Time for last 1 000 000: 7s. Last read position: 9:35 226 343 INFO 2016-01-13 12:00:07 SinglePassSamProgram Processed 87 000 000 records. Elapsed time: 00:09:56s. Time for last 1 000 000: 6s. Last read position: 9:57 627 423 INFO 2016-01-13 12:00:14 SinglePassSamProgram Processed 88 000 000 records. Elapsed time: 00:10:03s. Time for last 1 000 000: 6s. Last read position: 9:70 003 456 INFO 2016-01-13 12:00:20 SinglePassSamProgram Processed 89 000 000 records. Elapsed time: 00:10:10s. Time for last 1 000 000: 6s. Last read position: 9:96 896 206 INFO 2016-01-13 12:00:27 SinglePassSamProgram Processed 90 000 000 records. Elapsed time: 00:10:17s. Time for last 1 000 000: 7s. Last read position: 9:114 747 811 INFO 2016-01-13 12:00:34 SinglePassSamProgram Processed 91 000 000 records. Elapsed time: 00:10:24s. Time for last 1 000 000: 6s. Last read position: MT:5 758 INFO 2016-01-13 12:00:38 SinglePassSamProgram Processed 92 000 000 records. Elapsed time: 00:10:27s. Time for last 1 000 000: 3s. Last read position: MT:8 095 INFO 2016-01-13 12:00:43 SinglePassSamProgram Processed 93 000 000 records. Elapsed time: 00:10:32s. Time for last 1 000 000: 5s. Last read position: X:7 634 529 INFO 2016-01-13 12:00:50 SinglePassSamProgram Processed 94 000 000 records. Elapsed time: 00:10:39s. Time for last 1 000 000: 6s. Last read position: X:73 896 041 INFO 2016-01-13 12:00:57 SinglePassSamProgram Processed 95 000 000 records. Elapsed time: 00:10:46s. Time for last 1 000 000: 6s. Last read position: X:101 683 503 INFO 2016-01-13 12:01:04 SinglePassSamProgram Processed 96 000 000 records. Elapsed time: 00:10:53s. Time for last 1 000 000: 6s. Last read position: X:167 207 520 INFO 2016-01-13 12:01:13 SinglePassSamProgram Processed 97 000 000 records. Elapsed time: 00:11:02s. Time for last 1 000 000: 9s. Last read position: */* INFO 2016-01-13 12:01:24 SinglePassSamProgram Processed 98 000 000 records. Elapsed time: 00:11:13s. Time for last 1 000 000: 11s. Last read position: */* INFO 2016-01-13 12:01:35 SinglePassSamProgram Processed 99 000 000 records. Elapsed time: 00:11:24s. Time for last 1 000 000: 11s. Last read position: */* INFO 2016-01-13 12:01:37 RExecutor Executing R script via command: Rscript /scratch/script9078635170208800759.R metrics/ID.base_distribution_by_cycle_metrics metrics/ID.base_distribution_by_cycle.pdf sorted.mdup.bam ERROR 2016-01-13 12:01:39 ProcessExecutor Error in FUN(X[[i]], ...) : ERROR 2016-01-13 12:01:39 ProcessExecutor only defined on a data frame with all numeric variables ERROR 2016-01-13 12:01:39 ProcessExecutor Calls: plot ... plot.default -> xy.coords -> Summary.data.frame -> lapply -> FUN ERROR 2016-01-13 12:01:39 ProcessExecutor Exécution arrêtée [Wed Jan 13 12:01:39 EST 2016] picard.analysis.CollectMultipleMetrics done. Elapsed time: 11,48 minutes. Runtime.totalMemory()=761266176 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" picard.PicardException: R script nucleotideDistributionByCycle.R failed with return code 1 at picard.analysis.CollectBaseDistributionByCycle.finish(CollectBaseDistributionByCycle.java:80) at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:136) at picard.analysis.CollectMultipleMetrics.doWork(CollectMultipleMetrics.java:144) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:185) at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:125) at picard.analysis.CollectMultipleMetrics.main(CollectMultipleMetrics.java:108)


Created 2015-12-15 16:21:48 | Updated | Tags: queue picard collectmultiplemetrics
Comments (4)

Greetings,

Some of the picard's collect (e.g. CollectMultipleMetrics, CollectInsertSizeMetrics) commands load R script resources. Those resources were present in Queue <= v3.3-0 but they are absent from the subsequent versions. $ jar tf 3.3-0/Queue.jar | grep "picard.*R$" picard/analysis/baseDistributionByCycle.R picard/analysis/gcBias.R picard/analysis/meanQualityByCycle.R picard/analysis/rnaSeqCoverage.R picard/analysis/qualityScoreDistribution.R picard/analysis/insertSizeHistogram.R $ jar tf 3.4-0/Queue.jar | grep "picard.*R$" $ jar tf 3.5-0/Queue.jar | grep "picard.*R$" $ Without those resources, CollectMultipleMetrics throws something similar to java.lang.IllegalArgumentException: Script [picard/analysis/qualityScoreDistribution.R] not found in classpath depending on which commands it hits first. I thought I would be able to overcome that by adding picard.jar to CLASSPATH but it did not help. I had to add them to Queue.jar in order to make it work.

It would be nice to have those resources back in the distributed Queue.jar.

Thanks!

Roman


Created 2015-12-09 22:34:45 | Updated 2015-12-09 22:36:50 | Tags: picard reordersam
Comments (7)

Origin of the problem: GATK detected different order of the bam file and the reference file as follows:

ERROR MESSAGE: Input files reads and reference have incompatible contigs: The contig order in reads and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, X, Y, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, MT, NT_113887, ...]
ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, NT_113887, ...]

Then I referred to the link, https://www.broadinstitute.org/gatk/guide/article?id=1328

And decided to use Picard ReorderSam tool, which led me to the issue reported here,

The Problem: Picard ReorderSam terminates with error. Command: java -Xmx110g -Djava.io.tmpdir=$workDir/merged-bams/tmp -jar ./picard/1.115/ReorderSam.jar ALLOW_INCOMPLETE_DICT_CONCORDANCE=true TMP_DIR=$workDir/merged-bams/tmp I=$workDir/merged-bams/$sample.sorted.cleaned.bam R=$refGenome O=$workDir/merged-bams/$sample.sorted.reordered.bam Error: INFO 2015-12-09 12:24:39 ReorderSam Writing reads... INFO 2015-12-09 12:24:39 ReorderSam Processing All reads [Wed Dec 09 13:20:08 CST 2015] picard.sam.ReorderSam done. Elapsed time: 55.49 minutes. Runtime.totalMemory()=15967387648 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name HWUSI-EAS1612_61FV6:6:91:1510:1207#0, Read CIGAR M operator maps off end of reference at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452) at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:247) at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:460) at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1235) at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:1609) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:642) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488) at picard.sam.ReorderSam.writeReads(ReorderSam.java:165) at picard.sam.ReorderSam.doWork(ReorderSam.java:127) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183) at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:124) at picard.sam.ReorderSam.main(ReorderSam.java:85) [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_index_core] Invalid BAM header.[bam_index_build2] fail to index the BAM file.

Attempts to fix:

  1. I performed Picard CleanSam to solve this error: java -Xmx56g -jar ./picard/1.115/CleanSam.jar I=$workDir/merged-bams/$sample.sorted.bam O=$workDir/merged-bams/$sample.sorted.cleaned.bam And the output is attached.
  2. Then reordering this cleaned sam also throws the same error as mentioned above.
  3. Next I performed Picard ValidateSamfile and the log contains the readnames with error"Read CIGAR M operator maps off end of reference"

Can you please help me get around this issue? All I really want is proceed with GATK having same order of bam and reference contigs. I have been referring vigorously to several GATK discussions, but none addresses this issues directly or has helped find a solution.


Created 2015-12-09 09:31:39 | Updated | Tags: error picard validatesamfile
Comments (9)

Hi guys, probably my bad, but couldn't find in Picard Documentation an explanation of the error "ERROR:INVALID_VERSION_NUMBER" Would someone be able to shed some light, and explain what such an error refers to, and if that is something I should worry about before processing samples through GATK? cheers, Francesco


Created 2015-11-27 14:59:22 | Updated | Tags: picard multifasta
Comments (2)

Hi!

I have a problem running Picard's CollectAlignmentSummaryMetrics: I aligned Illumina paired-end reads (processed with Trimmomatic) against a genome from NCBI RefSeq (GCF_000187205.2_ASM18720v4) using BWA (bwa-0.7.8). The reference is a multifasta with NC_017847, NC_017848 and NC_020524. Then a SAM file was created and I called Picard's SortSam, MarkDuplicates and CollectAlignmentSummaryMetrics:

java -jar picard-tools-1.119/SortSam.jar INPUT=mapped.sam OUTPUT=mapped_sorted.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT java -jar picard-tools-1.119/MarkDuplicates.jar INPUT=mapped_sorted.bam METRICS_FILE=dedup_metrics.txt OUTPUT=mapped_sorted_dedup.bam VALIDATION_STRINGENCY=LENIENT java -jar picard-tools-1.119/CollectAlignmentSummaryMetrics.jar INPUT=mapped_sorted_dedup.bam R=GCF_000187205.2_ASM18720v4.fa O=alignment_metrics.txt VALIDATION_STRINGENCY=LENIENT

The error is Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 after calling CollectAlignmentSummaryMetrics. So, I thought it would be a good idea to check the SAM file, ran

java -jar picard-tools-1.119/ValidateSamFile.jar I=mapped_sorted_dedup.bam REFERENCE_SEQUENCE=GCF_000187205.2_ASM18720v4.fa

and got ERROR: Record 1, Read name HWI-ST751:181:C3VLMACXX:3:2106:15520:100263, Mate Alignment start should != 0 because reference name != *. ERROR: Record 1, Read name HWI-ST751:181:C3VLMACXX:3:2106:15520:100263, Alignment start should != 0 because reference name != *. [...] Exception in thread "main" htsjdk.samtools.SAMException: Exception counting mismatches for read HWI-ST751:181:C3VLMACXX:3:2106:15520:100263 1/2 101b aligned read.

Removing this read and another two reads causing the same error from the BAM file finally worked. These problematic reads have the following entries in the BAM file: HWI-ST751:181:C3VLMACXX:3:2106:15520:100263 73 NC_017847 0 37 101M = 0 0 ATTTAAAATCTTTTCTTTTTTTCTTTTAGAGGGCCTGAAACGATTGCTAATAAATGCTTTGAAAGCCTATTAAGGTACAAAAACTCCGTTTCAGCGGTACA BCCFFFFFGHHHHJIJJJJJJJJJJJJIIJJIBHHHIGEGJJJJJIJJJJJJJJJJIIJJJJHHHHHFFFFFFFDEEEEEDDDDCCDDACDDECDDDDDBD X0:i:1 X1:i:0 MD:Z:0 PG:Z:MarkDuplicates RG:Z:C5120-5141_R XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:4 XO:i:0 XT:A:U HWI-ST751:181:C3VLMACXX:3:2106:15520:100263 133 NC_017847 0 0 * = 0 0 GAATATTTGTACCGTAATTAAAACTTTTTGTACCGTTAAAACGGAGTATTTGTACCGCTGAAACGGAGTTTTTGTACCTTAATAGGCTTTCAAAGCATTTA CCCFFFFFHHHHHJIIJJJIJJJJJJJJJJFIIIJJJJJJJJJJIJ7BGIHGFHIJJJJGFIHHHHDD;AEEDDDEEDDDDDEEDCDDDDDDDDDDDDDD@ PG:Z:MarkDuplicates RG:Z:C5120-5141_R

HWI-ST751:181:C3VLMACXX:3:1205:16810:95215 113 NC_017847 0 37 97M NC_020524 18 0 CCGCTGAAACGGAGTTTTTGTACCTTAATAGGCTTTCAAAGCATTTATTAGCAATCGTTTCAGGCCCTCTAAAAGAAAAAAAGAAAAGATTTTAAAT 5.:;@:9CB<>@>>?;(>6666@EEDDC?76CHEHFIHGHHF;HCFF=4B9;GHE?GAB???::)9GEE>IIHIGGBGIC?@>GDD>FD:DDDD@@; X0:i:1 X1:i:0 MD:Z:0 PG:Z:MarkDuplicates RG:Z:C5120-5141_R XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:1 XO:i:0 XT:A:U HWI-ST751:181:C3VLMACXX:3:1205:16810:95215 177 NC_020524 18 37 101M NC_017847 0 0 TTTAACGGTACAAAAAGTTTTAATTACGGTACAAATATTCCTTTTTAACGGTACAAAAATTCTGGCCTTTAATTTCAATAGCTTAAAAGGTACAAAAATTC A8<<,;:3A?8ABBA>>CBBDB?<DDFFEC;FEFEE>FAF@>B@CFFFD?99?6IEFFFBF=FBC:<IIHDC@IEFEHEDDADDFEIGDFC2D?B:DB:<? X0:i:1 X1:i:0 MD:Z:101 PG:Z:MarkDuplicates RG:Z:C5120-5141_R XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-ST751:181:C3VLMACXX:3:2215:21036:11255 177 NC_017847 0 37 101M NC_020524 135 0 TACCGCTGAAACGGAGTTTTTGTACCTTAATAGGCTTTCAAAGCATTTATTAGCAATCGTTTCAGGCCCTCTAAAAGAAAAAAAGAAAAGATTTTAAATTC 9BDDCDEDDDDDDDDDDDDDDEC>C>EDFFFFFHHHHHHJJJIGIIGIIH@IJIJJJJIIJIIIJIGIIIHGJJJJIJJJJJJJIJJJHHHHHFFFFFBB@ X0:i:1 X1:i:0 MD:Z:0 PG:Z:MarkDuplicates RG:Z:C5120-5141_R XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:2 XO:i:0 XT:A:U HWI-ST751:181:C3VLMACXX:3:2215:21036:11255 113 NC_020524 135 37 101M NC_017847 0 0 ACGGTACATTTTCTCCGTTAAAAGTCCTATAAAGGTACAAAAACTCCGAATAAGTACCTTTTAAAGCCATTTTTCCATTAACTTAAAGAATACATAAAAGG DDDEEFDDDDD@CADFFFFFGHFGHIJJIJJJJJIJJJJJIHDIGJJJJJJJIGFDIJJIIHJJIIJJIIJJIJIJHJ9IJJJJJJJJHHHHHFFFFFCCC X0:i:1 X1:i:0 MD:Z:101 PG:Z:MarkDuplicates RG:Z:C5120-5141_R XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

As far as I can see, I would say that the problem is that the reads are mapped to different reference sequences (at least the two last ones). Is there a way to avoid this problem when using Picard?

Thanks in advance!


Created 2015-11-26 13:23:28 | Updated 2015-11-26 13:26:44 | Tags: picard collectvariantcallingmetrics
Comments (7)

I hope this is the right forum to ask about picard these days. I run CollectVariantCallingMetrics on a one-sample vcf file and get both the . variant_calling_summary_metrics and . variant_calling_detail_metrics. I would expect these to be close to identical, when only one sample is used, however, most values are different. Command: java -Djava.awt.headless=true -Xmx1500m -jar picard.jar CollectVariantCallingMetrics INPUT=${samplename}.final_variants.vcf.gz DBSNP=${DBSNP} OUTPUT=${samplename}.CollectVariantCallingMetrics

output:

METRICS CLASS picard.vcf.CollectVariantCallingMetrics$VariantCallingDetailMetrics
SAMPLE_ALIAS HET_HOMVAR_RATIO TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV SAMPLENAME 1.515108 80563 79938 625 27362 0.992242 2.431405

METRICS CLASS picard.vcf.CollectVariantCallingMetrics$VariantCallingSummaryMetrics
TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV 84172 82347 1825 29284 0.978318 2.401504

picard version= 1.141


Created 2015-10-23 18:15:22 | Updated | Tags: picard markduplicates
Comments (2)

Hello, I've been using the Picardtools MarkDuplicates tool. I'd like to identify which reads are duplicates of each other (ie. if read.1234 is a duplicate of read.5678, I want to be able to retrieve this relationship). Does the MarkDuplicates output indicate this in any way? While I could group reads together if they share the same start coordinate listed in the BAM file, this gets a little tricky if the reads align to the minus strand, or if there are mismatches in the first couple of nucleotides in the read. I think the MarkDuplicates program must be collecting this information behind the scenes when it's finding duplicates. Thank you very much for your help.


Created 2015-10-21 08:15:13 | Updated | Tags: picard trimming readadaptortrimmer
Comments (3)

Hi,

Is there a "Best Practices" for how to use ReadAdaptorTrimmer? To me it seems that there is a Catch 22, if one wants to use GATK and Picard.

According to the ReadAdaptorTrimmer documentation: "Read data MUST be in query name ordering as produced, for example with Picard's FastqToBam". Therefore, I would start by doing

java picard.jar FastqToSam FASTQ={r1_file} FASTQ2={r2_file} OUTPUT={bam_file} SM={sample} SORT_ORDER=queryname

to convert my FASTQ files into a sorted uBAM file. However, ReadAdaptorTrimmer requires the BAM file to be indexed, but if I then try

java picard.jar BuildBamIndex INPUT={bam_file}

it fails because BuildBamIndex requires that the BAM file is sorted by coordinate (which does not make sense since the reads are not yet aligned).

Thanks, Michael


Created 2015-10-19 19:49:00 | Updated | Tags: picard collectoxogmetrics
Comments (3)

Hi there, I'm having trouble getting picard's CollectOxoGMetrics to produce output. Specifically, commands like:

java -Xmx2g -jar ~/tools/picard-tools-1.140/picard.jar CollectOxoGMetrics I=test.bam O=test.oxog.csv R=~/resources/GRCh38.p2.fa

produces a "test.oxog.csv" file with some header information including the command line arguments, etc., but no other content besides a few blank lines. No warning or error information is emitted, and other metrics commands seem to work just fine. I've tried a couple of different bams now (one in-house generated, one from Illumina) and got the same results for each one. Am I missing something obvious? Thanks!


Created 2015-10-02 12:37:21 | Updated | Tags: picard resource-bundle
Comments (7)

I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample."

The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance.

That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option: TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required.

Take dbsnp_138.hg19.vcf.gz as an example:

$ zcat dbsnp_138.hg19.vcf.gz | grep CHROM
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field.

How should I proceed?


Created 2015-09-27 12:23:40 | Updated | Tags: picard markduplicates libraries
Comments (6)

I was hoping this had been addressed already on the forum, but I've not seen a definitive answer although I have seen a similar question posed on this and other forums.

Our current mark duplicate procedure using Picard MarkDuplicates is to run merges across lane data generated from the same library. I believe this makes sense, and once duplicates are marked, then library level merges are combined to create a sample level, multi-library bam file. Any duplicates found across libraries would not be expected to be PCR duplicates but instead just identical fragments.

It's not clear though whether Picard MarkDuplicates is library aware....ie. when it does mark duplicates does it account for read pairs only from the same library, or if run against a bam merge generated from multiple libraries, will it mark any duplicates it finds.

I don't see this addressed in the documentation, so I assume that is not the case, but I have seen suggestions elsewhere that it might be so.


Created 2015-09-26 12:01:02 | Updated | Tags: bam picard htsjdk
Comments (2)

Dear colleagues,

I am running a third-party program that makes use of the htsjdk library. For some of my samples (BWA alignments) it raises the following exception:

Exception in thread "Thread-5" java.lang.IllegalStateException: Inappropriate call if not paired read at net.sf.samtools.SAMRecord.requireReadPaired(SAMRecord.java:655) at net.sf.samtools.SAMRecord.getMateUnmappedFlag(SAMRecord.java:682)

It seems there may be some problem inside the BAM file. However, after running Picard's ValidateSamFile it only finds 1,155 INVALID_MAPPING_QUALITY errors (no errors related to mate reads). Anyhow, I repaired the bam file using Picard's CleanSam and rerun the third-party program. The "Inappropriate call if not paired read" error persisted.

How could I solve this, please?

Thanks in advance, Federico


Created 2015-09-16 18:12:09 | Updated | Tags: java picard
Comments (4)

Hi GATK team!

I have an issue with Picard SamToFastq.

I ran on a 16 GB BAM this command (on the latest version of Picard, I tried several):

java -Djava.io.tmpdir=/scratch -XX:ParallelGCThreads=8 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=4194304 -Xmx12G -jar /picard/dist/picard.jar SamToFastq \
  VALIDATION_STRINGENCY=LENIENT \
  INPUT=input.bam \
  FASTQ=output.pair1.fastq.gz \
  SECOND_END_FASTQ=output.pair2.fastq.gz

With this command-line, I was able to process WES BAM two or three times larger without the RAM complaining at all.

On this occurence, I have this result (and I tweaked MAX_RECORDS_IN_RAM argument, change from a 24 GB RAM machine to a 48 GB RAM machine):

[Wed Sep 16 12:17:28 EDT 2015] picard.sam.SamToFastq INPUT=input.bam FASTQ=output.pair1.fastq.gz SECOND_END_FASTQ=output.pair2.fastq.gz VALIDATION_STRINGENCY=LENIENT    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Sep 16 12:17:28 EDT 2015] Executing as emixaM@glop on Linux 2.6.32-504.23.4.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07; Picard version: 1.139(fd19c75b9a42d82cb57d45b53e1f93f0a3588541_1442367311) JdkDeflater
INFO    2015-09-16 12:17:36 SamToFastq  Processed     1 000 000 records.  Elapsed time: 00:00:07s.  Time for last 1 000 000:    7s.  Last read position: chr1:19 244 878
INFO    2015-09-16 12:17:45 SamToFastq  Processed     2 000 000 records.  Elapsed time: 00:00:17s.  Time for last 1 000 000:    9s.  Last read position: chr1:38 435 197
INFO    2015-09-16 12:17:57 SamToFastq  Processed     3 000 000 records.  Elapsed time: 00:00:29s.  Time for last 1 000 000:   12s.  Last read position: chr1:63 999 267
INFO    2015-09-16 12:18:10 SamToFastq  Processed     4 000 000 records.  Elapsed time: 00:00:42s.  Time for last 1 000 000:   13s.  Last read position: chr1:101 467 007
INFO    2015-09-16 12:18:17 SamToFastq  Processed     5 000 000 records.  Elapsed time: 00:00:48s.  Time for last 1 000 000:    6s.  Last read position: chr1:144 192 193
INFO    2015-09-16 12:18:36 SamToFastq  Processed     6 000 000 records.  Elapsed time: 00:01:07s.  Time for last 1 000 000:   19s.  Last read position: chr1:154 917 552
INFO    2015-09-16 12:18:43 SamToFastq  Processed     7 000 000 records.  Elapsed time: 00:01:14s.  Time for last 1 000 000:    7s.  Last read position: chr1:171 501 716
INFO    2015-09-16 12:19:01 SamToFastq  Processed     8 000 000 records.  Elapsed time: 00:01:33s.  Time for last 1 000 000:   18s.  Last read position: chr1:201 869 422
INFO    2015-09-16 12:19:08 SamToFastq  Processed     9 000 000 records.  Elapsed time: 00:01:40s.  Time for last 1 000 000:    6s.  Last read position: chr1:230 511 835
INFO    2015-09-16 12:19:15 SamToFastq  Processed    10 000 000 records.  Elapsed time: 00:01:46s.  Time for last 1 000 000:    6s.  Last read position: chr2:20 182 092
INFO    2015-09-16 12:19:57 SamToFastq  Processed    11 000 000 records.  Elapsed time: 00:02:28s.  Time for last 1 000 000:   41s.  Last read position: chr2:52 391 334
INFO    2015-09-16 12:20:55 SamToFastq  Processed    12 000 000 records.  Elapsed time: 00:03:26s.  Time for last 1 000 000:   58s.  Last read position: chr2:87 420 712
[Wed Sep 16 12:38:29 EDT 2015] picard.sam.SamToFastq done. Elapsed time: 21,03 minutes.
Runtime.totalMemory()=11453595648
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at htsjdk.samtools.BinaryTagCodec.readNullTerminatedString(BinaryTagCodec.java:414)
    at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:318)
    at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
    at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308)
    at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:288)
    at htsjdk.samtools.SAMRecord.getReadGroup(SAMRecord.java:691)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:166)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Do you have an idea on what I can change to not cross this overhead limit?

Cheers!


Created 2015-09-14 20:57:40 | Updated | Tags: picard
Comments (1)

Our group is trying to get a bwa-sampe/picard pipeline running on a windows cluster, and we're having problems piping to picard.

On unix, we'd use 'java -jar picard.jar SortSam I=/dev/stdin ...' but windows doesn't recognize /dev/stdin (even under cygwin) and complains that the file doesn't exist.

We can, of course, write the sam file out in its entirety, but we'd rather save the time and writing to disk, if possible.

Anyone have a solution (other than switch to unix)? Thanks in advance.


Created 2015-09-03 07:52:38 | Updated | Tags: picard markduplicates
Comments (3)

Hi there,

Hope someone can shed some light on this issue.

I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.

It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)

INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk. ... INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936 INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM. [Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes. Runtime.totalMemory()=6107234304 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:245) at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:165) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:281) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:318) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127) at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100) at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137) at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:236) ... 6 more

I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.

eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.

I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.

What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.

My entry "java -Xmx16g -Djavaio.tmpdir=/short/a32/working_temp -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT "

Hope you can help and thanks for your assistance in advance. Eddie


Created 2015-08-31 10:15:59 | Updated | Tags: picard filtervariants
Comments (14)

Dear all, I have a problem using Picard tool FilterVcf. I first run the command:

java -jar picard.jar FilterVcf \ MIN_AB=10.0 \ INPUT=12-3455_S6.final.vcf \ OUTPUT=test.vcf

And I receive the error:

[Mon Aug 31 12:04:53 CEST 2015] Executing as smerella@b002 on Linux 2.6.32-279.2.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.138() JdkDeflater [Mon Aug 31 12:04:56 CEST 2015] picard.vcf.filter.FilterVcf done. Elapsed time: 0.06 minutes. Runtime.totalMemory()=1138753536 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalArgumentException: A reference dictionary is required for creating Tribble indices on the fly at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:389) at picard.vcf.filter.FilterVcf.doWork(FilterVcf.java:97) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Then I tried with: java -jar picard.jar FilterVcf \ REFERENCE_SEQUENCE=/lustre1/genomes/hg19/fa/hg19 \ MIN_AB=10.0 \ INPUT=12-3455_S6.final.vcf \ OUTPUT=test.vcf

And I received the same error. In the directory /lustre1/genomes/hg19/fa/ I have the files hg19.dict hg19.fa hg19.fa.fai. I don't understand what I am doing wrong.

Thanks for the help.

[smerella@b002 scratch_bmc]$ java -jar picard.jar FilterVcf --version 1.138()


Created 2015-08-20 17:51:21 | Updated | Tags: picard markduplicates
Comments (9)

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.


Created 2015-08-20 10:56:07 | Updated | Tags: picard metrics
Comments (4)

Hi all, I have 2 questions about the Picard metrics. Sorry if they are stupid or you have already answered but I didn't find anything googling or searching in this forum.

  1. HsMetrics: in the page of metrics definition it is declared that the tool returns the percentage of all target bases achieving 2x,10x,20x,30,40x,50x,100x but when I run the command in the output file there are only 2x,10x,20x,30x. Is that correct? Is there any option I have to use to add information about 40x,50x and 100x? My command is:

java -jar CalculateHsMetrics.jar \ BAIT_INTERVALS=PICARD_interval_list_CANCER.list \ REFERENCE_SEQUENCE=hg19.fa \ TARGET_INTERVALS=PICARD_interval_list_CANCER.list \ BAIT_SET_NAME=CANCER \ METRIC_ACCUMULATION_LEVEL=SAMPLE \ INPUT=sample.bam\ OUTPUT=sample.metrics

  1. In the PICARD metric page I found metrics about VariantCalling: https://broadinstitute.github.io/picard/picard-metric-definitions.html#CollectVariantCallingMetrics.VariantCallingDetailMetrics But I didn't find the associated Picard tool. Is that an old metric not in use anymore?

Thanks for the help, hope to not bother you.

java -jar CalculateHsMetrics.jar --version 1.66(1178) Linux version 2.6.32-279.2.1.el6.x86_64 (mockbuild@c6b7.bsys.dev.centos.org)


Created 2015-08-12 23:44:49 | Updated 2015-08-12 23:53:26 | Tags: liftovervariants vcf hapmap picard liftovervcf
Comments (1)

I am having a problem with picard's LiftoverVcf.

I am trying to Liftover hapmap files (downloaded plink files from hapmap and converted to vcf using plink) from ncbi36 to hg38. I was able to do this with GATK LiftoverVariants. My problem came when I had to merge the hapmap.hg38 with some genotype files (that I liftover from hg19 to hg38 using GATK LiftoverVariants). I am merging them so that I can run population stratification using plink. I used vcf-merge but it complained that a SNP has different reference allele in both files: rs3094315, should be reference allele G (which was correct in the genotype.hg38 files but in the hapmap.hg38 files it was wrong). I also first tried to lift hapmap.ncbi36 to hg19 then to hg38 but the offending allele was still there. So I decided to try and lift the hapmap.ncbi36 using LiftoverVCF from picard.

  1. I downloaded the newest picard build (20 hours old) picard-tools-1.138.
  2. Used the command: java -jar -Xmx6000m ../../../tools/picard-tools-1.138/picard.jar LiftoverVcf I=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf O=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf C=../../../tools/liftover/chain_files/hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf R=../../../data/assemblies/hg38/hg38.fa VERBOSITY=ERROR

Here is the run: [Thu Aug 13 00:43:40 CEST 2015] picard.vcf.LiftoverVcf INPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf OUTPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf CHAIN=......\tools\liftover\chain_files\hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf REFERENCE_SEQUENCE=......\data\assemblies\hg19\assemble\hg38.fa VERBOSITY=ERROR QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

Here is the error: Exception in thread "main" java.lang.IllegalStateException: Allele in genotype A not in the variant context [T, C] at htsjdk.variant.variantcontext.VariantContext.validateGenotypes(VariantContext.java:1357) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1295) at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:410) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:496) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:490) at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:200) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

  1. I have no idea which SNP is the problem.
  2. I do not know what T* means (does not seem to exist in the file).
  3. I am new to picard so I thought VERBOSE=ERROR will give me something more but nothing more appeared.
  4. Given that lifting hapmap.ncbi36 to hg19 then to hg38 produced the same erroneous reference allele I suppose lifting will not fix this and I will have to work with dnsnp to correct my file. Do you know how I can change reference allele in a vcf? Is there a tool for this? Is there a liftover tool for dbsnp?
  5. As a side note I want to make picard work because I read that you will be deprecating the GATK liftover and will support the picard liftover (at some point in the future) so help with this tool will be appreciated.

Created 2015-07-29 18:59:07 | Updated | Tags: bwa picard
Comments (2)

I noticed that the current build of picard contains a 3rd party version of bwa 0.5.9-r16, the version.txt which contains the text:

"patched by AW, and multithread samse/sampe patch applied."

I have an external collaborator who is interested in applying this patch. Could someone provide download and instructions for doing so, or for accessing this patched version of BWA? Or can I just send the bwa executable directly to the collaborator?

Thanks!


Created 2015-07-10 22:42:44 | Updated | Tags: vcf picard
Comments (2)

Hi I would like to know how can I return the entire contents of a VCF file without having to know the chromosome/start/end parameters. The VCFFileReader class has a query method that can return parts of the VCF like as:

VCFFileReader.query(chromosome, startPos, endPos);

What I would like is to be able to say: 'return everything the VCF has'. Is there a way to do this with picard?

Martin


Created 2015-06-21 13:14:36 | Updated | Tags: vcf picard
Comments (1)

What is the fastest way of getting the total number of variants in a VCF file? (using picard-tools-1.119, via SplitVcfs.jar.)

So far the fastest way I could have done it was this:

private static int getNumVariants(VCFFileReader reader) { int totalVariants = 0; final CloseableIterator iterator = reader.iterator(); while (iterator.hasNext()) {iterator.next(); totalVariants++; } iterator.close();

  return totalVariants;

}

  • but this appears to iterate through the entire VCF file which for large files seems very inefficient...

I am thinking that there must be a faster way. After all, the number of variants is simply: total number of lines in file - number of lines in header?

Any way to get this?

Thanks Martin


Created 2015-06-21 13:08:56 | Updated 2015-06-21 13:10:27 | Tags: picard vcftools vcf-format
Comments (2)

I am trying to open small uncompressed VCFs via picard-tools-1.119, via SplitVcfs.jar.

File vcfFile = new File("test2.vcf"); VCFFileReader vcfReader = new VCFFileReader(vcfFile,false);

Even though I've specified that an index is not required in the constructor, this generates the following error:

htsjdk.tribble.TribbleException: Index not found for: MYLOCATION\test2.vcf at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:253) at htsjdk.variant.vcf.VCFFileReader.query(VCFFileReader.java:124) (if I don't specify the index parameter, then it generates a different error:"An index is required, but none found" )

I've attached a test file so you can see it for yourself. ( NOTE: I've had to compress it into a .zip, otherwise the forum didn't allow me to attach the file itself, but if you want to replicate my results you will need to uncompress it and try loading the plain file (test2.vcf).

My question: is there any way to open uncompressed VCF files in picard via VCFFileReader or otherwise?

Thanks Martin

PS: these are small VCFs that would be exported out from larger datasets, so compressing/indexing them would seem like an unnecessary extra step.


Created 2015-06-01 10:54:07 | Updated | Tags: best-practices picard mapping
Comments (4)

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment picard-tools MergeBamAlignment \ UNMAPPED_BAM=unmapped_reads.sam \ ALIGNED_BAM=aligned_reads.sam \ OUTPUT=aligned_reads.merged.bam \ REFERENCE_SEQUENCE=/path/to/reference.fasta \ PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile

HISTOGRAM java.lang.String

Error Type Count
ERROR:INVALID_CIGAR 5261
ERROR:MATES_ARE_SAME_END 30
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

Created 2015-05-30 13:45:41 | Updated | Tags: indelrealigner picard reads
Comments (5)

Hello,

I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.

I'm calling the indel realigner with the default options as follows:

java -Xmx16g -jar $GATK_DIR/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R /path/to/my/ref \
   -I input.bam.intervals \
   -targetIntervals input.bam.intervals \
   -o realn.bam \

I am generating the read groups using AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK RealignerTargetCreator with default options.

My bam file was generated off the raw reads of experiment SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools)

I have tried this on other reads and do not have the same issue. Can anyone comment on why indel realigner could be throwing out so many reads.

Thank you


Created 2015-05-09 14:48:21 | Updated | Tags: best-practices bam third-party-tools picard
Comments (1)

Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample:

sample1_lane1.dedup.realn.recal.bam --> sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam sample2_lane1.dedup.realn.recal.bam --> sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam ...

I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe?


Created 2015-03-29 01:11:42 | Updated 2015-03-29 01:14:42 | Tags: picard
Comments (4)

I have 16 libraries that were pooled and sequenced on the same lane. When using AddOrReplaceReadGroups.jar, should I create a unique RGLB name for each library (e.g., lib1, lib2, lib3, etc.)? Apologies if this sounds like a silly question, but I have seen folks on seq-answers use the same name (e.g., lib1) for the RGLB flag when libraries were sequenced on the same lane.

Thanks!


Created 2015-03-26 16:03:02 | Updated | Tags: bam picard nullpointerexception fixmateinformation
Comments (3)

Hello,

I was asked to re-post this question here. It was originally posted in the Picard forum at GitHub at https://github.com/broadinstitute/picard/issues/161.

Regards,

Bernt


ORIGINAL POST (edited)

There seems to be a problems with FixMateInformation crashing with

Exception in thread "main" java.lang.``````NullPointerException``````
at htsjdk.samtools.SamPairUtil.setMateInformationOnSupplementalAlignment(SamPairUtil.java:300)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.advance(SamPairUtil.java:442)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:454)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:360)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:194)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:185)
at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:125)
at picard.sam.FixMateInformation.main(FixMateInformation.java:93)

The problem first appeared in version 1.121. It is present in version 1.128. Versions prior to 1.120 worked and continue to work fine. I am currently using Java 1.7.0_75, but I observed the same problem with earlier version of Java. The problem occurs under several different version of Fedora.

The command lines I am using are:

java -jar picard-1.128/picard.jar FixMateInformation INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.121/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (fails)

java -jar picard-1.120/FixMateInformation.jar INPUT=test.bam OUTPUT=fixed.bam (succeeds)

I have observed the problem with various BAM files. This one is (a small subset of) the output of an indel realignment with GATK.


Later in the same thread:

ValidateSam produces: java -jar /opt/ghpc/picard-1.121/ValidateSamFile.jar INPUT=test.bam OUTPUT=out.bam [Wed Feb 18 08:48:40 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam OUTPUT=out.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Feb 18 08:48:40 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.121(da291b4d265f877808b216dce96eaeffd2f30bd3_1411396652) IntelDeflater [Wed Feb 18 08:48:41 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=505937920 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp


And later:

The problem also exists in the new version, 1.129.


New

Re-posted in GATK forum

Picard (1.129)'s ValidateSamFile complains about Mate not found - which was the reason for running FixMateInformation in the first place.

The output is:

java -jar /opt/ghpc/picard-current/picard.jar ValidateSamFile I=test.bam

[Thu Mar 26 16:44:49 CET 2015] picard.sam.ValidateSamFile INPUT=test.bam    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 26 16:44:49 CET 2015] Executing as bernt@interactive.ghpc.dk on Linux 2.6.35.14-106.fc14.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2215:5439:78978, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1216:7853:25411, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2301:9078:52020, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2104:18417:29553, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2310:18752:24451, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2310:6551:24766, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1105:9672:78339, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2112:20003:44801, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2213:8473:74864, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2306:11852:94726, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1110:11106:17369, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2215:12401:47072, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2312:13964:14859, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1312:3886:41184, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1206:12827:34659, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1107:18908:98983, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1313:7640:45146, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1306:1595:15034, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2209:2036:47281, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1201:6826:100382, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2213:4861:63517, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2204:10202:63100, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1207:7125:93640, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1101:9691:36089, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2211:1839:100174, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2312:7331:16518, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2303:13396:44533, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1103:15274:86897, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2110:1541:39614, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1206:10320:20874, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2104:12084:25830, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2115:6231:35664, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1106:5365:6728, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1201:5887:87680, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1204:9449:99890, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2207:6920:91927, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1113:17505:78862, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2311:19423:17546, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2303:6787:39570, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1116:6350:25293, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1305:15016:58323, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1116:10894:97830, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2306:13179:38191, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1301:11303:99731, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2102:13726:37821, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2312:11652:76919, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1208:4895:32748, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1106:9371:79983, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2111:1798:22917, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1107:1267:20231, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1109:15189:92031, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2302:9045:63944, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1102:14247:57062, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2305:7407:36655, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2204:12584:72228, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1111:18302:40904, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2316:8382:94789, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2109:12845:82338, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1206:10557:31568, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2210:14790:11210, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1303:7824:5423, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2111:9909:100689, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2202:16293:94205, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1102:16519:74708, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1305:10365:69588, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2102:8288:100810, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1311:17645:65928, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1109:17819:68329, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2206:3160:52730, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1112:18820:52584, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1108:4475:4687, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2205:7334:35631, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2106:9384:64665, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2316:12960:78271, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1104:3451:71528, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2211:21055:28695, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2202:13814:96357, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2111:17147:10853, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2106:20520:88043, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1214:2637:77724, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1109:9367:35640, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1215:11379:23758, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1304:17507:91188, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2204:12459:100042, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2216:8585:77239, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1313:12667:24591, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1316:10367:5281, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1315:15333:2359, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1206:5534:7650, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2102:4820:93659, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2104:6528:72676, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2305:7297:76200, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1315:5361:88165, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2305:17200:26640, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1302:2356:100479, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1101:3217:24975, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2314:1898:42432, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:1316:5424:4897, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2104:16620:81246, Mate not found for paired read
ERROR: Read name HWI-D00474:91:C62ARANXX:8:2102:15822:17446, Mate not found for paired read
Maximum output of [100] errors reached.
[Thu Mar 26 16:44:50 CET 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=505937920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Please let me know where to send the test BAM file.

Regards,

Bernt


Created 2015-03-17 22:54:15 | Updated | Tags: bam picard
Comments (5)

Hi,

I am attempting to merge the output of tophat in order to run some RNASeq QC metrics. This is a single read 50 bp on a Hiseq. In order to get past the fact that tophat gives a MAPQ of 255 to unmapped reads (not 0 as expected by Picard) I used the following tool ( https://github.com/cbrueffer/tophat-recondition) to change it.

Once completed, I added read groups using picard and then sorted the accepted_hits.bam by coordinate and sorted the unmapped reads by queryname.

tophat-recondition.py /home/ryan/NGS_Data/No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/unmapped_fixup.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ AddOrReplaceReadGroups \ I=/home/ryan/NGS_Data/accepted_hits.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG.bam \ SORT_ORDER=coordinate \ RGID=No_Dox RGLB=No_Dox RGPL=illumina RGPU=GCCAAT RGSM=No_Dox \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/unmapped_fixup-RG.bam \ O=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ SORT_ORDER=queryname

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ SortSam \ I=/home/ryan/NGS_Data/accepted_hits-RG.bam \ O=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ SORT_ORDER=coordinate \ CREATE_INDEX=true

java -Xmx2g -jar /home/ryan/jar/picard-tools-1.129/picard.jar \ MergeBamAlignment \ UNMAPPED_BAM=/home/ryan/NGS_Data/unmapped_fixup-RG-sorted.bam \ ALIGNED_BAM=/home/ryan/NGS_Data/accepted_hits-RG-sorted.bam \ O=/home/ryan/NGS_Data/merge_unmapped_accepted_hits_No_Dox.bam \ SORT_ORDER=coordinate \ REFERENCE_SEQUENCE=/home/ryan/Reference/human_spikein/hg19_spikein.fa \ PROGRAM_RECORD_ID=Tophat \ PROGRAM_GROUP_VERSION=0.1 \ PROGRAM_GROUP_COMMAND_LINE=tophat/dummy \ PAIRED_RUN=false \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true

I then get the following warning followed by the error:

WARNING 2015-03-17 09:44:22 SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Aligned record iterator (HS3:608:C6LNLACXX:7:2101:9946:63417) is behind the unmapped reads (HS3:608:C6LNLACXX:7:2101:9947:11009) INFO 2015-03-17 09:44:33 SamAlignmentMerger Read 1000000 records from alignment SAM/BAM. INFO 2015-03-17 09:44:43 SamAlignmentMerger Read 2000000 records from alignment SAM/BAM. .... INFO 2015-03-17 09:58:01 SamAlignmentMerger Read 96000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:09 SamAlignmentMerger Read 97000000 records from alignment SAM/BAM. INFO 2015-03-17 09:58:15 SamAlignmentMerger Finished reading 97571897 total records from alignment SAM/BAM. [Tue Mar 17 09:58:16 PDT 2015] picard.sam.MergeBamAlignment done. Elapsed time: 14.32 minutes. Runtime.totalMemory()=1908932608 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HS3:608:C6LNLACXX:7:1101:10000:11036) is behind the unmapped reads (HS3:608:C6LNLACXX:7:1101:10000:48402) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:383) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:153) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:248) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I have searched and have been unsuccessful at resolving this problem. Any ideas?

I am running picard 1.129 using java version "1.7.0_60"

ValidateSamFile on both inputs into MergeBamAlignment returns no errors in those files. I am at a lost and seeking for help!

Thanks,

Ryan


Created 2015-01-26 17:10:50 | Updated | Tags: picard
Comments (2)

Hey guys,

I was running MarkDuplicates with BWA output, and I got a error "java.lang.NoClassDefFoundError: java/lang/ref/Finalizer$2". To get around it, I tried the following:

1) Used different versions of Picard. I tried 1.90, 1.119, 1.228. None of them worked. 2) I built picard.jar from source code. Still didn't work. 3) Used -M in BWA mem, no luck. When I used aln and sampe, I still got the same error. 4) So I tried MarkDuplicates on RNA-Seq data, it worked perfectly on TopHat output!

So the conclusion I've reached is that MarkDuplicats doesn't work well with BWA output. Anyone knows how to deal with this situation? Thank you in advance!

More information: I'm using openjdk version "1.6.0-internal", Linux system.

The error message: Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/ref/Finalizer$2 at java.lang.ref.Finalizer.runFinalization(Finalizer.java:144) at java.lang.Runtime.runFinalization0(Native Method) at java.lang.Runtime.runFinalization(Runtime.java:705) at java.lang.System.runFinalization(System.java:967) at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:58) at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49) at htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76) at htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180) at htsjdk.samtools.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:164) at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.put(DiskBasedReadEndsForMarkDuplicatesMap.java:65) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:290) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)


Created 2015-01-18 13:28:03 | Updated 2015-01-18 13:29:37 | Tags: bam picard index preprocess
Comments (2)

Greeting all.

Currently, I have been using Picard's built-in library "BuildBamIndex" in order to index my bam files.

I have followed the manual described in Picard sites but I got error message.

Here is my command line that you can easily understand as below.

java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output6

I tried different approach to avoid this error message so I used "samtools index" which i think is also same function as Picard BuildBamIndex.

After using samtools, I successfully got my bam index files.

I suppose that there are no major difference between Picard bamindex and samtools bam index.

I am confusing that why only samtools index procedure is working fine?

Below is my error message when run "BuildBamIndex" from Picard.

[Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex INPUT=/DATA1/sclee1/data/URC_WES/U01/01U_N_Filtered_Sorted_Markdup_readgroup.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Sun Jan 18 22:15:42 KST 2015] picard.sam.BuildBamIndex done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Exception creating BAM index for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:291) at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271) at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:133) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: htsjdk.samtools.SAMException: BAM cannot be indexed without setting a fileSource for record HSQ-2K:530:C5PJAACXX:6:2109:18806:13902 1/2 101b aligned read. at htsjdk.samtools.BAMIndexMetaData.recordMetaData(BAMIndexMetaData.java:130) at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:182) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:90) ... 6 more

I look forward to hearing positive answers from you soon.

Bye!


Created 2014-12-05 15:12:16 | Updated 2014-12-05 15:12:48 | Tags: solid picard markduplicates
Comments (2)

Hi,

I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:

22_758_632_F3

604_1497_576

124_1189_1519_F5

358_1875_702_F5-DNA

And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAMEREGEX="([0-9]+)([0-9]+)_([0-9]+).*"

Any help is appreciated. Thanks!


Created 2014-09-02 19:46:27 | Updated 2014-09-02 20:10:09 | Tags: haplotypecaller picard
Comments (2)

Hi,

We ran HaplotypeCaller on some bams and got the same error on them: Rod span is not contained within the data shard, meaning we wouldn't get all of the data we need.

Here is my command: java -Xmx32g -jar ~/GATK-3.2.2/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R /hg19/hg19.fasta \ -I $1 \ -ERC GVCF \ -nct 16 \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ --pair_hmm_implementation VECTOR_LOGLESS_CACHING \ -o ${TEMPORARY_DIR}/$2

Can you help me figure out whats wrong? The error said it may be a bug and I cannot find where this issue was previously addressed. I wonder if something went wrong with the bam processing.

We received these bams from the sequencing institution already aligned. Individual reads were missing the @RG tag, so we used bamaddrg to add the @RG tags. That caused issues with the bam 'bins', so we had to run the htsjdk.samtools.FixBAMFile. We ran MarkDuplicates and attempted the above HaplotypeCaller command.

Thanks!

Mark


Created 2014-08-15 17:43:45 | Updated | Tags: developer picard customwalker htsjdk
Comments (4)

Hi fellow htsjdk/picard/gatk developers!

I've been thinking about this for quite some time now, so I thought I should write up a quick post about it here.

I've been writing custom tools for our group using both picard and GATK for some time now. It's been working nicely, but I have been missing a set of basic tutorials and examples, for users to quickly get started writing walkers. My most commonly used reference has been the 20-line life savers (http://www.slideshare.net/danbolser/20line-lifesavers-coding-simple-solutions-in-the-gatk) which is getting a bit dated.

What I would like to see is something like for following:

  • What's in htsjsk? What's not in htsjdk? (from a dev's perspective - in terms of frameworks)
  • What's in picard? What's not in picard? (from a dev's perspective - in terms of frameworks)
  • What's in gatk? What's not in gatk? (from a dev's perspective - in terms of frameworks)
  • When to use htsjdk, picard any GATK. What are the strengths and weaknesses of the three. (possibly more that I've missed)
  • Your first htsjdk walker
  • Your first picard walker
  • Your first gatk walker
  • Traversing a BAM in htsjdk vs gatk - what are the differences

There might be more stuff that could go in here as well. The driving force behind this is that I'm myself a bit confused by the overlap of these three packages/frameworks. I do understand that picard uses htsjdk, and that GATK uses both as dependencies, but it's not super clear what extra functionality (for a developer) is added from htsjdk -> picard -> gatk.

Could we assemble a small group of interested developers to contribute to this? We could set up a git repo with the examples and tutorials for easy collaboration and sharing online.

Anyone interested? I'll could myself as the first member :)


Created 2013-11-20 06:41:02 | Updated 2013-11-20 06:41:55 | Tags: queue qscript picard markduplicates
Comments (9)

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc="Input BAM file") val bam: File = inBam
    @Output(doc="Metrics file") val metrics: File = outMetrics
    this.input :+= bam
    this.targets = targetIntervals
    this.baits = baitIntervals
    this.output = metrics
    this.reference =  refGenome
    this.isIntermediate = false
}

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'  
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar 
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc = "Input bam file") var inbam = inBam
    @Output(doc = "Output BAM file with dups removed") var outbam = outBam
    this.REMOVE_DUPLICATES = true
    this.input :+= inBam
    this.output = outBam
    this.metrics = metricsFile
    this.memoryLimit = 3
    this.isIntermediate = false
}

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.


Created 2013-11-08 10:45:37 | Updated | Tags: reducereads error picard
Comments (8)

Hi guys, I've seen this error has been reported other times, for different reasons. The thing is that, the bam file I'm using to reduce the reads has been processed through GATK pipeline without problems, realignment and recalibration included. Therefore, I assumed the bam file generated after BQSR would be GATK-compliant. I was running with Queue, so I just run here the exact command passed to the job in an interactive mode, to see what happens.

Here below is the full command and error message (apologies for lengthy output), where there's no stack trace after the error.

        [fles@login07 reduced]$ 'java'  '-Xmx12288m'  '-Djava.io.tmpdir=/scratch/scratch/fles/project_analysis/reduced/tmp'  '-cp' '/home/fles/applications/Queue-2.7-4-g6f46d11/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'ReduceReads'  '-I' '/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam'  '-R' '/home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta'  '-o' '/scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam'
        INFO  09:27:21,728 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:21,730 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:29:52 
        INFO  09:27:21,731 HelpFormatter - Copyright (c) 2010 The Broad Institute 
        INFO  09:27:21,731 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
        INFO  09:27:21,735 HelpFormatter - Program Args: -T ReduceReads -I /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam -R /home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta -o /scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam 
        INFO  09:27:21,735 HelpFormatter - Date/Time: 2013/11/08 09:27:21 
        INFO  09:27:21,735 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:21,735 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:34,156 GenomeAnalysisEngine - Strictness is SILENT 
        INFO  09:27:34,491 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 40 
        INFO  09:27:34,503 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  09:27:34,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12 
        INFO  09:27:35,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
        INFO  09:27:35,045 GenomeAnalysisEngine - Done preparing for traversal 
        INFO  09:27:35,045 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
        INFO  09:27:35,046 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
        INFO  09:27:35,080 ReadShardBalancer$1 - Loading BAM index data 
        INFO  09:27:35,081 ReadShardBalancer$1 - Done loading BAM index data 
        INFO  09:28:05,059 ProgressMeter -      1:18958138        1.00e+06   30.0 s       30.0 s      0.6%        81.8 m    81.3 m 
        INFO  09:28:35,069 ProgressMeter -      1:46733396        2.30e+06   60.0 s       26.0 s      1.5%        66.4 m    65.4 m 
        INFO  09:29:05,079 ProgressMeter -      1:92187730        3.50e+06   90.0 s       25.0 s      3.0%        50.5 m    49.0 m 
        INFO  09:29:35,088 ProgressMeter -     1:145281942        4.90e+06  120.0 s       24.0 s      4.7%        42.7 m    40.7 m 
        INFO  09:30:05,098 ProgressMeter -     1:152323864        6.40e+06    2.5 m       23.0 s      4.9%        50.9 m    48.4 m 
        INFO  09:30:35,893 ProgressMeter -     1:181206886        7.70e+06    3.0 m       23.0 s      5.8%        51.4 m    48.4 m 
        INFO  09:31:05,902 ProgressMeter -     1:217604563        8.90e+06    3.5 m       23.0 s      7.0%        49.9 m    46.4 m 
        INFO  09:31:35,913 ProgressMeter -      2:14782401        1.02e+07    4.0 m       23.0 s      8.5%        47.0 m    43.0 m 
        INFO  09:32:05,922 ProgressMeter -      2:62429207        1.15e+07    4.5 m       23.0 s     10.0%        44.8 m    40.3 m 
        INFO  09:32:35,931 ProgressMeter -      2:97877374        1.28e+07    5.0 m       23.0 s     11.2%        44.7 m    39.7 m 
        INFO  09:33:06,218 ProgressMeter -     2:135574018        1.42e+07    5.5 m       23.0 s     12.4%        44.5 m    38.9 m 
        INFO  09:33:36,227 ProgressMeter -     2:179431307        1.56e+07    6.0 m       23.0 s     13.8%        43.5 m    37.5 m 
        INFO  09:34:06,237 ProgressMeter -     2:216279690        1.69e+07    6.5 m       23.0 s     15.0%        43.4 m    36.9 m 
        INFO  09:34:36,248 ProgressMeter -      3:14974731        1.81e+07    7.0 m       23.0 s     16.4%        42.9 m    35.9 m 
        INFO  09:35:07,073 ProgressMeter -      3:52443620        1.94e+07    7.5 m       23.0 s     17.6%        42.9 m    35.4 m 
        INFO  09:35:37,084 ProgressMeter -     3:111366536        2.05e+07    8.0 m       23.0 s     19.5%        41.3 m    33.2 m 
        INFO  09:36:07,094 ProgressMeter -     3:155571144        2.18e+07    8.5 m       23.0 s     20.9%        40.8 m    32.3 m 
        INFO  09:36:37,103 ProgressMeter -       4:3495327        2.31e+07    9.0 m       23.0 s     22.4%        40.4 m    31.3 m 
        INFO  09:37:07,114 ProgressMeter -      4:48178306        2.43e+07    9.5 m       23.0 s     23.8%        40.0 m    30.5 m 
        INFO  09:37:37,270 ProgressMeter -     4:106747046        2.56e+07   10.0 m       23.0 s     25.7%        39.0 m    29.0 m 
        INFO  09:38:07,483 ProgressMeter -     4:181303657        2.69e+07   10.5 m       23.0 s     28.1%        37.5 m    26.9 m 
        INFO  09:38:37,493 ProgressMeter -      5:41149454        2.81e+07   11.0 m       23.0 s     29.7%        37.1 m    26.1 m 
        INFO  09:38:51,094 GATKRunReport - Uploaded run statistics report to AWS S3 
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): 
        ##### ERROR
        ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
        ##### ERROR The error message below tells you what is the problem.
        ##### ERROR
        ##### ERROR If the problem is an invalid argument, please check the online documentation guide
        ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
        ##### ERROR
        ##### ERROR Visit our website and forum for extensive documentation and answers to 
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
        ##### ERROR
        ##### ERROR MESSAGE: SAM/BAM file /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam is malformed: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
        ##### ERROR ------------------------------------------------------------------------------------------

Following your usual advice, I validated the bam file produced by BQSR with Picard and I get the exact same error, but no specific error indication

        [fles@login07 reduced]$ java -jar ~/applications/picard-tools-1.102/ValidateSamFile.jar \
        > INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam \
        > IGNORE_WARNINGS=TRUE
        [Fri Nov 08 09:59:42 GMT 2013] net.sf.picard.sam.ValidateSamFile INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam IGNORE_WARNINGS=true    MAX_OUTPUT=100 VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
        [Fri Nov 08 09:59:42 GMT 2013] Executing as fles@login07 on Linux 2.6.18-194.11.4.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.102(1591)
        INFO    2013-11-08 10:01:01 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:01:18s.  Time for last 10,000,000:   78s.  Last read position: 1:204,966,172
        INFO    2013-11-08 10:02:19 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:02:36s.  Time for last 10,000,000:   78s.  Last read position: 2:232,121,396
        INFO    2013-11-08 10:03:36 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:03:54s.  Time for last 10,000,000:   77s.  Last read position: 4:123,140,629
        [Fri Nov 08 10:04:00 GMT 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 4.30 minutes.
        Runtime.totalMemory()=300941312
        To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
        Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
            at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397)
            at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
            at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:357)
            at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:558)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:532)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
            at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687)
            at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:665)
            at net.sf.picard.sam.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:241)
            at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:177)
            at net.sf.picard.sam.SamFileValidator.validateSamFileSummary(SamFileValidator.java:104)
            at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:164)
            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
            at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:100)
        Caused by: java.io.IOException: Unexpected compressed block length: 1
            at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:358)
            at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:113)
            at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:238)
            at java.io.DataInputStream.read(DataInputStream.java:149)
            at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:395)

any suggestions on what I might do wrong?


Created 2013-08-13 21:38:39 | Updated | Tags: vcf gatk picard
Comments (5)

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks


Created 2013-03-14 14:36:56 | Updated | Tags: picard
Comments (1)

I happened to be readying a new reference genome and was using FastaStats to force the creation of the .dict and .fai files. The automatic .dict file creation first creates a temporary file, creates the .dict using Picard, and then copies it as appropriate. However, the newer versions of Picard won't overwrite the dict file, and so there is an error using the Java created temporary file as an output. The problematic section seems to be ReferenceDataSource.java. The error manifests as:

Index file /gs01/projects/ngs/resources/gatk/2.3/human_g1k_v37.dict does not exist but could not be created because: .   /gs01/projects/ngs/resources/gatk/2.3/dict8545428789729910550.tmp already exists.  Delete this file and try again, or specify a different output file.

Created 2013-02-23 19:50:26 | Updated 2013-02-23 19:56:12 | Tags: realignertargetcreator picard
Comments (14)

Hi there,

I get an error when I try to run GATK with the following command:

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals

However I get this error:

SAM/BAM file SAMFileReader{/merged_bam_files_indexed_markduplicate.bam} is malformed: Read HWI-ST303_0093:5:5:13416:34802#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK.  Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem

It suggest that it a header issue however my bam file has a header:

samtools view -h merged_bam_files_indexed_markduplicate.bam | grep ^@RG
@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

when I grep the read within the error:

HWI-ST303_0093:5:5:13416:34802#0        99      1       1090    29      23S60M17S       =       1150    160     TGTTTGGGTTGAAGATTGATACTGGAAGAAGATTAGAATTGTAGAAAGGGGAAAACGATGTTAGAAAGTTAATACGGCTTACTCCAGATCCTTGGATCTC        GGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGGGGDGFGFGGGGGFEDFGEGGGDGEG?FGGDDGFFDGGEDDFFFFEDG?E        MD:Z:60 PG:Z:MarkDuplicates     RG:Z:test1      XG:i:0  AM:i:29 NM:i:0  SM:i:29 XM:i:0  XO:i:0  XT:A:M

Following Picard solution:

java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test  RGPL=Illumina RGSM=test/ RGPU=HWI-ST303  RGCN=japan CREATE_INDEX=True 

I get this error after 2 min.:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`

Any recommendation on how to solve this issue ?

My plan is to do the following to resolve the issue:

picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=LENIANT
samtools  index test_markduplicate.bam

I see a lot of messages like below but the command still running:

Ignoring SAM validation error: ERROR: Record (number), Read name HWI-ST303_0093:5:5:13416:34802#0, RG ID on SAMRecord not found in header: test1

while running the command

then try the GATK RealignerTargetCreator

I already tried to do the following

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals --validation_strictness LENIENT

But I still got the same error

N.B: the same command run with no issue with GATK version (1.2)

My pipeline in short: mapping the paired end reads with

bwa aln -q 20 ref.fa read > files.sai
bwa sampe ref.fa file1.sai file2.sai read1 read2 > test1.sam
samtools view -bS test1.sam | samtools sort - test
samtools  index test1.bam
samtools merge -rh RG.txt test test1.bam test2.bam

RG.txt

@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

samtools  index test.bam
picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=SILENT
samtools  index test_markduplicate.bam

Created 2012-12-24 20:09:25 | Updated 2013-01-07 19:16:08 | Tags: bam readgroup picard
Comments (3)

how to add read group to the bam file using PICARD generated from GS reference mapper BAM file?


Created 2012-11-28 11:47:29 | Updated | Tags: indelrealigner picard microscheduler baserecalibration createtarget
Comments (2)

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!


Created 2012-09-12 15:14:41 | Updated 2013-01-07 20:40:49 | Tags: bwa picard mtdna
Comments (1)

Picard appears not to like the way BWA codes mtDNA. I am doing human exome sequencing using a copy of hg19 which I obtained from UCSC and indexed using BWA per the instructions here:

Example 1

[Tue Aug 28 12:45:16 EDT 2012] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=125435904
FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Non-numeric value in ISIZE column; Line 3982
Line: FCC0CHTACXX:1:1101:14789:3170#TAGCTTAT 117 chrM 304415842 0 100M = -1610645157 2379906297 TGCGACTTGGAAGCGGATTCAGAGGACAGGACAGAACACTTGGGCAAGTGAATCTCTGTCTGTCTGTCTGTCTCATTGGTTGGTTTATTTCCATTTTCTT B@<:>CDDDBDDBDEEEEEEFEFCCHHFHHGGIIIHIGJJJIIGGGIIIIJJJIIGJIJGG@CEIFJIJJJJIJIJIJJJJIJJJGIHHGHFFEFFFCCC RG:Z:1868 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:39G45G14 XA:Z:chrM,-391302964,100M,2;
at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:223)
at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40)
at net.sf.samtools.SAMTextReader$RecordIterator.parseInt(SAMTextReader.java:293)
at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:394)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:68)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at net.sf.picard.sam.SortSam.main(SortSam.java:57)

Example 2

java -jar ~/bin/picard-tools-1.74/MarkDuplicates.jar \
INPUT=1sorted.bam \
OUTPUT=1dedup.bam \
ASSUME_SORTED=true \
METRICS_FILE=metrics \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT

...
Ignoring SAM validation error: ERROR: Record 691, Read name FCC0CHTACXX:1:1302:4748:176644#GGCTACAT, Mate Alignment start (436154938) must be <= reference sequence length (16571) on reference chrM
Ignoring SAM validation error: ERROR: Record 692, Read name FCC0CHTACXX:1:2104:8494:167812#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 693, Read name FCC0CHTACXX:1:1201:21002:183608#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 694, Read name FCC0CHTACXX:1:2303:3184:35872#GGCTACAT, Mate Alignment start (436154812) must be <= reference sequence length (16571) on reference chrM
...

I've truncated the output; in fact it throws such an error for every single line of mitochondrial reads.

I suspect I could solve this by writing my own script to go in and change the way one column is coded, but more broadly, I am interested in the answer to "how do you make BWA, Picard and GATK work seamlessly together without needing to do your own scripting"?