# Tagged with #performance 5 documentation articles | 3 announcements | 17 forum discussions

Created 2013-12-20 22:01:53 | Updated 2014-10-31 20:36:41 | Tags: performance presentations conferences

Mauricio Carneiro presented this slide deck at the workshop organized by Mnt Sinai School of Medicine on December 10, 2013. The other presentations made at the workshop were posted here.

Please note that we cannot guarantee content hosted on other websites; if outgoing links becomes outdated please let us know.

Created 2012-12-18 21:35:34 | Updated 2013-01-26 05:10:36 | Tags: intro queue parallelism performance scatter-gather multithreading nct nt

This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

### 1. Introducing the concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

#### Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

• open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

• open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

• open the file, index the lines

• read the first line, count the number of words, tell us the number
• read the second line, count the number of words, tell us the number
• read the third line, count the number of words, tell us the number
• [repeat for all lines]

• collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

### 2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

#### A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.

• Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

• Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

• Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

• -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

#### Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.

#### Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Created 2012-12-14 21:59:43 | Updated 2016-03-31 18:41:31 | Tags: queue parallelism performance scatter-gather multithreading nct nt

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

### Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

• -nt / --num_threads controls the number of data threads sent to the processor
• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

#### Additional consideration when using -nct with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

### Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

### Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool Full name Type of traversal NT NCT SG
RTC RealignerTargetCreator RodWalker + - -
IR IndelRealigner ReadWalker - - +
BR BaseRecalibrator LocusWalker - + +
UG UnifiedGenotyper LocusWalker + + +

### Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool RTC IR BR PR RR UG
Available modes NT SG NCT,SG NCT SG NT,NCT,SG
Cluster nodes 1 4 4 1 4 4 / 4 / 4
CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24
Data threads (-nt) 24 1 1 1 1 8 / 4 / 1
Memory (Gb) 48 4 4 4 4 32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Created 2012-11-05 16:20:38 | Updated 2013-01-14 17:35:25 | Tags: intro parallelism performance walkers map-reduce

### Overview

One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.

Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.

### Map/Reduce

Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.

• filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.

• map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.

• reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.

This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.

### Walkers, filters and traversal types

All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.

Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.

Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.

There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.

The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.

Created 2012-08-11 04:36:16 | Updated 2012-10-18 14:57:10 | Tags: developer performance ngs

Imagine a simple question like, "What's the depth of coverage at position A of the genome?"

First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.

Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.

This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:

#### All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences.

Created 2016-03-16 15:03:06 | Updated 2016-03-17 16:09:20 | Tags: profile parallelism performance runtime speed benchmarks

When you're setting up a variant discovery pipeline, you face two problems: deciding what tools to run (with what options), and how to run them efficiently so that it doesn't take forever. Between our documentation and our support forum, we can get you most if not all the way to solving the first problem, unless you're working with something really unusual.

However, the second problem is not something we've been able to help much with. We only benchmark computational requirements/performance for the purposes of our in-house pipelines, which are very specific to our particular infrastructure, and we don't have the resources to test different configurations. As a result it's been hard for us to give satisfying answers to questions like "How long should this take?" or "How much RAM do I need?" -- and we're aware this is a big point of pain.

So I'm really pleased to announce that a team of engineers at Intel have been developing a system to profile pipelines that implement our Best Practices workflows on a range of hardware configurations. This is a project we've been supporting by providing test data and implementation advice, and it's really gratifying to see it bear fruit: the team recently published their first round of profiling, done on the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF) on a trio of whole genomes.

The white paper is available from the GATK-specific page of Intel's Health-IT initiative website and contains some very useful insights regarding key bottlenecks in the pipeline. It also details the applicability of parallelizing options for each tool, as well as the effect of using different numbers of threads on performance, when run on a single 36-core machine. Spoiler alert: more isn't always better!

Read on for a couple of highlights of what I thought was especially interesting in the Intel team's white paper.

### Finding the thread count that's right for you: not just a Bed, Bath & Beyond question

First, this figure showing how parallelization affects turnaround time for each tool in the pipeline does a great job of identifying where parallelization makes a the biggest difference (hellooooo BWA), as well as where the addition of more parallel threads shows quickly diminishing returns.

Note that here the term "thread" refers both to parallelized execution that is achieved through multithreading approaches (in GATK, these are invoked using -nt and/or -nct depending on the tool) as well as through local scatter-gather, which consists of running the tools over slices of data generated using Queue, the companion program to GATK. In terms of computational resources the result is similar: a certain number of cores on the machine are dedicated to running the given component. Full details are given in the script that accompanies the white paper; in a nutshell, multithreading was used for BWA-mem and RealignerTargetCreator, whereas scatter-gather was used to parallelize IndelRealigner, BaseRecalibrator, PrintReads and HaplotypeCaller.

You can see that for the genome mapping step, done with BWA-mem, going from 1 to just 4 threads leads to a 4-fold decrease in runtime, which is yuuuge since it's the most time-consuming step in the pipeline. And beyond that you can still bring runtime down further by throwing more threads at the problem, until the relative gains bottom out somewhere between 8 and 36 -- at which point you're spending very little time on BWA mapping. Your mileage may vary, but for reference, in Broad's production pipeline we give BWA-mem 12 threads.

On the GATK end of the pipeline, HaplotypeCaller also shows an about 4-fold speedup when going from a single core to 4 scatter-gather jobs run on different cores, but beyond that the gains from additional parallelization tend to be progressively more modest. Multithreading with -nct is not used at all because it has proved fairly unstable in HaplotypeCaller, leading to occasional unpredictable crashes.

The remaining steps see less dramatic improvement, comparatively speaking, though BaseRecalibrator and PrintReads with -BQSR do show a decent 2-fold speedup when run on four cores instead of one. But this shows fairly clearly that there's little point to blindly throwing more parallelization at these tools.

### Are we there yet? Are we there yet? Are we there yet?

In this second figure we're looking at CPU utilization throughout the pipeline, i.e. how much computing the machine is doing at any given time -- as opposed to doing boring things like reading and writing data to and from files (I/O), which is like driving from Omaha to Denver (a long flat drive where nothing exciting happens, but things get fun once you get there).

Note that this figure corresponds specifically to an "optimized run", i.e. a specific configuration of the pipeline where each tool was parallelized optimally based on earlier results.

You can see that BWA-mem is a busy beaver, spending pretty much all of its time furiously calculating mapping scores and writing out results to the output SAM file as it goes. In contrast, if we look at the tools that write out BAM files, we see a flurry of activity up front, then the line goes flat during a long period spent just writing out results to the output BAM file.

In the case of BaseRecalibrator, the tool only outputs a recalibration table, which doesn't take very long at all. Then looking at PrintReads (run with -BQSR) you see a similar activity profile to BaseRecalibrator's, which corresponds to the on-the-fly recalibration done by the engine before the recalibrated data is written to the final pre-processed BAM file that will be fed to HaplotypeCaller.

Finally you see that HaplotypeCaller itself is the most compute-intensive tool in the GATK end of the pipeline; although this is not shown in the figure, I can tell you that much of its time is spent on graph assembly and pairHMM alignment of haplotypes. Note that here HaplotypeCaller is writing out a GVCF file; if you were to run HaplotypeCaller in regular mode (not using ERC GVCF) you would see a shorter period of I/O flatline because the variants-only VCF output amounts to a much smaller file.

### What's next?

If the above made you want to know more, head on over to Intel's Health-IT website and get the full white paper.

As I mentioned this is only the first pass in an ongoing project. The next step is going to involve implementing the joint genotyping and filtering with VQSR for the WGS trio, as well as profiling the equivalent exome pipeline on a cohort of ~30 exomes.

You may have noticed that this first pass was done one a single (albeit multi-core) machine; this was done on purpose to provide a baseline for end-to-end execution with the simplest configuration. Our friends at Intel will be looking at multi-machine setup in a future iteration, and for our part we'll have some new developments for you on the pipelining front soon -- so stay tuned to this blog or follow @gatk_dev on Twitter!

Created 2014-03-18 00:36:21 | Updated 2014-03-20 14:10:47 | Tags: release performance version-highlights topstory pairhmm

This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.

### Intel inside GATK

So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.

We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!

### In practice

Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.

Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command: -pairHMM VECTOR_LOGLESS_CACHING.

Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.

Linux kernel version Architecture / Processor Expected speedup Instruction set
Any 64-bit Linux Any x86 64-bit 1-1.5x Non-vector
Linux 2.6 or newer Penryn (Core 2 or newer) 1.3-1.8x SSE 4.1
Linux 2.6.30 or newer SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer) 2-2.5x AVX

To find out exactly which processor is in your machine, you can run this command in the terminal:

$cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions. Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware: #### Penryn processors #### Sandy Bridge processors Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures: • SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above). • The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want. • The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this. Created 2014-03-18 00:22:21 | Updated 2014-03-20 13:50:00 | Tags: performance hardware We're very excited to announce that we have started collaborating with a team from Intel (yep, that Intel) to optimize key parts of the GATK code to make it run faster. The first fruits of this collaboration --a set of hardware-based optimizations for the PairHMM algorithm in HaplotypeCaller-- are available as of today in version 3.1 of the GATK. Please see the release notes and version highlights in the Version History section of the Guide for details. Of course this is only the beginning, and we're looking forward to delivering more performance improvements for various other GATK tools moving forward as part of this collaboration. What's really cool is that this collaboration extends beyond our little GATK team; the Intel Bio Team is also going to be working with other groups at the Broad Institute to make their software run faster as well, all with the goal of accelerating scientific research and discovery. For more details and background information, see the Bio-IT World story here: http://www.bio-itworld.com/2014/3/20/broad-intel-announce-speed-improvements-gatk-powered-by-intel-optimizations.html Created 2016-02-11 16:22:24 | Updated | Tags: performance Dear GATK Developers, Thank your for maintaining such a valuable tool. I noticed a performance bottleneck with newer GATK versions (I'm on the latest version right now v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56) where (simple) analysis types such as SelectVariants or VariantsToTable would run for days when they would previously complete in less than an hour. I have to add that I am using bgzip-compressed and tabix-indexed VCF files throughout the analysis and that I see the performance problem on our cluster running several processes in parallel. Most Java processes would show a rather low percentage of CPU consumption, while our cluster file system maxes out at around 100,000 read requests per second. Using 'strace' of an active Java thread, I noticed that millions of "pread" system calls of 183 bytes against the reference sequence *.fasta file seem to be issued.  [ ~]$ strace -p 28395
Process 28395 attached - interrupt to quit
pread(11, "TGGTAAAATGTGGTTGGATGAAGCGTACGCTT"..., 183, 46038724) = 183
pread(11, "TAGACCATTCTATCAAAATGCTCTTTCTACAG"..., 183, 46038907) = 183
...

The question is, what am I doing wrong? Alternatively, are you aware of any changes to how the reference sequence FASTA file is read in recent GATK versions especially in the context of bgzip-compressed and tabix-indexed VCF files. Are you aware that so many read requests are issued? I tried to go back to version v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 and although the "pread" requests seem to include larger regions, the run time of a single process seems similar. I would have to go back further, but you may be quicker in doing so.

Suggested by our system administrator, I managed to overcome the performance bottleneck by copying the reference sequence .fasta, .fasta.fai and *.dict files into the Linux shared memory tmpfs file system (/dev/shm/). As an example, VariantsToTable run times have fallen from 5 hours to just 31 seconds (13h vs 74s per 1M sites), SelectVariants from 5 hours to 18 minutes (26m vs 87s per 1M sites). In case I'm not the only one seeing this, would it make sense to read (and keep in memory) larger chunks of the reference sequence?

Command lines, log files etc. would be available, of course.

Thanks, Michael

Created 2015-07-26 17:04:49 | Updated | Tags: performance java threads top

Hi,

first of all, yes, I am aware of this post ( http://gatkforums.broadinstitute.org/discussion/comment/20469/#Comment_20469 ) and its recommendation of using -XX:ParallelGCThreads.

I have been testing my pipeline to run HaplotypeCaller and, through testing, I have found java/GATK opening too many ¿unnecessary? threads. This leads to increased CPU metrics on our cluster and a not-so-happy sysadmin.

An example of my command is as follows:

java -Xmx20G -XX:ParallelGCThreads=${gct} -Djava.io.tmpdir=${TMPDIR} -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ${ref} \ -I${input} \ -L ${reg} \ -ERC GVCF \ -nct${nct} \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o ${name}.raw_variants.annotated.g.vcf \ -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage -A InbreedingCoeff For example, when running this in the cluster with nct=1 and ParallelGCThreads=4, top shows the following:$ top -c -u theredia -H -b -n1
top - 20:25:40 up 15 days, 7:00, 1 user, load average: 5.95, 5.41, 5.15 Tasks: 520 total, 7 running, 513 sleeping, 0 stopped, 0 zombie Cpu(s): 11.3%us, 0.1%sy, 0.1%ni, 88.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99192688k total, 10184220k used, 89008468k free, 886776k buffers Swap: 10239992k total, 0k used, 10239992k free, 800664k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29376 theredia 20 0 21.9g 657m 12m R 97.0 0.7 2:13.99 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29377 theredia 20 0 21.9g 657m 12m S 1.9 0.7 0:00.79 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29378 theredia 20 0 21.9g 657m 12m S 1.9 0.7 0:00.79 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29379 theredia 20 0 21.9g 657m 12m S 1.9 0.7 0:00.79 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 30210 theredia 20 0 17424 1504 868 R 1.9 0.0 0:00.01 top -c -u theredia -H -b -n1
29375 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.00 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29380 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.77 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29381 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.35 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29382 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.03 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29383 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.05 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29384 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.00 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29385 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:12.40 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29386 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:10.34 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29387 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.00 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29388 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.06 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29429 theredia 20 0 21.9g 657m 12m S 0.0 0.7 0:00.00 java -Xmx20G -XX:ParallelGCThreads=4 -Djava.io.tmpdir=/tmp/16535.1.lab_dcexs.q -jar /aplic/noarch/GATK/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller 29747 theredia 20 0 112m 1968 888 S 0.0 0.0 0:00.02 sshd: theredia@pts/0
29748 theredia 20 0 114m 6060 1460 S 0.0 0.0 0:00.05 -bash

All in all, a count of the number of threads tested is:

12-core node: nct=1 GC=NO : 21 nct=1 GC=4 : 15 nct=1 GC=8 : 19 nct=1 GC=12 : 23 nct=4 GC=NO : 26 nct=4 GC=4 : 20

20-core node: nct=1 GC=NO : 26 nct=1 GC=4 : 15 nct=1 GC=8 : 19 nct=1 GC=12 : 23 nct=1 GC=15 : 26 nct=1 GC=20 : 31 nct=4 GC=NO : 31 nct=4 GC=4 : 20 nct=4 GC=8 : 24 nct=8 GC=4 : 24 nct=8 GC=8 : 28

In order to run the calling pipeline for the chromosome 1 of a single 40x sample, I have to request -Xmx20G and ask SunGridEngine to reserve 4 threads for me, even when requesing -nct=1 and parallelGCThreads=4. Any request lower than this ends up with the job stalled at some point with progress reporter stuck at the same point for 3 days before I have to kill the job.

Our cluster has core-binding enabled. Thus, if I request only 1 thread/core, but GATK opens 15/20 threads, all these 19 extra threads have to fight the main thread for cpu time and the main process slows down a lot.

Does GATK really need this amount of resources? I there a way I can further reduce the amount of threads java/GATK is spawning? Our sysadmins are not happy because this rises the cpu load levels of the involved nodes and rises several alarms with no need to.

Txema

Created 2015-02-13 08:25:46 | Updated 2015-02-13 08:26:24 | Tags: bqsr printreads performance material

Hi, Here I have no problem :)

I ran a bqsr workflow on 2 set of data that were almost similare (same study, 2 indiv.), on 2 different machines.

Here are my observations : set1 : Dell powerEdge R820, with 128Go Ram, printReads ran in 7h hours with ~35Go ram used. set2 : Dell powerEdge R730 with 193Go ram (and many jobs already running) printReads ran in 1h20 min with ~35Go ram used.

Both sets were launched with the exact same command : -nct 20 set2 output is a little bigger (16Go vs 14Go for set1).

I'd like to know if this huge difference in duration is mostly due to materials, and if it is, which part ? CPU speed, cache, ram access ?

Thanks

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2014-06-03 16:39:33 | Updated | Tags: haplotypecaller queue performance

Hello,

I adapted the example UnifiedGenotyper script to run HaplotypeCaller. Unfortunately, so far I haven't been able to get it to use multiple CPUs. I am using a server with 64 cores (AMD Opteron) and 512 gb RAM. Invoking the -nct option made no difference either. The example job is a highly heterogeneous set of 6 samples with a 270 mbp reference. Predicted runtime: 6 weeks (and we will need to genotype batches of up to 40 samples...). The script creates ten output directories (/.queue/scatterGather/hc-1-sg/temp_01_of_10, etc), but files are only produced in one of them. The local guru cannot spot the problem either. I attach the script - I will appreciate some feedback.

As a sidenote, does the memoryLimit refer to the entire process, or is it per core?

Many thanks, Chris

# Command:

/queue -S hc.scala -R /whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta -I test_bams.list -NCT 4 -SCC 20 -MBQ 20 -H 0.1 -IH 0.001 -O test2.10cpu.4nct.output.vcf -run

# Script:

//kk443 AT cam.ac.uk 30/05/2014 Script to run the GATK HaplotypeCaller with Queue parallelism.

class ExampleUnifiedGenotyper extends QScript {

qscript =>

// Required arguments. All initialized to empty values.

@Input(doc="The reference file for the bam files.", shortName="R") var referenceFile: File = // is scala shorthand for null

@Input(doc="Bam file to genotype.", shortName="I") var bamFile: File = _

@Argument(shortName="H") var hets: Float = _

@Argument(shortName="IH") var indelHeterozygosity: Float = _

@Output(shortName="O") var o: File = _

// The following arguments are all optional.

@Argument(shortName="SCC", required=false) var stand_callconf: Int =

@Argument(shortName="MBQ", required=false) var mbq: Int = _

@Argument(shortName="NCT", required=false) var nct: Int = _

// This trait allows us set the variables below in one place, // and then reuse this trait on each CommandLineGATK function below. trait HaplotypeCallerArguments extends CommandLineGATK { this.reference_sequence = qscript.referenceFile this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals) // Set the memory limit to 2 gigabytes on each command. this.memoryLimit = 40 }

def script() { // Create the four functions that we may run depending on options. val genotyper = new HaplotypeCaller with HaplotypeCallerArguments

genotyper.scatterCount = 10
genotyper.input_file :+= qscript.bamFile
genotyper.out = swapExt(qscript.bamFile, "bam", "unfiltered.vcf")

add(genotyper)

} }

# StdOut:

INFO 23:27:36,528 QScriptManager - Compiling 1 QScript INFO 23:27:44,663 QScriptManager - Compilation complete INFO 23:27:44,806 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,806 HelpFormatter - Queue v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:09 INFO 23:27:44,807 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 23:27:44,807 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 23:27:44,808 HelpFormatter - Program Args: -S hc.scala -R /whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta -I test_bams.list -NCT 2 -SCC 20 -MBQ 20 -H 0.1 -IH 0.001 -O test4.20cpu.2nct.200GB.output.vcf -run INFO 23:27:44,809 HelpFormatter - Executing as kk443@bonobo--bio on Linux 3.2.0-60-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. INFO 23:27:44,809 HelpFormatter - Date/Time: 2014/06/01 23:27:44 INFO 23:27:44,810 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,810 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,818 QCommandLine - Scripting ExampleUnifiedGenotyper INFO 23:27:44,969 QCommandLine - Added 1 functions INFO 23:27:44,971 QGraph - Generating graph. INFO 23:27:44,994 QGraph - Generating scatter gather jobs. INFO 23:27:45,058 QGraph - Removing original jobs. INFO 23:27:45,061 QGraph - Adding scatter gather jobs. INFO 23:27:45,282 QGraph - Regenerating graph. INFO 23:27:45,362 QGraph - Running jobs. INFO 23:27:45,512 FunctionEdge - Starting: 'java' '-Xmx40960m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/whale-data/kk443/mapping/AAAfinished/.queue/tmp' '-cp' '/whale-data/biosoft/src/Queue-3.1-1/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/whale-data/kk443/mapping/AAAfinished/test_bams.list' '-L' '/whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/scatter.intervals' '-R' '/whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta' '-o' '/whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/test_bams.listunfiltered.vcf' INFO 23:27:45,512 FunctionEdge - Output written to /whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/test_bams.listunfiltered.vcf.out

Created 2014-03-26 15:09:35 | Updated | Tags: haplotypecaller performance combinegvcfs

I've got 300 gvcfs as a results of a Queue pipeline, that I want to combine. When I run CombineGVCFs (GATK v3.1-1) this however seems fairly slow:

INFO  15:24:22,100 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  15:57:52,778 ProgressMeter -      1:11456201        1.10e+07   33.5 m        3.0 m      0.4%         6.4 d     6.3 d
INFO  15:58:52,780 ProgressMeter -      1:11805001        1.10e+07   34.5 m        3.1 m      0.4%         6.4 d     6.3 d
INFO  15:59:52,781 ProgressMeter -      1:12140201        1.20e+07   35.5 m        3.0 m      0.4%         6.4 d     6.3 d 

Is there a way of improving the performance of this merge? 6 days seems like a lot, but of course not unfeasible. Likewise, what kind of performance could I expect in the GenotypeGVCFs step?

Created 2014-03-14 22:24:41 | Updated | Tags: haplotypecaller performance multi-sample

our group is calling variants using 5.5mb of targeted capture across ~2400 bams (for 830 subjects). i have successfully executed using unifiedgenotyper, but we would also like to run using haplotypecaller to compare. my basic approach has been to scatter the capture region into 25 equal sized chunks, run parallel and then CatVariants to gather afterward. this worked nicely for UG, but for HC estimated run-times for even these small intervals are >7 days and the jobs inevitably crash due to memory problems.

so, my question is how should i approach this large cohort problem in HC? i have read HC works best in cohorts <100, so should i divide up my cohort into chunks of 100 bams and combine afterward? seems this would undermine the benefits of joint calling, no? i have also read that GATK 3.0 is going to address this issue with HC -- have these improvements been made with the current version (which I am not currently using)? if so, is there a best practice or example available anywhere with cmd specs for situations similar to mine? finally, is it ok to split my capture region and run in parallel as i have done? this doesn't undermine the benefits of de novo local assembly by HC? thanks in advance for any input!

Created 2013-10-29 14:44:59 | Updated | Tags: unifiedgenotyper performance runtime

Hello there,

I ran GATK's UnifiedGenotyper of version 2.7-4 (SNP calling) with multiple data sets that where growing in size (10 sets ranging from 8.5 GB to 88GB) with 36 cpu threads and 8 GB heap assigned. I noticed that the UnifiedGenotyper performed much better than linear with a runtime increase of 4.6 times for the largest data set, although this was 10 times larger than the smallest. Now I wonder how this is achieved by GATK? What part of the processing makes it scale so well to be even better than linear here? Is it the SNP calling that is so performant, or how the data is preprocessed/filtered?

Best,

Cindy

Created 2013-10-29 14:38:24 | Updated | Tags: unifiedgenotyper performance multithreading runtime

Hello there,

I have run GATK's UnifiedGenotyper tool with different thread sizes (9, 18, 36 on a 40-core machine) in order to see how well it scales for an increasing thread size. I first tried out UnifiedGenotyper of version 1.7 and set -nt to 9, 18, and 36 with 8GB heap assigned for a 88GB large BAM file, respectively. When looking at the results, I noticed that runtime performance increased when using more threads, so I updated GATK to version 2.7-4 in order to check if this behavior has vanished with the newer version. I also noticed then that there exists another thread-parameter for the number of cpu threads.

So, I ran GATK's UnifiedGenotyper version 2.7-4 for a 88GB large file on a 40-core linux machine with -nt 1, -nct 9/18/36, and assigned 8GB heap. The behavior in runtime performance remained the same. What am I missing here?

Best,

Cindy

Created 2013-09-26 20:53:59 | Updated | Tags: unifiedgenotyper performance

Hi,

In the below log file of a UnifiedGenotyper run, the run made steady progress until about 95% completion, at which point the program keeps running for 3 hours. There has not been any new output to the log file or the vcf file.

In other runs, I've also seen Java GC errors.

Just wondering if something is wrong?

top reveals that the program is still working very hard.

18824 hbigelow 20 0 35.7g 33g 5452 R 94.4 52.9 167:27.83 java
18818 hbigelow 20 0 35.7g 33g 5452 R 93.7 52.9 167:44.83 java
18820 hbigelow 20 0 35.7g 33g 5452 R 93.7 52.9 167:28.44 java
18821 hbigelow 20 0 35.7g 33g 5452 R 93.7 52.9 167:24.83 java
18822 hbigelow 20 0 35.7g 33g 5452 R 93.7 52.9 167:25.96 java
18823 hbigelow 20 0 35.7g 33g 5452 R 93.7 52.9 167:26.57 java
18817 hbigelow 20 0 35.7g 33g 5452 R 93.4 52.9 167:26.73 java
18825 hbigelow 20 0 35.7g 33g 5452 R 93.4 52.9 167:24.03 java
18819 hbigelow 20 0 35.7g 33g 5452 R 93.1 52.9 167:27.78 java
18816 hbigelow 20 0 35.7g 33g 5452 R 92.7 52.9 167:33.16 java
18826 hbigelow 20 0 35.7g 33g 5452 S 5.6 52.9 4:34.92 java

INFO 08:23:21,511 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:23:21,512 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-3-ge1dbcc8, Compiled 2012/08/22 18:12:18
INFO 08:23:21,513 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 08:23:21,516 HelpFormatter - Date/Time: 2013/09/26 08:23:21
INFO 08:23:21,516 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:23:21,516 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:23:21,554 ArgumentTypeDescriptor - Dynamically determined type of /dev/shm/dbsnp_135.hg19.reordered.vcf to be VCF
INFO 08:23:21,567 GenomeAnalysisEngine - Strictness is SILENT
INFO 08:23:21,601 SAMDataSource - Running in asynchronous I/O mode; number of threads = 2
INFO 08:23:21,604 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:21,814 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.21 s, 50 of 113 in 0.21 s / 0.00 m (240.38 tasks/s). 63 remaining with est. completion in 0.26 s / 0.00 m
INFO 08:23:21,985 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.17 s, 100 of 113 in 0.38 s / 0.01 m (263.16 tasks/s). 13 remaining with est. completion in 0.05 s / 0.00 m INFO 08:23:22,029 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.43
INFO 08:23:22,149 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:22,285 MicroScheduler - Running the GATK in parallel mode with 10 concurrent threads
INFO 08:23:36,786 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:36,860 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.07 s, 50 of 113 in 0.07 s / 0.00 m (675.68 tasks/s). 63 remaining with est. completion in 0.09 s / 0.00 m
INFO 08:23:36,921 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 100 of 113 in 0.14 s / 0.00 m (740.74 tasks/s). 13 remaining with est. completion in 0.02 s / 0.00 m INFO 08:23:36,943 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.16
INFO 08:23:36,980 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,091 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.11 s, 50 of 113 in 0.11 s / 0.00 m (454.55 tasks/s). 63 remaining with est. completion in 0.14 s / 0.00 m
INFO 08:23:37,186 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.09 s, 100 of 113 in 0.21 s / 0.00 m (487.80 tasks/s). 13 remaining with est. completion in 0.03 s / 0.00 m INFO 08:23:37,202 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.22
INFO 08:23:37,211 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,281 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.07 s, 50 of 113 in 0.07 s / 0.00 m (724.64 tasks/s). 63 remaining with est. completion in 0.09 s / 0.00 m
INFO 08:23:37,337 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 100 of 113 in 0.13 s / 0.00 m (793.65 tasks/s). 13 remaining with est. completion in 0.02 s / 0.00 m INFO 08:23:37,354 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14
INFO 08:23:37,362 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,418 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 50 of 113 in 0.06 s / 0.00 m (909.09 tasks/s). 63 remaining with est. completion in 0.07 s / 0.00 m
INFO 08:23:37,468 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.05 s, 100 of 113 in 0.11 s / 0.00 m (943.40 tasks/s). 13 remaining with est. completion in 0.01 s / 0.00 m INFO 08:23:37,484 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
INFO 08:23:37,492 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,552 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 50 of 113 in 0.06 s / 0.00 m (847.46 tasks/s). 63 remaining with est. completion in 0.07 s / 0.00 m
INFO 08:23:37,605 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.05 s, 100 of 113 in 0.11 s / 0.00 m (884.96 tasks/s). 13 remaining with est. completion in 0.01 s / 0.00 m INFO 08:23:37,629 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14
INFO 08:23:37,638 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,710 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.07 s, 50 of 113 in 0.07 s / 0.00 m (704.23 tasks/s). 63 remaining with est. completion in 0.09 s / 0.00 m
INFO 08:23:37,775 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 100 of 113 in 0.14 s / 0.00 m (735.29 tasks/s). 13 remaining with est. completion in 0.02 s / 0.00 m INFO 08:23:37,794 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.16
INFO 08:23:37,800 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:37,863 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 50 of 113 in 0.06 s / 0.00 m (793.65 tasks/s). 63 remaining with est. completion in 0.08 s / 0.00 m
INFO 08:23:37,932 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.07 s, 100 of 113 in 0.13 s / 0.00 m (763.36 tasks/s). 13 remaining with est. completion in 0.02 s / 0.00 m INFO 08:23:37,946 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.15
INFO 08:23:37,955 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:38,024 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.07 s, 50 of 113 in 0.07 s / 0.00 m (724.64 tasks/s). 63 remaining with est. completion in 0.09 s / 0.00 m
INFO 08:23:38,083 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 100 of 113 in 0.13 s / 0.00 m (781.25 tasks/s). 13 remaining with est. completion in 0.02 s / 0.00 m INFO 08:23:38,100 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.15
INFO 08:23:38,111 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:23:38,172 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.06 s, 50 of 113 in 0.06 s / 0.00 m (833.33 tasks/s). 63 remaining with est. completion in 0.08 s / 0.00 m
INFO 08:23:38,223 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.05 s, 100 of 113 in 0.11 s / 0.00 m (892.86 tasks/s). 13 remaining with est. completion in 0.01 s / 0.00 m INFO 08:23:38,238 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13
INFO 08:23:40,277 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 08:23:40,277 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 08:23:41,846 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:41,906 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:42,095 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:42,756 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:42,815 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:42,876 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:42,940 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:43,003 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:23:43,071 RMDTrackBuilder - Loading Tribble index from disk for file /dev/shm/dbsnp_135.hg19.reordered.vcf
INFO 08:24:26,712 TraversalEngine - chr6:30116458 2.67e+03 49.9 s 5.2 h 0.4% 3.6 h 3.6 h
INFO 08:24:57,294 TraversalEngine - chr6:30151544 4.98e+03 80.5 s 4.5 h 0.5% 4.4 h 4.4 h
INFO 08:25:27,414 TraversalEngine - chr6:30078455 7.94e+04 110.6 s 23.2 m 0.3% 11.8 h 11.7 h
INFO 08:26:01,304 TraversalEngine - chr6:30207424 1.58e+05 2.4 m 15.2 m 0.7% 5.8 h 5.8 h
INFO 08:26:31,314 TraversalEngine - chr6:30240151 2.07e+05 2.9 m 14.0 m 0.8% 6.1 h 6.0 h
INFO 08:27:01,327 TraversalEngine - chr6:30442757 2.95e+05 3.4 m 11.5 m 1.5% 3.8 h 3.8 h
INFO 08:27:31,499 TraversalEngine - chr6:30449352 3.67e+05 3.9 m 10.6 m 1.5% 4.4 h 4.3 h
INFO 08:28:02,353 TraversalEngine - chr6:30591358 4.60e+05 4.4 m 9.6 m 2.0% 3.7 h 3.7 h
INFO 08:28:33,020 TraversalEngine - chr6:30704038 5.73e+05 4.9 m 8.6 m 2.3% 3.5 h 3.4 h
INFO 08:29:03,166 TraversalEngine - chr6:30757202 7.08e+05 5.4 m 7.7 m 2.5% 3.6 h 3.5 h
INFO 08:29:33,511 TraversalEngine - chr6:30875948 8.10e+05 5.9 m 7.3 m 2.9% 3.4 h 3.3 h
INFO 08:30:04,560 TraversalEngine - chr6:30930472 8.98e+05 6.5 m 7.2 m 3.1% 3.5 h 3.4 h
INFO 08:30:34,561 TraversalEngine - chr6:31119846 1.01e+06 7.0 m 6.9 m 3.7% 3.1 h 3.0 h
INFO 08:31:04,608 TraversalEngine - chr6:31129141 1.11e+06 7.5 m 6.7 m 3.8% 3.3 h 3.2 h
INFO 08:31:35,056 TraversalEngine - chr6:31267895 1.19e+06 8.0 m 6.7 m 4.2% 3.1 h 3.0 h
INFO 08:32:06,800 TraversalEngine - chr6:31394444 1.25e+06 8.5 m 6.8 m 4.6% 3.0 h 2.9 h
INFO 08:32:37,026 TraversalEngine - chr6:31434315 1.32e+06 9.0 m 6.8 m 4.8% 3.1 h 3.0 h
INFO 08:33:07,053 TraversalEngine - chr6:31468931 1.44e+06 9.5 m 6.6 m 4.9% 3.2 h 3.1 h
INFO 08:33:37,148 TraversalEngine - chr6:31640746 1.53e+06 10.0 m 6.6 m 5.5% 3.0 h 2.9 h
INFO 08:34:07,255 TraversalEngine - chr6:31732803 1.70e+06 10.5 m 6.2 m 5.8% 3.0 h 2.9 h
INFO 08:34:37,324 TraversalEngine - chr6:31949427 1.82e+06 11.0 m 6.1 m 6.5% 2.8 h 2.6 h
INFO 08:35:07,332 TraversalEngine - chr6:32038748 1.96e+06 11.5 m 5.9 m 6.8% 2.8 h 2.6 h
INFO 08:35:38,115 TraversalEngine - chr6:32260917 2.11e+06 12.0 m 5.7 m 7.5% 2.7 h 2.5 h
INFO 08:36:08,304 TraversalEngine - chr6:32299837 2.20e+06 12.5 m 5.7 m 7.7% 2.7 h 2.5 h
INFO 08:36:38,418 TraversalEngine - chr6:32426140 2.31e+06 13.0 m 5.6 m 8.1% 2.7 h 2.5 h
INFO 08:37:08,567 TraversalEngine - chr6:32434978 2.40e+06 13.5 m 5.6 m 8.1% 2.8 h 2.6 h
INFO 08:37:38,614 TraversalEngine - chr6:32420173 2.44e+06 14.0 m 5.8 m 8.1% 2.9 h 2.7 h
INFO 08:38:08,935 TraversalEngine - chr6:32589965 2.46e+06 14.5 m 5.9 m 8.6% 2.8 h 2.6 h
INFO 08:38:39,144 TraversalEngine - chr6:32640862 2.49e+06 15.0 m 6.0 m 8.8% 2.8 h 2.6 h
INFO 08:39:10,577 TraversalEngine - chr6:32518967 2.54e+06 15.6 m 6.1 m 8.4% 3.1 h 2.8 h
INFO 08:39:41,771 TraversalEngine - chr6:32553892 2.57e+06 16.1 m 6.3 m 8.5% 3.1 h 2.9 h
INFO 08:40:12,056 TraversalEngine - chr6:32682675 2.62e+06 16.6 m 6.3 m 8.9% 3.1 h 2.8 h
INFO 08:40:42,555 TraversalEngine - chr6:32785968 2.64e+06 17.1 m 6.5 m 9.3% 3.1 h 2.8 h
INFO 08:41:20,092 TraversalEngine - chr6:32795824 2.70e+06 17.7 m 6.6 m 9.3% 3.2 h 2.9 h
INFO 08:41:50,291 TraversalEngine - chr6:32809878 2.74e+06 18.2 m 6.6 m 9.4% 3.2 h 2.9 h

(i've cut a few lines here to be able to post this message -- otherwise it is too long)

INFO 08:58:38,847 TraversalEngine - chr6:37809937 7.74e+06 35.0 m 4.5 m 26.0% 2.2 h 99.5 m
INFO 08:59:08,943 TraversalEngine - chr6:38032669 7.90e+06 35.5 m 4.5 m 26.8% 2.2 h 97.2 m
INFO 08:59:39,215 TraversalEngine - chr6:38178709 8.05e+06 36.0 m 4.5 m 27.3% 2.2 h 96.2 m
INFO 09:00:09,319 TraversalEngine - chr6:38223072 8.21e+06 36.5 m 4.5 m 27.4% 2.2 h 96.8 m
INFO 09:00:39,452 TraversalEngine - chr6:38433658 8.37e+06 37.0 m 4.4 m 28.1% 2.2 h 94.7 m
INFO 09:01:13,716 TraversalEngine - chr6:38590529 8.52e+06 37.6 m 4.4 m 28.6% 2.2 h 93.7 m
INFO 09:01:43,802 TraversalEngine - chr6:38675790 8.61e+06 38.1 m 4.4 m 28.9% 2.2 h 93.7 m
INFO 09:02:13,922 TraversalEngine - chr6:38912999 8.78e+06 38.6 m 4.4 m 29.7% 2.2 h 91.4 m
INFO 09:02:43,946 TraversalEngine - chr6:38924222 8.89e+06 39.1 m 4.4 m 29.7% 2.2 h 92.4 m
INFO 09:03:14,210 TraversalEngine - chr6:38976501 9.03e+06 39.6 m 4.4 m 29.9% 2.2 h 92.8 m
INFO 09:03:44,460 TraversalEngine - chr6:39186721 9.10e+06 40.1 m 4.4 m 30.6% 2.2 h 90.9 m
INFO 09:04:14,581 TraversalEngine - chr6:39286832 9.25e+06 40.6 m 4.4 m 31.0% 2.2 h 90.6 m
INFO 09:04:44,648 TraversalEngine - chr6:39513555 9.40e+06 41.1 m 4.4 m 31.7% 2.2 h 88.6 m
INFO 09:05:14,890 TraversalEngine - chr6:39643835 9.53e+06 41.6 m 4.4 m 32.1% 2.2 h 87.9 m
INFO 09:05:45,112 TraversalEngine - chr6:39726072 9.68e+06 42.1 m 4.4 m 32.4% 2.2 h 87.8 m
INFO 09:06:15,205 TraversalEngine - chr6:39843217 9.81e+06 42.6 m 4.3 m 32.8% 2.2 h 87.3 m
INFO 09:06:45,235 TraversalEngine - chr6:40035220 9.97e+06 43.1 m 4.3 m 33.5% 2.1 h 85.8 m
INFO 09:07:18,386 TraversalEngine - chr6:40118351 1.01e+07 43.7 m 4.3 m 33.7% 2.2 h 85.9 m
INFO 09:07:48,699 TraversalEngine - chr6:40295716 1.02e+07 44.2 m 4.3 m 34.3% 2.1 h 84.6 m
INFO 09:08:18,770 TraversalEngine - chr6:40443578 1.03e+07 44.7 m 4.3 m 34.8% 2.1 h 83.7 m
INFO 09:08:48,965 TraversalEngine - chr6:40513739 1.05e+07 45.2 m 4.3 m 35.0% 2.1 h 83.8 m
INFO 09:09:19,107 TraversalEngine - chr6:40731644 1.06e+07 45.7 m 4.3 m 35.8% 2.1 h 82.1 m
INFO 09:09:49,166 TraversalEngine - chr6:40746672 1.07e+07 46.2 m 4.3 m 35.8% 2.1 h 82.8 m
INFO 09:10:19,460 TraversalEngine - chr6:40922695 1.08e+07 46.7 m 4.3 m 36.4% 2.1 h 81.6 m
INFO 09:10:49,470 TraversalEngine - chr6:41053193 1.10e+07 47.2 m 4.3 m 36.8% 2.1 h 80.9 m
INFO 09:11:19,503 TraversalEngine - chr6:41258443 1.11e+07 47.7 m 4.3 m 37.5% 2.1 h 79.4 m
INFO 09:11:49,657 TraversalEngine - chr6:41373032 1.13e+07 48.2 m 4.3 m 37.9% 2.1 h 79.0 m
INFO 09:12:19,818 TraversalEngine - chr6:41519483 1.14e+07 48.7 m 4.3 m 38.4% 2.1 h 78.2 m
INFO 09:12:49,994 TraversalEngine - chr6:41607431 1.15e+07 49.2 m 4.3 m 38.7% 2.1 h 78.0 m
INFO 09:13:20,085 TraversalEngine - chr6:41829544 1.17e+07 49.7 m 4.3 m 39.4% 2.1 h 76.4 m
INFO 09:13:51,114 TraversalEngine - chr6:41976835 1.18e+07 50.2 m 4.2 m 39.9% 2.1 h 75.6 m
INFO 09:14:21,198 TraversalEngine - chr6:42042108 1.19e+07 50.7 m 4.3 m 40.1% 2.1 h 75.7 m
INFO 09:14:51,368 TraversalEngine - chr6:42145915 1.20e+07 51.2 m 4.3 m 40.5% 2.1 h 75.3 m
INFO 09:15:21,745 TraversalEngine - chr6:42230347 1.21e+07 51.7 m 4.3 m 40.8% 2.1 h 75.2 m
INFO 09:15:51,769 TraversalEngine - chr6:42312940 1.23e+07 52.2 m 4.3 m 41.0% 2.1 h 75.1 m
INFO 09:16:21,989 TraversalEngine - chr6:42521707 1.24e+07 52.8 m 4.3 m 41.7% 2.1 h 73.6 m
INFO 09:16:52,068 TraversalEngine - chr6:42667582 1.26e+07 53.3 m 4.2 m 42.2% 2.1 h 72.9 m
INFO 09:17:22,213 TraversalEngine - chr6:42596694 1.26e+07 53.8 m 4.3 m 42.0% 2.1 h 74.3 m
INFO 09:17:52,355 TraversalEngine - chr6:42912768 1.28e+07 54.3 m 4.2 m 43.0% 2.1 h 71.8 m
INFO 09:18:23,674 TraversalEngine - chr6:42940249 1.29e+07 54.8 m 4.2 m 43.1% 2.1 h 72.2 m
INFO 09:18:53,795 TraversalEngine - chr6:43164214 1.30e+07 55.3 m 4.2 m 43.9% 2.1 h 70.7 m
INFO 09:19:23,953 TraversalEngine - chr6:43296472 1.32e+07 55.8 m 4.2 m 44.3% 2.1 h 70.1 m
INFO 09:19:58,879 TraversalEngine - chr6:43393448 1.33e+07 56.4 m 4.2 m 44.6% 2.1 h 69.9 m
INFO 09:20:28,923 TraversalEngine - chr6:43584770 1.35e+07 56.9 m 4.2 m 45.3% 2.1 h 68.7 m
INFO 09:20:59,073 TraversalEngine - chr6:43643818 1.36e+07 57.4 m 4.2 m 45.5% 2.1 h 68.8 m
INFO 09:21:32,811 TraversalEngine - chr6:43832842 1.37e+07 57.9 m 4.2 m 46.1% 2.1 h 67.7 m
INFO 09:22:03,108 TraversalEngine - chr6:43929592 1.39e+07 58.4 m 4.2 m 46.4% 2.1 h 67.4 m
INFO 09:22:33,243 TraversalEngine - chr6:44125651 1.40e+07 58.9 m 4.2 m 47.1% 2.1 h 66.2 m
INFO 09:23:03,473 TraversalEngine - chr6:44165849 1.41e+07 59.4 m 4.2 m 47.2% 2.1 h 66.4 m
INFO 09:23:33,496 TraversalEngine - chr6:44273738 1.42e+07 59.9 m 4.2 m 47.6% 2.1 h 66.0 m
INFO 09:24:03,513 TraversalEngine - chr6:44394715 1.43e+07 60.4 m 4.2 m 48.0% 2.1 h 65.5 m
INFO 09:24:35,836 TraversalEngine - chr6:44524782 1.44e+07 61.0 m 4.2 m 48.4% 2.1 h 65.0 m
INFO 09:25:05,908 TraversalEngine - chr6:44593246 1.45e+07 61.5 m 4.2 m 48.6% 2.1 h 64.9 m
INFO 09:25:35,943 TraversalEngine - chr6:44724080 1.47e+07 62.0 m 4.2 m 49.1% 2.1 h 64.3 m
INFO 09:26:06,087 TraversalEngine - chr6:44842595 1.48e+07 62.5 m 4.2 m 49.5% 2.1 h 63.8 m
INFO 09:26:36,236 TraversalEngine - chr6:45061281 1.50e+07 63.0 m 4.2 m 50.2% 2.1 h 62.5 m
INFO 09:27:06,248 TraversalEngine - chr6:45195311 1.51e+07 63.5 m 4.2 m 50.7% 2.1 h 61.9 m
INFO 09:27:36,407 TraversalEngine - chr6:45374247 1.53e+07 64.0 m 4.2 m 51.2% 2.1 h 60.9 m
INFO 09:28:06,485 TraversalEngine - chr6:45429293 1.54e+07 64.5 m 4.2 m 51.4% 2.1 h 60.9 m
INFO 09:28:36,712 TraversalEngine - chr6:45624998 1.55e+07 65.0 m 4.2 m 52.1% 2.1 h 59.8 m
INFO 09:29:06,764 TraversalEngine - chr6:45811826 1.57e+07 65.5 m 4.2 m 52.7% 2.1 h 58.8 m
INFO 09:29:38,307 TraversalEngine - chr6:45959089 1.58e+07 66.0 m 4.2 m 53.2% 2.1 h 58.1 m
INFO 09:30:08,500 TraversalEngine - chr6:46032156 1.59e+07 66.5 m 4.2 m 53.4% 2.1 h 58.0 m
INFO 09:30:38,609 TraversalEngine - chr6:46117802 1.61e+07 67.0 m 4.2 m 53.7% 2.1 h 57.7 m
INFO 09:31:08,773 TraversalEngine - chr6:46356384 1.62e+07 67.5 m 4.2 m 54.5% 2.1 h 56.3 m
INFO 09:31:38,954 TraversalEngine - chr6:46391290 1.63e+07 68.0 m 4.2 m 54.6% 2.1 h 56.5 m
INFO 09:32:09,073 TraversalEngine - chr6:46542734 1.65e+07 68.5 m 4.2 m 55.1% 2.1 h 55.8 m
INFO 09:32:39,250 TraversalEngine - chr6:46762643 1.66e+07 69.0 m 4.2 m 55.9% 2.1 h 54.5 m
INFO 09:33:09,403 TraversalEngine - chr6:46774570 1.68e+07 69.5 m 4.1 m 55.9% 2.1 h 54.8 m
INFO 09:33:39,535 TraversalEngine - chr6:46947944 1.69e+07 70.0 m 4.1 m 56.5% 2.1 h 53.9 m
INFO 09:34:09,839 TraversalEngine - chr6:47156780 1.70e+07 70.6 m 4.1 m 57.2% 2.1 h 52.8 m
INFO 09:34:39,986 TraversalEngine - chr6:47317115 1.72e+07 71.1 m 4.1 m 57.7% 2.1 h 52.0 m
INFO 09:35:09,991 TraversalEngine - chr6:47416650 1.73e+07 71.6 m 4.1 m 58.1% 2.1 h 51.7 m
INFO 09:35:40,106 TraversalEngine - chr6:47471655 1.74e+07 72.1 m 4.1 m 58.2% 2.1 h 51.7 m
INFO 09:36:10,116 TraversalEngine - chr6:47591682 1.76e+07 72.6 m 4.1 m 58.6% 2.1 h 51.2 m
INFO 09:36:40,252 TraversalEngine - chr6:47735236 1.77e+07 73.1 m 4.1 m 59.1% 2.1 h 50.5 m
INFO 09:37:10,376 TraversalEngine - chr6:47882396 1.78e+07 73.6 m 4.1 m 59.6% 2.1 h 49.8 m
INFO 09:37:40,429 TraversalEngine - chr6:48031975 1.80e+07 74.1 m 4.1 m 60.1% 2.1 h 49.2 m
INFO 09:38:10,761 TraversalEngine - chr6:48148047 1.81e+07 74.6 m 4.1 m 60.5% 2.1 h 48.7 m
INFO 09:38:40,815 TraversalEngine - chr6:48358393 1.83e+07 75.1 m 4.1 m 61.2% 2.0 h 47.6 m
INFO 09:39:10,834 TraversalEngine - chr6:48520973 1.84e+07 75.6 m 4.1 m 61.7% 2.0 h 46.8 m
INFO 09:39:40,902 TraversalEngine - chr6:48590730 1.86e+07 76.1 m 4.1 m 62.0% 2.0 h 46.7 m
INFO 09:40:10,940 TraversalEngine - chr6:48771497 1.87e+07 76.6 m 4.1 m 62.6% 2.0 h 45.8 m
INFO 09:40:41,559 TraversalEngine - chr6:48872402 1.88e+07 77.1 m 4.1 m 62.9% 2.0 h 45.4 m
INFO 09:41:11,622 TraversalEngine - chr6:49063500 1.89e+07 77.6 m 4.1 m 63.5% 2.0 h 44.5 m
INFO 09:41:41,801 TraversalEngine - chr6:49192972 1.91e+07 78.1 m 4.1 m 64.0% 2.0 h 44.0 m
INFO 09:42:11,815 TraversalEngine - chr6:49386761 1.93e+07 78.6 m 4.1 m 64.6% 2.0 h 43.0 m
INFO 09:42:41,830 TraversalEngine - chr6:49440520 1.94e+07 79.1 m 4.1 m 64.8% 2.0 h 43.0 m
INFO 09:43:11,960 TraversalEngine - chr6:49644100 1.95e+07 79.6 m 4.1 m 65.5% 2.0 h 42.0 m
INFO 09:43:42,117 TraversalEngine - chr6:49791580 1.97e+07 80.1 m 4.1 m 66.0% 2.0 h 41.3 m
INFO 09:44:12,453 TraversalEngine - chr6:49942980 1.98e+07 80.6 m 4.1 m 66.5% 2.0 h 40.6 m
INFO 09:44:42,514 TraversalEngine - chr6:50080075 2.00e+07 81.1 m 4.1 m 66.9% 2.0 h 40.1 m
INFO 09:45:12,756 TraversalEngine - chr6:50132228 2.01e+07 81.6 m 4.1 m 67.1% 2.0 h 40.0 m
INFO 09:45:43,083 TraversalEngine - chr6:50341911 2.03e+07 82.1 m 4.0 m 67.8% 2.0 h 39.0 m
INFO 09:46:13,272 TraversalEngine - chr6:50459968 2.04e+07 82.6 m 4.0 m 68.2% 2.0 h 38.5 m
INFO 09:46:43,325 TraversalEngine - chr6:50654861 2.06e+07 83.1 m 4.0 m 68.8% 2.0 h 37.6 m
INFO 09:47:13,333 TraversalEngine - chr6:50770087 2.07e+07 83.6 m 4.0 m 69.2% 2.0 h 37.2 m
INFO 09:47:43,428 TraversalEngine - chr6:50958433 2.09e+07 84.1 m 4.0 m 69.9% 2.0 h 36.3 m
INFO 09:48:13,559 TraversalEngine - chr6:51052363 2.10e+07 84.6 m 4.0 m 70.2% 2.0 h 36.0 m
INFO 09:48:43,710 TraversalEngine - chr6:51269289 2.12e+07 85.1 m 4.0 m 70.9% 2.0 h 34.9 m
INFO 09:49:13,780 TraversalEngine - chr6:51325417 2.12e+07 85.6 m 4.0 m 71.1% 2.0 h 34.8 m
INFO 09:49:43,958 TraversalEngine - chr6:51427067 2.14e+07 86.1 m 4.0 m 71.4% 2.0 h 34.5 m
INFO 09:50:13,998 TraversalEngine - chr6:51618301 2.15e+07 86.6 m 4.0 m 72.1% 2.0 h 33.6 m
INFO 09:50:44,171 TraversalEngine - chr6:51769423 2.17e+07 87.1 m 4.0 m 72.6% 2.0 h 32.9 m
INFO 09:51:14,302 TraversalEngine - chr6:51921220 2.18e+07 87.6 m 4.0 m 73.1% 119.9 m 32.3 m
INFO 09:51:44,486 TraversalEngine - chr6:52053794 2.19e+07 88.1 m 4.0 m 73.5% 119.9 m 31.8 m
INFO 09:52:14,531 TraversalEngine - chr6:52084325 2.21e+07 88.6 m 4.0 m 73.6% 2.0 h 31.8 m
INFO 09:52:44,620 TraversalEngine - chr6:52318415 2.22e+07 89.1 m 4.0 m 74.4% 119.8 m 30.7 m
INFO 09:53:15,252 TraversalEngine - chr6:52495376 2.23e+07 89.6 m 4.0 m 75.0% 119.5 m 29.9 m
INFO 09:53:45,312 TraversalEngine - chr6:52508016 2.24e+07 90.1 m 4.0 m 75.0% 2.0 h 30.0 m
INFO 09:54:15,358 TraversalEngine - chr6:52523967 2.26e+07 90.6 m 4.0 m 75.1% 2.0 h 30.1 m
INFO 09:54:45,366 TraversalEngine - chr6:52825024 2.27e+07 91.1 m 4.0 m 76.1% 119.8 m 28.7 m
INFO 09:55:15,533 TraversalEngine - chr6:52957057 2.29e+07 91.6 m 4.0 m 76.5% 119.8 m 28.1 m
INFO 09:55:45,630 TraversalEngine - chr6:53090402 2.30e+07 92.1 m 4.0 m 77.0% 119.7 m 27.6 m
INFO 09:56:15,731 TraversalEngine - chr6:53116593 2.31e+07 92.6 m 4.0 m 77.1% 2.0 h 27.6 m
INFO 09:56:45,870 TraversalEngine - chr6:53306749 2.33e+07 93.2 m 4.0 m 77.7% 119.9 m 26.8 m
INFO 09:57:16,185 TraversalEngine - chr6:53480683 2.34e+07 93.7 m 4.0 m 78.3% 119.7 m 26.0 m
INFO 09:57:46,349 TraversalEngine - chr6:53674612 2.35e+07 94.2 m 4.0 m 78.9% 119.3 m 25.2 m
INFO 09:58:16,617 TraversalEngine - chr6:53728699 2.36e+07 94.7 m 4.0 m 79.1% 119.7 m 25.0 m
INFO 09:58:46,622 TraversalEngine - chr6:53889922 2.38e+07 95.2 m 4.0 m 79.6% 119.5 m 24.3 m
INFO 09:59:16,780 TraversalEngine - chr6:53995948 2.39e+07 95.7 m 4.0 m 80.0% 119.6 m 23.9 m
INFO 09:59:46,941 TraversalEngine - chr6:54171265 2.41e+07 96.2 m 4.0 m 80.6% 119.4 m 23.2 m
INFO 10:00:17,025 TraversalEngine - chr6:54346501 2.42e+07 96.7 m 4.0 m 81.2% 119.1 m 22.4 m
INFO 10:00:47,168 TraversalEngine - chr6:54416886 2.43e+07 97.2 m 4.0 m 81.4% 119.4 m 22.2 m
INFO 10:01:17,220 TraversalEngine - chr6:54612159 2.45e+07 97.7 m 4.0 m 82.0% 119.1 m 21.4 m
INFO 10:01:47,429 TraversalEngine - chr6:54741948 2.46e+07 98.2 m 4.0 m 82.5% 119.0 m 20.9 m
INFO 10:02:17,457 TraversalEngine - chr6:54875353 2.47e+07 98.7 m 4.0 m 82.9% 119.0 m 20.3 m
INFO 10:02:47,470 TraversalEngine - chr6:55034479 2.49e+07 99.2 m 4.0 m 83.4% 118.8 m 19.7 m
INFO 10:03:17,501 TraversalEngine - chr6:55077719 2.50e+07 99.7 m 4.0 m 83.6% 119.2 m 19.6 m
INFO 10:03:47,629 TraversalEngine - chr6:55239211 2.52e+07 100.2 m 4.0 m 84.1% 119.1 m 18.9 m
INFO 10:04:17,807 TraversalEngine - chr6:55261503 2.53e+07 100.7 m 4.0 m 84.2% 119.6 m 18.9 m
INFO 10:04:48,022 TraversalEngine - chr6:55529351 2.54e+07 101.2 m 4.0 m 85.1% 118.9 m 17.7 m
INFO 10:05:18,030 TraversalEngine - chr6:55690263 2.55e+07 101.7 m 4.0 m 85.6% 118.7 m 17.1 m
INFO 10:05:48,177 TraversalEngine - chr6:55872577 2.57e+07 102.2 m 4.0 m 86.2% 118.5 m 16.3 m
INFO 10:06:18,302 TraversalEngine - chr6:56066218 2.59e+07 102.7 m 4.0 m 86.9% 118.2 m 15.5 m
INFO 10:06:50,984 TraversalEngine - chr6:56095241 2.60e+07 103.2 m 4.0 m 87.0% 118.7 m 15.4 m
INFO 10:07:21,004 TraversalEngine - chr6:56254088 2.62e+07 103.7 m 4.0 m 87.5% 118.5 m 14.8 m
INFO 10:07:51,015 TraversalEngine - chr6:56358201 2.63e+07 104.2 m 4.0 m 87.9% 118.6 m 14.4 m
INFO 10:08:21,055 TraversalEngine - chr6:56521140 2.65e+07 104.7 m 4.0 m 88.4% 118.5 m 13.7 m
INFO 10:08:51,097 TraversalEngine - chr6:56653469 2.66e+07 105.2 m 4.0 m 88.8% 118.5 m 13.2 m
INFO 10:09:21,237 TraversalEngine - chr6:56919917 2.68e+07 105.7 m 3.9 m 89.7% 117.8 m 12.1 m
INFO 10:09:51,254 TraversalEngine - chr6:56995310 2.70e+07 106.2 m 3.9 m 90.0% 118.1 m 11.8 m
INFO 10:10:21,418 TraversalEngine - chr6:57123859 2.71e+07 106.7 m 3.9 m 90.4% 118.1 m 11.3 m
INFO 10:10:51,765 TraversalEngine - chr6:57297613 2.72e+07 107.2 m 3.9 m 91.0% 117.9 m 10.6 m
INFO 10:11:22,167 TraversalEngine - chr6:57254325 2.72e+07 107.8 m 4.0 m 90.8% 118.6 m 10.9 m
INFO 10:11:52,516 TraversalEngine - chr6:57270883 2.72e+07 108.3 m 4.0 m 90.9% 119.1 m 10.8 m
INFO 10:12:22,849 TraversalEngine - chr6:57274093 2.72e+07 108.8 m 4.0 m 90.9% 119.6 m 10.9 m
INFO 10:12:53,635 TraversalEngine - chr6:57428723 2.73e+07 109.3 m 4.0 m 91.4% 119.5 m 10.2 m
INFO 10:13:24,366 TraversalEngine - chr6:57366872 2.73e+07 109.8 m 4.0 m 91.2% 2.0 h 10.6 m
INFO 10:13:55,899 TraversalEngine - chr6:57444991 2.73e+07 110.3 m 4.0 m 91.5% 2.0 h 10.3 m
INFO 10:14:25,994 TraversalEngine - chr6:57415703 2.73e+07 110.8 m 4.1 m 91.4% 2.0 h 10.4 m
INFO 10:14:57,164 TraversalEngine - chr6:57450056 2.74e+07 111.3 m 4.1 m 91.5% 2.0 h 10.3 m
INFO 10:15:28,614 TraversalEngine - chr6:57373410 2.74e+07 111.9 m 4.1 m 91.2% 2.0 h 10.7 m
INFO 10:16:00,023 TraversalEngine - chr6:57376265 2.74e+07 112.4 m 4.1 m 91.3% 2.1 h 10.8 m
INFO 10:16:30,470 TraversalEngine - chr6:57390372 2.74e+07 112.9 m 4.1 m 91.3% 2.1 h 10.8 m
INFO 10:17:00,616 TraversalEngine - chr6:57473509 2.74e+07 113.4 m 4.1 m 91.6% 2.1 h 10.4 m
INFO 10:17:31,852 TraversalEngine - chr6:57408327 2.74e+07 113.9 m 4.2 m 91.4% 2.1 h 10.8 m
INFO 10:18:02,372 TraversalEngine - chr6:57624633 2.75e+07 114.4 m 4.2 m 92.1% 2.1 h 9.8 m
INFO 10:18:32,415 TraversalEngine - chr6:57619768 2.75e+07 114.9 m 4.2 m 92.1% 2.1 h 9.9 m
INFO 10:19:03,387 TraversalEngine - chr6:57672367 2.75e+07 115.4 m 4.2 m 92.2% 2.1 h 9.7 m
INFO 10:19:33,472 TraversalEngine - chr6:57713390 2.76e+07 115.9 m 4.2 m 92.4% 2.1 h 9.6 m
INFO 10:20:03,707 TraversalEngine - chr6:57588693 2.76e+07 116.4 m 4.2 m 92.0% 2.1 h 10.2 m
INFO 10:20:33,937 TraversalEngine - chr6:57805974 2.77e+07 117.0 m 4.2 m 92.7% 2.1 h 9.2 m
INFO 10:21:04,581 TraversalEngine - chr6:57854191 2.77e+07 117.5 m 4.2 m 92.8% 2.1 h 9.0 m
INFO 10:21:38,029 TraversalEngine - chr6:57846906 2.77e+07 118.0 m 4.3 m 92.8% 2.1 h 9.1 m
INFO 10:22:11,718 TraversalEngine - chr6:57880594 2.78e+07 118.6 m 4.3 m 92.9% 2.1 h 9.0 m
INFO 10:22:44,835 TraversalEngine - chr6:57895296 2.78e+07 119.1 m 4.3 m 93.0% 2.1 h 9.0 m
INFO 10:23:19,001 TraversalEngine - chr6:57950669 2.78e+07 119.7 m 4.3 m 93.2% 2.1 h 8.8 m
INFO 10:23:52,808 TraversalEngine - chr6:57970048 2.78e+07 2.0 h 4.3 m 93.2% 2.1 h 8.7 m
INFO 10:24:26,193 TraversalEngine - chr6:57519976 2.79e+07 2.0 h 4.3 m 91.7% 2.2 h 10.9 m
INFO 10:25:00,160 TraversalEngine - chr6:57995097 2.79e+07 2.0 h 4.4 m 93.3% 2.2 h 8.7 m
INFO 10:25:33,919 TraversalEngine - chr6:58050289 2.79e+07 2.0 h 4.4 m 93.5% 2.2 h 8.5 m
INFO 10:26:06,617 TraversalEngine - chr6:58073510 2.79e+07 2.0 h 4.4 m 93.6% 2.2 h 8.4 m
INFO 10:26:39,639 TraversalEngine - chr6:58165229 2.80e+07 2.1 h 4.4 m 93.9% 2.2 h 8.0 m
INFO 10:27:13,254 TraversalEngine - chr6:58152533 2.81e+07 2.1 h 4.4 m 93.8% 2.2 h 8.1 m
INFO 10:27:45,992 TraversalEngine - chr6:58247070 2.81e+07 2.1 h 4.4 m 94.2% 2.2 h 7.7 m
INFO 10:28:19,085 TraversalEngine - chr6:58195026 2.81e+07 2.1 h 4.4 m 94.0% 2.2 h 8.0 m
INFO 10:28:51,984 TraversalEngine - chr6:58315635 2.82e+07 2.1 h 4.4 m 94.4% 2.2 h 7.5 m
INFO 10:29:22,635 TraversalEngine - chr6:58310322 2.83e+07 2.1 h 4.4 m 94.4% 2.2 h 7.5 m
INFO 10:29:52,658 TraversalEngine - chr6:58454325 2.84e+07 2.1 h 4.5 m 94.8% 2.2 h 6.9 m
INFO 10:30:23,461 TraversalEngine - chr6:58583327 2.85e+07 2.1 h 4.5 m 95.3% 2.2 h 6.3 m
INFO 10:30:53,585 TraversalEngine - chr6:58744180 2.86e+07 2.1 h 4.5 m 95.8% 2.2 h 5.6 m
INFO 10:31:23,886 TraversalEngine - chr6:58765337 2.99e+07 2.1 h 4.3 m 95.9% 2.2 h 5.5 m
INFO 10:31:57,941 TraversalEngine - chr6:57604792 3.00e+07 2.1 h 4.3 m 92.0% 2.3 h 11.1 m
INFO 10:35:46,247 TraversalEngine - chr6:58776926 3.00e+07 2.2 h 4.4 m 95.9% 2.3 h 5.6 m
INFO 10:42:29,294 TraversalEngine - chr6:58777926 3.00e+07 2.3 h 4.6 m 95.9% 2.4 h 5.9 m
INFO 10:49:41,001 TraversalEngine - chr6:58778926 3.00e+07 2.4 h 4.9 m 95.9% 2.5 h 6.2 m
INFO 10:55:43,727 TraversalEngine - chr6:58779926 3.00e+07 2.5 h 5.1 m 95.9% 2.6 h 6.4 m

Created 2013-06-04 19:41:41 | Updated | Tags: haplotypecaller performance

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

• Grant

Created 2013-04-02 09:21:05 | Updated 2013-04-02 09:21:56 | Tags: bam performance

Dear all,

I am currently running an analysis using the HaplotypeCaller on 300 large BAM files on our cluster and decided to chunk the the genome in 3MB bins in order for them to be processed in a decent time. I'm however experiencing very long runtimes as more and more jobs get scheduled to run in parallel on the same files. Looking at the GATK options, I saw these 2 that I thought could be of help and was wondering what were the recommendation for using them: --num_bam_file_handles --read_buffer_size

More precisely, does the num_bam_file_handles increase processing time by a lot? and what is the default value for --read_buffer_size ?

Thanks a lot, Laurent

Created 2013-01-10 20:43:28 | Updated 2013-01-10 20:49:09 | Tags: dataprocessingpipeline queue dpp performance community

Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?

Thanks

Created 2012-11-14 20:48:52 | Updated 2012-11-14 21:05:07 | Tags: performance

Hello, I was wondering if you have a page that explains which parameters affect performance of which tragets of the GATK?

In particular

• DepthOfCoverage
• RealignerTargetCreator
• IndelRealigner

For whoel genome, DepthOfCoverage can take more than 30hours and this is with the ommit base counts. Would putting in a higher read_buffer_size with more ram help, etc.

Thanks Louis

Created 2012-10-17 15:56:12 | Updated 2012-10-17 16:12:17 | Tags: haplotypecaller performance

Hi All, I've been attempting to use the haplotype caller on my 50x coverage exome data. The bam being parsed is about 12G. Each time, the caller runs for many hours and then the output is only the header of the VCF - no errors seen. I'm wondering if this is due to limited space on my drives or if the expected file size is much larger than I am anticipating.

Command:

GenomeAnalysisTK.jar -T HaplotypeCaller -R  Homo_sapiens_assembly19.fasta -I input.bam --dbsnp dbsnp_132.b37.nochr.vcf  -stand_call_conf 30 -stand_emit_conf 10 -o output.Haplotypes.vcf

Created 2012-10-15 18:59:35 | Updated 2012-10-18 00:50:04 | Tags: reducereads performance

Hi Team,

I have been running GATK2 ReduceReads on a large (100Gb) Bam file, and even though at the very beginning it runs very smoothly and predicts a week for finishing the task, after a few hours it gets totally stock. We first thought that it could be a garbage collection (or java memory allocation issue), but the logs show that the garbage collection works well.

The command is (similar behavior for smaller Xms and Xmx values) java -Xmx30g -Xms30g -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC -XX:+PrintGCDetails -Xloggc:gc.log -verbose:gc -jar $path$ref -T ReduceReads -I input.bam -o output.bam

The first few lines of the log file are

INFO 01:12:21,541 TraversalEngine - chr1:1094599 5.89e+05 9.9 m 16.8 m 0.0% 19.4 d 19.4 d INFO 01:13:21,628 TraversalEngine - chr1:2112411 9.44e+05 10.9 m 11.6 m 0.1% 11.2 d 11.2 d INFO 01:14:22,065 TraversalEngine - chr1:3051535 1.29e+06 11.9 m 9.3 m 0.1% 8.5 d 8.5 d INFO 01:15:22,297 TraversalEngine - chr1:4084547 1.59e+06 12.9 m 8.1 m 0.1% 6.9 d 6.9 d INFO 01:16:24,130 TraversalEngine - chr1:4719991 1.82e+06 13.9 m 7.7 m 0.2% 6.4 d 6.4 d

but after a short while it gets totally stock, and even in the location 121485073 of chromosome 1, there is almost no progress at all, and the estimated finish time goes over 11 weeks, and still increasing.

Any idea what the reason for this could be, and how we can solve the problem? The same command runs successfully on small (less than 5gig) Bam files though

Created 2012-08-28 20:41:29 | Updated 2012-10-19 00:51:12 | Tags: unifiedgenotyper performance

Background: I am testing GATK (ver. 2.0-39) for use in de novo SNP identification using targeted Illumina seq. against a set of ~2500 genes from 28 different indiv. genotypes, same species. These are PE 50 and PE100 libs. I do not have a defined set of indels or SNPs to use as a reference as per GATK Phase 1 best practices. The genome seq. for this organism is a first draft (2.2 GB with ~ 835,000 clusters/contigs). I decided to first test four libraries (two PE50 and two PE100) and then check the results and tweak switches as necessary before scaling up to the full complement of sample libs. So far I have:

1. Assigned readgroups and mapped reads (individually) of the 4 test libs. to the reference using bowtie2
2. Sorted, then combined outputs into a single bam file (12 GB)
3. Run GATK ReduceReads to generate a 6 GB bam file
4. Run UnifiedGenotyper with the cmd:

.

java -Djava.io.tmpdir=/path/tmp_dir -jar /path/GenomeAnalysisTK.jar -T UnifiedGenotyper -R speciesname_idx/speciesname.fasta -I 4.libs_reduced.bam -o 4.libs.UG -nt 6

My questions are:

1. Can GATK be run efficiently without Phase 1 processing?
2. Is the ref. genome too large, w.r.t. the # of clusters?
3. Would one expect this approach to require an inordinate amount of time to process a dataset of this size and complexity?

The program initially died because java didn't have enough write space. So I gave it a tmp dir. and it ran for 3 days and died after maxing out a hard, 2 TB directory size limit. I am now running it again with a 4 TB limit.

After 27 hr, I have only traversed 5.2% of the genome (if I'm understanding the stdout correctly).

INFO  16:33:47,746 TraversalEngine - ctg7180006247957:754        1.15e+08   26.9 h       14.0 m      5.2%         3.1 w     2.9 w

So, at this rate, that's ~21 days to process ~15% of the libs. I thought maybe there was an excessive amt of swap occurring that might be slowing things down, but of the 126 GB RAM available only~ 20-30 GB are being utilized among mine several other jobs, so not likely an issue.

I have no experience with this program, but this just seems way too slow for processing a relatively small dataset... and I wonder if it will ever be able to crunch through the full set of 28 libs.

Any suggestions/thoughts as to why this is occurring, and what I might be able do to speed things up would be greatly appreciated!

Walt