Tagged with #paper
2 documentation articles | 1 announcement | 0 forum discussions

Created 2015-09-28 04:03:04 | Updated 2015-09-28 04:38:58 | Tags: paper citation
Comments (0)

To date we have published three papers on GATK (citation details below). The ideal way to cite the GATK is to use all as a triple citation, as in:

We sequenced 10 samples on 10 lanes on an Illumina HiSeq 2000, aligned the resulting reads to the hg19 reference genome with BWA (Li & Durbin), applied GATK (McKenna et al., 2010) base quality score recalibration, indel realignment, duplicate removal, and performed SNP and INDEL discovery and genotyping across all 10 samples simultaneously using standard hard filtering parameters or variant quality score recalibration according to GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera et al., 2013).

McKenna et al. 2010 : Original description of the GATK framework

The first GATK paper covers the computational philosophy underlying the GATK and is a good citation for the GATK in general.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303

Article | Pubmed

DePristo et al. 2011 : First incarnation of the Best Practices workflow

The second GATK paper describes in more detail some of the key tools commonly used in the GATK for high-throughput sequencing data processing and variant discovery. The paper covers base quality score recalibration, indel realignment, SNP calling with UnifiedGenotyper, variant quality score recalibration and their application to deep whole genome, whole exome, and low-pass multi-sample calling. This is a good citation if you use the GATK for variant discovery.

A framework for variation discovery and genotyping using next-generation DNA sequencing data DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M, 2011 NATURE GENETICS 43:491-498

Article | Pubmed

Note that the workflow described in this paper corresponds to the version 1.x to 2.x best practices. Some key steps for variant discovery have been significantly modified in later versions (3.x onwards). This paper should not be used as a definitive guide to variant discovery with GATK. For that, please see our online documentation guide.

Van der Auwera et al. 2013 : Hands-on tutorial with step-by-step explanations

The third GATK paper describes the Best Practices for Variant Discovery (version 2.x). It is intended mainly as a learning resource for first-time users and as a protocol reference. This is a good citation to include in a Materials and Methods section.

From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33

Article | PubMed

Remember that as our work continues and our Best Practices recommendations evolve, specific command lines, argument values and even tool choices described in the paper become obsolete. Be sure to always refer to our Best Practices documentation for the most up-to-date and version-appropriate recommendations.

Created 2012-08-09 14:28:32 | Updated 2013-03-25 21:51:48 | Tags: official bundle analyst developer paper intermediate
Comments (1)

New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

  • WEx (150x) sequence
  • WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

CEU trio BAM libraries

These data files can be downloaded from the 1000 Genomes DCC

NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

  • NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
  • NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
  • NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
  • NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
  • NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
  • NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
  • BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
  • whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.


Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

  • If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

  • The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

Created 2014-04-14 21:00:52 | Updated 2014-04-14 21:01:34 | Tags: paper job job-offer
Comments (4)

We have a problem: we have a truckload of material sitting around waiting to be published, but no time to actually write the papers. So we're looking for someone who will help us convert this computational biology goldmine into cold hard Nature Biotech/Methods papers.

This is a great opportunity for an early-career, postdoc-level scientist who has experience publishing papers, demonstrated writing ability, and is not afraid of wrangling complex technical material.

Make no mistake, we're not looking for a ghostwriter; this will involve intellectual contribution worth the authorship in high-profile publications. But the basic material is ready and waiting.

Here is the complete job description; feel free to ask questions in the comments or by private message to me (Geraldine):

Scientific Technical Writer

The Genome Sequencing and Analysis group seeks a scientist with excellent writing skills and experience with scientific publication to work with scientists and engineers on transforming notes and technical documentation for existing analytical methods and tools into polished papers aimed at high-profile journals such as Nature Biotechnology and Nature Genetics. In collaboration with researchers working at the forefront of bioinformatics, including the group that develops the gold-standard Genome Analysis Toolkit (GATK), the writer would be responsible for designing paper structure and identifying necessary content to (1) better explain the theory that underlies methods and tools and (2) powerfully illustrate use cases based on public datasets. The writer would focus their writing on the introduction, results, and discussion sections of the papers, expanding existing technical material, while the team will provide the methods sections.

The writer will be hired as a consultant, with a minimum of 20 hours/week of work (more hours are possible if desired).


  • Ph.D. in a scientific discipline (not necessarily bioinformatics or computational biology, though familiarity with those fields is helpful);
  • experience and skill in writing scientific papers (experience with other scientific writing activities, e.g., grant-writing, is also helpful);
  • ability to craft a logically structured text from unstructured material;
  • ability to synthesize information from diverse sources;
  • ability to write clearly and concisely without sacrificing precision;
  • ability to communicate effectively in a team setting;
  • ability to work in a highly collaborative team environment; and
  • desire to take direction and iterate drafts with team members.
No posts found with the requested search criteria.