Map the read data to the reference and mark duplicates.
The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.
Compose the read group identifier in the following format:
\t stands for the tab character.
Run the following BWA command:
In this command, replace read group info by the read group identifier composed in the previous step.
bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam
<read group info> bit with the read group identifier you composed at the previous step.
-M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).
This creates a file called
aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.
Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.
These initial pre-processing operations format the data to suit the requirements of the GATK tools.
Run the following Picard command to sort the SAM file and convert it to BAM:
java -jar picard.jar SortSam \ INPUT=aligned_reads.sam \ OUTPUT=sorted_reads.bam \ SORT_ORDER=coordinate
This creates a file called
sorted_reads.bam containing the aligned reads sorted by coordinate.
Run the following Picard command to mark duplicates:
java -jar picard.jar MarkDuplicates \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam \ METRICS_FILE=metrics.txt
This creates a sorted BAM file called
dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called
metrics.txt containing (can you guess?) metrics.
Run the following Picard command to index the BAM file:
java -jar picard.jar BuildBamIndex \ INPUT=dedup_reads.bam
This creates an index file for the BAM file called
I noticed that the current build of picard contains a 3rd party version of bwa 0.5.9-r16, the version.txt which contains the text:
"patched by AW, and multithread samse/sampe patch applied."
I have an external collaborator who is interested in applying this patch. Could someone provide download and instructions for doing so, or for accessing this patched version of BWA? Or can I just send the bwa executable directly to the collaborator?
Hi guys, we are presently building some simulation program in our lab presently and we were wondering if there's already a program known by the community or done by GATK's team to revert back the bam files containing splitted spliced reads like what is outputted by the Split'N'trim step?
I'm asking because we are testing multiple programs to map RNAseq reads and bwa is outputting splitted reads directly and we want to revert them back to reads with N Cigar operators.
Thanks a lot for your help!
My question is on bwa software when one want to map RNA-seq data on the entire human genome. What should be the specific settings to use to get maximum mapping? Should it be effective if no options are used in the command line?
Thank you for your time
We have used bwa 0.7.4 aln and sampe to align illumina reads. Then used the following command java -Xmx6g -jar ~/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T BaseRecalibrator -I ~/temp/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedindelrealigned.bam -R ~/hg19/ucsc.hg19.fasta -knownSites ~/dbSNP/dbsnp_137.hg19.vcf -o ~/BIR-08_130330_I288_FCD1P68ACXX_L7_SZAIPI025187-74.sortedBQSR.grp Which gave the following error message
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (90) > (89) STOP -- this should never happen, please check read: FCD1P68ACXX:7:1315:19572:52424#CGCGGTGA 1/2 90b aligned read. (CIGAR: 85M4I1M2D) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:537) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:193) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:389) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:392) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:245) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:132) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
can you help me in this error message? Why its coming and how to rectify it? Thanks in advance Mayukh
I have the genomes of several isolates of a parasite, and I would like to investigate synonymous/non-synonymous substitution for identifying potential antigens, as well as SNPs genome-wide and I am wondering how well BWA/GATK are suited for this purpose. I've been told that BWA is only very good with sequences <2% divergent, and some of the antigens in this specie are known to be >20% divergent. However, I also know that GATK does local realignments of indels. So I would specifically like to know - is BWA/GATK good for looking at substitutions/SNPs in highly variable genes, and if not which other alignment tools are compatible and appropriate for this purpose?
before I only used BWA and as you described in the best pratice I performed the realign step. Now I want to integrate in my pipeline Stampy associated with BWA.
Do you think, I should make the realign step ?
Picard appears not to like the way BWA codes mtDNA. I am doing human exome sequencing using a copy of hg19 which I obtained from UCSC and indexed using BWA per the instructions here:
[Tue Aug 28 12:45:16 EDT 2012] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=125435904 FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Non-numeric value in ISIZE column; Line 3982 Line: FCC0CHTACXX:1:1101:14789:3170#TAGCTTAT 117 chrM 304415842 0 100M = -1610645157 2379906297 TGCGACTTGGAAGCGGATTCAGAGGACAGGACAGAACACTTGGGCAAGTGAATCTCTGTCTGTCTGTCTGTCTCATTGGTTGGTTTATTTCCATTTTCTT B@<:>CDDDBDDBDEEEEEEFEFCCHHFHHGGIIIHIGJJJIIGGGIIIIJJJIIGJIJGG@CEIFJIJJJJIJIJIJJJJIJJJGIHHGHFFEFFFCCC RG:Z:1868 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:39G45G14 XA:Z:chrM,-391302964,100M,2; at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:223) at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40) at net.sf.samtools.SAMTextReader$RecordIterator.parseInt(SAMTextReader.java:293) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:394) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619) at net.sf.picard.sam.SortSam.doWork(SortSam.java:68) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119) at net.sf.picard.sam.SortSam.main(SortSam.java:57)
java -jar ~/bin/picard-tools-1.74/MarkDuplicates.jar \ INPUT=1sorted.bam \ OUTPUT=1dedup.bam \ ASSUME_SORTED=true \ METRICS_FILE=metrics \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT ... Ignoring SAM validation error: ERROR: Record 691, Read name FCC0CHTACXX:1:1302:4748:176644#GGCTACAT, Mate Alignment start (436154938) must be <= reference sequence length (16571) on reference chrM Ignoring SAM validation error: ERROR: Record 692, Read name FCC0CHTACXX:1:2104:8494:167812#GGCTACAT, Mate Alignment start should != 0 because reference name != *. Ignoring SAM validation error: ERROR: Record 693, Read name FCC0CHTACXX:1:1201:21002:183608#GGCTACAT, Mate Alignment start should != 0 because reference name != *. Ignoring SAM validation error: ERROR: Record 694, Read name FCC0CHTACXX:1:2303:3184:35872#GGCTACAT, Mate Alignment start (436154812) must be <= reference sequence length (16571) on reference chrM ...
I've truncated the output; in fact it throws such an error for every single line of mitochondrial reads.
I suspect I could solve this by writing my own script to go in and change the way one column is coded, but more broadly, I am interested in the answer to "how do you make BWA, Picard and GATK work seamlessly together without needing to do your own scripting"?