Tagged with #header
0 documentation articles | 0 announcements | 4 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2014-09-08 11:15:00 | Updated 2014-09-08 11:16:08 | Tags: baserecalibrator readgroup header rg pu lb
Comments (4)

Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:

@RG ID:2_1  PL:ILLUMINA PU:Platform1_FCA    LB:Sample2  DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1
@RG ID:2_2  PL:ILLUMINA PU:Platform2_FCB    LB:Sample2  DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2
@RG ID:2_3  PL:ILLUMINA PU:Platform3_FCC    LB:Sample2  DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3

(IDs changed to something more generic)

I have run BaseRecalibrator, and the read group results are broken down by the PU field:

ReadGroup                    EventType  EmpiricalQuality  EstimatedQReported  Observations   Errors     
Platform1_FCA    M                   28.2433             28.1998  6283676400.00   9416437.03
Platform1_FCA    I                   41.2350             45.0000  6283676400.00    472843.68
Platform1_FCA    D                   40.2707             45.0000  6283676400.00    590401.90
Platform2_FCB  M                   28.7442             30.5860   258157515.00    344716.76
Platform2_FCB  I                   44.4728             45.0000   258157515.00      9216.35
Platform2_FCB  D                   41.1310             45.0000   258157515.00     19896.01
Platform3_FCC     M                   22.9983             23.2025  2510670817.00  12588158.43
Platform3_FCC     I                   44.0934             45.0000  2510670817.00     97824.77
Platform3_FCC     D                   41.1932             45.0000  2510670817.00    190753.01

This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields. I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique. Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file. I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is. I can find no errors or warnings about this in the output or stdout/stderr.

I'm wondering what would happen when I add a second sample, with the same PU IDs?

Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups? I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.

Many thanks Anna

Created 2014-03-05 08:09:15 | Updated | Tags: bqsr bam header pg-tag
Comments (7)

When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file.

BWA->Picard dedup->GATK LocalRealign->GATK BQSR

Below is the output BAM file after BQSR step.

@PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false

Related thread (guess) http://gatkforums.broadinstitute.org/discussion/2118/baserecalibration-prinread-don-t-create-a-header-and-don-t-obtain-oq-orignal-base-quality-in-bam

Created 2014-03-02 23:34:47 | Updated | Tags: printreadswalker bam header
Comments (6)


I use --sample_name option to restrict output reads for PintReads walker (GATK version - 2.4.9). This option works well but the output BAM header @RG still contains sample name (SM) that was not included. Is there a way to apply this --sample_name option to BAM header at the same time? I could reheader the BAM files using samtools but I'm looking for a better and easier way.

Thank you,


Created 2013-11-12 14:31:50 | Updated 2013-11-12 14:35:51 | Tags: fastaalternatereferencemaker consensus name header
Comments (4)


I am using FastaAlternateReferenceMaker to create consensus sequences as follows:

java -Xmx12g -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R /Reference/chromosome.1.fa -o /Output/consensus.1.fa --variant /VCF/chromosome1.vcf -L /Interval/chromosome.1.list

This command line produces a single fasta file including consensus sequences for the intervals provided in the list file.

Two questions:

(1) Presumably, sequence header for each sequence in the consensus fasta file corresponds to the contig name in the master reference file. Is there any way to modify the command line to print, say, the interval as the sequence header instead of the contig name?

Note: I can feed the program one interval at a time and name the output file accordingly, however, I'd like to stick to submitting my query per one chromosome at a time for the sake of saving up time.

(2) Number of sequences in the consensus fasta do not match the number of non-overlapping intervals in the input list. I would think that this is because some intervals are variant-free and therefore no alternate reference is reported for them. Do you confirm this is the case? I cannot easily check because the issue mentioned in (1).

Thank you.