Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:
@RG ID:2_1 PL:ILLUMINA PU:Platform1_FCA LB:Sample2 DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1 @RG ID:2_2 PL:ILLUMINA PU:Platform2_FCB LB:Sample2 DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2 @RG ID:2_3 PL:ILLUMINA PU:Platform3_FCC LB:Sample2 DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3
(IDs changed to something more generic)
I have run BaseRecalibrator, and the read group results are broken down by the PU field:
ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors Platform1_FCA M 28.2433 28.1998 6283676400.00 9416437.03 Platform1_FCA I 41.2350 45.0000 6283676400.00 472843.68 Platform1_FCA D 40.2707 45.0000 6283676400.00 590401.90 Platform2_FCB M 28.7442 30.5860 258157515.00 344716.76 Platform2_FCB I 44.4728 45.0000 258157515.00 9216.35 Platform2_FCB D 41.1310 45.0000 258157515.00 19896.01 Platform3_FCC M 22.9983 23.2025 2510670817.00 12588158.43 Platform3_FCC I 44.0934 45.0000 2510670817.00 97824.77 Platform3_FCC D 41.1932 45.0000 2510670817.00 190753.01
This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields. I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique. Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file. I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is. I can find no errors or warnings about this in the output or stdout/stderr.
I'm wondering what would happen when I add a second sample, with the same PU IDs?
Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups? I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.
Many thanks Anna
When I run the BQSR with GATK 2.8.1, I can't check the PG information on output bam file.
BWA->Picard dedup->GATK LocalRealign->GATK BQSR
Below is the output BAM file after BQSR step.
@PG ID:GATK IndelRealigner VN:2.3-9-gdcdccbb CL:knownAlleles=[(RodBinding name=knownAlleles source=/nG/Reference/hgdownload.cse.ucsc.edu/goldenPath/hg19/KNOWNINDEL/Mills_and_1000G_gold_standard.indels.hg19.vcf)] targetIntervals=/nG/Data/2077/vcf1/node1/chrY/Databind/chrY.bam.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.92(1464) CL:net.sf.picard.sam.MarkDuplicates INPUT=[/nG/Data/2077/step2_makebam/node1-1.bam, /nG/Data/2077/step2_makebam/node1-2.bam, /nG/Data/2077/step2_makebam/node1-3.bam, /nG/Data/2077/step2_makebam/node1-4.bam, /nG/Data/2077/step2_makebam/node1-5.bam, /nG/Data/2077/step2_makebam/node1-6.bam] OUTPUT=/dev/stdout METRICS_FILE=/nG/Data/2077/temp/picard_info.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000000 TMP_DIR=[/nG/Data/2077/temp] QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=2000000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO CREATE_INDEX=false CREATE_MD5_FILE=false
I use --sample_name option to restrict output reads for PintReads walker (GATK version - 2.4.9). This option works well but the output BAM header @RG still contains sample name (SM) that was not included. Is there a way to apply this --sample_name option to BAM header at the same time? I could reheader the BAM files using samtools but I'm looking for a better and easier way.
I am using FastaAlternateReferenceMaker to create consensus sequences as follows:
java -Xmx12g -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R /Reference/chromosome.1.fa -o /Output/consensus.1.fa --variant /VCF/chromosome1.vcf -L /Interval/chromosome.1.list
This command line produces a single fasta file including consensus sequences for the intervals provided in the list file.
(1) Presumably, sequence header for each sequence in the consensus fasta file corresponds to the contig name in the master reference file. Is there any way to modify the command line to print, say, the interval as the sequence header instead of the contig name?
Note: I can feed the program one interval at a time and name the output file accordingly, however, I'd like to stick to submitting my query per one chromosome at a time for the sake of saving up time.
(2) Number of sequences in the consensus fasta do not match the number of non-overlapping intervals in the input list. I would think that this is because some intervals are variant-free and therefore no alternate reference is reported for them. Do you confirm this is the case? I cannot easily check because the issue mentioned in (1).