# Tagged with #2.7 0 documentation articles | 0 announcements | 2 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

I'm somewhat struggling with the new negative training model in 2.7. Specifically, this paragraph in the FAQ causes me trouble:

Finally, please be advised that while the default recommendation for --numBadVariants is 1000, this value is geared for smaller datasets. This is the number of the worst scoring variants to use when building the model of bad variants. If you have a dataset that's on the large side, you may need to increase this value considerably, especially for SNPs.

And so I keep thinking about how to scale it with my dataset, and I keep wanting to just make it a percentage of the total variants - which is of course the behavior that was removed! In the Version History for 2.7, you say

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number

Can you comment a little further about how it scales? I'm assuming it's non-linear, and my intuition would be that smaller sets have proportionally more bad variants. Is that what you've seen? Do you have any other observations that could help guide selection of that parameter?

I get the following error when running the unified genotyper on the attached bam file:

##### ERROR MESSAGE: SAM/BAM file ./MD_CHW_AAC_2699.recalibrated.bam is malformed: Premature EOF; BinaryCodec in readmode; file: /data/itch/winni/proj/marchini/converge/variantCalling/./MD_CHW_AAC_2699.recalibrated.bam

I was using this code:


/usr/local/bin/java -Xmx128G -jar ~/src/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -T UnifiedGenotyper -I ./MD_CHW_AAC_2699.recalibrated.bam -R /data/1kg/reference_v37d5/hs37d5.fa -o ./test.vcf



There are two bam index files located in the same directory:


MD_CHW_AAC_2699.recalibrated.bam.bai
MD_CHW_AAC_2699.recalibrated.bai



The second index (.bai) is outdated and older than the bam file. The first index (.bam.bai) is newer than the bam file. The first index is valid and the second index is not valid. What I would expect is for GATK to do one of the following:

1. Use the newer of the two index files and run through
2. Use the .bai file and die with an error saying that the index file is older than the bam file

Instead, what I think GATK is doing is checking that one of the two index files is newer than the bam file, and then using the .bai file regardless of which index was in fact the newer of the two.

I see this error in GATK 2.6-5 and 2.7-2. Is this a bug?

Regards, Warren Kretzschmar