Converting Birdseed to VCF
ContEst, the contamination estimation tool, requires a VCF as input if you want to provide known genotypes from an array. To convert Birdseed output (the most commonly used tool for processing raw array CEL files into genotypes) into a VCF we've provided a simple python script.
The tool requires one non-standard Python module: pyfasta. You should install this before proceeding, and check that the module is visible on your python classpath.
Other files that are need:
Running the Tool
The parameters for the tool can be found by typing the following python BirdseedToVCF.py -h, which displays the help information:
usage: BirdseedToVCF.py [-h] --birdseed BIRDSEED --output_vcf OUTPUT_VCF
--array_sample ARRAY_SAMPLE --vcf_sample VCF_SAMPLE
--snp_annotation_file SNP_ANNOTATION_FILE --fasta
FASTA [--add_chr] [--one_based_annotations]
Convert a birdseed call file to a VCF for contamination input
-h, --help show this help message and exit
--birdseed BIRDSEED the birdseed file, containing SNP positions and their
genotype call column
the output vcf file
the sample name to look for in the array (probe) file,
this is converted to the output sample name for the
the sample name used for the VCF file; this should be
the same sample used in the BAM file in the
the annotation file, describing the SNP positions and
their alleles from the array manufacturer
--fasta FASTA the reference fasta file
--add_chr sometime the reference sequence uses chr, where the
arrays dont, in this case you want to add a 'chr' to
the array contig name
annotation files are most likely zero based, if this
flag is set we assume its one based
convert the mitochondrial name of MT to M
--show_dropped_sites if we drop a site due to missing information, output a
--sort_on_the_fly sometimes probes can come out of order, depending on
the reference. If this flag is set we store every
record in memory, and sort when outputting (large
if the contigs have a canonical ordering, we'll order
the contigs by the ordering in this file. Each contig
should be on a seperate line, in the order you'd like
An example command line would look like the following:
python BirdseedToVCF.py \
--birdseed birdseed.calls.txt \
--output_vcf GBM-Native-02-0047-Normal.vcf \
--snp_annotation_file GenomeWideSNP_6.na30.annot.hg18.csv.pickle \
--array_sample TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834 \
--vcf_sample GBM-Native-02-0047-Normal \
--fasta ~/fasta/hg18.fasta \
In the above example, we're taking the birdseed.calls.txt as input, and outputting a file called GBM-Native-02-0047-Normal.vcf. Since this an hg18 sample (both the birdseed and the pickle file are hg18), we add the --add_chr command, since our birdseed output used simple numbered chromosomes (e.g. 1 instead of chr1), whereas our reference uses the chr1 convention. Since our samples are named differently in the birdseed file, we use the --array_sample parameter to specify the name in the birdseed file, as well as the --vcf_sample to indicate what we'd like to be in the output VCF. This is important, as your sample name here must match what's specified in the BAM file for ContEst to run correctly (there are ways around this, but why not get it right now).
Sample output from a run should look like the following:
python BirdseedToVCF.py --birdseed example_calls_small.txt --output_vcf my_sample.vcf --snp_annotation_file GenomeWideSNP_6.na30.annot.hg18.csv.pickle --array_sample FADGE_p_TCGA_b100_SNP_N_GenomeWideSNP_6_C08_747496 --vcf_sample MY_SAMPLE --fasta ~/fasta/hg18.fasta
[Status] loading the FASTA file...
[Status] loading the annotation file...
[Status] loading from pickled annotation file...
Contig (chromosome) ordering
You may run into trouble when the ordering of the contigs in your birdseed file are different than your genome build Fasta file; the best way to handle this is to use the --contig_ordering to provide a reordering of the output contigs to match that of the Fasta file. This file is simply a list (one chromosome name per line), indicating the order of contigs you'd like to have in the output VCF, regardless of what was seen in the input birdseed file. Contigs not listed in this file will be omitted from the output, even if they're in the input birdseed file.