How do I prepare the input files?

Contest takes a series of input files, representing the sequencing data, population structures, as well as array-based genotype calls.  

Sequencing data

Sequencing data needs to formatted in the BAM file format, which is standard across most of the modern sequencer technologies and sequencing centers.  More information about this file format can be found in this PDF, or on the Samtools site.  Sequencing data must be formatted in the BAM (the binary version of the SAM format, it cannot unfortunately be in the SAM format) format.  Since ContEst is built as a Genome Analysis Toolkit (GATK) tool, input sequencing data must also conform to the specifications set forward by the GATK.

Reference information

The reference file must be in the FASTA format, as well as these additional constraints.  Saying that, the common human reference sequences in FASTA format (i.e. HG18, HG19, b36, b37) will work as long as they match your sequence information encoded in the BAM file and array files. 

Array based genotype data

Conversion of BirdSeed array input files to Variant Call Format (VCF) files

ContEst requires its input files to be in the Variant Call Format (VCF) file format.  This format was developed for the 1000 Genomes project, and has become a standard in the genetics community for encoding information about variant calls, site idenity, and other genomicly possitioned data (including structural variant information).  More information about the specification can be found on the 1000 Genomes website:

The VCF Specification

Many users of the ContEst tool will have their array based calls in the BirdSeed formats (after running the Birdseed suite on various array platforms, see the Birdseed website for more information).  There is a two step conversion to go from Birdseed files to the VCF format calls ContEst is looking for. 

An new alternate pathway that converts from Birdseed call files to a VCF is available here.

Convert Birdseed files to GELI intermediate

The first step is to convert the birdseed files into an intermediate GELI file; this is the precursor to the final VCF file required for ContEst.  The tool BirdseedSNPToGeli converts Birdseed files into GELI files.  The command to run this tool is:

java -Xmx1g -jar BirdseedSNPsToGeli.jar I=<birdseed.file> O=<sample.id>.reference.geli S=<sample.id> SD=<sequence.dictionary> R=<fasta> SNP60_DEFINITION=<snp60.definition>

Where:

  • birdseed.file is the input birdseed file.
  • sample.id the name of the sample
  • sequence.dictionary the sequence dictionary file (.dict file), available from the Picard toolset.  See direction here.  
  • fasta the fasta file for the appropriate genome build
  • snp60.definition the SNP 6.0 definition file; available for hg18 and hg19 here.

When this is completed, you should have a GELI file as output.  This is fed into the next step, converting the GELI file into VCF.

Converting GELI files to a VCF input file

To convert a GELI input file to a vcf file, download the following tool, GeliToVCF.jar.zip.  The command to run it looks like:

java -Djava.io.tmpdir=<tmp.dir> -Xmx2g -jar GeliToVCF.jar O=<sample.id>.vcf GELI=<geli.file> SAMPLE_NAME=<sample.id>

Where:

  • tmp.dir is the location of temperary space on your hard drive
  • sample.id is the name of your sample
  • geli.file is the location of the input GELI file

Creating Population Frequency VCF file

HapMap population frequencies are available in the download sections mapped to both HG18 and HG19.  However, if you would like to build population frequencies simply construct a VCF with your own frequencies represented in the INFO field with the following format:

<population-name>={<ref-allele>*=<ref-allele-frequency>, <alt-allele>=<alt-allele-frequency>}

For example:

CEU={A*=0.13030, G=0.86970}

Is a population with the name "CEU" where "A" is the base in the reference with a population frequency of 0.13030 and "G" is the non-reference base with a frequency in this population of 0.86970