We do not here go into the generalities of how to do whole genome association studies, and instead refer you to the literature for how to organize, design, and analyze a whole-genome scan for association to phenotype. Instead, we focus on Birdsuite output files and how you might use them to your advantage in your study, and somewhat for how to perform CNV analyses (Please check the PLINK website for a more thorough examination of how to accomplish these goals.) If you intend to use PLINK, please see the file conversion pipeline listed below (Birdsuite to PLINK)
We note that CNV analysis is confounded similarly to SNP analysis - by batch effects, artifacts arising from differential sample preparation, etc. We specifically point out files that can be used to filter your data in order to reduce these artifacts as much as possible, but stress that it is important to measure and control for these effects as much as possible (for example by correcting for global inflation in association statistics, removing associations that appear to be due to plate artifacts, using permutation testing, confirming top hits visually, and via replication.)
To analyze CNPs, the files needed from Birdsuite output include BASE.canary_calls, BASE.canary_confs, and BASE.probeset_summary. BASE.canary_calls contains one line per CNP, and one column per sample (plus one header column.) The value at each (cnp_id, sample) spot in the matrix indicates the copy number genotype of that sample at that CNP. The locations of the CNP can be found in the metadata, for example in the file GenomeWideSNP_6.hg18.cnv_defs.
BASE.canary_confs is an identically sized file that contains a value reflecting the confidence of each genotype at the respective location. Values closer to 0 are higher confidence; these are calculated in a manner akin to those for Birdseed. We consider values less than 0.1 to be highly confident.
Finally, BASE.probeset_summary is also an identically sized file, containing the summarized intensity for each (cnp_id, sample) pair, which is the input into Canary. One can manually review the calling by plotting these intensities, coloring by genotype, and see if the clustering appears reasonable and robust across each plate.
Due to the inherent difficulty in calling novel CNVs with high confidence, there are many files available to help prioritize copy-variable segments discovered for each sample.
As Birdseye acts on one chromosome at a time, there is one directory per chromosome with some valuable files in each. These directories are found under BASE.CHR.birdseye_dir/. In each directory there are the following files:
CNsummaries.txt | A matrix of intensities for each CN probe on that chromosome (rows) for each sample (columns). |
CNclusters.txt | Estimates of means and variances of intensities for samples of 0 copies, 1 copy, and 2 copies. The 2-copy cluster is empirally calculated from the data, while the 0- and 1-copy clusters are imputed. Clusters are semicolon delimited; within each cluster the first value is mean, and the second is variance |
SNPsummaries.txt | A matrix of intensities for each SNP probe on that chromosome (rows) for each allele (alternating rows) for each sample (columns). |
SNPclusters.txt | Estimates of means and variances of intensities for each SNP for AA samples, AB (heterozygous) samples, and BB samples. This file is identical to BASE.birdseed_clusters (see below), but broken up by chromosome. |
cn_sample_variances.txt | A sample-specific measure of noise of CN probes. The number reflects the square of the average number of standard deviations a sample is from the expected value; these are carried forward into the HMM such that noisy samples do not over-segment. The largest allowed value is 2.25, and the smallest 0.444. Samples with high relative variance (e.g. greater than 2) indicate either noisy data (most likely) or a lot of copy number variation on that particular chromosome (much less likely). Thus samples with high variance should be filtered out of downstream analyses, or at least they should be aware there may be questionable CNV calls. |
snp_sample_variances.txt | A sample-specific measure of noise of SNP probes. Otherwise identical to cn_samples_variances.txt. Noisy samples for SNP probes are slightly less common due to the 4-fold redundancy (and subsequent summarization) of SNP probes on the array. |
overall_cn_estimate.txt | An estimate of the total number of copies of that particular chromosome for each sample. Should be close to 2 for most samples for most chromosomes. Values significantly different than 2 indicate chromosomal abnormalities such as trisomies or monosomies. Chromosome 23 represents the X chromosome, and so this value should be close to 1 for males. Values not close to an integer can indicate mosaicism or contamination, and again it would be reasonable to filter downstream analyses based on these numbers. |
cn_segments.txt | This has the actual segmentation of the chromosome for each sample. The tab-delimited columns represent: sample number, copy number, chromsome, segment start, segment end, per-probe quality score, size, number of loci (CN or SNP probes) in the segment, and LOD score of the probability of the segment being the stated copy number versus the copy number of the flanking segments. |
The file BASE.birdseye_calls has the individual chromosome cn_segments.txt files conveniently concatenated together, and prepended by a column including the sample file name as opposed to just the sample number (which is the sample order as input to birdsuite). This file can form the basis for downstream analysis; one might want to filter for well-behaving samples and chromosomes (based on the above chromosome-specific files), for autosomal segments (chromosome column less than 23), for a certain class of copy-variable segments (e.g. 0 for homozygous deletion), for segments of a certain size (e.g. greater than 20kb), or for segments with a confident LOD score (e.g. greater than 5).
Note that in the current version, chromosome 24 (the Y chromsome) is incorrectly segmented; chromosome 23 (the X chromosome) is reasonable segmented, but lots of segmental duplications (specifically with the Y chromosome) make analysis on X much more difficult.
To analyze SNPs, there are two options:
We have recently created a way to convert your birdsuite data into a set of files that will encompass the three areas of focus. Please keep in mind that this pipeline is currently on it's first release cycle. If you have any questions please direct them to whitwort@broadinstitute.org:
You can find out more about this by reading the documentation, or by downloading the pipeline archive from the download page and reading the manual from there.