GC-content

From ArachneWiki

Jump to: navigation, search

GC-content is a measure of the relative frequency of the cytosine (C) and guanine (G) bases, in comparison with the adenine (A) and thymine (T) bases. A genome is called GC-rich if significantly more than 50% of its bases are G or C. The converse of GC-content is AT-content; a genome is called AT-rich if significantly more than 50% of its bases are A or T.

Unbalanced GC-content leads to problems in Whole Genome Shotgun Assembly because it causes cloning bias in the Sanger sequencing process. Due to the high strength of G-C bonds (as compared to A-T bonds), very GC-rich inserts tend to fold in upon themselves in their secondary structures, making it hard for the E. coli cell to attach and absorb the insert. On the other hand, very AT-rich inserts in the E. coli genome can be lethal, resulting in lower cloning rates. The upshot is that Sanger sequencing is biased against regions of a genome with unbalanced GC-content. These regions are sequenced to lower coverage and are difficult to assemble.

Evidence suggests that Solexa reads are susceptible to a similar kind of bias as Sanger reads: they are less likely to appear in regions that are very GC-rich or GC-poor. The mechanism for this is not known.

Note that A/T tend to appear in similar frequency to each other, as do C/G. Hence the GC-content in a genome implies approximate values for all four bases' frequencies, as follows: Let fGC be the GC-content of a genome. Then fCfGfGC/2 and fAfT ≈ (1 - fGC)/2.

In Arachne, the module SimpleACGTContent evaluates the frequency of each base, and the GC-content, in a fasta or fastb file.

Effects in genomics

GC-richness in localized areas of a genome points to important regions such as CpG islands and Alu sequences. These regions are studied for their effects on gene regulation and expression.

The protozoa Plasmodium falciparum has one of the most GC-poor genomes known, with fGC ≈ 23%.

Personal tools