Neurospora Crassa Assembly 3 Repetitive Sequence Analysis
Method
This analysis consisted of running a repeat finder on assembly 3 of N. crassa, grouping the resulting repeat elements into clusters, and then performing basic statistical calculations on the repeat elements, clusters, and relevant subsets of the assembly.
The repeat finder, cross_match.manyreads version 0.990319, was run so that in addition to finding sections of sequence repeated with individual contigs, it would also uncover sections of sequence repeated across contigs. The characteristics of the set of repeat elements found depend on the parameters given to the repeat finder, and it should be noted that these parameters, as well as criteria used in processing the output of the repeat finder, are heuristic, not inherently genomic, information that plays an important role in the analysis. For this analysis, three parameters were set explicitly and all others assumed their default values. The "minmatch" parameter, which specifies the minimum length match required to nucleate a SWAT comparison, was set to 12. The "minscore" parameter was set to 20. The "penalty" parameter, which specifies the score penalty for mismatched bases, was set to -2.
Running the repeat finder with these parameters resulted in 16,434,048 repeats. Over 55% of the repeats found were less than 30 bases long, and only 1.27% of the repeats were more than 100 bases long, so to focus on the most significant repeats and reduce the complexity of the analysis, repeats less than 100 bases long were filtered out before clustering. This resulted in a set of 209,472 repeats being clustered.
Since the repeat finder only outputs repeat pairs, if a repeat element occurs in more than two places it will be output as many pairs of repeats. Repeat clustering, which was done by a program written for this analysis, is the process whereby pairs that share repeat elements are grouped together. After clustering, clusters with an average element length less than 200 were removed - once again, to focus on the more significant repeat elements and increase the manageability of the data set. This resulted in a set of 27,844 repeat elements in 6108 clusters being used to generate the statistics for this analysis.
Results
The minimum cluster size possible is 2, and in fact, 67% of the clusters had only two repeat elements. The average number of elements in a cluster was found to be 4.56 with a standard deviation of 18. The most elements in a cluster was found to be 1042. Due to the repeated filtering of the data, the smallest repeat element had a length of 186. The maximum element length was found to be 25408 and the average element length 1043, with a standard deviation of 1114.
There are 38,044,343 bases in Assembly 3 of the N. crassa genome, 3,797,002 of which lie in at least one repeat element. Thus, 9.98% of the genome is repetitive, on the level of individual bases. An important objective of this analysis was an understanding of the distribution of repetitive elements throughout the genome. However, since the assembly is a work in progress, the distribution of repetitive sequence can currently only be definitively described with respect to known features such as exons. Considering that 38% of the bases in the assembly lie in an exon, if the distribution of repeat elements was random, 38% of the bases comprising repeats would also be in exons. However, it was found that only 3.6% of repetitive sequence is exonic.
Another important objective of this analysis was to determine the GC content of several relevant subsets of the genome. The GC content of the entire assembly was found to be 49.87%. The GC content of repetitive sequence was found to be 31.6% vs. 51.9% for non-repetitive sequence. For comparison, the GC content of exonic sequence was found to be 46% vs. 56% for non-exonic sequence. It thus appears that the GC content of a section of sequence depends much more on whether the sequence is repetitive than on whether it is exonic.
Summary
Overview
| Total bases: | 38 044 343 |
| Repeat bases: | 3 797 002 |
| Percent repeat: | 9.98 |
| Number of repeat elements: | 27844 |
| Average element length: | 1043 |
| Number of clusters: | 6108 |
| Average cluster size: | 4.56 elements |
| Minimum cluster size: | 2 elements |
| Maximum cluster size: | 1042 elements |
| Percent of sequence in exons: | 38.24 |
| Percent of repetitive sequence in exons: | 3.64 |
| Percent of repetitive sequence not in exons: | 96.36 |
% GC Content
| Whole Genome: | 49.87 |
| Non-Exon: | 46.07 |
| Exon: | 56.02 |
| Non-Repeat: | 51.90 |
| Repeat: | 31.60 |
| Repetitive and non-exonic: | 30.79 |
| Repetitive and exonic: | 53.25 |
| Non-repetitive and non-exonic: | 48.89 |
| Non-repetitive and exonic: | 56.05 |
| Exonic or repetitive: | 50.95 |
SP 3/6/02
