Frequently Asked Questions



  1. What is whole-genome shotgun sequencing?

    Whole genome shotgun sequencing is a technique for determining the DNA sequence of a genome by randomly shearing the DNA, sequencing multiple overlapping fragments, and inferring the original sequence from fragments that overlap. This method has been successfully used for bacterial genomes or subclones, like Fosmids. See Assembly for details

  2. What is an assembly?

    An assembly is a representation of the computationally derived relative positions of a set of sequenced fragments. When these individual sequences overlap, a consensus sequence is derived representing the most likely base at each position in the assembly. In this way, increased sequence redundancy improves the quality of the assembly and the confidence in the consensus. See Assembly for details.

  3. What does the name "Contig 1.XXX" mean?

    A contig is a sequence fragment created by assembling whole-genome shotgun reads. See Assembly for details.

    Every assembly contains multiple contigs. Each assembly is numbered sequentially. The number preceding the decimal point indicates the assembly number. Contigs within an assembly are also numbered sequentially. Thus "Contig 1.177" indicates contig #177 within assembly 1.

  4. What is a sequence contig?

    A sequence contig is the extended contiguous sequence that is produced by the assembly process that joins overlapping sequences. See Assembly for details.

  5. Are the contigs ordered?

    Contigs within the same supercontig are ordered. See Assembly for details.

  6. What is a sequence supercontig?

    supercontig consists of one or more sequence contigs known to occur in a specific order and orientation. Because we sequence each end of the subclones of plasmids and Fosmids, we can recognize that when one end of a clone lies in one sequence contig and the other end of the clone lies in a different sequence contig, these two contigs probably lie close to each other. To create supercontigs we require that two or more such linking clones join two sequence contigs. See Assembly for details.

  7. Are the supercontigs ordered?

    No, the supercontigs are not ordered by number.

  8. How big is the Ustilago maydis genome?

    Our current total unique contig length of ~20 Mb.

  9. What strain was sequenced?

    U. maydis strain 521.

  10. What is the current state of the assembly?

    The current assembly contains 274 sequence contigs >2 kb.

  11. How complete is the current assembly?

    Since the estimated genome size of Ustilago maydis is ~20 Mb, the current release represents 98% of the Ustilago maydis genome and is covered to a depth of ~10X. It excludes very highly conserved repetitive sequence and ribosomal RNA genes.

  12. Are the contigs ordered? For example, is contig 1.5 flanked by contigs 1.4 and 1.6?

    The contigs are numbered sequentially within larger supercontig fragments. Contigs within the same supercontig are positionally ordered. See Ustilago maydis Contig Numbering for details.

  13. How has the sequence been generated for the Ustilago maydis project?

    Our data consist of over 300,000 individual sequencing reads obtained by sequencing each end of plasmids and Fosmids from libraries containing randomly sheared fragments of 4 kb, 40 kb and 110 kb average insert size respectively. See Assembly for details.

  14. Will the genome be finished?

    There are no plans to finish the genome.

  15. How will we know the assembly is correct?

    The quality of the assembly will be assessed in several ways. In addition to requiring that the paired plasmid and Fosmid ends occur in a logical manner, our assembly of the Ustilago maydis genome will be verified through: 1) comparison with available genomic sequences, and 2) correlation with the physical map provided by Bayer CropScience.

  16. What data are available?

    In this version of our data release, all sequence contigs over 2 kb are available. Smaller contigs are sparsely covered and often include poor quality or contaminated DNA. Sequence contig data can be accessed in several ways: either through a BLASTN or TBLASTN search with an option for contig subsequence retrieval, or through FTP download of the entire genome. Contig sequences are subject to change throughout this project, so each data release version number will be appended to the contig number as a prefix (e.g. 1.235 denotes assembly version 1, contig #235).

    Genome assemblies from Bayer and Exelixis are also available.

  17. What about Fosmid end sequences?

    These sequences were crucial for ordering and orienting the genome as well as providing templates for gaps that are not captured by plasmids. They can be accessed using the file endreads.csv.gz

  18. What happened to two contigs?

    Two of the contigs in the previous release (contigs 1.246 & 1.251) have been identified as mito chondrial contigs. The contig names were changed as follows:

    <table border="1"> <tbody><tr><th>New name</th><th>Old name</th></tr> <tr> <td>Ustilago maydis mitochondria contig 1.1</td> <td>Ustilago maydis contig 1.246</td> </tr> <tr> <td>Ustilago maydis mitochondria contig 1.2</td> <td>Ustilago maydis contig 1.251</td> </tr> </tbody></table>


  1. What format is the download file in?

    The genome data is pure text in multiple FASTA format. The text file has been compressed using gzip. To uncompress the file:

     	    gunzip ustilago_maydis_1.fasta.gz 	    
  2. Why does gunzip tell me the file is not in gzip format?

    Some browsers (like newer versions on Netscape) automatically unzip files after download. If this is the case, the file should be 20 MB (rather than 6 MB of the compressed file). You can just rename the file to remove the <tt>.gz</tt> suffix.

  3. The download fails. What should I do?

    Downloading through the browser uses the http protocol. You can also try accessing the ftp site directly via the URL:


  1. Why is my BLAST job taking so long?

    BLAST jobs are queued and handled with other internal Broad processes in a general Load Sharing Facility. The delay for receiving your BLAST results depends on the current load.

  2. Why are my BLAST results split into multiple email messages?

    Some email programs are configured with a maximum message size and will automatically split large files into smaller pieces. If this is undesirable, you will need to reconfigure your email program.

  3. What sequences can I BLAST against?

    You can BLAST your query sequence against our entire assembly or special sequences set excluded from the assembly.

  4. Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?

    From the NCBI Blast FAQ:

    This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the sections of the BLAST FAQs on Q: What is low-complexity sequence? and also Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
  5. After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

    From the NCBI Blast FAQ:

    You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the BLAST reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.
  6. What is low-complexity sequence?

    From the NCBI Blast FAQ:

    Regions with low-complexity sequence have an unusual composition and this can create problems in sequence similarity searching (Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits (please also see Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?)

    In BLAST searches performed without a filter, certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.


  1. What's the Broad Institute?

    The Eli and Edythe L. Broad Institute is a partnership among MIT, Harvard and affiliated hospitals and the Whitehead Institute for Biomedical Research. Its mission is to create the tools for genomic medicine and make them freely available to the world and to pioneer their application to the study and treatment of disease.

  2. What's FGI?

    Fungal Genome Initative,

  3. How do I cite the sequence for publication?

    Publications should include the following citation:
    Ustilago maydis Sequencing Project. Broad Institute of MIT and Harvard (

  4. Who do I contact with questions about the sequencing?

    For additional help or to send feedback about the website, please email annotation-webmaster(at)

  5. Where are the beautiful photos from?

    The photos on the front page are: (from top to bottom)

    • sporidia, courtesy Gero Steinberg at MPI Terrestrial Microbiology
    • Fuz- mutants (mutants that form altered or no filaments on charcoal agar).
      Photomicrograph by Flora Banuett at California State University
    • The filamentous form of U. maydis in culture.
      Photomicrograph by Flora Banuett at California State University
    • Teliospores of U. maydis.
      Photomicrograph by Flora Banuett at California State University
    • Corn Cob infected by Ustilago maydis, courtesy Joerg Kaemper at MPI Terrestrial Microbiology