ForDistribution

From ArachneWiki

Jump to: navigation, search

The module ForDistribution converts Arachne's binary-format output files into a human-readable form that can be used in submissions to NCBI. These files are generated in the KEY directory, whose name defaults to ForDistribution. Note that the ForDistribution directory is a subdirectory of DATA and is related to a precise SUBDIR, specified in the source file.

Of particular interest are the "markup files" listed below. These modules are produced by RunMarkup, which is called by ForDistribution. RunMarkup serves to tag regions in the assembly that are potentially enriched for misassemblies. The markup files are documented here: ftp://ftp.broad.mit.edu/pub/wga/misc/docs/AssemblyMarkup2.0.pdf

File list

  • All files generated by Assemblez and placed in RUN (listed in Output), except assemblez.log
  • Information about this ForDistribution run
    • README: pointer to documentation (this page!)
    • source: which assembly (i.e., which SUBDIR) generated the files in ForDistribution.
    • ForDistribution.command: the exact command used to run the module, including all default command-line arguments.
  • Basic assembly statistics
    • BasicAssemblyStats.out: output from the BasicAssemblyStats module: some core statistics of the assembly (coverage, contig N50, etc.)
    • BasicAssemblyOneLiner.out: output from the BasicAssemblyOneLiner module, which is in a CSV format.
    • ReadUsage.Table: a simple table with some detailed info about assembled reads statistics.
    • LibStatsOverview.out: output from the LibStatsOverview module: some library statistics (percent of reads assembled in valid pairs, percent of reads for which their mate was not assembled, etc.)
    • PhysicalCoverageByLib.out: a csv file with details about the physical coverage (broken by library).
  • Raw assembly output
    • assembly.agp: agp file of the supercontigs (for each supercontig, a list of contig - gap - contig - gap, etc.)
    • assembly_supers.fasta.gz: gzipped fasta file for output supercontigs (gaps between contigs are filled with N's)
    • assembly_supers.quals.gz: gzipped qual file for output supercontigs (gaps between contigs are filled with N's).
    • assembly.bases.gz: gzipped fasta file of output contigs.
    • assembly.quals.gz: gzipped quals file of output contigs.
    • unplaced.fasta.gz: gzipped fasta of all unplaced reads.
    • unplaced.qual.gz: gzipped qual of all unplaced reads.
  • Markup files
    • qual.qc.xml.MarkupN50.txt
    • low_coverage.qc.xml.MarkupN50.txt
    • defects.gff.gz
    • stretch.gff.gz
    • repeats.gff.gz
    • qual.gff.gz
    • hqd.gff.gz
    • defects.xml.gz
    • stretch.xml.gz
    • repeats.xml.gz
    • qual.xml.gz
    • hqd.xml.gz

If the assembly is mapped onto chromosomes

In some cases maps can be used to anchor the assembly to chromosomes. In this case there are four more files (actually symbolic links!) in the ForDistribution directory. These are:

  • mapped.agp.chromosome.agp: agp file of the chromosomes (for each chromosome, a list of contig - gap - contig - gap, etc., in the agp format).
  • mapped.agp.chromosome.qual.gz: the fasta of the chromosomes.
  • mapped.agp.chromosome.fasta.gz: the qual of the chromosomes.
  • MapSnps.out: the output of the module MapSnps, to tag SNPs.
Personal tools