Expand All | Collapse All
WARNING: This documentation is preliminary and still under development!
3.0: Input Requirements
Input to Pilon consists of the input genome in FASTA format along with one or more BAM files of reads aligned to the input genome.
3.0.1: Input Genome (FASTA)
Pilon requires a FASTA of the input genome, either a draft assembly or a reference to be used for variant calling. If the input is an assembly to be improved, each scaffold should be a FASTA element, with at least 10 consecutive Ns representing gaps between contigs within the scaffold.
Input FASTA sequences should conform to the FASTA Sequence Format.
3.0.2: Aligned Read Files (BAMs)
Pilon also requires one or more BAM files of reads aligned to the input genome. These BAMs are specific by three command line arguments which specify the type of the input library:
for unpaired sequencing reads.for paired-end sequencing of DNA fragments, such as Illumina paired reads of fragment size <1000bp.
for paired sequencing data of larger insert size, such as Illumina mate pair libraries, typically of insert size >1000bp.
for unpaired sequencing reads.
BAM files are the binary form of Sequence Alignment Map (SAM) files. The format is described in the PDF file linked below:
Note: You may get the following message if your BAM index file is missing or older than the BAM file itself:
WARNING: BAM index file "path/to/bai_file" is older than BAM "/path/to/bam_file".
To resolve this issue run:
samtools index /path/to_bam_file
3.1: Pilon Arguments
In the examples below, the “pilon” command is shorthand for the java invocation of the pilon jar file, for example:
java --Xmx16G -jar /path/to/pilon.jar
The –genome argument and either frags, jumps or unpaired data must be supplied in order for Pilon to run:
|–genome <fasta file>||The fasta file for the genome to be improved and that was used to generate the bam files.
||pilon --genome genome.fasta --frags frags.bam|
|--frags <bam file>||A BAM file containing fragment paired-end alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once.||pilon --genome genome.fasta --frags frags.bam|
|--jumps <bam file>||A BAM file containing jump (mate pair) paired-end alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once.||pilon --jumps jumps.bam|
|--unpaired <bam file>||A BAM file containing unpaired alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once.||pilon --unpaired unpaired.bam|
The following arguments are optional, and should be selected based on the desired parameters and outputs:
Specifies a prefix for all output files (default "pilon").
pilon --genome genome.fasta --frags frags.bam --output myPilonData
If specified, a file listing changes in the output.fasta will be generated. This is a tab-separated file with four columns: original coordinate, output coordinate, original sequence, and output sequence.
pilon --genome genome.fasta --frags frags.bam --output myPilonData --changes
If specified, a VCF file will be generated
pilon --genome genome.fasta --frags frags.bam --vcf
Create .bed and .wig diagnostic tracks suitable for viewing in IGV or GenomeView.
pilon --genome genome.fasta --frags frags.bam --tracks
Shorthand to set up parameters for variant calling, as opposed to assembly improvement; currently equivalent to --vcf --fix +breaks
pilon --genome genome.fasta --frags frags.bam --variant
Input FASTA elements larger than this will be processed in smaller pieces not to exceed this size (default 10000000).
pilon --genome genome.fasta --frags frags.bam --chunksize 2000000
Indicates that the sample is from a diploid organism to direct calling of heterozygous SNPs. WARNING: Pilon has very limited support for diploid genomes; more is planned for the future.
pilon --genome genome.fasta --frags frags.bam --diploid
A comma-separated list of categories of issues to try to fix: "bases": try to fix individual bases and small indels; "amb": fix ambiguous bases in fasta output (to most likely alternative); "gaps": try to fill gaps;"local": try to detect and fix local misassemblies; "all": all of the above (default); "none": none of the above; new fasta file will not be written. The following are experimental fix types: "breaks": allow local reassembly to open new gaps (with "local"); "novel": attempt to assemble novel contigs from unaligned non-jump reads. Individual fix issue types can be turned on or off by preceding the fix with "+" or "-", respectively.
pilon --genome genome.fasta --frags frags.bam --fix "bases,amb"
Include reads that failed quality filtering by the sequencing instrument
pilon --genome genome.fasta --frags frags.bam --nonpf
Include reads marked as duplicates in input BAMs
pilon --genome genome.fasta --frags frags.bam --duplicates
Only process the specified targets. Targets are comma-separated with each target consisting of a fasta element name and optionally a base range separated from the fasta element name by a colon(:).
pilon --genome genome.fasta --frags frags.bam --targets scaffold00001,scaffold00002:10000-20000
Produce a more verbose output.
pilon --genome genome.fasta --frags frags.bam --verbose
Produce copious debugging output (implies verbose as well).
pilon --genome genome.fasta --frags frags.bam --debug
|--defaultqual <value>||Assumes bases are of this quality if quals are not present in input BAMs (default 15).||pilon --genome genome.fasta --frags frags.bam --defaultqual 10|
||The number of bases at the end of well-aligned reads that will be ignored (default=20).
||pilon --genome genome.fasta --frags frags.bam --flank 10|
|--gapmargin <value>||Gap closures must be within this number of bases of the original gap size in the input genome to be considered (default=100000, which essentially disables this check)||pilon --genome genome.fasta --frags frags.bam --gapmargin 500|
|--K||Kmer size used by internal assembler (default 47).||pilon --genome genome.fasta --frags frags.bam --K 31|
|--mindepth <value>||Variants (snps and indels) will only be called if there is coverage of valid reads at this depth or more. If this value is ≥1, it is an absolute depth; if it is a fraction <1, then minimum depth is computed by multiplying this value by the mean coverage for the region, with a minimum value of 5 (default 0.1: min depth to call is 10% of mean coverage or 5, whichever is greater).||pilon --genome genome.fasta --frags frags.bam --mindepth 0.5|
|--mingap <value>||The minimum size for unclosed gaps (default = 10)||pilon --genome genome.fasta --frags frags.bam --mingap 20|
|--minqual <value>||The minimum base quality to consider for pileups (default = 0)||pilon --genome genome.fasta --frags frags.bam --minqual 10|
|--nostrays||Skip making a pass through the input BAM files to identify stray pairs, that is, those pairs in which both reads are aligned but not marked valid because they have inconsistent orientation or separation. Identifying stray pairs can help fill gaps and assemble larger insertions, especially of repeat content. However, doing so sometimes consumes considerable memory.||pilon --genome genome.fasta --frags frags.bam --nostrays|
3.2: Pilon Output
3.2.1: Variant Output
Pilon variant output is stored in a file named pilon.vcf by default (if --output is specified, the file will be named <output>.vcf).
Calls are classified by small number VCF FILTER tags:
|PASS||A passing call, either reference confirmation or difference|
|Amb||Ambiguous; significant evidence for more than one allele at this position. Meant for haploid genomes, this filter tag is suppressed by the --diploid argument|
|LowCov||Valid read coverage less than the threshold controlled by the --mindepth argument|
|Del||Provides pileup information for loci which were removed by a variation in another line; this gives a sense of what was going on at that locus if the larger variation were not called|
Block substitutions resulting from local reassembly of suspicious regions are represented by VCF Structural Variant records. Pilon assigns SVTYPE=INS if the variant contains more bases than the reference region, otherwise SVTYPE=DEL.
Pilon includes many computed values in the VCF INFO field; here is an example along with a description of the values:
|DP=38||Depth of valid reads in pileup (not invalid pair; not softclipped)|
|TD=144||Total Depth, including reads excluded from pileups|
|BQ=25||Mean base base quality in pileup|
|MQ=19||Mean mapping quality in pileup|
|QD=2||Quality by depth; meant to give a sense of how confident the base call is. Don't read too much into this number.|
|BC=0,22,2,14||Base count in pileups (order A,C,G,T)|
|QP=0,72,1,27||Percentage of weighted evidence for each base (order A,C,G,T)|
|PC=958||Physical coverage of valid pairs crossing this locus|
|IC=0||Number of reads in pileup calling an insertion at this locus|
|DC=0||Number of reads in pileup calling a deletion at this locus|
|XC=0||Number of reads in pileup soft-clipped at this locus|
|AC=1||Alternate allele count, as defined in VCF spec (0=reference call, 1=heterozygous/ambiguous, 2=alternate call)|
|AF=0.27||Fraction of evidence in support of alternate allele|
Click the link below for a primer on VCF file output format.
3.2.2: Track Files
If run with the "--tracks" argument, Pilon produces .bed and .wig files that may be viewed in IGV, GenomeView, and other applications that support these formats. The tracks produced by Pilon are as follows:
|Pilon||pilonPilon.bed||Several classes of issue found by Pilon in a compact format|
|Changes||pilonChanges.wig||Changes made by Pilon|
|Unconfirmed||pilonUnconfirmed.wig||Non-zero in regions where the input genome was not confirmed|
|Copy Number||pilonCopyNumber.wig||The copy number in the genome of the sequence at a given location|
|Coverage||pilonCoverage.wig||The sequence coverage at a given position|
|Bad Coverage||pilonBadCoverage.wig||The coverage of reads that do not map logically|
|Delta Coverage||pilonDeltaCoverage.wig||A measure of local rate-of-change of valid coverage|
|Dip Coverage||pilonDipCoverage.wig||A metric designed to identify local dips in coverage, often indicating a contiguity break|
|Frag Coverage||pilonFragCoverage.wig||The fragment read coverage at a given position|
|Physical Coverage||pilonPhysicalCoverage.wig||The physical coverage at a given position|
|GC track||pilonGC.wig||The percent GC of sequence at a given location|
|Pct Bad||pilonPctBad.wig||Percentage of invalid reads (usually bad pairing) compared to total depth|
|Weighted Qual||pilonWeightedQual.wig||The weighted quality of reads at a given position|
|Weighted MQ||pilonWeightedMq.wig||The weighted mapping quality of reads at given position|
|Clipped Alignments||pilonClippedAlignments.wig||A count of how many soft-clipping events started at this locus|
The "SD" tracks express a metric as standard deviations from the mean of the metric across a given input fasta element. The values are integers representing 0.1 sigma, i.e., a value of -21 means 2.1 standard deviations below the mean.
Many of these tracks were primarily of use in developing Pilon's heuristics for making calls and identifying regions of possible misassembly; they may be removed in a future release.
3.2.4: Pilon Fasta
The pilon fasta file contains a version of the assembly in which errors are fixed as specified by the --fix option. Pilon renames the sequence headers by appending "_pilon" to each record name. The name of the file is 'pilon.fasta' unless the user specified an --output parameter.
3.2.5: Pilon Changes
If run with the "--changes" argument, Pilon produces a file containing a space-delimited record of every change made in the assembly as instructed by the --fix option. The format for the file is as follows:
<Original Scaffold Coordinate> <New Scaffold Coordinate> <Original Sequence> <New Sequence>
These headers are further described in the table below
|Original Scaffold Coordinate||The coordinate of sequence in the original fasta that was flagged by pilon for a change||scaffold00003:690754|
|New Scaffold Coordinate||The coordinate of the changed sequence in the pilon fasta file||scaffold00003_pilon:690809-690853|
|Original Sequence||The sequence in the original fasta file that was flagged by pilon for a change||. (pilon uses a period to indicate that no sequence was removed)|
|New Sequence||The sequence in the pilon fasta file that was inserted or deleted by pilon to fix the assembly||GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT|
Using the examples in the table above, a pilon change file record would appear as follows:
scaffold00003:690754 scaffold00003_pilon:690809-690853 . GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT
In this case, pilon did not remove any sequence from the position 690754 in scaffold 3 of the original assembly, but instead inserted 45 bases and wrote the output to the pilon fasta file at position 690809-690853 in scaffold 3.
Pilon determines which regions may contain contiguity breaks requiring local reassembly by looking for the following motifs:
3.2.6: How Pilon Identifies Bad Locations and Regions
Pilon determines which regions may contain contiguity breaks requiring local reassembly by looking for the following motifs: