Manual

Expand All | Collapse All

WARNING: This documentation is preliminary and still under development!

3.0: Input Requirements

Input to Pilon consists of the input genome in FASTA format along with one or more BAM files of reads aligned to the input genome.

3.0.1: Input Genome (FASTA)

Pilon requires a FASTA of the input genome, either a draft assembly or a reference to be used for variant calling.  If the input is an assembly to be improved, each scaffold should be a FASTA element, with at least 10 consecutive Ns representing gaps between contigs within the scaffold.

Input FASTA sequences should conform to the FASTA Sequence Format.

3.0.2: Aligned Read Files (BAMs)

Pilon also requires one or more BAM files of reads aligned to the input genome.  These BAMs are specific by three command line arguments which specify the type of the input library:

--frags <frags.bam>

for unpaired sequencing reads.for paired-end sequencing of DNA fragments, such as Illumina paired reads of fragment size <1000bp.

--jumps <jumps.bam>

for paired sequencing data of larger insert size, such as Illumina mate pair libraries, typically of insert size >1000bp.

--unpaired <unpaired.bam>

for unpaired sequencing reads.

Pilon makes considerable use of pairing information, so it is far preferred to have paired data.  For all paired BAMs, Pilon assumes that the pairs are in Forward-Reverse (FR) orientation, and the aligner is able to set the “proper pair” flag correctly based on the separation and orientation of the reads in the pair. Note that standard Illumina paired jumping libraries are sequenced in RF orientation. This means that the reads will need to be flipped prior to alignment.
Pilon development was done primarily using BAM files produced by the aligners BWA 0.5.9 (using sampe for paired alignment) and Bowtie 2.

BAM files are the binary form of Sequence Alignment Map (SAM) files. The format is described in the PDF file linked below:

SAM/BAM Format Specifications

Note: You may get the following message if your BAM index file is missing or older than the BAM file itself:

WARNING: BAM index file "path/to/bai_file" is older than BAM "/path/to/bam_file".

To resolve this issue run:

samtools index /path/to_bam_file



3.1: Pilon Arguments

In the examples below, the “pilon” command is shorthand for the java invocation of the pilon jar file, for example:

java --Xmx16G -jar /path/to/pilon.jar

3.1.1: Required

The –genome argument and either frags, jumps or unpaired data must be supplied in order for Pilon to run:

Argument Description Example
–genome <fasta file> The fasta file for the genome to be improved and that was used to generate the bam files. pilon --genome genome.fasta --frags frags.bam
--frags <bam file> A BAM file containing fragment paired-end alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once. pilon --genome genome.fasta --frags frags.bam
--jumps <bam file> A BAM file containing jump (mate pair) paired-end alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once. pilon --jumps jumps.bam
--unpaired <bam file> A BAM file containing unpaired alignments and aligned to --genome argument using BWA or bowtie2. This argument may be specified more than once. pilon --unpaired unpaired.bam


3.1.2: Optional

The following arguments are optional, and should be selected based on the desired parameters and outputs:

Output Options

Argument Description Example
--output <prefix> Specifies a prefix for all output files (default "pilon").  pilon --genome genome.fasta --frags frags.bam --output myPilonData
--changes  If specified, a file listing changes in the output.fasta will be generated. This is a tab-separated file with four columns: original coordinate, output coordinate, original sequence, and output sequence.  pilon --genome genome.fasta --frags frags.bam --output myPilonData --changes
--vcf  If specified, a VCF file will be generated  pilon --genome genome.fasta --frags frags.bam --vcf
--tracks Create .bed and .wig diagnostic tracks suitable for viewing in IGV or GenomeView. pilon --genome genome.fasta --frags frags.bam --tracks

Control Options
Argument Description Example
--variant Shorthand to set up parameters for variant calling, as opposed to assembly improvement; currently equivalent to --vcf --fix +breaks pilon --genome genome.fasta --frags frags.bam --variant
--chunksize Input FASTA elements larger than this will be processed in smaller pieces not to exceed this size (default 10000000). pilon --genome genome.fasta --frags frags.bam --chunksize 2000000
--diploid Indicates that the sample is from a diploid organism to direct calling of heterozygous SNPs. WARNING: Pilon has very limited support for diploid genomes; more is planned for the future. pilon --genome genome.fasta --frags frags.bam --diploid
--fix <list> A comma-separated list of categories of issues to try to fix: "bases": try to fix individual bases and small indels; "amb": fix ambiguous bases in fasta output (to most likely alternative); "gaps": try to fill gaps;"local": try to detect and fix local misassemblies; "all": all of the above (default); "none": none of the above; new fasta file will not be written. The following are experimental fix types: "breaks": allow local reassembly to open new gaps (with "local"); "novel": attempt to assemble novel contigs from unaligned non-jump reads. Individual fix issue types can be turned on or off by preceding the fix with "+" or "-", respectively.  pilon --genome genome.fasta --frags frags.bam --fix "bases,amb"
--nonpf Include reads that failed quality filtering by the sequencing instrument pilon --genome genome.fasta --frags frags.bam --nonpf
--duplicates Include reads marked as duplicates in input BAMs pilon --genome genome.fasta --frags frags.bam --duplicates
--targets <list> Only process the specified targets. Targets are comma-separated with each target consisting of a fasta element name and optionally a base range separated from the fasta element name by a colon(:). pilon --genome genome.fasta --frags frags.bam --targets scaffold00001,scaffold00002:10000-20000
--verbose Produce a more verbose output. pilon --genome genome.fasta --frags frags.bam --verbose
--debug Produce copious debugging output (implies verbose as well). pilon --genome genome.fasta --frags frags.bam --debug

Heuristic Options

Argument Description Example
--defaultqual <value> Assumes bases are of this quality if quals are not present in input BAMs (default 15). pilon --genome genome.fasta --frags frags.bam --defaultqual 10
--flank <value> The number of bases at the end of well-aligned reads that will be ignored (default=20). pilon --genome genome.fasta --frags frags.bam --flank 10
--gapmargin <value> Gap closures must be within this number of bases of the original gap size in the input genome to be considered (default=100000, which essentially disables this check) pilon --genome genome.fasta --frags frags.bam  --gapmargin 500
--K Kmer size used by internal assembler (default 47). pilon --genome genome.fasta --frags frags.bam  --K 31
--mindepth <value> Variants (snps and indels) will only be called if there is coverage of valid reads at this depth or more. If this value is ≥1, it is an absolute depth; if it is a fraction <1, then minimum depth is computed by multiplying this value by the mean coverage for the region, with a minimum value of 5 (default 0.1: min depth to call is 10% of mean coverage or 5, whichever is greater). pilon --genome genome.fasta --frags frags.bam --mindepth 0.5
--mingap <value> The minimum size for unclosed gaps (default = 10)  pilon --genome genome.fasta --frags frags.bam --mingap 20
--minqual <value>  The minimum base quality to consider for pileups (default = 0)  pilon --genome genome.fasta --frags frags.bam --minqual 10
--nostrays Skip making a pass through the input BAM files to identify stray pairs, that is, those pairs in which both reads are aligned but not marked valid because they have inconsistent orientation or separation. Identifying stray pairs can help fill gaps and assemble larger insertions, especially of repeat content. However, doing so sometimes consumes considerable memory. pilon --genome genome.fasta --frags frags.bam --nostrays


3.2: Pilon Output


3.2.1: Variant Output

Pilon variant output is stored in a file named pilon.vcf by default (if --output is specified, the file will be named <output>.vcf).

Calls are classified by small number VCF FILTER tags:

Filter Tag Description
PASS A passing call, either reference confirmation or difference
Amb Ambiguous; significant evidence for more than one allele at this position. Meant for haploid genomes, this filter tag is suppressed by the --diploid argument
LowCov Valid read coverage less than the threshold controlled by the --mindepth argument
Del Provides pileup information for loci which were removed by a variation in another line; this gives a sense of what was going on at that locus if the larger variation were not called

 

Block substitutions resulting from local reassembly of suspicious regions are represented by VCF Structural Variant records.  Pilon assigns SVTYPE=INS if the variant contains more bases than the reference region, otherwise SVTYPE=DEL.

Pilon includes many computed values in the VCF INFO field; here is an example along with a description of the values:

DP=38;TD=55;BQ=25;MQ=19;QD=2;BC=0,22,2,14;QP=0,72,1,27;PC=958;IC=0;DC=0;XC=0;AC=1;AF=0.27
Example Description
DP=38 Depth of valid reads in pileup (not invalid pair; not softclipped)
TD=144 Total Depth, including reads excluded from pileups
BQ=25 Mean base base quality in pileup
MQ=19 Mean mapping quality in pileup
QD=2 Quality by depth; meant to give a sense of how confident the base call is. Don't read too much into this number.
BC=0,22,2,14 Base count in pileups (order A,C,G,T)
QP=0,72,1,27 Percentage of weighted evidence for each base (order A,C,G,T)
PC=958 Physical coverage of valid pairs crossing this locus
IC=0 Number of reads in pileup calling an insertion at this locus
DC=0 Number of reads in pileup calling a deletion at this locus
XC=0 Number of reads in pileup soft-clipped at this locus
AC=1 Alternate allele count, as defined in VCF spec (0=reference call, 1=heterozygous/ambiguous, 2=alternate call)
AF=0.27 Fraction of evidence in support of alternate allele

Click the link below for a primer on VCF file output format.

Variant Call Format 4.1 at the 1000 Genomes Project

3.2.2: Track Files

If run with the "--tracks" argument, Pilon produces .bed and .wig files that may be viewed in IGV, GenomeView, and other applications that support these formats. The tracks produced by Pilon are as follows:

Track Name Filename Description
Pilon pilonPilon.bed Several classes of issue found by Pilon in a compact format
Changes pilonChanges.wig Changes made by Pilon
Unconfirmed pilonUnconfirmed.wig Non-zero in regions where the input genome was not confirmed
Copy Number pilonCopyNumber.wig The copy number in the genome of the sequence at a given location
Coverage pilonCoverage.wig The sequence coverage at a given position
Bad Coverage pilonBadCoverage.wig The coverage of reads that do not map logically
Delta Coverage pilonDeltaCoverage.wig A measure of local rate-of-change of valid coverage
Dip Coverage pilonDipCoverage.wig A metric designed to identify local dips in coverage, often indicating a contiguity break
Frag Coverage pilonFragCoverage.wig The fragment read coverage at a given position
Physical Coverage pilonPhysicalCoverage.wig The physical coverage at a given position
GC track pilonGC.wig The percent GC of sequence at a given location
Pct Bad pilonPctBad.wig Percentage of invalid reads (usually bad pairing) compared to total depth
Weighted Qual pilonWeightedQual.wig The weighted quality of reads at a given position
Weighted MQ pilonWeightedMq.wig The weighted mapping quality of reads at given position
Clipped Alignments pilonClippedAlignments.wig A count of how many soft-clipping events started at this locus

The "SD" tracks express a metric as standard deviations from the mean of the metric across a given input fasta element.  The values are integers representing 0.1 sigma, i.e., a value of -21 means 2.1 standard deviations below the mean.

Many of these tracks were primarily of use in developing Pilon's heuristics for making calls and identifying regions of possible misassembly; they may be removed in a future release.

3.2.4: Pilon Fasta

The pilon fasta file contains a version of the assembly in which errors are fixed as specified by the --fix option. Pilon renames the sequence headers by appending "_pilon" to each record name. The name of the file is 'pilon.fasta' unless the user specified an --output parameter.
3.2.5: Pilon Changes

If run with the "--changes" argument, Pilon produces a file containing a space-delimited record of every change made in the assembly as instructed by the --fix option. The format for the file is as follows:

<Original Scaffold Coordinate> <New Scaffold Coordinate> <Original Sequence> <New Sequence>

These headers are further described in the table below

Header Description Example
Original Scaffold Coordinate The coordinate of sequence in the original fasta that was flagged by pilon for a change scaffold00003:690754
New Scaffold Coordinate The coordinate of the changed sequence in the pilon fasta file scaffold00003_pilon:690809-690853
Original Sequence The sequence in the original fasta file that was flagged by pilon for a change . (pilon uses a period to indicate that no sequence was removed)
New Sequence The sequence in the pilon fasta file that was inserted or deleted by pilon to fix the assembly GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT

Using the examples in the table above, a pilon change file record would appear as follows:

scaffold00003:690754 scaffold00003_pilon:690809-690853 . GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT

In this case, pilon did not remove any sequence from the position 690754 in scaffold 3 of the original assembly, but instead inserted 45 bases and wrote the output to the pilon fasta file at position 690809-690853 in scaffold 3.

3.2.6: How Pilon Identifies Bad Locations and Regions

Pilon determines which regions may contain contiguity breaks requiring local reassembly by looking for the following motifs:

  • "V-dips," or locations where coverage drops significantly as compared to flanking regions
  • General low coverage
  • Large percentage of soft-clipped alignments at a specific locus
  • A high percentage of invalid pairs
Note that multiple such events in close proximity (~200bp) will be treated as a single suspect region.