The configuration file (reads_config.xml) is a required input file in the DATA directory. It consists of a set of rules, specified in XML format, that are applied to the input reads in the XML files. The configuration file allows you to correct and augment the information in the XML files. You can create a configuration file with the help of the module FindXmlFeatures. For example configuration files, see the sample projects.
The configuration file allows for an easy way to set parameters that are common to a group of reads. For instance, below we demonstrate how to set insert size and insert size standard deviation for all the reads in a particular library. Also, if the XML files are missing any required fields, the configuration file can contain rules to provide the missing information. When you first attempt to run Arachne, you are likely to get an error message about read information that can be fixed with the configuration file.
The configuration file must set a
<type> field for every read. The type field cannot be put in the original XML files, because they must conform to the Trace Archive Format specification.
As an XML file, the configuration file has a formal "document type" definition, which can be found in the file DATA/dtds/configuration.dtd. The file begins with
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "configuration.dtd"> <configuration>
and ends with:
The types of constructs that can put in between are comments, macros, and most importantly, rules.
Comments are in the standard XML format, for example:
<!-- ******** Some contaminated reads, to be tossed ******** -->
Macros facilitate abbreviation, pointless or otherwise, for example:
would change every subsequent occurrence of the string $gh to the string gringlehopper. Any text could have been used in place of gringlehopper.
Rules require more explanation, because they have nontrivial syntax. For example,
<rule> <name> exclude probable human reads </name> <match> <match_field>plate_id</match_field> <regex>^G10P6007$</regex> </match> <match> <match_field>plate_id</match_field> <regex>^G10P6130$</regex> </match> <action><remove /></action> </rule>
would cause all reads having plate_id G10P6007 or G10P6130 to be ignored by Arachne.
More generally, a rule is defined by three fields:
* <name>: Explanatory title. One per rule or none at all. * <match>: Defines which reads are affected by the rule. One or more per rule. * <action>: Defines what happens to those reads affected by the rule, namely those reads specified in one or more of the match fields. Exactly one per rule.
The <match> tag is composed of at least one pair of <match_field> and <regex> tags. If multiple <match_field> <regex> pairs are used (between a given <match> ... </match> pair), then the match applies to those reads that satisfy all of the intervening conditions.
* <match_field>: The field from the XML ancillary data to test for a matching read, in lower-case. * <regex>: A regular expression to match against the contents of the specified <match_field>.
The <action> tag requires one of the following sub-tags. Only one type of sub-tag is allowed for a single <action> tag, though multiple <set> tags are allowed in a single <action> tag.
* <remove />: Remove any matching read. Only one <remove /> tag is allowed in each <action> tag. * <unpair />: Remove any pairing information for matching reads. Only one <unpair /> tag is allowed in each <action> tag. * <set>: Set a field for any matching read. The syntax is
<set> <set_field> ... </set_field> <value> ... </value> </set>
but there may be more than one set tag within a given <action>.
Values of other fields associated with the matching read may be referred to in <value> tags by prepending the name of the field with an "@". Also, integer arithmetic evaluation will occur when setting numeric fields such as insert_size and insert_stdev. For example,
<set> <set_field> insert_stdev </set_field> <value>@insert_size/10</value> </set>
will set insert_stdev to 10% of insert_size.
Rules are applied in the order which they appear in the configuration file. Interactions between the rules are possible, and consequently, the order in which the rules appear may matter.
Here is an example of a rule that sets the insert statistics for all reads whose names begin with G20:
<rule> <name> set insert stats for G20 reads </name> <match> <match_field>trace_name</match_field> <regex>^G20</regex> </match> <action> <set> <set_field>insert_size</set_field> <value>4000</value> </set> <set> <set_field>insert_stdev</set_field> <value>400</value> </set> </action> </rule>
Finally, we give an example that shows how to designate every read as being a paired production read:
<rule> <name> all reads are paired production reads </name> <match> <match_field>trace_name</match_field> <regex>.</regex> </match> <action> <set> <set_field>type</set_field> <value>paired_production</value> </set> </action> </rule>
Other ways to configure
Additionally, one may provide an exclusion file, a list of read names to be excluded from the assembly. This file should be named "reads.to_exclude" and should be located in DATA; each line of the file should have one read name. The reads in this file will be excluded prior to the application of any rules.