XML ancillary files

From ArachneWiki

Revision as of 18:11, 7 April 2008 by Dheiman (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

The XML ancillary files are located at DATA/traceinfo/*xml*. They contain ancillary data about the reads; each read gets its own entry in one of the files. This ancillary data may be modified and supplemented with the aid of the configuration file.

The XML ancillary files are in the Trace Archive XML format (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc) and are parsed by the module TraceArchiveParser. We use only a subset of the fields specified in the Trace Archive XML definition. Note that all field names are case-insensitive. The following fields are required for each read entry (or may be, depending on the read type):

  • trace_name: The name of the read, which should be unique.
  • plate_id: The name of the plate on which the read resides. For paired production reads, it is normal practice to designate the same plate_id for two physical plates, one having the forward reads and the other having the reverse reads.
  • well_id: The well on the plate that the read came from.
  • type: "paired_production", "unpaired_production", or "transposon". Note that this field is not part of the Trace Archive Format and therefore must be set using the configuration file.
  • template_id: The name of the template (insert). Arachne identifies forward-reverse read pairs as those sharing the same template_id. Required for reads designated "paired production" or "transposon". The concept of a template for a transposon here is simply a kludge to associate a pair of transposon reads from the same transposon event, so there should be a different template id for each transposon event.
  • clip_vector_left: The position at the beginning of the read sequence at which it should be clipped due to vector sequence (i.e. the index of the first base of non-vector sequence).
  • clip_vector_right: The position at the end of the read sequence at which it should be clipped due to vector sequence (i.e. the index of the last base of non-vector sequence).
  • insert_size: The estimated insert size, in bases, for paired production reads. The estimated separation, in bases, for transposon reads. Required to be non-zero for reads designated as "paired production" or "transposon" (see the type field below).
  • insert_stdev: The standard deviation of the insert size, in bases, for paired production reads. The standard deviation of the separation, in bases, for transposon reads. Required to be non-zero for reads designated "paired production" or "transposon".
  • trace_end: The direction of the read on its insert (either F for forward or R for reverse). Required for reads designated "paired production" or "transposon".

The following fields are optional:

  • center_name: The research center from which the read came.
  • library_id or seq_lib_id: The name of the library containing the read. Optional, but highly recommended.
  • ti: The trace archive number.

Example

<trace>
       <trace_name>L3191P13FA1.T0</trace_name>
       <plate_id>L3191P13</plate_id>
       <well_id>A01</well_id>
       <template_id>L3191P13A1</template_id>
       <trace_end>F</trace_end>
       <clip_vector_left>39</clip_vector_left>
       <clip_vector_right>801</clip_vector_right>
       <library_id>L3191M1</library_id>
       <insert_size>1500</insert_size>
       <insert_stdev>225</insert_stdev>
</trace>
Personal tools