Broad: Annotation: Argo: Help: Use Case

Summary:

            This document describes a scenario where Argo is used to create plausible gene models based on a variety of displayed genomic data. The genomic data is draft sequence from the sea squirt Ciona savignyi. The intent of the document is to illustrate the use of some of Argo’s key visualization, editing and analysis tools.

Background:

            In a study characterizing cell adhesion molecules encoded by the genome of Ciona savignyi, high scoring tblastn hits were detected in the contig sc15687 using human integrin α subunits as query sequence. Examination of GenomeScan gene predictions from this contig revealed a gene that encoded a protein with an unusual molecular architecture.

Typical human integrin α subunit

GenomeScan predicted peptide super_15687.nrpepthits.geos.11

The ab initio prediction appears to be a concatenation of two integrin proteins. Argo will be used display this gene prediction and additional evidence in the context of the genomic sequence and more accurate gene models will be created.

Contents:

1)    Obtain Example Files

2)    Load Sequence

3)    Load GENSCAN Data

4)    Load GFF Data

5)    Load Blast Data

6)    Adjusting the View

7)    Obtain Genomic Sequence for External Use

8)    Insert Transcript Features

9)    Edit Transcript Features

1) Obtain example files. Make local copies of all the files here.

           

2) Loading sequence into Argo. Open Argo and use the FASTA Add File command in the Sequence Tree window to load the sequence “super_15687.fa” and open it in a Feature Map window. The length of the displayed sequence can be controlled in the pop-up dialog box. In this case, display all 322648 base pairs.

The default view can be modified using the View Map Options menu item. A useful setting here is to check the Segregate Strands? option and provide different colors for the positive and negative strands. In this case, white is the positive strand and off-white is the negative strand.

When an acceptable set of viewing options are obtained, the preferences file can be exported using the User Export Preferences menu item. Once saved, the preferences can be re-loaded as needed. An example preference file is available here.

3) Loading GENSCAN format Files. Load the GENSCAN and GenomeScan output files (sc15687.gs and sc15687.geos, respectively) by opening Track: Track Table, clicking on Genscan Tab and selecting “Load Tracks from File”. The different data are considered as separate tracks and once loaded, the tracks can be color coded. In this case, GENSCAN data is light gray, GenomeScan is dark blue.

4) Loading GFF Files. Load the output from GENEID and Augustus using the gff1 tab and the “Load Tracks from File” option. This highlights a useful aspect of the Argo visualization function. There are four different gff1 files containing GENEID results, each one derived from a run of the program using different isochore parameters. When the results are simultaneously loaded into Argo, visual comparison of the output from each run is easy.

Note: The output of any gene prediction algorithm can loaded into Argo as long as it is possible to obtain (or convert) the output in gff or GENSCAN format. Some possible options include:

GENSCAN: http://genes.mit.edu/GENSCAN.html or

http://bioweb.pasteur.fr/seqanal/interfaces/genscan-simple.html

GenomeScan: http://genes.mit.edu/genomescan.html

GeneID: http://genome.imim.es/software/geneid/geneid.html

Augustus: http://augustus.gobics.de/

Genewise: http://www.ebi.ac.uk/Wise2/advanced.html

5) Loading Blast Results. Load blast results by opening Track : Track Table, clicking on the Blast Tab followed by the “Load Tracks from File” option. The blast output files available for display are:

sc15687_huints_tbn.out – tblastn with human integrins as the queries and super_15687.fa as database

sc15687_otherints_tbn.out – tblastn with integrins from a variety of non-mammalian species as the queries and super_15687.fa as database

sc15687_v_CiESTtbx.out – tblastx with super_15687.fa as query and several EST assemblies encoding the probable ortholog of this gene in Ciona intestinalis, a related sea squirt that is about as closely related to Ciona savignyi as human is to mouse.

These blast results were generated with blastall 2.2.4. Other text-only versions of blast output can work with the Blast loader as well. For Blast2seqs, use:

http://bioweb.pasteur.fr/seqanal/interfaces/bl2seq.html

When the blast result file is selected and opened, an option dialog box appears asking if the subject coordinates should be used to draw the feature.

The answer depends on how the blast analysis was performed. In the case of the EST output described above, the answer is no, for the tblastn results the answer is yes.

When the blast files are loaded, a quick glance indicates that the majority of the HSPs are concentrated on a gene located roughly between 90Kb and 140Kb. They are displayed in purple and red in the screen capture below. NOTE: The raw alignment of each HSP is visible in the Inspector window.

This view, however, shows only a subset of the features. All the blast results can be seen using the Zoom: Vertical Size to Fit menu item.

6) Adjusting the view. The blast results identify the region of the contig that contains the integrin gene. It is possible to focus the view on this region in 2 different ways. Horizontal zoom, usually referred to simply as zoom, adjusts the number of base pairs displayed in the visible portion of the Feature Map. In the above view, all 322 KB are displayed. The easiest way to zoom in on the region of interest is by using the z-(left-click) zoom function. Press and hold the z key while you click the left-button on mouse and draw a rectangle in the feature map that contains the sequence you wish to view. Zooming out can be accomplished by z-(right-click). Other horizontal zoom options are located within the Zoom menu and the right-click pop-up menu.

For example, normal and “Vertical Size to Fit” views of the gene of interest are shown below.

-------------------------------------------

Another option is to open only the sequence you wish to view with the pop-up dialog box that appears when sequence is first opened. Another option is to use the File: Refresh Map Data menu option (also located on the right-click pop-up menu).

7) Obtain Genomic Sequence for External Use. The notion that the initial ab initio gene predictions (GENSCAN in gray, GenomeScan in blue) result concatenation of two adjacent copies of an integrin α subunit is supported by the repeating nature of the blast results. The DNA selection feature of Argo can be used to further characterize the duplication. Click and hold in the ruler near the top of the Feature Map window. Drag your mouse to include the area you wish to select. The boundaries can be adjusted by clicking on the selected region until it is outlined in blue and dragging the edges as needed. In order to characterize this particular duplication in greater detail, the genomic sequence selections were copied out of the DNA tab of the inspector window using <ctrl-c> and pasted into NCBI Blast2Sequences. Once finished, the selections can be deleted using the Delete key or the right-click menu option.

The feature map is shown above and with both versions of the gene selected. The 3’ version on the right is active as indicated by the blue outlining of the selection. The sequence of the 2 genes were copied from the inspector window, aligned with NCBI blast 2 sequences and found to be about 95% identical. The alignment is about 10Kb in length and each duplicated section contains a copy of the integrin gene.

5’ gene

 
                       

3’ gene

 
 


It is possible that one of these genes is an unprocessed pseudogene. In order to know for sure, more accurate gene models or Ciona savignyi mRNA sequence would be required. Argo, however, can be used to try and improve the gene models based on the available evidence.

8) Insert Transcript Features. It is clear that there should be two gene models at this locus and the GENSCAN prediction will be used as starting material. Select the GENSCAN prediction and go to Edit Insert Feature.

Name the gene/group “integrin” and provide the unique identifier “copyA”. The goal is to create 2 gene models out of this 1 GENSCAN prediction, repeat this process a second time except this time, and provide the unique identifier “copyB”. The inserted features will be highlighted in pink to indicate that they have not been saved.

NOTE: The types of unsaved edits are indicated by the buttons in the upper right-hand corner of Argo. In this case, pending insert is indicated in pink. You can undo any changes you make prior to saving by using the File: View Unsaved Edits menu item. Once saved, the features will be colored according to what you select in the Track: Track Table Dialog box. In this case I’ve selected “sky blue” for my annotated features.

In the view above, the inserted transcripts are shuffled in with the gene prediction results. This view can be adjusted using the Track: Track Table… dialog box. The inserted features can be moved to the center of the Feature Map window by changing their arrangement to “Segregated” or by providing them with a higher priority sort key.

9) Edit Transcript Features. Two identical transcripts have been inserted, one for each copy. The first operation is to remove the overlap between the two genes. A rough estimate of the footprint of each gene is apparent from the blast HSP pattern (black lines below) but a more accurate idea can be obtained using the mRNA translation and protein viewing tool in the in Inspector window.

From the repetitive nature of the blast hits, it is clear that the junction is probably exon18 but more information would be useful. NOTE: an exon count can be obtained by hovering over the exon of interest.

Integrin α subunits have a characteristic sequence near their carboxy-terminus that is similar to the consensus “GFFKR”. This sequence can be searched for using the Inspector window. Select one of the inserted transcripts and select the Protein tab of the inspector window.

Right-clicking in the Protein tab cause a menu to pop up with the option “Find Matching Pattern… A search for GFFKR yields no results but GFFK gives a single hit in exon 17. When a sequence is highlighted in the Inspector window, the corresponding sequence is also highlighted in the Feature Map Window.

In order to split the 2 inserted features, select one, right-click and select the “Update Feature” option. The “Update Feature” dialog box will appear:

In the structure window, each exon is listed starting with the 5’ end of the gene and going to the 3’end. The 3’ end of the right-hand gene (relative to the genomic sequence) is in exon 17 so all but the first 17 of these exons are selected and deleted.

The yellow highlight and yellow button in upper right-hand corner indicates an unsaved change to an existing transcript model (update). A similar modification is done for copyB except, in this case; retain everything except the first 17 exons.

Based on the pattern of blast hits, the left-hand copy of the gene (copyB) extends five exons in the 3’ direction relative to the right handed copy (copyA). As a result, copyB will be shortened. This can be done by selecting the exon of interest and pressing the delete key.

There is another notable difference between the 3’ ends of these genes as well.

             

                        copyB                                                                 copyA

In the above views, copyA has 4 exons and the “GFFKS” sequence (arrowhead). CopyB has only 3 exons and the GFFKS sequence is missing. Add the 4th exon by dragging one of the blast HSPs onto the copyB intron.

The last exon of copyB is shorter than the last exon of copyA. A reasonable extension of this last exon can be obtained by dragging one of the geneID predicted exons onto the last exon of copyB. A dialog box will appear allowing control of the extension; in this case, merge was selected.

A blast HSP was used to modify the transcript and as a result, the edited transcript has non-canonical splice junctions. This can be displayed by using the right-click Splice Site Profile option.

The non-canonical junctions are indicated in red. These junctions can be repaired by opening the DNA tab of the inspector window.

NOTE: When non-canonical splice junctions are present, the DNA in the DNA tab itself is colored red. There is an AG a few bases upstream from the first non-canonical junction. Select sequence from that AG into the adjacent exon, right-click and select the “Make Highlighted Sequence Exonic” option.

When the junctions are repaired the GFFKS sequence is still not in-frame, probably because of the blast HSP evidence used to create the 2nd to the last exon.

Using Featurize (Broken Externally??):

The built in alignment feature of Argo can be used to correct this problem. The goal is to make this exon as similar as possible to exon 16 of copyA. Select the copyB transcript and copy the gene sequence out of the DNA tab of the Inspector window. Re-activate the Feature Map window and go to the Analyze: Align menu item. Paste the gene sequence of copyB into the target window and the exon 16 sequence into the query window. Align the sequence, followed by selection of the Text tab.

The aligned sequence can then be used to create a feature by clicking the Featurize button.

Alternatively, the exon can be corrected using pattern matching and highlighting in a similar manner to the splice junction repair procedure. The final result is the correction of the copyB gene to contain this consensus sequence:

 


Contact: Reinhard Engels
argo-support@broad.mit.edu
617-452-2650
320 Charles, Room 2164