Quickstart

This quick start guide should get you up and running quickly using Conrad to do basic gene prediction.  One of the major strengths of Conrad is its flexibility and the fact that it allows users to apply their own data to improve the gene calling process, but right out of the box Conrad is a state-of-the-art gene prediction program, with the capability to call genes on a single reference sequence or to utilize comparative data in the form of alignments of closely related species.

Conrad comes packaged with trained models for several organisms.  If you want to call genes in one of these organisms, you can use the trained models as is and skip the section related to training, making the process even easier.  If not, you will need to train the model yourself given a trusted set of genes from your organism.  Conrad is easily trainable by users, and the Quickstart guide covers the entire training and prediction process.

Install and Verify Conrad

Conrad is written in Java and requires Java 1.5 or higher to be installed on your machine in order to run.  If you don't have Java, download it now.   

To get started, download Conrad and unpack it into any directory.  There is no install program, once unpacked Conrad will be ready to run.  Conrad comes packaged with some sample data and models which you can use to verify that it is working correctly.  Thesy also serve as useful examples to help you get started.  To run the default model on Unix or Mac, change to the directory where you unpacked Conrad and type:

bin/conrad.sh train models/singleSpecies.xml samples/validation/trainingData trainingFiles/validation.ser

 and on Windows:

bin\conrad.bat train models/singleSpecies.xml samples/validation/trainingData trainingFiles/validation.ser

After a minute or two the file trainedModels/validation.ser should be created and you should see the some output similar to the following:

21:36:35WARNING calhoun.analysis.crf.features.interval13.StateTransitionsInterval13.train - 7 genes, 4 introns, 1.57 exons/gene
21:36:35WARNING calhoun.analysis.crf.statistics.MixtureOfGammas.setup - you called a mixture of gammas model but set flag to force it to model as an exponential length distribution.
21:36:35WARNING calhoun.analysis.crf.statistics.MixtureOfGammas.setup - fewer than 20 lengths supplied for training; modeling with an exponential distribution instead of a mixture of Gammas
21:36:35WARNING calhoun.analysis.crf.statistics.MixtureOfGammas.setup - fewer than 20 lengths supplied for training; modeling with an exponential distribution instead of a mixture of Gammas
21:36:35WARNING calhoun.analysis.crf.statistics.MixtureOfGammas.setup - fewer than 20 lengths supplied for training; modeling with an exponential distribution instead of a mixture of Gammas
Trained in 0.047 seconds.
Training weights
21:36:35WARNING calhoun.analysis.crf.solver.CacheProcessorDeluxe.setTrainingData - Discarding sequence 0 Seq #0 Pos 326 Training segment length 3 state: 1 outside the allowed length 5-5000
21:36:35WARNING calhoun.analysis.crf.solver.CacheProcessorDeluxe.setTrainingData - Using 6 training sequences. Discarded 1 because of state length or transition problems.
21:36:53WARNING calhoun.analysis.crf.solver.CacheProcessorDeluxe.setTrainingData - Discarding sequence 0 Seq #0 Pos 326 Training segment length 3 state: 1 outside the allowed length 5-5000
21:36:53WARNING calhoun.analysis.crf.solver.CacheProcessorDeluxe.setTrainingData - Using 6 training sequences. Discarded 1 because of state length or transition problems.
21:36:54WARNING calhoun.analysis.crf.solver.StandardOptimizer.optimize - NOTE: You ARE NOT requiring convergence of LBFGS
21:37:00WARNING calhoun.analysis.crf.solver.StandardOptimizer.optimize - Objective value unchanged: -0.99853355 returned 1 times.
Trained weights in 34.516 seconds. 34.562999999999995 total.

Assuming the training was successful you can go ahead and predict genes on the sample data.  On Unix or Mac the command is:

bin/conrad.sh test trainingFiles/validation.ser samples/validation/testData samples/validation/output

and on Windows:

bin/conrad.sh test trainingFiles/validation.ser samples/validation/testData samples/validation/output

Note that this command uses the 'test' option which compares the output with the correct answers. For de novo prediction, one would use the 'predict' option instead. Note also that samples/validation/testData contains the same data as samples/validation/trainingData, which means the above is an in-sample test.

A GTF and a .dat file will be written to the samples/validation/output directory and you should see output similar to the following:

Beginning test
Testing complete
NOTE: If you're using the CRF for prediction and pass in a dummy (e.g. all zeros) hidden sequence, then many of the following statistics will not be meaningful
[State=intergenic] ( TP=2774, FP=0, FN=47, TN=1933 ) ( AP=2821, AN=1933, PP=2774, PN=1980 ) ( CC=0.980 ) ( ACP=0.990 ) ( AC=0.980 ) ( sens=0.983, spec=1.000, avSS=0.992 )
[State=exon0] ( TP=292, FP=0, FN=3, TN=4459 ) ( AP=295, AN=4459, PP=292, PN=4462 ) ( CC=0.995 ) ( ACP=0.997 ) ( AC=0.995 ) ( sens=0.990, spec=1.000, avSS=0.995 )
[State=exon1] ( TP=174, FP=0, FN=0, TN=4580 ) ( AP=174, AN=4580, PP=174, PN=4580 ) ( CC=1.000 ) ( ACP=1.000 ) ( AC=1.000 ) ( sens=1.000, spec=1.000, avSS=1.000 )
[State=exon2] ( TP=1229, FP=15, FN=0, TN=3510 ) ( AP=1229, AN=3525, PP=1244, PN=3510 ) ( CC=0.992 ) ( ACP=0.996 ) ( AC=0.992 ) ( sens=1.000, spec=0.988, avSS=0.994 )
[State=intron0] ( TP=186, FP=35, FN=0, TN=4533 ) ( AP=186, AN=4568, PP=221, PN=4533 ) ( CC=0.914 ) ( ACP=0.958 ) ( AC=0.917 ) ( sens=1.000, spec=0.842, avSS=0.921 )
[State=intron1] ( TP=49, FP=0, FN=0, TN=4705 ) ( AP=49, AN=4705, PP=49, PN=4705 ) ( CC=1.000 ) ( ACP=1.000 ) ( AC=1.000 ) ( sens=1.000, spec=1.000, avSS=1.000 )
[State=intron2] ( TP=0, FP=0, FN=0, TN=4754 ) ( AP=0, AN=4754, PP=0, PN=4754 ) ( margins not split )
 
...  
[Coding exons] ( TP=10, FP=1, FN=2 ) ( AP=12, PP=11 )( sens=0.833, spec=0.909, avSS=0.871 )
Perfectly predicted hidden sequences: 6/7 85.71 %
Nucleotide Hidden State Agreement: 4704/4754 98.95 %

If you can successfully train and test on this validation set, then your Conrad installation is working fine and you can begin working with your own data. 

Training Conrad on a new organism

Selecting a model

To run Conrad on your own data you must first select a model file to use.  A model is defined in an XML file, which is the first argument passed to the Conrad train command.  If you are working with a single genome sequence (no comparative data), then you should begin with the single species model, located at models/singleSpecies.xml.  If you have alignments of closely related genomes, you can use the comparative mdoel located at models/comparative.xml.  More models are available which incorporate other data types, and you can even create your own models which incorporate your own features or even change the state model used by Conrad.

Preparing data

Once you have selected a model, you will need to prepare a training data set, which consists of a set of konwn genes for the organism.  Conrad will use these known genes to set the parameters of the model.  This training process is an essential part of gene prediction, and a lot of work can go into selecting a good training set. In general, your set should consist of at least 200 genes which you are confident are correct.  In typical use, there won't be long contiguous stretches of DNA where all genes are well annotated. To handle this, we recommend training on isolated gene sequences, one sequence per gene, with 200 bases of intergenic region on each side. This is how we trained Conrad during development and use it in production at the Broad Institute.

If you are using the single species model, you will require two files of training data:

If you are using the comparative model, you will need the two files above and two additional files:

These files should be placed all in the same directory, and the name of this directory will be the second argument passed to the train command.  For a realistic example, take a look at the samples/aspergillus/trainingData directory.

Conrad's data I/O handling is completely customizable and if you have another format for data or need to include additional data you can reconfigure Conrad to do that. See the configuration section for more details.

Training the model

The train command runs the actual training process.  The train command requires an XML model file, a location for the input data and the name of the training file.  The training file is a binary file containing the trained parameters for the model, and is what you will use for testing.  The command to run training on Unix and Mac is:

bin/conrad.sh train <xml model file> <training data> <training file>

On Windows, replace conrad.sh with conrad.bat.

Depending on the size of your data set and the complexity of your model, training may anywhere from a few seconds to many hours. During the optimization process, the Conrad training output will provide you with information on how the optimization is proceeding.

Memory usage

The default scripts that come with Conrad run it with 1G of memory. This is enough for many problems, but large problems may require more. Also, if your machine does not have 1G RAM you may want to lower this value. To change it, edit the conrad.sh or conrad.bat file by changing the -Xmx1024m argument to a higher or lower number.

Testing the model

Often, you want to train your model given one set of known genes and then test its performance by running predictions and comparing those to another known set of genes. Conrad contains built-in funcitonality for this through it's test command. The test command requires test data set up in the same format as the training data and produces a set of accuracy and comparison statstics. The command to run testing on Unix and Mac is:

bin/conrad.sh test <training file> <input data> <base filename for output>

On Windows, replace conad.sh with conrad.bat.

Predicting genes

In order to predict genes you need a training file for your organism. The training file can be one of the pre-trained files that Conrad ships with or a file you have trained yourself using the step above. The prediction command takes a training file, a set of input data, and a base name for the output files. This data is the same as used for training, but without the GTF file containing the genes. The output consists of 2 files, a GTF file containing the predicted genes and a .dat file containing statistics about the process. To command to run prediction on Unix and Mac is:

bin/conrad.sh predict <training file> <input data> <base filename for output >

On Windows, replace conrad.sh with conrad.bat.

Next steps

There are man more things you can do with Conrad. It is highly configurable. Check out some of the How Tos for more information on: