Conrad uses a model file to control its behavior. The default models that come with Conrad provide basic gene calling behavior, but those models can be edited to enhance the gene calling with additional features and data or can be replaced entirely to use Conrad for an entirely different problem. The model file specifies:
The model file is an XML file that uses the Java Spring configuration language and can be edited in any regular text editor. If you are not familiar with XML, you should learn about well-formed XML documents before editing the model file.
The Conrad model file describes a set of objects that Conrad uses to define the CRF model and algorithms. Each object also contains configuration information that describes the specifics of how that object should operate. The configuration file allows you to specify different objects to use and therefore change the behavior of Conrad. This allows almost all of the internals of Conrad to be changed just from the model file without changing code or recompiling anything inside Conrad itself. The objects that Conrad reads for the configuration file are:
Each object is implemented by a Java class which implements a specific Java interface defined by Conrad. The configuration of each object is handled by defining Java properties on the class which are set when the object is created. These properties can be simple data types like numbers and text, or more complicated objects in and of themselves.
Model files may be very simple or very complicated, depending on how much configuration they include. In this section we will walk through the basic gene calling model provided with Conrad, singleSpecies.xml, which contains a signfiicant amount of configuration information and demonstrates the power of this technique.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
<bean id="inputHandler" class="calhoun.analysis.crf.io.InputHandlerDirectory">
<property name="inputReaders">
<map>
<entry>
<key><value>ref.fasta</value></key>
<bean class="calhoun.analysis.crf.io.FastaInput">
<property name="header" value="name"/>
<property name="sequence" value="ref"/>
</bean>
</entry>
</map>
</property>
<property name="hiddenSequenceFile" value="ref.gtf" />
<property name="hiddenStateReader">
<bean class="calhoun.analysis.crf.io.GTFInputInterval13"/>
</property>
</bean>
The first several lines are boilerplate XML text that is present at the beginning of each model file. The real content starts on line 4 with the definition of the first object, or bean, inputHandler. In this case, the inputHandler is specified as an InputHandlerDirectory, which means that the training and prediction data will be contained in a set of files all consistently named within the same directory. Other inputHandler objects define different policies, and any class can be used that implements the InputHandler interface. Another example is InputHandlerFile that reads all data from a single file..
The first piece of configuration for the input handler is a property called inputReaders. This property is a map that associates filenames within the data directory with a particular class used to read the data in from that file. In this case, a FastaInput object will be used to read in the file called ref.fasta. This is an example of a property that is configured with another object. Reading the javadoc for FastaInput shows that it creates two input components, one for the header information in the fasta file, and another for the sequence. These input components are the actual observations that are accessed by the features. We will see more on this later when we examine the feature configuration. Also so the I/O guide.
A second property of the input handler configuration is the hiddenSequenceFile this indicates that the hidden sequence (correct gene positions) for the training data will be contained in a file called ref.gtf. The format of the file is controlled by the hiddenStateReader property which indicates that the genes will be stored as GTF and read in using the GTFInputInterval13 class.
<bean id="outputHandler" class="calhoun.analysis.crf.io.OutputHandlerGeneCallStats">
<property name="inputHandler" ref="inputHandler"/>
<property name="manager" ref="model"/>
</bean>
The beans be defined in any order, as long as they have the proper id's. The order presented here is just the one used in the example files. The next section of the file specifies the outputHandler. The output handler must implement the OutputHandler interface. In this example, the handler is the class OutputHandlerGeneCallStats. This class writes out the results as a GTF file and calculates several specific statistics related to gene calling. This output handler must be configured with information about the model and about the input formats. Note that other objects defined in the configuration file can be referenced, as the inputHandler and model objects are here.
<bean id="inference" class="calhoun.analysis.crf.solver.SemiMarkovViterbi">
<property name="cacheProcessor" ref="cache"/>
</bean>
The next object, inference, must implement the CRFInference interface. Two inference algorithms currently exist, Viterbi and SemiMarkovViterbi. The latter must be used if segment features (explicit length) are present in the model. The SemiMarkovViterbi algorithm requires a cache processor. This is an object that is used to store the results of feature functions for quick access during training. The cacheProcessorobject is defined below, as it is shared by the inference and optimizer objects.
<bean id="optimizer" class="calhoun.analysis.crf.solver.TwoPassOptimizer">
<property name="firstPass">
<bean class="calhoun.analysis.crf.solver.StandardOptimizer">
<property name="objectiveFunction">
<bean class="calhoun.analysis.crf.solver.MaximumLikelihoodSemiMarkovGradient" >
<property name="cacheProcessor" ref="cache"/>
</bean>
</property>
</bean>
</property>
<property name="secondPass">
<bean class="calhoun.analysis.crf.solver.StandardOptimizer">
<property name="requireConvergence" value="false"/>
<property name="objectiveFunction">
<bean class="calhoun.analysis.crf.solver.MaximumExpectedAccuracySemiMarkovGradient" >
<property name="cacheProcessor" ref="cache"/>
<property name="score">
<bean class="calhoun.analysis.crf.scoring.SimScoreStateAndExonBoundariesInt13"/>
</property>
</bean>
</property>
</bean>
</property>
</bean>
The optimizer object implements the training algorithm and must implement the CRFTraining interface. In this case we are using a TwoPassOptimizer. This is a complex object that actually executes two different training algorithms. In the first pass, it performs a maximum likelihood training by using the StandardOptimizer with a MaximumLikelihoodSemiMarkovGradient. The StandardOptimizer performs an L-BFGS optimization and the gradient specifies the function to optimize. In this case it is the likelihood of the training data given the weights. The second pass of the TwoPassOptimizer takes the weights that resulted from the first pass and uses them as a starting point for a maximum expected accuracy training. This is once again done using the StandardOptimizer, but this time with a different objective function. In this case, it is the MaximumExpectedAccuracySemiMarkovGradient, which is the objective for MEA. The MEA algorithm can further be configured with a custom function to define accuracy, and in this case we are using the SimScoreStateAndExonBoundariesInt13 function. The default is the SimScoreMaxStateAgreement function. In addition to the optimizers discussed here, there is also a FixedWeightOptimizer that just sets with weights without training and a SeededOptimizer that performs an optimization starting with weights read from a separate training file.
<bean id="cache" class="calhoun.analysis.crf.solver.CacheProcessorDeluxe">
<property name="lookbackArraySize" value="5000"/>
<property name="lookbackArrayFeatureSize" value="3"/>
<property name="ignoreInvalidTrainingData" value="true"/>
<property name="semiMarkovSetup">
<bean id="lengths" class="calhoun.analysis.crf.SemiMarkovSetup">
<property name="ignoreSemiMarkovSelfTransitions" value="true"/>
<property name="maxLengths">
<list>
<value>5000</value>
<value>5000</value>
<value>5000</value>
<value>5000</value>
<value>600</value>
<value>600</value>
<value>600</value>
<value>5000</value>
<value>5000</value>
<value>5000</value>
<value>600</value>
<value>600</value>
<value>600</value>
</list>
</property>
<property name="minLengths">
<list>
<value>18</value>
<value>5</value>
<value>5</value>
<value>5</value>
<value>15</value>
<value>15</value>
<value>15</value>
<value>5</value>
<value>5</value>
<value>5</value>
<value>15</value>
<value>15</value>
<value>15</value>
</list>
</property>
</bean>
</property>
</bean>
The cacheProcessor is not directly required by Conrad, but the inference and training objects we used both require one to be specified. The CacheProcesorDeluxe defined here is the standard object, and it implements the required CacheProcessor interface. Among other details, this instance is configured to automatically filter out training data that contains problems, either invalid transitions or states that are too short or too long. By default these will cause the training to fail. The minLengths and maxLength properties specify what the minimum and maximum allowed segment sizes are for each state. The first number in each list specifies the limits for state 0, the second state 1, etc. The other parameters listed here are defined in the javadoc for CacheProcesorDeluxe.
<bean id="model" class="calhoun.analysis.crf.features.interval13.Interval13Model">
<property name="narrowBoundaries" value="true"/>
<property name="componentFeatures">
<list>
<!-- First, compute probability of hidden state sequences -->
<bean class="calhoun.analysis.crf.features.interval13.StateTransitionsInterval13">
<property name="inputComponent" value="ref"/>
</bean>
<bean class="calhoun.analysis.crf.features.interval13.StateLengthLogprobInterval13">
<property name="inputComponent" value="ref"/>
<property name="multipleFeatures" value="true"/>
</bean>
<!-- Second, restict to hidden state sequences that, given reference sequence observations, are legal -->
<bean class="calhoun.analysis.crf.features.interval13.GeneConstraintsInterval13">
<property name="inputComponent" value="ref"/>
</bean>
<!-- Third, copmute the probability of emitted ref sequenc egiven hidden state -->
<bean class="calhoun.analysis.crf.features.interval13.ReferenceBasePredictorInterval13">
<property name="inputComponent" value="ref"/>
<property name="multipleFeatures" value="true"/>
</bean>
<bean class="calhoun.analysis.crf.features.interval13.PWMInterval13" >
<property name="inputComponent" value="ref"/>
<property name="multipleFeatures" value="true"/>
</bean>
</list>
</property>
</bean>
</beans>
The final, and most important, object is the model itself. This object must implement the ModelManager interface and in this case we are using the Interval13Model. In this model the states and transitions are hardcoded into the model class, but the list of feature functions can be customized. For the single species model covered here, there are 5 different feature classes. Each feature class can implement multiple individual feature functions, and so the 5 classes here generate the 22 different feature functions in the single species model. Each feature class must implement on of the FeatureManager interfaces. In this case we are using the following features: StateTransitionsInterval13, StateLengthLogprobInterval13, GeneConstraintsInterval13, ReferenceBasePredictorInterval13, and PWMInterval13.
Each feature has it's own configuration. In this case we can see that each of the features is configured to use an inputComponent called ref. Looking back to the inputHandlersection of the config file we can see that this component name was assigned to the sequence of the fasta input file. Therefore, all of these features will get the genomic DNA sequence as thier input observations. This mechanism allows different features to be configured to use different parts of the input data. In addition, so of the features have addiitonal components. In this case, several features have the multipleFeatures property. Each of these features has the ability to use a generate a single feature funciton (which then gets assigned a single weight), or multiple feature functions (and therefore multiple weights). Since our testing showed that models with multiple weights had superior performance, we use that option in the default configuration file.
The final line of the file is always "</beans>" signals the end of the file and is required by XML.
Any aspects of the configuration can be changed by changing the XML model file. The new model can then be trained and tested. When Conrad reads processes the XML file, it will create all of the necessary objects. For the default models that ship with Conrad, all of the required Java classes are contained within conrad.jar. You can other classes in the model file that are not in conrad.jar. If you do this you must place the additional classes on the Java classpath so that they can be found when Conrad starts up. You can do this with the -cp option to Java. This is the easiest way to add your own classes to the configuration.
The Spring configuration format that Conrad uses supports many other advanced features. Take a look at the Spring documentation for full details.