Feature functions are the core of a CRF. The Conrad CRF engine defines several interfaces to create users and then contains algorithms for using those features to perform training and inference tasks.
All feature functions in Conrad are defined by a FeatureManager class. A FeatureManager can have zero or more features associated with it. Having zero features is useful for a FeatureManager that only implements constraints. Having multiple features associated with a single manager is useful for cases where several features share either code or data used for evaluation. An example is the StateLengthLogProb feature manager, where the length distribution for each individual state is a separate function, but the functions are all evaluated using one mathematical model. Each feature is assigned a numeric id, which indicates which weight in the weight vector is assocated with that feature. All of the features associated with a single manager will have contiguous ids.
There is extensive documentation about the specifics of implementing a FeatureManager in the javadoc.
In the mathematical definition there is only a single type of feature, but a variety of performance and technical constraints make it desirable to have several types of features for different circumstances. In Conrad, we explicitly divide features up along two different dimensions. The fist dimension is whether or not the feature function uses the the previous state to determine it's value. Those features which only use the current state are called "node" features. Those that use both the current and previous states are called "edge" features. Node features are only evaluated once for each possible state, edge features are evaluated once for each possible transition. This makes node features more efficient than edge features.
The second characteristic with which we categorize features is whether or not they use the length of the segment. Features in a semi-Markov CRF are defined over segments and we call these "explicit length" features. Features in a full markov CRF are defined at a single position and we call these regular features. It is possible to mix regular and explicit length features in a single model. Conrad will convert markov features into semi-Markov features by summing the value returned at each position in the segment.
When implementing a FeatureManager, you must implement one of the 4 derive dinterfaces corresponding to the particular types of features. The four types are: FeatureManagerNode, FeatureManagerEdge, FeatureManagerNodeExplicitLength, and FeatureManageEdgeExplicitLength. There is also a specialized feature type called FeatureManagerNodeBoundaries which handles the case of a regular markov feature that uses an alternative strategy to evaluate it's value in a semi-Markov environment.
Conrad is built to cache all feature evalautions so that during the training process the feature functions are not called repeatedly. This cuts the training time for Conrad and makes it tractable to train large data sets. It does this at the expense of memory, since all of the feature values must be stored in memory. Conrad uses several cache strategies to handle different type sof features and choosing the correct cache strategy is important to achieving maximal performance from Conrad. For more details on chooisng the cache strategy for a feature, see the javadoc for the CacheProcessorDeluxe.
In Conrad, FeatureManagers can be used to implement constraints as well as evaluating feature values. Constraints restrict the set of labeling searched in the training and test process and cna be used to eliminate illegal or nonsensical labellings or to speed performance. The constraints are defined using the same categorization as the features,
Constraints can be very importnat for performance. For gene calling, the node and edge constraints we use to enforce legal gene structures are absolutely necessary in order to achieve reasonable performance. Constraints are implemented simply by calling invalidate on the FeatureList passed to a feature evaluation.
asdfasdfasdf