calhoun.analysis.crf
Interface FeatureManager<InputType>

Type Parameters:
InputType - the InputSequence type this FeatureManager expects.
All Superinterfaces:
java.io.Serializable
All Known Subinterfaces:
FeatureManagerEdge<InputType>, FeatureManagerEdgeExplicitLength<InputType>, FeatureManagerNode<InputType>, FeatureManagerNodeBoundaries<InputType>, FeatureManagerNodeExplicitLength<InputType>, ModelManager
All Known Implementing Classes:
AbstractFeatureManager, BeanModel, BlastInterval13, CodingStopFeature, CompositeFeatureManager, ConstrainedFeatureManager, ConstraintTest.FixedEdges, EmissionMarkovFeature, EndFeatures, ESTEdge, ESTExon, ESTInterval13, ESTInterval29, ESTIntron, FelsensteinFeatures, FootprintsInterval13, FootprintsInterval29, GapConjunctionFeatures, GapFeaturesInterval13, GapFeaturesInterval29, GeneConstraints, GeneConstraintsInterval13, GeneConstraintsInterval29, GeneConstraintsToy, IndicatorEdges, Interval13Model, Interval29Model, IntervalPresenceFeatures, IntronLengthFeature, KmerFeatures, MaxentMotifFeatures, PfamGenic, PfamPhase, PhylogeneticLogprobInterval13, PositionWeightMatrixFeatures, PWM_evolution, PWMInterval13, PWMInterval29, ReferenceBasePredictorInterval13, ReferenceBasePredictorInterval13Base, ReferenceBasePredictorInterval29, ReferenceBasePredictorInterval29Base, ReferenceBasePredictorNodeOnlyInterval13, ReferenceBasePredictorNodeOnlyInterval29, ReferenceBasePredictorZeroPadInterval13, ReferenceBasePredictorZeroPadInterval29, StartFeatures, StateLengthLogprobInterval13, StateLengthLogprobInterval29, StateTransitionsInterval13, StateTransitionsInterval29, TestFeatures.EmissionFeature, TestFeatures.ExplicitHalfExponentialLengthFeature, TestFeatures.ExponentialLengthFeature, TestFeatures.GaussianLengthFeature, TestFeatures.HalfExponentialLengthFeature, TestFeatures.TestFeature, WeightedEdges, WeightedStateChanges, ZeroOrderManager, ZeroOrderModel

public interface FeatureManager<InputType>
extends java.io.Serializable

evaluates CRF feature functions on the model. This is the base class for all feature managers defined within the CRF.

Feature managers versus features

Each feature manager can manage zero, one, or more individual features in the CRF. This distinction between feature managers and individual features allows a single FeatureManager object to provide evaluations for many related features. For natural language applications, a single feature manager might hold thousands of features, each corresponding to the presence of a single word in its dictionary. In the special case of a feature manager that contains only contraints and never has non-zero evaluations, the feature manager may contain zero features.

Feature manager lifecycle

There are three basic stages to the FeatureManager lifecycle.

Configuration

FeatureManagers are normally specified as part of a model in an XML model file and are instantiated and configured by the engine when the model file is read. It is possible to add arbitrary configuration properties to any FeatureManager by adding public get and set methods for the properties. For more detail on this, see the Spring documentation.

The one piece of configuration common to all FeatureManagers is the inputComponent property. This property is used when the input data contains multiple different sources of input, and the FeatureManager computes its evaluations off of only one of them. In this case, each input source is given a different component name. The inputComponent property of the FeatureManager is then set to the name of the input source it uses for evaluation (usually this is done within the XML configuration). Once this configuration is complete, the Conrad engine will ensure that the feature manager is handed only the correct component during evaluation. This enables modular feature managers. A feature manager can be written to assume a very specific set of input data. It can then be combined together in a model with feature managers that expect different input data and provided the input components are assigned correctly each feature will behave as expected without any modifications.

Training

Setting evaluation parameters during training

The training stage of the lifecycle allows the feature manager to examine a full set of training data in case any of its evaluations require parameters that need to be learned. Although the feature manager is required to implement the train function, it may not parameterize its evalautions. Note that training the features is different from the training the feature weights, which inolves an optimization involving all features. As an example of a feature manager which uses training, the DenseWeightedEdge feature examines the training data to compute transition log probabilities for each edge in the model. Then during the evaluation phase, this log probability is the value returned for each edge.

Determining the number of features

During training, a feature manager must fix the number of features it's managing. For some feature managers this number may be fixed beforehand, but for others it may be not be determined until the training data is examined. An example is a feature manager that creates a dictionary on text and has one feature for each word in the input data. The exact number of features might not be determined until the training data is examined and the number of unique words is determined. After the train function has been called, it is legal for the engine to call getNumFeatures().

Additionally, during the call to train the engine must assign feature indices to each feature. Since many feature managers may be combined together and each feature requires a unique index, each feature manager is given a range of number to use for indices. The range begins with the startIx parameter that is passed into train and includes as many consecutive integers as the feature manager has features. Therefore, a feature manager with 3 features that receives a startIx of 5 will assign feature ids 5, 6, and 7 to its features. The next feature manager will then be given 8 as a startIx

Evaluation

Once a feature manager is configured and trained it is available to the Conrad engine for evaluation. Each FeatureManager must implement one or more of the FeatureEvaluation interfaces. Each of these interfaces defines an evaluate function. The most general evaluate function is a real-valued function which is dependent on the entire input sequence, a start and end position in the hidden sequence, a hidden state, and a previous hidden state. During training and inference, each feature must be evaluated for each state at every position. Therefore evaluate will be called many times during each of these processes.

When evaluate is called, the FeatureManager is responsible for computing a value for each feature it manages. For each non-zero evaluation, the feature manager must call FeatureList.addFeature(int, double) with the feature index and the evaluated value. Additionally, feature managers can also declare a particular state, transition, or segment invalid. In that case, the engine will remove all paths containing that state, transition or segment from consideration.

Other notes

Serialization

All features should be serializable. Conrad uses Java serialization to save its models, and so a feature must be serializable if it is to be saved as part of a trained model. The advantage of using serialization is that all parameterization of models will be automatically stored with the model, making it easy to build and incorporate new features into a model. As a side effect of this, features should not store unnecessarily large amounts of data, such as a full copy of the training data, as this will be stored in the serialized model and read in again for inference. This will slow the reading of the model and add to memory overhead during inference.

Performance

Conrad does extensive caching and attempts to make the minimum number of calls possible to the feature evaluate functions. Thus, during training iterations, the engine is generally working with cached versions of the feature evaluations and the evaluate functions are not called. Therefore, extensive optimization of FeatureManager code is not usually necessary to achieve adequate training performance. For inference however, the bottleneck is usually in the evaluation of the features, since inference does not require iterative evaluation of the features.

Choosing a derived class

The are several FeatureEvaluation interfaces to choose from. The various interfaces each take different subsets of the most general feature function signature. It is important to choose the correct interface when implementating a feature manager. Choosing a feature manager with extra parameters can lead to inefficient caching and slow performance. Choosing one with too few parameters can lead to incorrect results.


Method Summary
 CacheStrategySpec getCacheStrategy()
          caching strategy that the CacheProcessor should use to cache values for this feature.
 java.lang.String getFeatureName(int featureIndex)
          Returns a human identifiable name for the feature referenced by a given index.
 java.lang.String getInputComponent()
          Specifies the particular component of the input which this feature manager uses, if the input data is a composite input which has multiple components.
 int getNumFeatures()
          Returns the number of features maintained by this FeatureManager.
 void setInputComponent(java.lang.String name)
          Sets which particular component of a CompositeInput this FeatureManager has access to.
 void train(int startingIndex, ModelManager modelInfo, java.util.List<? extends TrainingSequence<? extends InputType>> data)
          Provides access to the entire training set so that FeatureManager can compute global properties and assign feature indices.
 

Method Detail

getCacheStrategy

CacheStrategySpec getCacheStrategy()
caching strategy that the CacheProcessor should use to cache values for this feature. This is only a hint, the cache processor is not required to use this (or any) caching strategy. This base class defaults to the UNSPECIFIED cache strategy.

Returns:
cache strategy specification appropriate for this feature.

getFeatureName

java.lang.String getFeatureName(int featureIndex)
Returns a human identifiable name for the feature referenced by a given index. Used for display purposes only.

Parameters:
featureIndex - the index of this feature
Returns:
the human readable name of this feature

getInputComponent

java.lang.String getInputComponent()
Specifies the particular component of the input which this feature manager uses, if the input data is a composite input which has multiple components. May be null if the input is not a composite or if the feature manager has access to all of the input. See the Conrad User's Guide for how to set up input data.

Returns:
string name of the input component this feature should look at, or null if the feature has access to all inputs.

setInputComponent

void setInputComponent(java.lang.String name)
Sets which particular component of a CompositeInput this FeatureManager has access to.

Parameters:
name - name of the component of the CompositeInput this FeatureManager has access to. Usually set automatically during configuration.

train

void train(int startingIndex,
           ModelManager modelInfo,
           java.util.List<? extends TrainingSequence<? extends InputType>> data)
Provides access to the entire training set so that FeatureManager can compute global properties and assign feature indices. This will be called before any evaluations are requested. If the FeatureManager can have a variable number of features, this must be fixed within this method.

Parameters:
startingIndex - the feature index of the first feature owned by this FeatureManager. Each FeatureManager must use up consecutive indexes, so the last index used will be startingIndex + numFeatures - 1.
modelInfo - the model that contains this feature
data - the full list of training sequences to use to train the feature

getNumFeatures

int getNumFeatures()
Returns the number of features maintained by this FeatureManager. This number must be fixed after the call to trainFeatures is complete.

Returns:
number of features managed by this FeatureManager