001 package calhoun.analysis.crf;
002
003 import java.io.Serializable;
004 import java.util.List;
005
006 import calhoun.analysis.crf.io.TrainingSequence;
007
008 /** evaluates CRF feature functions on the model. This is the base class for all feature managers defined within the CRF.<p>
009 * <h2>Feature managers versus features</h2>
010 * Each feature manager can manage zero, one, or more individual features in the CRF. This distinction between feature managers and
011 * individual features allows a single <code>FeatureManager</code> object to provide evaluations for many related features. For natural language
012 * applications, a single feature manager might hold thousands of features, each corresponding to the presence of a single word in its dictionary.
013 * In the special case of a feature manager that contains only contraints and never has non-zero evaluations, the feature manager may contain zero features.
014 * <h2>Feature manager lifecycle</h2>
015 * There are three basic stages to the <code>FeatureManager</code> lifecycle.
016 * <ul>
017 * <li><b>Configuration</b> - Instantiating the feature manager from XML, setting the input component and other options.
018 * <li><b>Training</b> - Parameterizing the features on a set of training data.
019 * <li><b>Evaluation</b> - Evaluating the features during training or inference.
020 * </ul>
021 * <h3>Configuration</h3>
022 * <code>FeatureManager</code>s are normally specified as part of a model in an XML model file and are instantiated and configured
023 * by the engine when the model file is read. It is possible to add arbitrary configuration properties to any <code>FeatureManager</code>
024 * by adding public get and set methods for the properties. For more detail on this, see <a href="http://www.springframework.org/documentation">the Spring documentation</a>.
025 * <p>
026 * The one piece of configuration common to all <code>FeatureManager</code>s is the <code>inputComponent</code> property. This property is
027 * used when the input data contains multiple different sources of input, and the <code>FeatureManager</code> computes its evaluations off of
028 * only one of them. In this case, each input source is given a different component name. The <code>inputComponent</code> property of the
029 * <code>FeatureManager</code> is then set to the name of the input source it uses for evaluation (usually this is done within the XML configuration).
030 * Once this configuration is complete, the Conrad engine will ensure that the feature manager is handed only the correct component during evaluation.
031 * This enables modular feature managers. A feature manager can be written to
032 * assume a very specific set of input data. It can then be combined together in a model with feature managers that expect different input data and
033 * provided the input components are assigned correctly each feature will behave as expected without any modifications.
034 * <h3>Training</h3>
035 * <h4>Setting evaluation parameters during training</h4>
036 * The training stage of the lifecycle allows the feature manager to examine a full set of training data in case any of its evaluations require parameters
037 * that need to be learned. Although the feature manager is required to implement the <code>train</code> function, it may not parameterize its evalautions.
038 * Note that training the features is different from the training the feature weights, which inolves an optimization involving all features. As an example
039 * of a feature manager which uses training, the DenseWeightedEdge feature examines the training data to compute transition log probabilities for each edge in
040 * the model. Then during the evaluation phase, this log probability is the value returned for each edge.
041 * <h4>Determining the number of features</h4>
042 * During training, a feature manager must fix the number of features it's managing. For some feature managers this number may be fixed beforehand,
043 * but for others it may be not be determined until the training data is examined. An example is a feature manager that creates a dictionary on text and has one feature for each word in the input
044 * data. The exact number of features might not be determined until the training data is examined and the number of unique words is determined. After the train
045 * function has been called, it is legal for the engine to call {@link #getNumFeatures()}.
046 * <p>
047 * Additionally, during the call to <code>train</code> the engine must assign feature indices to each feature. Since many feature managers may be combined together
048 * and each feature requires a unique index, each feature manager is given a range of number to use for indices. The range begins with the <code>startIx</codeIx>
049 * parameter that is passed into train and includes as many consecutive integers as the feature manager has features. Therefore, a feature manager with 3 features
050 * that receives a <code>startIx</code> of 5 will assign feature ids 5, 6, and 7 to its features. The next feature manager will then be given 8 as a <code>startIx</code>
051 * <h3>Evaluation</h3>
052 * Once a feature manager is configured and trained it is available to the Conrad engine for evaluation. Each <code>FeatureManager</code> must implement one or
053 * more of the <code>FeatureEvaluation</code> interfaces. Each of these interfaces defines an evaluate function. The most general evaluate function is a real-valued
054 * function which is dependent on the entire input sequence, a start and end position in the hidden sequence, a hidden state, and a previous hidden state.
055 * During training and inference, each feature must be evaluated for each state at every position. Therefore <code>evaluate</code> will be called many times
056 * during each of these processes.
057 * <p>
058 * When evaluate is called, the <code>FeatureManager</code> is responsible for computing a value for each feature it manages. For each non-zero evaluation, the
059 * feature manager must call {@link calhoun.analysis.crf.FeatureList#addFeature(int, double)} with the feature index and the evaluated value. Additionally, feature
060 * managers can also declare a particular state, transition, or segment invalid. In that case, the engine will remove all paths containing that state, transition or segment
061 * from consideration.
062 * <h2>Other notes</h2>
063 * <h3>Serialization</h3>
064 * All features should be serializable. Conrad uses Java serialization to save its models, and so a feature must be serializable if it is to be
065 * saved as part of a trained model. The advantage of using serialization is that all parameterization of models will be automatically stored with the model,
066 * making it easy to build and incorporate new features into a model. As a side effect of this, features should not store unnecessarily large amounts of data, such
067 * as a full copy of the training data, as this will be stored in the serialized model and read in again for inference. This will slow the reading of the model and
068 * add to memory overhead during inference.
069 * <h3>Performance</h3>
070 * Conrad does extensive caching and attempts to make the minimum number of calls possible to the feature evaluate functions. Thus, during training iterations, the engine
071 * is generally working with cached versions of the feature evaluations and the evaluate functions are not called. Therefore, extensive optimization of <code>FeatureManager</code>
072 * code is not usually necessary to achieve adequate training performance. For inference however, the bottleneck is usually in the evaluation of the features, since
073 * inference does not require iterative evaluation of the features.
074 * <h3>Choosing a derived class</h3>
075 * The are several <code>FeatureEvaluation</code> interfaces to choose from. The various interfaces each take different subsets of the most general feature function
076 * signature. It is important to choose the correct interface when implementating a feature manager. Choosing a feature manager with extra parameters can lead to
077 * inefficient caching and slow performance. Choosing one with too few parameters can lead to incorrect results.
078 * @param <InputType> the InputSequence type this FeatureManager expects.
079 */
080 public interface FeatureManager<InputType> extends Serializable {
081
082 /** caching strategy that the {@link calhoun.analysis.crf.solver.CacheProcessor} should use to cache values for this feature.
083 * This is only a hint, the cache processor is not required to use this (or any) caching strategy. This base class defaults
084 * to the UNSPECIFIED cache strategy.
085 *
086 * @return cache strategy specification appropriate for this feature.
087 */
088 CacheStrategySpec getCacheStrategy();
089
090 /** Returns a human identifiable name for the feature referenced by a given index. Used for display purposes only.
091 * @param featureIndex the index of this feature
092 * @return the human readable name of this feature
093 * */
094 String getFeatureName(int featureIndex);
095
096 /** Specifies the particular component of the input which this feature manager uses, if the input data is a composite input which has multiple components.
097 * May be <code>null</code> if the input is not a composite or if the feature manager has access to all
098 * of the input. See the <i>Conrad User's Guide</i> for <a href="golem:8080/display/Conrad/Specifying_Input_Data">how to set up input data</a>.
099 * @return string name of the input component this feature should look at, or null if the feature has access to all inputs. */
100 String getInputComponent();
101
102 /** Sets which particular component of a CompositeInput this FeatureManager has access to.
103 * @param name name of the component of the CompositeInput this FeatureManager has access to. Usually set automatically during configuration.
104 */
105 void setInputComponent(String name);
106
107 /** Provides access to the entire training set so that FeatureManager can compute global properties and assign feature indices. This will be called before
108 * any evaluations are requested. If the FeatureManager can have a variable number of features, this must be fixed within this method.
109 *
110 * @param startingIndex the feature index of the first feature owned by this FeatureManager. Each FeatureManager
111 * must use up consecutive indexes, so the last index used will be startingIndex + numFeatures - 1.
112 * @param modelInfo the model that contains this feature
113 * @param data the full list of training sequences to use to train the feature
114 * */
115 void train(int startingIndex, ModelManager modelInfo, List<? extends TrainingSequence<? extends InputType> > data);
116
117 /** Returns the number of features maintained by this <code>FeatureManager</code>. This number must be fixed after the call to trainFeatures is complete.
118 * @return number of features managed by this <code>FeatureManager</code>
119 */
120 int getNumFeatures();
121 }