Functions in this package implement activities of the linearization process.
Activities are the building blocks that can be combined to build sophisticated linearization systems.
Examples of complete linearization systems can be found in flink.linearize.
Note
Understanding linearization activities
The description of the linearization activities requires some basic understandinf of how the process works and its variants.
For more informations about these topics, please refer to our papers:
Note
Parallel activities
All the activities that can be parallelized implicitly implement parallelization. The parameter controlling the number of CPUs used is consistently named ncpu.
A wrapper around flink.pytk.learn() to learn models with tree or linear kernels.
outdir is a string representing the path to an output directory, and train is a string representing the path of the datafile to be used to train the model.
If num_splits is greater than one, then the training data file train is first partitioned into num_splits parts. The splits are created so as to preserve the positive to negative ratio of the original file. Then, each split (or the original input file train, if num_splits is 1) is used to train a model.
learn_options is a list of keyword-arguments (most importantly, kernel_type and cost_factor) which are forwarded to flink.pytk.get_learn_parameters() to set learning options. See flink.pytk.get_learn_parameters() for the list of available options.
For each split <s>, the following files are created in the outdir directory:
By default, the function returns the path of the newly learned model. If num_splits is greater than 1, then the function returns the list of the paths of the models learned.
If get_etime is True, then the function returns a pair in which:
So, for example, if num_splits equals 2 and get_etime is True, the function would return a composite reult in the form:
([path_to_first_model, path_to_second_model], elapsed_time)
Classify the file datafile using the model model.
If ncpu is greater than 1, then datafile is split in ncpu chunks which are classified in parallel. The predictions are then recombined in the proper order.
If predictions is not None, then classifier decisions are stored at that path. Otherwise, if outdir is not None, then classifier decisions are stored in:
join(outdir, basename(infile) + ".pred")
Finally, if both predictions and outdir are None, then classifier decisisions are saved in:
infile + ".pred"
The function returns the path to the output predictions file. If get_etime is True, then it returns a pair whose first element is the path to the prediction file, and the second is elapsed classification time.
Mine relevant fragments from one or more models with the greedy mining algorithm.
If models is the path to a model, then flink.pytk.miners.greedy_miner() is used to mine the most relevant fragments and store them in a dictionary. After mining, the dictionary can be used to project tree-like data onto a linear space where only the most relevant fragments are accounted for.
The output dictionary will be created in:
out_dict_path = join(outdir, "dictionary.dict")
If models is a list of paths, then the same process is applied to each model in the list, and the dictionaries obtained from each model are combined into a unique dictionary.
threshold and min_frequency are parameters of the mining algorithm. threshold is used to calculate the minimum relevance of a fragment to be included in the dictionary, according to the formula:
min_relevance = max_base_frag_relevance / threshold
where max_base_frag_relevance is the relevance of the heaviest of the base fragments encoded in a model. For the definition of what a base fragment in and for more details on the mining algorithm, please refer to this paper.
ncpu is the maximum number of processing units to employ (though the actual maximum degree of parallelization is bounded by the cardinality of models).
If get_etime is True, the function returns a pair (dict, elapsed), where the first element is the path to the output dictionary, and the second element is elapsed cpu time.
Linearize a TK file, i.e. project the examples in the file onto a lower dimensional space where only relevant fragments are accounted for.
The relevant fragments are those stored in the dictionary represented by the path dictionary. Dictionaries are created by the mining algorithms in flink.pytk.miners or by the activity flink.activities.mine, which is a wrapper around flink.pytk.miners.greedy_miner().
outdir is the output directory of the activity, and infile is the tree-encoded file to be linearized.
ncpu is the parallelization degree of the activity. If it is larger than 1, then infile is split into ncpu chunks, and each chunk is linearized in parallel.
Linearization works by looking at the relevant fragments stored in the dictionary, and building vectors in which each component measures the contribution of one of the relevant fragments.
By default, the value of a component is the number of occurrencies of the corresponding fragment. This is the kind of linearization that is used in what we call the OPT model (see CoNLL 2010 paper above).
If use_gradient_components is set to True, then this behaviour is changed in that the value of each component is set to the actual relevance of the fragment. This kind of linearization is the one employed for the LIN model described in the same paper. In this configuration, by default the function assumes that the input file infile is actually a TK model, and not just any data file. If the user wants to linearize a data file instead, he should also provide an instance of flink.pytk.KernelParams as the optional kernel_params argument, to inform the function about the kernel parameters to be used in estimating fragment relevance. This is the configuration used by tk2lin().
The function produces to output files:
the linearization of infile, in:
join(outdir, basename(infile) + ".lin")
a mapping file in:
join(outdir, basename(infile) + ".nid_map")
The linearized file can be used for training or classification in the lower dimensional linear space. The mapping file is used by sort_features() to reverse engineer a linear model and explictly list the most relevant fragments in the linear space.
By default, the function returns a pair consisting of the path to the linearized file and the path to the mapping file.
If get_etime is True, then running time of the function is added as the last element of the result tuple.
Generate a linear model according to the LIN architecture, i.e. by linearizing the support vectors encoded in the TK model tk_model.
The function invokes linearize_file() with the appropriate arguments, and then alters the header of the new model so that svm_learn can recognize it as a linear model.
If outfile is not None, then the linearized model is stored at that path. Otherwise, if outdir is not None, then the model is saved in:
join(outdir, basename(tk_model) + ".tk2lin")
Finally, if both outfile and outdir are None, then classifier decisisions are saved in:
tk_model + ".tk2lin"
If get_etime is False, return the path to the linearized model. Otherwise, return a pair consisting of the path to the linearized model and the time elapsed in the linearization process.
Calculate the norm of the gradient described by the support vectors encoded in the linear model modelfile.
Counts TK fragments encoded in the model tk_modelfile.
The function returns a dictionary with the following keys:
min: minimum number of fragments encoded by a support vector
max: maximum number of fragments encoded by a support vector
support vectors.
Returns the gradient norm for the TK model tk_model.
Generate a ranking of the most relevant fragments (represented as trees) encoded in a linearized model.
dictionary is a fragment dictionary, and nid_map is the mapping generated while linearizing the training file. linearized_model is a linear model learned with the linearized training set.
Note
The nid_map file must be relative to the linearization of the file used to learn the model. Otherwise, linearize_model could include some features which are not present in nid_map.
Therefore, the general usage pattern for reverse engineering a model is as follows:
tk_model = learn(outdir, ..., kernel_type = <STK or PTK>)
dictionary = mine(outdir, tk_model, ...)
linearized_training, nid_map = linearize(outdir, ...)
lin_model = learn(outdir, linearized_training, ...)
relevant_fragments = sort_features(dictionary, lin_model, nid_map, ...)
If outfile is None, then the ranked list of most relevant fragments is stored in:
linearized_model + ".best_frags"
Otherwise, the list of fragments is stored in outfile.
The function returns the path to the output file.
Returns a pair in which the first element is the list of models found in the directory modelsdir, and the second element is the name of the kernel function used to learn the models.
The headers of the models are scanned to verify that they are learned using the same kernel function.
Evaluation the predictions in predictions against the oracle labels in oracle.
Each line of predictions is split on spaces, and the first field is used as prediction value. Similarly, the first field of oracle is used as gold label for the corresponding example.
If multiclass is False, then a binary classification problem is assumed. A decision is correct if the prediction and the label have the same sign. Otherwise, a multiclassification problem is assumed. In this case, prediction and label are considered as strings, and a decision is correct if the two strings have the same value.
The function returns a dictionary with the following keys:
If store is not None, then it must be a string describing the path of a file where to store the results of the evaluation.
This module implements facilities for the optimization of the cost factor for model learning (optimize_cost_factor()) and of the threshold parameter of the greedy mining algorithm (optimize_threshold()) for kernel space mining.
Warning
This activities can be extremely time consuming!
They carry out several learning/classification iterations, and can take very long time for larger datasets.
Optimize the threshold parameter of the greedy_mining() algorithm.
models is the set of models for which the miner must be optimized, and benchmark is the data file to be used as training/test benchmark (so it should be either the same training file used to learn the models models or an ad-hoc development set).
The models listed in models are mined for different values of the threshold parameter. For each threshold value, the resulting fragment dictionary is used to linearize the benchmark. The linearized benchmark is used to create num_folds folds and to cross-validate the accuracy of the resulting linear classifier. The best threshold value is selected so as to maximize the average F1 measure across the folds.
The arguments lowerbound and upperbound can be used to change the range of threshold values to be considered. The algorithm used to generate the set of threshold values is the following:
oom = range(int(log10(lowerbound))-1, int(log10(upperbound))+1)
result = []
for order in oom:
for factor in [2.5, 5, 7.5, 10]:
value = (10**order) * factor
if value > upperbound:
return result
if value >= lowerbound:
result.append(value)
ncpu is the number of parallel processes that will be activated.
outdir is the directory where the function produces its output. For each threshold value, the function creates a distinct sub-directory in outdir where the relative files are stored.
The function returns a dictionary with the following keys:
If get_etime is True, then the function returns a pair in which the first element is the aforementioned dictionary, and the second is the time required to perform the optimization.
Perform cross-fold evaluation on the datafile benchmark to optimize svm_learn cost factor.
kernel_type is the kernel to use for learning the model (one of those allowed by flink.pytk.get_learn_parameters()), whereas num_folds is the number of folds for cross-fold evaluation.
outdir is the directory where output files are produced, and ncpu is the maximum number of parallel processes to employ.
The optimization is carried by means of a simple hill-climbing algorithm. The initial value for the optimization can be set via the initial_point parameter.
The function returns a dictionary with three values indexed by the following keys: