Navigation

  • index
  • modules |
  • next |
  • previous |
  • flink 1.0 documentation »

2. Linearizing a binary classifier with flink.linearize¶

These package provides the implementation of a a linearization system for binary classifiers.

The implementation is a straightforward combination of all the activities defined in flink.activities.

With respect to the description of the linearization process in Main activities of the linearization process, the kernel space learning activity (KSL) has been separated from the rest. This choice makes it easier to run several experiments because it allows for reusing the models learned in the kernel space only once while experimenting with different mining and linear space learning configurations.

As a consequence, linearizing a model is a two-step activity:

  • the executable module flink.linearize.ksl is used to learn TK models
  • the executable module flink.linearize.binary is pointed to the output directory of flink.linearize.ksl and used to actually linearize the models in there.

A typical usage pattern is the following:

python3 -m flink.linearize.ksl -o <ksl_dir> [other options] <training_file>
python3 -m flink.linearize.binary -k <ksl_dir> [other options] <training_file> <test file>

where <ksl_dir> and <training_file> have the same value in both commands.

More details about command line options of the two modules are provided in the following sections.

2.1. Learning models in the tree kernel space: flink.linearize.ksl¶

This executable module implements the KSL stage of the linearization process.

Available command line options can be listed by invoking the module with the -h flag:

python3 -m flink.linearize.ksl -h

In the simplest usage scenario, we want to learn a TK model from some valid SVM-Light-TK training file <training>, using the kernel <kernel_type> (correct values are STK for the syntactic tree kernel and PTK for the partial tree kernel) and store the results in the output directory <outdir>.

We can do so by issuing the command:

python3 -m flink.linearize.ksl -o <outdir> -k <kernel_type> <training>

Learning a model in this way is the first step of the OPT and LIN architectures.

For the Split architecture, we can automatically partition the training file into <num_splits> chunks and learn as many models by adding the -s <num_splits> option:

python3 -m flink.linearize.ksl -o <outdir> -k <kernel_type>
                               -s <num_splits> <training>

Note

Split learning is carried out by producing balanced splits of the training data file, i.e. each split will approximately contain the same positive/negative example ratio than the original training file.

Still, in some cases each split could contain too few positive examples for the learning algorithm to generalize properly, resulting in errors that can affect all the subsequent activities of the linearization process.

The responsibility of using a reasonable number of splits (with respect to the data set and task at hand) is left to the user of the module.

For each split <x>, the program will produce the following output files:

  • <outdir>/<x>: the training data split

  • <outdir>/<x>-model<params>: the corresponding model file, where

    <param> is the concatenation of parameters used to learn the model

  • <outdir>/<x>-model<params>.stdout: the standard output of SVM-Light-TK

  • <outdir>/<x>-model<params>.stderr: the standard error of SVM-Light-TK

In case of non-split learning, the only split is the actual training file, which is copied to the output directory.

If running on a multi-processor or multi-core architecture, it makes sense to use the -n <ncpu> option along with the -s <num_splits> option to use up to <ncpu> parallel processes for learning the models:

python3 -m flink.linearize.ksl -o <outdir> -k <kernel_type>
                               -s <num_splits> -n <ncpu> <training>

In both cases (single model or split model learning), we can use the parameter -j <cost_factor> to force a cost factor different than 1:

python3 -m flink.linearize.ksl -o <outdir>
                               -k <kernel_type> -j <cost_factor>
                               -s <num_splits> -n <ncpu> <training>

Note

In the case of split learning, the same cost factor is used for all the models.

Lastly, the -c flag can be used to force flink.linearize.ksl to erase the contents of <outdir> before learning the new models.

2.2. Binary STK and PTK model linearization: flink.linearize.binary¶

This module implements a greedy-miner based linearization system for PTK and STK models.

It assumes that the models have already been learned using flink.linearize.ksl and that they are located in a directory <ksl_dir>.

Depending on how the models were learned (single model vs. split model configuration), this module implements either the OPT or the Split linearization models described in Linearization architectures.

The LIN model, which like OPT requires non-split kernel space learning, is provided as a special case of the OPT model via a command line flag, as explained in LIN model evaluation

Available command line options can be listed by invoking the module with the -h flag:

python3 -m flink.linearize.binary -h

The simplest command line for this module looks like:

python3 -m flink.linearize.binary -k <ksl_dir> -o <outdir>
           <training> <test>

where:

  • <ksl_dir> is the output directory of flink.linearize.ksl
  • <outdir> is the selected output directory for the linearizer
  • <training> is the training data file
  • <test> is the test file to evaluate the model

Note

<training> must be the same file used for kernel space learning!

In this case, the mining algorithm will be invoked with all the default parameters, and learning in the linear space will be carried out with the default cost factor of 1.

This behavior can be customized by changing the value of the following command line options:

  • -t THRESHOLD and -f MIN_FREQUENCY: the values of the threshold factor and minimum fragment frequency parameters of the greedy miner algorithm flink.pytk.miners.greedy_miner().

    If THRESHOLD is set to 0, then the module automatically carries out a cross-fold evaluation to estimate an appropriate value for the parameter. The optimization is carried out via flink.optimize.optimize_threshold() using the training set as benchmark.

    The number of folds to be used can be controlled via the option -a (it defaults to 5).

  • -j COST_FACTOR: cost factor for linear space learning. If set to 0, then the module automatically carries out cost factor optimization via flink.optimize.optimize_cost_factor(), using thre training set as a benchmark.

    The number of folds to be used can be controlled via the option -b (it defaults to 5).

The module will produce a lot of output in the selected directory <outdir>.

Most notably, assuming that <bn_training> and <bn_test> are the base names (i.e. file names without directory information) of <training> and <test>, respectively:

  • the file <outdir>/dictionary.dict is the dictionary of the most relevant fragments mined from the models in <ksl_dir> for the given parameters of the mining algorithm;
  • the files <outdir>/<bn_training>.lin and <outdir>/<bn_test>.lin are the linearized training and test set, respectively;
  • the file <outdir>/<bn_training>.lin.model-<params> is the model learned in the linear space from the linearized training set;
  • the file <outdir>/<bn_training>.lin.model-<params>.best_frags contains a ranking of the most relevant fragments according to their weight as estimated by the learner in the linear space;
  • the file <outdir>/<bn_test>.lin.pred contains the classifiers’ predictions for the linearized test set;
  • the file <outdir>/<bn_test>.lin.pred.eval contains the evaluation of the predictions in <outdir>/<bn_test>.lin.pred.

The module will also output to the terminal the results of the evaluation, plus several other bits of information including the time necessary to perform each sub-task.

2.2.1. LIN model evaluation¶

Assuming that tree kernel learning was not carried out in a split fashion, the module can also be used to evaluate the LIN model on the same data set.

To this end, it is possible to add the -L flag to the command line of the module. In this case, it will also print to standard out the evaluation of the LIN model on the same benchmark, as well as the gradient norm of the linearized model.

The following files will also be present in the output directory:

  • <outdir>/<bn_training>.lin.model-<params>.tk2lin: the linearized model file;
  • <outdir>/<bn_test>.lin.pred_tk2lin: test predictions according to the linearized model, and
  • <outdir>/<bn_test>.lin.pred_tk2lin.eval: evaluation of the predictions obtained with the linearized model.

2.2.2. Tree kernel evaluation¶

For convenience, flink.linearize.binary can also evaluate the accuracy of the TK models learned during KSL. As in the previous case, this feature is only available if kernel space learning was carried out in a non-split fashion.

This can be accomplished by supplying the -T command line flag.

If the flag is present, the program will also output the results of classification using the TK model, as well as the gradient norm of the TK model and the number of fragments in the model.

Note

The number of fragments in the model is a very loose lower bound of the real number of fragments it encodes, as it is calculated as the number of fragments of the largest tree in the model. The real number of fragments in the model should generally be in the same order of magnitude.

The output directory will also contain the file <outdir>/<bn_test>.pred with the predictions generated according to the TK model.

Table Of Contents

  • 2. Linearizing a binary classifier with flink.linearize
    • 2.1. Learning models in the tree kernel space: flink.linearize.ksl
    • 2.2. Binary STK and PTK model linearization: flink.linearize.binary
      • 2.2.1. LIN model evaluation
      • 2.2.2. Tree kernel evaluation

Previous topic

1. Getting started with FLinK

Next topic

3. System level configuration options: flink.config

Quick search

Enter search terms or a module, class or function name.

Navigation

  • index
  • modules |
  • next |
  • previous |
  • flink 1.0 documentation »
© Copyright 2009-2011, Daniele Pighin. Created using Sphinx 1.1pre.