df

Decision Forests

Overview

Decision Forests

An implementation of the method popularised by the paper 'Random Forests' by L. Breiman which is a refinement, with a decent helping of rigour, of the papers 'Random Decision Forests' by T. K. Ho and 'Shape Quantization and Recognition with Randomized Trees' by Y. Amit and D. Geman. This particular implementation is based on the tutorial from ICCV 2011, which has a corresponding technical report: 'Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning' by A. Criminisi, J. Shotton and E. Konukoglu.

Ultimately this is intended to be a fully featured decision forest, but for now it only supports classification and a limited set of test generators. This early version none-the-less contains support for both continuous and discrete features, and incremental learning (Admittedly a primitive form.). Implementation is pure python, using numpy, (Future versions will include inline C++ based optimisations.) and is extremely modular, to allow for future expansion.

Usage typically consist of initially creating a DF object and configuring it by setting its goal (Only classification is currently available.), its generator (A generator provides tests with which to split the data set at a node.), and configuring its Pruner, which defines when to stop growing each tree. Once setup the learn method is called with a set of exemplars which it uses to train the model. After the model has been learned the evaluate method can be applied to new features, to get a probability distribution over class assignment (And optionally a hard assignment.). The included tests demonstrate usage and should help understand usage; if you are interested in incremental learning then test_inc.py is the file to read.

Particular attention must be paid to the generators - it is these that contain all of the really important parameters in the system, and choosing the right configuration will have a massive impact on performance. Consequentially there are a lot of generators, built on different principals and including composite generators to allow generators to be combined to build an infinite selection. This means that you should be able to find a good choice for your data, but also means that finding a good choice might take a certain amount of effort - be prepared to try many combinations.

If you are reading readme.txt then you can generate documentation by running make_doc.py

Contains the following files:

df.py - Primary python file - import this to get all the classes, particularly the DF class, around which all functionality revolves.

exemplars.py - Provides the wrapper around data. All data fed into the system must match the interface provided within, probably by being a bunch of data matrices wrapped by the MatrixFS class. Uses a very specific vocabulary - each data point provided to the system is an exemplar, where all exemplars have a set of shared channels which each contain 1 or more features. Generators create tests for specific channels; also, channels provide auxiliary information, such as the actual answer and weights for the exemplars during learning. Given a specific channel index and a specific feature index within that channel plus a specific exemplar index you get back a single value from the structure. Whilst not provided the capability to inherit from the interface and implement one specific to the problem at hand, by calculating values on the fly for instance, exists.

goals.py - Defines a generic interface for defining the goal of a decision tree. Currently provides the Classification implementation of this interface, for solving the classification problem.

generators.py - Provides the generator interface and the composite generators. The composite generators allow you to select tests from multiple generators - this is most often used to combine a generator for continuous tests and a generator for discrete tests when your data contains both data types.

gen_median.py - Generators that attempt to split the data in half. Generally not a good approach, but for some problems this works very well.

gen_random.py - Generators based on pure randomness - generate enough random tests and something will work. This is very much in the vein of 'Extremely random forests'.

gen_classify.py - Optimising generators specific to the problem of classification. These are the traditional choices, and should be the first approaches tried.

pruners.py - Defines the concept of a Pruner, which decides when to stop growing a tree. Provides a default approach that stops growth when any of several limits is approached.

test.py - Defines the actual tests, and their interface. In practise a user only cares about Generators, as each Generator inherits from the Test relevant to what it returns.

nodes.py - Defines the node of a tree, where all the magic in fact happens. Its completely internal to the system however, and never used directly.

test_discrete.py - Tests the discrete generators on some simulated discrete data.

test_continuous.py - Tests the continuous generators on some simulated continuous data.

test_mix.py - Tests the composite generators on a simulated problem that contains both discrete and continuous data.

test_inc.py - Tests incremental learning, running batch on the same data so performance can be compared.

readme.txt - This file, which is included in the html documentation.

make_doc.py - Builds the html documentation.

Classes

DF()

Master object for the decision forest system - provides the entire interface. Typical use consists of setting up the system - its goal, pruner and generator(s), providing data to train a model and then using the model to analyse new exemplars. Incrimental learning is also supported however, albeit a not very sophisticated implimentation. Note that this class is compatable with pythons serialisation routines, for if you need to save/load a trained model.

addTree(self, es, weightChannel, ret = False, dummy = False) Adds an entirely new tree to the system given all of the new data. Uses all exemplars in the ExemplarSet, which can optionally include a channel with a single feature in it to weight the vectors; indicated via weightChannel. Typically this is used indirectly via the learn method, rather than by the user of an instance of this class.

allowC(self, allow) By default the system will attempt to compile and use C code instead of running the (much slower) python code - this allows you to force it to not use C code, or switch C back on if you had previously switched it off. Typically only used for speed comparisons and debugging, but also useful if the use of C code doesn't work on your system. Just be aware that the speed difference is galactic.

answer_types(self) Returns a dictionary giving all the answer types that can be requested using the which parameter of the evaluate method. The keys give the string to be provided to which, whilst the values give human readable descriptions of what will be returned. 'best' is always provided, as a point estimate of the best answer; most models also provide 'prob', which is a probability distribution over 'best', such that 'best' is the argmax of 'prob'.

error(self) Returns the average error of all the trees - meaning depends on the Goal at hand, but should provide an idea of how well the model is working.

evaluate(self, es, index = slice(None, None, None), which = best, mp = False, callback) Given some exemplars returns a list containing the output of the model for each exemplar. The returned list will align with the index, which defaults to everything and hence if not provided is aligned with es, the ExemplarSet. The meaning of the entrys in the list will depend on the Goal of the model and which: which can either be a single answer type from the goal object or a list of answer types, to get a tuple of answers for each list entry - the result is what the Goal-s answer method returns. The answer_types method passes through to provide relevent information. Can be run in multiprocessing mode if you set the mp variable to True - only worth it if you have a lot of data (Also note that it splits by tree, so each process does all data items but for just one of the trees.). Should not be called if size()==0.

getGen(self) Returns the Generator object for the system.

getGoal(self) Returns the curent Goal object.

getGrow(self) Returns True if the trees will be subject to further growth during incrimental learning, when they have gained enough data to subdivide further.

getInc(self) Returns the status of incrimental learning - True if its enabled, False if it is not.

getPruner(self) Returns the current Pruner object.

learn(self, trees, es, weightChannel, clamp, mp = True, callback) This learns a model given data, and, when it is switched on, will also do incrimental learning. trees is how many new trees to create - for normal learning this is just how many to make, for incrimental learning it is how many to add to those that have already been made - more is always better, within reason, but it is these that cost you computation and memory. es is the ExemplarSet containing the data to train on. For incrimental learning you always provide the previous data, at the same indices, with the new exemplars appended to the end. weightChannel allows you to give a channel containing a single feature if you want to weight the importance of the exemplars. clamp is only relevent to incrimental learning - it is effectivly a maximum number of trees to allow, where it throws away the weakest trees first. This is how incrimental learning works, and so must be set for that - by constantly adding new trees as new data arrives and updating the error metrics of the older trees (The error will typically increase with new data.) the less-well trainned (and typically older) trees will be culled. mp indicates if multiprocessing should be used or not - True to do so, False to not. Will automatically switch itself off if not supported.

lumberjack(self, count) Once a bunch of trees have been learnt this culls them, reducing them such that there are no more than count. It terminates those with the highest error rate first, and does nothing if there are not enough trees to excede count. Typically this is used by the learn method, rather than by the object user.

nodes(self) Returns the total number of nodes in all the trees.

setGen(self, gen) Allows you to set the Generator object from which node tests are obtained - must be set before anything happens. You must not change this once trainning starts.

setGoal(self, goal) Allows you to set a goal object, of type Goal - must be called before doing anything, and must not be changed after anything is done.

setInc(self, inc, grow = False) Set this to True to support incrimental learning, False to not. Having incrimental learning on costs extra memory, but has little if any computational affect. If incrimental learning is on you can also switch grow on, in which case as more data arrives it tries to split the leaf nodes of trees that have already been grown. Requires a bit more memory be used, as it needs to keep the indices of the training set for future growth. Note that the default pruner is entirly inappropriate for this mode - the pruner has to be set such that as more data arrives it will allow future growth.

setPruner(self, pruner) Sets the pruner, which controls when to stop growing each tree. By default this is set to the PruneCap object with default parameters, though you might want to use getPruner to get it so you can adjust its parameters to match the problem at hand, as the pruner is important for avoiding overfitting.

size(self) Returns the number of trees within the forest.

ExemplarSet()

An interface for a set of feature vectors, referred to as exemplars - whilst a data matrix will typically be used this allows the possibility of exemplars for which that is impractical, i.e. calculating them on the fly if there is an unreasonable number of features within each exemplar. Also supports the concept of channels, which ties in with the test generation so you can have different generators for each channel. For trainning the 'answer' is stored in its own channel (Note that, because that channel will not exist for novel features it should always be the last channel, so that indexing is consistant, unless it is replaced with a dummy channel.), the type of which will depend on the problem being solved. Also allows the mixing of both continuous and discrete values.

channels(self) Returns how many channels of features are provided. They are indexed {0, ..., return-1}.

codeC(self, channel, name) Returns a dictionary containning all the entities needed to access the given channel of the exemplar from within C, using name to provide a unique string to avoid namespace clashes. Will raise a NotImplementedError if not avaliable. ['type'] = The C type of the channel, often 'float'. ['input'] = The input object to be passed into the C code, must be protected from any messing around that scipy.weave might do. ['itype'] = The input type in C, as a string, usually 'PyObject *' or 'PyArrayObject *'. ['get'] = Returns code for a function to get values from the channel of the exemplar; has calling convention <type> <name>_get(<itype> input, int exemplar, int feature). ['exemplars'] = Code for a function to get the number of exemplars; has calling convention int <name>_exemplars(<itype> input). Will obviously return the same value for all channels, so can be a bit redundant. ['features'] = Code for a function that returns how many features the channel has; calling convention is int <name>_features(<itype> input). ['name'] is also provided, which contains the base name handed to this method, for conveniance.

dtype(self, channel) Returns the numpy dtype used for the given channel. numpy.float32 and numpy.int32 are considered to be the standard choices.

exemplars(self) Returns how many exemplars are provided.

features(self, channel) Returns how many features exist for each exemplar for the given channel - they can then be indexed {0, .., return-1} for the given channel.

key(self) Provides a unique string that can be used to hash the results of codeC, to avoid repeated generation. Must be implimented if codeC is implimented.

listCodeC(self, name) Helper method - returns a tuple indexed by channel that gives the dictionary returned by codeC for each channel in this exemplar. It generates the names using the provided name by adding the number indexing the channel to the end. Happens to by the required input elsewhere.

name(self, channel) Returns a string as a name for a channel - this is optional and provided more for human usage. Will return None if the channel in question has no name.

tupleInputC(self) Helper method, that can be overriden - returns a tuple containing the inputs needed for the exemplar.

MatrixES(ExemplarSet)

The most common exemplar set - basically what you use when all the feature vectors can be computed and then stored in memory without issue. Contains a data matrix for each channel, where these are provided by the user.

add(self, dm) Adds a new data matrix of information as another channel. Returns its channel index. If given a 1D matrix assumes that there is only one exemplar and adjusts it accordingly.

append(self, *args) Allows you to add exemplars to the structure, by providing a set of data matrices that align with those contained, which contain the new exemplars. Note that this is slow and generally ill advised. If adding a single new feature the arrays can be 1D.

channels(self) Returns how many channels of features are provided. They are indexed {0, ..., return-1}.

codeC(self, channel, name) Returns a dictionary containning all the entities needed to access the given channel of the exemplar from within C, using name to provide a unique string to avoid namespace clashes. Will raise a NotImplementedError if not avaliable. ['type'] = The C type of the channel, often 'float'. ['input'] = The input object to be passed into the C code, must be protected from any messing around that scipy.weave might do. ['itype'] = The input type in C, as a string, usually 'PyObject *' or 'PyArrayObject *'. ['get'] = Returns code for a function to get values from the channel of the exemplar; has calling convention <type> <name>_get(<itype> input, int exemplar, int feature). ['exemplars'] = Code for a function to get the number of exemplars; has calling convention int <name>_exemplars(<itype> input). Will obviously return the same value for all channels, so can be a bit redundant. ['features'] = Code for a function that returns how many features the channel has; calling convention is int <name>_features(<itype> input). ['name'] is also provided, which contains the base name handed to this method, for conveniance.

dtype(self, channel) Returns the numpy dtype used for the given channel. numpy.float32 and numpy.int32 are considered to be the standard choices.

exemplars(self) Returns how many exemplars are provided.

features(self, channel) Returns how many features exist for each exemplar for the given channel - they can then be indexed {0, .., return-1} for the given channel.

key(self) Provides a unique string that can be used to hash the results of codeC, to avoid repeated generation. Must be implimented if codeC is implimented.

listCodeC(self, name) Helper method - returns a tuple indexed by channel that gives the dictionary returned by codeC for each channel in this exemplar. It generates the names using the provided name by adding the number indexing the channel to the end. Happens to by the required input elsewhere.

name(self, channel) Returns a string as a name for a channel - this is optional and provided more for human usage. Will return None if the channel in question has no name.

tupleInputC(self) Helper method, that can be overriden - returns a tuple containing the inputs needed for the exemplar.

MatrixGrow(ExemplarSet)

A slightly more advanced version of the basic exemplar set that has better support for incrimental learning, as it allows appends to be more efficient. It still assumes that all of the data can be fitted in memory, and makes use of numpy arrays for internal storage.

add(self, dm) Adds a new data matrix of information as another channel. Returns its channel index. If given a 1D matrix assumes that there is only one exemplar and adjusts it accordingly.

append(self, *args) Allows you to add exemplars to the structure, by providing a set of data matrices that align with those contained, which contain the new exemplars. If adding a single new exemplar the arrays can be 1D.

channels(self) Returns how many channels of features are provided. They are indexed {0, ..., return-1}.

codeC(self, channel, name) Returns a dictionary containning all the entities needed to access the given channel of the exemplar from within C, using name to provide a unique string to avoid namespace clashes. Will raise a NotImplementedError if not avaliable. ['type'] = The C type of the channel, often 'float'. ['input'] = The input object to be passed into the C code, must be protected from any messing around that scipy.weave might do. ['itype'] = The input type in C, as a string, usually 'PyObject *' or 'PyArrayObject *'. ['get'] = Returns code for a function to get values from the channel of the exemplar; has calling convention <type> <name>_get(<itype> input, int exemplar, int feature). ['exemplars'] = Code for a function to get the number of exemplars; has calling convention int <name>_exemplars(<itype> input). Will obviously return the same value for all channels, so can be a bit redundant. ['features'] = Code for a function that returns how many features the channel has; calling convention is int <name>_features(<itype> input). ['name'] is also provided, which contains the base name handed to this method, for conveniance.

dtype(self, channel) Returns the numpy dtype used for the given channel. numpy.float32 and numpy.int32 are considered to be the standard choices.

exemplars(self) Returns how many exemplars are provided.

features(self, channel) Returns how many features exist for each exemplar for the given channel - they can then be indexed {0, .., return-1} for the given channel.

key(self) Provides a unique string that can be used to hash the results of codeC, to avoid repeated generation. Must be implimented if codeC is implimented.

listCodeC(self, name) Helper method - returns a tuple indexed by channel that gives the dictionary returned by codeC for each channel in this exemplar. It generates the names using the provided name by adding the number indexing the channel to the end. Happens to by the required input elsewhere.

make_compact(self) Internal method really - converts the data structure so that len(dmcList)==1, by concatenating arrays as needed.

name(self, channel) Returns a string as a name for a channel - this is optional and provided more for human usage. Will return None if the channel in question has no name.

tupleInputC(self) Helper method, that can be overriden - returns a tuple containing the inputs needed for the exemplar.

Goal()

Interface that defines the purpose of a decision forest - defines what the tree is optimising, what statistics to store at each node and what is returned to the user as the answer when they provide a novel feature to the forest (i.e. how to combine the statistics).

answer(self, stats_list, which, es, index, trees) Given a feature then using a forest a list of statistics entitys can be obtained from the leaf nodes that the feature ends up in, one for each tree (Could be as low as just one entity.). This converts that statistics entity list into an answer, to be passed to the user, possibly using the es with the index of the one entry that the stats list is for as well. As multiple answer types exist (As provided by the answer_types method.) you provide the one(s) you want to the which variable - if which is a string then that answer type is returned, if it is a list of strings then a tuple aligned with it is returned, containing multiple answers. If multiple types are needed then returning a list should hopefuly be optimised by this method to avoid duplicate calculation. Also requires the trees themselves, as a list aligned with stats_list.

answer_batch(self, stats_lists, which, es, indices, trees) A batch version of answer, that does multiple stat lists at once. The stats_list now consists of a list of lists, where the outer list matches tne entrys in index (A numpy array), and the inner list are the samples, aligned with the trees list. es is the exemplar object that matches up with index, and which gives the output(s) to provide. Return value is a list, matching index, that contains the answer for each, which can be a tuple if which is alist/tuple. A default implimentation is provided.

answer_types(self) When classifying a new feature an answer is to be provided, of which several possibilities exist. This returns a dictionary of those possibilities (key==name, value=human readable description of what it is.), from which the user can select. By convention 'best' must always exist, as the best guess that the algorithm can give (A point estimate of the answer the user is after.). If a probability distribution over 'best' can be provided then that should be avaliable as 'prob' (It is highly recomended that this be provided.).

clone(self) Returns a deep copy of this object.

codeC(self, name, escl) Returns a dictionary of strings containing C code, that impliment the Goal's methods in C - name is a prefix on the names used, escl the result of listCodeC on the exemplar set from which it will get its data. The contents of its return value must contain some of: {'stats': 'void <name>_stats(PyObject * data, Exemplar * index, void *& out, size_t & outLen)' - data is the list of channels for the exemplar object, index the exemplars to use. The stats object is stuck into out, and the size updated accordingly. If the provided out object is too small it will be free-ed and then a large enough buffer malloc-ed; null is handled correctly if outLen is 0., 'updateStats': 'void <name>_updateStats(PyObject * data, Exemplar * index, void *& inout, size_t & inoutLen)' - Same as stats, except the inout data arrives already containing a stats object, which is to be updated with the provided exemplars., 'entropy':'float <name>_entropy(void * stats, size_t statsLen) - Given a stats object returns its entropy.', 'summary': 'void <name>_summary(PyObject * data, Exemplar * index, void *& out, size_t & outLen)' - Basically the same as stats, except this time it is using the exemplars to calculate a summary. Interface works in the same way., 'updateSummary': 'void <name>_updateSummary(PyObject * data, Exemplar * index, void *& inout, size_t & inoutLen)' - Given a summary object, using the inout variables it updates it with the provided exemplars., 'error': 'void <name>_error(void * stats, size_t statsLen, void * summary, size_t summaryLen, float & error, float & weight)' - Given two buffers, representing the stats and the summary, this calculates the error, which is put into the reference error. This should be done incrimentally, such that errors from all nodes in a tree can be merged - error will be initialised at 0, and addtionally weight is provided which can be used as it wishes (Incremental mean is typical.), also initialised as 0.}. Optional - if it throws the NotImplementedError (The default) everything will be done in python, if some C code is dependent on a missing C method it will also be done in python. The code can be dependent on the associated exempler code where applicable.

entropy(self, stats) Given a statistics entity this returns the associated entropy - this is used to choose which test is best.

error(self, stats, summary) Given a stats entity and a summary entity (i.e. the details of the testing and trainning sets that have reached a leaf) this returns the error of the testing set versus the model learnt from the trainning set. The actual return is a pair - (error, weight), so that the errors from all the leafs can be combined in a weighted average. The error metric is arbitary, but the probability of 'being wrong' is a good choice. An alternate mode exists, where weight is set to None - in this case no averaging occurs and the results from all nodes are just summed together.

key(self) Provides a unique string that can be used to hash the results of codeC, to avoid repeated generation. Must be implimented if codeC is implimented.

postTreeGrow(self, root, gen) After a tree is initially grown (At which point its shape is locked, but incrimental learning could still be applied.) this method is given the root node of the tree, and can do anything it likes to it - a post processing step, in case the stats objects need some extra cleverness. Most Goal-s do not need to impliment this. Also provided the generator for the tests in the tree.

stats(self, es, index, weights) Generates a statistics entity for a node, based on the features that make it to the node. The statistics entity is decided by the task at hand, but must allow the nodes entropy to be calculated, plus a collection of these is used to generate the answer when a feature is given to the decision forest. fs is a feature set, index the indices of the features in fs that have made it to this node. weights is an optional set of weights for the features, weighting how many features they are worth - will be a 1D numpy.float32 array aligned with the feature set, and can contain fractional weights.

summary(self, es, index, weights) Once a tree has been grown a testing set (The 'out-of-bag' set) is typically run through to find out how good it is. This consists of two steps, the first of which is to generate a summary of the oob set that made it to each leaf. This generates the summary, and must be done such that the next step - the use of a stats and summary entity to infer an error metric with a weight for averaging the error metrics from all leafs, can be performed. For incrimental learning it is also required to be able to add new exemplars at a later time.

updateStats(self, stats, es, index, weights) Given a stats entity, as generated by the stats method, this returns a copy of that stats entity that has had additional exemplars factored in, specifically those passed in. This allows a tree to be updated with further trainning examples (Or, at least its stats to be updated - its structure is set in stone once built.) Needed for incrimental learning.

updateSummary(self, summary, es, index, weights) For incrimental learning the summaries need to be updated with further testing examples - this does that. Given a summary and some exemplars it returns a copy of the summary updated with the new exemplars.

Classification(Goal)

The standard goal of a decision forest - classification. When trainning expects the existence of a discrete channel containing a single feature for each exemplar, the index of which is provided. Each discrete feature indicates a different trainning class, and they should be densly packed, starting from 0 inclusive, i.e. belonging to the set {0, ..., # of classes-1}. Number of classes is typically provided, though None can be provided instead in which case it will automatically resize data structures as needed to make them larger as more classes (Still densly packed.) are seen. A side effect of this mode is when it returns arrays indexed by class the size will be data driven, and from the view of the user effectivly arbitrary - user code will have to handle this.