Nodes

CodeSnippet

You can download all the code on this page from the code snippets directory

A node is the basic building block of an MDP application. It represents a data processing element, for example a learning algorithm, a data filter, or a visualization step (see the Node List section for an exhaustive list and references).

Each node can have one or more training phases, during which the internal structures are learned from training data (e.g. the weights of a neural network are adapted or the covariance matrix is estimated) and an execution phase, where new data can be processed forwards (by processing the data through the node) or backwards (by applying the inverse of the transformation computed by the node if defined).

Nodes have been designed to be applied to arbitrarily long sets of data; provided the underlying algorithms support it, the internal structures can be updated incrementally by sending multiple batches of data (this is equivalent to online learning if the chunks consists of single observations, or to batch learning if the whole data is sent in a single chunk). This makes it possible to perform computations on large amounts of data that would not fit into memory and to generate data on-the-fly.

A Node also defines some utility methods, for example copy, which returns an exact copy of a node, and save, which writes to in a file. Additional methods may also be present, depending on the algorithm.

Node Instantiation

A node can be obtained by creating an instance of the Node class.

Each node is characterized by an input dimension (i.e., the dimensionality of the input vectors), an output dimension, and a dtype, which determines the numerical type of the internal structures and of the output signal. By default, these attributes are inherited from the input data if left unspecified. The constructor of each node class can require other task-specific arguments. The full documentation is always available in the doc-string of the node’s class.

Some examples of node instantiation

Create a node that performs Principal Component Analysis (PCA) whose input dimension and dtype are inherited from the input data during training. Output dimensions default to input dimensions.

>>> pcanode1 = mdp.nodes.PCANode()
>>> pcanode1
PCANode(input_dim=None, output_dim=None, dtype=None)

Setting output_dim = 10 means that the node will keep only the first 10 principal components of the input.

>>> pcanode2 = mdp.nodes.PCANode(output_dim=10)
>>> pcanode2
PCANode(input_dim=None, output_dim=10, dtype=None)

The output dimensionality can also be specified in terms of the explained variance. If we want to keep the number of principal components which can account for 80% of the input variance, we set

>>> pcanode3 = mdp.nodes.PCANode(output_dim=0.8)
>>> pcanode3.desired_variance
0.8

If dtype is set to float32 (32-bit float), the input data is cast to single precision when received and the internal structures are also stored as float32. dtype influences the memory space necessary for a node and the precision with which the computations are performed.

>>> pcanode4 = mdp.nodes.PCANode(dtype='float32')
>>> pcanode4
PCANode(input_dim=None, output_dim=None, dtype='float32')

You can obtain a list of the numerical types supported by a node looking at its supported_dtypes property

>>> pcanode4.supported_dtypes             
[dtype('float32'), dtype('float64')...]

This attribute is a list of numpy.dtype objects.

A PolynomialExpansionNode expands its input in the space of polynomials of a given degree by computing all monomials up to the specified degree. Its constructor needs as first argument the degree of the polynomials space (3 in this case):

>>> expnode = mdp.nodes.PolynomialExpansionNode(3)

Node Training

Some nodes need to be trained to perform their task. For example, the Principal Component Analysis (PCA) algorithm requires the computation of the mean and covariance matrix of a set of training data from which the principal eigenvectors of the data distribution are estimated.

This can be done during a training phases by calling the train method. MDP supports both supervised and unsupervised training, and algorithms with multiple training phases.

Some examples of node training:

Create some random data to train the node

>>> x = np.random.random((100, 25))  # 25 variables, 100 observations

Analyzes the batch of data x and update the estimation of mean and covariance matrix

>>> pcanode1.train(x)

At this point the input dimension and the dtype have been inherited from x

>>> pcanode1
PCANode(input_dim=25, output_dim=None, dtype='float64')

We can train our node with more than one chunk of data. This is especially useful when the input data is too long to be stored in memory or when it has to be created on-the-fly. (See also the Iterables section)

>>> for i in range(100):
...     x = np.random.random((100, 25))
...     pcanode1.train(x)

Some nodes don’t need to or cannot be trained

>>> expnode.is_trainable()
False

Trying to train them anyway would raise an IsNotTrainableException.

The training phase ends when the stop_training, execute, inverse, and possibly some other node-specific methods are called. For example we can finalize the PCA algorithm by computing and selecting the principal eigenvectors

>>> pcanode1.stop_training()

If the PCANode was declared to have a number of output components dependent on the input variance to be explained, we can check after training the number of output components and the actually explained variance

>>> pcanode3.train(x)
>>> pcanode3.stop_training()
>>> pcanode3.output_dim 
16
>>> pcanode3.explained_variance 
0.85261144755506446

It is now possible to access the trained internal data. In general, a list of the interesting internal attributes can be found in the class documentation.

>>> avg = pcanode1.avg            # mean of the input data
>>> v = pcanode1.get_projmatrix() # projection matrix

Some nodes, namely the one corresponding to supervised algorithms, e.g. Fisher Discriminant Analysis (FDA), may need some labels or other supervised signals to be passed during training. Detailed information about the signature of the train method can be read in its doc-string.

>>> fdanode = mdp.nodes.FDANode()
>>> for label in ['a', 'b', 'c']:
...     x = np.random.random((100, 25))
...     fdanode.train(x, label)

A node could also require multiple training phases. For example, the training of fdanode is not complete yet, since it has two training phases: The first one computing the mean of the data conditioned on the labels, and the second one computing the overall and within-class covariance matrices and solving the FDA problem. The first phase must be stopped and the second one trained

>>> fdanode.stop_training()
>>> for label in ['a', 'b', 'c']:
...     x = np.random.random((100, 25))
...     fdanode.train(x, label)

The easiest way to train multiple phase nodes is using flows, which automatically handle multiple phases (see the Flows section).

Node Execution

Once the training is finished, it is possible to execute the node:

The input data is projected on the principal components learned in the training phase

>>> x = np.random.random((100, 25))
>>> y_pca = pcanode1.execute(x)

Calling a node instance is equivalent to executing it

>>> y_pca = pcanode1(x)

The input data is expanded in the space of polynomials of degree 3

>>> x = np.random.random((100, 5))
>>> y_exp = expnode(x)

The input data is projected to the directions learned by FDA

>>> x = np.random.random((100, 25))
>>> y_fda = fdanode(x)

Some nodes may allow for optional arguments in the execute method. As always the complete information can be found in the doc-string.

Node Inversion

If the operation computed by the node is invertible, the node can also be executed backwards, thus computing the inverse transformation:

In the case of PCA, for example, this corresponds to projecting a vector in the principal components space back to the original data space

>>> pcanode1.is_invertible()
True
>>> x = pcanode1.inverse(y_pca)

The expansion node in not invertible

>>> expnode.is_invertible()
False

Trying to compute the inverse would raise an IsNotInvertibleException.

Writing your own nodes: subclassing Node

MDP tries to make it easy to write new nodes that interface with the existing data processing elements.

The Node class is designed to make the implementation of new algorithms easy and intuitive. This base class takes care of setting input and output dimension and casting the data to match the numerical type (e.g. float or double) of the internal variables, and offers utility methods that can be used by the developer.

To expand the MDP library of implemented nodes with user-made nodes, it is sufficient to subclass Node, overriding some of the methods according to the algorithm one wants to implement, typically the _train, _stop_training, and _execute methods.

In its namespace MDP offers references to the main modules numpy or scipy, and the subpackages linalg, random, and fft as mdp.numx, mdp.numx_linalg, mdp.numx_rand, and mdp.numx_fft. This is done to possibly support additional numerical extensions in the future. For this reason it is recommended to refer to the numpy or scipy numerical extensions through the MDP aliases mdp.numx, mdp.numx_linalg, mdp.numx_fft, and mdp.numx_rand when writing Node subclasses. This shall ensure that your nodes can be used without modifications should MDP support alternative numerical extensions in the future.

We’ll illustrate all this with some toy examples.

We start by defining a node that multiplies its input by 2.

Define the class as a subclass of Node:

>>> class TimesTwoNode(mdp.Node):

This node cannot be trained. To specify this, one has to overwrite the is_trainable method to return False:

...     def is_trainable(self):
...         return False

Execute only needs to multiply x by 2:

...     def _execute(self, x):
...         return 2*x

Note that the execute method, which should never be overwritten and which is inherited from the Node parent class, will perform some tests, for example to make sure that x has the right rank, dimensionality and casts it to have the right dtype. After that the user-supplied _execute method is called. Each subclass has to handle the dtype defined by the user or inherited by the input data, and make sure that internal structures are stored consistently. To help with this the Node base class has a method called _refcast(array) that casts the input array only when its dtype is different from the Node instance’s dtype.

The inverse of the multiplication by 2 is of course the division by 2

...     def _inverse(self, y):
...         return y/2

Test the new node

>>> class TimesTwoNode(mdp.Node):
...      def is_trainable(self):
...          return False
...      def _execute(self, x):
...          return 2*x
...      def _inverse(self, y):
...          return y/2
>>> node = TimesTwoNode(dtype = 'float32')
>>> x = mdp.numx.array([[1.0, 2.0, 3.0]])
>>> y = node(x)
>>> print x, '* 2 =  ', y
[[ 1.  2.  3.]] * 2 =   [[ 2.  4.  6.]]
>>> print y, '/ 2 =', node.inverse(y)
[[ 2.  4.  6.]] / 2 = [[ 1.  2.  3.]]

We then define a node that raises the input to the power specified in the initialiser:

>>> class PowerNode(mdp.Node):

We redefine the init method to take the power as first argument. In general one should always give the possibility to set the dtype and the input dimensions. The default value is None, which means that the exact value is going to be inherited from the input data:

...     def __init__(self, power, input_dim=None, dtype=None):

Initialize the parent class:

...         super(PowerNode, self).__init__(input_dim=input_dim, dtype=dtype)

Store the power:

...         self.power = power

PowerNode is not trainable:

...     def is_trainable(self):
...         return False

nor invertible:

...     def is_invertible(self):
...         return False

It is possible to overwrite the function _get_supported_dtypes to return a list of dtype supported by the node:

...     def _get_supported_dtypes(self):
...         return ['float32', 'float64']

The supported types can be specified in any format allowed by the numpy.dtype constructor. The interface method get_supported_dtypes converts them and sets the property supported_dtypes, which is a list of numpy.dtype objects.

The _execute method:

...     def _execute(self, x):
...         return self._refcast(x**self.power)

Test the new node

>>> class PowerNode(mdp.Node):
...     def __init__(self, power, input_dim=None, dtype=None):
...         super(PowerNode, self).__init__(input_dim=input_dim, dtype=dtype)
...         self.power = power
...     def is_trainable(self):
...         return False
...     def is_invertible(self):
...         return False
...     def _get_supported_dtypes(self):
...         return ['float32', 'float64']
...     def _execute(self, x):
...         return self._refcast(x**self.power)
>>> node = PowerNode(3)
>>> x = mdp.numx.array([[1.0, 2.0, 3.0]])
>>> y = node(x)
>>> print x, '**', node.power, '=', node(x)
[[ 1.  2.  3.]] ** 3 = [[  1.   8.  27.]]

We now define a node that needs to be trained. The MeanFreeNode computes the mean of its training data and subtracts it from the input during execution:

>>> class MeanFreeNode(mdp.Node):
...     def __init__(self, input_dim=None, dtype=None):
...         super(MeanFreeNode, self).__init__(input_dim=input_dim,
...                                            dtype=dtype)

We store the mean of the input data in an attribute. We initialize it to None since we still don’t know how large is an input vector:

...         self.avg = None

Same for the number of training points:

...         self.tlen = 0

The subclass only needs to overwrite the _train method, which will be called by the parent train after some testing and casting has been done:

...     def _train(self, x):
...         # Initialize the mean vector with the right
...         # size and dtype if necessary:
...         if self.avg is None:
...             self.avg = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)

Update the mean with the sum of the new data:

...         self.avg += mdp.numx.sum(x, axis=0)

Count the number of points processed:

...         self.tlen += x.shape [0]

Note that the train method can have further arguments, which might be useful to implement algorithms that require supervised learning. For example, if you want to define a node that performs some form of classification you can define a _train(self, data, labels) method. The parent train checks data and takes care to pass the labels on (cf. for example mdp.nodes.FDANode).

The _stop_training function is called by the parent stop_training method when the training phase is over. We divide the sum of the training data by the number of training vectors to obtain the mean:

...     def _stop_training(self):
...         self.avg /= self.tlen
...         if self.output_dim is None:
...             self.output_dim = self.input_dim

Note that we input_dim are set automatically by the train method, and we want to ensure that the node has output_dim set after training. For nodes that do not need training, the setting is performed automatically upon execution. The _execute and _inverse methods:

...     def _execute(self, x):
...         return x - self.avg
...     def _inverse(self, y):
...         return y + self.avg

Test the new node

>>> class MeanFreeNode(mdp.Node):
...     def __init__(self, input_dim=None, dtype=None):
...         super(MeanFreeNode, self).__init__(input_dim=input_dim,
...                                            dtype=dtype)
...         self.avg = None
...         self.tlen = 0
...     def _train(self, x):
...         # Initialize the mean vector with the right
...         # size and dtype if necessary:
...         if self.avg is None:
...             self.avg = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)
...         self.avg += mdp.numx.sum(x, axis=0)
...         self.tlen += x.shape[0]
...     def _stop_training(self):
...         self.avg /= self.tlen
...         if self.output_dim is None:
...             self.output_dim = self.input_dim
...     def _execute(self, x):
...         return x - self.avg
...     def _inverse(self, y):
...         return y + self.avg
>>> node = MeanFreeNode()
>>> x = np.random.random((10,4))
>>> node.train(x)
>>> y = node(x)
>>> print 'Mean of y (should be zero):\n', np.abs(np.around(np.mean(y, 0), 15))
Mean of y (should be zero):
[ 0.  0.  0.  0.]

It is also possible to define nodes with multiple training phases. In such a case, calling the train and stop_training functions multiple times is going to execute successive training phases (this kind of node is much easier to train using Flows). Here we’ll define a node that returns a meanfree, unit variance signal. We define two training phases: first we compute the mean of the signal and next we sum the squared, meanfree input to compute the standard deviation (of course it is possible to solve this problem in one single step - remember this is just a toy example).

>>> class UnitVarianceNode(mdp.Node):
...     def __init__(self, input_dim=None, dtype=None):
...         super(UnitVarianceNode, self).__init__(input_dim=input_dim,
...                                                dtype=dtype)
...         self.avg = None # average
...         self.std = None # standard deviation
...         self.tlen = 0

The training sequence is defined by the user-supplied method _get_train_seq, that returns a list of tuples, one for each training phase. The tuples contain references to the training and stop-training methods of each of them. The default output of this method is [(_train, _stop_training)], which explains the standard behavior illustrated above. We overwrite the method to return the list of our training/stop_training methods:

...     def _get_train_seq(self):
...         return [(self._train_mean, self._stop_mean),
...                 (self._train_std, self._stop_std)]

Next we define the training methods. The first phase is identical to the one in the previous example:

...     def _train_mean(self, x):
...         if self.avg is None:
...             self.avg = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)
...         self.avg += mdp.numx.sum(x, 0)
...         self.tlen += x.shape[0]
...     def _stop_mean(self):
...         self.avg /= self.tlen

The second one is only marginally different and does not require many explanations:

...     def _train_std(self, x):
...         if self.std is None:
...             self.tlen = 0
...             self.std = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)
...         self.std += mdp.numx.sum((x - self.avg)**2., 0)
...         self.tlen += x.shape[0]
...     def _stop_std(self):
...         # compute the standard deviation
...         self.std = mdp.numx.sqrt(self.std/(self.tlen-1))

The _execute and _inverse methods are not surprising, either:

...     def _execute(self, x):
...         return (x - self.avg)/self.std
...     def _inverse(self, y):
...         return y*self.std + self.avg

Test the new node

>>> class UnitVarianceNode(mdp.Node):
...     def __init__(self, input_dim=None, dtype=None):
...         super(UnitVarianceNode, self).__init__(input_dim=input_dim,
...                                                 dtype=dtype)
...         self.avg = None # average
...         self.std = None # standard deviation
...         self.tlen = 0
...     def _get_train_seq(self):
...         return [(self._train_mean, self._stop_mean),
...                 (self._train_std, self._stop_std)]
...     def _train_mean(self, x):
...         if self.avg is None:
...             self.avg = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)
...         self.avg += mdp.numx.sum(x, 0)
...         self.tlen += x.shape[0]
...     def _stop_mean(self):
...         self.avg /= self.tlen
...     def _train_std(self, x):
...         if self.std is None:
...             self.tlen = 0
...             self.std = mdp.numx.zeros(self.input_dim,
...                                       dtype=self.dtype)
...         self.std += mdp.numx.sum((x - self.avg)**2., 0)
...         self.tlen += x.shape[0]
...     def _stop_std(self):
...         # compute the standard deviation
...         self.std = mdp.numx.sqrt(self.std/(self.tlen-1))
...     def _execute(self, x):
...         return (x - self.avg)/self.std
...     def _inverse(self, y):
...         return y*self.std + self.avg
>>> node = UnitVarianceNode()
>>> x = np.random.random((10,4))
>>> # loop over phases
... for phase in range(2):
...     node.train(x)
...     node.stop_training()
...
...
>>> # execute
... y = node(x)
>>> print 'Standard deviation of y (should be one): ', mdp.numx.std(y, axis=0, ddof=1)
Standard deviation of y (should be one):  [ 1.  1.  1.  1.]

In our last example we’ll define a node that returns two copies of its input. The output is going to have twice as many dimensions.

>>> class TwiceNode(mdp.Node):
...     def is_trainable(self): return False
...     def is_invertible(self): return False

When Node inherits the input dimension, output dimension, and dtype from the input data, it calls the methods set_input_dim, set_output_dim, and set_dtype. Those are the setters for input_dim, output_dim and dtype, which are Python properties. If a subclass needs to change the default behavior, the internal methods _set_input_dim, _set_output_dim and _set_dtype can be overwritten. The property setter will call the internal method after some basic testing and internal settings. The private methods _set_input_dim, _set_output_dim and _set_dtype are responsible for setting the private attributes _input_dim, _output_dim, and _dtype that contain the actual value.

Here we overwrite _set_input_dim to automatically set the output dimension to be twice the input one, and _set_output_dim to raise an exception, since the output dimension should not be set explicitly.

...     def _set_input_dim(self, n):
...         self._input_dim = n
...         self._output_dim = 2*n
...     def _set_output_dim(self, n):
...         raise mdp.NodeException, "Output dim can not be set explicitly!"

The _execute method:

...     def _execute(self, x):
...         return mdp.numx.concatenate((x, x), 1)

Test the new node

>>> class TwiceNode(mdp.Node):
...     def is_trainable(self): return False
...     def is_invertible(self): return False
...     def _set_input_dim(self, n):
...         self._input_dim = n
...         self._output_dim = 2*n
...     def _set_output_dim(self, n):
...         raise mdp.NodeException, "Output dim can not be set explicitly!"
...     def _execute(self, x):
...         return mdp.numx.concatenate((x, x), 1)
>>> node = TwiceNode()
>>> x = mdp.numx.zeros((5,2))
>>> x
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])
>>> node.execute(x)
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])