Caching execution results

CodeSnippet

You can download all the code on this page from the code snippets directory

Introduction

It is relatively common for nodes to process the same data several times. Usually this happens when training a long sequence of nodes using a fixed data set: to train the nodes at end of the sequence, the data has to be processed by all the preceding ones. This duplication of efforts may be costly, for example in image processing, when one needs to repeatedly filter the images (as in this example).

MDP offers a node extension that automatically caches the result of the execute method, which can boost the speed of an application considerably in such scenarios. The cache can be activated globally (i.e., for all node instances), for some node classes only, or for specific instances.

The caching mechanism is based on the library joblib, version 0.4.3 or higher.

Activating the caching extension

It is possible to activate the caching extension as for regular extension using the extension name 'cache_execute'. By default, the cached results will be stored in a database created in a temporary directory for the duration of the Python session. To change the caching directory, which may be useful to create a permanent cache over multiple sessions, one can call the function mdp.caching.set_cachedir.

We will illustrate the caching extension using a simple but relatively large Principal Component Analysis problem:

>>> # set up a relatively large PCA run
>>> import mdp
>>> import numpy as np
>>> from timeit import Timer
>>> x = np.random.rand(3000,1000)
>>> # create a PCANode and train it using the random data in 'x'
>>> pca_node = mdp.nodes.PCANode()
>>> pca_node.train(x)
>>> pca_node.stop_training()

The time for projecting the data x on the principal components drops dramatically after the caching extension is activated:

>>> # we will use this timer to measure the speed of 'pca_node.execute'
>>> timer = Timer("pca_node.execute(x)", "from __main__ import pca_node, x")
>>> mdp.caching.set_cachedir("/tmp/my_cache")
>>> mdp.activate_extension("cache_execute")
>>> # all calls to the 'execute' method will now be cached in 'my_cache'
>>> # the first time execute is called, the method is run
>>> # and the result is cached
>>> print timer.repeat(1, 1)[0], 'sec' 
1.188946008682251 sec
>>> # the second time, the result is retrieved from the cache
>>> print timer.repeat(1, 1)[0], 'sec' 
0.112375974655 sec
>>> mdp.deactivate_extension("cache_execute")
>>> # when the cache extension is deactivated, the 'execute' method is
>>> # called as usual
>>> print timer.repeat(1, 1)[0], 'sec' 
0.801102161407 sec

Alternative ways to activate the caching extension, which also expose more functionalities, can be found in the mdp.caching module. The functions activate_caching and deactivate_caching allow activating the cache only on certain Node classes, or specific instances. For example, the following line starts the cache extension, caching only instances of the classes SFANode and FDANode, and the instance pca_node.

>>> mdp.caching.activate_caching(cachedir='/tmp/my_cache',
...     cache_classes=[mdp.nodes.SFANode, mdp.nodes.FDANode],
...     cache_instances=[pca_node])
>>> # all calls to the 'execute' method of instances of 'SFANode' and
>>> # 'FDANode', and of 'pca_node' will now be cached in 'my_cache'
>>> mdp.caching.deactivate_caching()

Make sure to call the deactivate_caching method before the end of the session, or the cache directory may remain in a broken state.

Finally, the module mdp.caching also defines a context manager that closes the cache properly at the end of the block:

>>> with mdp.caching.cache(cachedir='/tmp/my_cache', cache_instances=[pca_node]):
...     # in the block, the cache is active
...     print timer.repeat(1, 1)[0], 'sec' 
...
0.101263999939 sec
>>> # at the end of the block, the cache is deactivated
>>> print timer.repeat(1, 1)[0], 'sec' 
0.801436901093 sec