Package mdp :: Package nodes :: Class CountVectorizerScikitsLearnNode
[hide private]
[frames] | no frames]

Class CountVectorizerScikitsLearnNode


Convert a collection of raw documents to a matrix of token counts This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features (the vocabulary size found by analysing the data) might be very large and the count vectors might not fit in memory.

For this case it is either recommended to use the sparse.CountVectorizer variant of this class or a HashingVectorizer that will reduce the dimensionality to an arbitrary number by using random projection.

Parameters

analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional

vocabulary: dict, optional

A dictionary where keys are tokens and values are indices in the matrix.

This is useful in order to fix the vocabulary in advance.

max_df : float in range [0.0, 1.0], optional, 1.0 by default

When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words).

This parameter is ignored if vocabulary is not None.

max_features : optional, None by default

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

dtype: type, optional
Type of the matrix returned by fit_transform() or transform().
Instance Methods [hide private]
 
__init__(self, input_dim=None, output_dim=None, dtype=None, **kwargs)
Convert a collection of raw documents to a matrix of token counts This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.
 
_execute(self, x)
list
_get_supported_dtypes(self)
Return the list of dtypes supported by this node. The types can be specified in any format allowed by numpy.dtype.
 
_stop_training(self, **kwargs)
Concatenate the collected data in a single array.
 
execute(self, x)
Extract token counts out of raw text documents This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. Parameters
 
stop_training(self, **kwargs)
Learn a vocabulary dictionary of all tokens in the raw documents This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. Parameters

Inherited from unreachable.newobject: __long__, __native__, __nonzero__, __unicode__, next

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __subclasshook__

    Inherited from Cumulator
 
_train(self, *args)
Collect all input data in a list.
 
train(self, *args)
Collect all input data in a list.
    Inherited from Node
 
__add__(self, other)
 
__call__(self, x, *args, **kwargs)
Calling an instance of Node is equivalent to calling its execute method.
 
__repr__(self)
repr(x)
 
__str__(self)
str(x)
 
_check_input(self, x)
 
_check_output(self, y)
 
_check_train_args(self, x, *args, **kwargs)
 
_get_train_seq(self)
 
_if_training_stop_training(self)
 
_inverse(self, x)
 
_pre_execution_checks(self, x)
This method contains all pre-execution checks.
 
_pre_inversion_checks(self, y)
This method contains all pre-inversion checks.
 
_refcast(self, x)
Helper function to cast arrays to the internal dtype.
 
_set_dtype(self, t)
 
_set_input_dim(self, n)
 
_set_output_dim(self, n)
 
copy(self, protocol=None)
Return a deep copy of the node.
 
get_current_train_phase(self)
Return the index of the current training phase.
 
get_dtype(self)
Return dtype.
 
get_input_dim(self)
Return input dimensions.
 
get_output_dim(self)
Return output dimensions.
 
get_remaining_train_phase(self)
Return the number of training phases still to accomplish.
 
get_supported_dtypes(self)
Return dtypes supported by the node as a list of numpy.dtype objects.
 
has_multiple_training_phases(self)
Return True if the node has multiple training phases.
 
inverse(self, y, *args, **kwargs)
Invert y.
 
is_training(self)
Return True if the node is in the training phase, False otherwise.
 
save(self, filename, protocol=-1)
Save a pickled serialization of the node to filename. If filename is None, return a string.
 
set_dtype(self, t)
Set internal structures' dtype.
 
set_input_dim(self, n)
Set input dimensions.
 
set_output_dim(self, n)
Set output dimensions.
Static Methods [hide private]
 
is_invertible()
Return True if the node can be inverted, False otherwise.
bool
is_trainable()
Return True if the node can be trained, False otherwise.
Properties [hide private]

Inherited from object: __class__

    Inherited from Node
  _train_seq
List of tuples:
  dtype
dtype
  input_dim
Input dimensions
  output_dim
Output dimensions
  supported_dtypes
Supported dtypes
Method Details [hide private]

__init__(self, input_dim=None, output_dim=None, dtype=None, **kwargs)
(Constructor)

 

Convert a collection of raw documents to a matrix of token counts This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features (the vocabulary size found by analysing the data) might be very large and the count vectors might not fit in memory.

For this case it is either recommended to use the sparse.CountVectorizer variant of this class or a HashingVectorizer that will reduce the dimensionality to an arbitrary number by using random projection.

Parameters

analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional

vocabulary: dict, optional

A dictionary where keys are tokens and values are indices in the matrix.

This is useful in order to fix the vocabulary in advance.

max_df : float in range [0.0, 1.0], optional, 1.0 by default

When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words).

This parameter is ignored if vocabulary is not None.

max_features : optional, None by default

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

dtype: type, optional
Type of the matrix returned by fit_transform() or transform().
Overrides: object.__init__

_execute(self, x)

 
Overrides: Node._execute

_get_supported_dtypes(self)

 
Return the list of dtypes supported by this node. The types can be specified in any format allowed by numpy.dtype.
Returns: list
The list of dtypes supported by this node.
Overrides: Node._get_supported_dtypes

_stop_training(self, **kwargs)

 
Concatenate the collected data in a single array.
Overrides: Node._stop_training

execute(self, x)

 

Extract token counts out of raw text documents This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. Parameters

raw_documents: iterable
an iterable which yields either str, unicode or file objects

Returns

vectors: array, [n_samples, n_features]

Overrides: Node.execute

is_invertible()
Static Method

 
Return True if the node can be inverted, False otherwise.
Overrides: Node.is_invertible
(inherited documentation)

is_trainable()
Static Method

 
Return True if the node can be trained, False otherwise.
Returns: bool
A boolean indication whether the node can be trained.
Overrides: Node.is_trainable

stop_training(self, **kwargs)

 

Learn a vocabulary dictionary of all tokens in the raw documents This node has been automatically generated by wrapping the scikits.learn.feature_extraction.text.CountVectorizer class from the sklearn library. The wrapped instance can be accessed through the scikits_alg attribute. Parameters

raw_documents: iterable
an iterable which yields either str, unicode or file objects

Returns

self

Overrides: Node.stop_training