CountVectorizer

Extracts the vocabulary from a given collection of documents and generates a vector of token counts for each document.

Input

It takes in a DataFrame as input and transforms it to another DataFrame

Output

It adds a new column to the incoming DataFrame containing the vector of token counts in the input column, to generate the output DataFrame

Type

ml-transformer

Class

fire.nodes.ml.NodeCountVectorizer

Fields

Name Title Description
inputCol Input Column Input column name
outputCol Output Column Output column name
vocabularySize Vocabulary Size Max size of the vocabulary.

Details

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

More at Spark MLlib/ML docs page : https://spark.apache.org/docs/latest/ml-features.html#countvectorizer