CountVectorizer¶
Extracts the vocabulary from a given collection of documents and generates a vector of token counts for each document.
Input¶
It takes in a DataFrame as input and transforms it to another DataFrame
Output¶
It adds a new column to the incoming DataFrame containing the vector of token counts in the input column, to generate the output DataFrame
Type¶
ml-transformer
Class¶
fire.nodes.ml.NodeCountVectorizer
Fields¶
| Name | Title | Description |
|---|---|---|
| inputCol | Input Column | Input column name |
| outputCol | Output Column | Output column name |
| vocabularySize | Vocabulary Size | Max size of the vocabulary. |
Details¶
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
More at Spark MLlib/ML docs page : https://spark.apache.org/docs/latest/ml-features.html#countvectorizer