Skip to contents

This function uses Python's sklearn for feature extraction and count vectorisation. It creates a CountVectorizer object with the specified parameters. CountVectorizer is a way to convert text data into vectors as model input. Used inside a BertopicR topic modelling pipeline.

Usage

bt_make_vectoriser(
  ...,
  ngram_range = c(1L, 2L),
  stop_words = "english",
  min_frequency = 0.1,
  max_features = NULL
)

Arguments

...

Additional parameters passed to sklearn's CountVectorizer

ngram_range

A vector of length 2 (default c(1, 2)) indicating the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted as features. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

stop_words

String (default 'english'). If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the default.

min_frequency

Integer or float (default 0.1). When building the vocabulary ignore terms that have a corpus frequency strictly lower than the given threshold. If min_frequency is explicitly defined to be an integer, it is assumed to represent the absolute count. If min_frequency is not explicitly specified as an integer and is between 0 and 1, it is assumed to represent a proportion of documents, if it is a whole number it is assumed to represent the absolute count.

max_features

Integer or NULL (default NULL). If not NULL, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.

Value

An sklearn CountVectorizer object configured with the provided parameters

Examples

# vectoriser model that converts text docs to ngrams with between 1 - 2 tokens
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 2), stop_words = "english")

# vectoriser model that converts text docs to ngrams with between 1 - 3 tokens
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 3), stop_words = "english")

# You can implement custom stopwords or stopwords from other sources
if (FALSE) {
stopwords_cat <- tm::stopwords(kind = "catalan")
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 3), stop_words = stopwords_cat)
}

custom_stopwords <- c("these", "words", "are", "not", "helpful")
vectoriser <- bt_make_vectoriser(ngram_range = c(1,2), stop_words = custom_stopwords)