Redistributes outliers using embeddings — bt_outliers

Uses the cosine similarity of the document embeddings to find the topic closest to each outlier document and reassigns these documents accordingly. Note that the purpose of this function is to obtain a new list of topics that can then be used to update the model, it does not make any changes to the model itself, the topic classification the model outputs does not change after running this function. The bt_update_topics function needs to be used to make the change to the model itself.

Usage

bt_outliers_embeddings(
  fitted_model,
  documents,
  topics,
  embeddings,
  embedding_model = NULL,
  threshold = 0.3
)

Arguments

fitted_model: Output of bt_fit_model() or another bertopic topic model. The model must have been fitted to data.
documents: documents to which the model was fit
topics: current topics associated with the documents
embeddings: embeddings used to create topics.
embedding_model: If you did not instantiate the model with an embedding model you will need to pass one here
threshold: minimum probability for outlier to be reassigned

Value

df with document, old topic, new topic

Details

It is possible to chain outlier reduction methods together as the operation works on the list of topics input to the argument, which can vary. You will see in the examples that we are able to perform one outlier reduction method, eg. bt_outliers_tokenset_similarity, which will output a list of potential new topics, and input that list into another outlier reduction method, eg. bt_outliers_embeddings, which will determine the output topic suggestions based on the input list. In this way we can use aspects of multiple outlier reduction strategies and chain them together.

Examples

if (FALSE) { # \dontrun{
# Reducing outliers original clustering model identified
outliers <- bt_outliers_embeddings(fitted_model = topic_model, documents = docs, topics = topic_model$topics_, embeddings = embeddings)

# Using chain strategies to build on outliers identified by another reduction strategy to redistribute outlier docs
# using tokenset similarity to redistribute outliers
outliers_ts <- bt_outliers_tokenset_similarity(fitted_model = topic_model, documents = docs, topics = topic_model$topics_)

# using embedding outlier reduction method on top of tokenset similarity method to redistribute outliers
outliers_chain <- bt_outliers_embeddings(fitted_model = topic_model, documents = docs, topics = outliers_ts$new_topics, embeddings = embeddings)

} # }