Skip to contents

This creates topic representations based on the KeyBERT algorithm.

Usage

bt_representation_keybert(
fitted_model, 
documents, 
document_embeddings,
embedding_model,
top_n_words = 10,
nr_repr_docs = 50,
nr_samples = 500,
nr_candidate_words = 100)

Arguments

fitted_model

Output of bt_fit_model() or another bertopic topic model. The model must have been fitted to data.

documents

The documents the fitted_model has been fitted to

document_embeddings

embeddings used to fit the model, these should have the same dimensions as that specified by the embedder you pass as the embedding model

embedding_model

The model used to create the embeddings passed. This will be used to create word embeddings that will be compared to topic embeddings using cosine similarity.

top_n_words

number of keywords/phrases to be extracted

nr_repr_docs

number of documents used to create topic embeddings

nr_samples

number of samples to select representative docs from for each topic

nr_candidate_words

number of words to examine per topic

Value

KeyBERTInspired representation model

Details

KeyBERT is a python package that is used for extraction of key words from documents. It works by:

  1. Selecting representative documents (nr_repr_docs) for each topic based on the c_tf_idf cosine similarity of documents and their topics. This is achieved by sampling nr_samples documents from each topic and calculating their c_tf_idf score and choosing the top nr_repr_docs from this.

  2. Candidate words (nr_candidate_words) are selected for each topic based on their c_tf_idf scores for that topic

  3. Topic embeddings are created by averaging the embeddings for the representative documents for each topic and compared, using cosine similarity, with candidate word embeddings to give a similarity score for each word and topic

  4. the top_n_words with the highest cosine similarity to a topic are used to represent that topic