Create representation model using keybert — bt_representation

This creates topic representations based on the KeyBERT algorithm.

Usage

bt_representation_keybert(
fitted_model, 
documents, 
document_embeddings,
embedding_model,
top_n_words = 10,
nr_repr_docs = 50,
nr_samples = 500,
nr_candidate_words = 100)

Arguments

fitted_model: Output of bt_fit_model() or another bertopic topic model. The model must have been fitted to data.
documents: The documents the fitted_model has been fitted to
document_embeddings: embeddings used to fit the model, these should have the same dimensions as that specified by the embedder you pass as the embedding model
embedding_model: The model used to create the embeddings passed. This will be used to create word embeddings that will be compared to topic embeddings using cosine similarity.
top_n_words: number of keywords/phrases to be extracted
nr_repr_docs: number of documents used to create topic embeddings
nr_samples: number of samples to select representative docs from for each topic
nr_candidate_words: number of words to examine per topic

Value

KeyBERTInspired representation model

Details

KeyBERT is a python package that is used for extraction of key words from documents. It works by:

Selecting representative documents (nr_repr_docs) for each topic based on the c_tf_idf cosine similarity of documents and their topics. This is achieved by sampling nr_samples documents from each topic and calculating their c_tf_idf score and choosing the top nr_repr_docs from this.
Candidate words (nr_candidate_words) are selected for each topic based on their c_tf_idf scores for that topic
Topic embeddings are created by averaging the embeddings for the representative documents for each topic and compared, using cosine similarity, with candidate word embeddings to give a similarity score for each word and topic
the top_n_words with the highest cosine similarity to a topic are used to represent that topic